Apache Spark Tutorial for beginners

Apache Spark is a open source processing engine.Apache Spark is a fast and general engine for large-scale data processing.Spark is a lightning-fast cluster computing designed for fast computation.

Apache spark:

Streaming Data

Apache Spark’s key use case is its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for companies to be able to stream and analyze it all in real time. And Spark Streaming has the capability to handle this extra workload.

Machine Learning

Spark comes with an integrated framework for performing advanced analytics that helps users run repeated queries on sets of data—which essentially amounts to processing machine learning algorithms. Among the components found in this framework is Spark’s scalable Machine Learning Library (MLlib). The MLlib can work in areas such as clustering, classification, and dimensionality reduction, among many others. All this enables Spark to be used for some very common big data functions, like predictive intelligence, customer segmentation for marketing purposes, and sentiment analysis. Companies that use a recommendation engine will find that Spark gets the job done fast.

Spark SQl

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation

Graph Processing

GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge

R on Spark

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.SparkR also supports distributed machine learning using MLlib.



What is apache spark:

Apache Spark is a fast and general engine for large-scale data processing.The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Ease of Use
Write applications quickly in Java, Scala, Python, R.

Combine SQL, streaming, and complex analytics.

Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.

Hadoop & Spark

BigData is a very big problem.To solve this problem one of the solution is Apache Hadoop.Apache Hadoop solved storage and processing problems.

Apache Spark is also one of the solution for bigdata big problem.To process the any type of data in a in-memory processing on distributed manner.

Apache Hadoop is a open source framework.It has two layers HDFS and Mapreduce.

HDFS(Hadoop Distributed File System) is used to store the data as file format.

Mapreduce is used to process the data in a distributed manner.This processing happens in disk based.

Apache Spark is also a open source framework.It is used  to process data in a in-memory passion.Spark doesn’t have a storage system.It run on multiple layers.Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra and etc..

Spark BigData

Spark is an open source framework engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is the way to solve this high latency processing.

Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining.
Spark provides in-memory cluster computing for lightning fast speed and supports Java, Python, R, and Scala APIs for ease of development.
Apache Spark can handle a wide range of data processing scenarios by combining SQL, streaming and complex analytics together seamlessly in the same application.
Apache Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, and etc..

Spark Scala

Spark itself is written in Scala and so its possible to get some performance boost because of that.Also Spark applications written in Scala are always faster than Python and others not a lot of Organizations would like to build their application with other languages especially when Scala is available, that runs on one of the highly optimized platforms, JVM.

Spark focuses on data “transformation” and “mapping” concepts, which is very fit for a functional programming language which flawlessly support those concepts like scala. Moreover, scala works on JVM, which made it easier to support hadoop framework.

Speak Your Mind