Introduction to Apache Spark

Introduction to Apache Spark

Apache Spark

Apache Spark is a open source processing engine for Hadoop data built around speed and sophisticated analytics anf easy to use. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010. Spark has quickly become  one of the largest open source communities in big data

Let us see properities of Apache Spark

1)Speed: Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk.

2)Sophisticated analytics:Spark supports for sql queries ,streaming data,complex analytics such as machine learning and graphs algorithms

3)Easy to use:Spark supports three types of languages to processing of  data such as java,scala,python,It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell.

Hadoop is effective for storing large amount of data,MapReduce is only able to execute simple computations and uses a high-latency batch model. Spark provides a more general and powerful alternative to Hadoop’s MapReduce, offering rich functionality such as stream processing, machine learning, and graph computations.

Apache Spark is 100% compatible with Hadoop’s Distributed File System (HDFS), HBase, and any Hadoop storage system, so your existing data is immediately usable in Spark.

Apache Spark is intended to enhance, not replace, the Hadoop stack. From day one, Spark was designed to read and write data from and to HDFS, as well as other storage systems, such as HBase and Amazon’s S3. As such, Hadoop users can enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks.

Through Shark,Apache  Spark enables Apache Hive users to run their unmodified queries much faster. Hive is a popular data warehouse solution running on top of Hadoop, while Shark is a system that allows the Hive framework to run on top of Spark instead of Hadoop. As a result, Shark can accelerate Hive queries by as much as 100x when the input data fits into memory, and up 10x when the input data is stored on disk.

Speak Your Mind