Hadoop Ecosystem

Hadoop Ecosystem 

Distributed FileSystem:

Apache HDFS: The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines.

Red Hat GlusterFS: GlusterFS is a scale-out network-attached storage file system.

Quantcast File System QFS: (QFS) is an open-source distributed file system software package for large-scale MapReduce or other batch-processing workloads

Ceph Filesystem: Ceph is a free software storage platform designed to present object, block, and file storage from a single distributed computer cluster.

Lustre file system: The Lustre filesystem is a high-performance distributed filesystem intended for larger network and high-availability environments. Distributed FileSystem:

Distributed Programming:

Apache MapReduce: MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

Apache Pig: Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows.

JAQL: JAQL is a functional, declarative programming language designed especially for working with large volumes of structured, semi-structured and unstructured data.

Apache Spark: Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS).Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications.

Apache Flink: Apache Flink (formerly called Stratosphere) features powerful programming abstractions in Java and Scala, a high-performance runtime, and automatic program optimization. It has native support for iterations, incremental iterations, and programs consisting of large DAGs of operations.

Facebook Corona: “The next version of Map-Reduce” from Facebook, based in own fork of Hadoop. The current Hadoop implementation of the MapReduce technique uses a single job tracker, which causes scaling issues for very large data sets.

Apache Tez: Tez is a proposal to develop a generic application which can be used to process complex data-processing task DAGs and runs natively on Apache Hadoop YARN.

SQL on Hadoop:

Apache Hive:Data Warehouse infrastructure developed by Facebook. Data summarization, query, and analysis. It’s provides SQL-like language (not SQL92 compliant): HiveQL.

Apache HCatalog: HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored.

AMPLAB Shark: Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data.

Apache Drill: Drill is the open source version of Google’s Dremel system which is available as an infrastructure service called Google BigQuery.

Apache Phoenix: Apache Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data.

Apache MRQL: MRQL is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark.

NoSQL Databases:

Apache HBase: Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables.

Apache Cassandra: Distributed Non-SQL DBMS, it’s a BDDB. MR can retrieve data from Cassandra. This BDDB can run without HDFS, or on-top of HDFS (DataStax fork of Cassandra).

Hypertable: . The project is based on experience of engineers who were solving large-scale data-intensive tasks for many years. Hypertable runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, or the Kosmos File System (KFS).

Apache Accumulo: Distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Data Ingestion:

Apache Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Apache Sqoop: System for bulk data transfer between HDFS and structured datastores as RDBMS. Like Flume but from HDFS to RDBMS.

Apache Storm: It’s a distributed real-time computation system for processing fast, large streams of data. Storm is an architecture based on master-workers paradigma.

Apache Kafka: Distributed publish-subscribe system for processing large amounts of streaming data. Kafka is a Message Queue developed by LinkedIn that persists messages to disk in a very performant manner.


Apache Oozie: Workflow scheduler system for MR jobs using DAGs (Direct Acyclical Graphs). Oozie Coordinator can trigger jobs by time (frequency) and data availability.

Apache Falcon: Apache Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to configure, manage and orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows.


Apache Sentry: Sentry is the next step in enterprise-grade big data security and delivers fine-grained authorization to data stored in Apache Hadoop.

Apache Knox Gateway: System that provides a single point of secure access for Apache Hadoop clusters.

Apache Ranger: Apache Argus Ranger (formerly called Apache Argus or HDP Advanced Security) delivers comprehensive approach to central security policy administration across the core enterprise security requirements of authentication, authorization, accounting and data protection.

This is about hadoop ecosystem





Speak Your Mind