Hadoop Definition and Ecosystems

Hadoop

Hadoop is one of the powerful technology in today Market,Hadoop is an open source and specially designed for commodity hardware and it is open source,java based frame work  that supports both storage purpose and processing purpose. Hadoop is especially designed for large data sets in distributed computing environment. Apache Hadoop is a part of Apache foundation.

In this Internet world every business every step and every decision is depends upon the data . In this Internet world every day data is increasing and every data data storage and processing technologies facing new challenges because of this big data problem,so Here is Big data is the problem and Hadoop is the solution. Hadoop is solving the many  problems in this real time data analyzing world.

What is Hadoop

hadoop

hadoop

Hadoop Definition

Actually hadoop was created by computer scientists Doug Cutting and Mike Cafarella in 2006 to support distribution for the Apache Nutch search engine.Hadoop inspired by google’s mapreduce papaers and google file system (GFS).After years of development within the open source community, Hadoop 1.0 became publicly available in November 2012 as part of the Apache project sponsored by the Apache Software Foundation.Now hadoop 2.7 is the latest version of hadoop.In every update Hadoop is keep improving it it’s resource management,security and scheduling.It features a high-availability file-system option and support for Microsoft Windows and other components to expand the framework’s versatility for data processing and analytics like Spark,Kafka,flink etc.

Hadoop works on Master slave architecture.we can run a application at a time parallel on thousands of commodity hardware nodes and can handle thousands of Petabytes of data.Hadoop mainly depends on distributed file system,because of this distributed file system hadoop can process and transfer the data between the nodes very fastly.Hadoop fault tolerance,data replication,Rack awareness,Name node high availability,HDFS federation are very highlighted features of hadoop.Famous hadoop supported companies are Hortonworks,Cloudera and MapR and Amazon Web Services (AWS), Google Cloud Platform and Microsoft Azure provides the cloud services for hadoop deployment.

hadoop features list :

  • Hadoop Brings Flexibility In Data Processing: …
  • Hadoop Is Easily Scalable. …
  • Hadoop Is Fault Tolerant. …
  • Hadoop Is Great At Faster Data Processing. …
  • Hadoop Ecosystem Is Robust: …
  • Hadoop Is Very Cost Effective.

hadoop ecosystem explained

Hadoop mainly divide into two parts that are

i) HDFS (For Storing the Actual data)

ii) Mapreduce (For Processing the data)

here HDFS means Hadoop distributed file system and it is useful for store the thousands of petabytes of data in large cluster across the all nodes and Mapreduce is very useful for process the raw data as per requirement and give exact records as output.In hadoop 2.X version Other components include that is Hadoop Yet Another Resource Negotiator (YARN), which provides resource management and scheduling for user applications.Mapreduce is providing the programming logical model to process large amount of data in a distributed manner.mapping data and reducing it to a result.

Hadoop provides some hadoop ecosystems to process the data in distributed manner based on client requirement.

Hadoop Tutorial

  • HDFS :- HDFSis the storage component of Hadoop. It is optimized for high throughput and works best when reading and writing large files (gigabytes and larger).
  • MapReduce :- MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.
  • Apache HcatLog :- HCatalogis a table and storage management layer for Hadoop that enables users with different data processing tools
  • Apache Pig :- A data flow language and execution environment for exploring very large datasets.Pig runs on HDFS and MapReduce clusters.
  • Apache Sqoop :- A tool for efficiently moving data between relational databases and HDFS
  • Apache Hive :- A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.
  • Apache Hbase :- A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
  • ZooKeeper :- A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.
  • Oozie :- Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
  • Apache Spark A fast engine for big data processing capable of streaming and supporting SQL, machine learning and graph processing;
  • Apache Storm. An open source data processing system
  • Apache Phoenix. An open source, massively parallel processing, relational database engine for Hadoop that is based on Apache HBase.

Speak Your Mind

*