Hadoop Use Cases

Hadoop Use Cases

Hadoop Use Cases:

This is a collection of some use cases of Hadoop. This is not meant to be an exhaustive list, but a sample to give you some ideas
2012 US Presidential Election
How Big Data help Obama win re-election [http://www.computerworld.com/s/article/9233587/Barack_Obama_39_s_Big_Data_won_the_US_election] – by Michael Lynch, the founder of Autonomy [http://www.autonomy.com/] (cached copy [cached_reports/ Barack_Obama_Big_Data_won_the_US_election__Computerworld.pdf])

2.Data Storage
NetApp collects diagnostic data from its storage systems deployed at customer sites. This data is used to analyze the health of NetApp systems.
Problem: NetApp collects over 600,000 data transactions weekly, consisting of unstructured logs and system diagnostic information. Traditional data storage systems proved inadequate to capture and process this data.
Solution: A Cloudera Hadoop system captures the data and allows parallel processing of data.
Hadoop Vendor: Cloudera
Cluster/Data size: 30+ nodes; 7TB of data / month
Cloudera case study [http://www.cloudera.com/content/dam/cloudera/Resources/PDF/
Cloudera_Case_Study_NetApp.pdf] (cached copy [cached_reports/Cloudera_Case_Study_NetApp.pdf])
(Published Sep 2012)

3. Financial Services
Dodd-Frank Compliance at a bank
A leading retail bank is using Cloudera and Datameer to validate data accuracy and quality to comply with regulations like Dodd-Frank
Problem: The previous solution using Teradata and IBM Netezza was time consuming and complex, and the data mart approach didn’t provide the data completeness required for determining overall data quality.
Solution: A Cloudera + Datameer platform allows analyzing trillions of records which currently result in approximately one terabyte per month of reports. The results are reported through a data quality dashboard.
Hadoop Vendor: Cloudera + Datameer
Cluster/Data size: 20+ nodes; 1TB of data / month
Cloudera case study [http://www.cloudera.com/content/dam/cloudera/Resources/PDF/
connect_case_study_datameer_banking_financial.pdf] (cached copy [cached_reports/
connect_case_study_datameer_banking_financial.pdf]) (Published Nov 2012)

4. Health Care
Storing and processing Medical Records
Problem: A health IT company instituted a policy of saving seven years of historical claims and remit data, but its in-house database systems had trouble meeting the data retention requirement while processing millions of claims every day
Solution:A Hadoop system allows archiving seven years’ claims and remit data, which requires complex processing to get into a normalized format, logging terabytes of data generated from transactional systems daily, and storing them in CDH for analytical purposes
Hadoop vendor:Cloudera
Cluster/Data size: 10+ nodes pilot; 1TB of data / day
Cloudera case study [http://www.cloudera.com/content/dam/cloudera/Resources/PDF/
Cloudera_Case_Study_Healthcare.pdf] (cached copy [cached_reports/
Cloudera_Case_Study_Healthcare.pdf]) (Published Oct 2012)
Monitoring patient vitals at Los Angeles Children’s Hospital
Researchers at LA Children’s Hospital is using Hadoop to capture and analyze medical sensor data.
Problem: Collecting lots (billions) of data points from sensors / machines attached to the patients. This data was periodically purged before because storing this large volume of data on expensive storage was cost-prohibitive.
Solution: Continuously streaming data from sensors/machines is collected and stored in HDFS. HDFS provides scalable data storage at reasonable cost.
video [http://www.youtube.com/watch?v=NmMbFM7l1Rs]
silicon angle story [http://siliconangle.com/blog/2013/06/27/leveraging-hadoop-to-advance-healthcare-
research-childrens-hospital-use-case-hadoopsummit/] (Published June 2013)

Let us see another hadoop use cases on human sciences

5. Human Sciences
NextBio is using Hadoop MapReduce and HBase to process massive amounts of human genome data.
Problem:Processing multi-terabyte data sets wasn’t feasible using traditional databases like MySQL.
Solution:NextBio uses Hadoop map reduce to process genome data in batches and it uses HBase as a scalable data store
Hadoop vendor:Intel
NextBio [http://www.nextbio.com/]
Intel case study [http://hadoop.intel.com/pdfs/IntelNextBioCaseStudy.pdf] (cached copy [cached_reports/
IntelNextBioCaseStudy.pdf]) (Published Feb 2013)
Information Week article (May 2012) [http://www.informationweek.com/software/information-management/
hbase-hadoops-next-big-data-chapter/232901601?pgno=1] (cached copy [cached_reports/
China Telecom Guangdong
Problem: Storing billions of mobile call records and providing real time access to the call records and billing information to customers. Traditional storage/database systems couldn’t scale to the loads and provide a cost effective solution
Solution: HBase is used to store billions of rows of call record details. 30TB of data is added monthly
Hadoop vendor: Intel
Hadoop cluster size: 100+ nodes
China Telecom Quangdong [http://gd.10086.cn/]
Intel case study [http://hadoop.intel.com/pdfs/IntelChinaMobileCaseStudy.pdf] (cached copy
[cached_reports/IntelChinaMobileCaseStudy.pdf]) (Published Feb 2013)
Intel APAC presentation [http://www.slideshare.net/IntelAPAC/apac-big-data-dc-strategy-update-foridh-launch-rk]
Nokia collects and analyzes vast amounts of data from mobile phones
(1) Dealing with 100TB of structured data and 500TB+ of semi-structured data
(2) 10s of PB across Nokia, 1TB / day
Solution: HDFS data warehouse allows storing all the semi/multi structured data and offers processing data at peta byte scale
Hadoop Vendor: Cloudera
Cluster/Data size:
(1) 500TB of data
(2) 10s of PB across Nokia, 1TB / day
(1) Cloudera case study [http://www.cloudera.com/content/dam/cloudera/Resources/PDF/
Cloudera_Nokia_Case_Study_Hadoop.pdf] (cached copy [cached_reports/
Cloudera_Nokia_Case_Study_Hadoop.pdf]) (Published Apr 2012)
(2) strata NY 2012 presentation slides [http://cdn.oreillystatic.com/en/assets/1/event/85/Big%20Data%
%20Tool%20for%20the%20Right%20Workload%20Presentation.pdf] (cached copy [cached_reports/
Strata NY 2012 presentation [http://strataconf.com/stratany2012/public/schedule/detail/26880]
Problem: Orbitz generates tremendous amounts of log data. The raw logs are only stored for a few days because of costly data warehousing. Orbitz needed an effective way to store and process this data, plus they needed to improve their hotel rankings.
Solution: A Hadoop cluster provided a very cost effective way to store vast amounts of raw logs. Data is cleaned and analyzed and machine learning algorithms are run.
Orbitz presentation [http://www.slideshare.net/jseidman/windy-citydb-final-4635799] (Published 2010)
Datanami article [http://www.datanami.com/datanami/2012-04-26/six_superscale_
Seismic Data at Chevron
Problem: Chevron analyzes vast amounts of seismic data to find potential oil reserves.
Solution: Hadoop offers the storage capacity and processing power to analyze this data.
Hadoop Vendor: IBM Big Insights
Presentation [http://almaden.ibm.com/colloquium/resources/Managing%20More%20Bits%20Than
%20Barrels%20Breuning.PDF] (cached copy [cached_reports/IBM_Chevron.pdf]) (Published June 2012)
OPower works with utility companies to provide engaging, relevant, and personalized content about home energy use to millions of households.
Problem: Collecting and analyzing massive amounts of data and deriving insights into customers’ energy usage.
Solution: Hadoop provides a single storage for all the massive data and machine learning algorithms are run on the data.
%20at%20Opower%20Presentation.pdf] (cached copy [cached_reports/Opower.pdf]) (Published Oct
Strata NY 2012 [http://strataconf.com/stratany2012/public/schedule/detail/25736]
Strata 2013 [http://strataconf.com/strata2013/public/schedule/detail/27158]
OPower.com [http://www.opower.com]
Trucking data @ US Xpress
US Xpress – one of the largest trucking companies in US – is using Hadoop to store sensor data from their trucks. The intelligence they mine out of this, saves them $6 million / year in fuel cost alone.
Problem: Collecting and and storing 100s of data points from thousands of trucks, plus lots of geo data.
Solution: Hadoop allows storing enormous amount of sensor data. Also Hadoop allows querying / joining
this data with other data sets.
Computer Weekly article [http://www.computerweekly.com/news/2240146943/Case-Study-US-Xpressdeploys- hybrid-big-data-with-Informatica] (Published May 2012)
Hortonworks white paper on ‘Business Value of Hadoop’ [http://hortonworks.com/
wp-content/uploads/downloads/2013/06/Hortonworks.BusinessValueofHadoop.v1.0.pdf] (cached copy
[cached_reports/Hortonworks.BusinessValueofHadoop.v1.0.pdf]) (Published July 2013)
USXpress.com [http://www.usxpress.com/]

This are the some hadoop use cases in realtime scenario.


  1. anything in clickstream analysis for E-pub sample project using hive

Speak Your Mind