Introduction to Big Data and Hadoop
Objectives
·
Identify the need for Big Data
·
Explain the concept of Big Data
·
Describe the basics of Hadoop
·
Explain the benefits of Hadoop
Introduction
Over 2.5 exabytes (2.5 billion gigabytes) of data is
generated every day.
Following are some of the sources of the huge volume of
data:
·
A typical, large stock exchange captures more
than 1 TB of data everyday
·
There are around 5 billion mobile phones (including
1.75 billion smart phones) in the world
·
YouTube users upload more than 48 hours of video
every minute.
·
Large social networks such as twitter and
facebook capture more than 10 TB of data daily.
·
There are more than 30 million networked sensors
in the world.
Types of Data
There are three types of data:
·
Structure
Data: Data which is represented in a tabular format. E.g. Databases
·
Semi-structured
Data: Data which does not have a formal data model. E.g. XML files
·
Un-structured
Data: Data which does not have a predefined data model. E.g. Text files
Characteristics of Big Data
Big data has three characteristics: variety, velocity, and
volume.
·
Variety:
Variety encompasses managing the complexity of data in many different
structures, ranging from relational data to logs and raw text.
·
Velocity:
Velocity account from streaming of data and movement of large volume data at a
high speed.
·
Volume:
Volume denotes the scaling of data ranging from terabytes to zettabytes.
Appeal of Big Data Technology
Big Data Technology is appealing because of the following
reasons:
·
It helps to manage and process a huge amount of
data cost efficiently
·
It analyzes data in its native form, which may
be unstructured, structured, or streaming.
·
It captures data from fast-happening events in
real time.
·
It can handle failure of isolated nodes and
tasks assigned to such nodes.
·
It can turn data into actionable insights.
Business Benefits of Big Data Technology
Following are the business benefits of implementing Big Data
technology, with examples:
·
It can help organizations to create personalized
products, gain insight into products that are profitable, and retain customers
by solving their problems.
Example: Utilizing Big Data Analytics permits banks to study the money-saving patterns and practices of individual clients.
Example: Utilizing Big Data Analytics permits banks to study the money-saving patterns and practices of individual clients.
·
The Big Data analytic solutions support or
automate cost cutting, bring greater efficiency of operations and the
evaluation of historical trends.
Example: Using Big Data Analytics, banks keep track of their client’s geographical shopping locations.
Example: Using Big Data Analytics, banks keep track of their client’s geographical shopping locations.
·
By using Big Data predictive analysis
techniques, organizations can provide an early warnings of a problem and enable
preventive maintenance to avoid a potential outage
Example: A portable computer manufacturer has the capacity to assemble and break down information utilized as part of segment assembling. This information can help the producer to focus on satisfactory levels of heat, vibration, and different variables utilized.
Example: A portable computer manufacturer has the capacity to assemble and break down information utilized as part of segment assembling. This information can help the producer to focus on satisfactory levels of heat, vibration, and different variables utilized.
·
Big Data offers a range of analytical techniques
help organizations to develop new products and services.
Example: An application can analyze data to the most granular level, even to observe that the customers who bought a smart phone also bought memory cards or back covers.
Example: An application can analyze data to the most granular level, even to observe that the customers who bought a smart phone also bought memory cards or back covers.
Traditional IT Analytics Approach
The following are the requirements of the traditional IT
analytics approach and the challenging factors:
Requirements:
·
The business team needs to define questions
before IT development.
·
They need to define data sources and structures.
Challenging Factors:
·
The requirements are iterative and volatile.
·
The data sources keep changing.
In typical scenario of traditional IT systems development,
the requirements are defined, followed by solution design and build. Once the
solution is implemented, queries are executed. If there are new requirements or
queries, the system is redesigned and rebuilt.
Define Requirements à
Design Solution à
Execute queries à
Redesign and Rebuild for new requirements.
Approach for Big Data Solutions
Following are the requirements for using Big Data technology
as a platform for discovery and exploration, and the challenges overcome by the
same:
Requirements
·
The business team needs to define data sources
·
They need to establish the hypothesis
Challenges overcome
by Big Data
·
The technology should enable explorative
analysis.
·
Data systems and sources need to be integrated
as required.
The steps illustrates how IT systems are built with the help
if Big Data technology.
·
Initial Data Sources are identified
·
IT Team creates a platform for creative
exploration of available data and content
·
The business teams determine the questions to
ask and test hypothesis
·
Any new questions lead to addition of data
sources and integration without the need to redesign or rebuild the platform.
Big Data Technology Capabilities
·
Understand and navigate Big Data sources
·
Manage and store huge volume of a variety of data
·
Process Data in reasonable time
·
Ingest data at a high speed
·
Analyze unstructured data
·
Bear faults and exceptions
Big Data Use Cases
The use cases of Big Data Hadoop are given below:
·
Automotive:
Auto sensors reporting location, problems
·
Communication:
Location based advertising
·
Consumer
Packaged Goods: Sentiment analysis of what’s hot, customer service
·
Financial
Services: Risk and portfolio analysis New Products
·
Education
and Research: Experiment sensors analysis
·
High
Technology/Industrial Mfg.: Mfg. quality Warranty analysis
·
Life
Sciences: Clinical Trials Geonomics
·
Media/
Entertainment: Viewers/Advertising Effectiveness
·
Online
Services/Social Media: People and career matching
·
Health
Care: Patient sensors, monitoring, EHRs Quality of care
·
Oil and
Gas: Drilling exploration sensor analysis
·
Retail:
Consumer sentiments Optimized marketing
·
Travel
and Transportation: Sensors analysis for optimal traffic flows. Customer
sentiments
·
Utilities:
Smart Meter analysis for network capability
·
Law
Enforcement and Defense: Threat analysis – social media monitoring, photo
analysis
Challenges of Big Data
Following are the challenges that need to be addressed by
Big Data Technology:
·
Fault tolerance and handling the system uptime
and downtime
o
Using commodity hardware for data storage and
analysis
o
Maintaining a copy of the same data across
clusters
·
Combining data accumulated from all systems
o
Analyzing data across different machines
o
Merging of data
Introduction to Hadoop
Following are the facts related to Hadoop and why it is
required:
What is Hadoop?
·
A free, Java-based programming framework that
supports the processing of large data sets in a distributed computing environment.
·
Based on Google File System (GFS)
Why Hadoop?
·
Runs a number of applications on distributed
system with thousands of nodes involving petabytes of data
·
Has a distributed file system, called Hadoop
Distributed File System or HDFS which enables fast data transfer among the
nodes.
·
Further it encompasses a distributed processing
framework called MapReduce
Hadoop and Traditional RDBMS
Feature
|
RDBMS
|
Hadoop
|
Computing Model
|
·
Notion of transactions
·
Transaction is the unit of work
·
ACID properties, Concurrency control
|
·
Notion of jobs
·
Job is the unit of work
·
No concurrency control
|
Data Model
|
·
Structured data with known schema
·
Read/Write mode
|
·
Any data will fit in any format
·
(un)(semi)structured
·
Read Only Mode
|
Cost Model
|
·
Expensive Server
|
·
Cheap commodity machines
|
Fault Tolerance
|
·
Failures are rare
·
Recovery mechanisms
|
·
Failures are common over thousands of machines
·
Simple yet efficient fault tolerance
|
History and Milestones Of Hadoop
Hadoop Originated from the Nutch open source project on
search engines and works over distributed network nodes.
Period
|
Milestone
|
2003 & 2004
|
Google released two papers which provided insight into their success.
The google file system or GFS and MapReduce. Simplified Data processing on
large clusters. The papers told the world how Google performed large scale
data processing.
|
July 2005
|
Nutch used GFS to perform MapReduce operations
|
Feb 2006
|
Nutch started a Lucene sub project which led to the era of Hadoop
|
Apr 2007
|
Yahoo started using Hadoop on a 1000-node cluster
|
Jan 2008
|
Apache took over Hadoop and made it a top-level project
|
Jul 2008
|
A 4000-node cluster with Hadoop was tested by Apache. The performance
of that cluster was surprisingly the fastest when compared to the other
technologies implemented that year
|
May 2009
|
Hadoop Successfully sorted a petabyte of data in 17 hours
|
Dec 2011
|
Hadoop reached version 1.0
|
Hadoop Core Services and Components.
Major components of Hadoop are:
·
HDFS: HDFS
runs on commodity machines which are low in cost and hardware. It is highly
fault tolerant and efficient enough to process huge amount of data.
·
NameNode:
Is the brain of the system. It stores the Metadata of the data blocks along
with location of data blocks. If this NameNode crashes the entire system is
dead.
·
Secondary
NameNode: Is the replica of the Primary NameNode. This is used to ensure
that even if the Primary NameNode crashes Hadoop system is not dead, but name
space image on Secondary NameNode can be used to restart the system.
·
DataNode:
Stores the blocks of data.
·
JobTracker:
Schedules client jobs and creates Map or Reduce tasks and schedules them. It
can run on the same machines as NameNode or different Node.
·
TaskTracker:
Runs on DataNodes and its primary responsibility is to run the MapReduce tasks
assigned by the name node.
Master
|
Slave 1
|
Slave 2
|
…
|
Slave N
|
|
MapReduce
|
JobTracker
|
||||
TaskTracker
|
TaskTracker
|
TaskTracker
|
TaskTracker
|
||
HDFS
|
NameNode
|
||||
DataNode
|
DataNode
|
DataNode
|
DataNode
|
HDFS Architecture
HDFS architecture and be
summarized as follows:
·
The NameNode is the master and DataNode are the
slaves.
·
NameNode is the brain of the system, and is
accessing client data. DataNode manages the storage of data.
·
The data is split into files of one or more
blocks.
·
When a client needs a data, it first interacts
with NameNode that holds the MetaData and replies back to client with location
of the Data on DataNodes.
·
After this, client starts interactions with
DataNode, till the time data requirement is completed.
Organizations Using Hadoop
The following table shows how various organizations use
Hadoop:
Name of the
Organization
|
Cluster
Specifications
|
Uses
|
A9.com:
Amazon
|
Clusters vary
from 1 to 100 nodes
|
·
Amazon’s product search indices are built using this program
·
Processes millions of sessions daily for analysis
|
Yahoo
|
More than
100,000 CPUs in approximately 20,000 computers running Hadoop; biggest
cluster has 2000 nodes {2 * 4 cpu boxes with 4 TB disk space}
|
·
To support research for ad systems and web search
|
AOL
|
Cluster size
is 50 machines, Intel Xeon, dual processors, and dual core, each with 16GB
RAM and 800 GB hard disk in total of 37 TB HDFS capacity
|
·
For variety of functions ranging from generating data to running
advanced algorithms for performing behavioral analysis and targeting
|
Facebook
|
320-machines
cluster with 2,560 cores and about 1.3 PB raw storage
|
·
Storing copies of internal logs and dimension data sources
·
Used as a source for reporting analytics and machine learning
|
Summary
·
Big Data relies on volume, velocity, and variety
with respect to processing.
·
Data can be divided into three types –
unstructured data, semi-structured data, and structured data.
·
Big Data technology understands and navigates
big data sources, analyzes unstructured data, and ingests data at a high speed.
·
Hadoop is free, Java based programming framework
that supports the processing of large data sets in a distributed computing
environment
·
Hadoop originated from the Nutch open source
project on search engines and works over distributed network nodes
The core services of Hadoop are NameNode,
DataNode, JobTracker, TaskTracker, and Secondary NameNode.
No comments:
Post a Comment