Hadoop Quick Notes: 1 Introduction to Big Data and Hadoop

Introduction to Big Data and Hadoop

Objectives

· Identify the need for Big Data

· Explain the concept of Big Data

· Describe the basics of Hadoop

· Explain the benefits of Hadoop

Introduction

Over 2.5 exabytes (2.5 billion gigabytes) of data is generated every day.

Following are some of the sources of the huge volume of data:

· A typical, large stock exchange captures more than 1 TB of data everyday

· There are around 5 billion mobile phones (including 1.75 billion smart phones) in the world

· YouTube users upload more than 48 hours of video every minute.

· Large social networks such as twitter and facebook capture more than 10 TB of data daily.

· There are more than 30 million networked sensors in the world.

Types of Data

There are three types of data:

· Structure Data: Data which is represented in a tabular format. E.g. Databases

· Semi-structured Data: Data which does not have a formal data model. E.g. XML files

· Un-structured Data: Data which does not have a predefined data model. E.g. Text files

Characteristics of Big Data

Big data has three characteristics: variety, velocity, and volume.

· Variety: Variety encompasses managing the complexity of data in many different structures, ranging from relational data to logs and raw text.

· Velocity: Velocity account from streaming of data and movement of large volume data at a high speed.

· Volume: Volume denotes the scaling of data ranging from terabytes to zettabytes.

Appeal of Big Data Technology

Big Data Technology is appealing because of the following reasons:

· It helps to manage and process a huge amount of data cost efficiently

· It analyzes data in its native form, which may be unstructured, structured, or streaming.

· It captures data from fast-happening events in real time.

· It can handle failure of isolated nodes and tasks assigned to such nodes.

· It can turn data into actionable insights.

Business Benefits of Big Data Technology

Following are the business benefits of implementing Big Data technology, with examples:

· It can help organizations to create personalized products, gain insight into products that are profitable, and retain customers by solving their problems.
Example: Utilizing Big Data Analytics permits banks to study the money-saving patterns and practices of individual clients.

· The Big Data analytic solutions support or automate cost cutting, bring greater efficiency of operations and the evaluation of historical trends.
Example: Using Big Data Analytics, banks keep track of their client’s geographical shopping locations.

· By using Big Data predictive analysis techniques, organizations can provide an early warnings of a problem and enable preventive maintenance to avoid a potential outage
Example: A portable computer manufacturer has the capacity to assemble and break down information utilized as part of segment assembling. This information can help the producer to focus on satisfactory levels of heat, vibration, and different variables utilized.

· Big Data offers a range of analytical techniques help organizations to develop new products and services.
Example: An application can analyze data to the most granular level, even to observe that the customers who bought a smart phone also bought memory cards or back covers.

Traditional IT Analytics Approach

The following are the requirements of the traditional IT analytics approach and the challenging factors:

Requirements:

· The business team needs to define questions before IT development.

· They need to define data sources and structures.

Challenging Factors:

· The requirements are iterative and volatile.

· The data sources keep changing.

In typical scenario of traditional IT systems development, the requirements are defined, followed by solution design and build. Once the solution is implemented, queries are executed. If there are new requirements or queries, the system is redesigned and rebuilt.

Define Requirements à Design Solution à Execute queries à Redesign and Rebuild for new requirements.

Approach for Big Data Solutions

Following are the requirements for using Big Data technology as a platform for discovery and exploration, and the challenges overcome by the same:

Requirements

· The business team needs to define data sources

· They need to establish the hypothesis

Challenges overcome by Big Data

· The technology should enable explorative analysis.

· Data systems and sources need to be integrated as required.

The steps illustrates how IT systems are built with the help if Big Data technology.

· Initial Data Sources are identified

· IT Team creates a platform for creative exploration of available data and content

· The business teams determine the questions to ask and test hypothesis

· Any new questions lead to addition of data sources and integration without the need to redesign or rebuild the platform.

Big Data Technology Capabilities

· Understand and navigate Big Data sources

· Manage and store huge volume of a variety of data

· Process Data in reasonable time

· Ingest data at a high speed

· Analyze unstructured data

· Bear faults and exceptions

Big Data Use Cases

The use cases of Big Data Hadoop are given below:

· Automotive: Auto sensors reporting location, problems

· Communication: Location based advertising

· Consumer Packaged Goods: Sentiment analysis of what’s hot, customer service

· Financial Services: Risk and portfolio analysis New Products

· Education and Research: Experiment sensors analysis

· High Technology/Industrial Mfg.: Mfg. quality Warranty analysis

· Life Sciences: Clinical Trials Geonomics

· Media/ Entertainment: Viewers/Advertising Effectiveness

· Online Services/Social Media: People and career matching

· Health Care: Patient sensors, monitoring, EHRs Quality of care

· Oil and Gas: Drilling exploration sensor analysis

· Retail: Consumer sentiments Optimized marketing

· Travel and Transportation: Sensors analysis for optimal traffic flows. Customer sentiments

· Utilities: Smart Meter analysis for network capability

· Law Enforcement and Defense: Threat analysis – social media monitoring, photo analysis

Challenges of Big Data

Following are the challenges that need to be addressed by Big Data Technology:

· Fault tolerance and handling the system uptime and downtime

o Using commodity hardware for data storage and analysis

o Maintaining a copy of the same data across clusters

· Combining data accumulated from all systems

o Analyzing data across different machines

o Merging of data

Introduction to Hadoop

Following are the facts related to Hadoop and why it is required:

What is Hadoop?

· A free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

· Based on Google File System (GFS)

Why Hadoop?

· Runs a number of applications on distributed system with thousands of nodes involving petabytes of data

· Has a distributed file system, called Hadoop Distributed File System or HDFS which enables fast data transfer among the nodes.

· Further it encompasses a distributed processing framework called MapReduce

Hadoop and Traditional RDBMS

Feature	RDBMS	Hadoop
Computing Model	· Notion of transactions · Transaction is the unit of work · ACID properties, Concurrency control	· Notion of jobs · Job is the unit of work · No concurrency control
Data Model	· Structured data with known schema · Read/Write mode	· Any data will fit in any format · (un)(semi)structured · Read Only Mode
Cost Model	· Expensive Server	· Cheap commodity machines
Fault Tolerance	· Failures are rare · Recovery mechanisms	· Failures are common over thousands of machines · Simple yet efficient fault tolerance

History and Milestones Of Hadoop

Hadoop Originated from the Nutch open source project on search engines and works over distributed network nodes.

Period	Milestone
2003 & 2004	Google released two papers which provided insight into their success. The google file system or GFS and MapReduce. Simplified Data processing on large clusters. The papers told the world how Google performed large scale data processing.
July 2005	Nutch used GFS to perform MapReduce operations
Feb 2006	Nutch started a Lucene sub project which led to the era of Hadoop
Apr 2007	Yahoo started using Hadoop on a 1000-node cluster
Jan 2008	Apache took over Hadoop and made it a top-level project
Jul 2008	A 4000-node cluster with Hadoop was tested by Apache. The performance of that cluster was surprisingly the fastest when compared to the other technologies implemented that year
May 2009	Hadoop Successfully sorted a petabyte of data in 17 hours
Dec 2011	Hadoop reached version 1.0

Hadoop Core Services and Components.

Major components of Hadoop are:

· HDFS: HDFS runs on commodity machines which are low in cost and hardware. It is highly fault tolerant and efficient enough to process huge amount of data.

· NameNode: Is the brain of the system. It stores the Metadata of the data blocks along with location of data blocks. If this NameNode crashes the entire system is dead.

· Secondary NameNode: Is the replica of the Primary NameNode. This is used to ensure that even if the Primary NameNode crashes Hadoop system is not dead, but name space image on Secondary NameNode can be used to restart the system.

· DataNode: Stores the blocks of data.

· JobTracker: Schedules client jobs and creates Map or Reduce tasks and schedules them. It can run on the same machines as NameNode or different Node.

· TaskTracker: Runs on DataNodes and its primary responsibility is to run the MapReduce tasks assigned by the name node.

	Master	Slave 1	Slave 2	…	Slave N
MapReduce	JobTracker
MapReduce	TaskTracker	TaskTracker	TaskTracker		TaskTracker
HDFS	NameNode
HDFS	DataNode	DataNode	DataNode		DataNode

HDFS Architecture

HDFS architecture and be summarized as follows:

· The NameNode is the master and DataNode are the slaves.

· NameNode is the brain of the system, and is accessing client data. DataNode manages the storage of data.

· The data is split into files of one or more blocks.

· When a client needs a data, it first interacts with NameNode that holds the MetaData and replies back to client with location of the Data on DataNodes.

· After this, client starts interactions with DataNode, till the time data requirement is completed.

Organizations Using Hadoop

The following table shows how various organizations use Hadoop:

Name of the Organization	Cluster Specifications	Uses
A9.com: Amazon	Clusters vary from 1 to 100 nodes	· Amazon’s product search indices are built using this program · Processes millions of sessions daily for analysis
Yahoo	More than 100,000 CPUs in approximately 20,000 computers running Hadoop; biggest cluster has 2000 nodes {2 * 4 cpu boxes with 4 TB disk space}	· To support research for ad systems and web search
AOL	Cluster size is 50 machines, Intel Xeon, dual processors, and dual core, each with 16GB RAM and 800 GB hard disk in total of 37 TB HDFS capacity	· For variety of functions ranging from generating data to running advanced algorithms for performing behavioral analysis and targeting
Facebook	320-machines cluster with 2,560 cores and about 1.3 PB raw storage	· Storing copies of internal logs and dimension data sources · Used as a source for reporting analytics and machine learning

Summary

· Big Data relies on volume, velocity, and variety with respect to processing.

· Data can be divided into three types – unstructured data, semi-structured data, and structured data.

· Big Data technology understands and navigates big data sources, analyzes unstructured data, and ingests data at a high speed.

· Hadoop is free, Java based programming framework that supports the processing of large data sets in a distributed computing environment

· Hadoop originated from the Nutch open source project on search engines and works over distributed network nodes

The core services of Hadoop are NameNode, DataNode, JobTracker, TaskTracker, and Secondary NameNode.

Hadoop Quick Notes

Wednesday, September 9, 2015

1 Introduction to Big Data and Hadoop