Wednesday, September 9, 2015

0 Hadoop Admin Course

Big Data and Hadoop Administrative Course

Lesson 00 Course Introduction

This lesson will give you an overview of the course, its pre-requisites and opportunities.

Course Objectives

·         Describe Big Data and Hadoop ecosystem
·         Describe advanced cluster configuration features
·         Explain Hadoop Distributed File System
·         Discuss MapReduce and YARN
·         Discuss Hadoop administration and maintenance
·         Explain important Hadoop components and ecosystem components

Course Overview

This training course provides the following:
·         A detailed introduction to the basics of Apache Hadoop
·         Knowledge on planning Hadoop cluster, Sqoop, MapReduce, YARN, Pig, Hive, and Impala
·         An overview of the Hadoop installation and configuration, advanced cluster configuration features, HDFS, and important Hadoop components
·         Knowledge on Hadoop administration and maintenance, and Hadoop ecosystem components
Following are the target audience of this course:
·          Professionals aspiring for a career in Big Data analytics using Apache Hadoop
·         Individuals who intend to design, deploy, and maintain Hadoop clusters
·         System Administrators, Developers, Architects, IT professionals, Analytics professionals, and experts are also key beneficiaries
·         Other aspirants and students, who wish to gain through understanding of Hadoop clusters
The prerequisites of the Big Data and Hadoop Administrator course are as follows:
·         Fundamental knowledge of any programming language
·         Good knowledge on Linux
·         Fundamental programming skills
·         Working knowledge of Java (not mandatory)

Value to Professionals

Hadoop professionals will be:
·         Furnished with Hadoop skeleton aptitudes in the fast-developing Big Data Analytics industry.
·         Prepared to drive Big Data methodology from Hadoop execution and bunch checking completely through exceptional security at huge speed and scale.
·         Popular in all leading associations worldwide in the following decade.
·         Leading the move from customary databases and information distribution centers to more adaptable, versatile framework based on Apache Hadoop.

Lessons Covered

Following is the list of lessons covered in this course:
Lesson 2 Planning Hadoop Cluster
Lesson 3 Hadoop Installation and Configuration
Lesson 4 Advanced Clusters and Configuration
Lesson 5 Hadoop Distributed File System
Lesson 6 Overview of MapReduce and YARN
Lesson 7 Important Hadoop Components
Lesson 8 Hadoop Administration and Maintenance
Lesson 9 Hadoop Ecosystem components

Demos and Lab Exercises are also included in the course.

1 Introduction to Big Data and Hadoop

Introduction to Big Data and Hadoop

Objectives

·         Identify the need for Big Data
·         Explain the concept of Big Data
·         Describe the basics of Hadoop
·         Explain the benefits of Hadoop

Introduction

Over 2.5 exabytes (2.5 billion gigabytes) of data is generated every day.
Following are some of the sources of the huge volume of data:
·         A typical, large stock exchange captures more than 1 TB of data everyday
·         There are around 5 billion mobile phones (including 1.75 billion smart phones) in the world
·         YouTube users upload more than 48 hours of video every minute.
·         Large social networks such as twitter and facebook capture more than 10 TB of data daily.
·         There are more than 30 million networked sensors in the world.

Types of Data

There are three types of data:
·         Structure Data: Data which is represented in a tabular format. E.g. Databases
·         Semi-structured Data: Data which does not have a formal data model. E.g. XML files
·         Un-structured Data: Data which does not have a predefined data model. E.g. Text files

Characteristics of Big Data

Big data has three characteristics: variety, velocity, and volume.
·         Variety: Variety encompasses managing the complexity of data in many different structures, ranging from relational data to logs and raw text.
·         Velocity: Velocity account from streaming of data and movement of large volume data at a high speed.
·         Volume: Volume denotes the scaling of data ranging from terabytes to zettabytes.

Appeal of Big Data Technology

Big Data Technology is appealing because of the following reasons:
·         It helps to manage and process a huge amount of data cost efficiently
·         It analyzes data in its native form, which may be unstructured, structured, or streaming.
·         It captures data from fast-happening events in real time.
·         It can handle failure of isolated nodes and tasks assigned to such nodes.
·         It can turn data into actionable insights.

Business Benefits of Big Data Technology

Following are the business benefits of implementing Big Data technology, with examples:
·         It can help organizations to create personalized products, gain insight into products that are profitable, and retain customers by solving their problems.
Example: Utilizing Big Data Analytics permits banks to study the money-saving patterns and practices of individual clients.
·         The Big Data analytic solutions support or automate cost cutting, bring greater efficiency of operations and the evaluation of historical trends.
Example: Using Big Data Analytics, banks keep track of their client’s geographical shopping locations.
·         By using Big Data predictive analysis techniques, organizations can provide an early warnings of a problem and enable preventive maintenance to avoid a potential outage
Example: A portable computer manufacturer has the capacity to assemble and break down information utilized as part of segment assembling. This information can help the producer to focus on satisfactory levels of heat, vibration, and different variables utilized.
·         Big Data offers a range of analytical techniques help organizations to develop new products and services.
Example: An application can analyze data to the most granular level, even to observe that the customers who bought a smart phone also bought memory cards or back covers.

Traditional IT Analytics Approach

The following are the requirements of the traditional IT analytics approach and the challenging factors:
Requirements:
·         The business team needs to define questions before IT development.
·         They need to define data sources and structures.
Challenging Factors:
·         The requirements are iterative and volatile.
·         The data sources keep changing.
In typical scenario of traditional IT systems development, the requirements are defined, followed by solution design and build. Once the solution is implemented, queries are executed. If there are new requirements or queries, the system is redesigned and rebuilt.
Define Requirements à Design Solution à Execute queries à Redesign and Rebuild for new requirements.

Approach for Big Data Solutions

Following are the requirements for using Big Data technology as a platform for discovery and exploration, and the challenges overcome by the same:
Requirements
·         The business team needs to define data sources
·         They need to establish the hypothesis
Challenges overcome by Big Data
·         The technology should enable explorative analysis.
·         Data systems and sources need to be integrated as required.
The steps illustrates how IT systems are built with the help if Big Data technology.
·         Initial Data Sources are identified
·         IT Team creates a platform for creative exploration of available data and content
·         The business teams determine the questions to ask and test hypothesis
·         Any new questions lead to addition of data sources and integration without the need to redesign or rebuild the platform.

Big Data Technology Capabilities

·         Understand and navigate Big Data sources
·         Manage and store huge volume of a variety of data
·         Process Data in reasonable time
·         Ingest data at a high speed
·         Analyze unstructured data
·         Bear faults and exceptions

Big Data Use Cases

The use cases of Big Data Hadoop are given below:
·         Automotive: Auto sensors reporting location, problems
·         Communication: Location based advertising
·         Consumer Packaged Goods: Sentiment analysis of what’s hot, customer service
·         Financial Services: Risk and portfolio analysis New Products
·         Education and Research: Experiment sensors analysis
·         High Technology/Industrial Mfg.: Mfg. quality Warranty analysis
·         Life Sciences: Clinical Trials Geonomics
·         Media/ Entertainment: Viewers/Advertising Effectiveness
·         Online Services/Social Media: People and career matching
·         Health Care: Patient sensors, monitoring, EHRs Quality of care
·         Oil and Gas: Drilling exploration sensor analysis
·         Retail: Consumer sentiments Optimized marketing
·         Travel and Transportation: Sensors analysis for optimal traffic flows. Customer sentiments
·         Utilities: Smart Meter analysis for network capability
·         Law Enforcement and Defense: Threat analysis – social media monitoring, photo analysis

Challenges of Big Data

Following are the challenges that need to be addressed by Big Data Technology:
·         Fault tolerance and handling the system uptime and downtime
o   Using commodity hardware for data storage and analysis
o   Maintaining a copy of the same data across clusters
·         Combining data accumulated from all systems
o   Analyzing data across different machines
o   Merging of data

Introduction to Hadoop

Following are the facts related to Hadoop and why it is required:
What is Hadoop?
·         A free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
·         Based on Google File System (GFS)
Why Hadoop?
·         Runs a number of applications on distributed system with thousands of nodes involving petabytes of data
·         Has a distributed file system, called Hadoop Distributed File System or HDFS which enables fast data transfer among the nodes.
·         Further it encompasses a distributed processing framework called MapReduce

Hadoop and Traditional RDBMS

Feature
RDBMS
Hadoop
Computing Model
·      Notion of transactions
·      Transaction is the unit of work
·      ACID properties, Concurrency control
·      Notion of jobs
·      Job is the unit of work
·      No concurrency control
Data Model
·      Structured data with known schema
·      Read/Write mode
·      Any data will fit in any format
·      (un)(semi)structured
·      Read Only Mode
Cost Model
·      Expensive Server
·      Cheap commodity machines
Fault Tolerance
·      Failures are rare
·      Recovery mechanisms
·      Failures are common over thousands of machines
·      Simple yet efficient fault tolerance

History and Milestones Of Hadoop

Hadoop Originated from the Nutch open source project on search engines and works over distributed network nodes.
Period
Milestone
2003 & 2004
Google released two papers which provided insight into their success. The google file system or GFS and MapReduce. Simplified Data processing on large clusters. The papers told the world how Google performed large scale data processing.
July 2005
Nutch used GFS to perform MapReduce operations
Feb 2006
Nutch started a Lucene sub project which led to the era of Hadoop
Apr 2007
Yahoo started using Hadoop on a 1000-node cluster
Jan 2008
Apache took over Hadoop and made it a top-level project
Jul 2008
A 4000-node cluster with Hadoop was tested by Apache. The performance of that cluster was surprisingly the fastest when compared to the other technologies implemented that year
May 2009
Hadoop Successfully sorted a petabyte of data in 17 hours
Dec 2011
Hadoop reached version 1.0

Hadoop Core Services and Components.

Major components of Hadoop are:
·         HDFS: HDFS runs on commodity machines which are low in cost and hardware. It is highly fault tolerant and efficient enough to process huge amount of data.
·         NameNode: Is the brain of the system. It stores the Metadata of the data blocks along with location of data blocks. If this NameNode crashes the entire system is dead.
·         Secondary NameNode: Is the replica of the Primary NameNode. This is used to ensure that even if the Primary NameNode crashes Hadoop system is not dead, but name space image on Secondary NameNode can be used to restart the system.
·         DataNode: Stores the blocks of data.
·         JobTracker: Schedules client jobs and creates Map or Reduce tasks and schedules them. It can run on the same machines as NameNode or different Node.
·         TaskTracker: Runs on DataNodes and its primary responsibility is to run the MapReduce tasks assigned by the name node.

Master
Slave 1
Slave 2
Slave N
MapReduce
JobTracker




TaskTracker
TaskTracker
TaskTracker

TaskTracker
HDFS
NameNode




DataNode
DataNode
DataNode

DataNode

HDFS Architecture

HDFS architecture and be summarized as follows:
·         The NameNode is the master and DataNode are the slaves.
·         NameNode is the brain of the system, and is accessing client data. DataNode manages the storage of data.
·         The data is split into files of one or more blocks.
·         When a client needs a data, it first interacts with NameNode that holds the MetaData and replies back to client with location of the Data on DataNodes.
·         After this, client starts interactions with DataNode, till the time data requirement is completed.

Organizations Using Hadoop

The following table shows how various organizations use Hadoop:
Name of the Organization
Cluster Specifications
Uses
A9.com: Amazon
Clusters vary from 1 to 100 nodes
·    Amazon’s product search indices are built using this program
·    Processes millions of sessions daily for analysis
Yahoo
More than 100,000 CPUs in approximately 20,000 computers running Hadoop; biggest cluster has 2000 nodes {2 * 4 cpu boxes with 4 TB disk space}
·    To support research for ad systems and web search
AOL
Cluster size is 50 machines, Intel Xeon, dual processors, and dual core, each with 16GB RAM and 800 GB hard disk in total of 37 TB HDFS capacity
·    For variety of functions ranging from generating data to running advanced algorithms for performing behavioral analysis and targeting
Facebook
320-machines cluster with 2,560 cores and about 1.3 PB raw storage
·    Storing copies of internal logs and dimension data sources
·    Used as a source for reporting analytics and machine learning

Summary

·         Big Data relies on volume, velocity, and variety with respect to processing.
·         Data can be divided into three types – unstructured data, semi-structured data, and structured data.
·         Big Data technology understands and navigates big data sources, analyzes unstructured data, and ingests data at a high speed.
·         Hadoop is free, Java based programming framework that supports the processing of large data sets in a distributed computing environment
·         Hadoop originated from the Nutch open source project on search engines and works over distributed network nodes
The core services of Hadoop are NameNode, DataNode, JobTracker, TaskTracker, and Secondary NameNode.

Tuesday, September 1, 2015

Setting Up Hadoop Single Node Cluster

To setup a single node hadoop cluster you will need a Linux System:

Install Linux


Install Java

For debian, ubuntu:

Hadoop Introduction

Hadoop Introduction

For detailed intoduction of Hadoop please visit: http://hadoop.apache.org/

Hadoop

  • Setting Up Hadoop Single Node Cluster
  • Settting Up Hadoop Multi Node Cluster
  • Hadoop Documentation Classified as follows:
  • Hadoop Common
  • Hadoop Distributed File System (HDFS)
  • Hadoop Map Reduce
  • Hadoop Yet Another Resource Negotiator (YARN)

Other Projects Under Hadoop

  • Ambari - Web Based Tool for provisioning, managing and monitoring Apache Hadoop Clusters
  • Avro - Data Serialization System
  • Cassandra - Scalable Multi-Master Database with no single point of failure.
  • Chukwa - Data Collection System
  • Hbase - A scalable, distributed database that supports structured data storage for large tables.
  • Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout - A Scalable machine learning and data mining library
  • Pig - A high-level data-flow language and execution framework for parallel computation
  • Spark - A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation
  • Tez - A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine
  • ZooKeeper - A high-performance coordination service for distributed applications