Hadoop Quick Notes: 6 Overview of Map Reduce and Yarn

Bigdata and Hadoop Administrator

Lesson 6: Overview of Map Reduce and YARN

6.2 Objectives:

After completing this lesson you will be able to:
Describe Map Reduce Architecture and Concepts
Explain Map Reduce applications and its libraries
Describe Map Reduce failure components and recovery components
Explain YARN concepts and its architecture
Install and configure YARN
Work with YARN and YARN WebUI

6.3 Map Reduce Introduction

Map Reduce is a programming model and an associated implementation for processing and generating large data sets with parallel distribution algorithms on a cluster. Map Reduce operation includes:

Specify computation in terms of map and reduce function
Parallel computation across large-scale clusters of machines
Handles machine failures and performance issues
Ensures efficient communication between the nodes

6.4 Concepts of Map Reduce

Some of the key concepts of Map Reduce are:

Model for processing large amount of data in parallel
Derived from functional programming, such as Map and Reduce functions
Can be implemented in languages such as Java, C++, Ruby, and Python and not in Cold Fusion.

6.5 History of Map Reduce

2004 - Map Reduce Paper

2006 - Lucene's sub-project

2008 - Apache top level project

2012 - Map Reduce 2.0/YARN 2012

6.6 Automatic Parallel Execution in Map Reduce

An architectural diagram on how parallel execution is done in map reduce is shown below:

Principle aims of the map reduce model is to conceal subtle elements are in parallel execution and permit client to concentrate just on information preparing methodologies. Map reduce model comprises of two permitted capacities. Map and Reduce. Depending on the input Map Tasks need to be divided manually. As seen on the screen there are three map tasks. The mapper logic is applied to all input key value pair. And then the intermediate key value pairs are generated. These interim key value pairs are stored and pulled based on the key. Reducer logic is then applied on these key value and results are emitted. Let us understand the parallel in map task 1 in detail. The data for Map task 1 is a run down of key1 and value1. The sets are connected to to register middle of the road key quality sets i.e. key2 value 2. The transitional keyword sets are then assembled together on a key uniformity premis i.e. key2 list value 2. For every key2 reduce chips away for all quality tools and then reduce key2 to accumulate more quality results. Similar pattern are followed by all the map tasks.

6.7 Map Reduce Framework

Map Reduce Framework takes care of distributed processing and coordination. The high level steps of Map Reduce framework are:

Scheduling:

Jobs are broken down into smaller chunks called tasks.
These tasks are scheduled by Task Tracker.

Task Localization with Data

Framework strives to place task on the nodes that host the segment of data to be processed by that specific task.
Code is moved to where the data is.

Error Handling:

Failures are an expected behavior, so the tasks are automatically attempted on other machines.

Data Synchronization:

Shuffle and Sort barriers re-arranges and moves data between machines.
Input and output are coordinated by the framework.

6.8 How Map and Reduce work together

The combine tasks of Map and Reduce are:

1. The Mapper Passes information to the reduce

2. Reduce accepts the information and starts its process. Meanwhile tasks such as sorting and combining are also performed.

3. Reduce function removed the unwanted data.

6.9 Map Reduce - Example

Count the occurrence of different words in the collection:

Suppose you are passing a large file with the words like:

Cold Warm, Warm Hot Hotter, Hotter Hot Cold War,.....

now the task is to count the occurrences of different words in the collection:
{Cold Warm, Warm Hot Hotter, Hotter Hot Cold Warm}
To design a solution:
* Start from Scratch
* Add and relax constraints
* Perform incremental design - improving the solution for performance and scalibility.

6.10 Workflow of Map Reduce

Input	Split	Mapping	Shuffle	Reducing	Output
{Cold Warm, Warm Hot Hotter, Hotter Hot Cold Warm}	{Cold Warm}	{Cold, 1} {Warm, 1}	{Cold, 1} {Cold, 1}	{Cold, 2}	Cold, 2 Warm, 3 Hot, 2 Hotter, 2
	{Cold Warm}	{Cold, 1} {Warm, 1}	{Hot, 1} {Hot, 1}	{Hot, 2}
	{Warm Hot Hotter}	{Warm, 1} {Hot, 1} {Hotter, 1}	{Hot, 1} {Hot, 1}	{Hot, 2}
	{Warm Hot Hotter}	{Warm, 1} {Hot, 1} {Hotter, 1}	{Hotter, 1} {Hotter, 1}	{Hotter, 2}
	{Hotter Hot Cold Warm}	{Hotter, 1} {Hot, 1} {Cold, 1} {Warm, 1}	{Warm, 1} {Warm, 1} {Warm, 1}	{Warm, 3}

6.11 Map Reduce Characteristics

Some of the Map Reduce characteristics are as follows:

It handles very large scale data: peta, exa bytes, and so on.
It works well on Write Once and Read Many (WORM) data.
It allows parallelism without mutexes.
The map reduce operations are typically performed by the same physical processor.
The operations are provisioned near the data i.e. data locality is preferred.
The commodity hardware and storage is leveraged.
The run-time takes care of splitting and moving data for operations

6.12 Development and Libraries of Map Reduce

The local libraries give designers the most granularity of coding. Given that all different methodologies are basically reflections, this dialect offers minimum overhead and best execution.

StringTokenizer itr=new StringTokenizer (value.toString());

while (itr.hasMoreTokens())

{

word.set(itr.nextToken());

context.write(word, one);

}}}

The two things to be noticed here are:

The Mapper takes a gander at an information set and understands it line by line. At that point, the Mapper's StringTokenizer capacity parts each one line into words as key worth matches.
At the base, the reducer code has gotten the key qualities sets, numbers each as an example, and composes the data to circle.

6.13 Map Reduce Failure and Recovery

The steps involved in Map Reduce failure and recovery:

Task processes send heartbeats to the Task Tracker.
Task Tracker send heartbeats to the Job Tracker.
Any task that fails to respond in 10 minutes or throws an exception is killed by the Task Tracker.
Task Tracker reports the failed tasks to the Job Tracker.
Job Tracker reschedules any failed tasks on different Task Tracker
If a task fails more than 4 times, the whole job fails.

6.14 Introduction to YARN

There were 3 main components in the Map Reduce 1 version, such as:

APIs for user level programming of Map Reduce,
Run Time services for Map Reduce, and
Infrastructure to monitor nodes, allocate resources, and schedule jobs.

A new tool called Yet Another Resource Negotiator (YARN) was introduced from Hadoop 2.0 version, and Resource Manager was moved from Map Reduce to YARN.

The Map Reduce API and framework are still handled by Map Reduce but YARN API and Resource Management are now being handled by YARN.

6.15 Need for YARN

The need for YARN raised to resolve the resource management issues in the earlier version of Map Reduce. The issues were:

clusters utilization was very less during Map Reduce tasks,
resource were not shared,
least preference was given to non-Map Reduce applications, and
only one job tracker was there per cluster

6.16 Benefits of Yarn

The benefits of YARN are as follows:

With the introduction of YARN, the resources are assigned to applications as and when needed.
MapReduce and non-MapReduce applications run on the same cluster.
Application Master was introduced to which most of the Job Tracker responsibilities were designated.
One cluster has many Application Master.

6.17 The YARN Architecture

The fundamental idea of Map Reduce 2.0 is to keep the functionalities of JobTracker, resource management, and job scheduling in separate daemons.The idea of having global resource manager and per node Application Master was also introduced in this version.

Resource Manager

Scheduler: Is responsible to allocate resources to various application and does not do any monitoring or tracking of status
Application Master: Accepts job submissions and negotiate with first container to execute job

6.18 The YARN Architecture Illustration

Typical Architecture of YARN consists of :

Client >> Resource Manager >>

Hadoop Quick Notes

Friday, January 15, 2016

6 Overview of Map Reduce and Yarn