Bigdata and Hadoop Administrator
Lesson 6: Overview of Map Reduce and YARN
6.2 Objectives:
- After completing this lesson you will be able to:
- Describe Map Reduce Architecture and Concepts
- Explain Map Reduce applications and its libraries
- Describe Map Reduce failure components and recovery components
- Explain YARN concepts and its architecture
- Install and configure YARN
- Work with YARN and YARN WebUI
6.3 Map Reduce Introduction
Map Reduce is a programming model and an associated implementation for processing and generating large data sets with parallel distribution algorithms on a cluster. Map Reduce operation includes:
- Specify computation in terms of map and reduce function
- Parallel computation across large-scale clusters of machines
- Handles machine failures and performance issues
- Ensures efficient communication between the nodes
6.4 Concepts of Map Reduce
Some of the key concepts of Map Reduce are:
- Model for processing large amount of data in parallel
- Derived from functional programming, such as Map and Reduce functions
- Can be implemented in languages such as Java, C++, Ruby, and Python and not in Cold Fusion.
6.5 History of Map Reduce
2004 - Map Reduce Paper
2006 - Lucene's sub-project
2008 - Apache top level project
2012 - Map Reduce 2.0/YARN 2012
6.6 Automatic Parallel Execution in Map Reduce
An architectural diagram on how parallel execution is done in map reduce is shown below:
Principle aims of the map reduce model is to conceal subtle elements are in parallel execution and permit client to concentrate just on information preparing methodologies. Map reduce model comprises of two permitted capacities. Map and Reduce. Depending on the input Map Tasks need to be divided manually. As seen on the screen there are three map tasks. The mapper logic is applied to all input key value pair. And then the intermediate key value pairs are generated. These interim key value pairs are stored and pulled based on the key. Reducer logic is then applied on these key value and results are emitted. Let us understand the parallel in map task 1 in detail. The data for Map task 1 is a run down of key1 and value1. The sets are connected to to register middle of the road key quality sets i.e. key2 value 2. The transitional keyword sets are then assembled together on a key uniformity premis i.e. key2 list value 2. For every key2 reduce chips away for all quality tools and then reduce key2 to accumulate more quality results. Similar pattern are followed by all the map tasks.
6.7 Map Reduce Framework
Map Reduce Framework takes care of distributed processing and coordination. The high level steps of Map Reduce framework are:
Scheduling:
- Jobs are broken down into smaller chunks called tasks.
- These tasks are scheduled by Task Tracker.
Task Localization with Data
- Framework strives to place task on the nodes that host the segment of data to be processed by that specific task.
- Code is moved to where the data is.
Error Handling:
- Failures are an expected behavior, so the tasks are automatically attempted on other machines.
Data Synchronization:
- Shuffle and Sort barriers re-arranges and moves data between machines.
- Input and output are coordinated by the framework.
6.8 How Map and Reduce work together
The combine tasks of Map and Reduce are:
1. The Mapper Passes information to the reduce
2. Reduce accepts the information and starts its process. Meanwhile tasks such as sorting and combining are also performed.
3. Reduce function removed the unwanted data.
6.9 Map Reduce - Example
Count the occurrence of different words in the collection:
Suppose you are passing a large file with the words like:
Cold Warm, Warm Hot Hotter, Hotter Hot Cold War,.....
now the task is to count the occurrences of different words in the collection:
{Cold Warm, Warm Hot Hotter, Hotter Hot Cold Warm}
To design a solution:
* Start from Scratch
* Add and relax constraints
* Perform incremental design - improving the solution for performance and scalibility.
6.10 Workflow of Map Reduce
Input
|
Split
|
Mapping
|
Shuffle
|
Reducing
|
Output
|
|
{Cold Warm, Warm Hot Hotter, Hotter Hot
Cold Warm}
|
{Cold Warm}
|
{Cold, 1}
{Warm, 1}
|
{Cold, 1}
{Cold, 1}
|
{Cold, 2}
|
Cold, 2
Warm, 3
Hot, 2
Hotter, 2
|
|
{Hot, 1}
{Hot, 1}
|
{Hot, 2}
|
|||||
{Warm Hot Hotter}
|
{Warm, 1}
{Hot, 1}
{Hotter, 1}
|
|||||
{Hotter, 1}
{Hotter, 1}
|
{Hotter, 2}
|
|||||
{Hotter Hot Cold Warm}
|
{Hotter, 1}
{Hot, 1}
{Cold, 1}
{Warm, 1}
|
{Warm, 1}
{Warm, 1}
{Warm, 1}
|
{Warm, 3}
|
6.11 Map Reduce Characteristics
Some of the Map Reduce characteristics are as follows:- It handles very large scale data: peta, exa bytes, and so on.
- It works well on Write Once and Read Many (WORM) data.
- It allows parallelism without mutexes.
- The map reduce operations are typically performed by the same physical processor.
- The operations are provisioned near the data i.e. data locality is preferred.
- The commodity hardware and storage is leveraged.
- The run-time takes care of splitting and moving data for operations
6.12 Development and Libraries of Map Reduce
The local libraries give designers the most granularity of coding. Given that all different methodologies are basically reflections, this dialect offers minimum overhead and best execution.
StringTokenizer itr=new StringTokenizer (value.toString());
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}}}
The two things to be noticed here are:
- The Mapper takes a gander at an information set and understands it line by line. At that point, the Mapper's StringTokenizer capacity parts each one line into words as key worth matches.
- At the base, the reducer code has gotten the key qualities sets, numbers each as an example, and composes the data to circle.
6.13 Map Reduce Failure and Recovery
The steps involved in Map Reduce failure and recovery:
- Task processes send heartbeats to the Task Tracker.
- Task Tracker send heartbeats to the Job Tracker.
- Any task that fails to respond in 10 minutes or throws an exception is killed by the Task Tracker.
- Task Tracker reports the failed tasks to the Job Tracker.
- Job Tracker reschedules any failed tasks on different Task Tracker
- If a task fails more than 4 times, the whole job fails.
6.14 Introduction to YARN
There were 3 main components in the Map Reduce 1 version, such as:- APIs for user level programming of Map Reduce,
- Run Time services for Map Reduce, and
- Infrastructure to monitor nodes, allocate resources, and schedule jobs.
A new tool called Yet Another Resource Negotiator (YARN) was introduced from Hadoop 2.0 version, and Resource Manager was moved from Map Reduce to YARN.
The Map Reduce API and framework are still handled by Map Reduce but YARN API and Resource Management are now being handled by YARN.
6.15 Need for YARN
The need for YARN raised to resolve the resource management issues in the earlier version of Map Reduce. The issues were:
- clusters utilization was very less during Map Reduce tasks,
- resource were not shared,
- least preference was given to non-Map Reduce applications, and
- only one job tracker was there per cluster
6.16 Benefits of Yarn
The benefits of YARN are as follows:
- With the introduction of YARN, the resources are assigned to applications as and when needed.
- MapReduce and non-MapReduce applications run on the same cluster.
- Application Master was introduced to which most of the Job Tracker responsibilities were designated.
- One cluster has many Application Master.
6.17 The YARN Architecture
The fundamental idea of Map Reduce 2.0 is to keep the functionalities of JobTracker, resource management, and job scheduling in separate daemons.The idea of having global resource manager and per node Application Master was also introduced in this version.
Resource Manager
- Scheduler: Is responsible to allocate resources to various application and does not do any monitoring or tracking of status
- Application Master: Accepts job submissions and negotiate with first container to execute job
6.18 The YARN Architecture Illustration
Typical Architecture of YARN consists of :
Client >> Resource Manager >>
No comments:
Post a Comment