Friday, January 15, 2016

6 Overview of Map Reduce and Yarn

Bigdata and Hadoop Administrator

Lesson 6: Overview of Map Reduce and YARN

6.2 Objectives:

After completing this lesson you will be able to:
Describe Map Reduce Architecture and Concepts
Explain Map Reduce applications and its libraries
Describe Map Reduce failure components and recovery components
Explain YARN concepts and its architecture
Install and configure YARN
Work with YARN and YARN WebUI

6.3 Map Reduce Introduction

Map Reduce is a programming model and an associated implementation for processing and generating large data sets with parallel distribution algorithms on a cluster. Map Reduce operation includes:

Specify computation in terms of map and reduce function
Parallel computation across large-scale clusters of machines
Handles machine failures and performance issues
Ensures efficient communication between the nodes

6.4 Concepts of Map Reduce

Some of the key concepts of Map Reduce are:

Model for processing large amount of data in parallel
Derived from functional programming, such as Map and Reduce functions
Can be implemented in languages such as Java, C++, Ruby, and Python and not in Cold Fusion.

6.5 History of Map Reduce

2004 - Map Reduce Paper

2006 - Lucene's sub-project

2008 - Apache top level project

2012 - Map Reduce 2.0/YARN 2012

6.6 Automatic Parallel Execution in Map Reduce

An architectural diagram on how parallel execution is done in map reduce is shown below:

Principle aims of the map reduce model is to conceal subtle elements are in parallel execution and permit client to concentrate just on information preparing methodologies. Map reduce model comprises of two permitted capacities. Map and Reduce. Depending on the input Map Tasks need to be divided manually. As seen on the screen there are three map tasks. The mapper logic is applied to all input key value pair. And then the intermediate key value pairs are generated. These interim key value pairs are stored and pulled based on the key. Reducer logic is then applied on these key value and results are emitted. Let us understand the parallel in map task 1 in detail. The data for Map task 1 is a run down of key1 and value1. The sets are connected to to register middle of the road key quality sets i.e. key2 value 2. The transitional keyword sets are then assembled together on a key uniformity premis i.e. key2 list value 2. For every key2 reduce chips away for all quality tools and then reduce key2 to accumulate more quality results. Similar pattern are followed by all the map tasks.

6.7 Map Reduce Framework

Map Reduce Framework takes care of distributed processing and coordination. The high level steps of Map Reduce framework are:

Scheduling:

Jobs are broken down into smaller chunks called tasks.
These tasks are scheduled by Task Tracker.

Task Localization with Data

Framework strives to place task on the nodes that host the segment of data to be processed by that specific task.
Code is moved to where the data is.

Error Handling:

Failures are an expected behavior, so the tasks are automatically attempted on other machines.

Data Synchronization:

Shuffle and Sort barriers re-arranges and moves data between machines.
Input and output are coordinated by the framework.

6.8 How Map and Reduce work together

The combine tasks of Map and Reduce are:

1. The Mapper Passes information to the reduce

2. Reduce accepts the information and starts its process. Meanwhile tasks such as sorting and combining are also performed.

3. Reduce function removed the unwanted data.

6.9 Map Reduce - Example

Count the occurrence of different words in the collection:

Suppose you are passing a large file with the words like:

Cold Warm, Warm Hot Hotter, Hotter Hot Cold War,.....

now the task is to count the occurrences of different words in the collection:
{Cold Warm, Warm Hot Hotter, Hotter Hot Cold Warm}
To design a solution:
* Start from Scratch
* Add and relax constraints
* Perform incremental design - improving the solution for performance and scalibility.

6.10 Workflow of Map Reduce

Input	Split	Mapping	Shuffle	Reducing	Output
{Cold Warm, Warm Hot Hotter, Hotter Hot Cold Warm}	{Cold Warm}	{Cold, 1} {Warm, 1}	{Cold, 1} {Cold, 1}	{Cold, 2}	Cold, 2 Warm, 3 Hot, 2 Hotter, 2
	{Cold Warm}	{Cold, 1} {Warm, 1}	{Hot, 1} {Hot, 1}	{Hot, 2}
	{Warm Hot Hotter}	{Warm, 1} {Hot, 1} {Hotter, 1}	{Hot, 1} {Hot, 1}	{Hot, 2}
	{Warm Hot Hotter}	{Warm, 1} {Hot, 1} {Hotter, 1}	{Hotter, 1} {Hotter, 1}	{Hotter, 2}
	{Hotter Hot Cold Warm}	{Hotter, 1} {Hot, 1} {Cold, 1} {Warm, 1}	{Warm, 1} {Warm, 1} {Warm, 1}	{Warm, 3}

6.11 Map Reduce Characteristics

Some of the Map Reduce characteristics are as follows:

It handles very large scale data: peta, exa bytes, and so on.
It works well on Write Once and Read Many (WORM) data.
It allows parallelism without mutexes.
The map reduce operations are typically performed by the same physical processor.
The operations are provisioned near the data i.e. data locality is preferred.
The commodity hardware and storage is leveraged.
The run-time takes care of splitting and moving data for operations

6.12 Development and Libraries of Map Reduce

The local libraries give designers the most granularity of coding. Given that all different methodologies are basically reflections, this dialect offers minimum overhead and best execution.

StringTokenizer itr=new StringTokenizer (value.toString());

while (itr.hasMoreTokens())

{

word.set(itr.nextToken());

context.write(word, one);

}}}

The two things to be noticed here are:

The Mapper takes a gander at an information set and understands it line by line. At that point, the Mapper's StringTokenizer capacity parts each one line into words as key worth matches.
At the base, the reducer code has gotten the key qualities sets, numbers each as an example, and composes the data to circle.

6.13 Map Reduce Failure and Recovery

The steps involved in Map Reduce failure and recovery:

Task processes send heartbeats to the Task Tracker.
Task Tracker send heartbeats to the Job Tracker.
Any task that fails to respond in 10 minutes or throws an exception is killed by the Task Tracker.
Task Tracker reports the failed tasks to the Job Tracker.
Job Tracker reschedules any failed tasks on different Task Tracker
If a task fails more than 4 times, the whole job fails.

6.14 Introduction to YARN

There were 3 main components in the Map Reduce 1 version, such as:

APIs for user level programming of Map Reduce,
Run Time services for Map Reduce, and
Infrastructure to monitor nodes, allocate resources, and schedule jobs.

A new tool called Yet Another Resource Negotiator (YARN) was introduced from Hadoop 2.0 version, and Resource Manager was moved from Map Reduce to YARN.

The Map Reduce API and framework are still handled by Map Reduce but YARN API and Resource Management are now being handled by YARN.

6.15 Need for YARN

The need for YARN raised to resolve the resource management issues in the earlier version of Map Reduce. The issues were:

clusters utilization was very less during Map Reduce tasks,
resource were not shared,
least preference was given to non-Map Reduce applications, and
only one job tracker was there per cluster

6.16 Benefits of Yarn

The benefits of YARN are as follows:

With the introduction of YARN, the resources are assigned to applications as and when needed.
MapReduce and non-MapReduce applications run on the same cluster.
Application Master was introduced to which most of the Job Tracker responsibilities were designated.
One cluster has many Application Master.

6.17 The YARN Architecture

The fundamental idea of Map Reduce 2.0 is to keep the functionalities of JobTracker, resource management, and job scheduling in separate daemons.The idea of having global resource manager and per node Application Master was also introduced in this version.

Resource Manager

Scheduler: Is responsible to allocate resources to various application and does not do any monitoring or tracking of status
Application Master: Accepts job submissions and negotiate with first container to execute job

6.18 The YARN Architecture Illustration

Typical Architecture of YARN consists of :

Client >> Resource Manager >>

Wednesday, December 16, 2015

5 SQOOP

Introduction to SQOOP
Sqoopis a tool used to import and trade information between Hadoop and tagged social databases. Characteristics of Sqoopare as follows:

Uses SqoopImport and SqoopExport capacities.
Is composed in Java which gives an API called Java Database Connectivity.
Depends on the database to depict the composition of the information to be foreign made.
Uses Mapreduce to import and fare the information, which gives parallel operation.

How SQOOP Works
The Sqoop includes activities such as:

ImportAllows data imports from external data stores and enterprise data warehouses into Hadoop
TransferParalleled data transfer for fast performance and optimal system utilization
CopyCopies data quickly from external systems to Hadoop
Increase efficiencyMakes data analysis more efficient
Reduce loadMitigates excessive loads to external systems

Prerequisite for SQOOP Installation
The prerequisites to install Sqoop are:

An arrival of Hadoop must be introduced and designed.
Currently, Sqoop supports 4 noteworthy Hadoop discharges—0.20, 0.23, 1.0 and 2.0.
Hadoop 2.2.0 is also introduced and it goes well with sqoop1.4.4.
A Linux environment Ubuntu 12.04 is also required.

Installing and configuring Sqoop
There are ten steps in the process. The first four steps to install and configure sqoop 1.4.4 are:

Download the sqoop-1.4.4.bin_hadoop-1.0.0.tar.gz file from www.apache.org/dyn/closer.cgl/sqoop/1.4.4
Unzip the tar ?le: sudo tar -zxvf sqoop-1.4.4.bin_hadoop-1.0.0.tar.gz
Move sqoop-1.4.4.bin_hadoop-1.0.0 to sqoop using command
user@ubuntu:~$ sudo mv sqoop-1.4.4.bin_hadoop-1.0.0 /usr/local/sqoop
Create a directory sqoop in usr/lib using command:
user@ubuntu:~$ sudo mkdir /usr/lib/sqoop
Go to the zipped folder sqoop-1.4.4.bin_hadoop-1.0.0 and run the command:
user@ubuntu:~$ sudo mv ./* /usr/lib/sqoop
Go to root directory using cd command (example)
user@ubuntu:~$ sudo gedit ~/.bashrc
Reduce remove unwanted data
Add the following lines:
export SQOOP_HOME:iusr/lib/sqoop
export PATH=$PATH:$SQOOP_HOME/BIN
To check if the sqoop has been installed successfully type the command:
user@ubuntu:~$ sqoop version

Importing Data from MySQL

The steps to import Data from MySQL using sqoop are:

Download mysql-connector-java-5.1.28-bin.jar and move to /usr/lib/sqoop/lib
Login to mysql using command
user@ubuntu:~$ mysql -u root -p
Login to secure shell using command
user@ubuntu:~$ ssh localhost
Start hadoop using the command
user@ubuntu:~$ bin/hadoop start-all.sh
Run the command

Business Scenario
Olivia is the EVP of IT Operations with Nutri Worldwide Inc. The data for the company has grwon exponentially to 500 terabytes, and the current RDBMS systems poses challenges in terms of latency. Hence, Olivia has decided to move all the data from RDBMS systems to HDFS, for which Sqoop needs to be installed.

Demo- Install sqoop

Visit the site sqoop.apache.org in firefox Browser.
Under the heading "Download" click on "Download a release from a nearby mirror"
Click on the suggested mirror link on the top.
Under the title "Index of /sqoop" click on the folder "1.4.4/"
Right click and copy link to "sqoop-1.4.4.bin_hadoop-1.0.0.tar.gz"
Open a terminal and then type "wget " and then paste the link:
user@ubuntu:~$ wget http://apache.petsads.us/sqoop/1.4.4/sqoop-1.4.4.bin_hadoop-1.0.0.tar.gz
Untar the downloaded tar file.
user@ubuntu:~$ tar -xvf sqoop-1.4.4.bin_hadoop-1.0.0.tar.gz
Copy the folder to the /usr/local/sqoop folder
user@ubuntu:~$ sudo cp -r sqoop-1.4.4.bin_hadoop-1.0.0 /usr/local/sqoop
Edit the bashrc to update the environment variables.
user@ubuntu:~$ sudo vi $HOME/.bashrc
export PATH=$PATH:/usr/local/sqoop
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive
export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:/usr/local/sqoop/bin
Type the command
user@ubuntu:~$ exec bash

Install My SQL Server

To perform the demo on hadoop you need to have a database server.

We will install mysql server.

Type the command:
user@ubuntu:~$ sudo apt-get install mysql-server
One mysql server is installed type the command
user@ubuntu:~$ mysql -u root -p
To create the database sl type the command
mysql> create database sl;
To use the database sl type the command
mysql> use sl;
Create a table called authentication that has columns with headings as username and password.
mysql> create table authentication(usename varchar(30), password varchar(30));
Insert some data in the table
mysql> insert into authentication value('admin','12345');
mysql> insert into authentication value('s1001','s1001');
Ensure that the data is inserted into the table using the command
select * from authentication;
The next step is to download the database driver that will be used by sqoop. This can be done by visiting the site http://www.mysql.com/downloads
Click on the "Download from MySQL Developer Zone" link under My SQL Community Edition
You will be taken to "http://dev.mysql.com/downloads/" page. Click the "DOWNLOAD" link under the title "MySQL Connectors".
You will be taken to "http://dev.mysql.com/downloads/connector/". Click on the "Connector /J" link to download the required distribution. Select platform independent option. Click on the download. Click on "No thanks, just start my download link". A download file info pop-up box appears. Click on start download button. FTP the downloaded file in the folder /usr/local/sqoop/lib.
Change the ownership of the folder using the command:
user@ubuntu:~$ sudo chown s1000 /usr/local/sqoop
user@ubuntu:~$ sudo chown s1000 /usr/local/sqoop/lib
You have successfully installed and configured sqoop.

Summary:
Sqoop is a tool to import and trade information between Hadoop and tagged social databases.

Friday, December 4, 2015

5 Hadoop Distributed File System

Objectives

Describe Hadoop Distributed File System concepts and architecture
Explain HDFS Storage mechanisms and HDFS Rack awareness
Explain HDFS Writes and Reads
List the important commands of HDFS
Describe Sqoop
Install and configure Sqoop

Introduction to Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is the subproject of the Apache Hadoop venture. It is a distributed, extremely fault tolerant document framework intended to run on minimal effort item fittings.
HDFS gives a high throughput access to aplication information and is suitable for applications with huge information sets. Hadoop gives a circulated document framework that is equipped for investigation and preparing of vast sets of information utilizing mapreduce ideal model.
The concept of HDFS is based on UNIX, even through gauges were traded off to some degree to enhance execution for the applications.
HDFS is similar to other Distributed File framework, some of the differences are:

Write-once-read-many model relaxes concurrency controls prerequisites.
Takes handling rational near the information.

Goals of HDFS

Recognizing deficiencies and programmed recovery
Streaming information access through MapReduce
Upholding simple coherency model
Bringing the handling rationale close to information
Preparing immense measure of information
Redeploying activity of handling rationale in the events of failure.

HDFS Architecture
HDFS Architecture can be summarized as follows:

NameNode and the Secondary NameNode services constitute the master service. DataNode service is the slave service.
The master service is responsible for accepting a job from clients and ensures that the data required for operation will be loaded and segregated into chunks of data blocks.
HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks stored and replicated in DataNodes. The data blocks are then distributed to the DataNode system within the cluster. This ensures that replicas of the data are maintained.

Design of HDFS
HDFS is intended to store and process extensive documents with streaming information access.

Situations where HDFS suits:

Substantial files and where Hadoop uses groups to store and processes the information

e.g. Document that are 100s of
  1 KB (KiloBytes)= 1024 Bytes
  1 MB (MegaBytes)= 1024 KB
  1 GB (Gigabyte) = 1024 MB
  1 TB (Terabyte) = 1024 GB
  1 PB (Petabyte) = 1024 TB
  1 EB (Exabyte) = 1024 PB
  1 ZB (Zettabyte)= 1024 EB
  1 YB (Yottabyte)= 1024 ZB
  1 SB (Shilentnobyte) = 1024 YB
  1 DB (Domegemegrottebyte) = 1024 SB
Reference

Streaming information access
As hadoop is focused around the idea of transforming information design which is composed once and read ordinarily form.
No extravagant equipment to run the framework
Hadoop is designed to run on bunches of ware equipment due to which risk of no failure is high. None the less hadoop is empowered to carry on the work regardless of the fact that there is a hardware failure.

Situations where HDFS does not suit:

Low-latency information access
Application that require access to information in several milliseconds reach must not be coordinated with HDFS. HDFS is converying high throughput of informaiton and this may prompt latency. Utilizing HBASE environment of Hadoop would be a superior decision.
Many small documents
Since the NameNode holds record framework Meta information in the memory, the point of confinement for the quantity of documents in a document framework is represented by Measure of memory on the NameNode.
When there are various composes
Records in HDFS takes after writing once and read commonly format. Composes are constantly made toward the end of the record. There is no backing for different composes.

HDFS Concepts

Concept	Description
Blocks	Stores the data – 64 MB by default, larger blocks
NameNode	The champion that deals with the record framework namespace
DataNode	Slaves in the HDFS framework, store and recover blocks when they are asked by the customer or NameNode
HDFS Federation	A reference to every document and block in memory, each NameNode deals with a namespace volume made up to the Meta information for the namespace and a block pool
High Availability	Includes help for sending two NameNodes in a dynamic or detached setup, checks if the inactive NameNode is fit for performing the check pointing part.

Hadoop Storage Mechanism
Storages can be mainly evaluated on three classes of performance metrics:

Cost per MB: The decision of storing the data is computed based on every Mega Byte.
Sturdiness: The measure of the durability of data once it has been effectively composed to the medium.
Execution: The two measures of capacity execution are Throughput and IO Operations per second.

Measure of Capacity Execution
Throughput

The extreme unfinished read compose rate that the storage can help
Ordinarily measured in Mbps
Essential metric for batch processing

IO Operation per second.

The quantity is influenced by the workload and IO size
The rotational inactivity of turning circles confines the greatest IOPS for an arbitrary IO workload

HDFS Storage Architecture Heterogeneous

Information nodes impart their capacity state through the accompanying sorts of messages, storage report and block report:

Storage Report

Contains the outline data of the condition of a storage
Includes limit and utilization points of interest
Found inside a heartbeat, sent once in every few seconds

Block Report

Also called a block report
An informal report of the individual piece imitations on a given DataNode
Two parts of piece reports are Incremental square report and full piece report

HDFS Storage Architecture Illustrated

With Heterogeneous Storage, the DataNodeuncovered the sorts and utilization insights for every individual storage to the NameNode.

HDFS Rack Awareness
The idea behind Rack Awarenessis data loss prevention and network performance

Each square of information is recreated on different machines to keep up with the failure of losing information.
Two machines in the same rack have more data transfer capacity and lower latency than two machines in two separate racks.

HDFS Writes—Example
An online shopping portal plans to improve the quality of their products by analyzing how many customers in their emails specify the word Refund.
File name: Email.txt
Key Points:

Client consults NameNode
Client writes block directly to one DataNode
Data replicates the block
Cycle repeats for next block

HDFS Reads
The workflow of HDFS Reads is:

To recover a document from HDFS, the client recounselsthe NameNodeand requests the piece areas of the record.
Client picks the DataNodefrom each square rundown and uses one piece at once with TCP on port 50010.

Important Commands of HDFS
Some of the Hadoop shell commands to manage HDFS are:

To create directory: -cat <path[filename]>
To list the contents of a directory: ls<args>
To move file from source to destination: mv <source> <destination>
To copy a file from/to local system from HDFS:

Copyfromlocal
copyFromLocal<localsrc>
Copyfromlocal
copyFromLocal<destination>

Some of the Hadoop shell commands to manage HDFS are:

To see contents of a file: mkdir<path of the directory>
To upload file in HDFS: fs-put <source file> ... <destination path>
To download file in HDFS : fs-get <source file> ... <destination path>

Types of HDFS Commands
All the Hadoop commands are invoked by the bin/hadoopscript. Running Hadoop script without any arguments prints the description for all commands. HDFS Commands are grouped into two types:

HDFS Commands

User
Administrator

User Commands
Some Important User Commands are:

Archive: Usage: hadoop archive -archiveName NAME <src>* <dest>
Distcp: Usage: distcp <srcurl> <desturl>
FS: Usage: hadoop fs
FSCK: Usage: hadoop fsck <path> [-move | -delete | -openforwrite]
[-files [-blocks [-location | -racks]]]
Jar: Usage: hadoop jar <jar> [mainClass] args...
ClassName: Usage: hadoop CLASSNAME
Job: Usage: hadoop job [-submit <job-file>] | [-status <job-id>] |
[-counter <job-id> <group-name> <counter-name>] | [kill <job-id>] |
[-events <job-id> <from-event-#> <#-of-events>] | [-history [all]] |
[-kill-task <task-id>] | [-fail-task <task-id>]

Administrator Commands

Some Important Administrator Commands are:

Balancer: Usage: hadoop balancer [- threshold <threshold>]
Daemon Log: Usage: hadoop daemonlog -getlevel <host:port> <name>
DataNode: Usage: hadoop datanode [-rollback]
DFSadmin: Usage: hadoop dfsadmin [-report] [-safemode enter | leave | get | wait ]
[-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-setQuota <quota> <dirname>...<dirname>] [-clrQuota <dirname>...<dirname>] [-help [cmd]]
JobTracker: Usage: hadoop jobtracker
NameNode: Usage: hadoop namenode [-format] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint]
Secondary NameNode: Usage hadoop secondarynamenode [-checkpoint [force]] | [-geteditsize]
TaskTracker: Usage: hadoop tasktracker

Business Scenario
Olivia is the EVP - IT Operations at Nutri Worldwide Inc. Her team is involved in setting up Hadoop infrastructure for the organization. After performing the steps to set up the Hadoop infrastructure, Olivia and her team decides to test the effectiveness of the HDFS infrastructure.

The demo in the subsequent will illustrate how to setup HDFS. Let is proceed to the next screen to see the demo.

Start the HDFS demo provided by Simplilearn

start-all.sh
jps
hadoop jar $HADOOP_PREFIX/hadoop-test-1.2.1.jar TestDFSIO -write nrFiles 5 -fileSize 100

The result show the execution time to write files and IO rate and throughput.

hadoop jar $HADOOP_PREFIX/hadoop-test-1.2.1.jar TestDFSIO -read nrFiles 5 -fileSize 100

The result show the execution time to read files and IO rate and throughput.
This is how bench marking is done.

sudo vi /usr/local/hadoop/conf/hdfssite.xml

<configuration>
<property>
<name>dfs.block.size</name>
<value>134217728</value> ### i.e 128 MB
</property>
</configuration>

We need to refresh the hadoop services:

stop-all.sh
start-all.sh

Let us upload a data for testing the same.

hadoop fs -copyFromLocal /home/hadoop/data/big/201201hourly.txt hdfs:/datanew/big/2012hourly.txt

Let us see the block size by accessing the gui:

http://192.168.21.184:50070/dfshealth.jsp
Click on datanew
Click on big
Observe the blocksize of the file 2012hourly.txt

Thus we have successfully set the blocksize.

Decommissioning of Data Nodes:

sudo vi /usr/local/hadoop/conf/exclude
192.168.21.154
:wq!

Enter the IP address of the data node and save the file.

Let us refresh the cluster.

hadoop dfsadmin -refreshNodes

Summary:

HDFS is a subproject of the Apache Hadoop venture. It is a distributed, exteremely fault-tolerant document framework intended to run on minimal effort item fittings.
HDFS is intended to store and process extensive documents within streaming information access.
With Heterogeneous Storage, the DataNode uncovered the sorts and utilization insights for every individual storage to the NameNode.
The idea behind Rack Awareness is data loss prevention and network performance.

Saturday, October 17, 2015

2 Clustering of Hadoop Environment

Clustering of Hadoop Environment

Pre-Preparation: Install one Hadoop System on VMware Player/Workstation and clone it for second Node.

We will be creating two node clusters with one Namenode and two Datanode

Open the master VM and go to the hadoop/conf directory
$ cd hadoop/conf

List the contents of the directory to view various configuration files.
$ ls
capacity-scheduler.xml
configuration.xls
core-site.xml
hadoop-env.sh
hadoop-metrics.properties
hadoop-policy.xml
hdfs-site.xml
log4j.properties
mapred-site.xml
masters
slaves
slaves.multi
ssl-client.xml.example
ssl-server.xml.example

Open core-site.xml for editing in vi:
Enter the file system's Namenode IP address with 9000 port.

$ vi core-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>fs.default.name</name>

</property>

</configuration>

Save and Exit

:wq!

Open hdfs-site.xml for editing in vi:
Enter dfs.replication value as 2.
$ vi hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

Save and Exit
:wq!

Open mapred-site.xml for editing in vi:
Enter mapred.job.tracker IP address and port 9001.
$ vi mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<configuration>
<property>

<name>mapred.job.tracker</name>

</property>
</configuration>

Save and Exit
:wq!

Open masters file for editing in vi:
Enter IP address of master.
$ vi masters
192.168.197.128

Save and Exit
:wq!

Open slaves file for editing in vi:
Enter IP address of master.
$ vi slaves
192.168.197.128

192.168.197.134

Save and Exit
:wq!

Switch to the slave terminal now:
Check the IP address of the slave
$ ifconfig
...
inet addr:192.168.197.134 Bcast:192.168.197.255 Mask:255.255.255.0
...

Note down and clear the screen
$ clear

Go to the configuration directory: hadoop/conf
$ cd hadoop/conf

List the contents to see various configuration files.
$ ls
capacity-scheduler.xml
configuration.xls
core-site.xml
hadoop-env.sh
hadoop-metrics.properties
hadoop-policy.xml
hdfs-site.xml
log4j.properties
mapred-site.xml
masters
slaves
slaves.multi
ssl-client.xml.example
ssl-server.xml.example

Clear the screen
$ clear

Open core-site.xml for editing in vi:
Enter the file system's Namenode IP address with 9000 port.

$ vi core-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>fs.default.name</name>

</property>

</configuration>

Save and Exit

:wq!

<name>mapred.job.tracker</name>

</property>
</configuration>

Save and Exit
:wq!

Open masters file for editing in vi:
Enter IP address of master.
$ vi masters
192.168.197.128

Save and Exit
:wq!

Open slaves file for editing in vi:
Enter IP address of master.
$ vi slaves
192.168.197.134

Save and Exit
:wq!

Switch to masters terminal
Generate ssh keys for password less communication between nodes.
$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key(/home/hadoop-user/.ssh/id_rsa):
/home/hadoop-user/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter paraphrase (empty for no paraphrase):
Enter same paraphrase again:
Your identification has been saved in /home/hadoop-user/.ssh/id_rsa.
Your public key has been saved in /home/hadoop-user/.ssh/id_rsa.pub
The key finger print is:

00:ba:38:a2:a9:da:33:71:37:8d:56:e3:31:7c:ea:6f hadoop-user@hadoop-desk
[ master@domain ]: cat ~/.ssh/id_rsa >> ~/.ssh/authorized_keys

Copy ssh keys from the master to the slave node.
[ master@domain ]: ssh-copy-id -i $HOME/.ssh/id_rsa
hadoop-user@192.168.197.128

Now trying logging into machine, with "ssh 'hadoop-user @192.168.197.128'", and check in:

.ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

[ slave@domain ]: ssh hadoop-user@192.168.197.128
...
[ master@domain ]:

Exit after login.
On Master Node start all the daemons of hadoop

[ master@domain ]: start-all.sh
......

Hadoop get successfully started on master
On the slave node start all hadoop daemons
[ slave@domain ]: start-all.sh
......

Hadoop get successfully started on slave

On Master list the contents of hdfs

[ master@domain ]: hadoop fs -ls /
......

On Slave list the contents of hdfs
[ slave@domain ]: hadoop fs -ls /
......

You will observe the same content on both master and slave.
Create a sample file.
vi sample.txt
...
:wq!

Copy file to HDFS

[ master@domain ]: hadoop fs -copyFromLocal sample.txt /

Turn safemode to off if prompted.

[ master@domain ]: hadoop dfsadmin -safemode leave

Copy file to HDFS

[ master@domain ]: hadoop fs -copyFromLocal sample.txt /

List the hdfs contents that are to be verified

[ master@domain ]: hadoop fs -ls /
......

You will observe the new file. Cat the contents of the file

[ master@domain ]: hadoop fs -cat sample.txt
......

Switch to Slave Node
List the hdfs contents that are to be verified

[ slave@domain ]: hadoop fs -ls /
......

You will observe the new file. Cat the contents of the file

[ slave@domain ]: hadoop fs -cat sample.txt
......

You will observe the distributed file system is working properly on the two nodes!

Wednesday, September 9, 2015

0 Hadoop Admin Course

Big Data and Hadoop Administrative Course

Lesson 00 Course Introduction

This lesson will give you an overview of the course, its pre-requisites and opportunities.

Course Objectives

· Describe Big Data and Hadoop ecosystem

· Describe advanced cluster configuration features

· Explain Hadoop Distributed File System

· Discuss MapReduce and YARN

· Discuss Hadoop administration and maintenance

· Explain important Hadoop components and ecosystem components

Course Overview

This training course provides the following:

· A detailed introduction to the basics of Apache Hadoop

· Knowledge on planning Hadoop cluster, Sqoop, MapReduce, YARN, Pig, Hive, and Impala

· An overview of the Hadoop installation and configuration, advanced cluster configuration features, HDFS, and important Hadoop components

· Knowledge on Hadoop administration and maintenance, and Hadoop ecosystem components

Following are the target audience of this course:

· Professionals aspiring for a career in Big Data analytics using Apache Hadoop

· Individuals who intend to design, deploy, and maintain Hadoop clusters

· System Administrators, Developers, Architects, IT professionals, Analytics professionals, and experts are also key beneficiaries

· Other aspirants and students, who wish to gain through understanding of Hadoop clusters

The prerequisites of the Big Data and Hadoop Administrator course are as follows:

· Fundamental knowledge of any programming language

· Good knowledge on Linux

· Fundamental programming skills

· Working knowledge of Java (not mandatory)

Value to Professionals

Hadoop professionals will be:

· Furnished with Hadoop skeleton aptitudes in the fast-developing Big Data Analytics industry.

· Prepared to drive Big Data methodology from Hadoop execution and bunch checking completely through exceptional security at huge speed and scale.

· Popular in all leading associations worldwide in the following decade.

· Leading the move from customary databases and information distribution centers to more adaptable, versatile framework based on Apache Hadoop.

Lessons Covered

Following is the list of lessons covered in this course:

Lesson 1 Introduction to Big Data and Hadoop

Lesson 2 Planning Hadoop Cluster

Lesson 3 Hadoop Installation and Configuration

Lesson 4 Advanced Clusters and Configuration

Lesson 5 Hadoop Distributed File System

Lesson 6 Overview of MapReduce and YARN

Lesson 7 Important Hadoop Components

Lesson 8 Hadoop Administration and Maintenance

Lesson 9 Hadoop Ecosystem components

Demos and Lab Exercises are also included in the course.