Wednesday, December 16, 2015

5 SQOOP

Introduction to SQOOP
Sqoopis a tool used to import and trade information between Hadoop and tagged social databases. Characteristics of Sqoopare as follows:

  • Uses SqoopImport and SqoopExport capacities.
  • Is composed in Java which gives an API called Java Database Connectivity.
  • Depends on the database to depict the composition of the information to be foreign made.
  • Uses Mapreduce to import and fare the information, which gives parallel operation.


                How SQOOP Works
                The Sqoop includes activities such as:

                • ImportAllows data imports from external data stores and enterprise data warehouses into Hadoop
                • TransferParalleled data transfer for fast performance and optimal system utilization
                • CopyCopies data quickly from external systems to Hadoop
                • Increase efficiencyMakes data analysis more efficient
                • Reduce loadMitigates excessive loads to external systems

                Prerequisite for SQOOP Installation
                The prerequisites to install Sqoop are:

                • An arrival of Hadoop must be introduced and designed.
                • Currently, Sqoop supports 4 noteworthy Hadoop discharges—0.20, 0.23, 1.0 and 2.0.
                • Hadoop 2.2.0 is also introduced and it goes well with sqoop1.4.4.
                • A Linux environment Ubuntu 12.04 is also required.

                Installing and configuring Sqoop
                There are ten steps in the process. The first four steps to install and configure sqoop 1.4.4 are:

                • Download the sqoop-1.4.4.bin_hadoop-1.0.0.tar.gz file from www.apache.org/dyn/closer.cgl/sqoop/1.4.4
                • Unzip the tar ?le: sudo tar -zxvf  sqoop-1.4.4.bin_hadoop-1.0.0.tar.gz
                • Move  sqoop-1.4.4.bin_hadoop-1.0.0 to sqoop using command
                  user@ubuntu:~$ sudo mv sqoop-1.4.4.bin_hadoop-1.0.0 /usr/local/sqoop
                • Create a directory sqoop in usr/lib using command:
                  user@ubuntu:~$ sudo mkdir /usr/lib/sqoop
                • Go to the zipped folder sqoop-1.4.4.bin_hadoop-1.0.0 and run the command:
                  user@ubuntu:~$ sudo mv ./* /usr/lib/sqoop
                • Go to root directory using cd command (example)
                  user@ubuntu:~$ sudo gedit ~/.bashrc
                • Reduce remove unwanted data
                • Add the following lines:
                  export SQOOP_HOME:iusr/lib/sqoop
                  export PATH=$PATH:$SQOOP_HOME/BIN
                • To check if the sqoop has been installed successfully type the command:
                  user@ubuntu:~$ sqoop version
                Importing Data from MySQL

                The steps to import Data from MySQL using sqoop are:

                • Download mysql-connector-java-5.1.28-bin.jar and move to /usr/lib/sqoop/lib
                • Login to mysql using command
                  user@ubuntu:~$ mysql -u root -p 
                • Login to secure shell using command
                  user@ubuntu:~$ ssh localhost
                • Start hadoop using the command
                  user@ubuntu:~$ bin/hadoop start-all.sh
                • Run the command

                Business Scenario
                Olivia is the EVP of IT Operations with Nutri Worldwide Inc. The data for the company has grwon exponentially to 500 terabytes, and the current RDBMS systems poses challenges in terms of latency. Hence, Olivia has decided to move all the data from RDBMS systems to HDFS, for which Sqoop needs to be installed.

                Demo- Install sqoop

                • Visit the site sqoop.apache.org in firefox Browser.
                • Under the heading "Download" click on "Download a release from a nearby mirror"
                • Click on the suggested mirror link on the top.
                • Under the title "Index of /sqoop" click on the folder "1.4.4/"
                • Right click and copy link to "sqoop-1.4.4.bin_hadoop-1.0.0.tar.gz"
                • Open a terminal and then type "wget " and then paste the link:
                  user@ubuntu:~$ wget http://apache.petsads.us/sqoop/1.4.4/sqoop-1.4.4.bin_hadoop-1.0.0.tar.gz
                • Untar the downloaded tar file.
                  user@ubuntu:~$ tar -xvf sqoop-1.4.4.bin_hadoop-1.0.0.tar.gz
                • Copy the folder to the /usr/local/sqoop folder
                  user@ubuntu:~$ sudo cp -r  sqoop-1.4.4.bin_hadoop-1.0.0    /usr/local/sqoop
                • Edit the bashrc to update the environment variables.
                  user@ubuntu:~$ sudo vi $HOME/.bashrc
                  export PATH=$PATH:/usr/local/sqoop
                  export HADOOP_COMMON_HOME=/usr/local/hadoop
                  export HADOOP_MAPRED_HOME=/usr/local/hadoop
                  export HIVE_HOME=/usr/local/hive
                  export HBASE_HOME=/usr/local/hbase
                  export PATH=$PATH:/usr/local/sqoop/bin
                • Type the command
                  user@ubuntu:~$ exec bash
                Install My SQL Server
                To perform the demo on hadoop you need to have a database server.
                We will install mysql server.
                • Type the command:
                  user@ubuntu:~$ sudo apt-get install mysql-server
                • One mysql server is installed type the command
                  user@ubuntu:~$ mysql -u root -p
                • To create the database sl type the command
                  mysql> create database sl;
                • To use the database sl type the command
                  mysql> use sl;
                • Create a table called authentication that has columns with headings as username and password.
                  mysql> create table authentication(usename varchar(30), password varchar(30));
                • Insert some data in the table
                  mysql> insert into authentication value('admin','12345');
                  mysql> insert into authentication value('s1001','s1001');
                • Ensure that the data is inserted into the table using the command
                  select * from authentication;
                • The next step is to download the database driver that will be used by sqoop. This can be done by visiting the site http://www.mysql.com/downloads
                • Click on the "Download from MySQL Developer Zone" link under My SQL Community Edition
                • You will be taken to "http://dev.mysql.com/downloads/" page. Click the "DOWNLOAD" link under the title "MySQL Connectors".
                • You will be taken to "http://dev.mysql.com/downloads/connector/". Click on the "Connector /J" link to download the required distribution. Select platform independent option. Click on the download. Click on "No thanks, just start my download link". A download file info pop-up box appears. Click on start download button. FTP the downloaded file in the folder /usr/local/sqoop/lib.
                • Change the ownership of the folder using the command:
                  user@ubuntu:~$ sudo chown s1000 /usr/local/sqoop
                  user@ubuntu:~$ sudo chown s1000 /usr/local/sqoop/lib
                • You have successfully installed and configured sqoop.


                Summary: 
                Sqoop is a tool to import and trade information between Hadoop and tagged social databases.











                Friday, December 4, 2015

                5 Hadoop Distributed File System

                Objectives

                • Describe Hadoop Distributed File System concepts and architecture
                • Explain HDFS Storage mechanisms and HDFS Rack awareness
                • Explain HDFS Writes and Reads
                • List the important commands of HDFS
                • Describe Sqoop
                • Install and configure Sqoop

                Introduction to Hadoop Distributed File System
                The Hadoop Distributed File System (HDFS) is the subproject of the Apache Hadoop venture. It is a distributed, extremely fault tolerant document framework intended to run on minimal effort item fittings.
                HDFS gives a high throughput access to aplication information and is suitable for applications with huge information sets. Hadoop gives a circulated document framework that is equipped for investigation and preparing of vast sets of information utilizing mapreduce ideal model.
                The concept of HDFS is based on UNIX, even through gauges were traded off to some degree to enhance execution for the applications.
                HDFS is similar to other Distributed File framework, some of the differences are:

                • Write-once-read-many model relaxes concurrency controls prerequisites.
                • Takes handling rational near the information.

                Goals of HDFS
                • Recognizing deficiencies and programmed recovery
                • Streaming information access through MapReduce
                • Upholding simple coherency model
                • Bringing the handling rationale close to information
                • Preparing immense measure of information
                • Redeploying activity of handling rationale in the events of failure.
                HDFS Architecture
                HDFS Architecture can be summarized as follows:


                • NameNode and the Secondary NameNode services constitute the master service. DataNode service is the slave service.
                • The master service is responsible for accepting a job from clients and ensures that the data required for operation will be loaded and segregated into chunks of data blocks.
                • HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks stored and replicated in DataNodes. The data blocks are then distributed to the DataNode system within the cluster. This ensures that replicas of the data are maintained.
                Design of HDFS
                HDFS is intended to store and process extensive documents with streaming information access.

                Situations where HDFS suits:

                • Substantial files and where Hadoop uses groups to store and processes the information
                                e.g. Document that are 100s of
                                1 KB (KiloBytes)= 1024 Bytes
                                1 MB (MegaBytes)= 1024 KB
                                1 GB (Gigabyte) = 1024 MB
                                1 TB (Terabyte) = 1024 GB
                                1 PB (Petabyte) = 1024 TB
                                1 EB (Exabyte)  = 1024 PB
                                1 ZB (Zettabyte)= 1024 EB
                                1 YB (Yottabyte)= 1024 ZB
                                1 SB (Shilentnobyte) = 1024 YB
                                1 DB (Domegemegrottebyte) = 1024 SB
                                Reference
                • Streaming information access
                  As hadoop is focused around the idea of transforming information design which is composed once and read ordinarily form.
                • No extravagant equipment to run the framework
                  Hadoop is designed to run on bunches of ware equipment due to which risk of no failure is high. None the less hadoop is empowered to carry on the work regardless of the fact that there is a hardware failure.
                Situations where HDFS does not suit:

                  • Low-latency information access
                    Application that require access to information in several milliseconds reach must not be coordinated with HDFS. HDFS is converying high throughput of informaiton and this may prompt latency. Utilizing HBASE environment of Hadoop would be a superior decision.
                  • Many small documents
                    Since the NameNode holds record framework Meta information in the memory, the point of confinement for the quantity of documents in a document framework is represented by Measure of memory on the NameNode.
                  • When there are various composes
                    Records in HDFS takes after writing once and read commonly format. Composes are constantly made toward the end of the record. There is no backing for different composes.
                  HDFS Concepts

                  Concept
                  Description
                  Blocks
                  Stores the data – 64 MB by default, larger blocks
                  NameNode
                  The champion that deals with the record framework namespace
                  DataNode
                  Slaves in the HDFS framework, store and recover blocks when they are asked by the customer or NameNode
                  HDFS Federation
                  A reference to every document and block in memory, each NameNode deals with a namespace volume made up to the Meta information for the namespace and a block pool
                  High Availability
                  Includes help for sending two NameNodes in a dynamic or detached setup, checks if the inactive NameNode is fit for performing the check pointing part.

                  Hadoop Storage Mechanism
                  Storages can be mainly evaluated on three classes of performance metrics:

                  • Cost per MB: The decision of storing the data is computed based on every Mega Byte.
                  • Sturdiness: The measure of the durability of data once it has been effectively composed to the medium.
                  • Execution: The two measures of capacity execution are Throughput and IO Operations per second.


                  Measure of Capacity Execution
                  Throughput

                  • The extreme unfinished read compose rate that the storage can help
                  • Ordinarily measured in Mbps
                  • Essential metric for batch processing


                  IO Operation per second.

                  • The quantity is influenced by the workload and IO size
                  • The rotational inactivity of turning circles confines the greatest IOPS for an arbitrary IO workload


                  HDFS Storage Architecture Heterogeneous

                  Information nodes impart their capacity state through the accompanying sorts of messages, storage report and block report:

                  Storage Report

                  • Contains the outline data of the condition of a storage
                  • Includes limit and utilization points of interest
                  • Found inside a heartbeat, sent once in every few seconds


                  Block Report

                  • Also called a block report
                  • An informal report of the individual piece imitations on a given DataNode
                  • Two parts of piece reports are Incremental square report and full piece report

                  HDFS Storage Architecture Illustrated

                  With Heterogeneous Storage, the DataNodeuncovered the sorts and utilization insights for every individual storage to the NameNode.





                  HDFS Rack Awareness
                  The idea behind Rack Awarenessis data loss prevention and network performance

                  • Each square of information is recreated on different machines to keep up with the failure of losing information.
                  • Two machines in the same rack have more data transfer capacity and lower latency than two machines in two separate racks.





                  HDFS Writes—Example
                  An online shopping portal plans to improve the quality of their products by analyzing how many customers in their emails specify the word Refund.
                  File name: Email.txt
                  Key Points:

                  • Client consults NameNode
                  • Client writes block directly to one DataNode
                  • Data replicates the block
                  • Cycle repeats for next block


                  HDFS Reads
                  The workflow of HDFS Reads is:

                  • To recover a document from HDFS, the client recounselsthe NameNodeand requests the piece areas of the record.
                  • Client picks the DataNodefrom each square rundown and uses one piece at once with TCP on port 50010.

                  Important Commands of HDFS
                  Some of the Hadoop shell commands to manage HDFS are:

                  • To create directory: -cat <path[filename]>
                  • To list the contents of a directory: ls<args>
                  • To move file from source to destination: mv <source> <destination>
                  • To copy a file from/to local system from HDFS:

                  Copyfromlocal
                  copyFromLocal<localsrc>
                  Copyfromlocal
                  copyFromLocal<destination>

                  Some of the Hadoop shell commands to manage HDFS are:
                  • To see contents of a file: mkdir<path of the directory>
                  • To upload file in HDFS: fs-put <source file> ... <destination path>
                  • To download file in HDFS : fs-get <source file> ... <destination path>

                  Types of HDFS Commands
                  All the Hadoop commands are invoked by the bin/hadoopscript. Running Hadoop script without any arguments prints the description for all commands. HDFS Commands are grouped into two types:

                  HDFS Commands

                  • User
                  • Administrator
                  User Commands
                  Some Important User Commands are:

                  • Archive: Usage: hadoop archive -archiveName NAME <src>* <dest>
                  • Distcp: Usage: distcp <srcurl> <desturl>
                  • FS: Usage: hadoop fs
                  • FSCK: Usage: hadoop fsck <path> [-move | -delete | -openforwrite]
                    [-files [-blocks [-location | -racks]]]
                  • Jar: Usage: hadoop jar <jar> [mainClass] args...
                  • ClassName: Usage: hadoop CLASSNAME
                  • Job: Usage: hadoop job [-submit <job-file>] | [-status <job-id>] |
                    [-counter <job-id> <group-name> <counter-name>] | [kill <job-id>] |
                    [-events  <job-id> <from-event-#> <#-of-events>] | [-history [all]] |
                    [-kill-task <task-id>] | [-fail-task <task-id>]
                  Administrator Commands
                  Some Important Administrator Commands are:
                  • Balancer: Usage: hadoop balancer [- threshold <threshold>]
                  • Daemon Log: Usage: hadoop daemonlog -getlevel <host:port> <name>
                  • DataNode: Usage: hadoop datanode [-rollback]
                  • DFSadmin: Usage: hadoop dfsadmin [-report] [-safemode enter | leave | get | wait ]
                    [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-setQuota <quota> <dirname>...<dirname>] [-clrQuota <dirname>...<dirname>] [-help [cmd]]
                  • JobTracker: Usage: hadoop jobtracker
                  • NameNode: Usage: hadoop namenode [-format] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint]
                  • Secondary NameNode: Usage hadoop secondarynamenode [-checkpoint [force]] | [-geteditsize]
                  • TaskTracker: Usage: hadoop tasktracker

                  Business Scenario
                  Olivia is the EVP - IT Operations at Nutri Worldwide Inc. Her team is involved in setting up Hadoop infrastructure for the organization. After performing the steps to set up the Hadoop infrastructure, Olivia and her team decides to test the effectiveness of the HDFS infrastructure.

                  The demo in the subsequent will illustrate how to setup HDFS. Let is proceed to the next screen to see the demo.

                  Start the HDFS demo provided by Simplilearn
                  • start-all.sh
                  • jps
                  • hadoop jar $HADOOP_PREFIX/hadoop-test-1.2.1.jar TestDFSIO -write nrFiles 5 -fileSize 100
                   The result show the execution time to write files and IO rate and throughput.
                  • hadoop jar $HADOOP_PREFIX/hadoop-test-1.2.1.jar TestDFSIO -read nrFiles 5 -fileSize 100
                  The result show the execution time to read files and IO rate and throughput.
                  This is how bench marking is done.

                  • sudo vi /usr/local/hadoop/conf/hdfssite.xml
                  <configuration>
                  <property>
                      <name>dfs.block.size</name>
                      <value>134217728</value>  ### i.e 128 MB
                  </property>
                  </configuration>


                  We need to refresh the hadoop services:
                  • stop-all.sh
                  • start-all.sh
                   Let us upload a data for testing the same.
                  • hadoop fs -copyFromLocal /home/hadoop/data/big/201201hourly.txt hdfs:/datanew/big/2012hourly.txt
                   Let us see the block size by accessing the gui:
                  • http://192.168.21.184:50070/dfshealth.jsp
                  • Click on datanew
                  • Click on big
                  • Observe the blocksize of the file 2012hourly.txt
                  Thus we have successfully set the blocksize.

                  Decommissioning of Data Nodes:
                  • sudo vi /usr/local/hadoop/conf/exclude
                  • 192.168.21.154
                  • :wq!
                   Enter the IP address of the data node and save the file.
                  Let us refresh the cluster.
                  • hadoop dfsadmin -refreshNodes
                  Summary:

                  • HDFS is a subproject of the Apache Hadoop venture. It is a distributed, exteremely fault-tolerant document framework intended to run on minimal effort item fittings.
                  • HDFS is intended to store and process extensive documents within streaming information access.
                  • With Heterogeneous Storage, the DataNode uncovered the sorts and utilization insights for every individual storage to the NameNode.
                  • The idea behind Rack Awareness is data loss prevention and network performance.


                  Saturday, October 17, 2015

                  2 Clustering of Hadoop Environment

                  Clustering of Hadoop Environment

                  Pre-Preparation: Install one Hadoop System on VMware Player/Workstation and clone it for second Node.

                  We will be creating two node clusters with one Namenode and two Datanode

                  Open the master VM and go to the hadoop/conf directory
                  $ cd  hadoop/conf

                  List the contents of the directory to view various configuration files.
                  $ ls
                  capacity-scheduler.xml
                  configuration.xls
                  core-site.xml
                  hadoop-env.sh
                  hadoop-metrics.properties
                  hadoop-policy.xml
                  hdfs-site.xml
                  log4j.properties
                  mapred-site.xml
                  masters
                  slaves
                  slaves.multi
                  ssl-client.xml.example
                  ssl-server.xml.example

                  Open core-site.xml for editing in vi:
                  Enter the file system's Namenode IP address with 9000 port.
                  $ vi core-site.xml
                  <?xml version="1.0"?>
                  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

                  <!-- Put site-specific property overrides in this file -->
                  <configuration>
                      <property>
                          <name>fs.default.name</name>
                          <value>hdfs://192.168.197.128:9000</value>
                      </property>
                  </configuration>

                  Save and Exit
                  :wq!

                  Open hdfs-site.xml for editing in vi:
                  Enter dfs.replication value as 2.
                  $ vi hdfs-site.xml
                  <?xml version="1.0"?>
                  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

                  <!-- Put site-specific property overrides in this file -->
                  <configuration>
                      <property>
                          <name>dfs.replication</name>
                          <value>2</value>
                      </property>
                  </configuration>

                  Save and Exit
                  :wq!

                  Open mapred-site.xml for editing in vi:
                  Enter mapred.job.tracker IP address and port 9001.
                  $ vi mapred-site.xml
                  <?xml version="1.0"?>
                  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

                  <!-- Put site-specific property overrides in this file -->
                  <configuration>
                      <property>
                          <name>mapred.job.tracker</name>
                          <value>hdfs://192.168.197.128:9001</value>
                      </property>
                  </configuration>

                  Save and Exit
                  :wq!

                  Open masters file for editing in vi:
                  Enter IP address of master.
                  $ vi masters
                  192.168.197.128

                  Save and Exit
                  :wq!

                  Open slaves file for editing in vi:
                  Enter IP address of master.
                  $ vi slaves
                  192.168.197.128
                  192.168.197.134

                  Save and Exit
                  :wq!

                  Switch to the slave terminal now:
                  Check the IP address of the slave
                  $ ifconfig
                  ...
                  inet addr:192.168.197.134 Bcast:192.168.197.255 Mask:255.255.255.0
                  ...

                  Note down and clear the screen
                  $ clear

                  Go to the configuration directory: hadoop/conf
                  $ cd hadoop/conf

                  List the contents to see various configuration files.
                  $ ls
                  capacity-scheduler.xml
                  configuration.xls
                  core-site.xml
                  hadoop-env.sh
                  hadoop-metrics.properties
                  hadoop-policy.xml
                  hdfs-site.xml
                  log4j.properties
                  mapred-site.xml
                  masters
                  slaves
                  slaves.multi
                  ssl-client.xml.example
                  ssl-server.xml.example

                  Clear the screen
                  $ clear

                  Open core-site.xml for editing in vi:
                  Enter the file system's Namenode IP address with 9000 port.
                  $ vi core-site.xml
                  <?xml version="1.0"?>
                  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

                  <!-- Put site-specific property overrides in this file -->
                  <configuration>
                      <property>
                          <name>fs.default.name</name>
                          <value>hdfs://192.168.197.128:9000</value>
                      </property>
                  </configuration>

                  Save and Exit
                  :wq!

                  Open hdfs-site.xml for editing in vi:
                  Enter dfs.replication value as 2.
                  $ vi hdfs-site.xml
                  <?xml version="1.0"?>
                  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

                  <!-- Put site-specific property overrides in this file -->
                  <configuration>
                      <property>
                          <name>dfs.replication</name>
                          <value>2</value>
                      </property>
                  </configuration>

                  Save and Exit
                  :wq!

                  Open mapred-site.xml for editing in vi:
                  Enter mapred.job.tracker IP address and port 9001.
                  $ vi mapred-site.xml
                  <?xml version="1.0"?>
                  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

                  <!-- Put site-specific property overrides in this file -->
                  <configuration>
                      <property>
                          <name>mapred.job.tracker</name>
                          <value>hdfs://192.168.197.128:9001</value>
                      </property>
                  </configuration>

                  Save and Exit
                  :wq!

                  Open masters file for editing in vi:
                  Enter IP address of master.
                  $ vi masters
                  192.168.197.128

                  Save and Exit
                  :wq!

                  Open slaves file for editing in vi:
                  Enter IP address of master.
                  $ vi slaves
                  192.168.197.134

                  Save and Exit
                  :wq!

                  Switch to masters terminal
                  Generate ssh keys for password less communication between nodes.
                  $ ssh-keygen
                  Generating public/private rsa key pair.
                  Enter file in which to save the key(/home/hadoop-user/.ssh/id_rsa):
                  /home/hadoop-user/.ssh/id_rsa already exists.
                  Overwrite (y/n)? y
                  Enter paraphrase (empty for no paraphrase):
                  Enter same paraphrase again:
                  Your identification has been saved in /home/hadoop-user/.ssh/id_rsa.
                  Your public key has been saved in /home/hadoop-user/.ssh/id_rsa.pub
                  The key finger print is: 
                  00:ba:38:a2:a9:da:33:71:37:8d:56:e3:31:7c:ea:6f hadoop-user@hadoop-desk
                  [ master@domain ]: cat ~/.ssh/id_rsa >> ~/.ssh/authorized_keys

                  Copy ssh keys from the master to the slave node.
                  [ master@domain ]: ssh-copy-id -i $HOME/.ssh/id_rsa
                  hadoop-user@192.168.197.128 


                  Now trying logging into machine, with "ssh 'hadoop-user @192.168.197.128'", and check in:

                    .ssh/authorized_keys

                  to make sure we haven't added extra keys that you weren't expecting.

                  [ slave@domain ]: ssh hadoop-user@192.168.197.128 
                  ...
                  [ master@domain ]:

                  Exit after login.
                  On Master Node start all the daemons of hadoop
                  [ master@domain ]: start-all.sh
                  ......

                  Hadoop get successfully started on master
                  On the slave node start all hadoop daemons
                  [ slave@domain ]: start-all.sh
                  ......

                  Hadoop get successfully started on slave


                  On Master list the contents of hdfs
                  [ master@domain ]: hadoop fs -ls /
                  ......

                  On Slave list the contents of hdfs
                  [ slave@domain ]: hadoop fs -ls /
                  ......

                  You will observe the same content on both master and slave.
                  Create a sample file.
                  vi  sample.txt
                  ...
                  :wq!
                  Copy file to HDFS
                  [ master@domain ]: hadoop fs -copyFromLocal sample.txt /

                  Turn safemode to off if prompted.
                  [ master@domain ]: hadoop dfsadmin -safemode leave

                  Copy file to HDFS
                  [ master@domain ]: hadoop fs -copyFromLocal sample.txt /

                  List the hdfs contents that are to be verified
                  [ master@domain ]: hadoop fs -ls /
                  ......

                  You will observe the new file. Cat the contents of the file
                  [ master@domain ]: hadoop fs -cat sample.txt
                  ......

                  Switch to Slave Node
                  List the hdfs contents that are to be verified
                  [ slave@domain ]: hadoop fs -ls /
                  ......

                  You will observe the new file. Cat the contents of the file
                  [ slave@domain ]: hadoop fs -cat sample.txt
                  ......

                  You will observe the distributed file system is working properly on the two nodes!

                  Wednesday, September 9, 2015

                  0 Hadoop Admin Course

                  Big Data and Hadoop Administrative Course

                  Lesson 00 Course Introduction

                  This lesson will give you an overview of the course, its pre-requisites and opportunities.

                  Course Objectives

                  ·         Describe Big Data and Hadoop ecosystem
                  ·         Describe advanced cluster configuration features
                  ·         Explain Hadoop Distributed File System
                  ·         Discuss MapReduce and YARN
                  ·         Discuss Hadoop administration and maintenance
                  ·         Explain important Hadoop components and ecosystem components

                  Course Overview

                  This training course provides the following:
                  ·         A detailed introduction to the basics of Apache Hadoop
                  ·         Knowledge on planning Hadoop cluster, Sqoop, MapReduce, YARN, Pig, Hive, and Impala
                  ·         An overview of the Hadoop installation and configuration, advanced cluster configuration features, HDFS, and important Hadoop components
                  ·         Knowledge on Hadoop administration and maintenance, and Hadoop ecosystem components
                  Following are the target audience of this course:
                  ·          Professionals aspiring for a career in Big Data analytics using Apache Hadoop
                  ·         Individuals who intend to design, deploy, and maintain Hadoop clusters
                  ·         System Administrators, Developers, Architects, IT professionals, Analytics professionals, and experts are also key beneficiaries
                  ·         Other aspirants and students, who wish to gain through understanding of Hadoop clusters
                  The prerequisites of the Big Data and Hadoop Administrator course are as follows:
                  ·         Fundamental knowledge of any programming language
                  ·         Good knowledge on Linux
                  ·         Fundamental programming skills
                  ·         Working knowledge of Java (not mandatory)

                  Value to Professionals

                  Hadoop professionals will be:
                  ·         Furnished with Hadoop skeleton aptitudes in the fast-developing Big Data Analytics industry.
                  ·         Prepared to drive Big Data methodology from Hadoop execution and bunch checking completely through exceptional security at huge speed and scale.
                  ·         Popular in all leading associations worldwide in the following decade.
                  ·         Leading the move from customary databases and information distribution centers to more adaptable, versatile framework based on Apache Hadoop.

                  Lessons Covered

                  Following is the list of lessons covered in this course:
                  Lesson 2 Planning Hadoop Cluster
                  Lesson 3 Hadoop Installation and Configuration
                  Lesson 4 Advanced Clusters and Configuration
                  Lesson 5 Hadoop Distributed File System
                  Lesson 6 Overview of MapReduce and YARN
                  Lesson 7 Important Hadoop Components
                  Lesson 8 Hadoop Administration and Maintenance
                  Lesson 9 Hadoop Ecosystem components

                  Demos and Lab Exercises are also included in the course.

                  1 Introduction to Big Data and Hadoop

                  Introduction to Big Data and Hadoop

                  Objectives

                  ·         Identify the need for Big Data
                  ·         Explain the concept of Big Data
                  ·         Describe the basics of Hadoop
                  ·         Explain the benefits of Hadoop

                  Introduction

                  Over 2.5 exabytes (2.5 billion gigabytes) of data is generated every day.
                  Following are some of the sources of the huge volume of data:
                  ·         A typical, large stock exchange captures more than 1 TB of data everyday
                  ·         There are around 5 billion mobile phones (including 1.75 billion smart phones) in the world
                  ·         YouTube users upload more than 48 hours of video every minute.
                  ·         Large social networks such as twitter and facebook capture more than 10 TB of data daily.
                  ·         There are more than 30 million networked sensors in the world.

                  Types of Data

                  There are three types of data:
                  ·         Structure Data: Data which is represented in a tabular format. E.g. Databases
                  ·         Semi-structured Data: Data which does not have a formal data model. E.g. XML files
                  ·         Un-structured Data: Data which does not have a predefined data model. E.g. Text files

                  Characteristics of Big Data

                  Big data has three characteristics: variety, velocity, and volume.
                  ·         Variety: Variety encompasses managing the complexity of data in many different structures, ranging from relational data to logs and raw text.
                  ·         Velocity: Velocity account from streaming of data and movement of large volume data at a high speed.
                  ·         Volume: Volume denotes the scaling of data ranging from terabytes to zettabytes.

                  Appeal of Big Data Technology

                  Big Data Technology is appealing because of the following reasons:
                  ·         It helps to manage and process a huge amount of data cost efficiently
                  ·         It analyzes data in its native form, which may be unstructured, structured, or streaming.
                  ·         It captures data from fast-happening events in real time.
                  ·         It can handle failure of isolated nodes and tasks assigned to such nodes.
                  ·         It can turn data into actionable insights.

                  Business Benefits of Big Data Technology

                  Following are the business benefits of implementing Big Data technology, with examples:
                  ·         It can help organizations to create personalized products, gain insight into products that are profitable, and retain customers by solving their problems.
                  Example: Utilizing Big Data Analytics permits banks to study the money-saving patterns and practices of individual clients.
                  ·         The Big Data analytic solutions support or automate cost cutting, bring greater efficiency of operations and the evaluation of historical trends.
                  Example: Using Big Data Analytics, banks keep track of their client’s geographical shopping locations.
                  ·         By using Big Data predictive analysis techniques, organizations can provide an early warnings of a problem and enable preventive maintenance to avoid a potential outage
                  Example: A portable computer manufacturer has the capacity to assemble and break down information utilized as part of segment assembling. This information can help the producer to focus on satisfactory levels of heat, vibration, and different variables utilized.
                  ·         Big Data offers a range of analytical techniques help organizations to develop new products and services.
                  Example: An application can analyze data to the most granular level, even to observe that the customers who bought a smart phone also bought memory cards or back covers.

                  Traditional IT Analytics Approach

                  The following are the requirements of the traditional IT analytics approach and the challenging factors:
                  Requirements:
                  ·         The business team needs to define questions before IT development.
                  ·         They need to define data sources and structures.
                  Challenging Factors:
                  ·         The requirements are iterative and volatile.
                  ·         The data sources keep changing.
                  In typical scenario of traditional IT systems development, the requirements are defined, followed by solution design and build. Once the solution is implemented, queries are executed. If there are new requirements or queries, the system is redesigned and rebuilt.
                  Define Requirements à Design Solution à Execute queries à Redesign and Rebuild for new requirements.

                  Approach for Big Data Solutions

                  Following are the requirements for using Big Data technology as a platform for discovery and exploration, and the challenges overcome by the same:
                  Requirements
                  ·         The business team needs to define data sources
                  ·         They need to establish the hypothesis
                  Challenges overcome by Big Data
                  ·         The technology should enable explorative analysis.
                  ·         Data systems and sources need to be integrated as required.
                  The steps illustrates how IT systems are built with the help if Big Data technology.
                  ·         Initial Data Sources are identified
                  ·         IT Team creates a platform for creative exploration of available data and content
                  ·         The business teams determine the questions to ask and test hypothesis
                  ·         Any new questions lead to addition of data sources and integration without the need to redesign or rebuild the platform.

                  Big Data Technology Capabilities

                  ·         Understand and navigate Big Data sources
                  ·         Manage and store huge volume of a variety of data
                  ·         Process Data in reasonable time
                  ·         Ingest data at a high speed
                  ·         Analyze unstructured data
                  ·         Bear faults and exceptions

                  Big Data Use Cases

                  The use cases of Big Data Hadoop are given below:
                  ·         Automotive: Auto sensors reporting location, problems
                  ·         Communication: Location based advertising
                  ·         Consumer Packaged Goods: Sentiment analysis of what’s hot, customer service
                  ·         Financial Services: Risk and portfolio analysis New Products
                  ·         Education and Research: Experiment sensors analysis
                  ·         High Technology/Industrial Mfg.: Mfg. quality Warranty analysis
                  ·         Life Sciences: Clinical Trials Geonomics
                  ·         Media/ Entertainment: Viewers/Advertising Effectiveness
                  ·         Online Services/Social Media: People and career matching
                  ·         Health Care: Patient sensors, monitoring, EHRs Quality of care
                  ·         Oil and Gas: Drilling exploration sensor analysis
                  ·         Retail: Consumer sentiments Optimized marketing
                  ·         Travel and Transportation: Sensors analysis for optimal traffic flows. Customer sentiments
                  ·         Utilities: Smart Meter analysis for network capability
                  ·         Law Enforcement and Defense: Threat analysis – social media monitoring, photo analysis

                  Challenges of Big Data

                  Following are the challenges that need to be addressed by Big Data Technology:
                  ·         Fault tolerance and handling the system uptime and downtime
                  o   Using commodity hardware for data storage and analysis
                  o   Maintaining a copy of the same data across clusters
                  ·         Combining data accumulated from all systems
                  o   Analyzing data across different machines
                  o   Merging of data

                  Introduction to Hadoop

                  Following are the facts related to Hadoop and why it is required:
                  What is Hadoop?
                  ·         A free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
                  ·         Based on Google File System (GFS)
                  Why Hadoop?
                  ·         Runs a number of applications on distributed system with thousands of nodes involving petabytes of data
                  ·         Has a distributed file system, called Hadoop Distributed File System or HDFS which enables fast data transfer among the nodes.
                  ·         Further it encompasses a distributed processing framework called MapReduce

                  Hadoop and Traditional RDBMS

                  Feature
                  RDBMS
                  Hadoop
                  Computing Model
                  ·      Notion of transactions
                  ·      Transaction is the unit of work
                  ·      ACID properties, Concurrency control
                  ·      Notion of jobs
                  ·      Job is the unit of work
                  ·      No concurrency control
                  Data Model
                  ·      Structured data with known schema
                  ·      Read/Write mode
                  ·      Any data will fit in any format
                  ·      (un)(semi)structured
                  ·      Read Only Mode
                  Cost Model
                  ·      Expensive Server
                  ·      Cheap commodity machines
                  Fault Tolerance
                  ·      Failures are rare
                  ·      Recovery mechanisms
                  ·      Failures are common over thousands of machines
                  ·      Simple yet efficient fault tolerance

                  History and Milestones Of Hadoop

                  Hadoop Originated from the Nutch open source project on search engines and works over distributed network nodes.
                  Period
                  Milestone
                  2003 & 2004
                  Google released two papers which provided insight into their success. The google file system or GFS and MapReduce. Simplified Data processing on large clusters. The papers told the world how Google performed large scale data processing.
                  July 2005
                  Nutch used GFS to perform MapReduce operations
                  Feb 2006
                  Nutch started a Lucene sub project which led to the era of Hadoop
                  Apr 2007
                  Yahoo started using Hadoop on a 1000-node cluster
                  Jan 2008
                  Apache took over Hadoop and made it a top-level project
                  Jul 2008
                  A 4000-node cluster with Hadoop was tested by Apache. The performance of that cluster was surprisingly the fastest when compared to the other technologies implemented that year
                  May 2009
                  Hadoop Successfully sorted a petabyte of data in 17 hours
                  Dec 2011
                  Hadoop reached version 1.0

                  Hadoop Core Services and Components.

                  Major components of Hadoop are:
                  ·         HDFS: HDFS runs on commodity machines which are low in cost and hardware. It is highly fault tolerant and efficient enough to process huge amount of data.
                  ·         NameNode: Is the brain of the system. It stores the Metadata of the data blocks along with location of data blocks. If this NameNode crashes the entire system is dead.
                  ·         Secondary NameNode: Is the replica of the Primary NameNode. This is used to ensure that even if the Primary NameNode crashes Hadoop system is not dead, but name space image on Secondary NameNode can be used to restart the system.
                  ·         DataNode: Stores the blocks of data.
                  ·         JobTracker: Schedules client jobs and creates Map or Reduce tasks and schedules them. It can run on the same machines as NameNode or different Node.
                  ·         TaskTracker: Runs on DataNodes and its primary responsibility is to run the MapReduce tasks assigned by the name node.

                  Master
                  Slave 1
                  Slave 2
                  Slave N
                  MapReduce
                  JobTracker




                  TaskTracker
                  TaskTracker
                  TaskTracker

                  TaskTracker
                  HDFS
                  NameNode




                  DataNode
                  DataNode
                  DataNode

                  DataNode

                  HDFS Architecture

                  HDFS architecture and be summarized as follows:
                  ·         The NameNode is the master and DataNode are the slaves.
                  ·         NameNode is the brain of the system, and is accessing client data. DataNode manages the storage of data.
                  ·         The data is split into files of one or more blocks.
                  ·         When a client needs a data, it first interacts with NameNode that holds the MetaData and replies back to client with location of the Data on DataNodes.
                  ·         After this, client starts interactions with DataNode, till the time data requirement is completed.

                  Organizations Using Hadoop

                  The following table shows how various organizations use Hadoop:
                  Name of the Organization
                  Cluster Specifications
                  Uses
                  A9.com: Amazon
                  Clusters vary from 1 to 100 nodes
                  ·    Amazon’s product search indices are built using this program
                  ·    Processes millions of sessions daily for analysis
                  Yahoo
                  More than 100,000 CPUs in approximately 20,000 computers running Hadoop; biggest cluster has 2000 nodes {2 * 4 cpu boxes with 4 TB disk space}
                  ·    To support research for ad systems and web search
                  AOL
                  Cluster size is 50 machines, Intel Xeon, dual processors, and dual core, each with 16GB RAM and 800 GB hard disk in total of 37 TB HDFS capacity
                  ·    For variety of functions ranging from generating data to running advanced algorithms for performing behavioral analysis and targeting
                  Facebook
                  320-machines cluster with 2,560 cores and about 1.3 PB raw storage
                  ·    Storing copies of internal logs and dimension data sources
                  ·    Used as a source for reporting analytics and machine learning

                  Summary

                  ·         Big Data relies on volume, velocity, and variety with respect to processing.
                  ·         Data can be divided into three types – unstructured data, semi-structured data, and structured data.
                  ·         Big Data technology understands and navigates big data sources, analyzes unstructured data, and ingests data at a high speed.
                  ·         Hadoop is free, Java based programming framework that supports the processing of large data sets in a distributed computing environment
                  ·         Hadoop originated from the Nutch open source project on search engines and works over distributed network nodes
                  The core services of Hadoop are NameNode, DataNode, JobTracker, TaskTracker, and Secondary NameNode.