Friday, December 4, 2015

5 Hadoop Distributed File System

Objectives

  • Describe Hadoop Distributed File System concepts and architecture
  • Explain HDFS Storage mechanisms and HDFS Rack awareness
  • Explain HDFS Writes and Reads
  • List the important commands of HDFS
  • Describe Sqoop
  • Install and configure Sqoop

Introduction to Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is the subproject of the Apache Hadoop venture. It is a distributed, extremely fault tolerant document framework intended to run on minimal effort item fittings.
HDFS gives a high throughput access to aplication information and is suitable for applications with huge information sets. Hadoop gives a circulated document framework that is equipped for investigation and preparing of vast sets of information utilizing mapreduce ideal model.
The concept of HDFS is based on UNIX, even through gauges were traded off to some degree to enhance execution for the applications.
HDFS is similar to other Distributed File framework, some of the differences are:

  • Write-once-read-many model relaxes concurrency controls prerequisites.
  • Takes handling rational near the information.

Goals of HDFS
  • Recognizing deficiencies and programmed recovery
  • Streaming information access through MapReduce
  • Upholding simple coherency model
  • Bringing the handling rationale close to information
  • Preparing immense measure of information
  • Redeploying activity of handling rationale in the events of failure.
HDFS Architecture
HDFS Architecture can be summarized as follows:


  • NameNode and the Secondary NameNode services constitute the master service. DataNode service is the slave service.
  • The master service is responsible for accepting a job from clients and ensures that the data required for operation will be loaded and segregated into chunks of data blocks.
  • HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks stored and replicated in DataNodes. The data blocks are then distributed to the DataNode system within the cluster. This ensures that replicas of the data are maintained.
Design of HDFS
HDFS is intended to store and process extensive documents with streaming information access.

Situations where HDFS suits:

  • Substantial files and where Hadoop uses groups to store and processes the information
                e.g. Document that are 100s of
                1 KB (KiloBytes)= 1024 Bytes
                1 MB (MegaBytes)= 1024 KB
                1 GB (Gigabyte) = 1024 MB
                1 TB (Terabyte) = 1024 GB
                1 PB (Petabyte) = 1024 TB
                1 EB (Exabyte)  = 1024 PB
                1 ZB (Zettabyte)= 1024 EB
                1 YB (Yottabyte)= 1024 ZB
                1 SB (Shilentnobyte) = 1024 YB
                1 DB (Domegemegrottebyte) = 1024 SB
                Reference
  • Streaming information access
    As hadoop is focused around the idea of transforming information design which is composed once and read ordinarily form.
  • No extravagant equipment to run the framework
    Hadoop is designed to run on bunches of ware equipment due to which risk of no failure is high. None the less hadoop is empowered to carry on the work regardless of the fact that there is a hardware failure.
Situations where HDFS does not suit:

    • Low-latency information access
      Application that require access to information in several milliseconds reach must not be coordinated with HDFS. HDFS is converying high throughput of informaiton and this may prompt latency. Utilizing HBASE environment of Hadoop would be a superior decision.
    • Many small documents
      Since the NameNode holds record framework Meta information in the memory, the point of confinement for the quantity of documents in a document framework is represented by Measure of memory on the NameNode.
    • When there are various composes
      Records in HDFS takes after writing once and read commonly format. Composes are constantly made toward the end of the record. There is no backing for different composes.
    HDFS Concepts

    Concept
    Description
    Blocks
    Stores the data – 64 MB by default, larger blocks
    NameNode
    The champion that deals with the record framework namespace
    DataNode
    Slaves in the HDFS framework, store and recover blocks when they are asked by the customer or NameNode
    HDFS Federation
    A reference to every document and block in memory, each NameNode deals with a namespace volume made up to the Meta information for the namespace and a block pool
    High Availability
    Includes help for sending two NameNodes in a dynamic or detached setup, checks if the inactive NameNode is fit for performing the check pointing part.

    Hadoop Storage Mechanism
    Storages can be mainly evaluated on three classes of performance metrics:

    • Cost per MB: The decision of storing the data is computed based on every Mega Byte.
    • Sturdiness: The measure of the durability of data once it has been effectively composed to the medium.
    • Execution: The two measures of capacity execution are Throughput and IO Operations per second.


    Measure of Capacity Execution
    Throughput

    • The extreme unfinished read compose rate that the storage can help
    • Ordinarily measured in Mbps
    • Essential metric for batch processing


    IO Operation per second.

    • The quantity is influenced by the workload and IO size
    • The rotational inactivity of turning circles confines the greatest IOPS for an arbitrary IO workload


    HDFS Storage Architecture Heterogeneous

    Information nodes impart their capacity state through the accompanying sorts of messages, storage report and block report:

    Storage Report

    • Contains the outline data of the condition of a storage
    • Includes limit and utilization points of interest
    • Found inside a heartbeat, sent once in every few seconds


    Block Report

    • Also called a block report
    • An informal report of the individual piece imitations on a given DataNode
    • Two parts of piece reports are Incremental square report and full piece report

    HDFS Storage Architecture Illustrated

    With Heterogeneous Storage, the DataNodeuncovered the sorts and utilization insights for every individual storage to the NameNode.





    HDFS Rack Awareness
    The idea behind Rack Awarenessis data loss prevention and network performance

    • Each square of information is recreated on different machines to keep up with the failure of losing information.
    • Two machines in the same rack have more data transfer capacity and lower latency than two machines in two separate racks.





    HDFS Writes—Example
    An online shopping portal plans to improve the quality of their products by analyzing how many customers in their emails specify the word Refund.
    File name: Email.txt
    Key Points:

    • Client consults NameNode
    • Client writes block directly to one DataNode
    • Data replicates the block
    • Cycle repeats for next block


    HDFS Reads
    The workflow of HDFS Reads is:

    • To recover a document from HDFS, the client recounselsthe NameNodeand requests the piece areas of the record.
    • Client picks the DataNodefrom each square rundown and uses one piece at once with TCP on port 50010.

    Important Commands of HDFS
    Some of the Hadoop shell commands to manage HDFS are:

    • To create directory: -cat <path[filename]>
    • To list the contents of a directory: ls<args>
    • To move file from source to destination: mv <source> <destination>
    • To copy a file from/to local system from HDFS:

    Copyfromlocal
    copyFromLocal<localsrc>
    Copyfromlocal
    copyFromLocal<destination>

    Some of the Hadoop shell commands to manage HDFS are:
    • To see contents of a file: mkdir<path of the directory>
    • To upload file in HDFS: fs-put <source file> ... <destination path>
    • To download file in HDFS : fs-get <source file> ... <destination path>

    Types of HDFS Commands
    All the Hadoop commands are invoked by the bin/hadoopscript. Running Hadoop script without any arguments prints the description for all commands. HDFS Commands are grouped into two types:

    HDFS Commands

    • User
    • Administrator
    User Commands
    Some Important User Commands are:

    • Archive: Usage: hadoop archive -archiveName NAME <src>* <dest>
    • Distcp: Usage: distcp <srcurl> <desturl>
    • FS: Usage: hadoop fs
    • FSCK: Usage: hadoop fsck <path> [-move | -delete | -openforwrite]
      [-files [-blocks [-location | -racks]]]
    • Jar: Usage: hadoop jar <jar> [mainClass] args...
    • ClassName: Usage: hadoop CLASSNAME
    • Job: Usage: hadoop job [-submit <job-file>] | [-status <job-id>] |
      [-counter <job-id> <group-name> <counter-name>] | [kill <job-id>] |
      [-events  <job-id> <from-event-#> <#-of-events>] | [-history [all]] |
      [-kill-task <task-id>] | [-fail-task <task-id>]
    Administrator Commands
    Some Important Administrator Commands are:
    • Balancer: Usage: hadoop balancer [- threshold <threshold>]
    • Daemon Log: Usage: hadoop daemonlog -getlevel <host:port> <name>
    • DataNode: Usage: hadoop datanode [-rollback]
    • DFSadmin: Usage: hadoop dfsadmin [-report] [-safemode enter | leave | get | wait ]
      [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-setQuota <quota> <dirname>...<dirname>] [-clrQuota <dirname>...<dirname>] [-help [cmd]]
    • JobTracker: Usage: hadoop jobtracker
    • NameNode: Usage: hadoop namenode [-format] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint]
    • Secondary NameNode: Usage hadoop secondarynamenode [-checkpoint [force]] | [-geteditsize]
    • TaskTracker: Usage: hadoop tasktracker

    Business Scenario
    Olivia is the EVP - IT Operations at Nutri Worldwide Inc. Her team is involved in setting up Hadoop infrastructure for the organization. After performing the steps to set up the Hadoop infrastructure, Olivia and her team decides to test the effectiveness of the HDFS infrastructure.

    The demo in the subsequent will illustrate how to setup HDFS. Let is proceed to the next screen to see the demo.

    Start the HDFS demo provided by Simplilearn
    • start-all.sh
    • jps
    • hadoop jar $HADOOP_PREFIX/hadoop-test-1.2.1.jar TestDFSIO -write nrFiles 5 -fileSize 100
     The result show the execution time to write files and IO rate and throughput.
    • hadoop jar $HADOOP_PREFIX/hadoop-test-1.2.1.jar TestDFSIO -read nrFiles 5 -fileSize 100
    The result show the execution time to read files and IO rate and throughput.
    This is how bench marking is done.

    • sudo vi /usr/local/hadoop/conf/hdfssite.xml
    <configuration>
    <property>
        <name>dfs.block.size</name>
        <value>134217728</value>  ### i.e 128 MB
    </property>
    </configuration>


    We need to refresh the hadoop services:
    • stop-all.sh
    • start-all.sh
     Let us upload a data for testing the same.
    • hadoop fs -copyFromLocal /home/hadoop/data/big/201201hourly.txt hdfs:/datanew/big/2012hourly.txt
     Let us see the block size by accessing the gui:
    • http://192.168.21.184:50070/dfshealth.jsp
    • Click on datanew
    • Click on big
    • Observe the blocksize of the file 2012hourly.txt
    Thus we have successfully set the blocksize.

    Decommissioning of Data Nodes:
    • sudo vi /usr/local/hadoop/conf/exclude
    • 192.168.21.154
    • :wq!
     Enter the IP address of the data node and save the file.
    Let us refresh the cluster.
    • hadoop dfsadmin -refreshNodes
    Summary:

    • HDFS is a subproject of the Apache Hadoop venture. It is a distributed, exteremely fault-tolerant document framework intended to run on minimal effort item fittings.
    • HDFS is intended to store and process extensive documents within streaming information access.
    • With Heterogeneous Storage, the DataNode uncovered the sorts and utilization insights for every individual storage to the NameNode.
    • The idea behind Rack Awareness is data loss prevention and network performance.


    No comments:

    Post a Comment