Uploaded by Santhosh Vinayak

Introduction to Hadoop

advertisement
Introduction to Hadoop
Several tools are available for working with big data. Many of the tools are opensource and Linux-based. Explore the fundamentals of Apache Hadoop, including
distributed computing, design principles, HDFS, Yarn, MapReduce, and Spark.
Table of Contents
1.
2.
3.
4.
5.
6.
7.
Distributed Computing
Design Principles of Hadoop
Functional View of Hadoop
HDFS in Action
Yarn in Action
MapReduce in Action
Spark in Action
Distributed Computing
Hadoop software is the foundational technology used in building a super-computing
platform. Hadoop has two parts. First – distributed data storage and second – parallel
processing framework. In this video we will take a look at both of these in some
detail. Let's begin. There is always confusion about how much storage is involved in
Big Data. The root of this is not using terminology correctly. Let me share some
common terms you should learn. First disk space. This is the available read/write
space in a single disk. Normally it's measured in terabytes, for example, a 2 TB hard
disk. Then there's physical space; this is the total amount of space from all the hard
disks. For example, if you have 300 hard disks you will have 600 TB of physical
space. Then there's usable space. This is the amount of space available after
calculating the replication factor. A standard is to have a replication factor of 3. This
means 600 TB of physical space has 200 TB of usable space. And then there's temp
space; this is the overhead amount of space on each disk required for swapping and
I/O performance. This is frequently 10% to 20%. This means 200 TB of usable space
needs 20 to 30 TB for overhead.
Heading: Terms of Storage.
A table with three columns and five rows is displayed. The columns are: Term,
Definition, and Example. Each row provides a term, then the definition, followed by
the example.
Disk Space: a single disk: 2 TB.
Physical Space: total amount of disk space: 600 TB.
Usable Space: space available after replication factor (Defaults to 3 copies): 200 TB.
Temp Space: overhead for performance (10% to 15%): 30 TB.
Data Space: Space available for data store: 170 TB.
And then there's data space, this is the most important term. This is the amount of
space available for Hadoop. This takes into account the replication factor and the
required temp space. In our example, we will have a 170 to a 180 TB of data space.
Now the primary components for Hadoop cluster include the master server, the
switches, the racks, and the data servers, which are also commonly called data
workers, data nodes, or just nodes. The master server has a number of key
responsibilities. These include the master server is responsible for managing and
coordinating the entire Hadoop cluster. The master server has responsibilities to
monitor the health of all the data nodes and take corrective action when required. The
master server is responsible for mapping the location of all the data and for directing
all data movement. The master server is responsible for scheduling and rescheduling
all compute jobs. And finally the master server is responsible for handling failure and
errors, including loss of a data node and rescheduling of failed compute jobs.
Heading: Functional View of Hadoop Cluster.
A diagram depicts a Hadoop cluster.
In the diagram, the master server is connected to DC Switch A and DC Switch B.
Each of the switches is connected to two racks, which are made up of six nodes.
Heading: Responsibilities for Master Server.
The responsibilities are:
Coordinate: manages cluster of worker nodes.
Health checks: monitors the health of all worker nodes.
Data location and movement: mapping and direction for all data movement.
Scheduling: scheduling and rescheduling all jobs.
Error handling: handles all errors and failures.
Responsibilities for the data server are very simple. First it is responsible to receive
and execute instructions from master server. Second the data server is responsible for
performing work as assigned by the master server. All work is performed on the
physical host; this is an important part of our share nothing design principle. And all
the data servers must continuously report status to health checks and ongoing job
progress. We have some recommendations we would like to make regarding what you
need to purchase in terms of hardware for your master server. The master server is
critical to availability and the money should be spent to ensure the highest level of
availability. We recommend you always max out the memory. Start at 64 Gigabytes,
you should have minimal growth of data so do not oversize your data storage. And
always, the master server should be configured with a highest level of RAID disks,
dual network, and dual power supply. One very firm rule, never collocate a data node
on the master server; it always, always should be dedicated to its primary purpose.
Heading: Responsibilities for Data Server.
The data server is responsible for processing instructions and performing processing
tasks on data located on the physical node.
Heading: Recommendation for Master Server.
Some of the recommendations are:
Max memory, minimize physical storage, and configuring the highest level of RAID
disks, and use a dual power supply and dual network.
Now regarding recommendations for the data server. When selecting hardware for
your data nodes, we strongly suggest you follow these rules. Do not put in
redundancy, do not put in multiple network ports, do not overload the CPU or
memory, do not overload storage per node. Remember one of the ways to keep
supercomputing affordable is to scale up by adding low cost data nodes. There are a
number of great reference models provided by the vendors such as Dell, IBM, and
HP. These are comprehensive reference models, they include everything you should
need to know to build a Hadoop cluster from bare iron and they all come with strong
support. Obviously they are not vendor neutral but are a solid starting point for
building out your supercomputing platform. Now you have a good overview of what it
takes to build a Hadoop cluster.
Design Principles of Hadoop
Let us discuss designing supercomputing platforms by first listing the problems that
need to be solved. These problems have been set out in requirements for a few
decades and there have been many different solutions over the years. Common
problems in designing supercomputers include: how to move large volumes of
transactional data across the back pane; how to move large volumes of static data
across the network; how to process large volumes of data faster and faster; how to
ensure even job scheduling and fair usage of resources; how to handle errors
interrupting other jobs; and how to coordinate and optimize resources. These are all
tough challenges for computer scientists. Many of the early solutions were handled at
a hardware level, which significantly drove up cost. But then along came Hadoop.
Hadoop is a radical redesign to meet the desire for supercomputing. The early Hadoop
developers stated their design principles as dumb hardware and smart software; share
nothing; move processing, not data; embrace failure; build applications, not
infrastructure.
The design principle of dumb hardware and smart software means using commodity
hardware versus purposely built high-availability servers. The lower the cost and the
simpler the hardware for the data node, the better for a Hadoop cluster. "Share
nothing" is the design principle that no CPU, or I/O, or data, should be shared. A
supercomputing platform is optimized when each data node is independent and able to
manage its own work. Move processing, not data. This design principle is about
moving the process into the data using a parallel processing framework versus moving
the data from disk to a central application to be processed. Embrace failure; this is a
critical construct in the development of Hadoop technology. We don't design to
prevent failure, we accept it as a part of normal operations. This design principle is to
ensure failure does not impact performance or result in data loss. Build applications,
not infrastructure. This design principle is about creating a software framework for
our supercomputer versus designing specialty in expensive hardware. This design
principle ensures we can take the full advantage of the flexibility of Hadoop, as we are
working at a software level not a hardware level.
Functional View of Hadoop
Now it is time to gain an understanding of Hadoop itself. Personally I always find
myself marveling at the elegance and simplicity of Hadoop. As you gain an
understanding of the solution, you too will pass through one of those moments that
make you say, now why didn't I think of that? Hadoop is made universal and
assessable because it is built on technology that is readily available and inexpensive.
The software can be downloaded from Apache for free. Hadoop comes with high
availability and fault tolerance purposely designed into the architecture. These
qualities are designed from an operational perspective, not as something to be added
on to prevent failure. It really does function in environments with a large number of
hardware failures and it really does continue to run jobs until they're complete.
Hadoop scales, Hadoop scales horizontally. Hadoop scales massively. We frequently
see 10, 20, 30, 40, or more racks fill with data nodes for a single supercomputing
platform. And I think one of the most important attributes is Hadoop is simple for end
users. The intelligence to run the distributed file system and the parallel processing
framework is in the software. And the software abstracts away the complexities of
distributed computing.
Now while writing a MapReduce job is nontrivial, it is actually easy for the
programmer to write massively parallel jobs across a massively distributed file
system. The programmer does not have to determine data location or size the number
of parallel compute jobs. Hadoop does it for you. Let us take a look at a functional
view of Hadoop. There are two parts of Hadoop. First there is the Hadoop distributed
file system, which we will now call HDFS. Second there is YARN. YARN stands for
Yet Another Resource Negotiator, fun name, right. Hadoop covers two layers in our
Big Data stack. HDFS is the storage layer and it is responsible for creating a
distributed repository. While YARN is the data refinery layer and is a processing level
for scheduling parallel compute jobs.
Heading: Functional View of Hadoop.
A diagram depicts a Hadoop Distributed File System (HDFS) and Yet Another
Resource Negotiator (YARN). In the diagram there is a master server and two data
servers.
The YARN master server is called ResourceManager an it has a mortgage ratings
compute job which is connected to a node manager in one data server and an
applicationMaster is another data server.
The HDFS master server is connected to a DataNode in one data server and a second
DataNode in the second data server.
HDFS consist of two daemons. The NameNode is the bookkeeper for HDFS. It runs
on the master server. There's only one NameNode per cluster and it is responsible for
managing all the data on the cluster. The DataNode runs on all the data nodes. It is
responsible for managing the local data. It does all the read/write of data. Now YARN
consists of three daemons. The ResourceManager is responsible for managing and
allocating all of the resources on the cluster. It resides on the master server. The
NodeManager resides on every single compute node in the cluster. It manages and
communicates the status of the onboard resources, such as memory, back to the
ResourceManager. The ApplicationManager is responsible for running the compute
jobs. It requests resources and then ensures the successful completion of its assigned
compute jobs.
HDFS in Action
Let me give a high-level overview of the Hadoop Distributed File System, HDFS. Let
us start with the key learning elements. The first important thing to learn about HDFS
is its purpose. HDFS abstracts away the complexity of distributing large data files
across multiple data nodes. The second important thing to learn is every data file has
multiple copies. HDFS keeps track of all the file segments and all the copies of each
file segment. This is the power of HDFS. HDFS can be imagined as an Inode table of
Inode tables living on a master server. At a local system level an Inode table is in
effect a mapping of the file name against the location of the data blocks for that file.
In HDFS the master server keeps a superior table that maps the file names against the
data blocks. But here the listing contains the names and network addresses of all of
the data nodes. It is rather a brilliant solution. HDFS works best with a single large
file, and we mean large file. A data file of 60 terabytes is easily handled. Such a large
file must be split up and the segment sent to different data nodes. We call these
segments data blocks. This is a key term.
Heading: Key Attributes for HDFS.
The default data block size for HDFS is anything between 64MB and 128MB. HDFS
also replicates data blocks as they are moved to different locations. It has a
replication factor of 3.
In HDFS data blocks have standard sizes of 64 or a 128 megabytes. They can be made
even larger by changing the configuration parameter. HDFS is basically a write-once
file system. The normal sequence is for small data files to be created elsewhere and
then the data is merged together into an HDFS file. These HDFS files are then ready
for analytics. HDFS is not a good system for storing data that needs to be constantly
modified. Our architectural principle of embracing failure leads to one of the most
important attributes for HDFS. This is data replication. The number of copies of data
files is a configuration parameter but it is normally left at the default value of three.
This system does a princely job of distributing the data across many nodes at load
time. HDFS then does an analysis and makes multiple copies of each data block.
HDFS is very astute at alerting to the loss of access to a data block. And when this
happens HDFS begins replicating another data block to always keep the number of
copies to replication level. This is high availability and fault tolerance by operational
design.
Let us walk through a simple example of how HDFS works. Let me again introduce
the two key players, the NameNode and the DataNode. The NameNode is the master
of HDFS. It is a daemon that resides on the master server. It keeps all of the file
system tables in memory to ensure high-performance. The DataNode is the data
worker. It is a daemon that resides on the DataNode. It is responsible to manage the
data blocks on the DataNode. It then keeps the NameNode apprised of status through
a heartbeat message. Here we can see the NameNode has a table which list two file
names. There is a file called creditcards and a file called mortgages. These very large
files are broken up into data blocks. Each data block is assigned a number. The
NameNode assigns the data block to a DataNode within the cluster.
Heading: HDFS Archetecture.
A diagram depicts a typical HDFS architecture with a NameNode and five
DataNodes. Each of the DataNodes has eight blocks in it. Three of the blocks in each
DataNode have a number in them, and the rest are empty. The NameNode contains
the text "/file/creditcards(data block:1,2)
/file/mortgages(data blocks: 3,4,5)".
The first DataNode contains the numbers five, three, and two, the second DataNode
contains the numbers three, one, and four, the third DataNode contains the numbers
three, five, and two, the fourth DataNode contains the numbers one, four, and two, the
fifth node contains the numbers one, five, and four.
The NameNode table contains the network address and local block number for that
data. Then using the replication factor, the NameNode assigns copies of data block to
other DataNodes. As you can see in this example, each data block has three copies. It
is the responsibility of the NameNode to ensure if there are always three copies of
each data block. If a DataNode stops transmitting a heartbeat, the NameNode will
wait a configurable amount of time and then make a new copy of the data block.
There is a lot of intelligence and a lot of configuration in how HDFS works, but this
overview gives you the primary principles. This is a fascinating subject and is really
worthy of further study.
Yarn in Action
The following are my key learning points for YARN. YARN is the parallel processing
framework. It can process large amounts of data over multiple compute nodes.
Writing computer jobs to run on YARN is not trivial, but once done YARN makes it
very easy to scale. Move code to data; need I say more. Colocating the data with the
computing node improves performance. We advance this idea by adding the concept
of containers with Hadoop version 2.0. Containers are an abstract notion to support
multi-tenancy. It is a way to define a requirement for memory, CPU, disk, and
network. By dividing the DataNode up into containers, the DataNode can now
simultaneously host multiple compute jobs. What really makes YARN work is the
design principle of Share Nothing. It is very intentionally designed so no compute job
has any dependency on any other compute job. Each compute job is run in its own
container and it does not share the assigned resources. Each compute job is only
responsible for its assigned work.
Heading: Key Attributes of YARN.
Another key attribute of YARN is embrace failure.
YARN is highly fault tolerant. In many ways, it epitomizes the design principle of
embracing failure. The fact that most compute jobs are run in a Share Nothing node
means a compute job can fail and be restarted without any implication to the final
output. YARN allows for the graceful rescheduling of failed compute jobs. Let us
begin our walkthrough with a bit more detailed introduction of the players. The client
is responsible for submitting the job, beginning by requesting an application ID from
the ResourceManager. The client then must provide location of the jar files, command
line argument, and any environmental settings. The client can maintain job status by
querying either the ResourceManager or the ApplicationMaster. The
ResourceManager is first responsible for maintaining knowledge of the resource status
of the entire cluster. It needs to keep a picture of the available memory, CPU, disk,
and network. Currently only memory is supported. It does this by ongoing
communications with the NodeManager.
Heading: YARN Architecture.
A diagram depicts typical YARN architecture. The diagram uses the flow of data in a
bank as an example.
The diagram contains a client which requests score for mortgages. The Resource
Manager, which is connected to the Client, manages cluster resources and assigns
containers. The Resource Manager is connected to the ApplicationMaster, which
manages compute jobs, and the Container which completes assigned work. The
ApplicationMaster moves code to data between itself and the Container.
The ApplicationMaster and Container are each connected to a different
NodeManager.
Second the ResourceManager is responsible for scheduling these resources by
allocating containers. It does this according to input requests by the client. It does this
by cluster capacity. It does this by work queues. And it does this by overall
prioritization of resources on the cluster. ResourceManager uses an algorithm for
allocating the containers. The general principle is to start a container on the same node
as the data required by the compute job. Third the ResourceManager is responsible for
managing applications by accepting job submissions from the client and by
maintaining a status of the ApplicationMasters. It will launch, and if required,
relaunch the ApplicationMaster. The ApplicationMaster is the actual owner of the job.
It is launched by the ResourceManager as a result of a request for a compute job from
a client. The ApplicationMaster communicates directly with the client. An
ApplicationMaster is launched for every compute job and keeps running until the job
is complete. The ApplicationMaster determines resource requirements and then
negotiates for those resources with the ResourceManager. The ResourceManager
interacts with the NameNode to determine data location and assigns containers as
required. The ApplicationMaster is responsible for moving code to the appropriate
container.
It is also responsible for job completion. It must restart failed compute jobs to ensure
compute job completion. The ApplicationMaster emits a heartbeat to the
ResourceManager to keep it informed of job status. The NodeManager is responsible
for keeping an up-to-date status of its resources and reporting these resources to the
ResourceManager. Every node within the cluster must have a running NodeManager.
In this example the client submits a job to request scoring on mortgages. The job is
submitted to the ResourceManager. The ResourceManager accepts the job and it
opens a container for the ApplicationManager on one of the nodes of the cluster. The
ApplicationMaster receives the job and then determines its need for more containers.
It makes this request to the ResourceManager, which assigns containers based on data
location and resource availability. The ResourceManager uses intelligence to
determine which one of the replication data blocks is best suited to launch the
container. The ApplicationMaster then moves the code to the various containers and
then begins the compute job. The ApplicationMaster will monitor the containers
assigned to the compute job and react to ensure job completion. This is a complex
framework but it ensures compute jobs can be massively paralleled across the entire
Hadoop cluster.
MapReduce in Action
Before I begin discussion about the key attributes for MapReduce, I want to provide a
historical comment. There was no YARN prior to Hadoop version 2.0. In the early
releases for Hadoop, the function of YARN and MapReduce were combined into
MapReduce. The functions of resource scheduling and job management were all part
of MapReduce. Please do not be confused if you're dealing with a Hadoop release
prior to version 2.0. MapReduce is a heavyweight for batch processing compute jobs.
As we discussed in another video, MapReduce was designed to use key-value pairs.
There are two primary parts of MapReduce. The mapper is meant to filter and
transform data. It completes the initial data crunch to produce intermediate output
files. These output files are then shuffled across the network to the reducer for the
reduce phase. The reducer aggregates data. It receives intermediate data and
consolidates them into the final output.
Heading: Key Attributes for MapReduce
The two primary parts of MapReduce are Mapper and Reducer.
Let us walk through a MapReduce process. The inputs to MapReduce are the data
blocks of a large HDFS file. Frequently part of the compute job is to preprocess this
data by cleaning out dirty data and/or duplicates. This ensures the following phases in
MapReduce run cleanly and with a higher performance level. During the Map phase, a
mapper is started within a container on each data node where data must be processed.
Each mapper will read its assigned data blocks and then process the data blocks. For
example, the mapper could read in each line and count each word of each line. It will
output its data into a file consisting of each word and a count of one for each word. A
second step may be run to further sort each file down to a list of unique words and the
total count for each word. This will be repeated for each mapper until all the map jobs
have produced their local files. Then comes the Shuffle phase. Using the network, the
files are shuffled around the cluster. If done correctly, this should be the first
transferred data over the network.
Heading: MapReduce Architecture.
A flow diagram illustrates how data is passed and processed through the MapReduce
system.
The diagram is divided into Map input, Map phase, Shuffle phase, Reduce phase, and
Final output sections. There are two rectangular blocks named Data node 1 and Data
node 2 respectively which have their own separate content and lead on to a single
rectangular block called Data node 3 which they are both connected to.
In the Map input section of the diagram, Data node 1 contains Block 0 and Block 1.
Data node 2 contains Block 2 and Block 3.
In the Map phase section on the diagram Block 0 and Block 1 in Data node 1 are
connected to a Mapper block and Block 2 and Block 3 in Data node 2 are connected
to another Mapper block.
In the Shuffle phase section of the diagram, the Mapper block is connected to a block
named Data A and the Mapper block in Data node 2 is connected to a block named
Data B. In this section of the diagram Data node 1 and Data node 2 are connected to
Data node 3 where Data A and Data B appear.
In the Reduce phase section of the diagram, the Data A and Data B are connected to a
block called Reducer. In the Final output section the Reducer block is connected to a
block named Partnnnn-xx.
Then comes the Reduce phase. If the files are not already sorted by key, each file is
first sorted. Then all the files are combined into a larger file and another sort will
occur on the keys. The job of a reducer is to sort the file into a single answer. And the
last phase is the creation of the output file. This file is assembled by the reducer into
as many files as required and then they are all stored on HDFS. I should mention, we
have a unique output naming file strategy of Partnnnn.xx This is a very high level
introduction to MapReduce. It becomes extremely complex and it takes a combination
of talent and experience to write good MapReduce jobs. If you are interested in
learning more about MapReduce, I recommend you start by reading the Apache web
site for Hadoop.
Spark in Action
Spark is an open source computing framework that can run data on a disk, or it can
load data into clusters memory. Spark became a top level Apache project in February
2014. It appears to be quickly rising to the top of a lot of people's list as their favorite
software to run in Big Data. Spark has an extremely strong community, many who
advocate it as a replacement for MapReduce. This is because Spark is fast – some say
blindingly fast. A company from UC Berkeley, called Databrick, are the originators of
Spark, and they are providing commercial support. Spark is robust and versatile. It has
successfully combined a number of different functions into a single software solution.
Key attributes include that Spark applications can be written in Java, Scala, and
Python. This makes it easy for programmers to write in their native language. There's
also an interactive Scala and Python shell. Spark is built to run on top of HDFS and
can use YARN to run alongside MapReduce jobs. It can read any existing Hadoop
data file. It also reads from HBase, Cassandra, and many other data sources. Spark is
scalable to 2000 nodes and it will continue to expand its ability to scale compute jobs.
Heading: Introducing Spark.
Spark is built onto HDFS and is able to use YARN.
Heading: Key Attributes for Spark.
Spark has high level libraries for streaming, machine learning, graph processing, and
R statistical programming.
One of the design principles of Spark was to combine SQL, streaming, and complex
analytics. The software provides high level libraries, so programmers can quickly
write jobs for streaming, machine learning, graph processing, and the R statistical
programming language. Spark is fast. Let me make two quick points about its speed,
and why this is leading to its increasing popularity. It runs against both disk and in
memory. Both are demonstrated to be faster than other existing software. For
example, in machine learning, it is clocked to run on a disk compute job ten times
faster than Apache Mahout. And on a large-scale statistical analysis, it is
benchmarked to run a hundred times faster in memory, than the same job running in
MapReduce. At this point the market has not determined the outcome between Spark
and MapReduce. Both are strong tools and it is possible both will be in use for years
to come.
Download