map - salsahpc - Indiana University

advertisement
Hadoop, Hadoop, Hadoop!!!
Jerome Mitchell
Indiana University
Outline
•
•
•
•
•
•
•
BIG DATA
Hadoop MapReduce
The Hadoop Distributed File System (HDFS)
Workflow
Conclusions
References
HandsOn
LOTS OF DATA
EVERYWHERE
Why Should You Care? Even Grocery
Stores Care
What is Hadoop?
• At Google MapReduce operation are run on a special
file system called Google File System (GFS) that is
highly optimized for this purpose.
• GFS is not open source.
• Doug Cutting and Yahoo! reverse engineered the GFS
and called it Hadoop Distributed File System (HDFS).
• The software framework that supports HDFS,
MapReduce and other related entities is called the
project Hadoop or simply Hadoop.
• This is open source and distributed by Apache.
What exactly is Hadoop
• A growing collection of subprojects
Motivation for MapReduce
• Large-Scale Data Processing
– Want to use 1000s of CPUs
• But, don’t want the hassle of managing things
• MapReduce Architecture provides
– Automatic Parallelization and Distribution
– Fault Tolerance
– I/O Scheduling
– Monitoring and Status Updates
MapReduce Model
• Input & Output: a set of key/value pairs
• Two primitive operations
– map: (k1,v1)
 list(k2,v2)
– reduce: (k2,list(v2))  list(k3,v3)
• Each map operation processes one input key/value pair and
produces a set of key/value pairs
• Each reduce operation
– Merges all intermediate values (produced by map ops) for a particular
key
– Produce final key/value pairs
• Operations are organized into tasks
– Map tasks: apply map operation to a set of key/value pairs
– Reduce tasks: apply reduce operation to intermediate key/value pairs
– Each MapReduce job comprises a set of map and reduce (optional)
tasks.
9
HDFS Architecture
The WorkFlow
•
•
•
•
Load data into the Cluster (HDFS writes)
Analyze the data (MapReduce)
Store results in the Cluster (HDFS)
Read the results from the Cluster (HDFS reads)
Hadoop Server Roles
Client
Distributed Data Storage
Distributed Data Processing
HDFS
MapReduce
Name Node
Job Tracker
Secondary Name
Node
Data Nodes &
Task Tracker
Data Nodes &
Task Tracker
Data Nodes &
Task Tracker
Data Nodes &
Task Tracker
Data Nodes &
Task Tracker
Data Nodes &
Task Tracker
Master
Slaves
Hadoop Cluster
Switch
Switch
DN + TT
DN + TT
DN + TT
DN + TT
Rack 1
Switch
DN + TT
DN + TT
DN + TT
DN + TT
Rack 2
Switch
Switch
Switch
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
Rack 3
Rack N
Hadoop Rack Awareness – Why?
Name Node
metatadata
Switch
Rack Aware
Switch
Switch
Switch
Data Node 1
Data Node 5
Data Node 5
Data Node 2
Data Node 6
Data Node 10
Data Node 3
Data Node 5
Rack 1
Data Node 7
Data Node 8
Rack 5
Data Node 11
Data Node 12
Rack 9
Rack 1:
Data Node
1
Data Node
2
Data Node
3
Rack 1:
Data Node
5
Data Node
Results.txt =
BLK A:
DN1, DN5, DN6
BLK B:
DN7, DN1, DN2
BLK C
DN5, DN8, DN9
Sample Scenario
How many times did our customers type the word refund
into emails sent to Indiana University?
Huge file containing all emails sent
to Indiana University…
File.txt
Writing Files to HDFS
File.txt
BLK A
BLK B
NameNode
BLK C
Client
Data Node 1
Data Node 5
Data Node 6
BLK A
BLK B
BLK C
Data Node N
Data Node Reading Files From HDFS
Name Node
Switch
Switch
Switch
Switch
Data Node 1
Data Node 1
Data Node 1
Data Node 2
Data Node 2
Data Node 2
Data Node
Data Node
Rack 1
Data Node
Data Node
Rack 5
Data Node
Data Node
Rack 9
Rack Aware
metatadata
Rack 1:
Data Node 1
Data Node 2
Data Node 3
Results.txt =
BLK A:
DN1, DN5, DN6
Rack 1:
Data Node 5
BLK B:
DN7, DN1, DN2
BLK C
DN5, DN8, DN9
MapReduce: Three Phases
1. Map
2. Sort
3. Reduce
Data Processing: Map
Job Tracker
Name Node
Map Task
Map Task
BLK A
Data Node 1
BLK B
Data Node 5
File.txt
Map Task
BLK C
Data Node 9
MapReduce: The Map Step
Input
key-value pairs
Intermediate
key-value pairs
k
v
k
v
k
v
map
k
v
k
v
…
k
map
…
v
k
v
Data Processing: Reduce
Results.txt
HDFS
Reduce Task
Map Task
Map Task
BLK A
BLK B
Map Task
BLK C
MapReduce: The Reduce Step
Intermediate
key-value pairs
Output
key-value pairs
Key-value groups
reduce
k
v
k
v
v
v
k
v
k
v
reduce
k
v
k
v
group
k
v
…
…
k
v
v
k
…
v
k
v
Clients Reading Files from HDFS
NameNode
Client
metatadata
Results.txt =
BLK A:
DN1, DN5, DN6
Switch
Switch
Switch
Data Node 1
Data Node 1
Data Node 1
Data Node 2
Data Node 2
Data Node 2
Data Node
Data Node
Rack 1
Data Node
Data Node
Rack 5
Data Node
Data Node
Rack 9
BLK B:
DN7, DN1, DN2
BLK C
DN5, DN8, DN9
Conclusions
• We introduced MapReduce programming
model for processing large scale data
• We discussed the supporting Hadoop
Distributed File System
• The concepts were illustrated using a simple
example
• We reviewed some important parts of the
source code for the example.
References
1. Apache Hadoop Tutorial: http://hadoop.apache.org
http://hadoop.apache.org/core/docs/current/mapred
_tutorial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters.
Communication of ACM 51, 1 (Jan. 2008), 107-113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4.
http://www.cse.buffalo.edu/faculty/bina/mapreduce.
html
FINALLY, A HANDS-ON ASSIGNMENT!
The MapReduce Framework
(pioneered by Google)
Automatic Parallel Execution in
MapReduce (Google)
Handles failures automatically, e.g., restarts tasks if a
node fails; runs multiples copies of the same task to
avoid a slow task slowing down the whole job
The Map (Example)
inputs
tasks (M=3)
When in the
course of human
events it …
map
(when,1), (course,1) (human,1) (events,1) (best,1) …
(in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) …
It was the best of
times and the worst
of times…
Over the past five
years, the authors
and many…
partitions (intermediate files) (R=2)
map
(over,1), (past,1) (five,1) (years,1) (authors,1) (many,1) …
(the,1), (the,1) (and,1) …
This paper evaluates
the suitability of the …
map
(this,1) (paper,1) (evaluates,1) (suitability,1) …
(the,1) (of,1) (the,1) …
The Reduce (Example)
partition (intermediate files) (R=2)
(in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) …
reduce task
sort
(the,1), (the,1) (and,1) …
(the,1) (of,1) (the,1) …
(and, (1)) (in,(1)) (it, (1,1)) (the, (1,1,1,1,1,1))
(of, (1,1,1)) (was,(1))
reduce
(and,1) (in,1) (it, 2) (of, 3) (the,6) (was,1)
Note: only one of the two reduce tasks shown
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job
Job Configuration Parameters
• 190+ parameters in
Hadoop
• Set manually or defaults
are used
Formatting the NameNode
Before we start, we have to format Hadoop's distributed filesystem (HDFS) for the namenode.
You need to do this the first time you set up Hadoop. Do not format a running Hadoop
namenode, this will cause all your data in the HDFS filesytem to be erased.
To format the filesystem, run the command (from the master):
--------------------------------------------
bin/hadoop namenode -format
--------------------------------------------Starting Hadoop:
Starting hadoop is done in two steps: First, the HDFS daemons are started: the namenode
daemon is started on master, and datanode daemons are started on all slaves (here: master and
slave). Second, the MapReduce daemons are started: the jobtracker is started on master, and
tasktracker daemons are started on all slaves (here: master and slave).
HDFS daemons:
Run the command <HADOOP_HOME>/bin/start-dfs.sh on the machine you want the namenode
to run on. This will bring up HDFS with the namenode running on the machine you ran the
previous command on, and datanodes on the machines listed in the conf/slaves file.
In our case, we will run bin/start-dfs.sh on master:
------------------------bin/start-dfs.sh
-----------------------On slave, you can examine the success or failure of this command by inspecting the log file
<HADOOP_HOME>/logs/hadoop-hadoop-datanode-slave.log.
Now, the following Java processes should run on master:
root@ubuntu:$HADOOP_HOME/bin: jps
14799 NameNode
15314 Jps
14880 DataNode
14977 SecondaryNameNode
------------------------------------
MapReduce Daemons:
In our case, we will run bin/start-mapred.sh on master:
------------------------------------bin/start-mapred.sh
------------------------------------On slave, you can examine the success or failure of this command by inspecting the log file
<HADOOP_HOME>/logs/hadoop-hadoop-tasktracker-slave.log.
At this point, the following Java processes should run on master:
---------------------------------------------------root@ubuntu:$HADOOP_HOME/bin$ jps
16017 Jps
14799 NameNode
15686 TaskTracker
14880 DataNode
15596 JobTracker
14977 SecondaryNameNode
----------------------------------------------------
We will execute your first Hadoop MapReduce job. We will use the WordCount example, which
will read a text file and count the frequency of words. The input is a text file and the output is a
text file with each line of which contains a word and the count of how often it occurred,
separated by a tab.
Download example input data:
Copy local data file to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system
to Hadoop's HDFS
----------------------------root@ubuntu:$HADOOP_HOME/bin$ hadoop dfs -copyFromLocal /tmp/source destination
Build WordCount
Execute build.sh
Run the MapReduce job
Now, we actually run the WordCount example job.
This command will read all the files in the HDFS “destination” directory , process it, and store the result in the HDFS directory
“output”.
----------------------------------------root@ubuntu:$HADOOP_HOME/bin bin/hadoop hadoop-example wordcount destination output
----------------------------------------You can check if the result is successfully stored in HDFS directory “output”.
Retrieve the job result from HDFS
To inspect the file, you can copy it from HDFS to the local file system.
------------------------------------root@ubuntu:/usr/local/hadoop$ mkdir /tmp/output
root@ubuntu:/usr/local/hadoop$ bin/hadoop dfs –copyToLocal output/part-00000 /tmp/output
---------------------------------------Alternatively, you can read the file directly from HDFS without copying it to the local file system by using the command :
--------------------------------------------root@ubuntu:/usr/local/hadoop$ bin/hadoop dfs –cat output/part-00000
Even though the Hadoop framework is written in Java, programs for Hadoop
need not to be coded in Java but can also be developed in other languages
like Python or C++.
Creating a launching program for your application
The launching program configures:
– The Mapper and Reducer to use
– The output key and value types (input types are inferred from the
InputFormat)
– The locations for your input and output
The launching program then submits the job and typically waits for it to
complete
A Map/Reduce may specify how it’s input is to be read by specifying an
InputFormat to be used
A Map/Reduce may specify how it’s output is to be written by specifying an
OutputFormat to be used
Download