Ch2. MapReduce (26 pages)

advertisement
Distributed and Parallel Processing Technology
Chapter2.
MapReduce
Sun Jo
1
Introduction
 MapReduce is a programming model for data processing.
 Hadoop can run MapReduce programs written in various languages.
 We shall look at the same program expressed in Java, Ruby, Python, and
C++.
2
A Weather Dataset
 Program that mines weather data
 Weather sensors collect data every
hour at many locations across the
globe
 They gather a large volume of log data,
which is good candidate for analysis
with MapReduce
 Data Format
 Data from the National Climate Data
Center(NCDC)
 Stored using a line-oriented ASCII
format, in which each line is a record
3
A Weather Dataset
 Data Format
 Data files are organized by date and weather station.
 There is a directory for each year from 1901 to 2001, each containing a gzipped file
for each weather station with its readings for that year.
 The whole dataset is made up of a large number of relatively small files since there
are tens of thousands of weather station.
 The data was preprocessed so that each year’s readings were concatenated into a
single file.
4
Analyzing the Data with Unix Tools
 What’s the highest recorded global temperature for each year in the dataset?
 Unix Shell script program with awk, the classic tool for processing line-oriented
data
The scripts loops through the compressed year files
 printing the year  processing each file using awk
Awk extracts the air temperature and the quality code from the data.
Temperature value 9999 signifies a missing value in the NCDC dataset.
 Beginning of a run
Maximum temperature is 31.7℃ for 1901.
 The complete run for the century took 42 minutes in one run on a single EC2 High-CPU
Extra Large Instance.
5
Analyzing the Data with Unix Tools
 To speed up the processing, run parts of the program in parallel
 Problems for parallel processing
 Dividing the work into equal-size pieces isn’t always easy or obvious.
• The file size for different years varies
• The whole run is dominated by the longest file
• A better approach is to split the input into fixed-size chunks and assign each chunk to a process
 Combining the results from independent processes may need further processing.
 Still limited by the processing capacity of a single machine, handling coordination and
reliability for multiple machines
 It’s feasible to parallelize the processing, though, it’s messy in practice.
6
Analyzing the Data with Hadoop – Map and Reduce
 Map and Reduce




MapReduce works by breaking the processing into 2 phases: the map and the reduce.
Both map and reduce phases have key-value pairs as input and output.
Programmers have to specify two functions: map and reduce function.
The input to the map phase is the raw NCDC data.
• Here, the key is the offset of the beginning of the line and the value is each line of the data set.
 The map function pulls out the year and the air temperature from each input value.
 The reduce function takes <year, temperature> pairs as input and produces the
maximum temperature for each year as the result.
7
Analyzing the Data with Hadoop – Map and Reduce
 Original NCDC Format
 Input file for the map function, stored in HDFS
 Output of the map function, running in parallel for each block
 Input for the reduce function & Output of the reduce function
8
Analyzing the Data with Hadoop – Map and Reduce
 The whole data flow
Shuffling
Map()
<1950, 0>
<1950, 22>
<1949,111>
<1951, 10>
<1952, 22>
<1954, 0>
<1954, 22>
<1950, -11>
<1949, 78>
<1951, 25>
Reduce ()
<1949, [111, 78]>
<1950, [0, 22, -11]>
<1951, [10, 76,34], 19>
<1952 ,[22, 34]>
<1953, [45]>
<1955, [23]>
<1949, 111>
<1950, 22>
<1951, 76>
<1952, 34>
<1953, 45>
<1955,25>
Input File
9
Analyzing the Data with Hadoop – Java MapReduce
 Having run through how the MapReduce program works, express it in code
 A map function, a reduce function, and some code to run the job are needed.
 Map function
10
Analyzing the Data with Hadoop – Java MapReduce
 Reduce function
11
Analyzing the Data with Hadoop – Java MapReduce
 Main function for running the MapReduce job
12
Analyzing the Data with Hadoop – Java MapReduce
 A test run
 The output is written to the output directory, which contains one output file
per reducer
13
Analyzing the Data with Hadoop – Java MapReduce
 The new Java MapReduce API
 The new API, referred to as “Context Objects”, is type-incompatible with the old, so
applications need to be rewritten to take advantage of it.

Notable differences
• Favors abstract classes over interfaces. The Mapper and Reducer interfaces are abstract classes.
• The new API is in the org.apache.hadoop.mapreduce package and subpackages.
• The old API can still be found in org.apache.hadoop.mapred
• Makes extensive use of context objects that allow the user code to communicate with MapReduce system
• i.e.) The MapContext unifies the role of the JobConf, the OutputCollector, and the Reporter
• Supports both a ‘push’ and a ‘pull’ style of iteration
• Basically key-value record pairs are pushed to the mapper, but in addition, the new API allows a
mapper to pull records from within the map() method.
• The same goes for the reducer
• Configuration has been unified.
• The old API has a JobConf object for job configuration, which is an extension of Hadoop’s vanilla
Configuration object.
• In the new API, job configuration is done through a Configuration.
• Job control is performed through the Job class rather than JobClient.
• Output files are named slightly differently
• part-m-nnnnn for map outputs, part-r-nnnnn for reduce outputs
• (nnnnn is an integer designating the part number, starting from 0)
14
Analyzing the Data with Hadoop – Java MapReduce
 The new Java MapReduce API
 Example 2-6 shows the MaxTemperature application rewritten to use the new API.
15
Scaling Out
 To scale out, we need to store the data in a distributed filesystem, HDFS.
 Hadoop moves the MapReduce computation to each machine hosting a part
of the data.
 Data Flow
 A MapReduce job consists of the input data, the MapReduce program, and
configuration information.
 Hadoop runs the job by dividing it into 2 types of tasks, map and reduce tasks.
 Two types of nodes, 1 jobtracker and several tasktrackers
• Jobtracker : coordinates and schedules tasks to run on tasktrakers.
• Tasktrackers : run tasks and send progress report to the jobtracker.
 Hadoop divides the input into fixed-size pieces, called input splits, or just splits.
 Hadoop creates one map task for each split, which runs the user-defined map function
for each record in the split.
 The quality of the load balancing increases as the splits become more fine-grained.
• Default size : 1 HDFS block, 64MB
 Map tasks write their output to the local disk, not to HDFS.
 If the node running a map task fails, Hadoop will automatically rerun the map task on
another node to re-create the map output.
16
Scaling Out
 Data Flow – single reduce task
 Reduce tasks don’t have the advantage of data locality – the input to a single reduce
task is normally the output from all mappers.
 All map outputs are merged across the network and passed to the user-defined reduce
function.
 The output of the reduce is normally stored in HDFS.
17
Scaling Out
 Data Flow – multiple reduce tasks
 The number of reduce tasks is specified independently not governed by the input size.
 The map tasks partition their output by keys, each creating one partition for each
reduce task.
 There can be many keys and their associated values in each partition, but the records for
any key are all in a single partition.
18
Scaling Out
 Data Flow – zero reduce task
19
Scaling Out
 Combiner Functions
 Many MapReduce jobs are limited by the bandwidth available on the cluster.
 It pays to minimize the data transferred between map and reduce tasks.
 Hadoop allows the user to specify a combiner function to be run on the map
output – the combiner function’s output forms the input to the reduce function.
 The contract for the combiner function constrains the type of function that may be used.
 Example without a combiner function
Map ()
Reduce ()
shuffling
<1950, 0>
<1950, 20>
<1950, 10>
<1950, [0, 20, 10, 25, 15]>
<1950, 25>
<1950, 25>
<1950, 15>
 Example with a combiner function, finding maximum temperature for a map
Map ()
combiner
<1950, 0>
<1950, 20>
<1950, 10>
shuffling
Reduce ()
<1950, 20>
<1950, [20, 25]>
<1950, 25>
<1950, 15>
<1950, 25>
<1950, 25>
20
Scaling Out
 Combiner Functions
 The function calls on the temperature values can be expressed as follows:
• Max(0, 20, 10, 25, 15) = max( max(0, 20, 10), max(25, 15) ) = max(20, 25) = 25
 Calculating ‘mean’ temperatures couldn’t use the mean as the combiner function
• mean(0, 20, 10, 25, 15) = 14
• mean( mean(0, 20, 10), mean(25, 15) ) = mean(10, 20) = 15.
 The combiner function doesn’t replace the reduce function.
 It can help cut down the amount of data shuffled between the maps and the reduces
21
Scaling Out
 Combiner Functions
 Specifying a combiner function
• The combiner function is defined using the Reducer interface
• It is the same implementation as the reducer function in MaxTemperatureReducer.
• The only change is to set the combiner class on the JobConf.
22
Hadoop Streaming
 Hadoop provides an API to MapReduce
 write the map and reduce functions in languages other than Java.
 We can use any language in MapReduce program.
 Hadoop Streaming




Map input data is passed over standard input to your map function.
The map function processes the data line by line and writes lines to standard output.
A map output key-value pair is written as a single tab-delimited line.
Reduce function reads lines from standard input (sorted by key), and writes its results to
standard output.
23
Hadoop Streaming
 Ruby

The map function can be expressed in Ruby.

Simulating the map function in Ruby with a Unix pipeline

The reduce function for maximum temperature in Ruby
24
Hadoop Streaming
 Ruby

Simulating the whole MapReduce pipeline with a Unix pipeline

Hadoop command to run the whole MapReduce job

When using the combiner which is coded in any streaming language
25
Hadoop Streaming
 Python
 Streaming supports any programming language that can read from standard input and
write to standard output.
 The map and reduce script in Python
 Test the programs and run the job in the same way we did in Ruby.
26
Hadoop Pipes
 Hadoop Pipes
 The name of the C++ interface to Hadoop MapReduce.
 Pipes uses sockets as the channel over which the tasktracker communicates with the
process running the C++ map or reduce function.
 The source code for the map and reduce functions in C++
27
Hadoop Pipes
 The source code for the map and reduce functions in C++
28
Hadoop Pipes
 Compiling and Running
 Makefile for C++ MapReduce program
 Defining PLATFORM which specifies the operating system, architecture, and data
model (e.g., 32- or 64-bit).
 To run a Pipes job, we need to run Hadoop (daemon) in pseudo-distributed model.
 Next step is to copy the executable code (program) to HDFS.
 Next, the sample data is copied from the local filesystem to HDFS.
29
Hadoop Pipes
 Compiling and Running
 Now, we can run the job. For this, we use the Hadoop pipes command, passing URI of
the executable in HDFS using the –program argument:
30
Download