Ch2. MapReduce (26 pages)

Distributed and Parallel Processing Technology Chapter2. MapReduce Sun Jo 1 Introduction  MapReduce is a programming model for data processing.  Hadoop can run MapReduce programs written in various languages.  We shall look at the same program expressed in Java, Ruby, Python, and C++. 2 A Weather Dataset  Program that mines weather data  Weather sensors collect data every hour at many locations across the globe  They gather a large volume of log data, which is good candidate for analysis with MapReduce  Data Format  Data from the National Climate Data Center(NCDC)  Stored using a line-oriented ASCII format, in which each line is a record 3 A Weather Dataset  Data Format  Data files are organized by date and weather station.  There is a directory for each year from 1901 to 2001, each containing a gzipped file for each weather station with its readings for that year.  The whole dataset is made up of a large number of relatively small files since there are tens of thousands of weather station.  The data was preprocessed so that each year’s readings were concatenated into a single file. 4 Analyzing the Data with Unix Tools  What’s the highest recorded global temperature for each year in the dataset?  Unix Shell script program with awk, the classic tool for processing line-oriented data The scripts loops through the compressed year files  printing the year  processing each file using awk Awk extracts the air temperature and the quality code from the data. Temperature value 9999 signifies a missing value in the NCDC dataset.  Beginning of a run Maximum temperature is 31.7℃ for 1901.  The complete run for the century took 42 minutes in one run on a single EC2 High-CPU Extra Large Instance. 5 Analyzing the Data with Unix Tools  To speed up the processing, run parts of the program in parallel  Problems for parallel processing  Dividing the work into equal-size pieces isn’t always easy or obvious. • The file size for different years varies • The whole run is dominated by the longest file • A better approach is to split the input into fixed-size chunks and assign each chunk to a process  Combining the results from independent processes may need further processing.  Still limited by the processing capacity of a single machine, handling coordination and reliability for multiple machines  It’s feasible to parallelize the processing, though, it’s messy in practice. 6 Analyzing the Data with Hadoop – Map and Reduce  Map and Reduce     MapReduce works by breaking the processing into 2 phases: the map and the reduce. Both map and reduce phases have key-value pairs as input and output. Programmers have to specify two functions: map and reduce function. The input to the map phase is the raw NCDC data. • Here, the key is the offset of the beginning of the line and the value is each line of the data set.  The map function pulls out the year and the air temperature from each input value.  The reduce function takes <year, temperature> pairs as input and produces the maximum temperature for each year as the result. 7 Analyzing the Data with Hadoop – Map and Reduce  Original NCDC Format  Input file for the map function, stored in HDFS  Output of the map function, running in parallel for each block  Input for the reduce function & Output of the reduce function 8 Analyzing the Data with Hadoop – Map and Reduce  The whole data flow Shuffling Map() <1950, 0> <1950, 22> <1949,111> <1951, 10> <1952, 22> <1954, 0> <1954, 22> <1950, -11> <1949, 78> <1951, 25> Reduce () <1949, [111, 78]> <1950, [0, 22, -11]> <1951, [10, 76,34], 19> <1952 ,[22, 34]> <1953, [45]> <1955, [23]> <1949, 111> <1950, 22> <1951, 76> <1952, 34> <1953, 45> <1955,25> Input File 9 Analyzing the Data with Hadoop – Java MapReduce  Having run through how the MapReduce program works, express it in code  A map function, a reduce function, and some code to run the job are needed.  Map function 10 Analyzing the Data with Hadoop – Java MapReduce  Reduce function 11 Analyzing the Data with Hadoop – Java MapReduce  Main function for running the MapReduce job 12 Analyzing the Data with Hadoop – Java MapReduce  A test run  The output is written to the output directory, which contains one output file per reducer 13 Analyzing the Data with Hadoop – Java MapReduce  The new Java MapReduce API  The new API, referred to as “Context Objects”, is type-incompatible with the old, so applications need to be rewritten to take advantage of it.  Notable differences • Favors abstract classes over interfaces. The Mapper and Reducer interfaces are abstract classes. • The new API is in the org.apache.hadoop.mapreduce package and subpackages. • The old API can still be found in org.apache.hadoop.mapred • Makes extensive use of context objects that allow the user code to communicate with MapReduce system • i.e.) The MapContext unifies the role of the JobConf, the OutputCollector, and the Reporter • Supports both a ‘push’ and a ‘pull’ style of iteration • Basically key-value record pairs are pushed to the mapper, but in addition, the new API allows a mapper to pull records from within the map() method. • The same goes for the reducer • Configuration has been unified. • The old API has a JobConf object for job configuration, which is an extension of Hadoop’s vanilla Configuration object. • In the new API, job configuration is done through a Configuration. • Job control is performed through the Job class rather than JobClient. • Output files are named slightly differently • part-m-nnnnn for map outputs, part-r-nnnnn for reduce outputs • (nnnnn is an integer designating the part number, starting from 0) 14 Analyzing the Data with Hadoop – Java MapReduce  The new Java MapReduce API  Example 2-6 shows the MaxTemperature application rewritten to use the new API. 15 Scaling Out  To scale out, we need to store the data in a distributed filesystem, HDFS.  Hadoop moves the MapReduce computation to each machine hosting a part of the data.  Data Flow  A MapReduce job consists of the input data, the MapReduce program, and configuration information.  Hadoop runs the job by dividing it into 2 types of tasks, map and reduce tasks.  Two types of nodes, 1 jobtracker and several tasktrackers • Jobtracker : coordinates and schedules tasks to run on tasktrakers. • Tasktrackers : run tasks and send progress report to the jobtracker.  Hadoop divides the input into fixed-size pieces, called input splits, or just splits.  Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split.  The quality of the load balancing increases as the splits become more fine-grained. • Default size : 1 HDFS block, 64MB  Map tasks write their output to the local disk, not to HDFS.  If the node running a map task fails, Hadoop will automatically rerun the map task on another node to re-create the map output. 16 Scaling Out  Data Flow – single reduce task  Reduce tasks don’t have the advantage of data locality – the input to a single reduce task is normally the output from all mappers.  All map outputs are merged across the network and passed to the user-defined reduce function.  The output of the reduce is normally stored in HDFS. 17 Scaling Out  Data Flow – multiple reduce tasks  The number of reduce tasks is specified independently not governed by the input size.  The map tasks partition their output by keys, each creating one partition for each reduce task.  There can be many keys and their associated values in each partition, but the records for any key are all in a single partition. 18 Scaling Out  Data Flow – zero reduce task 19 Scaling Out  Combiner Functions  Many MapReduce jobs are limited by the bandwidth available on the cluster.  It pays to minimize the data transferred between map and reduce tasks.  Hadoop allows the user to specify a combiner function to be run on the map output – the combiner function’s output forms the input to the reduce function.  The contract for the combiner function constrains the type of function that may be used.  Example without a combiner function Map () Reduce () shuffling <1950, 0> <1950, 20> <1950, 10> <1950, [0, 20, 10, 25, 15]> <1950, 25> <1950, 25> <1950, 15>  Example with a combiner function, finding maximum temperature for a map Map () combiner <1950, 0> <1950, 20> <1950, 10> shuffling Reduce () <1950, 20> <1950, [20, 25]> <1950, 25> <1950, 15> <1950, 25> <1950, 25> 20 Scaling Out  Combiner Functions  The function calls on the temperature values can be expressed as follows: • Max(0, 20, 10, 25, 15) = max( max(0, 20, 10), max(25, 15) ) = max(20, 25) = 25  Calculating ‘mean’ temperatures couldn’t use the mean as the combiner function • mean(0, 20, 10, 25, 15) = 14 • mean( mean(0, 20, 10), mean(25, 15) ) = mean(10, 20) = 15.  The combiner function doesn’t replace the reduce function.  It can help cut down the amount of data shuffled between the maps and the reduces 21 Scaling Out  Combiner Functions  Specifying a combiner function • The combiner function is defined using the Reducer interface • It is the same implementation as the reducer function in MaxTemperatureReducer. • The only change is to set the combiner class on the JobConf. 22 Hadoop Streaming  Hadoop provides an API to MapReduce  write the map and reduce functions in languages other than Java.  We can use any language in MapReduce program.  Hadoop Streaming     Map input data is passed over standard input to your map function. The map function processes the data line by line and writes lines to standard output. A map output key-value pair is written as a single tab-delimited line. Reduce function reads lines from standard input (sorted by key), and writes its results to standard output. 23 Hadoop Streaming  Ruby  The map function can be expressed in Ruby.  Simulating the map function in Ruby with a Unix pipeline  The reduce function for maximum temperature in Ruby 24 Hadoop Streaming  Ruby  Simulating the whole MapReduce pipeline with a Unix pipeline  Hadoop command to run the whole MapReduce job  When using the combiner which is coded in any streaming language 25 Hadoop Streaming  Python  Streaming supports any programming language that can read from standard input and write to standard output.  The map and reduce script in Python  Test the programs and run the job in the same way we did in Ruby. 26 Hadoop Pipes  Hadoop Pipes  The name of the C++ interface to Hadoop MapReduce.  Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function.  The source code for the map and reduce functions in C++ 27 Hadoop Pipes  The source code for the map and reduce functions in C++ 28 Hadoop Pipes  Compiling and Running  Makefile for C++ MapReduce program  Defining PLATFORM which specifies the operating system, architecture, and data model (e.g., 32- or 64-bit).  To run a Pipes job, we need to run Hadoop (daemon) in pseudo-distributed model.  Next step is to copy the executable code (program) to HDFS.  Next, the sample data is copied from the local filesystem to HDFS. 29 Hadoop Pipes  Compiling and Running  Now, we can run the job. For this, we use the Hadoop pipes command, passing URI of the executable in HDFS using the –program argument: 30

Ch2. MapReduce (26 pages)

Related documents

Products

Support

Ch2. MapReduce (26 pages)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib