MapReduce-Mehta

advertisement
SIDDHARTH MEHTA
PURSUING MASTERS IN COMPUTER SCIENCE
(FALL 2008)
INTERESTS: SYSTEMS, WEB


A programming model and an associated
implementation(library) for processing and
generating large data sets (on large
clusters).
A new abstraction allowing to express the
simple computations that hides the messy
details of parallelization, fault-tolerance,
data distribution and load balancing in a
library.

Large-Scale Data Processing
◦ Want to use 1000s of CPUs
 But don’t want hassle of managing things

MapReduce provides
◦
◦
◦
◦
Automatic parallelization & distribution
Fault tolerance
I/O scheduling
Monitoring & status updates

The MapReduce programming model has been successfully used at Google
for many different purposes.
◦ First, the model is easy to use, even for programmers without experience
with parallel and distributed systems, since it hides the details of
parallelization, fault-tolerance, locality optimization, and load balancing.
◦ Second, a large variety of problems are easily expressible as MapReduce
computations. For example, MapReduce is used for the generation of data
for Google's production web search service, for sorting, for data mining,
for machine learning, and many other systems.
◦ Third, developed an implementation of MapReduce that scales to large
clusters of machines comprising thousands of machines. The
implementation makes efficient use of these machine resources and
therefore is suitable for use on many of the large computational problems
encountered at Google.
map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
Emit result “(word, sum)”
see bob throw
see spot run
see
bob
run
see
spot
throw
1
1
1
1
1
1
bob
run
see
spot
throw
1
1
2
1
1




Distributed grep:
◦ Map: (key, whole doc/a line)  (the matched line, key)
◦ Reduce: identity function
Count of URL Access Frequency:
◦ Map: logs of web page requests  (URL, 1)
◦ Reduce: (URL, total count)
Reverse Web-Link Graph:
◦ Map: (source, target)  (target, source)
◦ Reduce: (target, list(source))  (target, list(source))
Inverted Index:
◦ Map: (docID, document)  (word, docID)
◦ Reduce: (word, list(docID))  (word, sortedlist(docID))

In Google clusters, comprised of the top of
the line PCs.
◦
◦
◦
◦
◦
Intel Xeon 2 x 2MB, HyperThreading
2-4GB Memory
100 M– 1G network
Local IDE disks + Google F.S.
Submit job to a scheduling system
R
M
R

Fault Tolerance – in a word: redo
◦ Master pings workers, re-schedules failed tasks.
◦ Note: Completed map tasks are re-executed on
failure because their output is stored on the local
disk.
◦ Master failure: redo
◦ Semantics in the presence of failures:
 Deterministic map/reduce function: Produce the same output
as would have been produced by a non-faulting sequential
execution of the entire program
 Rely on atomic commits of map and reduce task outputs to
achieve this property.








Partitioning
Ordering guarantees
Combiner function
Side effects
Skipping bad records
Local execution
Status information
Counters



Straggler: a machine that takes an unusually long
time to complete one of the last few map or
reduce tasks in the computation.
Cause: bad disk, …
Resolution: schedule backup of in-progress tasks
near the end of MapReduce Operation



Partition output of a map task to R pieces
Default: hash(key) mod R
User provided
◦ E.g. hash(Hostname(url)) mod R
‥‥‥
M
‥‥‥
‥‥‥
R
One Partition


Guarantee: within a given partition, the
intermediate key/value pairs are processed in
increasing key order.
MapReduce Impl. of distributed sort
◦ Map: (key, value)  (key for sort, value)
◦ Reduce: emit unchanged.






E.g. : word count, many <the, 1>
Combine once before reduce task, for saving
network bandwidth
Executed on machine performing map task.
Typically the same as reduce function
Output to an intermediate file
Example: count words

Skipping Bad Records
◦ Ignoring certain records makes tasks crash
◦ An optional mode of execution
◦ Install a signal handler to catch segmentation
violations and bus errors.

Status Information
◦ The master runs an internal HTTP server and
exports a set of status pages
◦ Monitor progress of computation: how many
tasks have been completed, how many are in
progress, bytes of input, bytes of intermediate
data, bytes of output, processing rates, etc. The
pages also contain links to the standard error and
standard output files generated by each task.
◦ In addition, the top-level status page shows
which workers have failed, and which map and
reduce tasks they were processing when they
failed.


Tests on grep and sort
Cluster characteristics
◦
◦
◦
◦
◦
1800 machines (!)
Intel Xeon 2x2MB, HyperThreading
2-4 GB Bytes Memory
100 M– 1G network
Local IDE disks + Google F.S.




1 terabyte: 10^10 100-byte records
Rare three-character pattern (10^5 freq.)
Split input into 64 MB pieces, M=15000
R=1 (output is small)

Peak at 30 GB/s (1764 workers)
1 minute startup time

Completed in <1.5 minutes

◦ Propagation of program to workers
◦ GFS: open 1000 input files
◦ Locality optimization







1 terabyte: 10^10 100-byte records
Extract 10-byte sorting key
Map: Emit <key,val.>: <10-byte,100-byte>
Reduce: identity
2-way replication of output
◦ For redundancy, typical in GFS
M=15000, R=4000
May need pre-pass MapReduce for computing
distribution of keys







Input rate less than for grep
Two humps: 2*1700 ~ 4000
Final output delayed because of sorting
Rates: input>shuffle,output (locality!)
Rates: shuffle>output (writing 2 copies)
Effect of backup
Effect of machine failures
◦ Restricting the programming model makes it easy to
parallelize and distribute computations and to make such
computations fault-tolerant.
◦ Network bandwidth is a scarce resource. A number of
optimizations in the system are therefore targeted at
reducing the amount of data sent across the network: the
locality optimization allows to read data from local disks,
and writing a single copy of the intermediate data to local
disk saves network bandwidth.
◦ Redundant execution can be used to reduce the impact of
slow machines, and to handle machine failures and data
loss.
Download