SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB A programming model and an associated implementation(library) for processing and generating large data sets (on large clusters). A new abstraction allowing to express the simple computations that hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Large-Scale Data Processing ◦ Want to use 1000s of CPUs But don’t want hassle of managing things MapReduce provides ◦ ◦ ◦ ◦ Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates The MapReduce programming model has been successfully used at Google for many different purposes. ◦ First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing. ◦ Second, a large variety of problems are easily expressible as MapReduce computations. For example, MapReduce is used for the generation of data for Google's production web search service, for sorting, for data mining, for machine learning, and many other systems. ◦ Third, developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines. The implementation makes efficient use of these machine resources and therefore is suitable for use on many of the large computational problems encountered at Google. map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see bob throw see spot run see bob run see spot throw 1 1 1 1 1 1 bob run see spot throw 1 1 2 1 1 Distributed grep: ◦ Map: (key, whole doc/a line) (the matched line, key) ◦ Reduce: identity function Count of URL Access Frequency: ◦ Map: logs of web page requests (URL, 1) ◦ Reduce: (URL, total count) Reverse Web-Link Graph: ◦ Map: (source, target) (target, source) ◦ Reduce: (target, list(source)) (target, list(source)) Inverted Index: ◦ Map: (docID, document) (word, docID) ◦ Reduce: (word, list(docID)) (word, sortedlist(docID)) In Google clusters, comprised of the top of the line PCs. ◦ ◦ ◦ ◦ ◦ Intel Xeon 2 x 2MB, HyperThreading 2-4GB Memory 100 M– 1G network Local IDE disks + Google F.S. Submit job to a scheduling system R M R Fault Tolerance – in a word: redo ◦ Master pings workers, re-schedules failed tasks. ◦ Note: Completed map tasks are re-executed on failure because their output is stored on the local disk. ◦ Master failure: redo ◦ Semantics in the presence of failures: Deterministic map/reduce function: Produce the same output as would have been produced by a non-faulting sequential execution of the entire program Rely on atomic commits of map and reduce task outputs to achieve this property. Partitioning Ordering guarantees Combiner function Side effects Skipping bad records Local execution Status information Counters Straggler: a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. Cause: bad disk, … Resolution: schedule backup of in-progress tasks near the end of MapReduce Operation Partition output of a map task to R pieces Default: hash(key) mod R User provided ◦ E.g. hash(Hostname(url)) mod R ‥‥‥ M ‥‥‥ ‥‥‥ R One Partition Guarantee: within a given partition, the intermediate key/value pairs are processed in increasing key order. MapReduce Impl. of distributed sort ◦ Map: (key, value) (key for sort, value) ◦ Reduce: emit unchanged. E.g. : word count, many <the, 1> Combine once before reduce task, for saving network bandwidth Executed on machine performing map task. Typically the same as reduce function Output to an intermediate file Example: count words Skipping Bad Records ◦ Ignoring certain records makes tasks crash ◦ An optional mode of execution ◦ Install a signal handler to catch segmentation violations and bus errors. Status Information ◦ The master runs an internal HTTP server and exports a set of status pages ◦ Monitor progress of computation: how many tasks have been completed, how many are in progress, bytes of input, bytes of intermediate data, bytes of output, processing rates, etc. The pages also contain links to the standard error and standard output files generated by each task. ◦ In addition, the top-level status page shows which workers have failed, and which map and reduce tasks they were processing when they failed. Tests on grep and sort Cluster characteristics ◦ ◦ ◦ ◦ ◦ 1800 machines (!) Intel Xeon 2x2MB, HyperThreading 2-4 GB Bytes Memory 100 M– 1G network Local IDE disks + Google F.S. 1 terabyte: 10^10 100-byte records Rare three-character pattern (10^5 freq.) Split input into 64 MB pieces, M=15000 R=1 (output is small) Peak at 30 GB/s (1764 workers) 1 minute startup time Completed in <1.5 minutes ◦ Propagation of program to workers ◦ GFS: open 1000 input files ◦ Locality optimization 1 terabyte: 10^10 100-byte records Extract 10-byte sorting key Map: Emit <key,val.>: <10-byte,100-byte> Reduce: identity 2-way replication of output ◦ For redundancy, typical in GFS M=15000, R=4000 May need pre-pass MapReduce for computing distribution of keys Input rate less than for grep Two humps: 2*1700 ~ 4000 Final output delayed because of sorting Rates: input>shuffle,output (locality!) Rates: shuffle>output (writing 2 copies) Effect of backup Effect of machine failures ◦ Restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tolerant. ◦ Network bandwidth is a scarce resource. A number of optimizations in the system are therefore targeted at reducing the amount of data sent across the network: the locality optimization allows to read data from local disks, and writing a single copy of the intermediate data to local disk saves network bandwidth. ◦ Redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss.