Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4th year CS – Web Development http://labs.google.com/papers/mapreduce.html Outline Motivation MapReduce Concept Map? Reduce? Example of MapReduce problem Reverse Web-Link Graph MapReduce Cluster Environment Lifecycle of MapReduce operation Optimizations to MapReduce process Conclusion MapReduce in Googlicious Action Motivation: Large Scale Data Processing Many tasks composed of processing lots of data to produce lots of other data Want to use hundreds or thousands of CPUs ... but this needs to be easy! MapReduce provides User-defined functions Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring Programming Concept Map Perform a function on individual values in a data set to create a new list of values Example: square x = x * x map square [1,2,3,4,5] returns [1,4,9,16,25] Reduce Combine values in a data set to create a new value Example: sum = (each elem in arr, total +=) reduce [1,2,3,4,5] returns 15 (the sum of the elements) Example: Reverse Web-Link Graph Find all pages that link to a certain page Map Function Outputs <target, source> pairs for each link to a target URL found in a source page For each page we know what pages it links to Reduce Function Concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)> For a given web page, we know what pages link to it Additional Examples Distributed grep Distributed sort Term-Vector per Host Web Access Log Statistics Document Clustering Machine Learning Statistical Machine Translation Performance Boasts Distributed grep 1010 100-byte files (~1TB of data) 3-character substring found in ~100k files ~1800 workers 150 seconds start to finish, including ~60 seconds startup overhead Distributed sort Same files/workers as above 50 lines of MapReduce code 891 seconds, including overhead Best reported result of 1057 seconds for TeraSort benchmark Typical Cluster 100s/1000s of Dual-Core, 2-4GB Memory Limited internal bandwidth Temporary storage on local IDE disks Google File System (GFS) Distributed file system for permanent/shared storage Job scheduling system Jobs made up of tasks Master-Scheduler assigns tasks to Worker machines Execution Initialization Split input file into 64MB sections (GFS) Read in parallel by multiple machines Fork off program onto multiple machines One machine is Master Master assigns idle machines to either Map or Reduce tasks Master Coordinates data communication between map and reduce machines Map-Machine Reads contents of assigned portion of input-file Parses and prepares data for input to map function (e.g. read <a /> from HTML) Passes data into map function and saves result in memory (e.g. <target, source>) Periodically writes completed work to local disk Notifies Master of this partially completed work (intermediate data) Reduce-Machine Receives notification from Master of partially completed work Retrieves intermediate data from Map-Machine via remote-read Sorts intermediate data by key (e.g. by target page) Iterates over intermediate data For each unique key, sends corresponding set through reduce function Appends result of reduce function to final output file (GFS) Worker Failure Master pings workers periodically Any machine who does not respond is considered “dead” Both Map- and Reduce-Machines Any task in progress gets needs to be re-executed and becomes eligible for scheduling Map-Machines Completed tasks are also reset because results are stored on local disk Reduce-Machines notified to get data from new machine assigned to assume task Skipping Bad Records Bugs in user code (from unexpected data) cause deterministic crashes Optimally, fix and re-run Not possible with third-party code When worker dies, sends “last gasp” UDP packet to Master describing record If more than one worker dies over a specific record, Master issues yet another re-execute command Tells new worker to skip problem record Backup Tasks Some “Stragglers” not performing optimally Other processes demanding resources Bad Disks (correctable errors) Slow down I/O speeds from 30MB/s to 1MB/s CPU cache disabled ?! Near end of phase, schedule redundant execution of in-process tasks First to complete “wins” Locality Network Bandwidth scarce Google File System (GFS) Around 64MB file sizes Redundant storage (usually 3+ machines) Assign Map-Machines to work on portions of input-files which they already have on local disk Read input file at local disk speeds Without this, read speed limited by network switch Conclusion Complete rewrite of the production indexing system 20+ TB of data indexing takes 5-10 MapReduce operations indexing code is simpler, smaller, easier to understand Fault Tolerance, Distribution, Parallelization hidden within MapReduce library Avoids extra passes over the data Easy to change indexing system Improve performance of indexing process by adding new machines to cluster