Data-Intensive Scalable Computing with MapReduce CS290N, Spring 2010 © Spinnaker Labs, Inc. Overview • What is MapReduce? • Related technologies – GFS – BigTable • Technology Comparison • MapReduce in a modern pipeline © Spinnaker Labs, Inc. Motivations for MapReduce • Data processing: > 1 TB • Massively parallel (hundreds or thousands of CPUs) • Must be easy to use © Spinnaker Labs, Inc. How MapReduce is Structured • Functional programming meets distributed computing • A batch data processing system • Factors out many reliability concerns from application logic © Spinnaker Labs, Inc. MapReduce Provides: • • • • Automatic parallelization & distribution Fault-tolerance Status and monitoring tools A clean abstraction for programmers © Spinnaker Labs, Inc. MapReduce: Insight • ”Consider the problem of counting the number of occurrences of each word in a large collection of documents” • How would you do it in parallel ? One possible solution Sum up the counts from all the documents to give final answer. Divide collection of document among the class. Each person gives count of individual word in a document. Repeats for assigned quota of documents. (Done w/o communication ) MapReduce Programming Model • Inspired from map and reduce operations commonly used in functional programming languages like Lisp. • Users implement interface of two primary methods: – 1. Map: (key1, val1) → (key2, val2) – 2. Reduce: (key2, [val2]) → [val3] Map operation • Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. – e.g. (doc—id, doc-content) • Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query. map map (in_key, in_value) -> (out_key, intermediate_value) list © Spinnaker Labs, Inc. Reduce operation • On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. • Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute. reduce reduce (out_key, intermediate_value list) -> out_value list initial returned © Spinnaker Labs, Inc. Pseudo-code map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Example vs. Actual Source Code • • • • Example is written in pseudo-code Actual implementation is in C++, using a MapReduce library Bindings for Python and Java exist via interfaces True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.) © Spinnaker Labs, Inc. MapReduce: Extensions and similar apps • PIG (Yahoo) • Hadoop (Apache) • DryadLinq (Microsoft) Large Scale Systems Architecture using MapReduce User App MapReduce BigTables/Distributed File Systems (GFS) MapReduce: Execution overview MapReduce: Execution overview Master Server distributes M map task to mappers and monitors their progress. Map Worker reads the allocated data, saves the map results in local buffer. Shuffle phase assigns reducers to these buffers, which are remotely read and processed by reducers. Reducers o/p the result on stable storage. MapReduce: Example MapReduce in Parallel: Example MapReduce: Execution Details • Input reader – Divide input into splits, assign each split to a Map processor • Map – Apply the Map function to each record in the split – Each Map function returns a list of (key, value) pairs • Shuffle/Partition and Sort – Shuffle distributes sorting & aggregation to many reducers – All records for key k are directed to the same reduce processor – Sort groups the same keys together, and prepares for aggregation • Reduce – Apply the Reduce function to each key – The result of the Reduce function is a list of (key, value) pairs 21 © 2010, Jamie Callan MapReduce in One Picture Tom White, Hadoop: The Definitive Guide 22 © 2010, Le Zhao MapReduce: Runtime Environment Partitioning the input data. Scheduling program across cluster of machines, Locality Optimization and Load balancing MapReduce Runtime Environment Dealing with machine failure Managing Inter-Machine communication MapReduce: Fault Tolerance • Handled via re-execution of tasks. Task completion committed through master • What happens if Mapper fails ? – Re-execute completed + in-progress map tasks • What happens if Reducer fails ? – Re-execute in progress reduce tasks • What happens if Master fails ? – Potential trouble !! MapReduce: Refinements Locality Optimization • Leverage GFS to schedule a map task on a machine that contains a replica of the corresponding input data. • Thousands of machines read input at local disk speed • Without this, rack switches limit read rate MapReduce: Refinements Redundant Execution • Slow workers are source of bottleneck, may delay completion time. • Near end of phase, spawn backup tasks, one to finish first wins. • Effectively utilizes computing power, reducing job completion time by a factor. MapReduce: Refinements Skipping Bad Records • Map/Reduce functions sometimes fail for particular inputs. • Fixing the Bug might not be possible : Third Party Libraries. • On Error – Worker sends signal to Master – If multiple error on same record, skip record MapReduce: Refinements Miscellaneous • Combiner Function at Mapper • Sorting Guarantees within each reduce partition. • Local execution for debugging/testing • User-defined counters Combining Phase • • • • Run on mapper nodes after map phase “Mini-reduce,” only on local map output Used to save bandwidth before sending data to full reducer Reducer can be combiner if commutative & associative © Spinnaker Labs, Inc. Combiner, graphically On one mapper machine: Map output Combiner replaces with: To reducer To reducer © Spinnaker Labs, Inc. MapReduce: Some More Apps • Distributed Grep. • Count of URL Access Frequency. • Clustering (K-means) • Graph Algorithms. • Indexing Systems MapReduce Programs In Google Source Tree More Applications with MapReduce MapReduce Use Case (1) – Map Only Data distributive tasks – Map Only • E.g. classify individual documents • Map does everything – Input: (docno, doc_content), … – Output: (docno, [class, class, …]), … • No reduce 33 © 2010, Le Zhao MapReduce Use Case (2) – Filtering and Accumulation Filtering & Accumulation – Map and Reduce • E.g. Counting total enrollments of two given classes • Map selects records and outputs initial counts – In: (Jamie, 11741), (Tom, 11493), … – Out: (11741, 1), (11493, 1), … • Shuffle/Partition by class_id • Sort – In: (11741, 1), (11493, 1), (11741, 1), … – Out: (11493, 1), …, (11741, 1), (11741, 1), … • Reduce accumulates counts – In: (11493, [1, 1, …]), (11741, [1, 1, …]) – Sum and Output: (11493, 16), (11741, 35) 34 © 2010, Le Zhao MapReduce Use Case (3) – Database Join Problem: Massive lookups – Given two large lists: (URL, ID) and (URL, doc_content) pairs – Produce (ID, doc_content) Solution: Database join • Input stream: both (URL, ID) and (URL, doc_content) lists – (http://del.icio.us/post, 0), (http://digg.com/submit, 1), … – (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), … • Map simply passes input along, • Shuffle and Sort on URL (group ID & doc_content for the same URL together) – Out: (http://del.icio.us/post, 0), (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), (http://digg.com/submit, 1), … • Reduce outputs result stream of (ID, doc_content) pairs – In: (http://del.icio.us/post, [0, html0]), (http://digg.com/submit, [html1, 1]), … – Out: (0, <html0>), (1, <html1>), … 35 © 2010, Le Zhao MapReduce Use Case (4) – Secondary Sort Problem: Sorting on values • E.g. Reverse graph edge directions & output in node order – Input: adjacency list of graph (3 nodes and 4 edges) 1 2 1 2 (3, [1, 2]) (1, [3]) (1, [2, 3]) (2, [1, 3]) (3, [1]) 3 3 • Note, the node_ids in the output values are also sorted. But Hadoop only sorts on keys! Solution: Secondary sort • Map – In: (3, [1, 2]), (1, [2, 3]). – Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]). (reverse edge direction) – Out: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1]). – Copy node_ids from value to key. 36 © 2010, Le Zhao MapReduce Use Case (4) – Secondary Sort Secondary Sort (ctd.) • Shuffle on Key.field1, and Sort on whole Key (both fields) – In: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1]) – Out: (<1, 3>, [3]), (<2, 1>, [1]), (<2, 3>, [3]), (<3, 1>, [1]) • Grouping comparator – Merge according to part of the key – Out: (<1, 3>, [3]), (<2, 1>, [1, 3]), (<3, 1>, [1]) this will be the reducer’s input • Reduce – Merge & output: (1, [3]), (2, [1, 3]), (3, [1]) 37 © 2010, Le Zhao Using MapReduce to Construct Indexes: Preliminaries Construction of binary inverted lists • Input: documents: (docid, [term, term..]), (docid, [term, ..]), .. • Output: (term, [docid, docid, …]) – E.g., (apple, [1, 23, 49, 127, …]) • Binary inverted lists fit on a slide more easily • Everything also applies to frequency and positional inverted lists A document id is an internal document id, e.g., a unique integer • Not an external document id such as a url MapReduce elements • Combiner, Secondary Sort, complex keys, Sorting on keys’ fields 38 © 2010, Jamie Callan Using MapReduce to Construct Indexes: A Simple Approach A simple approach to creating binary inverted lists • Each Map task is a document parser – Input: A stream of documents – Output: A stream of (term, docid) tuples » (long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) … • Shuffle sorts tuples by key and routes tuples to Reducers • Reducers convert streams of keys into streams of inverted lists – Input: (long, 1) (long, 127) (long, 49) (long, 23) … – The reducer sorts the values for a key and builds an inverted list » Longest inverted list must fit in memory – Output: (long, [df:492, docids:1, 23, 49, 127, …]) 39 © 2010, Jamie Callan Inverted Index: Data flow Foo This page contains so much text Bar My page contains text too Foo map output contains: Foo much: Foo page : Foo so : Foo text: Foo This : Foo Bar map output contains: Bar My: Bar page : Bar text: Bar too: Bar Reduced output contains: Foo, Bar much: Foo My: Bar page : Foo, Bar so : Foo text: Foo, Bar This : Foo too: Bar © Spinnaker Labs, Inc. Using MapReduce to Construct Indexes: A Simple Approach A more succinct representation of the previous algorithm • Map: (docid1, content1) (t1, docid1) (t2, docid1) … • Shuffle by t • Sort by t (t5, docid1) (t4, docid3) … (t4, docid3) (t4, docid1) (t5, docid1) … • Reduce: (t4, [docid3 docid1 …]) (t, ilist) docid: a unique integer t: a term, e.g., “apple” ilist: a complete inverted list but a) inefficient, b) docids are sorted in reducers, and c) assumes ilist of a word fits in memory 41 © 2010, Jamie Callan Using MapReduce to Construct Indexes: Using Combine • Map: (docid1, content1) (t1, ilist1,1) (t2, ilist2,1) (t3, ilist3,1) … – Each output inverted list covers just one document • Combine Sort by t Combine: (t1 [ilist1,2 ilist1,3 ilist1,1 …]) (t1, ilist1,27) – Each output inverted list covers a sequence of documents • Shuffle by t • Sort by t (t4, ilist4,1) (t5, ilist5,3) … (t4, ilist4,2) (t4, ilist4,4) (t4, ilist4,1) … • Reduce: (t7, [ilist7,2, ilist3,1, ilist7,4, …]) (t7, ilistfinal) ilisti,j: the j’th inverted list fragment for term i 42 © 2010, Jamie Callan Using MapReduce to Construct Indexes Documents Processors : Parser / Indexer : Inverted List Fragments Processors A-F Merger Parser / Indexer : G-P Merger : : Inverted Lists : Parser / Indexer : : Q-Z Merger Map/Combine Shuffle/Sort 43 Reduce © 2010, Jamie Callan Using MapReduce to Construct Partitioned Indexes • Map: (docid1, content1) ([p, t1], ilist1,1) • Combine to sort and group values ([p, t1] [ilist1,2 ilist1,3 ilist1,1 …]) ([p, t1], ilist1,27) • Shuffle by p • Sort values by [p, t] • Reduce: ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …]) ([p, t7], ilistfinal) p: partition (shard) id 44 © 2010, Jamie Callan Using MapReduce to Construct Indexes: Secondary Sort So far, we have assumed that Reduce can sort values in memory …but what if there are too many to fit in memory? • Map: (docid1, content1) ([t1, fd1,1], ilist1,1) • Combine to sort and group values • Shuffle by t • Sort by [t, fd], then Group by t (Secondary Sort) ([t7, fd7,2], ilist7,2), ([t7, fd7,1], ilist7,1) … (t7, [ilist7,1, ilist7,2, …]) • Reduce: (t7, [ilist7,1, ilist7,2, …]) (t7, ilistfinal) Values arrive in order, so Reduce can stream its output fdi,j is the first docid in ilisti,j 45 © 2010, Jamie Callan Using MapReduce to Construct Indexes: Putting it All Together • Map: (docid1, content1) ([p, t1, fd1,1], ilist1,1) • Combine to sort and group values ([p, t1, fd1,1] [ilist1,2 ilist1,3 ilist1,1 …]) ([p, t1, fd1,27], ilist1,27) • Shuffle by p • Secondary Sort by [(p, t), fd] ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …]) ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …]) • Reduce: ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …]) ([p, t7], ilistfinal) 46 © 2010, Jamie Callan Using MapReduce to Construct Indexes Documents Processors : Parser / Indexer : Inverted List Fragments Processors Shard Merger Parser / Indexer : Shard Merger : : Inverted Lists : Parser / Indexer : : Shard Merger Map/Combine Shuffle/Sort 47 Reduce © 2010, Jamie Callan MapReduce : PageRank PageRank models the behavior of a “random surfer”. n PR( x) (1 d ) d i 1 PR(ti ) C (ti ) C(t) is the out-degree of t, and (1-d) is a damping factor (random jump) The “random surfer” keeps clicking on successive links at random not taking content into consideration. Distributes its pages rank equally among all pages it links to. The dampening factor takes the surfer “getting bored” and typing arbitrary URL. Computing PageRank Start with seed PageRank values Each target page adds up “credit” from multiple inbound links to compute PRi+1 Each page distributes PageRank “credit” to all pages it points to. PageRank : Key Insights Effects at each iteration is local. i+1th iteration depends only on ith iteration At iteration i, PageRank for individual nodes can be computed independently PageRank using MapReduce • Use Sparse matrix representation (M) • Map each row of M to a list of PageRank “credit” to assign to out link neighbours. • These prestige scores are reduced to a single PageRank value for a page by aggregating over them. PageRank using MapReduce Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value Iterate until convergence Source of Image: Lin 2008 Phase 1: Process HTML • Map task takes (URL, page-content) pairs and maps them to (URL, (PRinit, list-of-urls)) – PRinit is the “seed” PageRank for URL – list-of-urls contains all pages pointed to by URL • Reduce task is just the identity function Phase 2: PageRank Distribution • Reduce task gets (URL, url_list) and many (URL, val) values – Sum vals and fix up with d to get new PR – Emit (URL, (new_rank, url_list)) • Check for convergence using non parallel component PageRank Calculation: Preliminaries One PageRank iteration: • Input: – (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) .. • Output: – (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..]) .. MapReduce elements • Score distribution and accumulation • Database join • Side-effect files 56 © 2010, Jamie Callan PageRank: Score Distribution and Accumulation • Map – In: (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) .. – Out: (out11, score1(t)/n1), (out12, score1(t)/n1) .., (out21, score2(t)/n2), .. • Shuffle & Sort by node_id – In: (id2, score1), (id1, score2), (id1, score1), .. – Out: (id1, score1), (id1, score2), .., (id2, score1), .. • Reduce – In: (id1, [score1, score2, ..]), (id2, [score1, ..]), .. – Out: (id1, score1(t+1)), (id2, score2(t+1)), .. 57 © 2010, Jamie Callan PageRank: Database Join to associate outlinks with score • Map – In & Out: (id1, score1(t+1)), (id2, score2(t+1)), .., (id1, [out11, out12, ..]), (id2, [out21, out22, ..]) .. • Shuffle & Sort by node_id – Out: (id1, score1(t+1)), (id1, [out11, out12, ..]), (id2, [out21, out22, ..]), (id2, score2(t+1)), .. • Reduce – In: (id1, [score1(t+1), out11, out12, ..]), (id2, [out21, out22, .., score2(t+1)]), .. – Out: (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..]) .. 58 © 2010, Jamie Callan PageRank: Side Effect Files for dangling nodes • Dangling Nodes – Nodes with no outlinks (observed but not crawled URLs) – Score has no outlet » need to distribute to all graph nodes evenly • Map for dangling nodes: – In: .., (id3, [score3]), .. – Out: .., ("*", 0.85×score3), .. • Reduce – In: .., ("*", [score1, score2, ..]), .. – Out: .., everything else, .. – Output to side-effect: ("*", score), fed to Mapper of next iteration 59 © 2010, Jamie Callan Manipulating Large Data • Do everything in Hadoop (and HDFS) – Make sure every step is parallelized! – Any serial step breaks your design • E.g. storing the URL list for a Web graph – Each node in Web graph has an id – [URL1, URL2, …], use line number as id – bottle neck – [(id1, URL1), (id2, URL2), …], explicit id 60 © 2010, Le Zhao Hadoop based Tools • For Developing in Java, NetBeans plugin • • • • • • – http://www.hadoopstudio.org/docs.html Pig Latin, a SQL-like high level data processing script language Hive, Data warehouse, SQL Cascading, Data processing Mahout, Machine Learning algorithms on Hadoop HBase, Distributed data store as a large table More – http://hadoop.apache.org/ – http://en.wikipedia.org/wiki/Hadoop – Many other toolkits, Nutch, Cloud9, Ivory 61 © 2010, Le Zhao Get Your Hands Dirty • Hadoop Virtual Machine – http://www.cloudera.com/developers/downloads/virtualmachine/ » This runs Hadoop 0.20 – An earlier Hadoop 0.18.0 version is here http://code.google.com/edu/parallel/tools/hadoopvm/index.ht ml • Amazon EC2 • Various other Hadoop clusters around • The NetBeans plugin simulates Hadoop – The workflow view works on Windows – Local running & debugging works on MacOS and Linux – http://www.hadoopstudio.org/docs.html 62 © 2010, Le Zhao Conclusions • • • • Why large scale MapReduce advantages Hadoop uses Use cases – Map only: for totally distributive computation – Map+Reduce: for filtering & aggregation – Database join: for massive dictionary lookups – Secondary sort: for sorting on values – Inverted indexing: combiner, complex keys – PageRank: side effect files • Large data 63 © 2010, Jamie Callan For More Information • L. A. Barroso, J. Dean, and U. Hölzle. “Web search for a planet: The Google cluster • • • • • • • architecture.” IEEE Micro, 2003. J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137-150. 2004. S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP-03), pages 29-43. 2003. I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. Morgan Kaufmann. 1999. J. Zobel and A. Moffat. “Inverted files for text search engines.” ACM Computing Surveys, 38 (2). 2006. http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce Tutorial”. Fetched January 21, 2010. Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009 J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, Book Draft. February 7, 2010. 64 © 2010, Jamie Callan