Data-Intensive Scalable Computing with MapReduce CS290N, Spring 2010 © Spinnaker Labs, Inc.

advertisement
Data-Intensive Scalable Computing with
MapReduce
CS290N, Spring 2010
© Spinnaker Labs, Inc.
Overview
• What is MapReduce?
• Related technologies
– GFS
– BigTable
• Technology Comparison
• MapReduce in a modern pipeline
© Spinnaker Labs, Inc.
Motivations for MapReduce
• Data processing: > 1 TB
• Massively parallel (hundreds or thousands of CPUs)
• Must be easy to use
© Spinnaker Labs, Inc.
How MapReduce is Structured
• Functional programming meets distributed computing
• A batch data processing system
• Factors out many reliability concerns from application logic
© Spinnaker Labs, Inc.
MapReduce Provides:
•
•
•
•
Automatic parallelization & distribution
Fault-tolerance
Status and monitoring tools
A clean abstraction for programmers
© Spinnaker Labs, Inc.
MapReduce: Insight
• ”Consider the problem of counting the number of occurrences of each
word in a large collection of documents”
• How would you do it in parallel ?
One possible solution
Sum up the counts from all
the documents to give final
answer.
Divide collection of
document among the class.
Each person gives count of
individual word in a
document. Repeats for
assigned quota of documents.
(Done w/o communication )
MapReduce Programming Model
• Inspired from map and reduce operations commonly
used in functional programming languages like Lisp.
• Users implement interface of two primary methods:
– 1. Map: (key1, val1) → (key2, val2)
– 2. Reduce: (key2, [val2]) → [val3]
Map operation
• Map, a pure function, written by the user, takes an
input key/value pair and produces a set of
intermediate key/value pairs.
– e.g. (doc—id, doc-content)
• Draw an analogy to SQL, map can be visualized as
group-by clause of an aggregate query.
map
map (in_key, in_value) ->
(out_key, intermediate_value) list
© Spinnaker Labs, Inc.
Reduce operation
• On completion of map phase, all the intermediate values for a given
output key are combined together into a list and given to a reducer.
• Can be visualized as aggregate function (e.g., average) that is computed
over all the rows with the same group-by attribute.
reduce
reduce (out_key, intermediate_value list) ->
out_value list
initial
returned
© Spinnaker Labs, Inc.
Pseudo-code
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
Example vs. Actual Source Code
•
•
•
•
Example is written in pseudo-code
Actual implementation is in C++, using a MapReduce library
Bindings for Python and Java exist via interfaces
True code is somewhat more involved (defines how the input
key/values are divided up and accessed, etc.)
© Spinnaker Labs, Inc.
MapReduce: Extensions and similar apps
• PIG (Yahoo)
• Hadoop (Apache)
• DryadLinq (Microsoft)
Large Scale Systems Architecture using MapReduce
User App
MapReduce
BigTables/Distributed File Systems
(GFS)
MapReduce: Execution overview
MapReduce: Execution overview
Master Server distributes M map task to mappers and monitors their progress.
Map Worker reads the allocated data, saves the map results in local buffer.
Shuffle phase assigns reducers to these buffers, which are remotely read and
processed by reducers.
Reducers o/p the result on stable storage.
MapReduce: Example
MapReduce in Parallel: Example
MapReduce: Execution Details
• Input reader
– Divide input into splits, assign each split to a Map processor
• Map
– Apply the Map function to each record in the split
– Each Map function returns a list of (key, value) pairs
• Shuffle/Partition and Sort
– Shuffle distributes sorting & aggregation to many reducers
– All records for key k are directed to the same reduce processor
– Sort groups the same keys together, and prepares for aggregation
• Reduce
– Apply the Reduce function to each key
– The result of the Reduce function is a list of (key, value) pairs
21
© 2010, Jamie Callan
MapReduce in One Picture
Tom White, Hadoop: The Definitive Guide
22
© 2010, Le Zhao
MapReduce: Runtime Environment
Partitioning the input data.
Scheduling program across cluster of
machines, Locality Optimization and
Load balancing
MapReduce Runtime
Environment
Dealing with machine failure
Managing Inter-Machine
communication
MapReduce: Fault Tolerance
• Handled via re-execution of tasks.

Task completion committed through master
• What happens if Mapper fails ?
– Re-execute completed + in-progress map tasks
• What happens if Reducer fails ?
– Re-execute in progress reduce tasks
• What happens if Master fails ?
– Potential trouble !!
MapReduce: Refinements
Locality Optimization
• Leverage GFS to schedule a map task on a machine
that contains a replica of the corresponding input
data.
• Thousands of machines read input at local disk
speed
• Without this, rack switches limit read rate
MapReduce: Refinements Redundant
Execution
• Slow workers are source of bottleneck, may delay completion time.
• Near end of phase, spawn backup tasks, one to finish first wins.
• Effectively utilizes computing power, reducing job completion time by
a factor.
MapReduce: Refinements
Skipping Bad Records
• Map/Reduce functions sometimes fail for particular inputs.
• Fixing the Bug might not be possible : Third Party Libraries.
• On Error
– Worker sends signal to Master
– If multiple error on same record, skip record
MapReduce: Refinements
Miscellaneous
• Combiner Function at Mapper
• Sorting Guarantees within each reduce partition.
• Local execution for debugging/testing
• User-defined counters
Combining Phase
•
•
•
•
Run on mapper nodes after map phase
“Mini-reduce,” only on local map output
Used to save bandwidth before sending data to full reducer
Reducer can be combiner if commutative & associative
© Spinnaker Labs, Inc.
Combiner, graphically
On one mapper machine:
Map output
Combiner
replaces with:
To reducer
To reducer
© Spinnaker Labs, Inc.
MapReduce: Some More Apps
• Distributed Grep.
• Count of URL Access
Frequency.
• Clustering (K-means)
• Graph Algorithms.
• Indexing Systems
MapReduce Programs In
Google Source Tree
More Applications with
MapReduce
MapReduce Use Case (1) – Map Only
Data distributive tasks – Map Only
• E.g. classify individual documents
• Map does everything
– Input: (docno, doc_content), …
– Output: (docno, [class, class, …]), …
• No reduce
33
© 2010, Le Zhao
MapReduce Use Case (2) – Filtering and
Accumulation
Filtering & Accumulation – Map and Reduce
• E.g. Counting total enrollments of two given classes
• Map selects records and outputs initial counts
– In: (Jamie, 11741), (Tom, 11493), …
– Out: (11741, 1), (11493, 1), …
• Shuffle/Partition by class_id
• Sort
– In: (11741, 1), (11493, 1), (11741, 1), …
– Out: (11493, 1), …, (11741, 1), (11741, 1), …
• Reduce accumulates counts
– In: (11493, [1, 1, …]), (11741, [1, 1, …])
– Sum and Output: (11493, 16), (11741, 35)
34
© 2010, Le Zhao
MapReduce Use Case (3) – Database Join
Problem: Massive lookups
– Given two large lists: (URL, ID) and (URL, doc_content) pairs
– Produce (ID, doc_content)
Solution: Database join
• Input stream: both (URL, ID) and (URL, doc_content) lists
– (http://del.icio.us/post, 0), (http://digg.com/submit, 1), …
– (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), …
• Map simply passes input along,
• Shuffle and Sort on URL (group ID & doc_content for the same URL together)
– Out: (http://del.icio.us/post, 0), (http://del.icio.us/post, <html0>),
(http://digg.com/submit, <html1>), (http://digg.com/submit, 1), …
• Reduce outputs result stream of (ID, doc_content) pairs
– In: (http://del.icio.us/post, [0, html0]), (http://digg.com/submit, [html1, 1]), …
– Out: (0, <html0>), (1, <html1>), …
35
© 2010, Le Zhao
MapReduce Use Case (4) – Secondary Sort
Problem: Sorting on values
• E.g. Reverse graph edge directions & output in node order
– Input: adjacency list of graph (3 nodes and 4 edges)
1
2
1
2
(3, [1, 2])
(1, [3])
(1, [2, 3])  (2, [1, 3])

(3, [1])
3
3
• Note, the node_ids in the output values are also sorted.
But Hadoop only sorts on keys!
Solution: Secondary sort
• Map
– In: (3, [1, 2]), (1, [2, 3]).
– Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]). (reverse edge direction)
– Out: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1]).
– Copy node_ids from value to key.
36
© 2010, Le Zhao
MapReduce Use Case (4) – Secondary Sort
Secondary Sort (ctd.)
• Shuffle on Key.field1, and Sort on whole Key (both fields)
– In: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1])
– Out: (<1, 3>, [3]), (<2, 1>, [1]), (<2, 3>, [3]), (<3, 1>, [1])
• Grouping comparator
– Merge according to part of the key
– Out: (<1, 3>, [3]), (<2, 1>, [1, 3]), (<3, 1>, [1])
this will be the reducer’s input
• Reduce
– Merge & output: (1, [3]), (2, [1, 3]), (3, [1])
37
© 2010, Le Zhao
Using MapReduce to Construct Indexes:
Preliminaries
Construction of binary inverted lists
• Input: documents: (docid, [term, term..]), (docid, [term, ..]), ..
• Output: (term, [docid, docid, …])
– E.g., (apple, [1, 23, 49, 127, …])
• Binary inverted lists fit on a slide more easily
• Everything also applies to frequency and positional inverted lists
A document id is an internal document id, e.g., a unique integer
• Not an external document id such as a url
MapReduce elements
• Combiner, Secondary Sort, complex keys, Sorting on keys’ fields
38
© 2010, Jamie Callan
Using MapReduce to Construct Indexes:
A Simple Approach
A simple approach to creating binary inverted lists
• Each Map task is a document parser
– Input: A stream of documents
– Output: A stream of (term, docid) tuples
» (long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) …
• Shuffle sorts tuples by key and routes tuples to Reducers
• Reducers convert streams of keys into streams of inverted lists
– Input:
(long, 1) (long, 127) (long, 49) (long, 23) …
– The reducer sorts the values for a key and builds an inverted list
» Longest inverted list must fit in memory
– Output: (long, [df:492, docids:1, 23, 49, 127, …])
39
© 2010, Jamie Callan
Inverted Index: Data flow
Foo
This page contains
so much text
Bar
My page contains
text too
Foo map output
contains: Foo
much: Foo
page : Foo
so : Foo
text: Foo
This : Foo
Bar map output
contains: Bar
My: Bar
page : Bar
text: Bar
too: Bar
Reduced output
contains: Foo, Bar
much: Foo
My: Bar
page : Foo, Bar
so : Foo
text: Foo, Bar
This : Foo
too: Bar
© Spinnaker Labs, Inc.
Using MapReduce to Construct Indexes:
A Simple Approach
A more succinct representation of the previous algorithm
• Map: (docid1, content1)  (t1, docid1) (t2, docid1) …
• Shuffle by t
• Sort by t
(t5, docid1) (t4, docid3) …  (t4, docid3) (t4, docid1) (t5, docid1) …
• Reduce: (t4, [docid3 docid1 …])  (t, ilist)
docid: a unique integer
t:
a term, e.g., “apple”
ilist:
a complete inverted list
but a) inefficient, b) docids are sorted in reducers, and c) assumes
ilist of a word fits in memory
41
© 2010, Jamie Callan
Using MapReduce to Construct Indexes:
Using Combine
• Map: (docid1, content1)  (t1, ilist1,1) (t2, ilist2,1) (t3, ilist3,1) …
– Each output inverted list covers just one document
• Combine
Sort by t
Combine: (t1 [ilist1,2 ilist1,3 ilist1,1 …])  (t1, ilist1,27)
– Each output inverted list covers a sequence of documents
• Shuffle by t
• Sort by t
(t4, ilist4,1) (t5, ilist5,3) …  (t4, ilist4,2) (t4, ilist4,4) (t4, ilist4,1) …
• Reduce: (t7, [ilist7,2, ilist3,1, ilist7,4, …])  (t7, ilistfinal)
ilisti,j:
the j’th inverted list fragment for term i
42
© 2010, Jamie Callan
Using MapReduce to Construct Indexes
Documents
Processors
:
Parser /
Indexer
:
Inverted List
Fragments
Processors
A-F
Merger
Parser /
Indexer
:
G-P
Merger
:
:
Inverted
Lists
:
Parser /
Indexer
:
:
Q-Z
Merger
Map/Combine
Shuffle/Sort
43
Reduce
© 2010, Jamie Callan
Using MapReduce to Construct
Partitioned Indexes
• Map: (docid1, content1)  ([p, t1], ilist1,1)
• Combine to sort and group values
([p, t1] [ilist1,2 ilist1,3 ilist1,1 …])  ([p, t1], ilist1,27)
• Shuffle by p
• Sort values by [p, t]
• Reduce: ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …])  ([p, t7], ilistfinal)
p: partition (shard) id
44
© 2010, Jamie Callan
Using MapReduce to Construct Indexes:
Secondary Sort
So far, we have assumed that Reduce can sort values in memory
…but what if there are too many to fit in memory?
• Map: (docid1, content1)  ([t1, fd1,1], ilist1,1)
• Combine to sort and group values
• Shuffle by t
• Sort by [t, fd], then Group by t (Secondary Sort)
([t7, fd7,2], ilist7,2), ([t7, fd7,1], ilist7,1) …  (t7, [ilist7,1, ilist7,2, …])
• Reduce: (t7, [ilist7,1, ilist7,2, …])  (t7, ilistfinal)
Values arrive in order, so Reduce can stream its output
fdi,j is the first docid in ilisti,j
45
© 2010, Jamie Callan
Using MapReduce to Construct Indexes:
Putting it All Together
• Map: (docid1, content1)  ([p, t1, fd1,1], ilist1,1)
• Combine to sort and group values
([p, t1, fd1,1] [ilist1,2 ilist1,3 ilist1,1 …])  ([p, t1, fd1,27], ilist1,27)
• Shuffle by p
• Secondary Sort by [(p, t), fd]
([p, t7], [ilist7,2, ilist7,1, ilist7,4, …])  ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …])
• Reduce: ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …])  ([p, t7], ilistfinal)
46
© 2010, Jamie Callan
Using MapReduce to Construct Indexes
Documents
Processors
:
Parser /
Indexer
:
Inverted List
Fragments
Processors
Shard
Merger
Parser /
Indexer
:
Shard
Merger
:
:
Inverted
Lists
:
Parser /
Indexer
:
:
Shard
Merger
Map/Combine
Shuffle/Sort
47
Reduce
© 2010, Jamie Callan
MapReduce : PageRank

PageRank models the behavior of a “random surfer”.
n
PR( x)  (1  d )  d 
i 1
PR(ti )
C (ti )
C(t) is the out-degree of t, and (1-d) is a damping factor
(random jump)


The “random surfer” keeps clicking on successive links at
random not taking content into consideration.

Distributes its pages rank equally among all pages it links
to.

The dampening factor takes the surfer “getting bored”
and typing arbitrary URL.
Computing PageRank
Start with seed
PageRank values
Each target page
adds up “credit”
from multiple inbound links to
compute PRi+1
Each page distributes
PageRank “credit” to
all pages it points to.
PageRank : Key Insights

Effects at each iteration is local. i+1th iteration depends only on ith iteration

At iteration i, PageRank for individual nodes can be computed
independently
PageRank using MapReduce
• Use Sparse matrix representation (M)
• Map each row of M to a list of PageRank “credit” to assign to out
link neighbours.
• These prestige scores are reduced to a single PageRank value for a page
by aggregating over them.
PageRank using MapReduce
Map: distribute PageRank “credit” to link targets
Reduce: gather up PageRank “credit” from
multiple sources to compute new PageRank value
Iterate until
convergence
Source of Image: Lin 2008
Phase 1: Process HTML
• Map task takes (URL, page-content) pairs and maps them to (URL,
(PRinit, list-of-urls))
– PRinit is the “seed” PageRank for URL
– list-of-urls contains all pages pointed to by URL
• Reduce task is just the identity function
Phase 2: PageRank Distribution
• Reduce task gets (URL, url_list) and many (URL, val) values
– Sum vals and fix up with d to get new PR
– Emit (URL, (new_rank, url_list))
• Check for convergence using non parallel component
PageRank Calculation:
Preliminaries
One PageRank iteration:
• Input:
– (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) ..
• Output:
– (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..]) ..
MapReduce elements
• Score distribution and accumulation
• Database join
• Side-effect files
56
© 2010, Jamie Callan
PageRank:
Score Distribution and Accumulation
• Map
– In: (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) ..
– Out: (out11, score1(t)/n1), (out12, score1(t)/n1) .., (out21, score2(t)/n2), ..
• Shuffle & Sort by node_id
– In: (id2, score1), (id1, score2), (id1, score1), ..
– Out: (id1, score1), (id1, score2), .., (id2, score1), ..
• Reduce
– In: (id1, [score1, score2, ..]), (id2, [score1, ..]), ..
– Out: (id1, score1(t+1)), (id2, score2(t+1)), ..
57
© 2010, Jamie Callan
PageRank:
Database Join to associate outlinks with score
• Map
– In & Out: (id1, score1(t+1)), (id2, score2(t+1)), .., (id1, [out11, out12, ..]),
(id2, [out21, out22, ..]) ..
• Shuffle & Sort by node_id
– Out: (id1, score1(t+1)), (id1, [out11, out12, ..]), (id2, [out21, out22, ..]), (id2,
score2(t+1)), ..
• Reduce
– In: (id1, [score1(t+1), out11, out12, ..]), (id2, [out21, out22, .., score2(t+1)]), ..
– Out: (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..])
..
58
© 2010, Jamie Callan
PageRank:
Side Effect Files for dangling nodes
• Dangling Nodes
– Nodes with no outlinks (observed but not crawled URLs)
– Score has no outlet
» need to distribute to all graph nodes evenly
• Map for dangling nodes:
– In: .., (id3, [score3]), ..
– Out: .., ("*", 0.85×score3), ..
• Reduce
– In: .., ("*", [score1, score2, ..]), ..
– Out: .., everything else, ..
– Output to side-effect: ("*", score), fed to Mapper of next iteration
59
© 2010, Jamie Callan
Manipulating Large Data
• Do everything in Hadoop (and HDFS)
– Make sure every step is parallelized!
– Any serial step breaks your design
• E.g. storing the URL list for a Web graph
– Each node in Web graph has an id
– [URL1, URL2, …], use line number as id – bottle neck
– [(id1, URL1), (id2, URL2), …], explicit id
60
© 2010, Le Zhao
Hadoop based Tools
• For Developing in Java, NetBeans plugin
•
•
•
•
•
•
– http://www.hadoopstudio.org/docs.html
Pig Latin, a SQL-like high level data processing script language
Hive, Data warehouse, SQL
Cascading, Data processing
Mahout, Machine Learning algorithms on Hadoop
HBase, Distributed data store as a large table
More
– http://hadoop.apache.org/
– http://en.wikipedia.org/wiki/Hadoop
– Many other toolkits, Nutch, Cloud9, Ivory
61
© 2010, Le Zhao
Get Your Hands Dirty
• Hadoop Virtual Machine
– http://www.cloudera.com/developers/downloads/virtualmachine/
» This runs Hadoop 0.20
– An earlier Hadoop 0.18.0 version is here
http://code.google.com/edu/parallel/tools/hadoopvm/index.ht
ml
• Amazon EC2
• Various other Hadoop clusters around
• The NetBeans plugin simulates Hadoop
– The workflow view works on Windows
– Local running & debugging works on MacOS and Linux
– http://www.hadoopstudio.org/docs.html
62
© 2010, Le Zhao
Conclusions
•
•
•
•
Why large scale
MapReduce advantages
Hadoop uses
Use cases
– Map only: for totally distributive computation
– Map+Reduce: for filtering & aggregation
– Database join: for massive dictionary lookups
– Secondary sort: for sorting on values
– Inverted indexing: combiner, complex keys
– PageRank: side effect files
• Large data
63
© 2010, Jamie Callan
For More Information
• L. A. Barroso, J. Dean, and U. Hölzle. “Web search for a planet: The Google cluster
•
•
•
•
•
•
•
architecture.” IEEE Micro, 2003.
J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.”
Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI
2004), pages 137-150. 2004.
S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” Proceedings of the
19th ACM Symposium on Operating Systems Principles (SOSP-03), pages 29-43. 2003.
I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. Morgan Kaufmann. 1999.
J. Zobel and A. Moffat. “Inverted files for text search engines.” ACM Computing Surveys,
38 (2). 2006.
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce
Tutorial”. Fetched January 21, 2010.
Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009
J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, Book Draft. February
7, 2010.
64
© 2010, Jamie Callan
Download