Tutorial for MapReduce (Hadoop) & Large Scale Processing

advertisement
Tutorial for
MapReduce (Hadoop) & Large Scale Processing
Le Zhao (LTI, SCS, CMU)
Database Seminar & Large Scale Seminar
2010-Feb-15
Some slides adapted from IR course lectures by Jamie Callan
1
© 2010, Le Zhao
Outline
•
•
•
•
Why MapReduce (Hadoop)
MapReduce basics
The MapReduce way of thinking
Manipulating large data
2
© 2010, Le Zhao
Outline
• Why MapReduce (Hadoop)
– Why go large scale
– Compared to other parallel computing models
– Hadoop related tools
• MapReduce basics
• The MapReduce way of thinking
• Manipulating large data
3
© 2010, Le Zhao
Why NOT to do parallel computing
• Concerns: a parallel system needs to provide:
–
–
–
–
Data distribution
Computation distribution
Fault tolerance
Job scheduling
4
© 2010, Le Zhao
Why MapReduce (Hadoop)
• Previous parallel computation models
– 1) scp + ssh
» Manual everything
– 2) network cross-mounted disks + condor/torque
» No data distr, disk access is bottleneck
» Can only partition totally distributed computation
» No fault tolerance
» Prioritized job scheduling
5
© 2010, Le Zhao
Hadoop
• Parallel batch computation
– Data distribution
» Hadoop Distributed File System (HDFS)
» Like Linux FS, but with automatic data repetition
– Computation distribution
» Automatic, user only need to specify #input_splits
» Can distribute aggregation computations as well
– Fault tolerance
» Automatic recovery from failure
» Speculative execution (a backup task)
– Job scheduling
» Ok, but still relies on the politeness of users
6
© 2010, Le Zhao
How you can use Hadoop
• Hadoop Streaming
– Quick hacking – much like shell scripting
» Uses STDIN & STDOUT carry data
» cat file | mapper | sort | reducer > output
– Easier to use legacy code, all programming languages
• Hadoop Java API
– Build large systems
» More data types
» More control over Hadoop’s behavior
» Easier debugging with Java’s error stacktrace display
– NetBeans plugin for Hadoop provides easy programming
» http://hadoopstudio.org/docs.html
7
© 2010, Le Zhao
Outline
•
•
•
•
Why MapReduce (Hadoop)
MapReduce basics
The MapReduce way of thinking
Manipulating large data
8
© 2010, Le Zhao
Map and Reduce
MapReduce is a new use of an old idea in Computer Science
• Map: Apply a function to every object in a list
– Each object is independent
» Order is unimportant
» Maps can be done in parallel
– The function produces a result
• Reduce: Combine the results to produce a final result
You may have seen this in a Lisp or functional programming
course
9
© 2009, Jamie Callan
MapReduce
• Input reader
– Divide input into splits, assign each split to a Map processor
• Map
– Apply the Map function to each record in the split
– Each Map function returns a list of (key, value) pairs
• Shuffle/Partition and Sort
– Shuffle distributes sorting & aggregation to many reducers
– All records for key k are directed to the same reduce processor
– Sort groups the same keys together, and prepares for aggregation
• Reduce
– Apply the Reduce function to each key
– The result of the Reduce function is a list of (key, value) pairs
10
© 2010, Jamie Callan
MapReduce in One Picture
Tom White, Hadoop: The Definitive Guide
11
© 2010, Le Zhao
Outline
• Why MapReduce (Hadoop)
• MapReduce basics
• The MapReduce way of thinking
– Two simple use cases
– Two more advanced & useful MapReduce tricks
– Two MapReduce applications
• Manipulating large data
12
© 2010, Le Zhao
MapReduce Use Case (1) – Map Only
Data distributive tasks – Map Only
• E.g. classify individual documents
• Map does everything
– Input: (docno, doc_content), …
– Output: (docno, [class, class, …]), …
• No reduce
13
© 2010, Le Zhao
MapReduce Use Case (2) – Filtering and
Accumulation
Filtering & Accumulation – Map and Reduce
• E.g. Counting total enrollments of two given classes
• Map selects records and outputs initial counts
– In: (Jamie, 11741), (Tom, 11493), …
– Out: (11741, 1), (11493, 1), …
• Shuffle/Partition by class_id
• Sort
– In: (11741, 1), (11493, 1), (11741, 1), …
– Out: (11493, 1), …, (11741, 1), (11741, 1), …
• Reduce accumulates counts
– In: (11493, [1, 1, …]), (11741, [1, 1, …])
– Sum and Output: (11493, 16), (11741, 35)
14
© 2010, Le Zhao
MapReduce Use Case (3) – Database Join
Problem: Massive lookups
– Given two large lists: (URL, ID) and (URL, doc_content) pairs
– Produce (ID, doc_content)
Solution: Database join
• Input stream: both (URL, ID) and (URL, doc_content) lists
– (http://del.icio.us/post, 0), (http://digg.com/submit, 1), …
– (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), …
• Map simply passes input along,
• Shuffle and Sort on URL (group ID & doc_content for the same URL together)
– Out: (http://del.icio.us/post, 0), (http://del.icio.us/post, <html0>),
(http://digg.com/submit, <html1>), (http://digg.com/submit, 1), …
• Reduce outputs result stream of (ID, doc_content) pairs
– In: (http://del.icio.us/post, [0, html0]), (http://digg.com/submit, [html1, 1]), …
– Out: (0, <html0>), (1, <html1>), …
15
© 2010, Le Zhao
MapReduce Use Case (4) – Secondary Sort
Problem: Sorting on values
• E.g. Reverse graph edge directions & output in node order
– Input: adjacency list of graph (3 nodes and 4 edges)
1
2
1
2
(3, [1, 2])
(1, [3])
(1, [2, 3])  (2, [1, 3])

(3, [1])
3
3
• Note, the node_ids in the output values are also sorted.
But Hadoop only sorts on keys!
Solution: Secondary sort
• Map
– In: (3, [1, 2]), (1, [2, 3]).
– Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]). (reverse edge direction)
– Out: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1]).
– Copy node_ids from value to key.
16
© 2010, Le Zhao
MapReduce Use Case (4) – Secondary Sort
Secondary Sort (ctd.)
• Shuffle on Key.field1, and Sort on whole Key (both fields)
– In: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1])
– Out: (<1, 3>, [3]), (<2, 1>, [1]), (<2, 3>, [3]), (<3, 1>, [1])
• Grouping comparator
– Merge according to part of the key
– Out: (<1, 3>, [3]), (<2, 1>, [1, 3]), (<3, 1>, [1])
this will be the reducer’s input
• Reduce
– Merge & output: (1, [3]), (2, [1, 3]), (3, [1])
17
© 2010, Le Zhao
Using MapReduce to Construct Indexes:
Preliminaries
Construction of binary inverted lists
• Input: documents: (docid, [term, term..]), (docid, [term, ..]), ..
• Output: (term, [docid, docid, …])
– E.g., (apple, [1, 23, 49, 127, …])
• Binary inverted lists fit on a slide more easily
• Everything also applies to frequency and positional inverted lists
A document id is an internal document id, e.g., a unique integer
• Not an external document id such as a url
MapReduce elements
• Combiner, Secondary Sort, complex keys, Sorting on keys’ fields
18
© 2010, Jamie Callan
Using MapReduce to Construct Indexes:
A Simple Approach
A simple approach to creating binary inverted lists
• Each Map task is a document parser
– Input: A stream of documents
– Output: A stream of (term, docid) tuples
» (long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) …
• Shuffle sorts tuples by key and routes tuples to Reducers
• Reducers convert streams of keys into streams of inverted lists
– Input:
(long, 1) (long, 127) (long, 49) (long, 23) …
– The reducer sorts the values for a key and builds an inverted list
» Longest inverted list must fit in memory
– Output: (long, [df:492, docids:1, 23, 49, 127, …])
19
© 2010, Jamie Callan
Using MapReduce to Construct Indexes:
A Simple Approach
A more succinct representation of the previous algorithm
• Map: (docid1, content1)  (t1, docid1) (t2, docid1) …
• Shuffle by t
• Sort by t
(t5, docid1) (t4, docid3) …  (t4, docid3) (t4, docid1) (t5, docid1) …
• Reduce: (t4, [docid3 docid1 …])  (t, ilist)
docid: a unique integer
t:
a term, e.g., “apple”
ilist:
a complete inverted list
but a) inefficient, b) docids are sorted in reducers, and c) assumes
ilist of a word fits in memory
20
© 2010, Jamie Callan
Using MapReduce to Construct Indexes:
Using Combine
• Map: (docid1, content1)  (t1, ilist1,1) (t2, ilist2,1) (t3, ilist3,1) …
– Each output inverted list covers just one document
• Combine
Sort by t
Combine: (t1 [ilist1,2 ilist1,3 ilist1,1 …])  (t1, ilist1,27)
– Each output inverted list covers a sequence of documents
• Shuffle by t
• Sort by t
(t4, ilist4,1) (t5, ilist5,3) …  (t4, ilist4,2) (t4, ilist4,4) (t4, ilist4,1) …
• Reduce: (t7, [ilist7,2, ilist3,1, ilist7,4, …])  (t7, ilistfinal)
ilisti,j:
the j’th inverted list fragment for term i
21
© 2010, Jamie Callan
Using MapReduce to Construct Indexes
Documents
Processors
:
Parser /
Indexer
:
Inverted List
Fragments
Processors
A-F
Merger
Parser /
Indexer
:
G-P
Merger
:
:
Inverted
Lists
:
Parser /
Indexer
:
:
Q-Z
Merger
Map/Combine
Shuffle/Sort
22
Reduce
© 2010, Jamie Callan
Using MapReduce to Construct
Partitioned Indexes
• Map: (docid1, content1)  ([p, t1], ilist1,1)
• Combine to sort and group values
([p, t1] [ilist1,2 ilist1,3 ilist1,1 …])  ([p, t1], ilist1,27)
• Shuffle by p
• Sort values by [p, t]
• Reduce: ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …])  ([p, t7], ilistfinal)
p: partition (shard) id
23
© 2010, Jamie Callan
Using MapReduce to Construct Indexes:
Secondary Sort
So far, we have assumed that Reduce can sort values in memory
…but what if there are too many to fit in memory?
• Map: (docid1, content1)  ([t1, fd1,1], ilist1,1)
• Combine to sort and group values
• Shuffle by t
• Sort by [t, fd], then Group by t (Secondary Sort)
([t7, fd7,2], ilist7,2), ([t7, fd7,1], ilist7,1) …  (t7, [ilist7,1, ilist7,2, …])
• Reduce: (t7, [ilist7,1, ilist7,2, …])  (t7, ilistfinal)
Values arrive in order, so Reduce can stream its output
fdi,j is the first docid in ilisti,j
24
© 2010, Jamie Callan
Using MapReduce to Construct Indexes:
Putting it All Together
• Map: (docid1, content1)  ([p, t1, fd1,1], ilist1,1)
• Combine to sort and group values
([p, t1, fd1,1] [ilist1,2 ilist1,3 ilist1,1 …])  ([p, t1, fd1,27], ilist1,27)
• Shuffle by p
• Secondary Sort by [(p, t), fd]
([p, t7], [ilist7,2, ilist7,1, ilist7,4, …])  ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …])
• Reduce: ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …])  ([p, t7], ilistfinal)
25
© 2010, Jamie Callan
Using MapReduce to Construct Indexes
Documents
Processors
:
Parser /
Indexer
:
Inverted List
Fragments
Processors
Shard
Merger
Parser /
Indexer
:
Shard
Merger
:
:
Inverted
Lists
:
Parser /
Indexer
:
:
Shard
Merger
Map/Combine
Shuffle/Sort
26
Reduce
© 2010, Jamie Callan
PageRank Calculation:
Preliminaries
One PageRank iteration:
• Input:
– (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) ..
• Output:
– (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..]) ..
MapReduce elements
• Score distribution and accumulation
• Database join
• Side-effect files
27
© 2010, Jamie Callan
PageRank:
Score Distribution and Accumulation
• Map
– In: (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) ..
– Out: (out11, score1(t)/n1), (out12, score1(t)/n1) .., (out21, score2(t)/n2), ..
• Shuffle & Sort by node_id
– In: (id2, score1), (id1, score2), (id1, score1), ..
– Out: (id1, score1), (id1, score2), .., (id2, score1), ..
• Reduce
– In: (id1, [score1, score2, ..]), (id2, [score1, ..]), ..
– Out: (id1, score1(t+1)), (id2, score2(t+1)), ..
28
© 2010, Jamie Callan
PageRank:
Database Join to associate outlinks with score
• Map
– In & Out: (id1, score1(t+1)), (id2, score2(t+1)), .., (id1, [out11, out12, ..]),
(id2, [out21, out22, ..]) ..
• Shuffle & Sort by node_id
– Out: (id1, score1(t+1)), (id1, [out11, out12, ..]), (id2, [out21, out22, ..]), (id2,
score2(t+1)), ..
• Reduce
– In: (id1, [score1(t+1), out11, out12, ..]), (id2, [out21, out22, .., score2(t+1)]), ..
– Out: (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..])
..
29
© 2010, Jamie Callan
PageRank:
Side Effect Files for dangling nodes
• Dangling Nodes
– Nodes with no outlinks (observed but not crawled URLs)
– Score has no outlet
» need to distribute to all graph nodes evenly
• Map for dangling nodes:
– In: .., (id3, [score3]), ..
– Out: .., ("*", 0.85×score3), ..
• Reduce
– In: .., ("*", [score1, score2, ..]), ..
– Out: .., everything else, ..
– Output to side-effect: ("*", score), fed to Mapper of next iteration
30
© 2010, Jamie Callan
Outline
•
•
•
•
Why MapReduce (Hadoop)
MapReduce basics
The MapReduce way of thinking
Manipulating large data
31
© 2010, Le Zhao
Manipulating Large Data
• Do everything in Hadoop (and HDFS)
– Make sure every step is parallelized!
– Any serial step breaks your design
• E.g. storing the URL list for a Web graph
– Each node in Web graph has an id
– [URL1, URL2, …], use line number as id – bottle neck
– [(id1, URL1), (id2, URL2), …], explicit id
32
© 2010, Le Zhao
Hadoop based Tools
• For Developing in Java, NetBeans plugin
•
•
•
•
•
•
– http://www.hadoopstudio.org/docs.html
Pig Latin, a SQL-like high level data processing script language
Hive, Data warehouse, SQL
Cascading, Data processing
Mahout, Machine Learning algorithms on Hadoop
HBase, Distributed data store as a large table
More
– http://hadoop.apache.org/
– http://en.wikipedia.org/wiki/Hadoop
– Many other toolkits, Nutch, Cloud9, Ivory
33
© 2010, Le Zhao
Get Your Hands Dirty
• Hadoop Virtual Machine
– http://www.cloudera.com/developers/downloads/virtualmachine/
» This runs Hadoop 0.20
– An earlier Hadoop 0.18.0 version is here
http://code.google.com/edu/parallel/tools/hadoopvm/index.ht
ml
• Amazon EC2
• Various other Hadoop clusters around
• The NetBeans plugin simulates Hadoop
– The workflow view works on Windows
– Local running & debugging works on MacOS and Linux
– http://www.hadoopstudio.org/docs.html
34
© 2010, Le Zhao
Conclusions
•
•
•
•
Why large scale
MapReduce advantages
Hadoop uses
Use cases
– Map only: for totally distributive computation
– Map+Reduce: for filtering & aggregation
– Database join: for massive dictionary lookups
– Secondary sort: for sorting on values
– Inverted indexing: combiner, complex keys
– PageRank: side effect files
• Large data
35
© 2010, Jamie Callan
For More Information
• L. A. Barroso, J. Dean, and U. Hölzle. “Web search for a planet: The Google cluster
•
•
•
•
•
•
•
architecture.” IEEE Micro, 2003.
J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.”
Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI
2004), pages 137-150. 2004.
S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” Proceedings of the
19th ACM Symposium on Operating Systems Principles (SOSP-03), pages 29-43. 2003.
I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. Morgan Kaufmann. 1999.
J. Zobel and A. Moffat. “Inverted files for text search engines.” ACM Computing Surveys,
38 (2). 2006.
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce
Tutorial”. Fetched January 21, 2010.
Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009
J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, Book Draft. February
7, 2010.
36
© 2010, Jamie Callan
Download