Randolph - MapReduce.pptx

advertisement
Design Patterns for Efficient
Graph Algorithms in MapReduce
Jimmy Lin and Michael Schatz
(Slides by Tyler S. Randolph)
What is MapReduce?
• Definition: Programming model and an associated implementation for
processing and generating large datasets with a parallel, distributed
algorithm on a cluster
• 2 main parts
- Mapper
- Reducer
• 2 sub parts
- Combiner
- Partitioner
What is MapReduce?
1) Mappers applied to input
2) Combiners perform local
aggregation
3) Partitioners send data to
reducers
4) Reducers aggregate results
• Very parallelizable
Example
Step through the MapReduce function to return the # of times a certain
word length appears in the following sentence:
We should all take summer classes this year.
Write and label the outputs of the mapper, combiner, and reducer (no
need for a partitioner with an example this small)
Example (continued)
“We should all take summer classes this year.”
Mapper2: We
5: should
3: all
4: take
6: summer
7: classes
4: this
4: year
Example (continued)
“We should all take summer classes this year.”
Mapper2: We
3: all
4: take
4: this
4: year
5: should
6: summer
7: classes
Example (continued)
“We should all take summer classes this year.”
Combiner2: [We]
3: [all]
4: [take, this, year]
5: [should]
6: [summer]
7: [classes]
Example (continued)
“We should all take summer classes this year.”
Reducer2: 1
3: 1
4: 3
5: 1
7: 1
“Message Passing” Graphs
• G = (V, E)
-Graph = (Vertices, Edges)
-directed graphs
• In-degree
- how many vertices point to me
• Out-degree
- how many vertices do I point to
• Metadata
PageRank
• Definition: Google’s main algorithm that works by counting the
number and quality of links to a page to determine a rough estimate
of how important the website is.
• Assumption
- Really one big popularity contest
• Graph Topology
- “physical” layout of the graph
- what points to what
PageRank
• At each iteration…
- Computations occur at every vertex as a function of the vertex’s
internal state and the LOCAL graph structure
- Partial results in the form of messages are “passed” via
DIRECTED edges to each vertex’s neighbors
- Computations occur at every vertex based on incoming partial
results, potentially altering the vertex’s internal state
PageRank
Basic PageRank Algorithm
Basic Example
Say A has a link to B, B has links to C and A, C has a link to A, and D has
a link to A B and C…
Basic Example (continued)
• Each page has starting rank of 0.25
• PR(A) = (0.25 / L(B)) + (0.25 / L(C)) + (0.25 / L(D))
• B has 2 links, C has 1 link, D has 3 links
• PR(A) = (0.25 / 2) + (0.25 / 1) + (0.25 / 3)
• PR(A) = = 0.4583…
Complications
• Need a way to deal with…
- Random hops
- Sinks
Dampening Factor
• Probability that at any step, the surfer will continue on as he has been
• 0.85
- (1 – 0.85) / N
Dampening Factor
Tying It All Together
• Why MapReduce
- good for this type of calculation
- Exploit shuffle and sort phase to aid info passing
• Parallelization of PageRank
- Only care about local topology and dampening factor
- No need to worry about entire picture
- create adjacency list representation of the graph where key is id
of vertex and value is vertex’s structure and metadata
-metadata probably include out-degree and internal state
Bibliography
• "PageRank." Wikipedia. Wikimedia Foundation, 26 Apr. 2015. Web. 03
May 2015
• "MapReduce." Wikipedia. Wikimedia Foundation, 01 May 2015. Web.
03 May 2015.
• Lin, Jimmy, and Michael Schatz. "Design Patterns for Efficient Graph
Algorithms in MapReduce." Thesis. University of Maryland, College
Park, 2010. Https://cs.wmich.edu. Web. 1 May 2015.
<https://cs.wmich.edu/gupta/teaching/cs5950/sumII10cloudComputi
ng/graphAlgo%20in%20mapReduce%20paper%20p78-lin.pdf>.
Questions?
Download