Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph) What is MapReduce? • Definition: Programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster • 2 main parts - Mapper - Reducer • 2 sub parts - Combiner - Partitioner What is MapReduce? 1) Mappers applied to input 2) Combiners perform local aggregation 3) Partitioners send data to reducers 4) Reducers aggregate results • Very parallelizable Example Step through the MapReduce function to return the # of times a certain word length appears in the following sentence: We should all take summer classes this year. Write and label the outputs of the mapper, combiner, and reducer (no need for a partitioner with an example this small) Example (continued) “We should all take summer classes this year.” Mapper2: We 5: should 3: all 4: take 6: summer 7: classes 4: this 4: year Example (continued) “We should all take summer classes this year.” Mapper2: We 3: all 4: take 4: this 4: year 5: should 6: summer 7: classes Example (continued) “We should all take summer classes this year.” Combiner2: [We] 3: [all] 4: [take, this, year] 5: [should] 6: [summer] 7: [classes] Example (continued) “We should all take summer classes this year.” Reducer2: 1 3: 1 4: 3 5: 1 7: 1 “Message Passing” Graphs • G = (V, E) -Graph = (Vertices, Edges) -directed graphs • In-degree - how many vertices point to me • Out-degree - how many vertices do I point to • Metadata PageRank • Definition: Google’s main algorithm that works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. • Assumption - Really one big popularity contest • Graph Topology - “physical” layout of the graph - what points to what PageRank • At each iteration… - Computations occur at every vertex as a function of the vertex’s internal state and the LOCAL graph structure - Partial results in the form of messages are “passed” via DIRECTED edges to each vertex’s neighbors - Computations occur at every vertex based on incoming partial results, potentially altering the vertex’s internal state PageRank Basic PageRank Algorithm Basic Example Say A has a link to B, B has links to C and A, C has a link to A, and D has a link to A B and C… Basic Example (continued) • Each page has starting rank of 0.25 • PR(A) = (0.25 / L(B)) + (0.25 / L(C)) + (0.25 / L(D)) • B has 2 links, C has 1 link, D has 3 links • PR(A) = (0.25 / 2) + (0.25 / 1) + (0.25 / 3) • PR(A) = = 0.4583… Complications • Need a way to deal with… - Random hops - Sinks Dampening Factor • Probability that at any step, the surfer will continue on as he has been • 0.85 - (1 – 0.85) / N Dampening Factor Tying It All Together • Why MapReduce - good for this type of calculation - Exploit shuffle and sort phase to aid info passing • Parallelization of PageRank - Only care about local topology and dampening factor - No need to worry about entire picture - create adjacency list representation of the graph where key is id of vertex and value is vertex’s structure and metadata -metadata probably include out-degree and internal state Bibliography • "PageRank." Wikipedia. Wikimedia Foundation, 26 Apr. 2015. Web. 03 May 2015 • "MapReduce." Wikipedia. Wikimedia Foundation, 01 May 2015. Web. 03 May 2015. • Lin, Jimmy, and Michael Schatz. "Design Patterns for Efficient Graph Algorithms in MapReduce." Thesis. University of Maryland, College Park, 2010. Https://cs.wmich.edu. Web. 1 May 2015. <https://cs.wmich.edu/gupta/teaching/cs5950/sumII10cloudComputi ng/graphAlgo%20in%20mapReduce%20paper%20p78-lin.pdf>. Questions?