Parallel Graph Algorithms Aydın Buluç ABuluc@lbl.gov http://gauss.cs.ucsb.edu/~aydin/ Lawrence Berkeley National Laboratory CS267, Spring 2013 April 9, 2013 Slide acknowledgments: K. Madduri, J. Gilbert, D. Delling, S. Beamer Graph Preliminaries Define: Graph G = (V,E) -a set of vertices and a set of edges between vertices Edge Vertex n=|V| (number of vertices) m=|E| (number of edges) D=diameter (max #hops between any pair of vertices) • Edges can be directed or undirected, weighted or not. • They can even have attributes (i.e. semantic graphs) • Sequences of edges <u1,u2>, <u2,u3>, … ,<un-1,un> is a path from u1 to un. Its length is the sum of its weights. Lecture Outline • Applications • Designing parallel graph algorithms • Case studies: A. B. C. D. E. Graph traversals: Breadth-first search Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall Ranking: Betweenness centrality Maximal Independent Sets: Luby’s algorithm Strongly Connected Components • Wrap-up: challenges on current systems Lecture Outline • Applications • Designing parallel graph algorithms • Case studies: A. B. C. D. E. Graph traversals: Breadth-first search Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall Ranking: Betweenness centrality Maximal Independent Sets: Luby’s algorithm Strongly Connected Components • Wrap-up: challenges on current systems Routing in transportation networks Road networks, Point-to-point shortest paths: 15 seconds (naïve) 10 microseconds H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007. Internet and the WWW • The world-wide web can be represented as a directed graph – Web search and crawl: traversal – Link analysis, ranking: Page rank and HITS – Document classification and clustering • Internet topologies (router networks) are naturally modeled as graphs Scientific Computing • Reorderings for sparse solvers – Fill reducing orderings Partitioning, eigenvectors – Heavy diagonal to reduce pivoting (matching) • Data structures for efficient exploitation of sparsity • Derivative computations for optimization Image source: Yifan Hu, “A gallery of large graphs” – graph colorings, spanning trees • Preconditioning – Incomplete Factorizations – Partitioning for domain decomposition – Graph techniques in algebraic multigrid Independent sets, matchings, etc. – Support Theory Spanning trees & graph embedding techniques B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”, http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf 3 1 6 8 4 9 7 10 5 G+(A) [chordal] 2 Large-scale data analysis • Graph abstractions are very useful to analyze complex data sets. • Sources of data: petascale simulations, experimental devices, the Internet, sensor networks • Challenges: data size, heterogeneity, uncertainty, data quality Astrophysics: massive datasets, temporal variations Bioinformatics: data quality, heterogeneity Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com Social Informatics: new analytics challenges, data uncertainty Graph-theoretic problems in social networks – Targeted advertising: clustering and centrality – Studying the spread of information Image Source: Nexus (Facebook application) Network Analysis for Neurosciences Graph-theoretical models are used to predict the course of degenerative illnesses like Alzheimer’s. Vertices: ROI (regions of interest) Edges: structural/functional connectivity Some disease indicators: - Deviation from small-world property - Emergence of “epicenters” with disease-associated patterns Research in Parallel Graph Algorithms Application Areas Methods/ Problems Social Network Analysis Find central entities Community detection Network dynamics WWW Computational Biology Marketing Social Search Gene regulation Metabolic pathways Genomics Graph Algorithms Traversal Data size Shortest Paths Connectivity Problem Complexity Max Flow … Scientific Computing Graph partitioning Matching Coloring … Engineering VLSI CAD Route planning … Architectures GPUs, FPGAs x86 multicore servers Massively multithreaded architectures (Cray XMT, Sun Niagara) Multicore clusters (NERSC Hopper) Clouds (Amazon EC2) Characterizing Graph-theoretic computations Input data Problem: Find *** • paths • clusters • partitions • matchings • patterns • orderings Graph kernel • traversal • shortest path algorithms • flow algorithms • spanning tree algorithms • topological sort ….. Factors that influence choice of algorithm • graph sparsity (m/n ratio) • static/dynamic nature • weighted/unweighted, weight distribution • vertex degree distribution • directed/undirected • simple/multi/hyper graph • problem size • granularity of computation at nodes/edges • domain-specific characteristics Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations Lecture Outline • Applications • Designing parallel graph algorithms • Case studies: A. B. C. D. E. Graph traversals: Breadth-first search Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall Ranking: Betweenness centrality Maximal Independent Sets: Luby’s algorithm Strongly Connected Components • Wrap-up: challenges on current systems The PRAM model • Many PRAM graph algorithms in 1980s. • Idealized parallel shared memory system model • Unbounded number of synchronous processors; no synchronization, communication cost; no parallel overhead • EREW (Exclusive Read Exclusive Write), CREW (Concurrent Read Exclusive Write) • Measuring performance: space and time complexity; total number of operations (work) PRAM Pros and Cons • Pros – Simple and clean semantics. – The majority of theoretical parallel algorithms are designed using the PRAM model. – Independent of the communication network topology. • Cons – – – – – Not realistic, too powerful communication model. Communication costs are ignored. Synchronized processors. No local memory. Big-O notation is often misleading. Building blocks of classical PRAM graph algorithms • Prefix sums • Symmetry breaking • Pointer jumping • List ranking • Euler tours • Vertex collapse • Tree contraction [some covered in the “Tricks with Trees” lecture] Work / Span Model tp = execution time on p processors Work / Span Model tp = execution time on p processors t1 = work Work / Span Model tp = execution time on p processors t1 = work t∞ = span * *Also called critical-path length or computational depth. Work / Span Model tp = execution time on p processors t1 = work t∞ = span * WORK LAW ∙tp ≥t1/p SPAN LAW ∙tp ≥ t∞ *Also called critical-path length or computational depth. Data structures: graph representation Static case • Dense graphs (m = Θ(n2)): adjacency matrix commonly used. • Sparse graphs: adjacency lists, compressed sparse matrices Dynamic • representation depends on common-case query • Edge insertions or deletions? Vertex insertions or deletions? Edge weight updates? • Graph update rate • Queries: connectivity, paths, flow, etc. • Optimizing for locality a key design consideration. Graph representations Compressed sparse rows (CSR) = cache-efficient adjacency lists 7 1 12 4 2 2 Index into adjacency array 3 26 14 1 26 1 12 19 3 6 4 2 14 4 3 Adjacencies 1 3 2 2 4 (column ids in CSR) 26 19 14 7 (numerical values in CSR) 12 4 2 19 1 Weights 3 3 (row pointers in CSR) 7 Distributed graph representations • Each processor stores the entire graph (“full replication”) • Each processor stores n/p vertices and all adjacencies out of these vertices (“1D partitioning”) • How to create these “p” vertex partitions? – Graph partitioning algorithms: recursively optimize for conductance (edge cut/size of smaller partition) – Randomly shuffling the vertex identifiers ensures that edge count/processor are roughly the same 2D checkerboard distribution • Consider a logical 2D processor grid (pr * pc = p) and the matrix representation of the graph • Assign each processor a sub-matrix (i.e, the edges within the sub-matrix) 9 vertices, 9 processors, 3x3 processor grid 5 x 8 1 7 2 3 4 x 6 Flatten Sparse matrices x x x x x x x x x Per-processor local graph representation x x x 0 x x x x x x x x x x High-performance graph algorithms • Implementations are typically array-based for performance (e.g. CSR representation). • Concurrency = minimize synchronization (span) • Where is the data? Find the distribution that minimizes inter-processor communication. • Memory access patterns – Regularize them (spatial locality) – Minimize DRAM traffic (temporal locality) • Work-efficiency – Is (Parallel time) * (# of processors) = (Serial work)? Lecture Outline • Applications • Designing parallel graph algorithms • Case studies: A. B. C. D. E. Graph traversals: Breadth-first search Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall Ranking: Betweenness centrality Maximal Independent Sets: Luby’s algorithm Strongly Connected Components • Wrap-up: challenges on current systems Graph traversal: Depth-first search (DFS) 9 5 8 8 0 source vertex 7 1 2 7 1 2 3 3 4 4 6 preorder vertex number 6 5 9 procedure DFS(vertex v) v.visited = true previsit (v) for all v s.t. (v, w) Î E if(!w.visited) DFS(w) postvisit (v) Parallelizing DFS is a bad idea: span(DFS) = O(n) J.H. Reif, Depth-first search is inherently sequential. Inform. Process. Lett. 20 (1985) 229-234. Graph traversal : Breadth-first search (BFS) 1 Input: Output: 5 8 1 0 source vertex 7 2 1 2 3 3 3 4 6 distance from source vertex 4 4 9 1 2 Memory requirements (# of machine words): • Sparse graph representation: m+n • Stack of visited vertices: n • Distance array: n Parallel BFS Strategies 1. Expand current frontier (level-synchronous approach, suited for low diameter graphs) 5 0 7 8 3 1 4 6 9 source vertex • O(D) parallel steps • Adjacencies of all vertices in current frontier are visited in parallel 2 2. Stitch multiple concurrent traversals (Ullman-Yannakakis approach, suited for high-diameter graphs) source vertex 0 5 8 7 3 2 1 4 6 9 • path-limited searches from “super vertices” • APSP between “super vertices” INTRODUCTION 60 Percentage of total 50 40 30 20 2 2 ! Breadth-First Search (BFS) a key building 1 is 1 0 0 0 1 1 Performance observations ofblock the level-synchronous algorithm for many graph algorithms ! Common real-world graph properties 2 1 1 • Scale-free - exponential degree distribution Youtube social network Synthetic network Neighbor Types • Small-world low effective diameter # of vertices in frontier array ! BFS implementation challenges Execution time • Often have little locality (spatial or temporal) • Low arithmetic intensity • Usually bottlenecked by memory system • Scale-free graphs are harder to load-balance • Small-world graphs are harder to partition 10 0 BREADTH-FIRST SEARCH 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Phase # When the frontier is at its peak, almost all edge examinations “fail” to claim a child Insight 2 Sample Search with Classic Algorithm ! Many edges are examined unnecessarily ! Classic algorithm is: Top-Down ! What if went Bottom-Up? 1 2 2 1 0 2 1 0 0 1 1 0 1 S u ffi c ie n tC h e c k E x tr a C h e c k Neighbor Types 2 1 2 11 Scott Beamer, Aydın Buluç, David Patterson, Krste A Bottom-up BFS algorithm L E L C O M P U T I N G L A Classical (top-down) algorithm is optimal in worst case, but pessimistic O P T I M I Zfor I Nlow-diameter G D I R E C graphs T I O N (previous S H Aslide). RED MEM ANA Top-Down 1 Edge Check Types Bottom-Up 0 1 1 for all v in frontier attempt to parent all neighbors(v) 1 0 1 1 Direction-optimizing approach: - Switch from top-down to bottom-up search - When the majority of the vertices are discovered. [Read paper for exact heuristic] for all v in unvisited find any parent (neighbor(v) in frontier) Hybrid of Top-Down & Bottom-Up Scott Beamer, Krste Asanović, and David Patterson, "Direction-Optimizing Breadth-First Search", ! Combine both to get best of both Int. Conf. on High Performance Computing, Networking, Storage and Analysis (SC), 2012 ! Top-down for small frontier, bottom-up for large Time Breakdown # nodes C = # unexplored edges C = 2 1 4 5 7 1 3 6 Breadth-first search in the language of linear algebra from 1 to 7 AT 7 2 1 4 5 7 1 6 3 Replace scalar operations Multiply -> select Add -> minimum from 1 7 1 1 1 parents: to 1 1 7 AT X ATX 2 1 Select vertex with minimum label as parent 4 5 7 from 1 6 3 7 1 2 1 to 4 parents: 4 4 4 2 2 1 2 2 7 2 2 4 AT X ATX 2 1 4 5 7 1 6 3 from 7 1 3 1 to 5 4 parents: 1 2 3 2 3 5 3 7 7 AT X ATX 2 1 4 5 7 1 3 6 from 7 1 to 6 7 AT X ATX 1D Parallel BFS algorithm 1 x 4 3 AT 2 7 5 6 xfrontier ALGORITHM: 1. Find owners of the current frontier’s adjacency [computation] 2. Exchange adjacencies via all-to-all. [communication] 3. Update distances/parents for unvisited vertices. [computation] 2D Parallel BFS algorithm 1 x 4 3 AT 2 7 5 6 xfrontier ALGORITHM: 1. Gather vertices in processor column [communication] 2. Find owners of the current frontier’s adjacency [computation] 3. Exchange adjacencies in processor row [communication] 4. Update distances/parents for unvisited vertices. [computation] BFS Strong Scaling 2D Flat MPI 2D Hybrid 16 GTEPS 12 Comm. time (seconds) 20 1D Flat MPI 1D Hybrid 12 8 4 5040 10008 20000 Number of cores 40000 1D Flat MPI 1D Hybrid 2D Flat MPI 2D Hybrid 10 8 6 4 2 5040 10008 20000 40000 Number of cores • NERSC Hopper (Cray XE6, Gemini interconnect AMD Magny-Cours) • Hybrid: In-node 6-way OpenMP multithreading • Kronecker (Graph500): 4 billion vertices and 64 billion edges. A. Buluç, K. Madduri. Parallel breadth-first search on distributed memory systems. Proc. Supercomputing, 2011. 4: while f = ∅ do T RAcores NSPOSEV ECTOR (f i j ) 115.6K on Hopper f i ← A L L GATHERV (f i j , P (:, j )) t i ← A i j ⊗f i t is candida t i j ← A L LTOA L LV (t i , P (i , :)) t i j ← t i Scaling j ⊙πi j Strong w/ Twitter πi j ← πi j + t i j f i j ← ti j 5: ! Direction Achieved optimizing 243 GTEPS with BFS with 2D decomposition Results on Titan Weak Scaling w/ Kronecker 6: 7: 8: 9: 10: 11: The pseudocode for parallel BFS algorithm with tioning is given in Algorithm 3 for completeness. B t are implemented as sparse vectors. For distribute the syntax vi j denotes the local n/ p sized piece of owned by the P (i , j )th processor. The syntax vi d hypothetical n/ pr sized piece of the vector collectiv by all the processors along the i th processor row P algorithm has four major steps: • Expand: For all locally owned edges, g Time Breakdown Sensitivity to Degree • ORNL Titan (Cray XK6, Gemini interconnect Interlagos) originatingAMD vertices (line 6). • Kronecker (Graph500): 16 billion •vertices and 256y:billion L ocal discover Find edges. (partial) owners of t frontier’s adjacency (line 7). • Fold: Exchange discovered (line Scott Beamer, Aydın Buluç, Krste Asanović, and David Patterson, "Distributed Memory Breadth-Firstadjacencies Search Revisited: Enabling Bottom-Up Search”, Workshop on Multithreaded Applications (MTAAP), at • L ocal Architectures update: and Update distances/parents fo the International Parallel & Distributed Processing Symposium (IPDPS), 2013 vertices (lines 9, 10, and 11). Lecture Outline • Applications • Designing parallel graph algorithms • Case studies: A. B. C. D. E. Graph traversals: Breadth-first search Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall Ranking: Betweenness centrality Maximal Independent Sets: Luby’s algorithm Strongly Connected Components • Wrap-up: challenges on current systems Parallel Single-source Shortest Paths (SSSP) algorithms • Famous serial algorithms: – Bellman-Ford : label correcting - works on any graph – Dijkstra : label setting – requires nonnegative edge weights • No known PRAM algorithm that runs in sub-linear time and O(m+n log n) work • Ullman-Yannakakis randomized approach • Meyer and Sanders, ∆ - stepping algorithm • Distributed memory implementations based on graph partitioning • Heuristics for load balancing and termination detection K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007. ∆ - stepping algorithm • Label-correcting algorithm: Can relax edges from unsettled vertices also • “approximate bucket implementation of Dijkstra” • For random edge weighs [0,1], runs in O(n + m + D× L) where L = max distance from source to any node • Vertices are ordered using buckets of width ∆ • Each bucket may be processed in parallel ∆ < min w(e) : Degenerates into Dijkstra ∆ > max w(e) : Degenerates into Bellman-Ford U. Meyer and P.Sanders, ∆ - stepping: a parallelizable shortest path algorithm. Journal of Algorithms 49 (2003) ∆ - stepping algorithm: illustration ∆ = 0.1 (say) 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 4 5 6 ∞ ∞ ∞ ∞ ∞ ∞ ∞ Buckets 6 One parallel phase while (bucket is non-empty) i) Inspect light (w < ∆) edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 0 ∞ ∞ ∞ ∞ ∞ ∞ Buckets 0 0 4 5 6 6 One parallel phase while (bucket is non-empty) i) Inspect light (w < ∆) edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket Initialization: Insert s into bucket, d(s) = 0 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 0 ∞ ∞ ∞ ∞ ∞ ∞ Buckets 0 0 4 5 6 One parallel phase while (bucket is non-empty) i) Inspect light (w < ∆) edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 6 R 2 .01 S ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 0 ∞ ∞ ∞ ∞ ∞ ∞ Buckets 4 5 6 6 One parallel phase while (bucket is non-empty) i) Inspect light (w < ∆) edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R 2 .01 0 S 0 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 0 ∞ .01 ∞ ∞ ∞ ∞ Buckets 0 2 4 5 6 6 One parallel phase while (bucket is non-empty) i) Inspect light (w < ∆) edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R S 0 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 0 ∞ .01 ∞ ∞ ∞ ∞ Buckets 0 2 4 5 6 6 One parallel phase while (bucket is non-empty) i) Inspect light (w < ∆) edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R 1 3 .03 .06 S 0 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 0 ∞ .01 ∞ ∞ ∞ ∞ Buckets 4 5 6 6 One parallel phase while (bucket is non-empty) i) Inspect light (w < ∆) edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R 1 3 .03 .06 0 S 0 2 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 0 .03 .01 .06 Buckets 0 1 3 4 5 6 ∞ ∞ ∞ 6 One parallel phase while (bucket is non-empty) i) Inspect light (w < ∆) edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R S 0 2 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 2 0 0.02 0.13 0.15 0.23 4 0.18 5 1 d array 0 1 2 3 4 5 6 0 .03 .01 .06 .16 .29 .62 Buckets 1 4 2 5 6 6 6 One parallel phase while (bucket is non-empty) i) Inspect light (w < ∆) edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R S 0 2 1 3 No. of phases (machine-independent performance count) 1000000 high diameter No. of phases 100000 10000 1000 low diameter 100 10 Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE Graph Family Too many phases in high diameter graphs: Level-synchronous breadth-first search has the same problem. Average shortest path weight for various graph families ~ 220 vertices, 222 edges, directed graph, edge weights normalized to [0,1] Complexity: O(n + m + D× L) 100000 L: maximum distance (shortest path weight) Average shortest path weight 10000 1000 100 10 1 0.1 0.01 Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU Graph Family SqGrid USAd NE USAt NE PHAST – hardware accelerated shortest path trees Preprocessing: Build a contraction hierarchy • order nodes by importance (highway dimension) • process in order • add shortcuts to preserve distances • assign levels (ca. 150 in road networks) • 75% increase in number of edges (for road networks) D. Delling, A. V. Goldberg, A. Nowatzyk, and R. F. Werneck. PHAST: HardwareAccelerated Shortest Path Trees. In IPDPS. IEEE Computer Society, 2011 PHAST – hardware accelerated shortest path trees One-to-all search from source s: • Run forward search from s • Only follow edges to more important nodes • Set distance labels d of reached nodes PHAST – hardware accelerated shortest path trees From top-down: • process all nodes u in reverse level order: • check incoming arcs (v,u) with lev(v) > lev(u) • Set d(u)=min{d(u),d(v)+w(v,u)} PHAST – hardware accelerated shortest path trees From top-down: • linear sweep without priority queue • reorder nodes, arcs, distance labels by level • accesses to d array becomes contiguous (no jumps) • parallelize with SIMD (SSE instructions, and/or GPU) PHAST – performance comparison Inputs: Europe/USA Road Networks GPU implementation Edge weights: estimated travel times Edge weights: physical distances PHAST – hardware accelerated shortest path trees • Specifically designed for road networks • Fill-in can be much higher for other graphs (Hint: think about sparse Gaussian Elimination) • Road networks are (almost) planar. • Planar graphs have O( n) separators. • Good separators lead to orderings with minimal fill. Lipton, Richard J.; Tarjan, Robert E. (1979), "A separator theorem for planar graphs", SIAM Journal on Applied Mathematics 36 (2): 177–189 Alan George. “Nested Dissection of a Regular Finite Element Mesh”. SIAM Journal on Numerical Analysis, Vol. 10, No. 2 (Apr., 1973), 345-363 All-pairs shortest-paths problem • Input: Directed graph with “costs” on edges • Find least-cost paths between all reachable vertex pairs • Classical algorithm: Floyd-Warshall 1 2 3 4 5 1 for k=1:n // the induction sequence for i = 1:n for j = 1:n if(w(i→k)+w(k→j) < w(i→j)) w(i→j):= w(i→k) + w(k→j) 2 3 4 5 k = 1 case • It turns out a previously overlooked recursive version is more parallelizable than the triple nested loop 5 2 1 9 -3 4 V 1 -2 1 3 3 6 3 -1 7 V 2 C A 5 4 4 0 ¥ 3 ¥ ¥ -3 B C D D B 4 5 + is “min”, A = A*; é ê ê ê ê ê ê êë A 5 9 ¥ ¥ 4 ù ú 0 1 -2 ¥ ¥ ú ¥ 0 4 ¥ 3 ú ¥ 5 0 4 ¥ ú ú ¥ -1 ¥ 0 ¥ ú ¥ ¥ ¥ 7 0 úû × is “add” % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC; 5 2 1 9 -3 4 3 3 6 é ê ê ê ê ê ê êë 0 ¥ 3 ¥ ¥ -3 3 -1 7 -2 1 8 5 4 4 é ¥ ù ê ú 3 ë û é 5 9 ù ë û C B The cost of 3-12 path 4 5 5 9 ¥ ¥ 4 ù ú 0 1 -2 ¥ ¥ ú 8 0 4 ¥ 3 ú ¥ 5 0 4 ¥ ú ú ¥ -1 ¥ 0 ¥ ú ¥ ¥ ¥ 7 0 úû Distances = é ¥ ¥ ù ê ú ë 8 12 û a(3, 2) = a(3,1) + a(1, 2) ¾then ¾® P(3, 2) = P(1, 2) é ê ê ê P =ê ê ê êë 1 2 3 4 5 6 1 2 1 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Parents 1 2 3 4 5 6 1 2 3 4 5 6 ù ú ú ú ú ú ú úû 5 2 1 6 -3 4 6 é ê ê ê ê ê ê êë 0 ¥ 3 ¥ ¥ -3 3 -1 7 -2 1 8 3 3 D = D*: no change 5 4 4 é 5 9 ù é 0 1 ù ë û ê 8 0 ú ë û B é ù =ë 5 6 û D Path: 1-2-3 4 5 5 6 ¥ ¥ 4 ù ú 0 1 -2 ¥ ¥ ú 8 0 4 ¥ 3 ú ¥ 5 0 4 ¥ ú ú ¥ -1 ¥ 0 ¥ ú ¥ ¥ ¥ 7 0 úû Distances a(1,3) = a(1, 2)+ a(2,3) ¾then ¾® P(1,3) = P(2,3) é ê ê ê P =ê ê ê êë 1 2 3 4 5 6 1 2 1 4 5 6 2 2 3 4 5 6 1 2 3 4 5 6 Parents 1 2 3 4 5 6 1 2 3 4 5 6 ù ú ú ú ú ú ú úû All-pairs shortest-paths problem Floyd-Warshall ported to GPU 480x Naïve recursive implementation The right primitive (Matrix multiply) A. Buluç, J. R. Gilbert, and C. Budak. Solving path problems on the GPU. Parallel Computing, 36(5-6):241 - 253, 2010. Communication-avoiding APSP in distributed memory Bandwidth: Wbc-2.5D (n, p) = O(n Latency: Sbc-2.5D (p) = O ( cp) 2 ) cp log (p) 2 c: number of replicas Optimal for any memory size ! Communication-avoiding APSP in distributed memory 1200 c=16 c=4 c=1 1000 n=8192 600 400 n=4096 0 1 4 16 64 256 1024 200 1 4 16 64 256 1024 GFlops 800 Number of compute nodes E. Solomonik, A. Buluç, and J. Demmel. Minimizing communication in all-pairs shortest paths. In Proceedings of the IPDPS. 2013. Lecture Outline • Applications • Designing parallel graph algorithms • Case studies: A. B. C. D. E. Graph traversals: Breadth-first search Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall Ranking: Betweenness centrality Maximal Independent Sets: Luby’s algorithm Strongly Connected Components • Wrap-up: challenges on current systems Betweenness Centrality (BC) • Centrality: Quantitative measure to capture the importance of a vertex/edge in a graph – degree, closeness, eigenvalue, betweenness • Betweenness Centrality BC (v) st (v) s v t st ( st : No. of shortest paths between s and t) • Applied to several real-world networks – – – – Social interactions WWW Epidemiology Systems biology Algorithms for Computing Betweenness • All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n3) time), sum up the fractional dependency values (O(n2) space). • Brandes’ algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn) work and O(m+n) space on unweighted graph. d (v) = s (v) å s (w) (1+ d (w)) wÎadj(v) Dependency of source on v. d (v) = ås st (v) tÎV Number of shortest paths from source to w Parallel BC algorithms for unweighted graphs • High-level idea: Level-synchronous parallel breadthfirst search augmented to compute centrality scores • Two steps (both implemented as BFS) – traversal and path counting – dependency accumulation Shared-memory parallel algorithms for betweenness centrality Exact algorithm: O(mn) work, O(m+n) space, O(nD+nm/p) time. D.A. Bader and K. Madduri. Parallel algorithms for evaluating centrality indices in real-world networks, ICPP’08 Improved with lower synchronization overhead and fewer noncontiguous memory references. K. Madduri, D. Ediger, K. Jiang, D.A. Bader, and D. Chavarria-Miranda. A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. MTAAP 2009 Parallel BC Algorithm Illustration 0 5 8 7 3 2 1 4 6 9 Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. 0 5 8 7 3 source vertex 2 1 4 6 9 Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. S 5 0 7 source vertex 2 8 3 6 P 0 1 4 D 9 2 7 5 0 1 0 1 0 1 0 0 Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. S 5 0 7 source vertex 2 8 3 6 P 0 1 4 D 9 8 3 2 7 5 0 1 2 0 2 1 0 1 2 0 5 Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel 7 0 7 Parallel BC Algorithm Illustration 1. Traversal step: at the end, we have all reachable vertices, their corresponding predecessor multisets, and D values. 5 0 7 source vertex 2 8 3 1 4 6 9 S D 2 1 6 4 8 3 2 7 5 0 0 P 1 2 1 1 2 Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel 6 0 2 3 0 8 0 5 6 7 8 0 7 Parallel BC Algorithm Illustration 2. Accumulation step: Pop vertices from stack, update dependence scores. S 5 0 7 source vertex 2 8 3 1 4 6 9 2 1 6 4 8 3 2 7 5 0 (v ) (v ) 1 (w) ( w ) vP ( w ) Delta P 6 0 2 3 0 8 0 5 6 7 8 0 7 Parallel BC Algorithm Illustration 2. Accumulation step: Can also be done in a level-synchronous manner. S 5 0 7 source vertex 2 8 3 1 4 6 9 2 1 6 4 8 3 2 7 5 0 d (v) = s (v) (1+ d (w)) s (w) wÎadj(v) å Delta P 6 0 2 3 0 8 0 5 6 7 8 0 7 Distributed memory BC algorithm Work-efficient parallel breadth-first search via parallel sparse matrix-matrix multiplication over semirings 4 3 AT B 7 6 . (ATX) *¬B Encapsulates three level of parallelism: 1. columns(B): multiple BFS searches in parallel 2. columns(AT)+rows(B): parallel over frontier vertices in each BFS 3. rows(AT): parallel over incident edges of each frontier vertex A. Buluç, J.R. Gilbert. The Combinatorial BLAS: Design, implementation, and applications. The International Journal of High Performance Computing Applications, 2011. 5 Lecture Outline • Applications • Designing parallel graph algorithms • Case studies: A. B. C. D. E. Graph traversals: Breadth-first search Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall Ranking: Betweenness centrality Maximal Independent Sets: Luby’s algorithm Strongly Connected Components • Wrap-up: challenges on current systems Maximal Independent Set • Graph with vertices V = {1,2,…,n} • A set S of vertices is independent if no two vertices in S are neighbors. • An independent set S is maximal if it is impossible to add another vertex and stay independent 5 • An independent set S is maximum if no other independent set has more vertices • Finding a maximum independent set is intractably difficult (NP-hard) • Finding a maximal independent set is easy, at least on one processor. 1 2 3 4 7 8 The set of red vertices S = {4, 5} is independent and is maximal but not maximum 6 Sequential Maximal Independent Set Algorithm 1 1. S = empty set; 2 2. for vertex v = 1 to n { 3. if (v has no neighbor in S) { 4. 5. add v to S } 5 3 4 7 8 6. } S={} 6 Sequential Maximal Independent Set Algorithm 1 1. S = empty set; 2 2. for vertex v = 1 to n { 3. if (v has no neighbor in S) { 4. 5. add v to S } 5 3 4 7 8 6. } S={1} 6 Sequential Maximal Independent Set Algorithm 1 1. S = empty set; 2 2. for vertex v = 1 to n { 3. if (v has no neighbor in S) { 4. 5. add v to S } 5 3 4 7 8 6. } S = { 1, 5 } 6 Sequential Maximal Independent Set Algorithm 1 1. S = empty set; 2 2. for vertex v = 1 to n { 3. if (v has no neighbor in S) { 4. 5. add v to S } 5 3 4 7 8 6. } S = { 1, 5, 6 } work ~ O(n), but span ~O(n) 6 Parallel, Randomized MIS Algorithm 1. S = empty set; C = V; 1 2 2. while C is not empty { 3. label each v in C with a random r(v); 4. for all v in C in parallel { 5 if r(v) < min( r(neighbors of v) ) { 5. 6. move v from C to S; 7. remove neighbors of v from C; 8. 9. 10. } } } 3 4 7 8 S={} C = { 1, 2, 3, 4, 5, 6, 7, 8 } M. Luby. "A Simple Parallel Algorithm for the Maximal Independent Set Problem". SIAM Journal on Computing 15 (4): 1036–1053, 1986 6 Parallel, Randomized MIS Algorithm 1. S = empty set; C = V; 1 2 2. while C is not empty { 3. label each v in C with a random r(v); 4. for all v in C in parallel { 5 if r(v) < min( r(neighbors of v) ) { 5. 6. move v from C to S; 7. remove neighbors of v from C; 8. 9. 10. } } } 3 4 7 8 S={} C = { 1, 2, 3, 4, 5, 6, 7, 8 } 6 Parallel, Randomized MIS Algorithm 2.6 1 1. S = empty set; C = V; 2. while C is not empty { 3. 4. for all v in C in parallel { 5 if r(v) < min( r(neighbors of v) ) { 6. move v from C to S; 7. remove neighbors of v from C; 8. 10. } 2 5.9 3.1 label each v in C with a random r(v); 5. 9. 4.1 } } 1.2 3 4 7 8 9.7 5.8 6 9.3 S={} C = { 1, 2, 3, 4, 5, 6, 7, 8 } Parallel, Randomized MIS Algorithm 2.6 1 1. S = empty set; C = V; 2. while C is not empty { 3. 4. for all v in C in parallel { 5 if r(v) < min( r(neighbors of v) ) { 6. move v from C to S; 7. remove neighbors of v from C; 8. 10. } 2 5.9 3.1 label each v in C with a random r(v); 5. 9. 4.1 } } 1.2 3 4 7 8 9.7 9.3 S = { 1, 5 } C = { 6, 8 } 5.8 6 Parallel, Randomized MIS Algorithm 1. S = empty set; C = V; 1 2 2. while C is not empty { 3. 4. label each v in C with a random r(v); for all v in C in parallel { 5 if r(v) < min( r(neighbors of v) ) { 5. 6. move v from C to S; 7. remove neighbors of v from C; 8. 9. 10. } } } 3 4 7 8 1.8 S = { 1, 5 } C = { 6, 8 } 2.7 6 Parallel, Randomized MIS Algorithm 1. S = empty set; C = V; 1 2 2. while C is not empty { 3. 4. label each v in C with a random r(v); for all v in C in parallel { 5 if r(v) < min( r(neighbors of v) ) { 5. 6. move v from C to S; 7. remove neighbors of v from C; 8. 9. 10. } } } 3 4 7 8 1.8 S = { 1, 5, 8 } C={} 2.7 6 Parallel, Randomized MIS Algorithm 1. S = empty set; C = V; 1 2 2. while C is not empty { 3. label each v in C with a random r(v); 4. for all v in C in parallel { 5 if r(v) < min( r(neighbors of v) ) { 5. 6. move v from C to S; 7. remove neighbors of v from C; 8. 9. 10. } } } 3 4 7 8 Theorem: This algorithm “very probably” finishes within O(log n) rounds. work ~ O(n log n), but span ~O(log n) 6 Lecture Outline • Applications • Designing parallel graph algorithms • Case studies: A. B. C. D. E. Graph traversals: Breadth-first search Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall Ranking: Betweenness centrality Maximal Independent Sets: Luby’s algorithm Strongly Connected Components • Wrap-up: challenges on current systems Strongly connected components (SCC) 1 1 2 4 7 5 3 6 1 2 2 4 7 4 7 5 3 6 3 6 • Symmetric permutation to block triangular form • Find P in linear time by depth-first search Tarjan, R. E. (1972), "Depth-first search and linear graph algorithms", SIAM Journal on Computing 1 (2): 146–160 5 Strongly connected components of directed graph • Sequential: use depth-first search (Tarjan); work=O(m+n) for m=|E|, n=|V|. • DFS seems to be inherently sequential. • Parallel: divide-and-conquer and BFS (Fleischer et al.); worst-case span O(n) but good in practice on many graphs. L. Fleischer, B. Hendrickson, and A. Pınar. On identifying strongly connected components in parallel. Parallel and Distributed Processing, pages 505–511, 2000. Fleischer/Hendrickson/Pinar algorithm - Partition the given graph into three disjoint subgraphs - Each can be processed independently/recursively Lemma: FW(v)∩ BW(v) is a unique SCC for any v. For every other SCC s, either (a) s ⊂ FW(v)\BW(v), (b) s ⊂ BW(v)\FW(v), (c) s ⊂ V \ (FW(v)∪BW(v)). FW(v): vertices reachable from vertex v. BW(v): vertices from which v is reachable. Lecture Outline • Applications • Designing parallel graph algorithms • Case studies: A. B. C. D. E. Graph traversals: Breadth-first search Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall Ranking: Betweenness centrality Maximal Independent Sets: Luby’s algorithm Strongly Connected Components • Wrap-up: challenges on current systems The locality challenge “Large memory footprint, low spatial and temporal locality impede performance” Performance rate (Million Traversed Edges/s) Serial Performance of “approximate betweenness centrality” on a 2.67 GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache) Input: Synthetic R-MAT graphs (# of edges m = 8n) No Last-level Cache (LLC) misses 60 50 40 30 O(m) LLC misses 20 10 0 10 12 14 16 18 20 22 24 Problem Size (Log2 # of vertices) 26 ~ 5X drop in performance The parallel scaling challenge “Classical parallel graph algorithms perform poorly on current parallel systems” • Graph topology assumptions in classical algorithms do not match real-world datasets • Parallelization strategies at loggerheads with techniques for enhancing memory locality • Classical “work-efficient” graph algorithms may not fully exploit new architectural features – Increasing complexity of memory hierarchy, processor heterogeneity, wide SIMD. • Tuning implementation to minimize parallel overhead is nontrivial – Shared memory: minimizing overhead of locks, barriers. – Distributed memory: bounding message buffer sizes, bundling messages, overlapping communication w/ computation. Designing parallel algorithms for large sparse graph analysis System requirements: High (on-chip memory, DRAM, network, IO) bandwidth. Solution: Efficiently utilize available memory bandwidth. Algorithmic innovation to avoid corner cases. Improve locality “RandomAccess”-like Problem Complexity Locality Data reduction/ Compression “Stream”-like Faster methods Work performed O(n) O(nlog n) O(n2) 104 106 108 1012 Problem size (n: # of vertices/edges) Peta+ Conclusions • Best algorithm depends on the input. • Communication costs (and hence data distribution) is crucial for distributed memory. • Locality is crucial for in-node performance. • Graph traversal (BFS) is fundamental building block for more complex algorithms. • Best parallel algorithm can be significantly different than the best serial algorithm.