All-Pairs Shortest Paths Csc8530 – Dr. Prasad Jon A Preston March 17, 2004 Outline • • • • • • • • Review of graph theory Problem definition Sequential algorithms Properties of interest Parallel algorithm Analysis Recent research References Graph Terminology • G = (V, E) • W = weight matrix – wij = weight/length of edge (vi, vj) – wij = ∞ if vi and vj are not connected by an edge – wii = 0 • Assume W has positive, 0, and negative values • For this problem, we cannot have a negative-sum cycle in G Weighted Graph and Weight Matrix v0 v1 5 -4 1 v2 3 7 9 v3 0 2 6 v4 0 1 2 3 4 0 5 4 1 0 1 2 5 4 0 3 3 0 0 7 2 9 3 4 1 0 7 0 6 0 2 9 6 0 Directed Weighted Graph and Weight Matrix v0 1 -1 v1 2 9 v2 7 5 6 3 v4 v3 -2 4 0 v5 0 3 2 4 1 5 0 1 2 1 2 3 4 5 1 0 2 9 0 3 7 0 0 4 6 5 0 All-Pairs Shortest Paths Problem Defined • For every pair of vertices vi and vj in V, it is required to find the length of the shortest path from vi to vj along edges in E. • Specifically, a matrix D is to be constructed such that dij is the length of the shortest path from vi to vj in G, for all i and j. • Length of a path (or cycle) is the sum of the lengths (weights) of the edges forming it. Sample Shortest Path v0 v3 -2 1 -1 v1 2 v2 9 5 6 3 v4 7 4 v5 Shortest path from v0 to v4 is along edges (v0, v1), (v1, v2), (v2, v4) and has length 6 Disallowing Negative-length Cycles • APSP does not allow for input to contain negative-length cycles • This is necessary because: – If such a cycle were to exist within a path from vi to vj, then one could traverse this cycle indefinitely, producing paths of ever shorter lengths from vi to vj. • If a negative-length cycle exists, then all paths which contain this cycle would have a length of -∞. Recent Work on Sequential Algorithms • Floyd-Warshall algorithm is Θ(V3) – Appropriate for dense graphs: |E| = O(|V|2) • Johnson’s algorithm – Appropriate for sparse graphs: |E| = O(|V|) – O(V2 log V + V E) if using a Fibonacci heap – O(V E log V) if using binary min-heap • Shoshan and Zwick (1999) Strassen’s Algorithm (matrix multiplication) – Integer edge weights in {1, 2, …, W} – O(W Vω p(V W)) where ω ≤ 2.376 and p is a polylog function • Pettie (2002) – Allows real-weighted edges – O(V2 log log V + V E) wk p(v, w) k 1 v k Properties of Interest k d • Let ij denote the length of the shortest path from vi to vj that goes through at most k - 1 intermediate vertices (k hops) 1 • d ij = wij (edge length from vi to vj) • If i ≠ j and there is no edge from vi to vj, then dij1 wij 1 d • Also, ii wii 0 • Given that there are no negative weighted cycles in G, there is no advantage in visiting any vertex more than once in the shortest path from vi to vj. • Since there are only n vertices in G, dij dijn1 Guaranteeing Shortest Paths • If the shortest path from vi to vj contains vr and vs (where vr precedes vs) • The path from vr to vs must be minimal (or it wouldn’t exist in the shortest path) • Thus, to obtain the shortest path from vi to vj, we can compute all combinations of optimal sub-paths (whose concatenation is a path from vi to vj), and then select the shortest one vi vr vs MIN vj MIN MIN ∑ MINs Iteratively Building Shortest Paths v1 d ik11 w1j v2 vi … d ik21 w2j d k 1 in vn vj wnj dijk 1 k 1 n d min d ij , min d ilk 1 wlj l 1 k ij n d min dilk 1 wlj k ij l 1 Recurrence Definition k k /2 k /2 d min ( d d • For k > 1, ij il lj ) l vi vl vj MIN ≤ k/2 vertices MIN ≤ k/2 vertices ≤ k vertices • Guarantees O(log k) steps to calculate d ijk Similarity n d min dilk 1 wlj k ij l 1 n Cij Ail Blj l 1 Computing D • Let Dk = matrix with entries dij for 0 ≤ i, j ≤ n - 1. • Given D1, compute D2, D4, … , Dm – m 2 log(n 1) • D = Dm • To calculate Dk from Dk/2, use special form of matrix multiplication – ‘’ → ‘’ – ‘’ → ‘min’ “Modified” Matrix Multiplication Step 2: for r = 0 to N – 1 dopar Cr = Ar + Br end Step 3: for m = 2q to 3q – 1 do for all r N (rm = 0) dopar Cr = min(Cr, Cr(m)) “Modified” Example 1 2 A 3 4 P100 1 2 B 3 4 2 3 P110 4 -3 7 10 C 15 22 2 -4 P000 P001 1 -1 1 -2 3 -1 3 -2 P010 P011 P101 From 9.2, after step (1.3) P111 4 -4 “Modified” Example (step 2) 1 2 A 3 4 P100 1 2 B 3 4 5 -2 P000 P110 1 P101 P001 0 -1 2 1 P010 P011 From 9.2, after modified step 2 P111 0 “Modified” Example (step 3) 1 2 A 3 4 1 2 B 3 4 0 2 C 1 0 P101 P100 MIN P110 MIN P000 P001 0 -2 1 0 P010 P011 MIN MIN From 9.2, after modified step 3 P111 Hypercube Setup • Begin with a hypercube of n3 processors – Each has registers A, B, and C – Arrange them in an n n n array (cube) • Set A(0, j, k) = wjk for 0 ≤ j, k ≤ n – 1 – i.e processors in positions (0, j, k) contain D1 = W • When done, C(0, j, k) contains APSP = Dm Setup Example 0 D1 = Wjk = A(0, j, k) = v0 1 -1 v1 2 9 v2 7 5 6 3 v4 v3 -2 4 v5 0 3 2 4 1 5 0 1 2 1 2 3 4 5 1 0 2 9 0 3 7 0 0 4 6 5 0 APSP Parallel Algorithm Algorithm HYPERCUBE SHORTEST PATH (A,C) Step 1: for j = 0 to n - 1 dopar for k = 0 to n - 1 dopar B(0, j, k) = A(0, j, k) end for end for Step 2: for i = 1 to log( n 1) do (2.1) HYPERCUBE MATRIX MULTIPLICATION(A,B,C) (2.2) for j = 0 to n - 1 dopar for k = 0 to n - 1 dopar (i) A(0, j, k) = C(0, j, k) (ii) B(0, j, k) = C(0, j, k) end for end for end for An Example 0 D1 = 0 3 2 4 1 5 0 1 2 0 D4 = 1 3 4 5 1 0 2 9 0 3 7 0 0 4 6 5 0 1 0 1 4 0 2 3 3 2 1 4 1 0 5 3 4 0 1 2 2 2 3 4 0 D2 = 0 1 8 0 3 2 1 4 1 0 5 3 0 D8 = 2 0 1 2 5 3 19 6 10 2 14 5 9 0 12 3 7 1 0 4 12 2 9 0 4 6 5 9 0 1 1 0 1 4 0 2 3 3 2 1 4 1 0 5 3 4 0 1 2 3 4 5 10 5 13 3 7 7 0 10 10 9 0 4 6 5 9 0 3 2 0 2 3 4 5 3 15 6 10 2 14 5 9 0 12 3 7 1 0 4 8 2 9 0 4 6 5 9 0 Analysis • Steps 1 and (2.2) require constant time • There are log( n 1) iterations of Step (2.1) – Each requires O(log n) time • The overall running time is t(n) = O(log2 n) • p(n) = n3 • Cost is c(n) = p(n) t(n) = O(n3 log2 n) 3 T1 O( n ) 1 • Efficiency is E 3 2 2 c(n) O(n log n) O(log n) Recent Research • Jenq and Sahni (1987) compared various parallel algorithms for solving APSP empirically • Kumar and Singh (1991) used the isoefficiency metric (developed by Kumar and Rao) to analyze the scalability of parallel APSP algorithms – Hardware vs. scalability – Memory vs. scalability Isoefficiency • For “scalable” algorithms (efficiency increases monotonically as p remains constant and problem size increases), efficiency can be maintained for increasing processors provided that the problem size also increases • Relates the problem size to the number of processors necessary for an increase in speedup in proportion to the number of processors used Isoefficiency (cont) • Given an architecture, defines the “degree of scalability” • Tells us the required growth in problem size to be able to efficiently utilize an increasing number of processors • Ex: Given an isoefficiency of kp3 If p0 and w0, speedup = 0.8p0 (efficiency = 0.8) If p1 = 2p0, to maintain efficiency of 0.8 w1 = 23w0 = 8w0 • Indicates the superiority of one algorithm over another only when problem sizes are increased in the range between the two isoefficiency functions Isoefficiency (cont) • Given an architecture, defines the “degree of scalability” • Tells us the required growth in problem size to be able to efficiently utilize an increasing number of processors • Ex: Given isoefficiency of kp3 of kp3 Given ananisoefficiency If p0 and w0, speedup = 0.8p0 (efficiency = 0.8) w1 = 2w maintain efficiency of 0.8(efficiency p0Ifand w = 0.8p 0,,tospeedup 0 0 3 p1 = 2 w0 = 8w0 If = 0.8) If p1 = 2p0, to maintain efficiency of 0.8 w1 = the 23superiority w0 = 8wof0one algorithm over another only when • Indicates problem sizes are increased in the range between the two isoefficiency functions Memory Overhead Factor (MOF) • Ratio: Total memory required for all processors Memory required for the same problems size on single processor • We’d like this to be lower! Architectures Discussed • • • • • Shared Memory (CREW) Hypercube (Cube) Mesh Mesh with Cut-Through Routing Mesh with Cut-Through and Multicast Routing • Also examined fast and slow communication technologies Parallel APSP Algorithms • • • • • Floyd Checkerboard Floyd Pipelined Checkerboard Floyd Striped Dijkstra Source-Partition Dijkstra Source-Parallel General Parallel Algorithm (Floyd) Repeat steps 1 through 4 for k := 1 to n Step 1: If this processor has a segment of Pk-1[*,k], then transmit it to all processors that need it Step 2: If this processor has a segment of Pk-1[k,*], then transmit it to all processors that need it Step 3: Wait until the needed segments of Pk-1[*,k] and Pk-1[k,*] have been received Step 4: For all i, j in this processor’s partition, compute Pk[i,j] := min {Pk-1[i,j], Pk-1[i,k] + Pk-1[k,j]} Floyd Checkerboard n p n p Each “cell” is assigned to a different processor, and this processor is responsible for updating the cost matrix values at each iteration of the Floyd algorithm. Steps 1 and 2 of the GPF involve each of the p processors sending their data to the “neighbor” columns and rows. Floyd Pipelined Checkerboard n p n p Similar to the preceding. Steps 1 and 2 of the GPF involve each of the p processors sending their data to the “neighbor” columns and rows. The difference is that the processors are not synchronized and compute and send data ASAP (or sends as soon as it receives). Floyd Striped n p Each “column” is assigned a different processor, and this processor is responsible for updating the cost matrix values at each iteration of the Floyd algorithm. Step 1 of the GPF involves each of the p processors sending their data to the “neighbor” columns. Step 2 is not needed (since the column is contained within the processor). Dijkstra Source-Partition • Assumes Dijkstra’s Single-source Shortest Path is equally distributed over p processors and executed in parallel • Processor p finds shortest paths from each vertex in it’s set to all other vertices in the graph • Fortunately, this approach involves no inter-processor communication • Unfortunately, only n processors can be kept busy • Also, memory overhead is high since each processors has a copy of the weight matrix Dijkstra’s Source-Parallel • Motivated by keeping more processors busy • Run n copies of the Dijkstra’s SSP – Each copy runs on p n p n p n p n p n processors (p > n) p n p n p n p n p n p n p n p n p n p n p n p n Calculating Isoefficiency • Example: Floyd Checkerboard • At most n2 processors can be kept busy • n must grow as Θ(√p) due to problem structure • By Floyd (sequential), Te = Θ(n3) • Thus isoefficiency is √(p3) = Θ(p1.5) • But what about communication… Calculating Isoefficiency (cont) • • • • • ts = message startup time tw = per-word communication time tc = time to compute next iteration value for one cell in matrix m = number words sent d = number hops between nodes • Hypercube: – (ts + tw m) log d = time to deliver m words – 2 (ts + tw m) log p = barrier synchronization time (up & down “tree”) – d = √p – – – – Step 1 = (ts + tw n/√p) log √p Step 2 = (ts + tw n/√p) log √p Step 3 (barrier synch) = 2(ts + tw) log p Step 4 = tcn2/p n Tp n 2 t s t w log p p 2t s t w log p tc n 2 p Isoefficiency = Θ(p1.5(log p)3) Mathematical Details To pT p Te n To p n 2 t s t w log p n 2 tc n 3 p 2t s t w log p tc p To 3ts 2tw np log p twn2 p log p How are n and p related? Mathematical Details To pT p Te n To p n 2 t s t w log p n 2 tc n 3 p 2t s t w log p tc p To 3ts 2tw np log p twn2 p log p tc n3 K 3ts 2tw np log p twn2 p log p E 1 E ( p log p)1.5 p1.5 (log p)3 p1.5 (log p)3 Calculating Isoefficiency (cont) • • • • • ts = message startup time tw = per-word communication time tc = time to compute next iteration value for one cell in matrix m = number words sent d = number hops between nodes • Mesh: – – – – n p p Step 1 = n Step 2 = p p Step 3 (barrier synch) = Step 4 = Te n Tp (comm / sync ) 2 p p p p n p Isoefficiency = Θ(p3+p2.25) = Θ(p3) Isoefficiency and MOF for Algorithm & Architecture Combinations Base Algorithm Parallel Variant Architecture Isoefficiency MOF Dijkstra SourcePartitioned SM, Cube, Mesh, Mesh-CT, Mesh-CT-MC p3 p Dijkstra Source-Parallel SM, Cube (p log p)1.5 n Mesh, Mesh-CT Mesh-CT-MC p1.8 n SM p3 1 Cube (p log p)3 1 Mesh p4.5 1 Mesh-CT (p log p)3 1 Mesh-CT-MC p3 1 SM p1.5 1 Cube p1.5 (log p)3 1 Mesh p3 1 Mesh-CT p2.25 1 Mesh-CT-MC p2.25 1 SM, Cube, Mesh, Mesh-CT, Mesh-CT-MC p1.5 1 Floyd Floyd Floyd Stripe Checkerboard Pipelined Checkerboard Comparing Metrics • We’ve used “cost” previously this semester (cost = p Tp) • But notice that the cost of all of the architecturealgorithm combinations discussed here is Θ(n3) • Clearly some are more scalable than others • Thus isoefficiency is a useful metric when analyzing algorithms and architectures References • Akl S. G. Parallel Computation: Models and Methods. Prentice Hall, Upper Saddle River NJ, pp. 381-384,1997. • Cormen T. H., Leiserson C. E., Rivest R. L., and Stein C. Introduction to Algorithms (2nd Edition). The MIT Press, Cambridge MA, pp. 620-642, 2001. • Jenq J. and Sahni S. All Pairs Shortest Path on a Hypercube Multiprocessor. In International Conference on Parallel Processing. pp. 713-716, 1987. • Kumar V. and Singh V. Scalability of Parallel Algorithms for the All Pairs Shortest Path Problem. Journal of Parallel and Distributed Computing, vol. 13, no. 2, Academic Press, San Diego CA, pp. 124138, 1991. • Pettie S. A Faster All-pairs Shortest Path Algorithm for Realweighted Sparse Graphs. In Proc. 29th Int'l Colloq. on Automata, Languages, and Programming (ICALP'02), LNCS vol. 2380, pp. 8597, 2002.