COT 5405 – Fall 2006 Introduction to parallel algorithms Ashok Srinivasan www.cs.fsu.edu/~asriniva Florida State University Outline • • • • Background Primitives Algorithms Important points Background • Terminology – Time complexity – Speedup – Efficiency – Scalability • Communication cost model Time complexity • Parallel computation – A group of processors work together to solve a problem – Time required for the computation is the period from when the first processor starts working until when the last processor stops Sequential Parallel - bad Parallel - ideal Parallel - realistic Other terminology • • • • Speedup: S = T1/TP Efficiency: E = S/P Work: W = P TP Scalability Notation •P = Number of processors •T1 = Time on one processor •TP = Time on P processors – How does TP decrease as we increase P to solve the same problem? – How should the problem size increase with P, to keep E constant? Communication cost model • Processes spend some time doing useful work, and some time communicating • Model communication cost as – – – – TC = ts + L tb L = message size Independent of location of processes Any process can communicate with any other process – A process can simultaneously send and receive one message I/O model • We will ignore I/O issues, for the most part • We will assume that input and output are distributed across the processors in a manner of our choosing • Example: Sorting – Input: x1, x2, ..., xn • Initially, xi is on processor i – Output xp1, xp2, ..., xpn • xpi on processor i • xpi < xpi+1 Primitives • • • • • Reduction Broadcast Gather/Scatter All gather Prefix Reduction -- 1 x1 Compute x1 + x2 + ... + xn x2 xn x3 • Tn = n-1 + (n-1)(ts+tb) • Sn = 1/(1 + ts + tb) x4 Reduction -- 2 Reduction-1 for {x1, ... xn/2} x1 xn/2+1 Reduction-1 for {xn/2+1, ... xn} • Tn = n/2-1 + (n/2-1)(ts+ tb) + (ts+ tb) + 1 = n/2 + n/2 (ts+ tb) • Sn ~ 2/(1 + ts+ tb) Reduction -- 3 xn/2+1 x1 x1 Reduction-1 Reduction-1 for {x1, ... xn/2} for {xn/2+1, ... xn} xn/4+1 xn/2+1 x3n/4+1 Reduction-1 Reduction-1 Reduction-1 Reduction-1 for {x1, ... xn/4} for {xn/4+1, ... xn/2} for {xn/2+1, ... x3n/4} for {x3n/4+1, ... xn} • Apply reduction-2 recursively – * Divide and conquer • Tn ~ log2n + (ts+ tb) log2n • Sn ~ (n/ log2n) x 1/(1 + ts+ tb) • Note that any associative operator can be used in place of + Parallel addition features • If n >> P – * Each processor adds n/P distinct numbers – Perform parallel reduction on P numbers – TP ~ n/P + (1 + ts+ tb) log P – Optimal P obtained by differentiating wrt P • Popt ~ n/(1 + ts+ tb) • If communication cost is high, then fewer processors ought to be used – E = [1 + (1+ ts+ tb) P log P/n]-1 • * As problem size increases, efficiency increases • * As number of processors increases, efficiency decreases Some common collective operations A A A A B A C A D Broadcast A, B, C, D Scatter A, B, C, D Gather A A A, B, C, D B B A, B, C, D C C A, B, C, D D D A, B, C, D All Gather Broadcast x7 x3 x8 x1 x1 x4 x2 x1 x5 x2 x4 x6 x1 x1 x3 x5 x3 x2 • T ~ (ts+ Ltb) log P – L: Length of data x7 x2 x6 x4 x8 Gather/Scatter Note: Si=0log P–1 2i x18 = (2 log P – 1)/(2–1) = P-1 4L x14 2L x12 2L x34 L x1 x2 ~P x58 x56 L x3 x4 x78 L x5 x6 L x7 x8 • Gather: Data move towards the root • Scatter: Review question • T ~ ts log P + PLtb All gather x7 x8 x3 x4 x5 x6 L x1 • Equivalent x2 to each processor broadcasting to all the processors All gather x78 x78 x34 2L x34 x56 x56 L x12 x12 All gather x58 x58 x14 2L x14 x58 x58 L x14 x14 4L All gather x18 x18 x18 2L x18 x18 x18 L x18 x18 • Tn ~ ts log P + PLtb 4L Review question: Pipelining • * Useful when repeatedly and regularly performing a large number of primitive operations – Optimal time for a broadcast = log P • But doing this n times takes n log P time – Pipelining the broadcasts takes n + P time • Almost constant amortized time per broadcast – if n >> P • n + P << n log P when n >> P • Review question: How can you accomplish this time complexity? Sequential prefix • Input – Values xi , 1 < i < n • Output – Xi = x1 * x2 * ... * xi, 1 < i < n – * is an associative operator • Algorithm – X1 = x1 – for i = 2 to n • Xi = Xi-1 * xi Parallel prefix • Define f(a,b) as follows • Input – Processor i has xi • Output – Processor i has x1 * x2 * ... * xi • Divide and conquer – f(a,b) yields the following • Xi = xa *... * xi, Proc Pi • Xi = xa *... * xb, Proc Pi • a<i<b – f(1,n) solves the problem – if a == b • Xi = xi, on Proc Pi • Xi = xi, on Proc Pi – else • compute in parallel – f(a,(a+b)/2) – f((a+b)/2+1,b) • Pi and Pj send Xi and Xj to each other, respectively – a < i < (a+b)/2 – j = i + (a+b)/2 • Xi = Xi*Xj on Pi • Xj = Xi*Xj on Pj • Xj = Xi*Xj on Pj – T(n) = T(n/2) + 2 + (ts+tw) => T(n) = O(log n) – An iterative implementation improves the constant Iterative parallel prefix example x0 x1 x2 x3 x4 x5 x6 x7 x01 x12 x23 x34 x45 x56 x67 x02 x03 x14 x25 x36 x47 x04 x05 x06 x07 Algorithms • Linear recurrence • Matrix vector multiplication Linear recurrence • Determine each xi, 2 < i < n – xi = ai xi-1 + bi xi-2 – x 0 = x 0, x 1 = x 1 • Sequential solution – for i = 2 to n • xi = ai xi-1 + bi xi-2 – Follows directly from the recurrence – This approach is not easily parallelized Linear recurrence in parallel • Given xi = ai xi-1 + bi xi-2 – x2i = a2i x2i-1 + b2i x2i-2 – x2i+1 = a2i+1 x2i + b2i+1 x2i-1 • Rewrite this in matrix form x2i x2i+1 b2i a2i x2i-2 a2i+1 b2i b2i+1 + a2i+1 a2i x2i-1 Ai Xi • Xi = Ai A i-1 Xi-1 ... A1X0 • This is a parallel prefix computation, since matrix multiplication is associative • Solved in O(log n) time Matrix-vector multiplication • c=Ab – Often performed repeatedly • bi = A bi-1 – We need same data distribution for c and b • One dimensional decomposition – Example: row-wise block striped for A • b and c replicated – Each process computes its components of c independently – Then all-gather the components of c 1-D matrix-vector multiplication c: Replicated A: Row-wise b: Replicated • Each process computes its components of c independently – Time = Q(n2/P) • Then all-gather the components of c – Time = ts log P + tb n • Note: P < n 2-D matrix-vector multiplication C0 A00 A01 A02 A03 B0 C1 A10 A11 A12 A13 B1 C2 A20 A21 A22 A23 B2 C3 A30 A31 A32 A33 B3 • Processes Pi0 sends Bi to P0i – Time: ts + tbn/P0.5 • Processes P0j broadcast Bj to all Pij – Time = ts log P0.5 + tb n log P0.5 / P0.5 • Processes Pij compute Cij = AijBj – Time = Q(n2/P) • Processes Pij reduce Cij on to Pi0, 0 < i < P0.5 – Time = ts log P0.5 + tb n log P0.5 / P0.5 • Total time = Q(n2/P + ts log P + tb n log P / P0.5 ) – P < n2 – * More scalable than 1-dimensional decomposition Important points • Efficiency – Increases with increase in problem size – Decreases with increase in number of processors • Aggregation of tasks to increase granularity – Reduces communication overhead • Data distribution – 2-dimensional may be more scalable than 1-dimensional – Has an effect on load balance too • General techniques – Divide and conquer – Pipelining