On-line adaptive parallel prefix computation Jean-Louis Roch, Daouda Traoré and Julien Bernard Presented by Andreas Söderström, ITN The prefix problem Given X = x1,x2,…,xn compute the n products πk=x0 о x1 о … ο xk for 1 ≤ k ≤ n where ο is some associative operation Example: o = + (i.e. addition) X = 1,3,5,7 π1 = 1 π2 = 1+3 = 4 π3 = 1+3+5 = 9 π4 = 1+3+5+7 = 16 Parallel prefix sum (first pass) Step 3 36 10 3 1 7 2 Step 2 26 3 11 4 5 Step 1 15 6 7 8 Step 0 Parallel prefix sum (second pass) For every even position use the value of the parent node For evey odd position pn compute pn-1+ pn Step 0 36 36 26 10 3 1 10 7 32 63 Step 1 11 21 10 4 15 5 Step 2 15 36 21 6 28 7 36 8 Step 3 Parallel prefix computation Parallel time: 2n/p + O(log n) for p < n/(log n) Lower bound for parallel time: 2n/(p+1) for n > p(p+1)/2 Assumes identical processors! Parallel prefix computation Potential practical problems: Processor setup may be heterogenous Processor load may vary due to other users computing on the same machine Off-line optimal scheduling potentially not optimal anymore! Solution: Use on-line scheduling! The basic idea Combine a sequentially optimal algorithm with fine-grained parallellism using work stealing P0 P1 P2 … Pn Steal work Steal work The algorithm Sequential process Ps: The sequential process Ps starts working on [π1, πk], i.e. value indices [1,k] where indices [k+1,m] has been stolen When Ps reaches the index k it communicates πk to the parallel process Pv that has stolen [k+1,m] and recoveres the last index n computed by Pv together with the local prefix result r n Ps uses associativity to calculate πn+1 = πk o rn and continues with the computation from index n+1 The algorithm Parallel process Pv Pv scans for active processes (can be Ps or another Pv) and steals part of the work from that process. Pv computes the local prefix operation on the stolen interval The computation of Pv depends on a previous value and need to be finalized when that value is known The algorithm Jump P0 1 2 3 4 5 6 Result P1 P2 Finalize 7 8 Stealable 9 10 11 12 13 14 15 16 Performance If a processor is or becomes slow part of its work can be stolen by an idle processor Asymptotic optimality (proof provided in the paper) Performance P homogenous processeors 8 7 6 5 4 3 2 1 0 Lower bound Min Average Max p=2 p=6 p=4 Static p=8 p=2 p=4 p=6 Adaptive p=8 Performance P heterogenous processors 12 10 Lower bound 8 Min 6 Average 4 Max 2 0 p=2 p=4 p=6 Static p=8 p=2 p=4 p=6 Adaptive p=8 Questions?