On-line adaptive parallel prefix computation Jean-Louis Roch, Daouda Traoré and Julien Bernard

advertisement
On-line adaptive parallel
prefix computation
Jean-Louis Roch, Daouda Traoré and Julien
Bernard
Presented by Andreas Söderström, ITN
The prefix problem


Given X = x1,x2,…,xn compute the n products
πk=x0 о x1 о … ο xk for 1 ≤ k ≤ n
where ο is some associative operation
Example:
o = + (i.e. addition)
X = 1,3,5,7
π1 = 1
π2 = 1+3 = 4
π3 = 1+3+5 = 9
π4 = 1+3+5+7 = 16
Parallel prefix sum (first pass)
Step 3
36
10
3
1
7
2
Step 2
26
3
11
4
5
Step 1
15
6
7
8
Step 0
Parallel prefix sum (second pass)


For every even position use the value of the parent node
For evey odd position pn compute pn-1+ pn
Step 0
36
36
26
10
3
1
10
7
32
63
Step 1
11
21
10
4
15
5
Step 2
15
36
21
6
28
7
36
8
Step 3
Parallel prefix computation



Parallel time: 2n/p + O(log n)
for p < n/(log n)
Lower bound for parallel time:
2n/(p+1) for n > p(p+1)/2
Assumes identical processors!
Parallel prefix computation

Potential practical problems:
Processor setup may be heterogenous
 Processor load may vary due to other users
computing on the same machine



Off-line optimal scheduling potentially not
optimal anymore!
Solution:

Use on-line scheduling!
The basic idea

Combine a sequentially optimal algorithm with
fine-grained parallellism using work stealing
P0
P1 P2 … Pn
Steal work
Steal work
The algorithm
Sequential process Ps:
 The sequential process Ps starts working on
[π1, πk], i.e. value indices [1,k] where indices [k+1,m] has
been stolen
 When Ps reaches the index k it communicates πk to the
parallel process Pv that has stolen [k+1,m] and
recoveres the last index n computed by Pv together with
the local prefix result r n
 Ps uses associativity to calculate πn+1 = πk o rn and
continues with the computation from index n+1
The algorithm
Parallel process Pv
 Pv scans for active processes (can be Ps or
another Pv) and steals part of the work from
that process.
 Pv computes the local prefix operation on the
stolen interval
 The computation of Pv depends on a previous
value and need to be finalized when that value is
known
The algorithm
Jump
P0
1
2
3
4
5
6
Result
P1
P2
Finalize
7
8
Stealable
9
10 11 12 13 14 15 16
Performance


If a processor is or becomes slow part of its
work can be stolen by an idle processor
Asymptotic optimality (proof provided in the
paper)
Performance
P homogenous processeors
8
7
6
5
4
3
2
1
0
Lower bound
Min
Average
Max
p=2
p=6
p=4
Static
p=8
p=2
p=4
p=6
Adaptive
p=8
Performance
P heterogenous processors
12
10
Lower bound
8
Min
6
Average
4
Max
2
0
p=2
p=4
p=6
Static
p=8
p=2
p=4
p=6
Adaptive
p=8
Questions?
Download