Introduction Load balancing Optimal local load balancing strategy Conclusions

advertisement
Introduction
Load balancing
Optimal local load balancing strategy
Conclusions
Introduction
Load balancing
Optimal local load balancing strategy
Conclusions
Outline
Load Balancing of Irregular Parallel
Divide-and-Conquer Algorithms in
Group-SPMD Programming Environments
Mattias Eriksson
Christoph Kessler
1
Introduction
Parallel quicksort
Imbalance in workload
2
Load balancing
Strategies for load balancing
3
Optimal local load balancing strategy
Problem: When to serialise/repivot
Dynamic programming method
Mikhail Chalabine
Linköping university, Sweden
Originally presented at the PASA workshop — 16 mars 2006
Introduction
Load balancing
Optimal local load balancing strategy
Conclusions
Parallel Quicksort
Introduction
Load balancing
Optimal local load balancing strategy
Conclusions
Imbalanced workload in parallel quicksort
A processor group selects a shared pivot and partitions S
into L = {x ∈ S|x ≤ π} and R = {x ∈ S|x ≥ π}.
The processor group splits into two groups.
The first processor group sorts L and the second
processor group sorts R.
With subproblem sizes being data-dependent a good processor
assignment to subproblems is not always possible.
1000 keys
p0 ,p1 ,p2
S
p0 ,p1 ,p2 ,p3
10 keys
p0
L
R
p1
p0 ,p2 ,p3
990 keys
p1 ,p2
Imbalance may accumulate in the parallel phase.
Examples of DC-algorithms with data-dependent
(irregular) subproblem sizes: quicksort and quickhull.
Examples of DC-algorithms where subproblem sizes are
fixed: mergesort, FFT and Strassen matrix multiplication.
Computing resources are wasted.
Introduction
Load balancing
Optimal local load balancing strategy
Conclusions
The manager strategy
Introduction
Load balancing
Optimal local load balancing strategy
Conclusions
The task queue strategy
One of the processors is dedicated to coordinating the
others.
After partitioning is done in the one-processor phase, one
of the subtasks is put in a shared task queue.
When a processor has no more work to do, it fetches a
new task from the queue.
S
To task queue
L
+ A processor with high workload can get help from an idle
processor.
- The manager does no sorting at all.
Introduction
Load balancing
Optimal local load balancing strategy
Conclusions
The repivoting strategy
R
+ The load imbalance will be very small.
- Managing the task queue will cause overhead.
Introduction
Load balancing
Optimal local load balancing strategy
Conclusions
The serialisation strategy
If processor groups will become imbalanced a new pivot is
selected.
1000 keys
If processor groups will become imbalanced the whole
processor group will first sort the first subproblem and then
the other subproblem.
1000 keys
p0 ,p1 ,p2
p0 ,p1 ,p2
10 keys
p0
990 keys
p1 ,p2
+ Bad load balance between processor groups is avoided.
- When a new pivot is selected the partitioning will have to
be redone.
- Does not work where pivot is uniquely determined (eg.
quickhull).
- Exactly how do we define bad load balance?
10 keys
990 keys
p0 ,p1 ,p2
p0 ,p1 ,p2
+ No load imbalance at all when serialising the execution in
the parallel phase.
- The processors will execute more work in the parallel
phase.
- Exactly how do we define bad load balance?
Introduction
Load balancing
Optimal local load balancing strategy
Conclusions
Optimal local strategy: example
cost
P = 2 and N = 1000
Splitting
Serialising
Repivoting
200
400
600
800
Optimal strategy is sometimes repivoting/serialisation and
sometimes group splitting.
We want to find the threshold values!
Load balancing
Optimal local load balancing strategy
Conclusions
Question: When is it worthwhile to serialise/repivot?
Find out numerically with dynamic programming
Dynamic programming is a method for solving problems of
a recursive nature.
Calculate E[Tp (n)], for increasingly large p and n, using
previously calculated values.
Also calculate which strategy of group splitting, repivoting
and serialising is best for a given situation.
p
n1
8
8
7
7
6
6
p=2
1000
n1
Introduction
Load balancing
Optimal local strategy: numerical approach
Question: When is it worthwhile to serialise/repivot?
Example: 2 processors sort 1000 elements. n1 is size of the
first subproblem.
2000
1800
1600
1400
1200
1000
800
600
400
200
0
Introduction
Optimal local load balancing strategy
Conclusions
Having a look at the results
5
5
4
4
3
s
s
1
2
3
4
5
Introduction
6
7
8
s
s
s
s
3
2 t9 ?
1 t1 t2 t3 t4 t5 t6 t7 t8
s
s
s
s
s
s
2
1 s
n
1
s
2
s
3
s
4
s
5
Load balancing
s
s
s
s
s
s
6
7
8
n
Optimal local load balancing strategy
Conclusions
Results of dynamic programming
We found from dynamic programming that:
On average serialisation is always better than repivoting.
P = 7 and N = 1000000
Serialisation is better than group splitting when one of the
subproblems is very small compared to the other.
900000
850000
800000
MPI-thresholds
cost
750000
p
n
105
106
107
108
109
700000
650000
Splitting
Sequentialising
Repivoting
600000
550000
500000
0
200000
400000
600000
n1
n1 = n/2 is not the optimal split point!
800000
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
.156
.328
.382
.401
.407
0
.216
.265
.281
.284
0
.136
.191
.206
.210
0
.076
.143
.159
.163
0
.022
.109
.126
.131
0
0
.083
.102
.107
0
0
.062
.083
.088
0
0
.042
.067
.073
0
0
.026
.054
.061
0
0
.011
.043
.051
0
0
0
.033
.042
0
0
0
.025
.034
0
0
0
.015
.027
0
0
0
.008
.021
0
0
0
0
.016
1e+06
Example: Consider the case p = 3 and n = 107 . If
n1 /n < .265 serialisation is better than group splitting.
We implement a hybrid local heuristic, parameterized by
this table (or a corresponding table for Fork).
Introduction
Load balancing
Optimal local load balancing strategy
Conclusions
Execution times in Fork
Absolute speedup
Absolute speedup
0
2
4
6
Load balancing
8 10 12 14 16 18 20 22 24 26 28 30 32
Number of processors
Optimal local load balancing strategy
Execution times in MPI with a better pivot
Early investing more time in better pivoting can reduce
need for load balancing.
Example:
Speedups of quicksort in MPI.
In this example 100 million integers are sorted.
Pivot is selected as average of 7 elements.
Absolute speedup
Optimal local load balancing strategy
Conclusions
Speedups of quicksort on a cluster with Intel Xeon processors.
In this example 100 million integers are sorted.
With task queue
With repivoting
With serialisation
No load balancing
16
15
Serialisation
Manager
14
13 No load balancing
12
11
10
9
8
7
6
5
4
3
2
1
0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Number of processors
12
Serialisation
11
Manager
10 No load balancing
9
8
7
6
5
4
3
2
1
0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Number of processors
Here our local load balancing strategy with serialisation
outperforms the dynamic manager solution!
The dynamic load balancing strategy with a task queue
performs better than our strategy with serialisation.
SP-PRAM: Nonblocking task queue operations, low
overhead.
Introduction
Load balancing
Execution times in MPI
Speedups of quicksort running on the SB-PRAM simulator.
In this example 40000 integers are sorted.
12
11
10
9
8
7
6
5
4
3
2
1
0
Introduction
One processor dedicated; communication overhead.
Conclusions
Introduction
Load balancing
Optimal local load balancing strategy
Conclusions
Conclusions
We have proposed and studied new strategies for local
load balancing in SPMD: repivoting and serialisation.
We developed an optimal hybrid local load balancing
method, calibrated with threshold tables derived from
offline dynamic programming.
Our hybrid local strategy outperforms the global dynamic
load balancing (manager solution) on the MPI cluster.
Global dynamic load balancing (task queue solution)
outperforms our local hybrid strategy on the SB-PRAM
where synchronization is very cheap.
Download