ppt - IPDPS

advertisement
Towards Dynamic Reconfigurable Load-Balancing
for Hybrid Desktop Platforms
Alécio P. D. Binotto (abinotto@inf.ufrgs.br) – Carlos E. Pereira – Dieter W. Fellner
Motivation
Desktop accelerators (like GPUs) form
a powerful heterogeneous platform in
conjunction with multi-core CPUs. To
improve application performance on
these heterogeneous platforms, loadbalancing plays an important role to
distribute workload. And even more
important is to consider online
parameters that cannot be known a
priori for system scheduling.
A Computational Fluid Dynamics (CFD)
application,
developed
at
the
Fraunhofer IGD, is used to exemplify
the implementation of iterative SLE
(Systems of Linear Equations) solvers
and the need of a hybrid CPU-GPU
approach to optimize performance.
Innovation
The approach
abstracts the Pus
using the OpenCL API as the platform independent programming model. It has the
proposal to extend OpenCL with a module that schedule and balance the workload
over the CPU and GPU for the specific case study in a high level. Starting with an initial
scheduling configuration just when the application starts, an online profiler monitors
and stores tasks’ execution times and platform conditions. Since the tasks are nondeterministic, during application execution, a reconfigurable dynamic scheduling is
performed considering changes on runtime conditions. Figure above depicts the
approach; and bellow, the tow-phase scheduling approach.
First Assignment: the first guess faces a multidimensional scheduling problem with
NP-hard complexity and become more complex when dealing with more than two PUs
and several tasks. To optimize, this assignment is based on heuristics taking into
account the performance benchmark described on the right. The first scheduling is
performed in a static-code way, but dynamically just after the application starts
considering the domain size and the premise that the PUs are idle, defining a rule
based on the break-even point values presented on the next section. However, this
strategy can lead to a PU overflow, i.e., tasks being scheduled to the same PU,
showing the need for a context-aware adaptation.
Dynamic Reconfiguration: After the first assignment, information provided by profiling
is considered. Based on estimated costs, dynamic parameters, and awareness of
runtime conditions, a new task can be reconfigured to other PU just if the estimated
time to be executed in the new PU is less than the time in the current PU.
24th IEEE International Parallel and Distributed Processing Symposium
PhD Forum
70
Preliminary Results
2500
PCG on CPU
60
Conjugate Gradient on 8800 new
Conjugate Gradient on 8800
50
2000
Conjugate Gradient on GTX 285 new
40
30
1500
1000
20
10
500
0
0
2000
4000
6000
8000
10000
12000
0
Unknowns
700
600
Jacobi on GTX 285
Time (ms)
500
Gauss Seidel on GTX 285
400
300
200
100
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Unknowns x 1000
45
40
Time (ms)
35
30
25
20
Conjugate Gradient on GTX 285
15
Jacobi on GTX 285
10
Gauss Seidel on GTX 285
5
0
0
50
100
150
The GPU implementation was made
using CUDA and optimized for memory
coalescing. Particularly, the figure on the
right side, above, compares the CG
without and with (new) our strategy for
enabling memory coalescing.
200
250
300
Unknowns x 1000
350
400
450
500
2000
3000
4000
5000
6000
7000
8000
9000
600
Single GPU
500
Average two GPUs
400
300
200
Importance
This PhD research follows the state-ofthe-art as several scientific applications
can now be executed on multi-core
desktop platforms. To our knowledge,
there is a need for research oriented to
support load-balancing over a CPU-GPU
(or CPU-accelerators) platform. The work
shows its relevance analyzing not just
platform characteristics, but also the
platform context execution scenario.
50
1000
Unknowns x 1000
The CG was also used for comparing
performances using multiple GPUs. There
is a need of 2M unknowns for using two
GPUs. With less elements, it results on an
increasing of communication. The multiGPU approach demonstrates that the
speedup depends on the problem size
and the achieved speedup was 1.7 for 8M
unknowns.
Conjugate Gradient on GTX 285
0
Time (ms)
The
figures
bellow
show
the
performance of the solvers on the most
powerful used PU, the GTX285 GPU.
For a small problem size, GS and
Jacobi are faster; and for large
problems, it is the CG. On the right, a
comparison of the CG over the PUs.
For problems till 2K unknowns, the CPU
has better performance than GPUs.
Conjugate Gradient on GTX 285
Conjugate Gradient on GTX 285
Time (ms)
Time (ms)
Specially, 3 iterative methods for solving
SLEs are analyzed: Jacobi, red-black
Gauss-Seidel (GS), Conjugate Gradient
(CG). The break-even points over the
PUs (quad-core CPU, 8800GT GPU,
and GTX285 GPU) used in the first
assignment of the scheduling are
presented as first results.
Conjugate Gradient on 8800
100
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Unknowns x 1000
Conclusions and Remaining Challenges
Based on the performance evaluation, it was
verified the need of strategies for dynamic
scheduling, improving OpenCL. We propose
a method for dynamic load-balancing over
CPU and GPU, applied to solvers for SLEs.
As remaining challenges, to develop the
load-balance reconfiguration phase on the
presented case study; and to analyze the
performance of the application without the
proposed method to evaluate the strategy
overhead.
Acknowledgments
Thanks for Daniel Weber and Christian Daniel for their
support. A. Binotto thanks the support given by DAAD
and Alßan, scholarship no. E07D402961BR.
Download