Parallel processing is not easy are hard to find. • Parallel implementation

advertisement
Parallel processing is not easy
Good schedules for parallel implementations of composite tasks
are hard to find.
• Parallel implementation
• Mapping and scheduling
• Optimization for makespan
or throughput
User’s idea on
Parallel Computations
(e.g. STAP)
Executable file
for a given parallel computer
Parallel processing can be made easy
with ALPS software framework
/*
This is a .cpp file of the partially
adaptive STAP methods that the
user had in mind.
This is optimized for minimum
latency or maximum throughput
on a given parallel machine.
*/
#include <alps.h>
...
ALPS CodeGen
User’s idea on
Parallel Computations
(e.g. STAP)
ALPS Library
Various STAP heuristics
constructed from ALPS building blocks
1. Task Graph
3. ALPS Scheduler
Find optimal configuration that minimizes
overall makespan or maximizes throughput
2. ALPS Benchmarker & Modeler
Modeling execution times of
basic building blocks under various configurations
1
Task Graph: PRI-overlapped STAP
User describes the algorithm using task graphs
Target cube
Data cube
Steering vector
ALPS software framework
determines the implementation parameters such as number of
processors for each task and return the executable code
2
ALPS Benchmarker and Modeler
1. Measures of actual timings yi for a set of input parameters
2. Candidate terms {gk} such as the size of data cube or the size of
processor cube are generated based on the input parameters
3. Model coefficients xk’s are found using least-squares
Ttask ≈ ∑(xkgk)
x1
x2
2
xn
W
…
2
= minx W g1 g2 g3 … gn-1 gn
…
minx W(Ax-b)
y1
y2
2
2
ym
Solved incrementally by adding a best column to the model at a time.
minx W(Ax-b)
2
2
Ordinary least-squares
W=I
COV-weighted least-squares (BLUE estimator)
W = R-1/2 where R is the covariance matrix
one of the best, but requires lots of replications hence not practical
VAR-weighted least-squares
W = D-1/2 where D=diag(R)
INV-weighted least-squares
W = diag(1/b)
Used ordinary, VAR-weighted and INV-weighted least-squares with 5
replications for each measurement
0.7
26.6% error
from ordinary
least-square
0.6
0.5
From weighted
least-square
0.4
0.3
0.2
7.5% error
from INV-weighted
least-square
0.1
00
0.1
0.2
0.3
0.4
0.5
From ordinary least-square
0.6
0.7
3
ALPS Scheduler
Schedules tree-shaped task graphs
A
B
C
(Tree-shaped graphs)
=
(linear chains)
+
(joining nodes)
rather simple
complicated
A
processors
A
C
A
B
B
B
t
t
Data-parallelism
only [3]
C
C
Task-parallelism
preferred [4]
t
Other possibly
better schedules
We consider both data-parallel (series) and task-parallel
(parallel) schedules for a joining node
Scheduling a joining node with two incoming arcs
Use dynamic programming considering two incoming arcs in series
and in parallel
Scheduling a joining node with more than two incoming arcs
1. Sort the incoming arcs based on makespan when maximally parallelized
2. Schedule two incoming nodes at a time
10 A
1 B
8 C
5 D
10
8
A
C
5
D
1 B
Experiments on 8-node linux cluster:
executable codes were generated by ALPS software framework
Makespan: achieved vs. estimated
Makespan (sec)
Makespan (sec)
80
estimated
30
achieved
60
20
achieved
40
estimated
10
20
0
1
2
4
Number of processors used
8
0
Dataparallelism
only [3]
Our
Taskparallelism scheduling
preferred [4] algorithm
Schedules found in makespan minimization
Target cube (32x2048x32)
Data cube (32x2048x32)
Steering vector (32x1x32)
~31 (sec)
Data-parallelism only [3]
P1
P8
Task-parallelism preferred [4]
~24 (sec)
Our scheduling algorithm
~24 (sec)
P1
P8
P1
P8
Throughput: achieved vs. estimated
Throughput (1/sec)
Throughput (1/sec)
0.12
estimated
0.1
0.1
achieved
estimated
0.05
01
0.05
achieved
2
4
Number of processors used
8
0
Dataparallelism
only [3]
Our
Taskparallelism scheduling
preferred [4] algorithm
Schedules found in throughput maximization
Target cube (32x2048x32)
Data cube (32x2048x32)
Steering vector (32x1x32)
~77 (sec)
Data-parallelism only [3] and our scheduling algorithm
P1
P8
Schedule on 1 processor replicated 8 times
Task-parallelism preferred [4]
~45 (sec)
P1
P8
Schedule on 4 processors replicated 2 times
Conclusion
The schedules found by ALPS framework are as good as or often
better than those computed by other published algorithms
The relative error between the estimated and the achieved
makespan/throughput were around 10% confirming that the
modeled execution time is useful in predicting performance
Weighted least-squares produced estimates 20—30% better than
ordinary least-squares in terms of relative errors; may be useful
when modeling the unreliable timings.
Related works
Frameworks
[1] PVL and [2] S3P
We use text version of task graph as front-end and source code
generation as back-end without users dealing with C++ code
We cover larger scheduling problem space (tree-structured
task graphs) than S3P.
Scheduling algorithms
[3] J. Subholk and G. Vondran: chain of tasks only
[4] D. Nicol, R. Simha, and A. Choudhary: s-p task graph, but taskparallelism is preferred
(… and many other works; refer to reference in the abstract)
References
[1] Eddie Rutledge and Jeremy Kepner, “PVL: An Object Oriented Software
Library for Parallel Signal Processing”, IEEE Cluster 2001.
[2] H. Hoffmann, J. V. Kepner, and R. A. Bond, “S3P: automatic, optimized
mapping of signal processing applications to parallel architectures,” in
Proceedings of the Fifth Annual High-Performance Embedded Computing
(HPEC) Workshop, Nov. 2001.
[3] J. Subhlok and G. Vondran, “Optimal Use of Mixed Task and Data
Parallelism for Pipelined Computations”, Journal of Parallel and Distributed
Computing, vol. 60, 2000.
[4] D. Nicol, R. Simha, and A. Choudhary, “Optimal Processor Assignments for
a Class of Pipelined Computations”, IEEE Transaction on Parallel and
Distributed Systems, vol. 5(4), 1994.
Download