Parallel processing is not easy Good schedules for parallel implementations of composite tasks are hard to find. • Parallel implementation • Mapping and scheduling • Optimization for makespan or throughput User’s idea on Parallel Computations (e.g. STAP) Executable file for a given parallel computer Parallel processing can be made easy with ALPS software framework /* This is a .cpp file of the partially adaptive STAP methods that the user had in mind. This is optimized for minimum latency or maximum throughput on a given parallel machine. */ #include <alps.h> ... ALPS CodeGen User’s idea on Parallel Computations (e.g. STAP) ALPS Library Various STAP heuristics constructed from ALPS building blocks 1. Task Graph 3. ALPS Scheduler Find optimal configuration that minimizes overall makespan or maximizes throughput 2. ALPS Benchmarker & Modeler Modeling execution times of basic building blocks under various configurations 1 Task Graph: PRI-overlapped STAP User describes the algorithm using task graphs Target cube Data cube Steering vector ALPS software framework determines the implementation parameters such as number of processors for each task and return the executable code 2 ALPS Benchmarker and Modeler 1. Measures of actual timings yi for a set of input parameters 2. Candidate terms {gk} such as the size of data cube or the size of processor cube are generated based on the input parameters 3. Model coefficients xk’s are found using least-squares Ttask ≈ ∑(xkgk) x1 x2 2 xn W … 2 = minx W g1 g2 g3 … gn-1 gn … minx W(Ax-b) y1 y2 2 2 ym Solved incrementally by adding a best column to the model at a time. minx W(Ax-b) 2 2 Ordinary least-squares W=I COV-weighted least-squares (BLUE estimator) W = R-1/2 where R is the covariance matrix one of the best, but requires lots of replications hence not practical VAR-weighted least-squares W = D-1/2 where D=diag(R) INV-weighted least-squares W = diag(1/b) Used ordinary, VAR-weighted and INV-weighted least-squares with 5 replications for each measurement 0.7 26.6% error from ordinary least-square 0.6 0.5 From weighted least-square 0.4 0.3 0.2 7.5% error from INV-weighted least-square 0.1 00 0.1 0.2 0.3 0.4 0.5 From ordinary least-square 0.6 0.7 3 ALPS Scheduler Schedules tree-shaped task graphs A B C (Tree-shaped graphs) = (linear chains) + (joining nodes) rather simple complicated A processors A C A B B B t t Data-parallelism only [3] C C Task-parallelism preferred [4] t Other possibly better schedules We consider both data-parallel (series) and task-parallel (parallel) schedules for a joining node Scheduling a joining node with two incoming arcs Use dynamic programming considering two incoming arcs in series and in parallel Scheduling a joining node with more than two incoming arcs 1. Sort the incoming arcs based on makespan when maximally parallelized 2. Schedule two incoming nodes at a time 10 A 1 B 8 C 5 D 10 8 A C 5 D 1 B Experiments on 8-node linux cluster: executable codes were generated by ALPS software framework Makespan: achieved vs. estimated Makespan (sec) Makespan (sec) 80 estimated 30 achieved 60 20 achieved 40 estimated 10 20 0 1 2 4 Number of processors used 8 0 Dataparallelism only [3] Our Taskparallelism scheduling preferred [4] algorithm Schedules found in makespan minimization Target cube (32x2048x32) Data cube (32x2048x32) Steering vector (32x1x32) ~31 (sec) Data-parallelism only [3] P1 P8 Task-parallelism preferred [4] ~24 (sec) Our scheduling algorithm ~24 (sec) P1 P8 P1 P8 Throughput: achieved vs. estimated Throughput (1/sec) Throughput (1/sec) 0.12 estimated 0.1 0.1 achieved estimated 0.05 01 0.05 achieved 2 4 Number of processors used 8 0 Dataparallelism only [3] Our Taskparallelism scheduling preferred [4] algorithm Schedules found in throughput maximization Target cube (32x2048x32) Data cube (32x2048x32) Steering vector (32x1x32) ~77 (sec) Data-parallelism only [3] and our scheduling algorithm P1 P8 Schedule on 1 processor replicated 8 times Task-parallelism preferred [4] ~45 (sec) P1 P8 Schedule on 4 processors replicated 2 times Conclusion The schedules found by ALPS framework are as good as or often better than those computed by other published algorithms The relative error between the estimated and the achieved makespan/throughput were around 10% confirming that the modeled execution time is useful in predicting performance Weighted least-squares produced estimates 20—30% better than ordinary least-squares in terms of relative errors; may be useful when modeling the unreliable timings. Related works Frameworks [1] PVL and [2] S3P We use text version of task graph as front-end and source code generation as back-end without users dealing with C++ code We cover larger scheduling problem space (tree-structured task graphs) than S3P. Scheduling algorithms [3] J. Subholk and G. Vondran: chain of tasks only [4] D. Nicol, R. Simha, and A. Choudhary: s-p task graph, but taskparallelism is preferred (… and many other works; refer to reference in the abstract) References [1] Eddie Rutledge and Jeremy Kepner, “PVL: An Object Oriented Software Library for Parallel Signal Processing”, IEEE Cluster 2001. [2] H. Hoffmann, J. V. Kepner, and R. A. Bond, “S3P: automatic, optimized mapping of signal processing applications to parallel architectures,” in Proceedings of the Fifth Annual High-Performance Embedded Computing (HPEC) Workshop, Nov. 2001. [3] J. Subhlok and G. Vondran, “Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations”, Journal of Parallel and Distributed Computing, vol. 60, 2000. [4] D. Nicol, R. Simha, and A. Choudhary, “Optimal Processor Assignments for a Class of Pipelined Computations”, IEEE Transaction on Parallel and Distributed Systems, vol. 5(4), 1994.