Baden

Making progress with multi-tier programming Scott B. Baden Daniel Shalit Department of Computer Science and Engineering University of California, San Diego Introducing Multi-tier Computers • Hierarchical construction • Two kinds of communication: slow messages, fast shared memory – SMP clusters (numerous vendors) – NPACI Blue Horizon – ASCI Blue-Pacific CTR (LLNL) High Opportunity Cost of Communication • Interconnect speeds are not keeping pace – r: DGEMM floating point rate per node, MFLOP/s – ß: peak pt - pt MPI message BW, MBYTE/s • IBM SP2/Power2SC: r = 640 • NPACI Blue Horizon: r = 5,500 • ASCI Blue-Pacific CTR: r = 750 ß = 100 ß = 100 ß = 80 What programming models are available for multi-tier computers? • Single Tier – Flatten the hierarchical communication structure of the machine: one level or “tier” of parallelism – Simplest approach; MPI codes are reusable – Disadvantages: poor use of shared memory, unable to overlap communication with computation • Multi-tier – Utilize information about the hierarchical communication structure of the machine – Hybrid model: message passing + threads/openMP – More complicated, but possible to overcome disadvantages of single tier model Road Map • A hierarchical model of parallelism: multi-tier prototype of KeLP, KeLP2 • How to improve processor utilization when nonblocking, asynchronous, point-to-point message passing fails to realize communication overlap • What are the opportunities and the limitations? – Guidelines for employing overlap – Studies on ASCI Blue-Pacific CTR – Progress on NPACI Blue Horizon What is KeLP ? • KeLP = Kernel Lattice Parallelism • Thesis topic of Stephen J. Fink (Ph.D. 1998) • A set of run time C++ class libraries for parallel computation – Reduce application development time without sacrificing performance – Run-time decomposition and communication objects – Structured blocked N-dimensional data • http://www-cse.ucsd.edu/groups/hpcl/scg/kelp Multi-tier programming • For an n-level machine, we identify n levels of parallelism + one collective level of control • KeLP2 programs have 3 levels of control: – Collective level: operations performed on all nodes – Node level: operations performed on one node – Processor level: operations performed on one processor More about the model • Communication reflects the organization of the hardware – Two kinds of communication: slow messages, fast shared memory – A node communicates on behalf of its processors – Direct inter-processor communication only on-node • Hierarchical parallel control flow – Node level communication may run as a concurrent task – Processors execute computation out of shared memory KeLP’s central abstractions • MetaData – Region – FloorPlan • Distributed storage container – XArray • Parallel control flow – Iterators The Region • Region: box in multidimensional index space • A geometric calculus to manipulate the regions • Similar to BoxTools (Colella et al.), doesn’t support general domains as in Titanium (UCB) Aggregate abstractions • FloorPlan: a table of regions and their assignment to processors • XArray: a distributed collection of multidimensional arrays instantiated over a FloorPlan Data Motion Model • Unit of transfer is a regular section • Build a persistent communication object, the KeLP Mover, in terms of regular section copies • Satisfy dependencies by executing the Mover; may be executed asynchronously to realize overlap • Replace point-to-point message passing with geometric descriptions of data dependences Road Map • A hierarchical model of parallelism: multi-tier prototype of KeLP, KeLP2 • How to improve processor utilization when nonblocking, asynchronous, point-to-point message passing fails to realize communication overlap • What are the opportunities and the limitations? – Studies on ASCI Blue-Pacific CTR – Progress on NPACI Blue Horizon – Guidelines for employing overlap Single-tier formulation of an iterative method • Finite difference solver for Poisson’s eqn • Decompose the data BLOCKwise • Execute one process per processor Transmit halo regions between processes Compute inner region after communication completes Hierarchical multi-tier reformulation • One process per node, p threads per process Transmit halo regions between nodes with MPI Compute inner region in shared memory using threads A communication-computation imbalance Only a single thread communicates on each node Load imbalance due to a serial section If we have enough computation, we can shift work to improve hardware utilization Overlapping Communication with Computation Reformulate the algorithm Isolate the inner region from the halo Execute communication concurrently with computation on the inner region Compute on the annulus when the halo finishes Give up one processor to handle communication It may not be practical to have that processor also communicate Road Map • A hierarchical model of parallelism: multi-tier prototype of KeLP, KeLP2 • How to improve processor utilization when nonblocking, asynchronous, point-to-point message passing fails to realize communication overlap • What are the opportunities and the limitations? – Guidelines for employing overlap – Studies on ASCI Blue-Pacific CTR – Progress on NPACI Blue Horizon A Performance Model of overlap Give up one processor to communication p = number of processors per node running time = 1.0 f < 1 = multi-tier, non-overlapped communication time Running time = MAX ( (1-f)x(p/p-1) , f ) p/(p-1) slowdown factor Useful range: f > 1/p When f > 0.8, improvement  20% Equal time in communication and computation when f = p/(2p-1) Performance • When we displace computation to make way for the proxy, computation time increases • Wait on communication drops to zero, ideally • When f < p/(2p-1): improvement is (p-1)/(p*(1-f))% • Communication bound: improvement is (1-f)/f % 1 - f f Dilation f T = 1.0 T = (1-f)x(p/(p-1)) Road Map • A hierarchical model of parallelism: multi-tier prototype of KeLP, KeLP2 • How to improve processor utilization when nonblocking, asynchronous, point-to-point message passing fails to realize communication overlap • What are the opportunities and the limitations? – Guidelines for employing overlap – Studies on ASCI Blue-Pacific CTR – Progress on NPACI Blue Horizon Results: ASCI Blue Pacific CTR • Multiple SMP nodes – 320 4-way 332 MHz Power PC 604e compute nodes – 1.5 GB memory per node; 32 KB L1$, 256KB L2 per proc • Differential MPI communication rates (peak Ring) – 82 MB/sec off-node, 77 MB/sec on node • 81% parallel efficiency on 1 node w/ 4 threads Redblack3D on 1 node 120 Mflops 100 80 60 40 20 0 1 2 3 4 Variants • KeLP2 – Overlapped, using inner annulus – Non-overlapped: communication runs serially, no annulus – Single tier: 1 thread per MPI process, 4 processes per node • MPI: hand coded, 4 processes / node Environment settings • KeLP2, overlapped MP_CSS_INTERRUPT=no MP_POLLING_INTERVAL=2000000000 MP_SINGLE_THREAD=yes AIXTHREAD_SCOPE=S • KeLP2, non-overlapped MP_CSS_INTERRUPT=yes AIXTHREAD_SCOPE=S • MPI and KeLP2 single-tier – #PSUB -g @tpn4 Software • KeLP layered on top of MPI and pthreads Express parallelism in C++, computation in f77 • Compilers: mpCC_r, mpxlf_r • Compiler flags -O3 -qstrict -qtune=auto -qarch=auto • OS: AIX 4.3.3 Performance improves with overlap Redblack3D, ASCI Blue-Pacific CTR, 8 Nodes 5 4.5 Compute Fillpatch 4 2.5 2 MPI ST w/o overlap 3 w/ overlap TIME (sec) 3.5 1.5 1 0.5 0 128 160 N Comparison with the model • Consider compute bound cases: f < p/(2p-1) • On 8 nodes, N=128, f=0.51 – We predict 53%, observe 33% – Underestimated slowdown of computation • N=160: f=0.41. Predict 21%, observe 16% • N=256: f=0.28. Slight slowdown • For P=4, N=320, 64 nodes: f=0.52 – Predict 35%, observe 14% • Investigating cause of increased slowdown A closer look at performance • 8 nodes, N=128, without overlap NThrds Total Wait Compute Proxy 1 3.51 0.814 2.73 0.792 2 2.26 0.814 1.43 0.78 3 1.89 0.825 1.05 0.789 4 1.67 0.851 0.798 0.811 • With overlap NThrds Total Wait Compute Proxy 1 3.42 0.081 3.38 2.83 2 1.73 0.166 1.67 1.52 3 1.26 0.119 1.16 1.01 4 1.66 0.647 1.14 1.16 NPACI Blue Horizon • Consider N=800 on 8 nodes (64 proc) • Nodes ~ 8 times faster than CTR’s , but inter-node communication is twice as slow • Non-overlapped execution: f=0.34 • Predict 25% improvement with overlap • Communication proxy interferes with computation threads • We can hide the communication at the cost of slowing down computation enough to offset any advantages • Currently under investigation A closer look at performance w/o overlap, with overlap By comparison, single tier K2 runs in 41 sec NThrds Total Wait Compute Proxy 1 293 16.9 279 16.8 6 55.1 15.4 41.2 15.4 7 50.1 15.6 36.3 15.6 8 45.5 15.6 31 15.6 6 49.9 0.398 49.8 32.6 7 44.2 1.48 44.2 31.9 8 48.4 6.62 42.5 34.1 Inside the KeLP Mover • Mover encapsulates installation and architecture-specific optimizations, without affecting correctness of the code – Packetization to improve performance of ATM – Mover run as a proxy on SMP based nodes Related Work • Fortran P [Sawdey et al., 1997] • SIMPLE [Bader & JáJá 1997] • Multiprotocol Active Messages [Lummeta, Mainwaring and Culler, 1997] • Message Proxies [Lim, Snir, et al., 1998] Conclusions and Future Work • The KeLP Mover separates correctness concerns from policy decisions that affect performance • Accommodates generational changes in hardware, for software that will live for many generations • Multi-tier programming can realize performance improvements via overlap • Overlap will become more attractive in the future due to increased multiprocessing on the node • Future work – Hierarchical algorithm design – Large grain dataflow Acknowledgements and Further Information • Sponsorship – NSF (CARM, ACR), NPACI, DoE (ISCR) – State of California, Sun Microsystems (Cal Micro) – NCSA • Official NPACI release KeLP1.3 – NPACI Blue Horizon, Origin 2000, Sun HPC, clusters – Workstations: Solaris, Linux, etc. – http://www.cse.ucsd.edu/groups/hpcl/scg/kelp • Thanks to John May, Bronis de Supinksi (CASC/LLNL), Bill Tuel (IBM)

Baden

Related documents

Products

Support

Baden

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib