Baden

advertisement
Making progress with multi-tier programming
Scott B. Baden
Daniel Shalit
Department of Computer Science and Engineering
University of California, San Diego
Introducing Multi-tier Computers
• Hierarchical construction
• Two kinds of communication: slow
messages, fast shared memory
– SMP clusters (numerous vendors)
– NPACI Blue Horizon
– ASCI Blue-Pacific CTR (LLNL)
High Opportunity Cost of Communication
• Interconnect speeds are not keeping pace
– r: DGEMM floating point rate per node, MFLOP/s
– ß: peak pt - pt MPI message BW, MBYTE/s
• IBM SP2/Power2SC: r = 640
• NPACI Blue Horizon: r = 5,500
• ASCI Blue-Pacific CTR: r = 750
ß = 100
ß = 100
ß = 80
What programming models are available for
multi-tier computers?
• Single Tier
– Flatten the hierarchical communication structure
of the machine: one level or “tier” of parallelism
– Simplest approach; MPI codes are reusable
– Disadvantages: poor use of shared memory,
unable to overlap communication with computation
• Multi-tier
– Utilize information about the hierarchical
communication structure of the machine
– Hybrid model: message passing + threads/openMP
– More complicated, but possible to overcome
disadvantages of single tier model
Road Map
• A hierarchical model of parallelism: multi-tier
prototype of KeLP, KeLP2
• How to improve processor utilization when nonblocking, asynchronous, point-to-point message
passing fails to realize communication overlap
• What are the opportunities and the limitations?
– Guidelines for employing overlap
– Studies on ASCI Blue-Pacific CTR
– Progress on NPACI Blue Horizon
What is KeLP ?
• KeLP = Kernel Lattice Parallelism
• Thesis topic of Stephen J. Fink (Ph.D. 1998)
• A set of run time C++ class libraries for
parallel computation
– Reduce application development time
without sacrificing performance
– Run-time decomposition and communication
objects
– Structured blocked N-dimensional data
• http://www-cse.ucsd.edu/groups/hpcl/scg/kelp
Multi-tier programming
• For an n-level machine, we identify n levels
of parallelism + one collective level of
control
• KeLP2 programs have 3 levels of control:
– Collective level: operations performed on all
nodes
– Node level: operations performed on one node
– Processor level: operations performed on one
processor
More about the model
• Communication reflects the organization of the
hardware
– Two kinds of communication: slow messages, fast shared
memory
– A node communicates on behalf of its processors
– Direct inter-processor communication only on-node
• Hierarchical parallel control flow
– Node level communication may run as a concurrent task
– Processors execute computation out of shared memory
KeLP’s central abstractions
• MetaData
– Region
– FloorPlan
• Distributed storage container
– XArray
• Parallel control flow
– Iterators
The Region
• Region: box in multidimensional index space
• A geometric calculus to manipulate the regions
• Similar to BoxTools (Colella et al.), doesn’t
support general domains as in Titanium (UCB)
Aggregate abstractions
• FloorPlan: a table of regions and their assignment
to processors
• XArray: a distributed collection of
multidimensional arrays instantiated over a
FloorPlan
Data Motion Model
• Unit of transfer is a regular section
• Build a persistent communication object, the KeLP Mover, in
terms of regular section copies
• Satisfy dependencies by executing the Mover; may be
executed asynchronously to realize overlap
• Replace point-to-point message passing with geometric
descriptions of data dependences
Road Map
• A hierarchical model of parallelism: multi-tier
prototype of KeLP, KeLP2
• How to improve processor utilization when nonblocking, asynchronous, point-to-point message
passing fails to realize communication overlap
• What are the opportunities and the limitations?
– Studies on ASCI Blue-Pacific CTR
– Progress on NPACI Blue Horizon
– Guidelines for employing overlap
Single-tier formulation of an iterative method
• Finite difference solver for Poisson’s eqn
• Decompose the data BLOCKwise
• Execute one process per processor
Transmit halo regions between processes
Compute inner region after communication completes
Hierarchical multi-tier reformulation
• One process per node, p threads per process
Transmit halo regions between nodes with MPI
Compute inner region in shared memory using threads
A communication-computation imbalance
Only a single thread communicates on each node
Load imbalance due to a serial section
If we have enough computation, we can shift work
to improve hardware utilization
Overlapping Communication with Computation
Reformulate the algorithm
Isolate the inner region from the halo
Execute communication concurrently with computation on
the inner region
Compute on the annulus when the halo finishes
Give up one processor to handle communication
It may not be practical to have that processor also
communicate
Road Map
• A hierarchical model of parallelism: multi-tier
prototype of KeLP, KeLP2
• How to improve processor utilization when nonblocking, asynchronous, point-to-point message
passing fails to realize communication overlap
• What are the opportunities and the limitations?
– Guidelines for employing overlap
– Studies on ASCI Blue-Pacific CTR
– Progress on NPACI Blue Horizon
A Performance Model of overlap
Give up one processor to communication
p = number of processors per node
running time = 1.0
f < 1 = multi-tier, non-overlapped communication
time
Running time = MAX ( (1-f)x(p/p-1) , f )
p/(p-1) slowdown factor
Useful range: f > 1/p
When f > 0.8, improvement  20%
Equal time in communication and computation when
f = p/(2p-1)
Performance
• When we displace computation to make way for
the proxy, computation time increases
• Wait on communication drops to zero, ideally
• When f < p/(2p-1): improvement is (p-1)/(p*(1-f))%
• Communication bound: improvement is (1-f)/f %
1 - f
f
Dilation
f
T = 1.0
T = (1-f)x(p/(p-1))
Road Map
• A hierarchical model of parallelism: multi-tier
prototype of KeLP, KeLP2
• How to improve processor utilization when nonblocking, asynchronous, point-to-point message
passing fails to realize communication overlap
• What are the opportunities and the limitations?
– Guidelines for employing overlap
– Studies on ASCI Blue-Pacific CTR
– Progress on NPACI Blue Horizon
Results: ASCI Blue Pacific CTR
• Multiple SMP nodes
– 320 4-way 332 MHz Power PC 604e compute nodes
– 1.5 GB memory per node; 32 KB L1$, 256KB L2 per proc
• Differential MPI communication rates (peak Ring)
– 82 MB/sec off-node, 77 MB/sec on node
• 81% parallel efficiency on 1 node w/ 4 threads
Redblack3D on 1 node
120
Mflops
100
80
60
40
20
0
1
2
3
4
Variants
• KeLP2
– Overlapped, using inner annulus
– Non-overlapped: communication runs
serially, no annulus
– Single tier: 1 thread per MPI process, 4
processes per node
• MPI: hand coded, 4 processes / node
Environment settings
• KeLP2, overlapped
MP_CSS_INTERRUPT=no
MP_POLLING_INTERVAL=2000000000
MP_SINGLE_THREAD=yes
AIXTHREAD_SCOPE=S
• KeLP2, non-overlapped
MP_CSS_INTERRUPT=yes
AIXTHREAD_SCOPE=S
• MPI and KeLP2 single-tier
– #PSUB -g @tpn4
Software
• KeLP layered on top of MPI and pthreads
Express parallelism in C++, computation in
f77
• Compilers: mpCC_r, mpxlf_r
• Compiler flags
-O3 -qstrict -qtune=auto -qarch=auto
• OS: AIX 4.3.3
Performance improves with overlap
Redblack3D, ASCI Blue-Pacific CTR, 8 Nodes
5
4.5
Compute
Fillpatch
4
2.5
2
MPI
ST
w/o overlap
3
w/ overlap
TIME (sec)
3.5
1.5
1
0.5
0
128
160
N
Comparison with the model
• Consider compute bound cases: f < p/(2p-1)
• On 8 nodes, N=128, f=0.51
– We predict 53%, observe 33%
– Underestimated slowdown of computation
• N=160: f=0.41. Predict 21%, observe 16%
• N=256: f=0.28. Slight slowdown
• For P=4, N=320, 64 nodes: f=0.52
– Predict 35%, observe 14%
• Investigating cause of increased slowdown
A closer look at performance
• 8 nodes, N=128, without overlap
NThrds Total
Wait
Compute Proxy
1
3.51
0.814
2.73
0.792
2
2.26
0.814
1.43
0.78
3
1.89
0.825
1.05
0.789
4
1.67
0.851
0.798
0.811
• With overlap
NThrds Total
Wait
Compute Proxy
1
3.42
0.081
3.38
2.83
2
1.73
0.166
1.67
1.52
3
1.26
0.119
1.16
1.01
4
1.66
0.647
1.14
1.16
NPACI Blue Horizon
• Consider N=800 on 8 nodes (64 proc)
• Nodes ~ 8 times faster than CTR’s , but
inter-node communication is twice as slow
• Non-overlapped execution: f=0.34
• Predict 25% improvement with overlap
• Communication proxy interferes with
computation threads
• We can hide the communication at the cost
of slowing down computation enough to
offset any advantages
• Currently under investigation
A closer look at performance
w/o overlap, with overlap
By comparison, single tier K2 runs in 41 sec
NThrds Total
Wait
Compute Proxy
1
293
16.9
279
16.8
6
55.1
15.4
41.2
15.4
7
50.1
15.6
36.3
15.6
8
45.5
15.6
31
15.6
6
49.9
0.398
49.8
32.6
7
44.2
1.48
44.2
31.9
8
48.4
6.62
42.5
34.1
Inside the KeLP Mover
• Mover encapsulates installation and
architecture-specific optimizations,
without affecting correctness of the
code
– Packetization to improve performance of
ATM
– Mover run as a proxy on SMP based
nodes
Related Work
• Fortran P [Sawdey et al., 1997]
• SIMPLE [Bader & JáJá 1997]
• Multiprotocol Active Messages [Lummeta,
Mainwaring and Culler, 1997]
• Message Proxies [Lim, Snir, et al., 1998]
Conclusions and Future Work
• The KeLP Mover separates correctness concerns
from policy decisions that affect performance
• Accommodates generational changes in hardware,
for software that will live for many generations
• Multi-tier programming can realize performance
improvements via overlap
• Overlap will become more attractive in the future
due to increased multiprocessing on the node
• Future work
– Hierarchical algorithm design
– Large grain dataflow
Acknowledgements and Further
Information
• Sponsorship
– NSF (CARM, ACR), NPACI, DoE (ISCR)
– State of California, Sun Microsystems (Cal Micro)
– NCSA
• Official NPACI release KeLP1.3
– NPACI Blue Horizon, Origin 2000, Sun HPC, clusters
– Workstations: Solaris, Linux, etc.
– http://www.cse.ucsd.edu/groups/hpcl/scg/kelp
• Thanks to John May, Bronis de Supinksi
(CASC/LLNL), Bill Tuel (IBM)
Download