A Communication-Optimal N-Body Algorithm for Direct Interactions Michael Driscoll, Evangelos Georganas, Penporn

advertisement
A Communication-Optimal N-Body
Algorithm for Direct Interactions
Michael Driscoll, Evangelos Georganas, Penporn
Koanantakool, Edgar Solomonik, Katherine Yelick*
UC Berkeley
*Lawrence Berkeley National Laboratory
Overview
•
•
•
•
•
Intro to N-Body problem.
Communication bounds.
Communication-optimal algorithm.
Performance results.
Conclusion
Direct N-Body
n particles
- molecules, galaxies, database tuples, etc.
- O(n2) interactions
for i = 1 to n:
for j = 1 to n:
force[i] += interact( particles[i], particles[j] )
p processors
Communication Model
• Communication cost along critical path.
• Alpha-beta model:
1/bandwidth
# messages
latency
# words
• Can we find lower bounds on S or W?
• Do current algorithms meet those bounds?
• If not, can we find ones that do? or better bounds?
Communication Lower Bounds
From Minimizing Communication in Numerical Linear Algebra [Ballard et al. 2011]:
F
M
H
S
W
# flops
size of fast memory
max flops per M words
# messages
# words
Generalized in: Communication Lower Bounds and Optimal Algorithms for Programs
That Reference Arrays [Christ et al. 2013].
Lower Bounds for N-Body
Flops:
Memory:
Max flops per M words:
Plug into latency and bandwidth lower bounds:
Do current algorithms meet these bounds?
A Naïve N-Body Algorithm
Proc. 0
Proc. 1
Proc. 2
Proc. 3
Proc. 4
Proc. 5
…
Proc. P
particles:
       
replicas:
       
+
+
+
• For p steps, send n/p particles.
# messages:
# words:
• Recall bounds, and
:
✔
✔
The naïve algorithm is optimal…
• Recall the lower bounds:
• Notice M in denominator.
• Increase M => decrease communication.
• Realize a “lower” lower bound.
Communication-Optimal N-Body
Team 0
particles:
Team 1
Team 2
Team 3
Team 4
Team 5
…
Team p/c
       
+
processors:
































p/c teams
• Replication factor: c copies of each particle
• Communication cost:
Messages
Words
– Broadcast
– Shifts
– Reduction
– Total
reduce #messages by c2
• c = p1/2 => force decomposition [Plimpton 1995]
reduce #words by c
c layers
Experiments
• Developed particle code
–
–
–
–
Flat MPI
52-byte particles
Repulsive force drops off with square of distance
Reflective boundary conditions
• Platforms
– Hopper: Cray XE-6 at NERSC, 24 cores/node
– Intrepid: IBM BlueGene/P at ALCF, 4 cores/node
– Both have 3D torus interconnect.
Performance on Hopper
24K particles, 6K cores
0.045
0.04
Communication (Reduce)
Communication (Shift)
Computation
0.035
0.03
95.6%
reductio
n
0.025
0.02
0.015
0.01
0.005
0
c=1
c=2
c=4
c=8
Replication Factor
c=16
c=32
Down is good
Execution Time Per Timestep (sec)
Execution Time vs. Replication Factor
Performance on Intrepid
262K particles, 32K cores
1.4
Communication (Reduce)
Communication (Shift)
Computation
1.2
1
0.8
99.3%
reductio
n
0.6
0.4
0.2
0
c=1 c=1 c=2
(tree)(no-tree)
c=4
c=8
c=16 c=32 c=64 c=128
Replication Factor
Down is good
Execution Time Per Timestep (sec)
Execution Time vs. Replication Factor
Strong Scaling on Intrepid
262K particles
1
0.8
4.5x
speedup
0.6
0.4
0.2
0
2048
Ideal
c=64
c=32
c=16
c=8
c=4
c=2
c=1
4096
Up is Good
Relative Efficiency vs. One Core
Parallel Efficiency on Intrepid (n=262144)
Perfect Strong Scaling
8192
Machine size (# cores)
16384
32768
CA N-Body with Cutoff Distance
• No interactions beyond cutoff radius r
• Assuming:
– uniform particle distribution
– spatial processor decomposition
• Simple extension to support a cutoff:
– still communication-optimal
– works in space of any dimensions
– speedups from 1D and 2D experiments
N-Body with Cutoff
cutoff diameter
Team 0
Team 1
Team 2
Team 3
Team 4
Team 5
…
Team p/c
particles:
       
processors:
       
       
       
+
p/c teams
• Shifts occur modulo the cutoff distance.
• Optimality holds
– same counting argument
– see paper for details
c layers
1D Simulation on Intrepid
262K particles, 32K cores
0.35
Communication (Re-assign)
Communication (Reduce)
Communication (Shift)
Computation
0.3
0.25
84.6% reduction
0.2
0.15
0.1
0.05
0
c=1
c=2
c=4
c=8
c=16
c=32
Replication Factor
c=64 c=128
Down is good
Execution Time Per Timestep (sec)
Execution Time vs. Replication Factor
2D Simulation on Hopper
196K particles, 24K cores
0.07
Communication (Re-assign)
Communication (Reduce)
Communication (Shift)
Computation
0.06
0.05
74.8% reduction
0.04
0.03
0.02
0.01
0
c=1
c=2
c=4
c=8
c=16
Replication Factor
c=32
c=64
Down is good
Execution Time Per Timestep (sec)
Execution Time vs. Replication Factor
Strong Scaling on Hopper
2D space, 24K cores, 196K particles
1
0.8
Up is Good
Relative Efficiency vs. One Core
Parallel Efficiency on Intrepid (n=196608)
Good Strong Scaling
0.6
0.4
Ideal
c=64
c=16
c=4
c=1
0.2
0
96
192
384
768
1536
3072
Machine size (# cores)
6144
12288 24576
Conclusions
• By using c times more memory, we reduce:
– Words sent along critical path:
– Messages sent along critical path:
c.
c2.
• Theory: maximize c.
• Practice: tune for best c.
– Saw 99.5% reduction in communication (11.8x speedup).
• Applications beyond direct n-body:
– collision detection algorithms
– database joins
– bottom solvers in hierarchical n-body codes
Download