A Communication-Optimal N-Body Algorithm for Direct Interactions Michael Driscoll, Evangelos Georganas, Penporn Koanantakool, Edgar Solomonik, Katherine Yelick* UC Berkeley *Lawrence Berkeley National Laboratory Overview • • • • • Intro to N-Body problem. Communication bounds. Communication-optimal algorithm. Performance results. Conclusion Direct N-Body n particles - molecules, galaxies, database tuples, etc. - O(n2) interactions for i = 1 to n: for j = 1 to n: force[i] += interact( particles[i], particles[j] ) p processors Communication Model • Communication cost along critical path. • Alpha-beta model: 1/bandwidth # messages latency # words • Can we find lower bounds on S or W? • Do current algorithms meet those bounds? • If not, can we find ones that do? or better bounds? Communication Lower Bounds From Minimizing Communication in Numerical Linear Algebra [Ballard et al. 2011]: F M H S W # flops size of fast memory max flops per M words # messages # words Generalized in: Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays [Christ et al. 2013]. Lower Bounds for N-Body Flops: Memory: Max flops per M words: Plug into latency and bandwidth lower bounds: Do current algorithms meet these bounds? A Naïve N-Body Algorithm Proc. 0 Proc. 1 Proc. 2 Proc. 3 Proc. 4 Proc. 5 … Proc. P particles: replicas: + + + • For p steps, send n/p particles. # messages: # words: • Recall bounds, and : ✔ ✔ The naïve algorithm is optimal… • Recall the lower bounds: • Notice M in denominator. • Increase M => decrease communication. • Realize a “lower” lower bound. Communication-Optimal N-Body Team 0 particles: Team 1 Team 2 Team 3 Team 4 Team 5 … Team p/c + processors: p/c teams • Replication factor: c copies of each particle • Communication cost: Messages Words – Broadcast – Shifts – Reduction – Total reduce #messages by c2 • c = p1/2 => force decomposition [Plimpton 1995] reduce #words by c c layers Experiments • Developed particle code – – – – Flat MPI 52-byte particles Repulsive force drops off with square of distance Reflective boundary conditions • Platforms – Hopper: Cray XE-6 at NERSC, 24 cores/node – Intrepid: IBM BlueGene/P at ALCF, 4 cores/node – Both have 3D torus interconnect. Performance on Hopper 24K particles, 6K cores 0.045 0.04 Communication (Reduce) Communication (Shift) Computation 0.035 0.03 95.6% reductio n 0.025 0.02 0.015 0.01 0.005 0 c=1 c=2 c=4 c=8 Replication Factor c=16 c=32 Down is good Execution Time Per Timestep (sec) Execution Time vs. Replication Factor Performance on Intrepid 262K particles, 32K cores 1.4 Communication (Reduce) Communication (Shift) Computation 1.2 1 0.8 99.3% reductio n 0.6 0.4 0.2 0 c=1 c=1 c=2 (tree)(no-tree) c=4 c=8 c=16 c=32 c=64 c=128 Replication Factor Down is good Execution Time Per Timestep (sec) Execution Time vs. Replication Factor Strong Scaling on Intrepid 262K particles 1 0.8 4.5x speedup 0.6 0.4 0.2 0 2048 Ideal c=64 c=32 c=16 c=8 c=4 c=2 c=1 4096 Up is Good Relative Efficiency vs. One Core Parallel Efficiency on Intrepid (n=262144) Perfect Strong Scaling 8192 Machine size (# cores) 16384 32768 CA N-Body with Cutoff Distance • No interactions beyond cutoff radius r • Assuming: – uniform particle distribution – spatial processor decomposition • Simple extension to support a cutoff: – still communication-optimal – works in space of any dimensions – speedups from 1D and 2D experiments N-Body with Cutoff cutoff diameter Team 0 Team 1 Team 2 Team 3 Team 4 Team 5 … Team p/c particles: processors: + p/c teams • Shifts occur modulo the cutoff distance. • Optimality holds – same counting argument – see paper for details c layers 1D Simulation on Intrepid 262K particles, 32K cores 0.35 Communication (Re-assign) Communication (Reduce) Communication (Shift) Computation 0.3 0.25 84.6% reduction 0.2 0.15 0.1 0.05 0 c=1 c=2 c=4 c=8 c=16 c=32 Replication Factor c=64 c=128 Down is good Execution Time Per Timestep (sec) Execution Time vs. Replication Factor 2D Simulation on Hopper 196K particles, 24K cores 0.07 Communication (Re-assign) Communication (Reduce) Communication (Shift) Computation 0.06 0.05 74.8% reduction 0.04 0.03 0.02 0.01 0 c=1 c=2 c=4 c=8 c=16 Replication Factor c=32 c=64 Down is good Execution Time Per Timestep (sec) Execution Time vs. Replication Factor Strong Scaling on Hopper 2D space, 24K cores, 196K particles 1 0.8 Up is Good Relative Efficiency vs. One Core Parallel Efficiency on Intrepid (n=196608) Good Strong Scaling 0.6 0.4 Ideal c=64 c=16 c=4 c=1 0.2 0 96 192 384 768 1536 3072 Machine size (# cores) 6144 12288 24576 Conclusions • By using c times more memory, we reduce: – Words sent along critical path: – Messages sent along critical path: c. c2. • Theory: maximize c. • Practice: tune for best c. – Saw 99.5% reduction in communication (11.8x speedup). • Applications beyond direct n-body: – collision detection algorithms – database joins – bottom solvers in hierarchical n-body codes