Practical Performance Modelling of High Performance Scientific Codes Robert Bird & David Beckingsale Performance Computing and Visualisation, University of Warwick, Coventry Predicting the execution times of scientific applications Predicting the execution times of scientific applications Why? y e n o M Why? y e n o M Why? Time Future y e n o M Why? Time Predicting the execution times of scientific applications Ttotal = Tcomp + Tcomms Tcomp Wg ny nx Tcomp = Wg x problem size Tcomms Tcomms = Time to send all messages Time (s) Message Size (bytes) Time (s) h t d i w d n a b / 1 latency Message Size (bytes) y = mx + c Tmsg = (1/bw) x msg_size + latency Time (s) Message Size (bytes) Tcomms = Σi∈messagesTi Ttotal = Tcomp + Tcomms Predicting the execution times of scientific applications Plasma Physics Lare2D Gr ids ny nx ny De com po siti o n nx Strong vs Weak Scaling Strong ny nx P1 ny nx P1 P2 ny nx P1 P2 ny P3 nx P1 P2 P3 P4 ny nx Weak Main Loop DO ... CALL lagrangian_step CALL eulerian_remap ... END DO iterations Tcomp = Σ(Wg nx ny) x i=0 x Grind Time Value Predictor Corrector Viscosity and B Lagran Remap Remap X Remap Y Remap Z Remap Remainder 2.63E-07 1.84E-07 2.20E-08 5.41E-07 5.09E-07 2.10E-08 1.43E-07 Tcomms = ΣMPI_comms Tcomms = ΣMPI_comms Tcomms = Σt(send_recv) +Σ t(allreduce) P1 P2 P3 P4 ny nx P1 P2 P3 P4 ny nx All Reduce 1 1 2 1 1 3 4 3 1 2 3 3 1 4 All Reduce 1 1 2 1 1 3 4 3 1 2 3 3 1 4 All Reduce 1 1 2 1 1 3 4 3 1 2 3 3 1 4 All Reduce 1 1 2 1 1 3 4 3 1 2 3 3 1 4 All Reduce 1 1 2 1 1 3 4 3 1 2 3 3 1 4 All Reduce 1 1 2 1 1 3 3 1 2 3 3 1 2 Log2 n x 4 4 Tcomms = ΣMPI_comms Tcomms = Σt(send_recv) +Σ t(allreduce) Can we improve on this? Simulation SST/Macro Structured Simulation Toolkit Simulated Topologies 1 2 3 4 1 2 4 3 1 2 4 3 1 2 4 3 Structured Simulation Toolkit SUBROUTINE remap_x ... DO iy = -1, ny+2 iym = iy - 1 DO ix = -1, nx+2 ixm = ix - 1 ... END DO END DO ... END SUBROUTINE remap_x void remap_x(int rank) { ... sstmac::timestamp t(remap_x_w * nx * ny); compute(t); ... } SUBROUTINE dm_x_bcs ... CALL MPI_SENDRECV(dm(nx-1, 0:ny+1), ny+2, mpireal, & proc_x_max, tag, dm(-1, 0:ny+1), ny+2, mpireal, & proc_x_min, tag, comm, status, errcode) ... END SUBROUTINE dm_x_bcs void dm_x_bcs(int rank) { ... mpi->sendrecv(ny + 2, sstmac::sw::mpitype::mpi_real, \ proc_x_max, tag, ny + 2, sstmac::sw::mpitype::mpi_real, \ proc_x_min, tag, world(), stat); ... } Validation Machines Machines Sierra Machines Sierra Minerva Sierra Processor Intel Xeon 5660 Processor Speed 2.8 Ghz Cores/Node 12 Nodes 1849 Memory/Node 24 GB Compilers Intel 12.0 MPI MVAPICH2 1.7 Minerva Intel Xeon 5650 2.66 Ghz 12 258 24 GB Intel 12.0 OpenMPI 1.4.3 Results Time 580.00 435.00 290.00 Weak Scaling 145.00 0 Actual Predicted 12 48 108 Cores 192 252 432 Time 580.00 565.00 550.00 Weak Scaling 535.00 520.00 Actual Predicted 12 48 108 Cores 192 252 432 Time 580.00 Strong Scaling 435.00 Predicted Actual 290.00 145.00 0 48 96 144 Cores 192 288 Strong Scaling Accuracy: >90% Average Error: 4.97% Weak Scaling Accuracy: >95% Average Error: 3.05% Future Optimisations Relative Cost % 1 0.5 0.25 0.2 0.1 0.01 1 0.00 32.15 48.22 51.44 57.87 64.24 2 -64.30 0.00 32.15 38.58 51.44 64.17 4 -192.9 -64.30 0.00 12.86 38.58 64.04 5 -257.2 -96.45 -16.07 0.00 32.15 63.98 10 -578.7 -257.2 -96.45 -64.30 0.00 63.66 Frequency Relative Cost % 1 0.5 0.25 0.2 0.1 0.01 1 0.00 32.15 48.22 51.44 57.87 64.24 2 -64.30 0.00 32.15 38.58 51.44 64.17 4 -192.9 -64.30 0.00 12.86 38.58 64.04 5 -257.2 -96.45 -16.07 0.00 32.15 63.98 10 -578.7 -257.2 -96.45 -64.30 0.00 63.66 Frequency Predicting the execution times of scientific applications Optimisation of Patch Distribution Strategies for AMR Applications Adaptive Mesh Refinement Optimisation of Patch Distribution Strategies for AMR Applications Optimisation of Patch Distribution Strategies for AMR Applications ny nx Wg (nxny)wg l(s) + s/bw(s) tq,p min(t + t ) p q,p p ∈procs. Optimised Optimised Average time (s) 800 Original Local 600 400 200 1 2 4 8 16 Cores 32 64 128 256 Optimised Optimised Average time (s) 170 Original Local 127.5 85 42.5 0 16 32 64 Cores 128 256 Optimisation of Patch Distribution Strategies for AMR Applications Building a Model 1. Benchmark Wg Times 2. Benchmark network 3. Find regions 4. Do the regression 5. Identify the comms 6. Combine Take Away Points Take Away Points •Ahead of time prediction Take Away Points •Ahead of time prediction •Aids in day to day, procurement, and time management Take Away Points •Ahead of time prediction •Aids in day to day, procurement, and time management •Lets you see into the future Take Away Points •Ahead of time prediction •Aids in day to day, procurement, and time management •Lets you see into the future •Could help you? Take Away Points •Ahead of time prediction •Aids in day to day, procurement, and time management •Lets you see into the future •Could help you? Future Work