Practical Performance Modelling of High Performance Scientific Codes

advertisement
Practical Performance
Modelling of High
Performance
Scientific Codes
Robert Bird & David Beckingsale
Performance Computing and Visualisation,
University of Warwick, Coventry
Predicting the
execution times of
scientific applications
Predicting the
execution times of
scientific applications
Why?
y
e
n
o
M
Why?
y
e
n
o
M
Why?
Time
Future
y
e
n
o
M
Why?
Time
Predicting the
execution times of
scientific applications
Ttotal = Tcomp + Tcomms
Tcomp
Wg
ny
nx
Tcomp = Wg x problem size
Tcomms
Tcomms = Time to send all messages
Time (s)
Message Size (bytes)
Time (s)
h
t
d
i
w
d
n
a
b
/
1
latency
Message Size (bytes)
y = mx + c
Tmsg = (1/bw) x
msg_size + latency
Time (s)
Message Size (bytes)
Tcomms = Σi∈messagesTi
Ttotal = Tcomp + Tcomms
Predicting the
execution times of
scientific applications
Plasma
Physics
Lare2D
Gr
ids
ny
nx
ny
De
com
po
siti
o
n
nx
Strong vs Weak
Scaling
Strong
ny
nx
P1
ny
nx
P1
P2
ny
nx
P1
P2
ny
P3
nx
P1
P2
P3
P4
ny
nx
Weak
Main Loop
DO
...
CALL lagrangian_step
CALL eulerian_remap
...
END DO
iterations
Tcomp = Σ(Wg nx ny)
x
i=0
x
Grind Time Value
Predictor Corrector
Viscosity and B
Lagran Remap
Remap X
Remap Y
Remap Z
Remap Remainder
2.63E-07
1.84E-07
2.20E-08
5.41E-07
5.09E-07
2.10E-08
1.43E-07
Tcomms = ΣMPI_comms
Tcomms = ΣMPI_comms
Tcomms = Σt(send_recv)
+Σ t(allreduce)
P1
P2
P3
P4
ny
nx
P1
P2
P3
P4
ny
nx
All Reduce
1
1
2
1
1
3
4
3
1
2
3
3
1
4
All Reduce
1
1
2
1
1
3
4
3
1
2
3
3
1
4
All Reduce
1
1
2
1
1
3
4
3
1
2
3
3
1
4
All Reduce
1
1
2
1
1
3
4
3
1
2
3
3
1
4
All Reduce
1
1
2
1
1
3
4
3
1
2
3
3
1
4
All Reduce
1
1
2
1
1
3
3
1
2
3
3
1
2 Log2 n
x
4
4
Tcomms = ΣMPI_comms
Tcomms = Σt(send_recv)
+Σ t(allreduce)
Can we improve
on this?
Simulation
SST/Macro
Structured
Simulation
Toolkit
Simulated
Topologies
1
2
3
4
1
2
4
3
1
2
4
3
1
2
4
3
Structured
Simulation
Toolkit
SUBROUTINE remap_x
...
DO iy = -1, ny+2
iym = iy - 1
DO ix = -1, nx+2
ixm = ix - 1
...
END DO
END DO
...
END SUBROUTINE remap_x
void remap_x(int rank) {
...
sstmac::timestamp t(remap_x_w
* nx * ny);
compute(t);
...
}
SUBROUTINE dm_x_bcs
...
CALL MPI_SENDRECV(dm(nx-1, 0:ny+1), ny+2, mpireal, &
proc_x_max, tag, dm(-1, 0:ny+1), ny+2, mpireal, &
proc_x_min, tag, comm, status, errcode)
...
END SUBROUTINE dm_x_bcs
void dm_x_bcs(int rank) {
...
mpi->sendrecv(ny + 2, sstmac::sw::mpitype::mpi_real, \
proc_x_max, tag, ny + 2, sstmac::sw::mpitype::mpi_real, \
proc_x_min, tag, world(), stat);
...
}
Validation
Machines
Machines
Sierra
Machines
Sierra
Minerva
Sierra
Processor Intel Xeon 5660
Processor Speed 2.8 Ghz
Cores/Node 12
Nodes 1849
Memory/Node 24 GB
Compilers Intel 12.0
MPI MVAPICH2 1.7
Minerva
Intel Xeon 5650
2.66 Ghz
12
258
24 GB
Intel 12.0
OpenMPI 1.4.3
Results
Time
580.00
435.00
290.00
Weak Scaling
145.00
0
Actual
Predicted
12
48
108
Cores
192
252
432
Time
580.00
565.00
550.00
Weak Scaling
535.00
520.00
Actual
Predicted
12
48
108
Cores
192
252
432
Time
580.00
Strong Scaling
435.00
Predicted
Actual
290.00
145.00
0
48
96
144
Cores
192
288
Strong Scaling
Accuracy:
>90%
Average Error: 4.97%
Weak Scaling
Accuracy:
>95%
Average Error: 3.05%
Future Optimisations
Relative Cost
% 1 0.5 0.25 0.2 0.1 0.01
1 0.00 32.15 48.22 51.44 57.87 64.24
2 -64.30 0.00 32.15 38.58 51.44 64.17
4 -192.9 -64.30 0.00 12.86 38.58 64.04
5 -257.2 -96.45 -16.07 0.00 32.15 63.98
10 -578.7 -257.2 -96.45 -64.30 0.00 63.66
Frequency
Relative Cost
% 1 0.5 0.25 0.2 0.1 0.01
1 0.00 32.15 48.22 51.44 57.87 64.24
2 -64.30 0.00 32.15 38.58 51.44 64.17
4 -192.9 -64.30 0.00 12.86 38.58 64.04
5 -257.2 -96.45 -16.07 0.00 32.15 63.98
10 -578.7 -257.2 -96.45 -64.30 0.00 63.66
Frequency
Predicting the
execution times of
scientific applications
Optimisation of Patch
Distribution Strategies
for AMR Applications
Adaptive
Mesh
Refinement
Optimisation of Patch
Distribution Strategies
for AMR Applications
Optimisation of Patch
Distribution Strategies
for AMR Applications
ny
nx
Wg
(nxny)wg
l(s) + s/bw(s)
tq,p
min(t
+
t
)
p
q,p
p ∈procs.
Optimised
Optimised
Average time (s)
800
Original
Local
600
400
200
1
2
4
8
16
Cores
32
64 128 256
Optimised
Optimised
Average time (s)
170
Original
Local
127.5
85
42.5
0
16
32
64
Cores
128
256
Optimisation of Patch
Distribution Strategies
for AMR Applications
Building a
Model
1. Benchmark
Wg Times
2. Benchmark
network
3. Find
regions
4. Do the
regression
5. Identify
the comms
6. Combine
Take Away Points
Take Away Points
•Ahead of time prediction
Take Away Points
•Ahead of time prediction
•Aids in day to day,
procurement, and time
management
Take Away Points
•Ahead of time prediction
•Aids in day to day,
procurement, and time
management
•Lets you see into the future
Take Away Points
•Ahead of time prediction
•Aids in day to day,
procurement, and time
management
•Lets you see into the future
•Could help you?
Take Away Points
•Ahead of time prediction
•Aids in day to day,
procurement, and time
management
•Lets you see into the future
•Could help you?
Future Work
Download