BG/L - IBM Research

advertisement
BG/L Application Tuning and Lessons Learned
Bob Walkup
IBM Watson Research Center
Performance Decision Tree
Total Performance
Computation
Communication
I/O
MIO Library
Xprofiler
HPM
Compiler
MP_Profiler
Routines/Source
Summary/Blocks
Source Listing
Summary/Events
Timing Summary from MPI Wrappers
Data for MPI rank 0 of 32768:
Times and statistics from MPI_Init() to MPI_Finalize().
-------------------------------------------------------MPI Routine
#calls
avg. bytes
time(sec)
-------------------------------------------------------MPI_Comm_size
3
0.0
0.000
MPI_Comm_rank
3
0.0
0.000
MPI_Sendrecv
2816
112084.3
23.197
MPI_Bcast
3
85.3
0.000
MPI_Gather
1626
104.2
0.579
MPI_Reduce
36
207.2
0.028
MPI_Allreduce
679
76586.3
19.810
-------------------------------------------------------total communication time = 43.614 seconds.
total elapsed time
= 302.099 seconds.
top of the heap address = 84.832 MBytes.
Compute-Bound => Use gprof/Xprofiler
Compile/link with -g -pg
Optionally link with libmpitrace.a to limit profiler output – get
profile data for ranks 0, min, max, median communication time.
Analysis is the same for serial and parallel codes.
Gprof => subroutine-level
Xprofiler => statement level: clock ticks tied to source lines
Gprof Example : GTC Flat Profile
Each sample counts as 0.01 seconds.
%
cumulative
self
self
time
seconds
seconds calls
s/call
37.43
144.07
144.07
201
0.72
25.44
241.97
97.90
200
0.49
6.12
265.53
23.56
4.85
284.19
18.66
4.49
301.47
17.28
4.19
317.61
16.14
200
0.08
3.79
332.18
14.57
3.55
345.86
13.68
total
s/call name
0.72 chargei
0.49 pushi
_xldintv
cos
sinl
0.08 poisson
_pxldmod
_ieee754_exp
Time is concentrated in two routines and intrinsic functions.
Good prospects for tuning.
Statement-Level Profile : GTC pushi
Line
115
116
117
118
119
120
121
122
123
124
125
126
129
ticks source
do m=1,mi
657
r=sqrt(2.0*zion(1,m))
136
rinv=1.0/r
34
ii=max(0,min(mpsi-1,int((r-a0)*delr)))
55
ip=max(1,min(mflux,1+int((r-a0)*d_inv)))
194
wp0=real(ii+1)-(r-a0)*delr
52
wp1=1.0-wp0
104
tem=wp0*temp(ii)+wp1*temp(ii+1)
86
q=q0+q1*r*ainv+q2*r*r*ainv*ainv
166
qinv=1.0/q
68
cost=cos(zion(2,m))
18
sint=sin(zion(2,m))
104
b=1.0/(1.0+r*cost)
Can pipeline expensive operations like sqrt, reciprocal, cos, sin, …
Requires either compiler option (-qhot=vector) or hand-tuning.
Compiler Listing Example : GTC chargei
Source
55 |!
56 |
57 |
58 |
59 |
60 |
61 |
section:
inner flux surface
im=ii
tdum=pi2_inv*(tflr-zetatmp*qtinv(im))+10.0
tdum=(tdum-aint(tdum))*delt(im)
j00=max(0,min(mtheta(im)-1,int(tdum)))
jtion0(larmor,m)=igrid(im)+j00
wtion0(larmor,m)=tdum-real(j00)
Register section:
GPR's set/used:
FPR's set/used:
ssss ssss ssss s-ss ssss ssss ssss ssss
ssss ssss ssss ssss ssss ssss ssss ssss
ssss ssss ssss ss-- ---- ---- ---s s--s
Assembler section:
50| 000E04 stw
0
ST4A
#SPILL52(gr31,520)=gr4
58| 000E08 bl
0
CALLN
fp1=_xldintv,0,fp1,…
59| 000E0C mullw
2
M
gr3=gr19,gr15
58| 000E10 rlwinm
1
SLL4
gr5=gr15,2
Issues: function call for aint(), register spills, pipelining, …
Get listing with source code: -qlist -qsource
GTC – Performance on Blue Gene/L
Original code: main loop time = 384 sec (512 nodes, coprocessor)
Tuned code : main loop time = 244 sec (512 nodes, coprocessor)
Factor of ~1.6 performance improvement by tuning.
Weak scaling, relative performance per processor:
#nodes
512
1024
2048
4096
8192
16384
coprocessor
1.000
1.002
0.985
1.002
1.009
0.968
virtual-node
0.974
0.961
0.963
0.956
0.935
NAN
BG/L Daxpy Performance
1.2
1
Flops per Cycle
440d+alignx
440
0.8
440d
0.6
0.4
0.2
0
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
Bytes
Daxpy: y(:) = a*x(:) + y(:), with compiler-generated code
1.0E+08
Getting Double-FPU Code Generation
Use library routines (blas, vector intrinsics, ffts,…)
Try compiler options : -O3 -qarch=440d (-qlist -qsource)
-O3 -qarch=440d -qhot=simd
Add alignment assertions:
Fortran: call alignx(16,array(1))
C: __alignx(16,array);
Try double-FPU intrinsics:
Fortran: loadfp(A), fpadd(a,b), fpmadd(c,b,a), …
C : __ldpf(address), __fpadd(a,b), __fpmadd(c,b,a)
Can write assembler code.
16K file write test
350
write
300
open
mkdir
Time (sec)
250
200
150
100
50
0
1
8
64
Number of Directories
512
4096
Optimizing Communication Performance
3D Torus network => the bandwidth is degraded if the traffic
goes many hops, sharing links => stay local if possible.
Example: 2D domain on 1024 nodes (8x8x16 torus)
try 16x64 with BGLMPI_MAPPING=ZXYT
Example: QCD codes with logical 4D decomposition
try Px = torus x dimension (same for y, z)
Pt = 2 (for virtual-node-mode)
Layout optimizer : Record the communication matrix, then
minimize the cost function to obtain a good mapping.
Currently limited to modest torus dimensions.
Finding Communication Problems
POP Communication Time
1920 Processors (40x48 decomposition)
Some Experience with Performance Tools
Follow the decision tree – don’t get buried in data.
Get details for a just a subset of MPI ranks.
Use the parallel job for data analysis (min, max, median etc.).
For applications that repeat the same work:
sample or trace just a few repeat units.
Save cumulative performance data for all ranks in one file.
Some Lessons Learned
Text core files are great, as long as you get the call stack (need -g).
Use addr2line … takes you from instruction address to the source file
and line number. Standard GNU bin utility, compile/link with -g.
Use the backtrace() library routine – standard GNU libc utility.
Can make wrappers for exit() and abort() routines so that normal
exits provide the call stack.
What do you do when >10**4 processes are hung?
halt cores, dump stacks, make separate text core files, use grep
(grep -L tells you which of the >10**4 core files was not
stopped in an MPI routine, also use grep + wc (word count).
Lesson: Flash Flood
If (task = 0)
for (t=1, …, P-1) recv data from task t
Else
send data to task 0 => results in a flood at task 0
------------------------------------------------------------------------------------Add flow control … slow but safe:
If (task = 0)
for (t=1, … P-1) {send a flag to task t; then recv data from task t}
Else
{recv a flag from task 0; then send data to task 0}
Lesson: P*P => Can’t Scale
integer table(P,P) requires 1 GB for P = 16K
Memory requirement limits scalability: example Metis
Can sometimes replace table(P,P) with local(P) and remote(P) plus
communication to get values stored elsewhere.
Some computational algorithms scale as P*P, which can limit scaling:
example - certain methods for automatic mesh refinement
More Processors = More Fun
Looking forward to the petaflop scale …
Download