BG/L Application Tuning and Lessons Learned Bob Walkup IBM Watson Research Center Performance Decision Tree Total Performance Computation Communication I/O MIO Library Xprofiler HPM Compiler MP_Profiler Routines/Source Summary/Blocks Source Listing Summary/Events Timing Summary from MPI Wrappers Data for MPI rank 0 of 32768: Times and statistics from MPI_Init() to MPI_Finalize(). -------------------------------------------------------MPI Routine #calls avg. bytes time(sec) -------------------------------------------------------MPI_Comm_size 3 0.0 0.000 MPI_Comm_rank 3 0.0 0.000 MPI_Sendrecv 2816 112084.3 23.197 MPI_Bcast 3 85.3 0.000 MPI_Gather 1626 104.2 0.579 MPI_Reduce 36 207.2 0.028 MPI_Allreduce 679 76586.3 19.810 -------------------------------------------------------total communication time = 43.614 seconds. total elapsed time = 302.099 seconds. top of the heap address = 84.832 MBytes. Compute-Bound => Use gprof/Xprofiler Compile/link with -g -pg Optionally link with libmpitrace.a to limit profiler output – get profile data for ranks 0, min, max, median communication time. Analysis is the same for serial and parallel codes. Gprof => subroutine-level Xprofiler => statement level: clock ticks tied to source lines Gprof Example : GTC Flat Profile Each sample counts as 0.01 seconds. % cumulative self self time seconds seconds calls s/call 37.43 144.07 144.07 201 0.72 25.44 241.97 97.90 200 0.49 6.12 265.53 23.56 4.85 284.19 18.66 4.49 301.47 17.28 4.19 317.61 16.14 200 0.08 3.79 332.18 14.57 3.55 345.86 13.68 total s/call name 0.72 chargei 0.49 pushi _xldintv cos sinl 0.08 poisson _pxldmod _ieee754_exp Time is concentrated in two routines and intrinsic functions. Good prospects for tuning. Statement-Level Profile : GTC pushi Line 115 116 117 118 119 120 121 122 123 124 125 126 129 ticks source do m=1,mi 657 r=sqrt(2.0*zion(1,m)) 136 rinv=1.0/r 34 ii=max(0,min(mpsi-1,int((r-a0)*delr))) 55 ip=max(1,min(mflux,1+int((r-a0)*d_inv))) 194 wp0=real(ii+1)-(r-a0)*delr 52 wp1=1.0-wp0 104 tem=wp0*temp(ii)+wp1*temp(ii+1) 86 q=q0+q1*r*ainv+q2*r*r*ainv*ainv 166 qinv=1.0/q 68 cost=cos(zion(2,m)) 18 sint=sin(zion(2,m)) 104 b=1.0/(1.0+r*cost) Can pipeline expensive operations like sqrt, reciprocal, cos, sin, … Requires either compiler option (-qhot=vector) or hand-tuning. Compiler Listing Example : GTC chargei Source 55 |! 56 | 57 | 58 | 59 | 60 | 61 | section: inner flux surface im=ii tdum=pi2_inv*(tflr-zetatmp*qtinv(im))+10.0 tdum=(tdum-aint(tdum))*delt(im) j00=max(0,min(mtheta(im)-1,int(tdum))) jtion0(larmor,m)=igrid(im)+j00 wtion0(larmor,m)=tdum-real(j00) Register section: GPR's set/used: FPR's set/used: ssss ssss ssss s-ss ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ss-- ---- ---- ---s s--s Assembler section: 50| 000E04 stw 0 ST4A #SPILL52(gr31,520)=gr4 58| 000E08 bl 0 CALLN fp1=_xldintv,0,fp1,… 59| 000E0C mullw 2 M gr3=gr19,gr15 58| 000E10 rlwinm 1 SLL4 gr5=gr15,2 Issues: function call for aint(), register spills, pipelining, … Get listing with source code: -qlist -qsource GTC – Performance on Blue Gene/L Original code: main loop time = 384 sec (512 nodes, coprocessor) Tuned code : main loop time = 244 sec (512 nodes, coprocessor) Factor of ~1.6 performance improvement by tuning. Weak scaling, relative performance per processor: #nodes 512 1024 2048 4096 8192 16384 coprocessor 1.000 1.002 0.985 1.002 1.009 0.968 virtual-node 0.974 0.961 0.963 0.956 0.935 NAN BG/L Daxpy Performance 1.2 1 Flops per Cycle 440d+alignx 440 0.8 440d 0.6 0.4 0.2 0 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 Bytes Daxpy: y(:) = a*x(:) + y(:), with compiler-generated code 1.0E+08 Getting Double-FPU Code Generation Use library routines (blas, vector intrinsics, ffts,…) Try compiler options : -O3 -qarch=440d (-qlist -qsource) -O3 -qarch=440d -qhot=simd Add alignment assertions: Fortran: call alignx(16,array(1)) C: __alignx(16,array); Try double-FPU intrinsics: Fortran: loadfp(A), fpadd(a,b), fpmadd(c,b,a), … C : __ldpf(address), __fpadd(a,b), __fpmadd(c,b,a) Can write assembler code. 16K file write test 350 write 300 open mkdir Time (sec) 250 200 150 100 50 0 1 8 64 Number of Directories 512 4096 Optimizing Communication Performance 3D Torus network => the bandwidth is degraded if the traffic goes many hops, sharing links => stay local if possible. Example: 2D domain on 1024 nodes (8x8x16 torus) try 16x64 with BGLMPI_MAPPING=ZXYT Example: QCD codes with logical 4D decomposition try Px = torus x dimension (same for y, z) Pt = 2 (for virtual-node-mode) Layout optimizer : Record the communication matrix, then minimize the cost function to obtain a good mapping. Currently limited to modest torus dimensions. Finding Communication Problems POP Communication Time 1920 Processors (40x48 decomposition) Some Experience with Performance Tools Follow the decision tree – don’t get buried in data. Get details for a just a subset of MPI ranks. Use the parallel job for data analysis (min, max, median etc.). For applications that repeat the same work: sample or trace just a few repeat units. Save cumulative performance data for all ranks in one file. Some Lessons Learned Text core files are great, as long as you get the call stack (need -g). Use addr2line … takes you from instruction address to the source file and line number. Standard GNU bin utility, compile/link with -g. Use the backtrace() library routine – standard GNU libc utility. Can make wrappers for exit() and abort() routines so that normal exits provide the call stack. What do you do when >10**4 processes are hung? halt cores, dump stacks, make separate text core files, use grep (grep -L tells you which of the >10**4 core files was not stopped in an MPI routine, also use grep + wc (word count). Lesson: Flash Flood If (task = 0) for (t=1, …, P-1) recv data from task t Else send data to task 0 => results in a flood at task 0 ------------------------------------------------------------------------------------Add flow control … slow but safe: If (task = 0) for (t=1, … P-1) {send a flag to task t; then recv data from task t} Else {recv a flag from task 0; then send data to task 0} Lesson: P*P => Can’t Scale integer table(P,P) requires 1 GB for P = 16K Memory requirement limits scalability: example Metis Can sometimes replace table(P,P) with local(P) and remote(P) plus communication to get values stored elsewhere. Some computational algorithms scale as P*P, which can limit scaling: example - certain methods for automatic mesh refinement More Processors = More Fun Looking forward to the petaflop scale …