PDC Summer School 2016 Performance Engineering: Lab Session The following exercises are to be done with your own code, as applicable. In case you do not have a code to work with you can find a sample toy application, a 2D heat equation solver, provided via the course web page (start the lab session by reading the appendix of this document for details how to compile and run it). These exercises do not have to be completed in consecutive order, and feel free to pick the ones that interest you the most. Note that some of these are not doable or interesting for some applications – just skip the exercise in such case. 1. Porting and running applications Move all your files to Beskow. Adapt your makefiles or other compilation scripts to comply with the XC environment if necessary (the heat equation solver should work out of the box). Remember: the compilers are always referred to as ftn, cc and CC. Build your application. Prepare a batch job script and submit the job. Make sure it runs correctly. Measure the strong scaling curve and optionally a weak scaling curve of your application using a representative input dataset (matching also the available core count). 2. Compiler optimization Recompile your application with the following flags and record the wall-clock time (or other meaningful timing information, e.g. time for one simulation step). Remember to verify the output. If working with the heat equation solver, modify the lines CCFLAGS (C version) or FCFLAGS (Fortran version) of the makefiles. Cray -O0 (no flags) -O3 -hfp3 Time The empty boxes are reserved for experimenting with flag combinations of your own. If you would like to experiment with other compilers, the compiling environment is changed on Beskow by changing the module PrgEnv (with “module swap”, e.g. “module swap PrgEnv-cray PrgEnv-gnu”). When changing a compiler please note that the flags are always specific to the compiler, so you will need to remove the Cray compiler specific flags from the FC/CCFLAGS entries. The heat equation solver may have issues with the libpng library when working with other compilers. 3. Performance analysis with CrayPAT We will carry out the combined sampling and tracing experiment known as Automatic Profiling Analysis (APA) approach for your software in the following. Refer back to the lectures. Do this first with the core count where the code still scales (see the scalability curve recorded in Exercise 1). 1. module load perftools-base module load perftools 2. Rebuild your application (“make clean && make” or alike) 3. Do “pat_build a.out” [a.out is the name of your binary]. You should get a new binary with “+pat” addition. 4. Run a.out+pat with the selected core count. You obtain a file with the .xf suffix, in addition to output files. 5. Do “pat_report a.out+pat+(something).xf > samp.prof” replace a.out+pat+.. with the proper filename. This should produce the sampling report (file samp.prof), a file with .apa suffix and a file with .ap2 suffix. 6. Read through the sampling profile file for the profile where time is being spent etc. Any surprises? 7. Continuing with the APA experiment - let us select the most important stuff for more detailed study. See the .apa file but let us leave it unchanged for now. This file controls the tracing experiment; you can include more library groups (MPI by default) and user functions for the experiment through that file. 8. Do “pat_build -O <name of the .apa file>”. This should produce yet another binary with “+apa” addition. 9. Run that binary. You should get a new .xf file. 10. Apply pat_report again: “pat_report (...) > tracing.prof”, where (...) is the name of the most recent .xf file. See “pat_report -O help” for available reports. 11. Read through the file tracing.prof. 12. Have a look also at the CrayPAT GUI, do “app2 (name of the most recent .ap2 file)”. Once done with this, repeat these steps for the core count where the code does not scale anymore. Compare the profiles. 4. Single-core optimization Gather data on the single-core performance of the application you are working with, at its scalability limit. Which computational routine takes most time? If working with the Cray compiler, recompile your code with the additional compiler flag “-h profile_generate” and do the CrayPAT analysis (Ex. 3) again to get more detailed information about loops. See the compiler feedback (see the lectures for the needed flags). Does the compiler vectorize the loops in the most computationally intensive routines identified above? Focusing on the observed hot-spot routines loops, try to restructure the loop for better performance, especially focusing on getting the compiler to vectorize the hotspot loops. Talk to the instructors for hints! And if working with the 2D heat equation solver, refer back to the lectures. Describe the modification and the observed related timing below: Rewrite (routine, performed optimization) Baseline Time 5. Collecting hardware performance counters By modifying the .apa file and PAT_RT_PERFCTR field therein, rebuild the a.out+apa files with PAT_RT_PERFCTR=0 and PAT_RT_PERFCTR=2 and run them both. By reviewing the profile files (Refer back to Ex. 3) see: In the top time-consuming routines, what are L1 and L2 cache hit ratios? 6. Environment variables affecting MPI performance Let us investigate the impact of the following system parameter tweaks. 1. Skip this if you are running with less than 128 MPI ranks (4 full nodes on Beskow). Rerun your application (on the two core counts, i.e. the one where the scalability is still sufficient and the one where it stops scaling) with different rank placement, using the optimal placements as suggested by CrayPAT (refer back to the CrayPAT runs of Exercise 3). Copy the CrayPAT-provided MPICH_RANK_ORDER.User_time files to the submit folder renaming them as “MPICH_RANK_ORDER” and inserting “export MPICH_RANK_REORDER_METHOD=3” to the job script. Core count Time without rank reordering Time with rank reordering 2. Increase the eager limit by inserting “export MPICH_GNI_EAGER_MSG_SIZE=131072” into your batch job script and rerun the application with the two core counts under investigation. 3. Load the module “craype-hugepages8M”, rebuild and rerun your application. Did you see any speed-up? 4. See the profile. If your application spends a lot of time in collective communication (and skip this stage if your code is not using collectives at all, which is the case with the 2D heat equation solver) try out the impact of using DMAPP collectives: Recompile your code with the DMAPP linked in: modify your makefile to include either “ldmapp” or “-Wl,--whole-archive,-ldmapp,--no-whole-archive” (for dynamic or static linking, respectively) in LIBS (or LDFLAGS or equivalent) section. Then insert “export MPICH_USE_DMAPP_COLL=1” onto your batch script and rerun. Technique (Baseline) Making more messages eager Hugepages DMAPP collectives Those that speed it up combined Core count= Core count= After these, consider what kind of source code changes could be made to the code for improving the scalability. Discuss with the instructors for comments. Time permitting you may even try to implement these. 7. Removing I/O bottlenecks Skip this exercise if you are working with the 2D heat equation solver. Analyze the I/O of your application by reviewing the CrayPAT reports and/or the files produced (written) or referred (read in) by the application. Does the I/O take a major amount of execution time, and is it a scalability bottleneck? Is the I/O scheme parallel, and is it asynchronous or are the processes waiting for the I/O? Within the time limits it will not be possible to completely rewrite the application I/O. Therefore, on top of the analysis above, try how much a mere reduction of I/O (reducing checkpointing, writing results less often, writing less to the output file or screen, etc) improves the performance. In case of large (> 1 GB) writes, try whether the Lustre striping improves the write times. Apply “lfs setstripe -c 4 -S 4M” to the directory the output files will be written to (adjust your job script if needed). If this decreased the I/O time, give a try with other stripe count (with -c that runs from 2 to 16, the default count is 1) and stripe size (-S 2...8M, the default is 1M). Also experiment the impact of the IOBUF library: o module load iobuf o Rebuild your application o Insert “export IOBUF_PARAMETERS=’*’” onto the job script, rerun (with some core count) and compare timings. Appendix: Heat equation solver Heat equation is a partial differential equation that describes the variation of temperature in a given region over time where u(x, y, z, t) represents temperature variation over space at a given time, and α is a thermal diffusivity constant. We limit ourselves to two dimensions (plane) and discretize the equation onto a grid. Then the Laplacian can be expressed as finite differences as Where ∆x and ∆y are the grid spacing of the temperature grid u(i,j). We can study the development of the temperature grid with explicit time evolution over time steps ∆t: Parallelization of the program is pretty straightforward, since each point in the grid can be updated independently. In distributed memory machines, the boundaries of the domains need to be communicated between tasks before each step: Solvers for the 2D equation implemented with C and Fortran are provided in the file heat.tar.gz. The solver develops the equation in a user-provided grid size and over the number of time steps provided by the user. The default geometry is a flat rectangle with a disc on the middle, but user can give other shapes as input files - a coke bottle is provided as an example. You can compile the program with the make command, adjusting the Makefile as needed. There are three versions available; serial (“make serial”), MPI-parallelized version with MPI_Sendrecv operations (“make mpi”) and an MPI+OpenMP hybrid version (“make hyb”). Examples on how to run the binary (remember to prefix with “aprun” together with proper -n, -d, -N etc flags for it) ./heat (no arguments - will run the program with default grid size (2048x2048) and number of time steps (100)) ./heat bottle.dat (one argument - will run the program starting from a temperature grid provided in the file bottle.dat for the default number of time steps, i.e. 100) ./heat bottle.dat 1000 (two arguments - will run the program starting from a temperature grid provided in the file bottle.dat for 1000 time steps) ./heat 4096 4096 500 (three arguments - will run the program for the default case, in a 4096x4096 grid, for 500 time steps) The program will produce a .png image of the temperature field after a given amount of iterations. You can change that from the parameter image_interval defined in heat_X_main.F90/.c (X=serial, mpi, or hyb).