Performance Engineering: Lab Session

advertisement
PDC Summer School 2016
Performance Engineering:
Lab Session
The following exercises are to be done with your own code, as applicable. In case you do not have a code to
work with you can find a sample toy application, a 2D heat equation solver, provided via the course web
page (start the lab session by reading the appendix of this document for details how to compile and run it).
These exercises do not have to be completed in consecutive order, and feel free to pick the ones that
interest you the most. Note that some of these are not doable or interesting for some applications – just
skip the exercise in such case.
1. Porting and running applications
Move all your files to Beskow. Adapt your makefiles or other compilation scripts to comply with the XC
environment if necessary (the heat equation solver should work out of the box). Remember: the compilers
are always referred to as ftn, cc and CC. Build your application.


Prepare a batch job script and submit the job. Make sure it runs correctly.
Measure the strong scaling curve and optionally a weak scaling curve of your application using a
representative input dataset (matching also the available core count).
2. Compiler optimization
Recompile your application with the following flags and record the wall-clock time (or other meaningful
timing information, e.g. time for one simulation step). Remember to verify the output. If working with the
heat equation solver, modify the lines CCFLAGS (C version) or FCFLAGS (Fortran version) of the makefiles.
Cray
-O0
(no flags)
-O3 -hfp3
Time
The empty boxes are reserved for experimenting with flag combinations of your own. If you would like to
experiment with other compilers, the compiling environment is changed on Beskow by changing the
module PrgEnv (with “module swap”, e.g. “module swap PrgEnv-cray PrgEnv-gnu”). When changing
a compiler please note that the flags are always specific to the compiler, so you will need to remove the
Cray compiler specific flags from the FC/CCFLAGS entries. The heat equation solver may have issues with
the libpng library when working with other compilers.
3. Performance analysis with CrayPAT
We will carry out the combined sampling and tracing experiment known as Automatic Profiling Analysis
(APA) approach for your software in the following. Refer back to the lectures. Do this first with the core
count where the code still scales (see the scalability curve recorded in Exercise 1).
1. module load perftools-base
module load perftools
2. Rebuild your application (“make clean && make” or alike)
3. Do “pat_build a.out” [a.out is the name of your binary]. You should get a new binary with
“+pat” addition.
4. Run a.out+pat with the selected core count. You obtain a file with the .xf suffix, in addition to
output files.
5. Do “pat_report a.out+pat+(something).xf > samp.prof” replace a.out+pat+.. with the
proper filename. This should produce the sampling report (file samp.prof), a file with .apa suffix
and a file with .ap2 suffix.
6. Read through the sampling profile file for the profile where time is being spent etc. Any surprises?
7. Continuing with the APA experiment - let us select the most important stuff for more detailed
study. See the .apa file but let us leave it unchanged for now. This file controls the tracing
experiment; you can include more library groups (MPI by default) and user functions for the
experiment through that file.
8. Do “pat_build -O <name of the .apa file>”. This should produce yet another binary with
“+apa” addition.
9. Run that binary. You should get a new .xf file.
10. Apply pat_report again: “pat_report (...) > tracing.prof”, where (...) is the name of the
most recent .xf file. See “pat_report -O help” for available reports.
11. Read through the file tracing.prof.
12. Have a look also at the CrayPAT GUI, do “app2 (name of the most recent .ap2 file)”.
Once done with this, repeat these steps for the core count where the code does not scale anymore.
Compare the profiles.
4. Single-core optimization
Gather data on the single-core performance of the application you are working with, at its scalability limit.

Which computational routine takes most time?
If working with the Cray compiler, recompile your code with the additional compiler flag “-h
profile_generate” and do the CrayPAT analysis (Ex. 3) again to get more detailed information about loops.
See the compiler feedback (see the lectures for the needed flags).

Does the compiler vectorize the loops in the most computationally intensive routines identified
above?
Focusing on the observed hot-spot routines loops, try to restructure the loop for better performance,
especially focusing on getting the compiler to vectorize the hotspot loops. Talk to the instructors for hints!
And if working with the 2D heat equation solver, refer back to the lectures.
Describe the modification and the observed related timing below:
Rewrite (routine, performed optimization)
Baseline
Time
5. Collecting hardware performance counters
By modifying the .apa file and PAT_RT_PERFCTR field therein, rebuild the a.out+apa files with
PAT_RT_PERFCTR=0 and PAT_RT_PERFCTR=2 and run them both. By reviewing the profile files (Refer back
to Ex. 3) see:

In the top time-consuming routines, what are L1 and L2 cache hit ratios?
6. Environment variables affecting MPI performance
Let us investigate the impact of the following system parameter tweaks.
1. Skip this if you are running with less than 128 MPI ranks (4 full nodes on Beskow). Rerun your
application (on the two core counts, i.e. the one where the scalability is still sufficient and the one
where it stops scaling) with different rank placement, using the optimal placements as suggested
by CrayPAT (refer back to the CrayPAT runs of Exercise 3). Copy the CrayPAT-provided
MPICH_RANK_ORDER.User_time files to the submit folder renaming them as
“MPICH_RANK_ORDER” and inserting “export MPICH_RANK_REORDER_METHOD=3” to the job
script.
Core count
Time without rank reordering
Time with rank reordering
2. Increase the eager limit by inserting “export MPICH_GNI_EAGER_MSG_SIZE=131072” into your
batch job script and rerun the application with the two core counts under investigation.
3. Load the module “craype-hugepages8M”, rebuild and rerun your application. Did you see any
speed-up?
4. See the profile. If your application spends a lot of time in collective communication (and skip this
stage if your code is not using collectives at all, which is the case with the 2D heat equation solver)
try out the impact of using DMAPP collectives:
 Recompile your code with the DMAPP linked in: modify your makefile to include either “ldmapp” or “-Wl,--whole-archive,-ldmapp,--no-whole-archive” (for dynamic or static
linking, respectively) in LIBS (or LDFLAGS or equivalent) section.
 Then insert “export MPICH_USE_DMAPP_COLL=1” onto your batch script and rerun.
Technique
(Baseline)
Making more messages eager
Hugepages
DMAPP collectives
Those that speed it up
combined
Core count=
Core count=
After these, consider what kind of source code changes could be made to the code for improving the
scalability. Discuss with the instructors for comments. Time permitting you may even try to implement
these.
7. Removing I/O bottlenecks
Skip this exercise if you are working with the 2D heat equation solver. Analyze the I/O of your application
by reviewing the CrayPAT reports and/or the files produced (written) or referred (read in) by the
application.


Does the I/O take a major amount of execution time, and is it a scalability bottleneck?
Is the I/O scheme parallel, and is it asynchronous or are the processes waiting for the I/O?
Within the time limits it will not be possible to completely rewrite the application I/O. Therefore, on top of
the analysis above, try how much a mere reduction of I/O (reducing checkpointing, writing results less
often, writing less to the output file or screen, etc) improves the performance.


In case of large (> 1 GB) writes, try whether the Lustre striping improves the write times. Apply “lfs
setstripe -c 4 -S 4M” to the directory the output files will be written to (adjust your job script if
needed). If this decreased the I/O time, give a try with other stripe count (with -c that runs from 2
to 16, the default count is 1) and stripe size (-S 2...8M, the default is 1M).
Also experiment the impact of the IOBUF library:
o module load iobuf
o Rebuild your application
o Insert “export IOBUF_PARAMETERS=’*’” onto the job script, rerun (with some core count)
and compare timings.
Appendix: Heat equation solver
Heat equation is a partial differential equation that describes the variation of temperature in a given region
over time
where u(x, y, z, t) represents temperature variation over space at a given time, and α is a thermal diffusivity
constant.
We limit ourselves to two dimensions (plane) and discretize the equation onto a grid. Then the Laplacian
can be expressed as finite differences as
Where ∆x and ∆y are the grid spacing of the temperature grid u(i,j). We can study the development of the
temperature grid with explicit time evolution over time steps ∆t:
Parallelization of the program is pretty straightforward, since each point in the grid can be updated
independently. In distributed memory machines, the boundaries of the domains need to be communicated
between tasks before each step:
Solvers for the 2D equation implemented with C and Fortran are provided in the file heat.tar.gz. The solver
develops the equation in a user-provided grid size and over the number of time steps provided by the user.
The default geometry is a flat rectangle with a disc on the middle, but user can give other shapes as input
files - a coke bottle is provided as an example.
You can compile the program with the make command, adjusting the Makefile as needed. There are three
versions available; serial (“make serial”), MPI-parallelized version with MPI_Sendrecv operations (“make
mpi”) and an MPI+OpenMP hybrid version (“make hyb”).
Examples on how to run the binary (remember to prefix with “aprun” together with proper -n, -d, -N etc
flags for it)




./heat (no arguments - will run the program with default grid size (2048x2048) and number of time
steps (100))
./heat bottle.dat (one argument - will run the program starting from a temperature grid provided in
the file bottle.dat for the default number of time steps, i.e. 100)
./heat bottle.dat 1000 (two arguments - will run the program starting from a temperature grid
provided in the file bottle.dat for 1000 time steps)
./heat 4096 4096 500 (three arguments - will run the program for the default case, in a 4096x4096
grid, for 500 time steps)
The program will produce a .png image of the temperature field after a given amount of iterations. You can
change that from the parameter image_interval defined in heat_X_main.F90/.c (X=serial, mpi, or hyb).
Download