Multicore Exercises: Running Simple OpenMP Programs David Henty 1 Introduction The basic aim of this exercise is to familiarise you with compiling and running OpenMP parallel programs. Most importantly this will enable you to determine the correct options and settings for your particular combination of machine, OS, programming language and compiler. Running the programs will also give you information on how well your system copes with running real parallel programs in terms of both CPU and memory usage. The first example is a simple image processing program, the second is a standard benchmark for measuring memory bandwidth. 2 Image Processing The program does a simple form of image processing to try and sharpen up a fuzzy picture. You will be able to measure the time taken by the code using different numbers of threads to check that the execution time decreases with processor count as expected. 2.1 Accessing the code If you are using ness you can copy a tar file containing the code directly to your own course account: [user@ness ~]$ cp /home/s05/course00/multicore/sharpen.tar . If you are using your own machine then you can download all the files from: http://www.epcc.ed.ac.uk/~etg/multicore/. If you are using a Unix system or Cygwin then you should download sharpen.tar. If you have a Windows compilers then you should download sharpen.zip. You can use standard Windows tools to unpack the zip file. The tar files can be unpacked straightforwardly as follows: [user@ness ~]$ tar xvf sharpen.tar sharpen/C/ sharpen/C/sharpen.c ... 2.2 Compiling the code Two versions are supplied, one for C and one for Fortran. You should work in the appropriate directory. A Makefile is supplied for compilation – for example, to compile the Fortran code using make on ness: 1 [user@ness ~]$ cd sharpen/F [user@ness sharpen/F]$ make pgf90 -g -mp -c sharpen.f90 pgf90 -g -mp -c dosharpen.f90 pgf90 -g -mp -c filter.f90 pgf90 -g -mp -c fio.f90 pgf90 -g -mp -o sharpen sharpen.o dosharpen.o filter.o fio.o which creates the executable program sharpen (or sharpen.exe in Cygwin or Windows). If you are not using ness then you will need to edit the Makefile to use the correct compilers and flags. Options already exist in the Makefile for compiling under Cygwin - you just need to uncomment them. For other systems you will need to set these options by hand. On Unix systems you can look at the input file using the display program: [user@ness sharpen/F]$ display fuzzy.pgm and you can close display by typing “q” anywhere in the window. For Windows or Cygwin you will need to convert the PGM files to a Windows format such as PNG, e.g. using the convert program: convert fuzzy.pgm fuzzy.png You should be able to view the PNG files using standard Windows applications. 2.3 Running the Code You can run the exectuable on your own system, or on the frontend of ness, in the standard way. All you need to do is set the number of threads, e.g. on a Unix system or under Cywgin: [user@ness sharpen/F]$ export OMP_NUM_THREADS=2 [user@ness sharpen/F]$ ./sharpen Image sharpening code running on Input file is: fuzzy.pgm Image size is: 564 ... 2 x thread(s) 770 You should note the timings reported by the code, and view the output file sharp.pgm to see the effect of the image sharpening. If you are using ness then you will need to run on the backend system to get reliable timings and access up to 16 cores (the frontend is shared between all users and is only a dual-core machine). To run the code on the backend you need to submit a script to the Sun Grid Engine (SGE) batch system. You are provided with a template script ompbatch.sge which is appropriate for submitting any code parallelised using OpenMP. All you need to do is make a copy of this file with the same name as the executable you want to run, e.g.: [user@ness sharpen/F]$ cp ompbatch.sge sharpen.sge To run the code on, say, two threads: [user@ness sharpen/F]$ qsub -pe omp 2 ./sharpen.sge When the code has completed it will produce a log file which contains, among other things, the standard output from the program. The log file will have a name of the form sharpen.sge.oXXXX where XXXX will be a unique job identifier assigned by the SGE batch system. 2 2.4 Parallel Performance If you examine the output you will see that it contains two timings: the total time taken by the entire program (including IO) and the time taken solely by the calculation. The image input and output is not parallelised so this is a serial overhead, performed by a single core. The calculation part is, in theory, perfectly parallel (each thread operates on different parts of the image) so this should get faster on more processors. You should do a number of runs and fill in Table 1: the IO time is the difference between the calculation time and the overall run time; the total CPU time is the calculation time multiplied by the number of processors. You should see performance improvements up to the number of cores on your system. However, it may be interesting to run with more threads than cores to see what effect this has on performance. Look at your results – do they make sense? # Threads 1 2 4 7 10 Overall run time Calculation time IO time Total CPU time Table 1: Time taken by parallel image processing code Given the structure of the code, you would expect the IO time to be roughly constant, and the performance of the calculation to increase linearly with the number of processors: this would give a roughly constant figure for the total CPU time. Is this what you observe? Unless you are running via a batch system you should be able to monitor the CPU usage while the program is running, e.g. using top or Task Manager. Does the CPU usage change as you would expect between running on a single thread and using multiple threads? 3 3 Memory Bandwidth The streams benchmark is a simple way to measure the memory bandwidth of any computer. This is of particular relevance for multicore systems where many cores share the same memory bus. For full information on streams, see http://www.cs.virginia.edu/stream/ref.html The benchmark has four kernels: copy, scale, sum and triad: name COPY SCALE SUM TRIAD kernel a(i) = b(i) a(i) = q*b(i) a(i) = b(i) + c(i) a(i) = b(i) + q*c(i) bytes/iter 16 16 24 24 FLOPS/iter 0 1 1 2 On all modern systems, the rate of execution is determined by the access to memory rather than the peak FLOP rate (i.e. the clock rate). The size of the arrays n can be varied; to get sensible timings, each operation is performed multiple times as specified by ntimes. If n is very large then the program will be accessing main memory. If it is small enough then data may be fit into cache, leading to an increased bandwidth for multiple iterations. 3.1 Accessing the code The code is available from the same locations as the image processing code – copy stream.tar or stream.zip to your local account. 3.2 Compiling the code A Makefile is supplied which works as-is on ness, and has commented-out options for Cygwin. You may need to edit the settings appropriately for your local system. For simplicity, we have only supplied the Fortran version of the code. Note that this has been slightly modified from the standard stream benchmark – see the README for details. 3.3 Running the Code The default version is parallelised using OpenMP. You can simply run it in your system as before, setting the number of threads as required. On ness, you can run it on the shared frontend, or submit to the backend with the standard script (renamed appropriately). To change the size of the array, you simply edit line 109 which initially reads: PARAMETER (n=1000000,offset=0,ndim=n+offset,ntimes=1000) It is very important that you increase the value of ntimes when you decrease n (and vice-versa) in order to maintain a reasonable runtime of several seconds. For example, you could alter each by a factor of 10. 3.4 Performance First, run the code on a single thread and alter the value of n (remembering to change ntimes as appropriate). You should see that the bandwidth changes substantially at various threshholds when the arrays start to fit into different levels of cache. 4 Now, run with more than one thread. You would expect to see that the bandwith stays relatively constant when accessing main memory as the two threads are contending for the shared bus. However, when arrays fit into level-one cache, the bandwith should scale with the number of threads as each core has its own cache. Things may be a bit more complicated at the level-two threshhold which may or may not be shared depending on the architecture and how threads are allocated to cores. If you observe performance thresholds, do they coincide with the cache sizes on your system? # Threads n COPY SCALE SUM TRIAD Table 2: Bandwidth (MB/s) for STREAM benchmark 3.5 Issues with OpenMP Performance On some systems, shared-memory parallelisation can incur large overheads. In this situation the time taken for operations on small arrays is dominated by the OpenMP overheads, and you may not observe any benefit from the caches. It is still possible to investigate cache effects by running multiple copies of a serial stream benchmark as opposed to one copy of a parallel one. To do this, you should edit the Makefile so the source is specified as stream_tuned-serial.f. This is compiled as before: note that OpenMP is enabled to gain access to its inbuilt timer, but is not otherwise used in the serial code. To execute multiple copies at the same time, the simplest method is to run the benchmark multiple times in the background (i.e. using ’&’): [user@ness ~]$ ./stream & [user@ness ~]$ ./stream & ... Having entered the first command you can repeat it quickly by typing ’CTRL-P’ or hitting the up arrow. Providing the benchmarks run for several seconds the delay between starting them should be insignificant. If the two (or more) instances of streams are contending for memory bandwidth then they will all report a smaller bandwidth than a single instance. If all accesses are coming from level-one cache then they should all report a similar bandwidth to a single instance. 5