Multicore Exercises: Running Simple OpenMP Programs 1 Introduction David Henty

advertisement
Multicore Exercises: Running Simple OpenMP Programs
David Henty
1
Introduction
The basic aim of this exercise is to familiarise you with compiling and running OpenMP parallel programs. Most importantly this will enable you to determine the correct options and settings for your
particular combination of machine, OS, programming language and compiler. Running the programs
will also give you information on how well your system copes with running real parallel programs in
terms of both CPU and memory usage. The first example is a simple image processing program, the
second is a standard benchmark for measuring memory bandwidth.
2
Image Processing
The program does a simple form of image processing to try and sharpen up a fuzzy picture. You will be
able to measure the time taken by the code using different numbers of threads to check that the execution
time decreases with processor count as expected.
2.1
Accessing the code
If you are using ness you can copy a tar file containing the code directly to your own course account:
[user@ness ~]$ cp /home/s05/course00/multicore/sharpen.tar .
If you are using your own machine then you can download all the files from:
http://www.epcc.ed.ac.uk/~etg/multicore/.
If you are using a Unix system or Cygwin then you should download sharpen.tar. If you have a
Windows compilers then you should download sharpen.zip.
You can use standard Windows tools to unpack the zip file. The tar files can be unpacked straightforwardly as follows:
[user@ness ~]$ tar xvf sharpen.tar
sharpen/C/
sharpen/C/sharpen.c
...
2.2
Compiling the code
Two versions are supplied, one for C and one for Fortran. You should work in the appropriate directory.
A Makefile is supplied for compilation – for example, to compile the Fortran code using make on ness:
1
[user@ness ~]$ cd sharpen/F
[user@ness sharpen/F]$ make
pgf90 -g -mp -c sharpen.f90
pgf90 -g -mp -c dosharpen.f90
pgf90 -g -mp -c filter.f90
pgf90 -g -mp -c fio.f90
pgf90 -g -mp -o sharpen sharpen.o dosharpen.o filter.o fio.o
which creates the executable program sharpen (or sharpen.exe in Cygwin or Windows).
If you are not using ness then you will need to edit the Makefile to use the correct compilers and flags.
Options already exist in the Makefile for compiling under Cygwin - you just need to uncomment them.
For other systems you will need to set these options by hand.
On Unix systems you can look at the input file using the display program:
[user@ness sharpen/F]$ display fuzzy.pgm
and you can close display by typing “q” anywhere in the window. For Windows or Cygwin you will
need to convert the PGM files to a Windows format such as PNG, e.g. using the convert program:
convert fuzzy.pgm fuzzy.png
You should be able to view the PNG files using standard Windows applications.
2.3
Running the Code
You can run the exectuable on your own system, or on the frontend of ness, in the standard way. All
you need to do is set the number of threads, e.g. on a Unix system or under Cywgin:
[user@ness sharpen/F]$ export OMP_NUM_THREADS=2
[user@ness sharpen/F]$ ./sharpen
Image sharpening code running on
Input file is: fuzzy.pgm
Image size is:
564
...
2
x
thread(s)
770
You should note the timings reported by the code, and view the output file sharp.pgm to see the effect
of the image sharpening.
If you are using ness then you will need to run on the backend system to get reliable timings and access
up to 16 cores (the frontend is shared between all users and is only a dual-core machine).
To run the code on the backend you need to submit a script to the Sun Grid Engine (SGE) batch system.
You are provided with a template script ompbatch.sge which is appropriate for submitting any code
parallelised using OpenMP. All you need to do is make a copy of this file with the same name as the
executable you want to run, e.g.:
[user@ness sharpen/F]$ cp ompbatch.sge sharpen.sge
To run the code on, say, two threads:
[user@ness sharpen/F]$ qsub -pe omp 2 ./sharpen.sge
When the code has completed it will produce a log file which contains, among other things, the standard
output from the program. The log file will have a name of the form sharpen.sge.oXXXX where
XXXX will be a unique job identifier assigned by the SGE batch system.
2
2.4
Parallel Performance
If you examine the output you will see that it contains two timings: the total time taken by the entire
program (including IO) and the time taken solely by the calculation. The image input and output is not
parallelised so this is a serial overhead, performed by a single core. The calculation part is, in theory,
perfectly parallel (each thread operates on different parts of the image) so this should get faster on more
processors.
You should do a number of runs and fill in Table 1: the IO time is the difference between the calculation
time and the overall run time; the total CPU time is the calculation time multiplied by the number of processors. You should see performance improvements up to the number of cores on your system. However,
it may be interesting to run with more threads than cores to see what effect this has on performance.
Look at your results – do they make sense?
# Threads
1
2
4
7
10
Overall run time
Calculation time
IO time
Total CPU time
Table 1: Time taken by parallel image processing code
Given the structure of the code, you would expect the IO time to be roughly constant, and the performance
of the calculation to increase linearly with the number of processors: this would give a roughly constant
figure for the total CPU time. Is this what you observe?
Unless you are running via a batch system you should be able to monitor the CPU usage while the
program is running, e.g. using top or Task Manager. Does the CPU usage change as you would expect
between running on a single thread and using multiple threads?
3
3
Memory Bandwidth
The streams benchmark is a simple way to measure the memory bandwidth of any computer. This is
of particular relevance for multicore systems where many cores share the same memory bus. For full
information on streams, see http://www.cs.virginia.edu/stream/ref.html
The benchmark has four kernels: copy, scale, sum and triad:
name
COPY
SCALE
SUM
TRIAD
kernel
a(i) = b(i)
a(i) = q*b(i)
a(i) = b(i) + c(i)
a(i) = b(i) + q*c(i)
bytes/iter
16
16
24
24
FLOPS/iter
0
1
1
2
On all modern systems, the rate of execution is determined by the access to memory rather than the peak
FLOP rate (i.e. the clock rate). The size of the arrays n can be varied; to get sensible timings, each
operation is performed multiple times as specified by ntimes.
If n is very large then the program will be accessing main memory. If it is small enough then data may
be fit into cache, leading to an increased bandwidth for multiple iterations.
3.1
Accessing the code
The code is available from the same locations as the image processing code – copy stream.tar or
stream.zip to your local account.
3.2
Compiling the code
A Makefile is supplied which works as-is on ness, and has commented-out options for Cygwin. You
may need to edit the settings appropriately for your local system. For simplicity, we have only supplied
the Fortran version of the code. Note that this has been slightly modified from the standard stream
benchmark – see the README for details.
3.3
Running the Code
The default version is parallelised using OpenMP. You can simply run it in your system as before, setting
the number of threads as required. On ness, you can run it on the shared frontend, or submit to the
backend with the standard script (renamed appropriately).
To change the size of the array, you simply edit line 109 which initially reads:
PARAMETER (n=1000000,offset=0,ndim=n+offset,ntimes=1000)
It is very important that you increase the value of ntimes when you decrease n (and vice-versa) in order
to maintain a reasonable runtime of several seconds. For example, you could alter each by a factor of 10.
3.4
Performance
First, run the code on a single thread and alter the value of n (remembering to change ntimes as
appropriate). You should see that the bandwidth changes substantially at various threshholds when the
arrays start to fit into different levels of cache.
4
Now, run with more than one thread. You would expect to see that the bandwith stays relatively constant
when accessing main memory as the two threads are contending for the shared bus. However, when
arrays fit into level-one cache, the bandwith should scale with the number of threads as each core has its
own cache. Things may be a bit more complicated at the level-two threshhold which may or may not be
shared depending on the architecture and how threads are allocated to cores.
If you observe performance thresholds, do they coincide with the cache sizes on your system?
# Threads
n
COPY
SCALE
SUM
TRIAD
Table 2: Bandwidth (MB/s) for STREAM benchmark
3.5
Issues with OpenMP Performance
On some systems, shared-memory parallelisation can incur large overheads. In this situation the time
taken for operations on small arrays is dominated by the OpenMP overheads, and you may not observe
any benefit from the caches.
It is still possible to investigate cache effects by running multiple copies of a serial stream benchmark as
opposed to one copy of a parallel one. To do this, you should edit the Makefile so the source is specified
as stream_tuned-serial.f. This is compiled as before: note that OpenMP is enabled to gain
access to its inbuilt timer, but is not otherwise used in the serial code.
To execute multiple copies at the same time, the simplest method is to run the benchmark multiple times
in the background (i.e. using ’&’):
[user@ness ~]$ ./stream &
[user@ness ~]$ ./stream &
...
Having entered the first command you can repeat it quickly by typing ’CTRL-P’ or hitting the up arrow.
Providing the benchmarks run for several seconds the delay between starting them should be insignificant.
If the two (or more) instances of streams are contending for memory bandwidth then they will all report
a smaller bandwidth than a single instance. If all accesses are coming from level-one cache then they
should all report a similar bandwidth to a single instance.
5
Download