Multicore Programming with OpenMP Exercise Notes September 7, 2009 Getting started If you are using ness you can copy a tar file containing the code directly to your course account: [user@ness ˜]$ cp /home/s05/course00/multicore/openmp.tar . If you are using your own machine, then you can download the files from: http://www.epcc.ed.ac.uk/˜etg/multicore If you are using a Unix/Linux system, or Cygwin then you should download openmp.tar Unpack the tar file with the command tar xvf openmp.tar If you are using a Windows compiler then you should download openmp.zip and use standard Windows tools to unpack it. Exercise 1: Hello World This is a simple exercise to introduce you to the compilation and execution of OpenMP programs. The example code can be found in */HelloWorld/ where the * represents the language of your choice, i.e. C, or Fortran90. Compile the code, making sure you use the appropriate flag to enable OpenMP. Before running it, set the environment variable OMP NUM THREADS to a number n between 1 and 4 with the command: export OMP_NUM_THREADS=n When run, the code enters a parallel region at the !$OMP PARALLEL/#pragma omp parallel command. At this point n threads are spawned, and each thread executes the print command separately. The OMP GET THREAD NUM()/ omp get thread num() library routine returns a number (between 0 and n-1) which identifies each thread. 1 Extra Exercise Incorporate a call to omp get num threads() into the code and print its value within and outside of the parallel region. Exercise 2: Area of the Mandelbrot Set The aim of this exercise is to use the OpenMP directives learned so far and apply them to a real problem. It will demonstrate some of the issues which need to be taken into account when adapting serial code to a parallel version. The Mandelbrot Set The Mandelbrot Set is the set of complex numbers c for which the iteration z = z 2 + c does not diverge, from the initial condition z = c. To determine (approximately) whether a point c lies in the set, a finite number of iterations are performed, and if the condition |z| > 2 is satisfied then the point is considered to be outside the Mandelbrot Set. What we are interested in is calculating the area of the Mandelbrot Set. There is no known theoretical value for this, and estimates are based on a procedure similar to that used here. We will use Monte Carlo sampling to calculate a numerical solution to this problem. N.B. Complex numbers in C: Complex arithmetic is supported directly in Fortran, but in C it must be done manually. For this exercise the template has a struct complex declared. Then for a given complex number z, z.creal represents the real part and z.cimag the imaginary part. Then assigning z = z 2 + c is given by: ztemp=(z.creal*z.creal)-(z.cimag*z.cimag)+c.creal; z.cimag=z.creal*z.cimag*2+c.cimag; z.creal=ztemp; using double ztemp as temporary storage for the updated real part of z. Also, for the threshold test |z| > 2.0, use abs(z)>2.0 in Fortran, and in C use: z.creal*z.creal+z.cimag*z.cimag>4.0. The Code : Sequential The Monte Carlo method we shall use generates a number of random points in the range [(−2.0, 0), (0.5, 1.125)] of the complex plane. Then each point will be iterated using the equation above a finite number of times (say 100000). If within that number of iterations the threshold condition |z| > 2 is satisfied then that point is considered to be outside of the Mandelbrot Set. Then counting the number of random points within the Set and those outside will reveal a good approximation of the area of the Set. The template for this code can be found in */Mandelbrot/. Implement the Monte Carlo sampling, looping over all npoints points. For each point, assign z=c(i) and iterate as defined in the earlier equation, testing for the threshold condition at each iteration. If the threshold condition is not satisfied, i.e. |z| ≤ 2, then 2 repeat the iteration (up to a maximum number of iterations, say 100,000). If after the maximum number of iterations the condition is still not satisfied then add one to the total number of points inside the Set. If the threshold condition is satisfied, i.e. |z| > 2 then stop iterating and move on to the next point. Parallelisation Now parallelise the serial code using the OpenMP directives and library routines that you have learned so far. The method for doing this is as follows 1. Start a parallel region before the main Monte Carlo iteration loop, making sure that any private, shared or reduction variables within the region are correctly declared. 2. Distribute the npoints random points across the n threads available so that each thread has an equal number of the points. For this you will need to use some of the OpenMP library routines. N.B.: there is no ceiling function in Fortran77, so to calculate the number of points on each thread (nlocal) compute (npoints+n-1)/n instead of npoints/n. Then for each thread the iteration bounds are given by: ilower = nlocal * myid + 1 iupper = min(npoints, (myid+1)*nlocal) where myid stores the thread identification number. 3. Rewrite the iteration loop to take into account the new distribution of points. 4. End the parallel region after the iteration loop. Once you have written the code try it out using 1, 2, 3 and 4 threads. Check that the results are identical in each case, and compare the time taken for the calculations using the different number of threads. Extra Exercise Write out which iterations are being performed by each thread to make sure all the points are being iterated. Exercise 3: Image The aim of this exercise is to use the OpenMP worksharing directives, specifically the PARALLEL DO/parallel for directives. This code reconstructs an image from a version which has been passed through a a simple edge detection filter. The edge pixel values are constructed from the image using 3 edgei,j = imagei−1,j + imagei+1,j + image1,j−1 + imagei,j+1 − 4 imagei,j If an image pixel has the same value as its four surrounding neighbours (i.e. no edge) then the value of edgei,j will be zero. If the pixel is very different from its four neighbours (i.e. a possible edge) then edgei,j will be large in magnitude. We will always consider i and j to lie in the range 1, 2, . . . M and 1, 2, . . . N respectively. Pixels that lie outside this range (e.g. imagei,0 or imageM+1,j ) are simply considered to be set to zero. Many more sophisticated methods for edge detection exist, but this is a nice simple approach. The exercise is actually to do the reverse operation and construct the initial image given the edges. This is a slightly artificial thing to do, and is only possible given the very simple approach used to detect the edges in the first place. However, it turns out that the reverse calculation is iterative, requiring many successive operations each very similar to the edge detection calculation itself. The fact that calculating the image from the edges requires a large amount of computation, makes it a much more suitable program than edge detection itself for the purposes of timing and parallel scaling studies. As an aside, this inverse operation is also very similar to a large number of real scientific HPC calculations that solve partial differential equations using iterative algorithms such as Jacobi or Gauss-Seidel. Parallelisation The code for this exercise can be found in */Image/. The file notlac.dat contains the initial data. The code can be parallelised by adding PARALLEL DO/parallel for directives to the three routines new2old, jacobistep and residue, taking care to correctly identify shared, private and reduction variables. Hint: use default(none). Remember to add the OpenMP compiler flag. To test for correctness, compare the final residue value to that from the sequential code. Run the code on different numbers of threads. Try to answer the following questions: • Where is synchronisation taking place? • Where is communication taking place? Extra exercise Redo Exercise 2 using worksharing directives. Extra exercise (Fortran90 only) Implement the routine new2old using Fortran 90 array syntax, and parallelise with a PARALLEL WORKSHARE directive. Compare the performance of the two versions. 4 Exercise 4: Load imbalance This exercise will demonstrate the use of the SCHEDULE/schedule clause to improve performance of a parallelised code. The Goldbach Conjecture In the world of number theory, the Goldbach Conjecture states that: Every even number greater than two is the sum of two primes. In this exercise you are going to write a code to test this conjecture. The code will find the number of Goldbach pairs (that is, number of pairs of primes p1 , p2 such that p1 + p2 = i) for i = 2, . . . , 8000. The computational cost is proportional to i3/2 , so for optimal performance on multiple threads the work load must be intelligently distributed using the OpenMP SCHEDULE/schedule clause. The Code The template for the code can be found in */Goldbach/. 1. Initialise the array numpairs to zero. 2. Create a do/for loop which loops over i and sets each element of the array numpairs to the returned value of a function goldbach(2*i). This array will contain the number of Goldbach pairs for each even number up to 8000. 3. Once the number of Goldbach pairs for each even number has been calculated, test that the Goldbach conjecture holds true for these numbers, that is, that for each even number greater than two there is at least one Goldbach pair. Your output should match these results: Even number 2 802 1602 2402 3202 4002 4802 5602 6402 7202 Goldbach pairs 0 16 53 37 40 106 64 64 156 78 Table 1: Example output for Goldbach code 5 Parallelisation Parallelise the code using OpenMP directives around the do/for loop which calls the goldbach function. Validate that the code is still working when using more than one thread, and then experiment with the schedule clause to try and improve performance. Extra Exercise 1. Print out which thread is doing each iteration so that you can see directly the effect of the different scheduling clauses. 2. Try running the loop in reverse order, and using the GUIDED schedule. Exercise 5: Molecular Dynamics The aim of this exercise is to demonstrate how to use several of the OpenMP directives to parallelise a molecular dynamics code. The Code The code can be found in */md/. The code is a molecular dynamics (MD) simulation of argon atoms in a box with periodic boundary conditions. The atoms are initially arranged as a face-centred cubic (fcc) lattice and then allowed to melt. The interaction of the particles is calculated using a Lennard-Jones potential. The main loop of the program is in the file main.[f|c|f90]. Once the lattice has been generated and the forces and velocities initialised, the main loop begins. The following steps are undertaken in each iteration of this loop: 1. The particles are moved based on their velocities, and the velocities are partially updated (call to domove) 2. The forces on the particles in their new positions are calculated and the virial and potential energies accumulated (call to forces) 3. The forces are scaled, the velocity update is completed and the kinetic energy calculated (call to mkekin) 4. The average particle velocity is calculated and the temperature scaled (call to velavg) 5. The full potential and virial energies are calculated and printed out (call to prnout) Parallelisation The parallelisation of this code is a little less straightforward. There are several dependencies within the program which will require use of the ATOMIC/atomic directive as well as the REDUCTION/reduction clause. The instructions for parallelising the code are as follows: 6 1. Edit the subroutine/function forces.[f|c|f90]. Start a !$OMP PARALLEL DO / #pragma omp parallel for for the outer loop in this subroutine, identifying any private or reduction variables. Hint: There are 2 reduction variables. 2. Identify the variable within the loop which must be updated atomically and use !$OMP ATOMIC/#pragma omp atomic to ensure this is the case. Hint: You’ll need 6 atomic directives. Once this is done the code should be ready to run in parallel. Compare the output using 2, 3 and 4 threads with the serial output to check that it is working. Try adding the schedule clause with the option static,n to the DO/for directive for different values of n. Does this have any effect on performance? Exercise 6: Molecular Dynamics Part II Following on from the previous exercise, we shall update the molecular dynamics code to take advantage of orphaning and examine the performance issues of using the atomic directive. Orphaning To reduce the overhead of starting and stopping the threads, you can change the PARALLEL DO/omp parallel for directive to a DO/for directive and start the parallel region outside the main loop of the program. Edit the main.[f|c|f90] file. Start a PARALLEL/parallel region before the main loop. End the region after the end of the loop. Except for the forces routine, all the other work in this loop should be executed by one thread only. Ensure that this is the case using the SINGLE or MASTER directive. Recall that any reduction variables updated in a parallel region but outside of the DO/for directive should be updated by one thread only. As before, check your code is still working correctly. Is there any difference in performance compared with the code without orphaning? Multiple Arrays The use of the atomic directive is quite expensive. One way round the need for the directive in this molecular dynamics code is to create a new temporary array for the variable f. This array can then be used to collect the updated values of f from each thread, which can then be ”reduced” back into f. 1. Edit the main.[f|c|f90] file. Create an array ftemp of dimensions (npart, 3, 0:MAXPROC-1) for Fortran, [3*npart][MAXPROC] for C, where MAXPROC is a constant value which the number of threads used will not exceed. This is the array to be used to store the updates on f from each thread. 2. Within the parallel region, create a variable nprocs and assign to it the number of threads being used. 7 3. The variables ftemp and nprocs and the parameter MAXPROC need to be passed to the forces routine. (N.B. In C it is easier, but rather sloppy, to declare ftemp as a global variable). 4. In the forces routine define a variable my id and set it to the thread number). This variable can now be used as the index for the last dimension of the array ftemp. Initialise ftemp to zero. 5. Within the parallelised loop, remove the atomic directives and replace the references to the array f with ftemp. 6. After the loop, the array ftemp must now be reduced into the array f. Loop over the number of particles (npart) and the number of threads (nprocs) and sum the temporary array into f The code should now be ready to try again. Check its still working and see if the performance (especially the scalability) has improved. Extra exercise (Fortran only) Use a reduction clause for f instead of the temporary arrays. Compare the performance of the two versions. Exercise 8: Performance analysis N.B. these instructions are for ness only. If you wish to conduct this excercise on another machine, please speak to a demonstrator. This exercise is intended to use a profiling tool to do some performance analysis. */MolDyn4 contains a correct, but not very efficient solution to the MolDyn example. First of all, run a sequential version of the code (by compiling without OpenMP, and note the execution time. To profile the code, compile with -qp in addition to other flags. Run the code on the back end, say on 4 threads. This will generate a file gmon.out. The command gprof ./md will generate profiling information. Compute the ideal time (the sequential time divided by the number of threads) and compare it to the actual execution time. Estimate how much of the overhead is due to • Sequential code (code not in the parallel region) • Load imbalance ( mp barrierp) • Synchronisation (other routines starting with mp) Appendix: OpenMP on ness Hardware For this course you can compile and run code on the EPCC HPC Service. This is provided on a Sun X4600 system (ness) confugred as follows 8 • A 2-processor front-end used for compilation, debugging and testing of parallel jobs, and submission of jobs to the back-end. • One 16-processor and one 8-processor back-ends used for production runs of parallel jobs. Compiling using OpenMP The OpenMP compilers we use are the Portland Group compilers for Fortran 90 and C. To compile an OpenMP code, simply: Fortran (pgf90): add the flag -mp C (pgcc): add the flag -mp Using a Makefile The Makefile below is a typical example of the Makefiles used in the exercises. The option -O3 is a standard optimisation flag for speeding up execution of the code. ## Fortran compiler and options FC= pgf90 -O3 -mp ## Object files OBJ= main.o \ sub.o ## Compile execname: $(OBJ) $(FC) -o $@ $(OBJ) .f.o: $(FC) -c $< ## Clean out object files and the executable. clean: rm *.o execname Job Submission Batch processing is very important on the HPC service, because it is the only way of accessing the back-end. Interactive access is not allowed. For doing timing runs you must use the back-end. To do this, you should submit a batch job as follows, for example using 4 threads: qsub -cwd -pe omp 4 scriptfile where scriptfile is a shell script containing 9 #! /usr/bin/bash export OMP_NUM_THREADS=4 ./execname The -cwd flag ensures the output files are written to the current directory. The -pe flag requestes 4 processors. Note that with OpenMP codes the number of processors requested should be the same as the environment variable OMP NUM THREADS. This can be ensured by using: export OMP_NUM_THREADS=$NSLOTS You can monitor your jobs status with the qstat command, or the qmon utility. Jobs can be deleted with qdel (or via the qmon utility). 10