Parallel Programming with OpenMP [CBPrice Draft4 31-01-12] Introduction OpenMP is not a new programming languages, rather it is a set of directives which can be inserted into a C/C++ program or a Fortran progam. These directives instruct the compiler to produce an executable which the Operating System will share amongst the multiple processors present on a machine, so that the program will be executed on multiple processors. For example, suppose we have an array of numbers a[i] of length ten and our program will double all the elements. (Here i is the ‘index’ into the array.) If we have two processors, then the parallel program will assign the computations on elements a[0],a[1],a[2], a[3] and a[4] to processor A and the computations on elements a[5],a[6],a[7],a[8] and a[9] to processor B. Both processors run in parallel, so the work is shared. Each parallel thread of computation is called a thread. So in the above example, there are two threads of execution, one runs on processor A and the other runs on processor B. OpenMP works using shared memory where all threads have access to this common memory. It provides the programmer with the ability to (i) create a team of threads to run on each processor on the machine, (ii) specify how work is to be shared between the members of the team, (iii) designate shared and private variables from the point of view of each thread (don’t worry, this will become clearer later, trust me!), (iv) prevent the execution of one thread spoiling the computations of another thread. In this text we aim to introduce the basics of OpenMP. We shall look at the following set of constructs and provide links to the associated workshop activities: (i) The parallel construct. (ii) Work-sharing constructs such as loop, sections, single (iii) Scheduling of parallel processing (iv) The barrier construct (v) The critical construct. In the examples provided below, we shall provide you with sequential code designed to run on a single processor machine. You will be familiar with this code from your programming experience. Then we shall provide you with the parallelised code. We hope through a comparison of these, you will learn how to parallise your own code. The Parallel Construct Let’s start with a traditional programming exercise, to print out “Hello World”. Here’s a C program to do this using sequential code. int _tmain(int argc, _TCHAR* argv[]) { printf(“Hello World\n”); } Now the equivalent parallel program uses the construct #pragma omp parallel to create a number of threads and to get each of these to execute the print statement. Here’s the parallel program. Note that all the code between the accolades following the construct will be executed in parallel; that means that the “Hello World” text will be printed out a number of times, once for each processor or thread you have on your machine. int _tmain(int argc, _TCHAR* argv[]) { #pragma omp parallel { printf(“Hello World\n”); } } This exercise is explored in the workshop activity Investigation 1. It is useful to know which processor or thread is executing a statement at a particular time. This can be done using the call to the OpenMP function omp_get_thread_num(). Let’s add this to the above code. We write this. int _tmain(int argc, _TCHAR* argv[]) { int tid; #pragma omp parallel { tid = omp_get_thread_num(); printf(“Thread %d says Hello World\n”,tid); } } Here the variable tid is the ‘thread id’, and tells you which thread is actually printing out the “Hello World” message. This is incorporated into the workshop activity Investigation 1. As you will find when you run this Investigation several times, the order of execution will be different, i.e., the printing of the message will be assigned to different threads. This ‘non-deterministic’ element of parallel execution is something which you must understand. Sharing Work – The Loop Construct. Sharing work between threads is the most fundamental of OpenMP’s constructs since it allows your desired computation to be distributed between teams of threads and so to obtain a speedup and a reduction in time of the total computation. One important aspect of parallel programming is to share the work involved in computations involving a loop. Look at the following code which initialises an array to a particular value (2*i) using a serial program. int _tmain(int argc, _TCHAR* argv[]) { int int int int n; // i; // m; // a[10];// This This This This is is is is the the the the length of the array index into the array value to be assigned to all array elements array of length 10 n = 10; // Initialise all 10 elements of the array with value 2*i for(i=0;i<n;i++) { a[i] = 2*i; } } This is straightforward to understand. First we declare the variables n,i,m,a which we shall use, then we define the value of m and n. Finally, we do the computation within the for loop which does nothing else but assign the value of each element of array a to the value of 2*i. The parallelised program is show below. This program will distribute the computation of the array a[i] values over the threads we have available. For example, if we have two threads then each thread will execute half the computations. int _tmain(int argc, _TCHAR* argv[]) { int int int int n; // i; // m; // a[10];// This This This This is is is is the the the the length of the array index into the array value to be assigned to all array elements array of length 10 n = 10; #pragma omp parallel { #pragma omp for for(i=0;i<n;i++) { a[i] = 2*i; } }// end parallel } // end main Here the first directive #pragma omp parallel identifies a block of code which may be parallelised, and the second directive #pragma omp for tells us that it is the for loop which will be parallelised. When this code is executed, the computations will be divided between the threads. Investigation 150/151 will explore this. Loop Construct – Shared and Private Variables While the above parallelised code is correct and it works, you are not advised to write like this. It’s too simplistic and has made some assumptions. We are assuming that the variable i is unique for each thread, so that no two threads have the same value of i and therefore update the same array element a[i] which could cause a problem. We also assume that each thread sees the same array a[] and not a different copy. To remove these assumptions, we add two “clauses” to the #pragma omp parallel directive, shared(a) and private(i) which ensure that no two threads work on the same i, and that all threads see a common array a[] within the parallel region. Here’s the correct program. int _tmain(int argc, _TCHAR* argv[]) { int int int int n; // i; // m; // a[10];// This This This This is is is is the the the the length of the array index into the array value to be assigned to all array elements array of length 10 n = 10; #pragma omp parallel private(i) shared(a) { #pragma omp for for(i=0;i<n;i++) { a[i] = 2*i; } }// end parallel } // end main This will be explored in Investigation 15. An incorrect use of shared and private will be shown in Investigation 12. An Aside: Combined Work Sharing Constructs You may have noticed in the parallel code example above that the parallel for construct #pragma omp for was contained in the parallel section construct #pragma omp parallel. As we stated in the introductory material, the OpenMP approach aims to parallelise loops. So it may not come as a surprise that both of these constructs can be combined into one, which will simplify the writing of code and therefore reduce the chance of errors. The above example may be condensed as follows. n = 10; #pragma omp parallel for private(i) shared(a) for(i=0;i<n;i++) { a[i] = 2*i; } Loop Construct – lastprivate and firstprivate clauses. Let’s consider this parallel code block which distributes the computation of variables i and m over the threads available. This is perfectly good code. If m had been declared in the shared clause, then multiple threads would have attempted to update its value in an uncontrolled manner with non-deterministic results. n = 10; #pragma omp parallel for private(i,m) shared(n) for(i=0;i<n;i++) { m = i + 1; } k = 2*m; But what about the serial statement k=2*m outside the parallel block, which value of m does it get from the parallel block? The answer is we cannot know and the value of m outside the parallel section is undefined, since we identified it as being private. Fortunately OpenMP defines a clause which allows us to access the last value of a private variable in a parallel loop. Here’s the syntax. n = 10; #pragma omp parallel for private(i) lastprivate(m) shared(n) for(i=0;i<n;i++) { m = i + 1; } k = 2*m; Here the lastprivate(m) clause allows access to the last value of the shared variable in the parallel region. This is explored in Investigation 14. Issues with Parallel Loops – The DataRace Pathology. You may feel that you are becoming confident in your understanding of how to paralleise loops. Nevertheless there are some subtle issues which you should know about, or stated differently, there are traps you should not fall into. Here’s one, consider this code segment, in purely sequential code. n = 10; // initialise array for(i=0;i<n;i++) a[i] = i; // compute new values of the array for(i=0;i<n;i++) a[i] = a[i+1] + 1; Nothing wrong with this. We first initialise the array, so that a[0] = 0, a[1]=1 and so on, and then we compute the new values of the array, so that a[0] becomes a[1] + 1 which is 2 and so on. There is absolutely no problem here. Now let’s look at a naive (and incorrect) attempt to parallelise this, which will fail. n = 10; //initialise array (serial section) for(i=0;i<n;i++) a[i] = i; // compute new values of the array (parallel section) #pragma omp parallel for private(tid,i) shared(a,n) for(i=0;i<n;i++) { a[i] = a[i+1] + 1; } Here we have indicated that a[] is a shared variable (as in the examples above) in order that all parallel threads may work on this array together. But there is a problem here. The line of code a[i] = a[i+1] + 1; says that the value of a[i] depends on the value of a[i+1]. But these values are being computed in parallel, so both a[i] and a[i+1] may be computed simultaneously. In other words, there is no way of knowing which values are correct. This situation is known as a data race. The point is that not all code can be parallelised. A data race is explored in Investigation 13. The ‘Sections’ Construct. In constructing a program which may be executed on parallel processors to distribute the computational load, it may be fairly straightforward to do this in a number of cases. Consider this sequential program below which needs to initialise the values of an array a[] and independently initialise the values of a second array b[]. n = 10; for(i=0;i<n;i++) a[i] = 10; for(i=0;i<n;i++) b[i] = 20; Both of these initialisations are independent and so it makes sense to assign them to individual threads of processing which can be processed at the same time. The sections construct allows us to do this. Here’s how we would parallise these initialisations. Each section identifies an independent section of code which can be executed independently, without reference to variables within the other section. Let’s have a look. n = 10; #pragma omp parallel private(tid,i) shared(n,a,b) { #pragma omp sections { #pragma omp section for(i=0;i<n;i++) a[i]=10; #pragma omp section for(i=0;i<n;i++) b[i]=20; }// sections }//parallel Here the first section initialises the values of the array a[] and the second section initialises the value of the array b[]. There are no dependencies between a[]and b[], ie we have no statement like a[i] = 2*b[i]. So we are sure that the execution of code within the sections is independent. In this example the compiler will assign the processing of the first section to a dedicated thread, and the processing of the second section to another dedicated thread. So the code will run in parallel processing using two threads. Of course if our system has more than two processors, then the additional processors will be idle (or else dealing with your Facebook or email traffic). You may think from the above code that one section will be completed by one thread before the second thread starts work. This is nonsense. Both threads may work together, so the computations will be interleaved, or just perhaps one thread may complete its computations before the other. Only investigations will establish what will happen. Investigation 8. The Single Construct Often within a parallel region we need to have some code executed on a single thread. A good example is the initialisation of a shared variable. If we let a load of threads trying to initialise the same variable then there is a chance that the variable would be incorrectly initialised. This is due to details of how the CPU actually writes to memory, which is beyond our work at the moment. However, we need a method to ensure that the variable is correctly initialised, and this will always succeed if we use a single thread. This is the purpose of the Single construct. Look at the code below. n=10; #pragma omp parallel shared(m,b) private(i) { #pragma omp single { m = 35; } #pragma omp for for(i=0;i<n;i++) { b[i]=m; } } Note that the entire code is contained in a parallel region. Variables m and b are shared. In the parallel for loop, the array b[] is initialised to the value of the variable m. This variable needs to be initialised, and this happens within the single block. When this code runs, a single thread will assign the value 35 to the variable n, then the parallel threads will initialise the elements of the array b[] with the value of m. The Critical Construct This construct ensures that multiple threads do not attempt to update the same shared variable in memory together, which could cause erroneous results. As an illustration, let’s attempt to parallise the serial loop shown below, where we sum the elements of an array. n=10; // All the following is serial // initialise array a[]; for(i=0;i<n;i++) a[i]=5; // sum the array sum = 0; for(i=0;i<n;i++) sum += a[i]; There’s nothing wrong with this program and it will work fine and give us the correct value for sum. Now, here’s an incorrect attempt to parallise this computation. n=10; #pragam omp parallel shared(n,a,sum) private(sumLocal) { #pragma omp for for(i=0;i<n;i++) sumLocal += a[i]; sum += sumLocal; } Remember that the code within the parallel block will be distributed amongst all the threads we have available. So the idea is that the threads within the parallel for loop will divide the summation of the array values and will store their sums in a variable ‘sumLocal’, one for each thread. Then the local sums will be added together in the final sum statement. Sounds correct. But in this final stage, the threads may be attempting to change the value of sum at the same time, since they are operating in parallel. This is a data race condition and may lead to erroneous results. To prevent this, we define a critical region of code which prevents multiple threads from updating a shared variable, in this case ‘sum’. Here’s how the correct code appears. n=10; #pragam omp parallel shared(n,a,sum) private(sumLocal) { #pragma omp for for(i=0;i<n;i++) sumLocal += a[i]; #pragma omp critical (name) { sum += sumLocal; } } Here ‘name’ can be any name. It is used to identify a particular critical region, since in serious code there could be more than one. So now, within the parallel region, the summation of a[] is distributed amongst the threads, and each thread computes a partial sum, stored in sumLocal (private to each thread). When the threads have finished with their summation and have the correct values of their individual sums, sumLocal, they enter the critical region where only one thread at a time is allowed to update the value of sum. This is explored in Investgation 7. The Barrier Construct A barrier is placed in a parallel program where we need a team of threads to wait for each other before proceeding. For example, a team of threads may be assigned to perform some calculations and write to a shared variable (such as an array), and the team will also be assigned to read the shared variable and perform some further calculations. Clearly the writes to the array need to be complete before the reads, otherwise we would have a data race. The barrier would be placed between the writes and the reads, to ensure that all writes are completed before the reads commence. Most of the OpenMP constructs have an implied barrier, so that the compiler automatically inserts a barrier at the end of the construct. You will not normally have to do this explicitly, but it is important to know that it is going on. The code below shows a rather contrived example of an explicit use of the barrier construct. tt0 = omp_get_wtime(); #pragma omp parallel private(tid) { tid = omp_get_thread_num(); Sleep(1000*tid); printf("Thread %d: before barrier at time %4.1f\n",tid,tt1); #pragma omp barrier tt2 = omp_get_wtime() - tt0; printf("Thread %d: after barrier at time %4.1f\n",tid,tt2); } Here in the parallel section we get the tid (“thread identification”) of each thread and we put it to sleep for a number of seconds equal to its tid. Sleep(1000) means sleep for 1000 milliseconds = 1 second. We print off the time when each thread has completed its sleep. In a real program each thread would be doing a computation which would take different amounts of time. So some threads will be completed before others and we need to synchronise all threads before proceeding to a subsequent computation. So we place the #pragma omp barrier construct next, to force all threads to wait until they are all done. Then we print off each thread time after the barrier. These times should be the same. This is explored in Investigation 30.