Comp3104 The Nature of Computing Worksheet 1 Purpose (a) To learn the basic OpenMP constructs, (b) To understand how parallel code actually executes, (c) to experience some issues peculiar to parallel execution. Files Required Code: Executable OpenMP_1.exe. Install this executable onto your desktop, or a place of your choice. This worksheet will make use of this executable. Also the Visual Studio 10 solution OpenMP_1.sln is also provided. Associated Reading and/or Background Material - Parallel Programming with OpenMP Source Code for this worksheet OpenMP_1.cpp which is part of the VS10 solution. Notes The code snippets below are cut-down versions which appears in the source code file (just to make things simpler to understand). Comments have been added where printing to screen and the log-file occurs. Also, each Investigation will write a log file which mirrors what you see on the screen. This file is contained in the folder where you dropped the .exe. The file is written to disk when you select ‘0’ at the input prompt. It will over-write the previous file. Tasks 1 Hello World Code. Here’s the code which runs when you select Investigation 1. #pragma omp parallel private(tid) // Start of the Parallel Section { tid = omp_get_thread_num(); printf("Thread %d: Hello World from thread = %d\n", tid,tid); if (tid == 0) { nthreads = omp_get_num_threads(); tid = omp_get_thread_num(); printf("\nThread %d: Number of threads = %d\n", tid,nthreads); } } // End of the Parallel Section. Don’t worry about the details yet. You should be able to identify (i) the parallel section of code, (ii) the code which gets the thread ID (tid) and (iii) the code which prints the Hello World message. Also there is a chunk of code which tests for the master thread and then outputs the number of threads running. 1 Hello World Investigation. Run the investigation by double left-clicking on OpenMP_1.exe which will open up a console window. Select Investigation 1 and the code will run. Note down in which order the threads write their message to the console. Repeat and note the order again. What do you notice? Remember that the threads are running in parallel, so the order you see them printed out is actually determined by the Windows OS. Exit the program by selecting 0 and look at the contents of the log file which you may like to keep safe somewhere. 2 Work Sharing Loop Code. Now let’s have a look at the main use of parallel processing, the sharing of work in a loop. Look at the code below. #pragma omp parallel // start parallel section { #pragma omp for // The following for loop is parallel for(i=0;i<n;i++) { a[i] = 2*i; tid = omp_get_thread_num(); printf("Thread %d assigns a[%d] to %d\n",tid,i,a[i]); } }// end parallel Here the computation a[i] = 2*i; sets the value of the array element a[i] to two times its index, so we get a[0] = 0, a[1] = 2, … a[7] = 14. But which thread gets to calculate which value. Let’s investigate. 2 Work Sharing Loop Investigation 1. Run the executable and choose Investigation 150. It will ask you for the length of the array a[] in the above program. Since we have 8 threads, a good starting choice for the length is 8. Try this and note down which thread calculates which array element. Repeat if you like. Now I suggest you try an array length less than 8. What do you notice? Now try an array length greater than 8. What do you notice? Write down a summary in simple English. 2 Work Sharing Loop Investigation 2. Run the executable and choose Investigation 151 where I have added some extra code to count how often each thread is used (and also plot the histogram of thread usage). Think of some interesting array lengths to investigate (not more than 100). I suggest you perform two investigations, (i) where the length is a multiple of 8, (ii) where the length is not a multiple of 8. What do you notice? Write down a summary in simple English. 3 Loop Shared and Private Variables Code (correct). The code below correctly uses the shared and private variables within a loop. #pragma omp parallel private(i) shared(a) { #pragma omp for for(i=0;i<n;i++) { a[i] = 2*i; tid = omp_get_thread_num(); printf("Thread %d assigns a[%d] to %d\n",tid,i,a[i]); } }// end parallel The array a[] is identified as shared since all threads in the team will be working on the same array. But the array index i is identified as private since each thread will be working on an individual array element at any one time. 3 Loop Shared and Private Variables Investigation (correct). Select Investigation 15 to run this code and check that the threads assign the correct value to each array element. The length of the array has been fixed to 15. 4 Loop Shared and Private Variables Code (incorrect). The code below incorrectly uses the shared and private variables within a loop. #pragma omp parallel for private(tid,i) shared(m,n) for(i=0;i<n;i++) { m = i + 1; } Whereas this code is correct #pragma omp parallel for private(tid,i,m) shared(n) for(i=0;i<n;i++) { m = i + 1; } Spot the difference? Why do you think making m shared is a mistake? Remember that a shared variable can be updated by any of the thread team. 4 Loop Shared and Private Variables Investigation (incorrect). Run Investigation 12. But I advise you to open up OpenMP_1.sln in visual studio and find the code for the investigation. You will find there are two parallel sections. The first corresponds to the incorrect code given above. In the code provided we loop for 100000 times and check whether the computation has given an incorrect result, if so it is printed out. The second section corresponds to the correct code. Again we loop for 100000 times and check for incorrect results. Only incorrect results are printed out. Run the investigation a few times and observe the difference. 5 Loop Shared and Lastprivate Variables Code (incorrect and correct). This investigation concerns accessing a variable which has been calculated with a parallel section, after the parallel section is completed. Let’s consider the following code. // parallel region #pragma omp parallel for private(i,m,tid) shared(n) for(i=0;i<n;i++) { m = i + 1; } // serial region printf("m = %d”,m); The variable m which we need to print out after the parallel section has been declared as private within the parallel section. This had to be done, since if it were shared then all members of the thread team would have been able to change it, which would have produced non-deterministic results. However, making m private means we cannot access it out of the parallel region. So the above code would fail. Here’s how we fix this issue, by using the lastprivate clause. //parallel region #pragma omp parallel for private(i,tid) lastprivate(m) shared(n) for(i=0;i<n;i++) { m = i + 1; } // serial region printf("m = %d”,m); This clause allows the last value of the variable m to be accessible outside the parallel region. The last value of m is of course the last value calculated in the for loop, for i = (n-1). 5 Loop Shared and Lastprivate Variables Investigation (incorrect and correct). Run Investigation 14. You will see calculations of m for both the code chunks listed above. Note how the first chunk does not assign m to the value calculated for i=7 in the parallel loop. The second chunk assigns m correctly to the last value computed in the loop. If you wish to repeat the investigation, then you must exit the program and restart it. (Problems with data caching methinks). 6 Parallel Loops The DataRace Pathology Code. Let’s look at this code which calculates the values of array a[] based on previous initialised values //serial region to initialise a[] for(i=0;i<n;i++) { a[i] = i; } #pragma omp parallel for private(tid,i) shared(a) for(i=0;i<(n-1);i++) { a[i] = a[i+1] + 1; } The initialised array elements are a[0] = 0, a[1] = 1, a[2]=2 and so on. The line of code a[i] = a[i+1] + 1; indicates that the value of the array element a[i] is calculated based on the value of the element a[i+1], which is different. So we expect in the second for loop to get a[0] = a[1]+1 = 2, and a[1] = a[2] + 1 = 3 and so on. However in the parallel section all elements are being calculated in parallel, both a[i] and a[i+1] are being calculated simultaneously. Hence the sum a[i] = a[i+1] + 1; will surely fail. This is called a data race. 6 Parallel Loops The DataRace Pathology Investigation. Run the Investigation 13. You may want to look at the detailed code before you do this. The investigation will print out the values initially assigned the array a[]. Then it will printout the calculations of a correct serial implementation, where each array element is correctly calculated according to a[i] = a[i+1] + 1; You should check this is correct. Then it will printout the calculations according to the parallel code which is incorrect. Look at the results of the parallel code. Can you see where there are incorrect calculations? Make a note of these. 7 The Sections Construct Code. In the above investigations we did not choose to assign computation to a particular thread. This is the usual case where the code cannot be easily explicitly mapped onto a thread. However there are some cases where this is possible in the example below #pragma omp parallel private(tid,i) shared(n,a,b) { #pragma omp sections { #pragma omp section for(i=0;i<n;i++) a[i]=10; #pragma omp section for(i=0;i<n;i++) { b[i]=20; } } The code in each section will be assigned to a different thread (though we have no control over which threads will be used. This is possible since the computations in each section are not dependent on each other, arrays a[] and b[] have no data dependencies. 7 The Sections Construct Investigation. Run the Investigation 8 and note how a single thread computes the values of a[] and a different thread computes the value of b[]. Now run the investigation several times and look at the computations. There are two interesting things that emerge. Can you spot them? 8 The Critical Construct Code. Let’s have a look at this code chunk below where we add the values in an array a[] to find their sum. The approach is to distribute the addition over the thread team where each thread maintains its private sum “sumLocal”. Finally when the parallel for loop is done, we add the values of the private sums to get the final sum: #pragma omp parallel shared(n,a,sum) private(tid,i,sumLocal) { #pragma omp for for(i=0;i<n;i++) { sumLocal += a[i]; } sum += sumLocal; } Unfortunately this won’t always work. The problem is with the line sum += sumLocal; This will be executed by the thread team, so the variable sum could be updated by more than one thread at any time. This is of course a data race condition. To prevent this we use, as discussed, a critical section: #pragma omp parallel shared(n,a,sum) private(tid,i,sumLocal) { #pragma omp for for(i=0;i<n;i++) { sumLocal += a[i]; } #pragma omp critical (Agatha) { sum += sumLocal; } } 8 The Critical Construct Investigation. Let’s now investigate these situations. You should have a look at the detailed code in Visual Studio before you run Investigation 7. This investigation outputs data from the incorrect code then the data from the correct code. In each case the array a[10] contains 10 elements each with value 5, so the sum is 50. Run the investigations several times until the first sum is not 50. Now look at the values each thread computes for its sum and try to find something strange here. To help you find this, compare the thread sums for the correct code. 9 The Barrier Construct Code. The code snippet below shows the barrier in operation #pragma omp parallel private(tid) { tid = omp_get_thread_num(); Sleep(1000*tid); tt1 = omp_get_wtime() - tt0; printf("Thread %d: before barrier at time %4.1f\n",tid,tt1); #pragma omp barrier tt2 = omp_get_wtime() - tt0; printf("Thread %d: after barrier at time %4.1f\n",tid,tt2); } Each thread is put to sleep for tid seconds, then it prints out the time before it hits the barrier. Then it prints out its time when it has passed the barrier 9 The Barrier Construct Investigation. Run Investigation 30 and look at the time each thread hits the barrier and the time that each thread leaves the barrier. What does this tell you about the behaviour of the barrier and of each thread? Hint: If thread A hits the barrier at 2 secs and thread B hits the barrier at 4 secs, but both leave at 4 secs, then what has A been doing in the meantime?