Comp1211 Investigating OpenMP Worksheet 1

advertisement
Comp1211 Investigating OpenMP Worksheet 1
Purpose
(a) To learn the basic OpenMP constructs, (b) To understand how parallel code actually executes, (c) to
experience some issues peculiar to parallel execution.
Files Required
Load the resources from the Workshops page. You will find the executable OpenMP_1.exe. Install this
executable onto your desktop, or a place of your choice. Also the Visual Studio 10 solution OpenMP_1.sln
which contains the source code for the exe is also provided.
Associated Reading and/or Background Material
-
Parallel Programming with OpenMP
Source Code for this worksheet OpenMP_1.cpp which is part of the VS10 solution.
Notes
The code snippets below are cut-down versions which appears in the source code file (just to make things
simpler to understand). Note that each Investigation will write a log file which mirrors what you see on the
screen. This file is contained in the folder where you dropped the .exe. The file is written to disk when you
select ‘0’ at the input prompt. It will over-write the previous file.
Tasks
1
Hello World Code. Here’s the code which runs when you select Investigation 1.
#pragma omp parallel private(tid) // Start of the Parallel Section
{
tid = omp_get_thread_num();
printf("Thread %d: Hello World from thread = %d\n", tid,tid);
if (tid == 0) {
nthreads = omp_get_num_threads();
tid = omp_get_thread_num();
printf("\nThread %d: Number of threads = %d\n", tid,nthreads);
}
} // End of the Parallel Section.
Don’t worry about the details yet. You should be able to identify (i) the parallel section of code, (ii)
the code which gets the thread ID (tid) and (iii) the code which prints the Hello World message.
Also there is a chunk of code which tests for the master thread and then outputs the number of
threads running.
1
Hello World Investigation. Run the investigation by double left-clicking on OpenMP_1.exe which
will open up a console window. Select Investigation 1 and the code will run. Note down in which
order the threads write their message to the console. Repeat and note the order again. What do you
notice?
Remember that the threads are running in parallel, so the order you see them printed out is actually
determined by the Windows OS. Exit the program by selecting 0 and look at the contents of the log
file which you may like to keep safe somewhere.
2
Work Sharing Loop Code. Now let’s have a look at the main use of parallel processing, the
sharing of work in a loop. Look at the code below.
#pragma omp parallel // start parallel section
{
#pragma omp for // The following for loop is parallel
for(i=0;i<n;i++) {
a[i] = 2*i;
tid = omp_get_thread_num();
printf("Thread %d assigns a[%d] to %d\n",tid,i,a[i]);
}
}// end parallel
Here the computation a[i] = 2*i; sets the value of the array element a[i] to two times its index, so
we get a[0] = 0, a[1] = 2, … a[7] = 14. But which thread gets to calculate which value. Let’s
investigate.
2
Work Sharing Loop Investigation 1. Run the executable and choose Investigation 150. It will ask
you for the length of the array a[] in the above program. Since we have 8 threads, a good starting
choice for the length is 8. Try this and note down which thread calculates which array element.
Repeat if you like. Now I suggest you try an array length less than 8. What do you notice? Now try
an array length greater than 8. What do you notice? Write down a summary in simple English.
2
Work Sharing Loop Investigation 2. Run the executable and choose Investigation 151 where I
have added some extra code to count how often each thread is used (and also plot the histogram of
thread usage). Think of some interesting array lengths to investigate (not more than 100). I suggest
you perform two investigations, (i) where the length is a multiple of 8, (ii) where the length is not a
multiple of 8. What do you notice? Write down a summary in simple English.
3
Loop Shared and Private Variables Code (correct). The code below correctly uses the shared and
private variables within a loop.
#pragma omp parallel private(i) shared(a)
{
#pragma omp for
for(i=0;i<n;i++) {
a[i] = 2*i;
tid = omp_get_thread_num();
printf("Thread %d assigns a[%d] to %d\n",tid,i,a[i]);
}
}// end parallel
The array a[] is identified as shared since all threads in the team will be working on the same array.
But the array index i is identified as private since each thread will be working on an individual array
element at any one time.
3
Loop Shared and Private Variables Investigation (correct). Select Investigation 15 to run this
code and check that the threads assign the correct value to each array element. The length of the
array has been fixed to 15.
4
Loop Shared and Private Variables Code (incorrect). The code below incorrectly uses the shared
and private variables within a loop.
#pragma omp parallel for private(tid,i) shared(m,n)
for(i=0;i<n;i++) {
m = i + 1;
}
Whereas this code is correct
#pragma omp parallel for private(tid,i,m) shared(n)
for(i=0;i<n;i++) {
m = i + 1;
}
Spot the difference? Why do you think making m shared is a mistake? Remember that a shared
variable can be updated by any of the thread team.
4
Loop Shared and Private Variables Investigation (incorrect). Run Investigation 12. But I advise
you to open up OpenMP_1.sln in visual studio and find the code for the investigation. You will find
there are two parallel sections. The first corresponds to the incorrect code given above. In the code
provided we loop for 100000 times and check whether the computation has given an incorrect result,
if so it is printed out. The second section corresponds to the correct code. Again we loop for 100000
times and check for incorrect results. Only incorrect results are printed out. Run the investigation a
few times and observe the difference.
5
Loop Shared and Lastprivate Variables Code (incorrect and correct). This investigation
concerns accessing a variable which has been calculated with a parallel section, after the parallel
section is completed. Let’s consider the following code.
// parallel region
#pragma omp parallel for private(i,m,tid) shared(n)
for(i=0;i<n;i++) {
m = i + 1;
}
// serial region
printf("m = %d”,m);
The variable m which we need to print out after the parallel section has been declared as private
within the parallel section. This had to be done, since if it were shared then all members of the
thread team would have been able to change it, which would have produced non-deterministic
results. However, making m private means we cannot access it out of the parallel region. So the
above code would fail.
Here’s how we fix this issue, by using the lastprivate clause.
//parallel region
#pragma omp parallel for private(i,tid) lastprivate(m) shared(n)
for(i=0;i<n;i++) {
m = i + 1;
}
// serial region
printf("m = %d”,m);
This clause allows the last value of the variable m to be accessible outside the parallel region. The
last value of m is of course the last value calculated in the for loop, for i = (n-1).
5
Loop Shared and Lastprivate Variables Investigation (incorrect and correct). Run Investigation
14. You will see calculations of m for both the code chunks listed above. Note how the first chunk
does not assign m to the value calculated for i=7 in the parallel loop. The second chunk assigns m
correctly to the last value computed in the loop. If you wish to repeat the investigation, then you
must exit the program and restart it. (Problems with data caching methinks).
6
Parallel Loops The DataRace Pathology Code. Let’s look at this code which calculates the values
of array a[] based on previous initialised values
//serial region to initialise a[]
for(i=0;i<n;i++) {
a[i] = i;
}
#pragma omp parallel for private(tid,i) shared(a)
for(i=0;i<(n-1);i++) {
a[i] = a[i+1] + 1;
}
The initialised array elements are a[0] = 0, a[1] = 1, a[2]=2 and so on. The line of code a[i] =
a[i+1] + 1; indicates that the value of the array element a[i] is calculated based on the value of the
element a[i+1], which is different. So we expect in the second for loop to get a[0] = a[1]+1 = 2, and
a[1] = a[2] + 1 = 3 and so on.
However in the parallel section all elements are being calculated in parallel, both a[i] and a[i+1] are
being calculated simultaneously. Hence the sum a[i] = a[i+1] + 1; will surely fail. This is called a
data race.
6
Parallel Loops The DataRace Pathology Investigation. Run the Investigation 13. You may want
to look at the detailed code before you do this. The investigation will print out the values initially
assigned the array a[]. Then it will printout the calculations of a correct serial implementation, where
each array element is correctly calculated according to a[i] = a[i+1] + 1; You should check this
is correct. Then it will printout the calculations according to the parallel code which is incorrect.
Look at the results of the parallel code. Can you see where there are incorrect calculations? Make a
note of these.
7
The Sections Construct Code. In the above investigations we did not choose to assign computation
to a particular thread. This is the usual case where the code cannot be easily explicitly mapped onto a
thread. However there are some cases where this is possible in the example below
#pragma omp parallel private(tid,i) shared(n,a,b)
{
#pragma omp sections
{
#pragma omp section
for(i=0;i<n;i++)
a[i]=10;
#pragma omp section
for(i=0;i<n;i++) {
b[i]=20;
}
}
The code in each section will be assigned to a different thread (though we have no control over
which threads will be used. This is possible since the computations in each section are not dependent
on each other, arrays a[] and b[] have no data dependencies.
7
The Sections Construct Investigation. Run the Investigation 8 and note how a single thread
computes the values of a[] and a different thread computes the value of b[]. Now run the
investigation several times and look at the computations. There are two interesting things that
emerge. Can you spot them?
8
The Critical Construct Code. Let’s have a look at this code chunk below where we add the values
in an array a[] to find their sum. The approach is to distribute the addition over the thread team
where each thread maintains its private sum “sumLocal”. Finally when the parallel for loop is done,
we add the values of the private sums to get the final sum:
#pragma omp parallel shared(n,a,sum) private(tid,i,sumLocal)
{
#pragma omp for
for(i=0;i<n;i++) {
sumLocal += a[i];
}
sum += sumLocal;
}
Unfortunately this won’t always work. The problem is with the line sum += sumLocal; This will be
executed by the thread team, so the variable sum could be updated by more than one thread at any
time. This is of course a data race condition. To prevent this we use, as discussed, a critical section:
#pragma omp parallel shared(n,a,sum) private(tid,i,sumLocal)
{
#pragma omp for
for(i=0;i<n;i++) {
sumLocal += a[i];
}
#pragma omp critical (Agatha)
{
sum += sumLocal;
}
}
8
The Critical Construct Investigation. Let’s now investigate these situations. You should have a
look at the detailed code in Visual Studio before you run Investigation 7. This investigation outputs
data from the incorrect code then the data from the correct code. In each case the array a[10]
contains 10 elements each with value 5, so the sum is 50. Run the investigations several times until
the first sum is not 50. Now look at the values each thread computes for its sum and try to find
something strange here. To help you find this, compare the thread sums for the correct code.
9
The Barrier Construct Code. The code snippet below shows the barrier in operation
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
Sleep(1000*tid);
tt1 = omp_get_wtime() - tt0;
printf("Thread %d: before barrier at time %4.1f\n",tid,tt1);
#pragma omp barrier
tt2 = omp_get_wtime() - tt0;
printf("Thread %d: after barrier at time %4.1f\n",tid,tt2);
}
Each thread is put to sleep for tid seconds, then it prints out the time before it hits the barrier. Then it
prints out its time when it has passed the barrier
9
The Barrier Construct Investigation. Run Investigation 30 and look at the time each thread hits
the barrier and the time that each thread leaves the barrier. What does this tell you about the
behaviour of the barrier and of each thread? Hint: If thread A hits the barrier at 2 secs and thread B
hits the barrier at 4 secs, but both leave at 4 secs, then what has A been doing in the meantime?
Download