Introduction to Multi

advertisement
Parallel Programming with OpenMP
[CBPrice Draft4 31-01-12]
Introduction
OpenMP is not a new programming languages, rather it is a set of directives which can be inserted into a
C/C++ program or a Fortran progam. These directives instruct the compiler to produce an executable which
the Operating System will share amongst the multiple processors present on a machine, so that the program
will be executed on multiple processors. For example, suppose we have an array of numbers a[i] of length
ten and our program will double all the elements. (Here i is the ‘index’ into the array.) If we have two
processors, then the parallel program will assign the computations on elements a[0],a[1],a[2], a[3] and a[4]
to processor A and the computations on elements a[5],a[6],a[7],a[8] and a[9] to processor B. Both
processors run in parallel, so the work is shared.
Each parallel thread of computation is called a thread. So in the above example, there are two threads of
execution, one runs on processor A and the other runs on processor B.
OpenMP works using shared memory where all threads have access to this common memory. It provides the
programmer with the ability to (i) create a team of threads to run on each processor on the machine, (ii)
specify how work is to be shared between the members of the team, (iii) designate shared and private
variables from the point of view of each thread (don’t worry, this will become clearer later, trust me!), (iv)
prevent the execution of one thread spoiling the computations of another thread.
In this text we aim to introduce the basics of OpenMP. We shall look at the following set of constructs and
provide links to the associated workshop activities:
(i) The parallel construct.
(ii) Work-sharing constructs such as loop, sections, single
(iii) Scheduling of parallel processing
(iv) The barrier construct
(v) The critical construct.
In the examples provided below, we shall provide you with sequential code designed to run on a single
processor machine. You will be familiar with this code from your programming experience. Then we shall
provide you with the parallelised code. We hope through a comparison of these, you will learn how to
parallise your own code.
The Parallel Construct
Let’s start with a traditional programming exercise, to print out “Hello World”. Here’s a C program to do
this using sequential code.
int _tmain(int argc, _TCHAR* argv[]) {
printf(“Hello World\n”);
}
Now the equivalent parallel program uses the construct #pragma omp parallel to create a number of
threads and to get each of these to execute the print statement. Here’s the parallel program. Note that all the
code between the accolades following the construct will be executed in parallel; that means that the “Hello
World” text will be printed out a number of times, once for each processor or thread you have on your
machine.
int _tmain(int argc, _TCHAR* argv[]) {
#pragma omp parallel
{
printf(“Hello World\n”);
}
}
This exercise is explored in the workshop activity Investigation 1.
It is useful to know which processor or thread is executing a statement at a particular time. This can be done
using the call to the OpenMP function omp_get_thread_num(). Let’s add this to the above code. We write
this.
int _tmain(int argc, _TCHAR* argv[]) {
int tid;
#pragma omp parallel
{
tid = omp_get_thread_num();
printf(“Thread %d says Hello World\n”,tid);
}
}
Here the variable tid is the ‘thread id’, and tells you which thread is actually printing out the “Hello World”
message. This is incorporated into the workshop activity Investigation 1. As you will find when you run this
Investigation several times, the order of execution will be different, i.e., the printing of the message will be
assigned to different threads. This ‘non-deterministic’ element of parallel execution is something which you
must understand.
Sharing Work – The Loop Construct.
Sharing work between threads is the most fundamental of OpenMP’s constructs since it allows your desired
computation to be distributed between teams of threads and so to obtain a speedup and a reduction in time of
the total computation. One important aspect of parallel programming is to share the work involved in
computations involving a loop. Look at the following code which initialises an array to a particular value
(2*i) using a serial program.
int _tmain(int argc, _TCHAR* argv[]) {
int
int
int
int
n;
//
i;
//
m;
//
a[10];//
This
This
This
This
is
is
is
is
the
the
the
the
length of the array
index into the array
value to be assigned to all array elements
array of length 10
n = 10;
// Initialise all 10 elements of the array with value 2*i
for(i=0;i<n;i++) {
a[i] = 2*i;
}
}
This is straightforward to understand. First we declare the variables n,i,m,a which we shall use, then we
define the value of m and n. Finally, we do the computation within the for loop which does nothing else but
assign the value of each element of array a to the value of 2*i. The parallelised program is show below.
This program will distribute the computation of the array a[i] values over the threads we have available. For
example, if we have two threads then each thread will execute half the computations.
int _tmain(int argc, _TCHAR* argv[]) {
int
int
int
int
n;
//
i;
//
m;
//
a[10];//
This
This
This
This
is
is
is
is
the
the
the
the
length of the array
index into the array
value to be assigned to all array elements
array of length 10
n = 10;
#pragma omp parallel
{
#pragma omp for
for(i=0;i<n;i++) {
a[i] = 2*i;
}
}// end parallel
} // end main
Here the first directive #pragma omp parallel identifies a block of code which may be parallelised, and
the second directive #pragma omp for tells us that it is the for loop which will be parallelised. When this
code is executed, the computations will be divided between the threads. Investigation 150/151 will explore
this.
Loop Construct – Shared and Private Variables
While the above parallelised code is correct and it works, you are not advised to write like this. It’s too
simplistic and has made some assumptions. We are assuming that the variable i is unique for each thread, so
that no two threads have the same value of i and therefore update the same array element a[i] which could
cause a problem. We also assume that each thread sees the same array a[] and not a different copy. To
remove these assumptions, we add two “clauses” to the #pragma omp parallel directive, shared(a) and
private(i) which ensure that no two threads work on the same i, and that all threads see a common array a[]
within the parallel region. Here’s the correct program.
int _tmain(int argc, _TCHAR* argv[]) {
int
int
int
int
n;
//
i;
//
m;
//
a[10];//
This
This
This
This
is
is
is
is
the
the
the
the
length of the array
index into the array
value to be assigned to all array elements
array of length 10
n = 10;
#pragma omp parallel private(i) shared(a)
{
#pragma omp for
for(i=0;i<n;i++) {
a[i] = 2*i;
}
}// end parallel
} // end main
This will be explored in Investigation 15. An incorrect use of shared and private will be shown in
Investigation 12.
An Aside: Combined Work Sharing Constructs
You may have noticed in the parallel code example above that the parallel for construct #pragma omp for
was contained in the parallel section construct #pragma omp parallel. As we stated in the introductory
material, the OpenMP approach aims to parallelise loops. So it may not come as a surprise that both of these
constructs can be combined into one, which will simplify the writing of code and therefore reduce the
chance of errors. The above example may be condensed as follows.
n = 10;
#pragma omp parallel for private(i) shared(a)
for(i=0;i<n;i++) {
a[i] = 2*i;
}
Loop Construct – lastprivate and firstprivate clauses.
Let’s consider this parallel code block which distributes the computation of variables i and m over the
threads available. This is perfectly good code. If m had been declared in the shared clause, then multiple
threads would have attempted to update its value in an uncontrolled manner with non-deterministic results.
n = 10;
#pragma omp parallel for private(i,m) shared(n)
for(i=0;i<n;i++) {
m = i + 1;
}
k = 2*m;
But what about the serial statement k=2*m outside the parallel block, which value of m does it get from the
parallel block? The answer is we cannot know and the value of m outside the parallel section is undefined,
since we identified it as being private. Fortunately OpenMP defines a clause which allows us to access the
last value of a private variable in a parallel loop. Here’s the syntax.
n = 10;
#pragma omp parallel for private(i) lastprivate(m) shared(n)
for(i=0;i<n;i++) {
m = i + 1;
}
k = 2*m;
Here the lastprivate(m) clause allows access to the last value of the shared variable in the parallel region.
This is explored in Investigation 14.
Issues with Parallel Loops – The DataRace Pathology.
You may feel that you are becoming confident in your understanding of how to paralleise loops.
Nevertheless there are some subtle issues which you should know about, or stated differently, there are traps
you should not fall into. Here’s one, consider this code segment, in purely sequential code.
n = 10;
// initialise array
for(i=0;i<n;i++)
a[i] = i;
// compute new values of the array
for(i=0;i<n;i++)
a[i] = a[i+1] + 1;
Nothing wrong with this. We first initialise the array, so that a[0] = 0, a[1]=1 and so on, and then we
compute the new values of the array, so that a[0] becomes a[1] + 1 which is 2 and so on. There is absolutely
no problem here.
Now let’s look at a naive (and incorrect) attempt to parallelise this, which will fail.
n = 10;
//initialise array (serial section)
for(i=0;i<n;i++)
a[i] = i;
// compute new values of the array (parallel section)
#pragma omp parallel for private(tid,i) shared(a,n)
for(i=0;i<n;i++) {
a[i] = a[i+1] + 1;
}
Here we have indicated that a[] is a shared variable (as in the examples above) in order that all parallel
threads may work on this array together. But there is a problem here. The line of code a[i] = a[i+1] + 1;
says that the value of a[i] depends on the value of a[i+1]. But these values are being computed in parallel, so
both a[i] and a[i+1] may be computed simultaneously. In other words, there is no way of knowing which
values are correct. This situation is known as a data race. The point is that not all code can be parallelised. A
data race is explored in Investigation 13.
The ‘Sections’ Construct.
In constructing a program which may be executed on parallel processors to distribute the computational
load, it may be fairly straightforward to do this in a number of cases. Consider this sequential program
below which needs to initialise the values of an array a[] and independently initialise the values of a second
array b[].
n = 10;
for(i=0;i<n;i++)
a[i] = 10;
for(i=0;i<n;i++)
b[i] = 20;
Both of these initialisations are independent and so it makes sense to assign them to individual threads of
processing which can be processed at the same time. The sections construct allows us to do this. Here’s how
we would parallise these initialisations. Each section identifies an independent section of code which can be
executed independently, without reference to variables within the other section. Let’s have a look.
n = 10;
#pragma omp parallel private(tid,i) shared(n,a,b)
{
#pragma omp sections
{
#pragma omp section
for(i=0;i<n;i++)
a[i]=10;
#pragma omp section
for(i=0;i<n;i++)
b[i]=20;
}// sections
}//parallel
Here the first section initialises the values of the array a[] and the second section initialises the value of the
array b[]. There are no dependencies between a[]and b[], ie we have no statement like a[i] = 2*b[i]. So we
are sure that the execution of code within the sections is independent.
In this example the compiler will assign the processing of the first section to a dedicated thread, and the
processing of the second section to another dedicated thread. So the code will run in parallel processing
using two threads. Of course if our system has more than two processors, then the additional processors will
be idle (or else dealing with your Facebook or email traffic). You may think from the above code that one
section will be completed by one thread before the second thread starts work. This is nonsense. Both threads
may work together, so the computations will be interleaved, or just perhaps one thread may complete its
computations before the other. Only investigations will establish what will happen. Investigation 8.
The Single Construct
Often within a parallel region we need to have some code executed on a single thread. A good example is
the initialisation of a shared variable. If we let a load of threads trying to initialise the same variable then
there is a chance that the variable would be incorrectly initialised. This is due to details of how the CPU
actually writes to memory, which is beyond our work at the moment. However, we need a method to ensure
that the variable is correctly initialised, and this will always succeed if we use a single thread. This is the
purpose of the Single construct. Look at the code below.
n=10;
#pragma omp parallel shared(m,b) private(i)
{
#pragma omp single
{
m = 35;
}
#pragma omp for
for(i=0;i<n;i++) {
b[i]=m;
}
}
Note that the entire code is contained in a parallel region. Variables m and b are shared. In the parallel for
loop, the array b[] is initialised to the value of the variable m. This variable needs to be initialised, and this
happens within the single block. When this code runs, a single thread will assign the value 35 to the variable
n, then the parallel threads will initialise the elements of the array b[] with the value of m.
The Critical Construct
This construct ensures that multiple threads do not attempt to update the same shared variable in memory
together, which could cause erroneous results. As an illustration, let’s attempt to parallise the serial loop
shown below, where we sum the elements of an array.
n=10;
// All the following is serial
// initialise array a[];
for(i=0;i<n;i++)
a[i]=5;
// sum the array
sum = 0;
for(i=0;i<n;i++)
sum += a[i];
There’s nothing wrong with this program and it will work fine and give us the correct value for sum. Now,
here’s an incorrect attempt to parallise this computation.
n=10;
#pragam omp parallel shared(n,a,sum) private(sumLocal)
{
#pragma omp for
for(i=0;i<n;i++)
sumLocal += a[i];
sum += sumLocal;
}
Remember that the code within the parallel block will be distributed amongst all the threads we have
available. So the idea is that the threads within the parallel for loop will divide the summation of the array
values and will store their sums in a variable ‘sumLocal’, one for each thread. Then the local sums will be
added together in the final sum statement. Sounds correct. But in this final stage, the threads may be
attempting to change the value of sum at the same time, since they are operating in parallel. This is a data
race condition and may lead to erroneous results. To prevent this, we define a critical region of code which
prevents multiple threads from updating a shared variable, in this case ‘sum’. Here’s how the correct code
appears.
n=10;
#pragam omp parallel shared(n,a,sum) private(sumLocal)
{
#pragma omp for
for(i=0;i<n;i++)
sumLocal += a[i];
#pragma omp critical (name)
{
sum += sumLocal;
}
}
Here ‘name’ can be any name. It is used to identify a particular critical region, since in serious code there
could be more than one. So now, within the parallel region, the summation of a[] is distributed amongst the
threads, and each thread computes a partial sum, stored in sumLocal (private to each thread). When the
threads have finished with their summation and have the correct values of their individual sums, sumLocal,
they enter the critical region where only one thread at a time is allowed to update the value of sum. This is
explored in Investgation 7.
The Barrier Construct
A barrier is placed in a parallel program where we need a team of threads to wait for each other before
proceeding. For example, a team of threads may be assigned to perform some calculations and write to a
shared variable (such as an array), and the team will also be assigned to read the shared variable and perform
some further calculations. Clearly the writes to the array need to be complete before the reads, otherwise we
would have a data race. The barrier would be placed between the writes and the reads, to ensure that all
writes are completed before the reads commence.
Most of the OpenMP constructs have an implied barrier, so that the compiler automatically inserts a barrier
at the end of the construct. You will not normally have to do this explicitly, but it is important to know that it
is going on. The code below shows a rather contrived example of an explicit use of the barrier construct.
tt0 = omp_get_wtime();
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
Sleep(1000*tid);
printf("Thread %d: before barrier at time %4.1f\n",tid,tt1);
#pragma omp barrier
tt2 = omp_get_wtime() - tt0;
printf("Thread %d: after barrier at time %4.1f\n",tid,tt2);
}
Here in the parallel section we get the tid (“thread identification”) of each thread and we put it to sleep for a
number of seconds equal to its tid. Sleep(1000) means sleep for 1000 milliseconds = 1 second. We print off
the time when each thread has completed its sleep. In a real program each thread would be doing a
computation which would take different amounts of time. So some threads will be completed before others
and we need to synchronise all threads before proceeding to a subsequent computation. So we place the
#pragma omp barrier construct next, to force all threads to wait until they are all done. Then we print off
each thread time after the barrier. These times should be the same. This is explored in Investigation 30.
Download