PPT - Dr. Bo Yuan

advertisement
Open Multiprocessing
Dr. Bo Yuan
E-mail: yuanb@sz.tsinghua.edu.cn
OpenMP
•
An API for shared memory multiprocessing (parallel) programming in C,
C++ and Fortran.
– Supports multiple platforms (processor architectures and operating systems).
– Higher level implementation (a block of code that should be executed in parallel).
•
A method of parallelizing whereby a master thread forks a number of slave
threads and a task is divided among them.
•
Based on preprocessor directives (Pragma)
– Requires compiler support.
– omp.h
•
References
– http://openmp.org/
– https://computing.llnl.gov/tutorials/openMP/
– http://supercomputingblog.com/openmp/
2
Hello, World!
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void Hello(void)
int main(int argc, char* argv[]) {
/* Get number of threads from command line */
int thread_count=strtol(argv[1], NULL, 10);
#
pragma omp parallel num_threads(thread_count)
Hello();
return 0;
}
void Hello(void) {
int my_rank=omp_get_thread_num();
int thread_count=omp_get_num_threads();
printf(“Hello from thread %d of %d\n”, my_rank, thread_count);
}
3
Definitions
# pragma omp parallel [clauses]
text to modify the directive
{ code_block }
implicit barrier
Error Checking
#ifdef _OPENMP
#
include <omp.h>
#endif
#ifdef
int
int
#else
int
int
#endif
Thread Team = Master + Slaves
_OPENMP
my_rank=omp_get_thread_num();
thread_count=omp_get_num_threads();
my_rank=0;
thread_count=1;
4
The Trapezoidal Rule
/* Input: a, b, n */
h=(b-a)/n;
approx=(f(a)+f(b))/2.0;
for (i=1; i<=n-1; i++) {
x_i=a+i*h;
approx+=f(x_i);
}
approx=h*approx;
Thread 0
Thread 2
Shared Memory  Shared Variables  Race Condition
# pragma omp critical
global_result+=my_result;
5
The critical Directive
# pragma omp critical
y=f(x);
...
double f(double x) {
#
pragma omp critical
z=g(x);
...
}
Cannot be executed simultaneously!
# pragma omp critical(one)
y=f(x);
...
double f(double x) {
#
pragma omp critical(two)
z=g(x);
...
}
6
The atomic Directive
# pragma omp atomic
x <op>=<expression>;
<op> can be one of the binary operators:
+, *, -, /, &, ^, |, <<, >>
x++
++x
x---x
• Higher performance than the critical directive.
• Only single C assignment statement is protected.
• Only the load and store of x is protected.
• <expression> must not reference x.
# pragma omp atomic
x+=f(y);
# pragma omp critical
x=g(x);
Can be executed simultaneously!
7
Locks
/* Executed by one thread */
Initialize the lock data structure;
...
/* Executed by multiple threads */
Attempt to lock or set the lock data structure;
Critical section;
Unlock or unset the lock data structure;
...
/* Executed by one thread */
Destroy the lock data structure;
void
void
void
void
omp_init_lock(omp_lock_t*
lock_p);
omp_set_lock(omp_lock_t*
lock_p);
omp_unset_lock(omp_lock_t* lock_p);
omp_destroy_lock(omp_lock_t* lock_p);
8
Trapezoidal Rule in OpenMP
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void Trap(double a, double b, int n, double* global_result_p);
int main(int argc, char* argv[]) {
double global_result=0.0;
double a, b;
int
n, thread_count;
#
thread_count=strtol(argv[1], NULL, 10);
printf(“Enter a, b, and n\n”);
scanf(“%lf %lf %d”, &a, &b, &n);
pragma omp parallel num_threads(thread_count)
Trap(a, b, n, &global_result);
printf(“With n=%d trapezoids, our estimate\n”, n);
printf(“of the integral from %f to %f = %.15e\n”, a, b,
global_result);
return 0;
}
9
Trapezoidal Rule in OpenMP
void Trap(double a, double b, int n, double* global_result_p) {
double h, x, my_result;
double local_a, local_b;
int
i, local_n;
int
my_rank=omp_get_thread_num();
int
thread_count=omp_get_num_threads();
h=(b-a)/n;
local_n=n/thread_count;
local_a=a+my_rank*local_n*h;
local_b=local_a+local_n*h;
my_result=(f(local_a)+f(local_b))/2.0;
for(i=1; i<=local_n-1; i++) {
x=local_a+i*h;
my_result+=f(x);
}
my_result=my_result*h;
#
}
pragma omp critical
*global_result_p+=my_result;
10
Scope of Variables
In serial programming:
• Function-wide scope
• File-wide scope
• a, b, n
• global_result
• thread_count
Shared Scope
• Accessible by all threads in a team
• Declared before a parallel directive
Private Scope
• my_rank
• my_result
• global_result_p
• *global_result_p
• Only accessible by a single thread
• Declared in the code block
11
Another Trap Function
double Local_trap(double a, double b, int n);
global_result=0.0;
# pragma omp parallel num_threads(thread_count)
{
#
pragma omp critical
global_result+=Local_trap(a, b, n);
}
global_result=0.0;
# pragma omp parallel num_threads(thread_count)
{
double my_result=0.0; /* Private */
my_result=Local_trap(a, b, n);
#
pragma omp critical
global_result+=my_result;
}
12
The Reduction Clause
• Reduction: A computation (binary operation) that repeatedly applies
the same reduction operator (e.g., addition or multiplication) to a
sequence of operands in order to get a single result.
reduction(<operator>: <variable list>)
• Note:
– The reduction variable itself is shared.
– A private variable is created for each thread in the team.
– The private variables are initialized to 0 for addition operator.
global_result=0.0;
# pragma omp parallel num_threads(thread_count)\
reduction(+: global_result)
global_result=Local_trap(a, b, n);
13
The parallel for Directive
h=(b-a)/n;
approx=(f(a)+f(b))/2.0;
# pragma omp parallel for num_threads(thread_count)\
reduction(+: approx)
for (i=1; i<=n-1; i++) {
approx+=f(a+i*h);
}
h=(b-a)/n;
approx=h*approx;
approx=(f(a)+f(b))/2.0;
for (i=1; i<=n-1; i++) {
approx+=f(a+i*h);
}
approx=h*approx;
• The code block must be a for loop.
• Iterations of the for loop are divided among threads.
• approx is a reduction variable.
• i is a private variable.
14
The parallel for Directive
•
•
Sounds like a truly wonderful approach to parallelizing serial programs.
Does not work with while or do-while loops.
– How about converting them into for loops?
•
The number of iterations must be determined in advance.
for (; ;) {
...
}
for (i=0; i<n; i++) {
if (...) break;
...
}
int x, y;
# pragma omp parallel for num_threads(thread_count) private(y)
for(x=0; x < width; x++) {
for(y=0; y < height; y++) {
finalImage[x][y] = f(x, y);
}
}
15
Estimating π

(1) k
 1 1 1

  41       4
 3 5 7

k 0 2 K  1
double factor=1.0;
double sum=0.0;
for(k=0; k<n; k++) {
sum+=factor/(2*k+1);
factor=-factor;
}
pi_approx=4.0*sum;
?
Loop-carried dependence
double factor=1.0;
double sum=0.0;
# pragma omp parallel for\
num_threads(thread_count)\
reduction(+: sum)
for(k=0; k<n; k++) {
sum+=factor/(2*k+1);
factor=-factor;
}
pi_approx=4.0*sum;
16
Estimating π
if(k%2 == 0)
factor=1.0;
else
factor=-1.0;
sum+=factor/(2*k+1);
factor=(k%2 == 0)?1.0: -1.0;
sum+=factor/(2*k+1);
double factor=1.0;
double sum=0.0;
# pragma omp parallel for num_threads(thread_count)\
reduction(+: sum) private(factor)
for(k=0; k<n; k++) {
if(k%2 == 0)
factor=1.0;
else
factor=-1.0;
sum+=factor/(2*k+1);
}
pi_approx=4.0*sum;
17
Scope Matters
double factor=1.0;
double sum=0.0;
# pragma omp parallel for num_threads(thread_count)\
default(none) reduction(+: sum) private(k, factor) shared(n)
for(k=0; k<n; k++) {
if(k%2 == 0)
factor=1.0;
else
The private factor is not specified.
factor=-1.0;
sum+=factor/(2*k+1);
}
pi_approx=4.0*sum;
•
With the default (none) clause, we need to specify the scope of each variable that we
use in the block that has been declared outside the block.
•
The value of a variable with private scope is unspecified at the beginning (and after
completion) of a parallel or parallel for block.
18
Bubble Sort
for (len=n; len>=2; len--)
for (i=0; i<len-1; i++)
if (a[i]>a[i+1]) {
tmp=a[i];
a[i]=a[i+1];
a[i+1]=tmp;
}
• Can we make it faster?
• Can we parallelize the outer loop?
• Can we parallelize the inner loop?
19
Odd-Even Sort
Phase
0
1
2
3
Subscript in Array
0
1
2
3
9
7
8
6
7
9
6
8
7
9
6
8
7
6
9
8
7
6
9
8
6
7
8
9
6
7
8
9
6
7
8
9
Any opportunities for parallelism?
20
Odd-Even Sort
void Odd_even_sort (int a[], int n) {
int phase, i, temp;
for (phase=0; phase<n; phase++)
if (phase%2 == 0) { /* Even phase */
for (i=1; i<n; i+=2)
if (a[i-1]>a[i]) {
temp=a[i];
a[i]=a[i-1];
a[i-1]=temp;
}
} else {
/* Odd phase */
for (i=1; i<n-1; i+=2)
if (a[i]>a[i+1]) {
temp=a[i];
a[i]=a[i+1];
a[i+1]=temp;
}
}
}
21
Odd-Even Sort in OpenMP
#
#
for (phase=0; phase<n; phase++) {
if (phase%2 == 0) { /* Even phase */
pragma omp parallel for num_threads(thread_count)\
default(none) shared(a, n) private(i, temp)
for (i=1; i<n; i+=2)
if (a[i-1]>a[i]) {
temp=a[i];
a[i]=a[i-1];
a[i-1]=temp;
}
} else {
/* Odd phase */
pragma omp parallel for num_threads(thread_count)\
default(none) shared(a, n) private(i, temp)
for (i=1; i<n-1; i+=2)
if (a[i]>a[i+1]) {
temp=a[i];
a[i]=a[i+1];
a[i+1]=temp;
}
}
}
22
Odd-Even Sort in OpenMP
#
#
#
pragma omp parallel num_thread(thread_count) \
default(none) shared(a, n) private(i, tmp, phase)
for (phase=0; phase<n; phase++) {
if (phase%2 == 0) { /* Even phase */
pragma omp for
for (i=1; i<n; i+=2)
if (a[i-1]>a[i]) {
temp=a[i];
a[i]=a[i-1];
a[i-1]=temp;
}
} else {
/* Odd phase */
pragma omp for
for (i=1; i<n-1; i+=2)
if (a[i]>a[i+1]) {
temp=a[i];
a[i]=a[i+1];
a[i+1]=temp;
}
}
}
23
Data Partitioning
Iterations
0
Threads
Iterations
Threads
1
2
3
0
0
1
0
4
5
6
1
2
3
4
1
7
8
Block
8
Cyclic
2
5
6
7
2
24
Scheduling Loops
sum=0.0;
for (i=0; i<=n; i++)
sum+=f(i);
double f(int i) {
int j, start=i*(i+1)/2, finish=start+i;
double return_val=0.0;
for (j=start; j<=finish; j++) {
return_val+=sin(j);
}
return return_val;
}
25
The schedule clause
sum=0.0;
# pragma omp parallel for num_threads(thread_count) \
reduction(+:sum) schedule(static, 1)
for (i=0; i<n; i++)
sum+=f(i);
chunksize
n=12, t=3
schedule(static, 1)
schedule(static, 2)
schedule(static, 4)
Thread 0: 0, 3, 6, 9
Thread 0: 0, 1, 6, 7
Thread 0: 0, 1, 2, 3
Thread 1: 1, 4, 7, 10
Thread 1: 2, 3, 8, 9
Thread 1: 4, 5, 6, 7
Thread 2: 2, 5, 8, 11
Thread 2: 4, 5, 10, 11
Thread 2: 8, 9, 10, 11
schedule(static, total_iterations/thread_count)
26
The dynamic and guided Types
• In a dynamic schedule:
– Iterations are broken into chunks of chunksize consecutive iterations.
– Default chunksize value: 1
– Each thread executes a chunk.
– When a thread finishes a chunk, it requests another one.
• In a guided schedule:
–
–
–
–
Each thread executes a chunk.
When a thread finishes a chunk, it requests another one.
As chunks are completed, the size of the new chunks decreases.
Approximately equals to the number of iterations remaining divided by
the number of threads.
– The size of chunks decreases down to chunksize or 1 (default).
27
Which schedule?
• The optimal schedule depends on:
– The type of problem
– The number of iterations
– The number of threads
• Overhead
– guided>dynamic>static
– If you are getting satisfactory results (e.g., close to the theoretically
maximum speedup) without a schedule clause, go no further.
• The Cost of Iterations
– If it is roughly the same, use the default schedule.
– If it decreases or increases linearly as the loop executes, a static
schedule with small chunksize values will be good.
– If it cannot be determined in advance, try to explore different options.
28
Performance Issue
A
x
y
# pragma omp parallel for num_threads(thread_count) \
default(none) private(i,j) shared(A, x, y, m, n)
for(i=1; i<m; i++) {
y[i]=0.0;
for(j=0; j<n; j++)
y[i]+=A[i][j]*x[j];
}
29
Performance Issue
Matrix Dimension
Number
of
Threads
Time
Efficiency
Time
Efficiency
Time
Efficiency
1
0.322
1.000
0.264
1.000
0.333
1.000
2
0.219
0.735
0.189
0.698
0.300
0.555
3
0.141
0.571
0.119
0.555
0.303
0.275
8,000,000 x 8
8,000 x 8,000
8 x 8,000,000
30
Performance Issue
•
8,000,000-by-8
– y has 8,000,000 elements  Potentially large number of write misses
•
8-by-8,000,000
– x has 8,000,000 elements  Potentially large number of read misses
•
8-by-8,000,000
– y has 8 elements (8 doubles)  Could be stored in the same cache line (64 bytes).
– Potentially serious false sharing effect for multiple processors
•
8000-by-8000
–
–
–
–
y has 8,000 elements (8,000 doubles).
Thread 2: 4000 to 5999
Thread 3: 6000 to 7999
{y[5996], y[5997], y[5998], y[5999], y[6000], y[6001], y[6002], y[6003] }
The effect of false sharing is highly unlikely.
31
Thread Safety
•
How to generate random numbers in C?
– First, call srand() with an integer seed.
– Second, call rand() to create a sequence of random numbers.
•
Pseudorandom Number Generator (PRNG)
X n1  aXn  c mod m
•
Is it thread safe?
– Can it be simultaneously executed by multiple threads without causing problems?
32
Foster’s Methodology
•
Partitioning
– Divide the computation and the data into small tasks.
– Identify tasks that can be executed in parallel.
•
Communication
– Determine what communication needs to be carried out.
– Local Communication vs. Global Communication
•
Agglomeration
– Group tasks into larger tasks.
– Reduce communication.
– Task Dependence
•
Mapping
– Assign the composite tasks to processes/threads.
33
Foster’s Methodology
34
The n-body Problem
•
To predict the motion of a group of objects that interact with each other
gravitationally over a period of time.
– Inputs: Mass, Position and Velocity
•
Astrophysicist
– The positions and velocities of a collection of stars
•
Chemist
– The positions and velocities of a collection of molecules
35
Newton’s Law
f qk  
Gmq mk
sq t   sk t 
n 1
Fq  Gmq 
k 0
k q
n 1
aq  G
k 0
k q
3
s (t )  s (t )
q
k
mk
sq (t )  sk (t )
mk
sq (t )  sk (t )
3
3
s (t )  s (t )
q
k
s (t )  s (t )
q
k
36
The Basic Algorithm
Get input data;
for each timestep {
if (timestep output)
Print positions and velocities of particles;
for each particle q
Compute total force on q;
for each particle q
Compute position and velocity of q;
}
for each particle q {
forces[q][0]=forces[q][1]=0;
for each particle k!=q {
x_diff=pos[q][0]-pos[k][0];
y_diff=pos[q][1]-pos[k][1];
dist=sqrt(x_diff*x_diff+y_diff*y_diff);
dist_cubed=dist*dist*dist;
forces[q][0]-=G*masses[q]*masses[k]/dist_cubed*x_diff;
forces[q][1]-=G*masses[q]*masses[k]/dist_cubed*y_diff;
}
}
37
The Reduced Algorithm
for each particle q
forces[q][0]=forces[q][1]=0;
for each particle q {
for each particle k>q {
x_diff=pos[q][0]-pos[k][0];
y_diff=pos[q][1]-pos[k][1];
dist=sqrt(x_diff*x_diff+y_diff*y_diff);
dist_cubed=dist*dist*dist;
force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff;
force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;
forces[q][0]+=force_qk[0];
forces[q][1]+=force_qk[1];
forces[k][0]-=force_qk[0];
forces[k][1]-=force_qk[1];
}
}
38
Euler Method
y(t0  t )  y(t0 )  y (t0 )(t0  t  t0 )  y(t0 )  y (t0 )t
39
Position and Velocity
sq (t )  sq (0)  tsq' (0)  sq (0)  tvq (0)
vq (t )  vq (0)  tv (0)  vq (0)  taq (0)  vq (0)  t
'
q
Fq (0)
mq
sq (2t )  sq (t )  tsq' (t )  sq (t )  tvq (t )
vq (2t )  vq (t )  tv (t )  vq (t )  taq (t )  vq (t )  t
'
q
Fq (t )
mq
for each particle q {
pos[q][0]+=delta_t*vel[q][0];
pos[q][1]+=delta_t*vel[q][1];
vel[q][0]+=delta_t*forces[q][0]/masses[q];
vel[q][1]+=delta_t*forces[q][1]/masses[q];
}
40
Communications
sq(t)
vq(t)
sr(t)
vr(t)
Fq(t)
sq(t + △t)
vq(t + △t)
Fr(t)
sr(t + △t)
Fq(t+ △t)
vr(t + △t)
Fr(t+ △t)
41
Agglomeration
sq
t
sq, vq, Fq
sr, vr, Fr
sq
t + △t
sr
sq, vq, Fq
sr
sr, vr, Fr
42
Parallelizing the Basic Solver
# pragma omp parallel
for each timestep {
if (timestep output){
#
pragma omp single nowait
Print positions and velocities of particles;
}
#
pragma omp for
for each particle q
Compute total force on q;
#
pragma omp for
for each particle q
Compute position and velocity of q;
}
43
Parallelizing the Reduced Solver
# pragma omp for
for each particle q
forces[q][0]=forces[q][1]=0;
# pragma omp for
for each particle q {
for each particle k>q {
x_diff=pos[q][0]-pos[k][0];
y_diff=pos[q][1]-pos[k][1];
dist=sqrt(x_diff*x_diff+y_diff*y_diff);
dist_cubed=dist*dist*dist;
force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff;
force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;
forces[q][0]+=force_qk[0];
forces[q][1]+=force_qk[1];
forces[k][0]-=force_qk[0];
forces[k][1]-=force_qk[1];
}
}
44
Does it work properly?
• Consider 2 threads and 4 particles.
• Thread 1 is assigned particle 0 and particle 1.
• Thread 2 is assigned particle 2 and particle 3.
• F3=-f03-f13-f23
• Who will calculate f03 and f13?
• Who will calculate f23?
• Any race conditions?
45
Thread Contributions
Thread
Particle
0
0
1
2
Thread
0
1
2
f01 +f02 +f03+f04 +f05
0
0
1
-f01 +f12 +f13+f14 +f15
0
0
2
-f02 -f12
f23 +f24 +f25
0
3
-f03 -f13
-f23 +f34 +f35
0
4
-f04 -f14
-f24 –f34
f45
5
-f05 -f15
-f25 –f35
-f45
3 Threads, 6 Particles, Block Partition
46
First Phase
# pragma omp for
for each particle q {
for each particle k>q {
x_diff=pos[q][0]-pos[k][0];
y_diff=pos[q][1]-pos[k][1];
dist=sqrt(x_diff*x_diff+y_diff*y_diff);
dist_cubed=dist*dist*dist;
force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff;
force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;
loc_forces[my_rank][q][0]+=force_qk[0];
loc_forces[my_rank][q][1]+=force_qk[1];
loc_forces[my_rank][k][0]-=force_qk[0];
loc_forces[my_rank][k][1]-=force_qk[1];
}
}
47
Second Phase
# pragma omp for
for (q=0; q<n; q++) {
forces[q][0]=forces[q][1]=0;
for(thread=0; thread<thread_count; thread++) {
forces[q][0]+=loc_forces[thread][q][0];
forces[q][1]+=loc_forces[thread][q][1];
}
}
•
In the first phase, each thread carries out the same calculations as before
but the values are stored in its own array of forces (loc_forces).
•
In the second phase, the thread that has been assigned particle q will add
the contributions that have been computed by different threads.
48
Evaluating the OpenMP Codes
•
In the reduced code:
– Loop 1: Initialization of the loc_forces array
– Loop 2: The first phase of the computation of forces
– Loop 3: The second phase of the computation of forces
– Loop 4: The updating of positions and velocities
•
Which schedule should be used?
7.71
Reduced
Default
3.90
Reduced
Forces Cyclic
3.90
Reduced
All Cyclic
3.90
2
3.87
2.94
1.98
2.01
4
1.95
1.73
1.01
1.08
8
0.99
0.95
0.54
0.61
Threads
Basic
1
49
Review
•
What are the major differences between MPI and OpenMP?
•
What is the scope of a variable?
•
What is a reduction variable?
•
How to ensure mutual exclusion in a critical section?
•
What are the common loop scheduling options?
•
What factors may potentially affect the performance of an OpenMP program?
•
What is a thread safe function?
50
Download