Open Multiprocessing Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn OpenMP • An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran. – Supports multiple platforms (processor architectures and operating systems). – Higher level implementation (a block of code that should be executed in parallel). • A method of parallelizing whereby a master thread forks a number of slave threads and a task is divided among them. • Based on preprocessor directives (Pragma) – Requires compiler support. – omp.h • References – http://openmp.org/ – https://computing.llnl.gov/tutorials/openMP/ – http://supercomputingblog.com/openmp/ 2 Hello, World! #include <stdio.h> #include <stdlib.h> #include <omp.h> void Hello(void) int main(int argc, char* argv[]) { /* Get number of threads from command line */ int thread_count=strtol(argv[1], NULL, 10); # pragma omp parallel num_threads(thread_count) Hello(); return 0; } void Hello(void) { int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads(); printf(“Hello from thread %d of %d\n”, my_rank, thread_count); } 3 Definitions # pragma omp parallel [clauses] text to modify the directive { code_block } implicit barrier Error Checking #ifdef _OPENMP # include <omp.h> #endif #ifdef int int #else int int #endif Thread Team = Master + Slaves _OPENMP my_rank=omp_get_thread_num(); thread_count=omp_get_num_threads(); my_rank=0; thread_count=1; 4 The Trapezoidal Rule /* Input: a, b, n */ h=(b-a)/n; approx=(f(a)+f(b))/2.0; for (i=1; i<=n-1; i++) { x_i=a+i*h; approx+=f(x_i); } approx=h*approx; Thread 0 Thread 2 Shared Memory Shared Variables Race Condition # pragma omp critical global_result+=my_result; 5 The critical Directive # pragma omp critical y=f(x); ... double f(double x) { # pragma omp critical z=g(x); ... } Cannot be executed simultaneously! # pragma omp critical(one) y=f(x); ... double f(double x) { # pragma omp critical(two) z=g(x); ... } 6 The atomic Directive # pragma omp atomic x <op>=<expression>; <op> can be one of the binary operators: +, *, -, /, &, ^, |, <<, >> x++ ++x x---x • Higher performance than the critical directive. • Only single C assignment statement is protected. • Only the load and store of x is protected. • <expression> must not reference x. # pragma omp atomic x+=f(y); # pragma omp critical x=g(x); Can be executed simultaneously! 7 Locks /* Executed by one thread */ Initialize the lock data structure; ... /* Executed by multiple threads */ Attempt to lock or set the lock data structure; Critical section; Unlock or unset the lock data structure; ... /* Executed by one thread */ Destroy the lock data structure; void void void void omp_init_lock(omp_lock_t* lock_p); omp_set_lock(omp_lock_t* lock_p); omp_unset_lock(omp_lock_t* lock_p); omp_destroy_lock(omp_lock_t* lock_p); 8 Trapezoidal Rule in OpenMP #include <stdio.h> #include <stdlib.h> #include <omp.h> void Trap(double a, double b, int n, double* global_result_p); int main(int argc, char* argv[]) { double global_result=0.0; double a, b; int n, thread_count; # thread_count=strtol(argv[1], NULL, 10); printf(“Enter a, b, and n\n”); scanf(“%lf %lf %d”, &a, &b, &n); pragma omp parallel num_threads(thread_count) Trap(a, b, n, &global_result); printf(“With n=%d trapezoids, our estimate\n”, n); printf(“of the integral from %f to %f = %.15e\n”, a, b, global_result); return 0; } 9 Trapezoidal Rule in OpenMP void Trap(double a, double b, int n, double* global_result_p) { double h, x, my_result; double local_a, local_b; int i, local_n; int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads(); h=(b-a)/n; local_n=n/thread_count; local_a=a+my_rank*local_n*h; local_b=local_a+local_n*h; my_result=(f(local_a)+f(local_b))/2.0; for(i=1; i<=local_n-1; i++) { x=local_a+i*h; my_result+=f(x); } my_result=my_result*h; # } pragma omp critical *global_result_p+=my_result; 10 Scope of Variables In serial programming: • Function-wide scope • File-wide scope • a, b, n • global_result • thread_count Shared Scope • Accessible by all threads in a team • Declared before a parallel directive Private Scope • my_rank • my_result • global_result_p • *global_result_p • Only accessible by a single thread • Declared in the code block 11 Another Trap Function double Local_trap(double a, double b, int n); global_result=0.0; # pragma omp parallel num_threads(thread_count) { # pragma omp critical global_result+=Local_trap(a, b, n); } global_result=0.0; # pragma omp parallel num_threads(thread_count) { double my_result=0.0; /* Private */ my_result=Local_trap(a, b, n); # pragma omp critical global_result+=my_result; } 12 The Reduction Clause • Reduction: A computation (binary operation) that repeatedly applies the same reduction operator (e.g., addition or multiplication) to a sequence of operands in order to get a single result. reduction(<operator>: <variable list>) • Note: – The reduction variable itself is shared. – A private variable is created for each thread in the team. – The private variables are initialized to 0 for addition operator. global_result=0.0; # pragma omp parallel num_threads(thread_count)\ reduction(+: global_result) global_result=Local_trap(a, b, n); 13 The parallel for Directive h=(b-a)/n; approx=(f(a)+f(b))/2.0; # pragma omp parallel for num_threads(thread_count)\ reduction(+: approx) for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } h=(b-a)/n; approx=h*approx; approx=(f(a)+f(b))/2.0; for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx; • The code block must be a for loop. • Iterations of the for loop are divided among threads. • approx is a reduction variable. • i is a private variable. 14 The parallel for Directive • • Sounds like a truly wonderful approach to parallelizing serial programs. Does not work with while or do-while loops. – How about converting them into for loops? • The number of iterations must be determined in advance. for (; ;) { ... } for (i=0; i<n; i++) { if (...) break; ... } int x, y; # pragma omp parallel for num_threads(thread_count) private(y) for(x=0; x < width; x++) { for(y=0; y < height; y++) { finalImage[x][y] = f(x, y); } } 15 Estimating π (1) k 1 1 1 41 4 3 5 7 k 0 2 K 1 double factor=1.0; double sum=0.0; for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum; ? Loop-carried dependence double factor=1.0; double sum=0.0; # pragma omp parallel for\ num_threads(thread_count)\ reduction(+: sum) for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum; 16 Estimating π if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); factor=(k%2 == 0)?1.0: -1.0; sum+=factor/(2*k+1); double factor=1.0; double sum=0.0; # pragma omp parallel for num_threads(thread_count)\ reduction(+: sum) private(factor) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum; 17 Scope Matters double factor=1.0; double sum=0.0; # pragma omp parallel for num_threads(thread_count)\ default(none) reduction(+: sum) private(k, factor) shared(n) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else The private factor is not specified. factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum; • With the default (none) clause, we need to specify the scope of each variable that we use in the block that has been declared outside the block. • The value of a variable with private scope is unspecified at the beginning (and after completion) of a parallel or parallel for block. 18 Bubble Sort for (len=n; len>=2; len--) for (i=0; i<len-1; i++) if (a[i]>a[i+1]) { tmp=a[i]; a[i]=a[i+1]; a[i+1]=tmp; } • Can we make it faster? • Can we parallelize the outer loop? • Can we parallelize the inner loop? 19 Odd-Even Sort Phase 0 1 2 3 Subscript in Array 0 1 2 3 9 7 8 6 7 9 6 8 7 9 6 8 7 6 9 8 7 6 9 8 6 7 8 9 6 7 8 9 6 7 8 9 Any opportunities for parallelism? 20 Odd-Even Sort void Odd_even_sort (int a[], int n) { int phase, i, temp; for (phase=0; phase<n; phase++) if (phase%2 == 0) { /* Even phase */ for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } } 21 Odd-Even Sort in OpenMP # # for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */ pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } } 22 Odd-Even Sort in OpenMP # # # pragma omp parallel num_thread(thread_count) \ default(none) shared(a, n) private(i, tmp, phase) for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */ pragma omp for for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ pragma omp for for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } } 23 Data Partitioning Iterations 0 Threads Iterations Threads 1 2 3 0 0 1 0 4 5 6 1 2 3 4 1 7 8 Block 8 Cyclic 2 5 6 7 2 24 Scheduling Loops sum=0.0; for (i=0; i<=n; i++) sum+=f(i); double f(int i) { int j, start=i*(i+1)/2, finish=start+i; double return_val=0.0; for (j=start; j<=finish; j++) { return_val+=sin(j); } return return_val; } 25 The schedule clause sum=0.0; # pragma omp parallel for num_threads(thread_count) \ reduction(+:sum) schedule(static, 1) for (i=0; i<n; i++) sum+=f(i); chunksize n=12, t=3 schedule(static, 1) schedule(static, 2) schedule(static, 4) Thread 0: 0, 3, 6, 9 Thread 0: 0, 1, 6, 7 Thread 0: 0, 1, 2, 3 Thread 1: 1, 4, 7, 10 Thread 1: 2, 3, 8, 9 Thread 1: 4, 5, 6, 7 Thread 2: 2, 5, 8, 11 Thread 2: 4, 5, 10, 11 Thread 2: 8, 9, 10, 11 schedule(static, total_iterations/thread_count) 26 The dynamic and guided Types • In a dynamic schedule: – Iterations are broken into chunks of chunksize consecutive iterations. – Default chunksize value: 1 – Each thread executes a chunk. – When a thread finishes a chunk, it requests another one. • In a guided schedule: – – – – Each thread executes a chunk. When a thread finishes a chunk, it requests another one. As chunks are completed, the size of the new chunks decreases. Approximately equals to the number of iterations remaining divided by the number of threads. – The size of chunks decreases down to chunksize or 1 (default). 27 Which schedule? • The optimal schedule depends on: – The type of problem – The number of iterations – The number of threads • Overhead – guided>dynamic>static – If you are getting satisfactory results (e.g., close to the theoretically maximum speedup) without a schedule clause, go no further. • The Cost of Iterations – If it is roughly the same, use the default schedule. – If it decreases or increases linearly as the loop executes, a static schedule with small chunksize values will be good. – If it cannot be determined in advance, try to explore different options. 28 Performance Issue A x y # pragma omp parallel for num_threads(thread_count) \ default(none) private(i,j) shared(A, x, y, m, n) for(i=1; i<m; i++) { y[i]=0.0; for(j=0; j<n; j++) y[i]+=A[i][j]*x[j]; } 29 Performance Issue Matrix Dimension Number of Threads Time Efficiency Time Efficiency Time Efficiency 1 0.322 1.000 0.264 1.000 0.333 1.000 2 0.219 0.735 0.189 0.698 0.300 0.555 3 0.141 0.571 0.119 0.555 0.303 0.275 8,000,000 x 8 8,000 x 8,000 8 x 8,000,000 30 Performance Issue • 8,000,000-by-8 – y has 8,000,000 elements Potentially large number of write misses • 8-by-8,000,000 – x has 8,000,000 elements Potentially large number of read misses • 8-by-8,000,000 – y has 8 elements (8 doubles) Could be stored in the same cache line (64 bytes). – Potentially serious false sharing effect for multiple processors • 8000-by-8000 – – – – y has 8,000 elements (8,000 doubles). Thread 2: 4000 to 5999 Thread 3: 6000 to 7999 {y[5996], y[5997], y[5998], y[5999], y[6000], y[6001], y[6002], y[6003] } The effect of false sharing is highly unlikely. 31 Thread Safety • How to generate random numbers in C? – First, call srand() with an integer seed. – Second, call rand() to create a sequence of random numbers. • Pseudorandom Number Generator (PRNG) X n1 aXn c mod m • Is it thread safe? – Can it be simultaneously executed by multiple threads without causing problems? 32 Foster’s Methodology • Partitioning – Divide the computation and the data into small tasks. – Identify tasks that can be executed in parallel. • Communication – Determine what communication needs to be carried out. – Local Communication vs. Global Communication • Agglomeration – Group tasks into larger tasks. – Reduce communication. – Task Dependence • Mapping – Assign the composite tasks to processes/threads. 33 Foster’s Methodology 34 The n-body Problem • To predict the motion of a group of objects that interact with each other gravitationally over a period of time. – Inputs: Mass, Position and Velocity • Astrophysicist – The positions and velocities of a collection of stars • Chemist – The positions and velocities of a collection of molecules 35 Newton’s Law f qk Gmq mk sq t sk t n 1 Fq Gmq k 0 k q n 1 aq G k 0 k q 3 s (t ) s (t ) q k mk sq (t ) sk (t ) mk sq (t ) sk (t ) 3 3 s (t ) s (t ) q k s (t ) s (t ) q k 36 The Basic Algorithm Get input data; for each timestep { if (timestep output) Print positions and velocities of particles; for each particle q Compute total force on q; for each particle q Compute position and velocity of q; } for each particle q { forces[q][0]=forces[q][1]=0; for each particle k!=q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; forces[q][0]-=G*masses[q]*masses[k]/dist_cubed*x_diff; forces[q][1]-=G*masses[q]*masses[k]/dist_cubed*y_diff; } } 37 The Reduced Algorithm for each particle q forces[q][0]=forces[q][1]=0; for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; } } 38 Euler Method y(t0 t ) y(t0 ) y (t0 )(t0 t t0 ) y(t0 ) y (t0 )t 39 Position and Velocity sq (t ) sq (0) tsq' (0) sq (0) tvq (0) vq (t ) vq (0) tv (0) vq (0) taq (0) vq (0) t ' q Fq (0) mq sq (2t ) sq (t ) tsq' (t ) sq (t ) tvq (t ) vq (2t ) vq (t ) tv (t ) vq (t ) taq (t ) vq (t ) t ' q Fq (t ) mq for each particle q { pos[q][0]+=delta_t*vel[q][0]; pos[q][1]+=delta_t*vel[q][1]; vel[q][0]+=delta_t*forces[q][0]/masses[q]; vel[q][1]+=delta_t*forces[q][1]/masses[q]; } 40 Communications sq(t) vq(t) sr(t) vr(t) Fq(t) sq(t + △t) vq(t + △t) Fr(t) sr(t + △t) Fq(t+ △t) vr(t + △t) Fr(t+ △t) 41 Agglomeration sq t sq, vq, Fq sr, vr, Fr sq t + △t sr sq, vq, Fq sr sr, vr, Fr 42 Parallelizing the Basic Solver # pragma omp parallel for each timestep { if (timestep output){ # pragma omp single nowait Print positions and velocities of particles; } # pragma omp for for each particle q Compute total force on q; # pragma omp for for each particle q Compute position and velocity of q; } 43 Parallelizing the Reduced Solver # pragma omp for for each particle q forces[q][0]=forces[q][1]=0; # pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; } } 44 Does it work properly? • Consider 2 threads and 4 particles. • Thread 1 is assigned particle 0 and particle 1. • Thread 2 is assigned particle 2 and particle 3. • F3=-f03-f13-f23 • Who will calculate f03 and f13? • Who will calculate f23? • Any race conditions? 45 Thread Contributions Thread Particle 0 0 1 2 Thread 0 1 2 f01 +f02 +f03+f04 +f05 0 0 1 -f01 +f12 +f13+f14 +f15 0 0 2 -f02 -f12 f23 +f24 +f25 0 3 -f03 -f13 -f23 +f34 +f35 0 4 -f04 -f14 -f24 –f34 f45 5 -f05 -f15 -f25 –f35 -f45 3 Threads, 6 Particles, Block Partition 46 First Phase # pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; loc_forces[my_rank][q][0]+=force_qk[0]; loc_forces[my_rank][q][1]+=force_qk[1]; loc_forces[my_rank][k][0]-=force_qk[0]; loc_forces[my_rank][k][1]-=force_qk[1]; } } 47 Second Phase # pragma omp for for (q=0; q<n; q++) { forces[q][0]=forces[q][1]=0; for(thread=0; thread<thread_count; thread++) { forces[q][0]+=loc_forces[thread][q][0]; forces[q][1]+=loc_forces[thread][q][1]; } } • In the first phase, each thread carries out the same calculations as before but the values are stored in its own array of forces (loc_forces). • In the second phase, the thread that has been assigned particle q will add the contributions that have been computed by different threads. 48 Evaluating the OpenMP Codes • In the reduced code: – Loop 1: Initialization of the loc_forces array – Loop 2: The first phase of the computation of forces – Loop 3: The second phase of the computation of forces – Loop 4: The updating of positions and velocities • Which schedule should be used? 7.71 Reduced Default 3.90 Reduced Forces Cyclic 3.90 Reduced All Cyclic 3.90 2 3.87 2.94 1.98 2.01 4 1.95 1.73 1.01 1.08 8 0.99 0.95 0.54 0.61 Threads Basic 1 49 Review • What are the major differences between MPI and OpenMP? • What is the scope of a variable? • What is a reduction variable? • How to ensure mutual exclusion in a critical section? • What are the common loop scheduling options? • What factors may potentially affect the performance of an OpenMP program? • What is a thread safe function? 50