Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent of the compiler openMP: Library requires compiler support Pthreads: Low-level, with maximum flexibility openMP: Higher-level, with less flexibility Pthreads: Application parallelized all at once openMP: Programmer can incrementally parallelize an application Pthreads: Difficult to program high-performance parallel applications openMP: Much simpler to program high-performance parallel applications Pthreads: Explicit Fork/Join or Detached Thread model openMP: Implicit Fork/Join model Creating openMP Programs • In C, make use of the preprocessor General Syntax: #pragma omp directive [clause [clause] ...] Note: To extend the syntax to multiple lines, end a line with the '\' character • Error Checking: Adapt if compiler support lacking #ifdef _OPENMP // Only include the header if it exists #include <omp . h> #endif #ifdef _OPENMP // Get number of threads and rank if openMP exists int rank = omp_get_thread_num ( ) ; int threads = omp_get_num_threads ( ) ; #e l s e // Default to one thread, rank zero if no openMP support int rank = 0 ; int threads = 1 ; #endif openMP Parallel Regions An Introduction Into OpenMP Copyright©2005 Sun Microsystems Parallel Directive Initiate a parallel structured block: #pragma omp parallel { /* code here */ } Overview • The compiler, encountering the omp parallel pragma, creates multiple threads • All threads execute the specified structured block of code • A structured block can be either a single statement or a compound statement created with { ...} with a single entry point and a single exit point • There is an implicit barrier at the end of the construct • Within the block, we can declare variables to be – Private: local to a particular thread – Shared: visible to all threads Example int x, threads; #pragma omp parallel private(x, threads) { x = omp_get_thread_num(); threads = omp_get_num_threads(); a[x] += num_threads; } • omp_get_num_threads() returns number of active threads • omp_get_thread_num() returns rank (counting from 0) • x and threads are private variables, local to the threads • Array a[] is a global shared array • Note: a[x] in this example is not a critical section (a[x] is a unique location for each thread) Matrix Times Vector An Introduction Into OpenMP Copyright©2005 Sun Microsystems (f(b) + f(a))/2 Trapezoidal Rule Trapezoid Area: h * (f(xi) + f(xi+1)/2) Calculate the approximate integral • • • • • Sum up a bunch of adjacent trapezoids Each trapezoid has the same width Approximate Integral: (b-a)*(f(b)+f(a))/2 A closer approximation: (f(xn)-f(x0))/n * [f(x0)/2+f(x1)+f(x2)+···+f(xn)/2] Sequential algorithm (Given a, b, and n): w = ( b−a ) / n ; // Width of an integral segment integral = 0; for ( i = 1 ; i < n−1; i++) { integral += f (a+i*w); } // Evaluate at the next point integral = w * (integral + f(a)/2 + f(b)/2); // The approximate result Parallel Trapezoidal Integration void TrapezoidIntegration( double a , double b , int n , double global_result ) { int rank = omp_get_thread_num ( ), threads = omp_get_num_threads ( ) ; double w=(b-a)/n, myN=n/threads, myResult; double myA = a+rank*w, myB = myA + (myN - 1)*w; for ( i = 1 ; i < myN−1; i++) { myResult+ = f(myA + i*w); } # pragma omp critical // Mutual exclusion *global_result += w * (myResult +f(myA)/2 + f(myB)/2) ; } int main ( i n t argc , char argv [ ] ) { double result = 0 . 0 , a=atod(argv[2]), b=atod(argv[3]); int n, threads = atoi ( argv [ 1 ] , NULL , 10 ) ; #pragma omp parallel num_threads ( threads ) TrapezoidIntegration( a , b , n , &global_result ) ; printf ( "%d trapezoids , %f to %f Integral = %.14e\n" , n, a, b, result ) ; } Global Reduction void TrapezoidIntegration ( double a , double b , int n ) { int rank = omp_get_thread_num ( ), threads = omp_get_num_threads ( ) ; double w=(b-a)/n, myN=n/threads, myResult; double myA = a+rank*w, myB = myA+(myN - 1)*w; for ( i = 1 ; i < myN−1; i++) { myResult += f(myA + i*w); } } return w * ( myResult + f(myA)/2 + f(myB)/2 ); } int main ( i n t argc , char argv [ ] ) { double result = 0 . 0 , a=atod(argv[2]), b=atod(argv[3]); int n, threads = atoi ( argv [ 1 ] , NULL , 10 ) ; #pragma omp parallel num_threads ( threads ) reduction(+: result) result += TrapezoidIntegration ( a , b , n ) ; printf ( "%d trapezoids , %f to %f Integral = %.14e\n" , n, a, b, result ) ; } Parallel for loop Corresponds to the forall construct Syntax: #pragma omp for (i=0; i<MAX; i++) {block of code} • The parallel directive creates a team of threads to execute a specified block of code in parallel • To optimally use system resources, the number of threads in a team is determined dynamically • By default, an implied barrier follows each iteration of the loop • Any one of the following three approaches, in precedence order, determine team size: 1. 2. 3. The environment variable OMP_NUM_THREADS Calling omp_set_num_threads() library routine num_threads clause after the parallel directive specifies team size for that particular directive Illustration of parallel for loops An Introduction Into OpenMP Copyright©2005 Sun Microsystems Data Dependencies • The compiler rejects loops that don't follow openMP rules: – The number of iterations must be known in advance – The loop expressions cannot be floats or doubles and cannot change during execution of the loop. The index can only change by the increment part of the for statement int Linear_search ( i n t key , i n t A [ ] , i n t n ) { int i ; #pragma omp parallel for num_threads ( thread_count ) for ( i = 0 ; i < n ; i++) i f ( A [ i ] == key ) return i ; return −1; } // Compiler error: invalid exit from OpenMP structured block Data Dependencies One iteration depends upon computations of another • Compiles, but results are inconsistent • Fibonacci example: f0 = f1 = 1; fi = fi-1 + fi-2 Fibo[0] = fibo[1] = 1 ; # pragma omp parallel for num threads ( threads ) f o r ( i = 2 ; i < n ; i++) fibo[i] = fibo[i−1] + fibo [i−2] ; • Possible outcomes using two threads 1 1 2 3 5 8 13 21 34 55 // correct, 1 1 2 3 5 8 0 0 0 0 // incorrect • Conclusions – Dependencies within a single iteration will work correctly – OpenMP compilers don’t reject parallel for directive iteration dependences – Avoid attempting to parallelize Loops with cross-iteration dependencies More par/for examples Trapezoidal Integration h = ( b−a ) / n ; integral = (f(a) + f(b))/2.0; # pragma omp parallel for \ num_threads (threads) \ reduction (+: integral ) for (i = 1; i<=n−1; i++) integral += f(a+ih ) ; integral = h*integral ; Calculation of π double sum = 0 . 0 ; # pragma omp parallel for \ num_threads ( threads ) \ reduction (+: sum) \ private ( factor ) for (k=0 ; k<n ; k++) { if (k%2 == 0) factor = 1.0; e l s e factor = −1.0; sum += factor / ( 2*k+1); } // 1 – 1/3 + 1/5 – 1/7 + … double pi = 4 * sum; Odd/Even Sort Note: default clause forces programmer to specify the scope of all variables // Note that for, unlike parallel does not fork threads; it uses those available // Spawning new threads is an inefficient operation, and used sparingly # pragma omp parallel num_threads ( threads) \ default ( none ) shared ( a , n ) private ( i , tmp , phase ) for ( phase = 0 ; phase < n ; phase ++) { if ( phase % 2 == 0) # pragma omp for for ( i = 1; i < n; i += 2) { i f ( a[i−1] > a[i] ) { tmp = a[i−1]; a[i−1] = a [i]; a[i] = tmp ; } } else # pragma omp for for ( i = 1; i < n−1; i += 2) { i f ( a[i] > a[i+1] ) { tmp = a[i+1]; a[i+1] = a[i]; a[i] = tmp ; } } } Note: There is a default barrier after each iteration Scheduling of Threads Clause: schedule( type, chunk) • Static – Iterations are assigned to threads before the loop is executed. – System assigns chunks of iterations to threads in a round robin fashion – for seven iterations, 0, 1, . . . , 7, and two threads • schedule(static,1) assigns 0,2,4,6 to thread 0 and 1,3,5,7 to thread 1 • schedule(static,4) assigns 0,1,2,3 to thread 0 and 4,5,6,7 to thread 1 • Dynamic or guided – Iterations are assigned to the threads while the loop is executing. – After a thread completes its current set of iterations, its requests more – Guided initially assigns large chunks, which decreases down to chunk size as threads request more work, dynamic uses the chunk size • auto: The compiler and/or the run-time loader determine the schedule • runtime: The allocation schedule is determined automatically at run-time Scheduling Example #pragma omp parallel for num_threads( threads) \ reduction ( + : sum ) schedule ( static , 1 ) for ( i = 0 ; i <= n ; i++) sum += f ( i ) ; Assign prior to execute the loop in a round robin fashion with one iteration assigned to each thread Sections The structured blocks are shared among threads of a team The sections directive does not create new thread teams // Allocate sections among available threads #pragma omp sections { // The first section directive is implied and optional #pragma omp section { /* structured_block */ } // Each section can have its own individual code #pragma omp section { /* structured_block */ } . . . } Notes: Sections can be nested. Different independent code blocks run simultaneously in sections. OMP: Sequential within Parallel Blocks • Single: the block executed by one of the threads – Syntax: #pragma omp single {//code} – Note: There is an implied barrier at the end of the construct unless nowait appears on the pragma line • Master: The block executed by the master thread – Syntax: #pragma omp master {//code} – Note: There is no implied barrier in this construct Critical Sections/ Synchronization • Critical Sections: #pragma omp critical name {//code} – A critical section is keyed by its name. – Thread reaching the critical directive blocks until no other thread is executing the same critical section (one with the same name) – The name is optional. If not specified, A global default is used • Barrier: #pragma omp barrier – Threads wait till all threads reach the barrier; then they all proceed together – Caution: All threads must be able to reach the barrier • Atomic expression: #pragma omp atomic <expressionStatement> – A critical section updating a variable by executing a simple expression • Flush: #pragma omp flush (variable_list) – The executing thread gets a consistent view of the shared variables – Current read and write operations on the variables complete and values are written back to memory. New memory operations in the code after flush are not started, creating a “memory fence”.