Week 10 Power Point Slides

advertisement
Open[M]ulti[P]rocessing
Pthreads: Programmer explicitly define thread behavior
openMP: Compiler and system defines thread behavior
Pthreads: Library independent of the compiler
openMP: Library requires compiler support
Pthreads: Low-level, with maximum flexibility
openMP: Higher-level, with less flexibility
Pthreads: Application parallelized all at once
openMP: Programmer can incrementally parallelize an application
Pthreads: Difficult to program high-performance parallel applications
openMP: Much simpler to program high-performance parallel applications
Pthreads: Explicit Fork/Join or Detached Thread model
openMP: Implicit Fork/Join model
Creating openMP Programs
• In C, make use of the preprocessor
General Syntax: #pragma omp directive [clause [clause] ...]
Note: To extend the syntax to multiple lines, end a line with the '\' character
• Error Checking: Adapt if compiler support lacking
#ifdef _OPENMP // Only include the header if it exists
#include <omp . h>
#endif
#ifdef _OPENMP // Get number of threads and rank if openMP exists
int rank = omp_get_thread_num ( ) ;
int threads = omp_get_num_threads ( ) ;
#e l s e // Default to one thread, rank zero if no openMP support
int rank = 0 ;
int threads = 1 ;
#endif
openMP Parallel Regions
An Introduction Into OpenMP
Copyright©2005 Sun Microsystems
Parallel Directive
Initiate a parallel structured block:
#pragma omp parallel { /* code here */ }
Overview
• The compiler, encountering the omp parallel pragma, creates
multiple threads
• All threads execute the specified structured block of code
• A structured block can be either a single statement or a
compound statement created with { ...} with a single entry
point and a single exit point
• There is an implicit barrier at the end of the construct
• Within the block, we can declare variables to be
– Private: local to a particular thread
– Shared: visible to all threads
Example
int x, threads;
#pragma omp parallel private(x, threads)
{ x = omp_get_thread_num();
threads = omp_get_num_threads();
a[x] += num_threads;
}
• omp_get_num_threads() returns number of active threads
• omp_get_thread_num() returns rank (counting from 0)
• x and threads are private variables, local to the threads
• Array a[] is a global shared array
• Note: a[x] in this example is not a critical section (a[x] is a
unique location for each thread)
Matrix Times Vector
An Introduction Into OpenMP
Copyright©2005 Sun Microsystems
(f(b) + f(a))/2
Trapezoidal Rule
Trapezoid Area: h * (f(xi) + f(xi+1)/2)
Calculate the approximate integral
•
•
•
•
•
Sum up a bunch of adjacent trapezoids
Each trapezoid has the same width
Approximate Integral: (b-a)*(f(b)+f(a))/2
A closer approximation: (f(xn)-f(x0))/n * [f(x0)/2+f(x1)+f(x2)+···+f(xn)/2]
Sequential algorithm (Given a, b, and n):
w = ( b−a ) / n ; // Width of an integral segment
integral = 0;
for ( i = 1 ; i < n−1; i++) { integral += f (a+i*w); } // Evaluate at the next point
integral = w * (integral + f(a)/2 + f(b)/2); // The approximate result
Parallel Trapezoidal Integration
void TrapezoidIntegration( double a , double b , int n , double global_result )
{ int rank = omp_get_thread_num ( ), threads = omp_get_num_threads ( ) ;
double w=(b-a)/n, myN=n/threads, myResult;
double myA = a+rank*w, myB = myA + (myN - 1)*w;
for ( i = 1 ; i < myN−1; i++) {
myResult+ = f(myA + i*w); }
# pragma omp critical
// Mutual exclusion
*global_result += w * (myResult +f(myA)/2 + f(myB)/2) ;
}
int main ( i n t argc , char argv [ ] )
{ double result = 0 . 0 , a=atod(argv[2]), b=atod(argv[3]);
int n, threads = atoi ( argv [ 1 ] , NULL , 10 ) ;
#pragma omp parallel num_threads ( threads )
TrapezoidIntegration( a , b , n , &global_result ) ;
printf ( "%d trapezoids , %f to %f Integral = %.14e\n" , n, a, b, result ) ;
}
Global Reduction
void TrapezoidIntegration ( double a , double b , int n )
{ int rank = omp_get_thread_num ( ), threads = omp_get_num_threads ( ) ;
double w=(b-a)/n, myN=n/threads, myResult;
double myA = a+rank*w, myB = myA+(myN - 1)*w;
for ( i = 1 ; i < myN−1; i++) {
myResult += f(myA + i*w); } }
return w * ( myResult + f(myA)/2 + f(myB)/2 );
}
int main ( i n t argc , char argv [ ] )
{ double result = 0 . 0 , a=atod(argv[2]), b=atod(argv[3]);
int n, threads = atoi ( argv [ 1 ] , NULL , 10 ) ;
#pragma omp parallel num_threads ( threads ) reduction(+: result)
result += TrapezoidIntegration ( a , b , n ) ;
printf ( "%d trapezoids , %f to %f Integral = %.14e\n" , n, a, b, result ) ;
}
Parallel for loop
Corresponds to the forall construct
Syntax: #pragma omp for (i=0; i<MAX; i++) {block of code}
• The parallel directive creates a team of threads to execute a
specified block of code in parallel
• To optimally use system resources, the number of threads in a
team is determined dynamically
• By default, an implied barrier follows each iteration of the loop
• Any one of the following three approaches, in precedence
order, determine team size:
1.
2.
3.
The environment variable OMP_NUM_THREADS
Calling omp_set_num_threads() library routine
num_threads clause after the parallel directive specifies team size for
that particular directive
Illustration of parallel for loops
An Introduction Into OpenMP
Copyright©2005 Sun Microsystems
Data Dependencies
• The compiler rejects loops that don't follow openMP rules:
– The number of iterations must be known in advance
– The loop expressions cannot be floats or doubles and cannot change
during execution of the loop. The index can only change by the
increment part of the for statement
int Linear_search ( i n t key , i n t A [ ] , i n t n )
{ int i ;
#pragma omp parallel for num_threads ( thread_count )
for ( i = 0 ; i < n ; i++)
i f ( A [ i ] == key ) return i ;
return −1;
} // Compiler error: invalid exit from OpenMP structured block
Data Dependencies
One iteration depends upon computations of another
• Compiles, but results are inconsistent
• Fibonacci example: f0 = f1 = 1; fi = fi-1 + fi-2
Fibo[0] = fibo[1] = 1 ;
# pragma omp parallel for num threads ( threads )
f o r ( i = 2 ; i < n ; i++) fibo[i] = fibo[i−1] + fibo [i−2] ;
• Possible outcomes using two threads
1 1 2 3 5 8 13 21 34 55 // correct, 1 1 2 3 5 8 0 0 0 0 // incorrect
• Conclusions
– Dependencies within a single iteration will work correctly
– OpenMP compilers don’t reject parallel for directive iteration dependences
– Avoid attempting to parallelize Loops with cross-iteration dependencies
More par/for examples
Trapezoidal Integration
h = ( b−a ) / n ;
integral = (f(a) + f(b))/2.0;
# pragma omp parallel for \
num_threads (threads) \
reduction (+: integral )
for (i = 1; i<=n−1; i++)
integral += f(a+ih ) ;
integral = h*integral ;
Calculation of π
double sum = 0 . 0 ;
# pragma omp parallel for \
num_threads ( threads ) \
reduction (+: sum) \
private ( factor )
for (k=0 ; k<n ; k++)
{ if (k%2 == 0) factor = 1.0;
e l s e factor = −1.0;
sum += factor / ( 2*k+1);
} // 1 – 1/3 + 1/5 – 1/7 + …
double pi = 4 * sum;
Odd/Even Sort
Note: default clause forces programmer to specify the scope of all variables
// Note that for, unlike parallel does not fork threads; it uses those available
// Spawning new threads is an inefficient operation, and used sparingly
# pragma omp parallel num_threads ( threads) \
default ( none ) shared ( a , n ) private ( i , tmp , phase )
for ( phase = 0 ; phase < n ; phase ++)
{ if ( phase % 2 == 0)
# pragma omp for
for ( i = 1; i < n; i += 2)
{ i f ( a[i−1] > a[i] ) { tmp = a[i−1]; a[i−1] = a [i]; a[i] = tmp ; } }
else
# pragma omp for
for ( i = 1; i < n−1; i += 2)
{ i f ( a[i] > a[i+1] ) { tmp = a[i+1]; a[i+1] = a[i]; a[i] = tmp ; } }
}
Note: There is a default barrier after each iteration
Scheduling of Threads
Clause: schedule( type, chunk)
• Static
– Iterations are assigned to threads before the loop is executed.
– System assigns chunks of iterations to threads in a round robin fashion
– for seven iterations, 0, 1, . . . , 7, and two threads
• schedule(static,1) assigns 0,2,4,6 to thread 0 and 1,3,5,7 to thread 1
• schedule(static,4) assigns 0,1,2,3 to thread 0 and 4,5,6,7 to thread 1
• Dynamic or guided
– Iterations are assigned to the threads while the loop is executing.
– After a thread completes its current set of iterations, its requests more
– Guided initially assigns large chunks, which decreases down to chunk size as
threads request more work, dynamic uses the chunk size
• auto: The compiler and/or the run-time loader determine the schedule
• runtime: The allocation schedule is determined automatically at run-time
Scheduling Example
#pragma omp parallel for num_threads( threads) \
reduction ( + : sum ) schedule ( static , 1 )
for ( i = 0 ; i <= n ; i++)
sum += f ( i ) ;
Assign prior to execute the loop in a round robin
fashion with one iteration assigned to each thread
Sections
The structured blocks are shared among threads of a team
The sections directive does not create new thread teams
// Allocate sections among available threads
#pragma omp sections
{ // The first section directive is implied and optional
#pragma omp section
{ /* structured_block */ }
// Each section can have its own individual code
#pragma omp section
{ /* structured_block */ }
. . .
}
Notes: Sections can be nested.
Different independent code blocks run
simultaneously in sections.
OMP: Sequential within Parallel Blocks
• Single: the block executed by one of the threads
– Syntax: #pragma omp single {//code}
– Note: There is an implied barrier at the end of the
construct unless nowait appears on the pragma line
• Master: The block executed by the master thread
– Syntax: #pragma omp master {//code}
– Note: There is no implied barrier in this construct
Critical Sections/ Synchronization
• Critical Sections: #pragma omp critical name {//code}
– A critical section is keyed by its name.
– Thread reaching the critical directive blocks until no other thread is executing
the same critical section (one with the same name)
– The name is optional. If not specified, A global default is used
• Barrier: #pragma omp barrier
– Threads wait till all threads reach the barrier; then they all proceed together
– Caution: All threads must be able to reach the barrier
• Atomic expression: #pragma omp atomic <expressionStatement>
– A critical section updating a variable by executing a simple expression
• Flush: #pragma omp flush (variable_list)
– The executing thread gets a consistent view of the shared variables
– Current read and write operations on the variables complete and values are
written back to memory. New memory operations in the code after flush are
not started, creating a “memory fence”.
Download