Introduction to OpenMP Programming

advertisement
OpenMP - Introduction
Süha TUNA
Bilişim Enstitüsü
UHeM Yaz Çalıştayı - 21.06.2012
Outline
• What is OpenMP?
– Introduction (Code Structure, Directives, Threads etc.)
– Limitations
– Data Scope Clauses
• Shared, Private
– Work-sharing constructs
– Synchronization
What is OpenMP?
• An Application Program Interface (API) that may be used to
explicitly direct multithreaded, shared memory parallelism
• Three main API components
– Compiler directives
– Runtime library routines
– Environment variables
• Portable & Standardized
– API exist both C/C++ and Fortan 90/77
– Multi platform Support (Unix, Linux etc.)
OpenMP Specifications
•
•
•
•
Version 3.1, Complete Specifications, July 2011
Version 3.0, May 2008
Version 2.5, May 2005 (C/C++ & Fortran)
Version 2.0
– C/C++, March 2002
– Fortran, November 2000
• Version 1.0
– C/C++, October 1998
– Fortran, October 1997
Detailed Info: http://www.openmp.org/wp/openmp-specifications/
Intel & GNU OpenMP
• Intel Compilers
–
–
–
–
–
OpenMP 2.5 conforming
Nested parallelisim
Workqueuing extension to OpenMP
Interoperability with POSIX and Windows threads
OMP_DYNAMIC support
• GNU OpenMP (OpenMP+gcc)
– OpenMP 3.0 Support (gcc 4.4 and later)
OpenMP Programming Model
• Explicit parallelism
• Thread based parallelism; program runs with user specified
number of multiple thread
• Uses fork & join model
Synchronization Point (“barrier”, “critical region”, “single processor region”)
Limitations of OpenMP
• Shared Memory Model
– Each thread must be reach a shared memory (SMP)
• Intel compilers use the POSIX threads library to implement
OpenMP.
Terminology and Behavior
• OpenMP Team = Master + Worker
• Parallel Region is a block of code executed by all threads
simultaneously (has implicit barrier)
– The master thread always has thread id 0
– Parallel regions can be nested
– If clause can be used to guard the parallel region
• A Work-Sharing construct divides the execution of the
enclosed code region among the members of the team. (Loop,
Section etc.)
OpenMP Code Structure
C/C++
#include <omp.h>
main () {
int var1, var2, var3;
/* Serial code */
.
.
.
/* Beginning of parallel section. Fork a
team of threads.Specify variable scoping
*/
#pragma omp parallel private(var1, var2) \
shared(var3)
{
Parallel section executed by all threads
.
.
All threads join master thread and disband
}
/* Resume serial code */
.
.
}
Fortran
PROGRAM MYCODE
USE omp_lib
C
Or USE “omp_lib.h”
INTEGER var1, var2, var3
C
Serial code
.
.
.
C
Beginning of parallel section. Fork a
C
team of threads.Specify variable
C
scoping
$OMP PARALLEL PRIVATE(var1, var2) &
SHARED(var3)
Parallel section executed by all threads
.
$OMP BARRIER
.
All threads join master thread and disband
$OMP END PARALLEL
C
Resume serial code
.
.
END
OpenMP Directives

Format in C/C++:
#pragma omp

&
directivename
[clause, ...]
&
Required for all OpenMP C/C++ directives.
directivename:


[clause, ...]
#pragma omp:


directivename
Format in Fortran 90:
!$OMP

[clause, ...] \
Format in Fortran 77:
C$OMP

directivename
A valid OpenMP directive. Must appear after the pragma and before any clauses.
[clause, ...] :


Optional.
Clauses can be in any order, and repeated as necessary unless otherwise restricted.
OpenMP Directives

Example:
#pragma omp parallel default(shared) private(beta,pi)

General Rules:





Directives follow conventions of the C/C++ standards for
compiler directives.
Case sensitive
Only one directivename may be specified per directive
Each directive applies to at most one succeeding
statement, which must be a structured block.
Long directive lines can be "continued" on succeeding lines
by escaping the newline character with a backslash ("\") at
the end of a directive line.
OpenMP Directives

PARALLEL Region Construct:
 A parallel region is a block of code that will be executed by
multiple threads.
 This is the fundamental OpenMP parallel construct.
#pragma omp parallel [clause ...] newline
if (scalar_expression)
private (list)
shared (list)
default (shared | none)
firstprivate (list)
reduction (operator: list)
copyin (list)
structured_block
OpenMP Directives
C/C++ OpenMP structured block definition.
#pragma omp parallel [clause ...]
{
structured_block
}
Fortran OpenMP structured block definition.
!$OMP PARALLEL [clause ...]
structured_block
!$OMP END PARALLEL
OpenMP Directives
• Parallel region construct …
– supported clauses
When a thread reaches a PARALLEL
directive
• It creates a term of threads and becomes the master of the
team
• The master is a member of that team, it has thread number 0
within that team (THREAD ID)
• Starting from the beginning of this parallel region, the code is
duplicated and all threads will execute that code (different path
of exec.)
• There is an implied barrier at the end of a parallel section
• Only the master thread continues execution past this point
OpenMP Constructs
Data Scope Attribute Clauses
C/C++
Fortran
shared (list)
SHARED (list)
• SHARED Clause:
– It declares variables in its list to be shared to each thread.
– Behavior
• The pointer of the object of the same type is declared once for each
thread in the team
• All threads reference to the original object
• The default clause is SHARED for all variables in OpenMP
Data Scope Attribute Clauses
C/C++
Fortran
private (list)
PRIVATE (list)
• PRIVATE Clause:
– It declares variables in its list to be private to each thread.
– Behavior
• A new object of the same type is declared once for each thread in the
team
• All references to the original object are replaced with references to the
new object
• Variables declared PRIVATE are uninitialized for each thread
(FIRSTPRIVATE can be used for initialization of variables)
Data Scope Attribute Clauses
C/C++
Fortran
default (private/shared)
DEFAULT (private/shared)
• DEFAULT Clause:
– It declares the default scope attribute for the variables in parallel
region.
– If not declared the default value is SHARED
– If declared, the default value will be defined in the specific data scope
only.
– You should not be courage to change the default value to PRIVATE.
– Changing DEFAULT to PRIVATE overhead the parallelization.
Lab: Helloworld
• INTEL
bash: $ ifort -openmp hi-omp.f -o hi-omp.x
hi-omp.f(3) : (col. 6) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
• GCC
bash: $ gcc -fopenmp hi-omp.c -o hi-omp.x
• LSF submition
bash: $ bsub -a openmp –q short -o %J.out -e %J.err -n 4 -x ./hi-omp.x
Lab: Helloworld
Optional Exercise:
1 - set OMP_NUM_THREADS
to an higher value (such as 10)
2 - uncomment critical section
3 - repeat example.
• Set environment variables (setenv, export)
bash: $ export OMP_NUM_THREADS=4
• Run your OpenMP compile
bash: $ ./hi-omp.x
Hello OpenMP!
Hello OpenMP!
Hello OpenMP!
Hello OpenMP!
Work-Sharing Constructs
A work-sharing construct divides the execution of the enclosed
code region among the members of team that encounter it.

Must be enclosed in a parallel region otherwise it is simply
ignored.


Work-sharing constructs do not launch/create new threads.
There is no implied barrier upon entry to a work-sharing
construct. However there is an implicit barrier at the end of a
work-sharing construct.

Work-Sharing Constructs
• Types
Only available in Fortran
Parallelize the array
Operations. For example,
A(:,:)=B(:,:)+C(:,:)
Work-Sharing Constructs
shares iterations of a loop
across the team. Represents
a type of "data parallelism".
breaks work into separate,
discrete sections. Each section is
executed by a thread. Can be
used to implement a type of
"functional parallelism".
serializes a section
of code
Work-Sharing Constructs
• DO directive (Fortran)
!$OMP DO [clause ...]
SCHEDULE (type [,chunk])
ORDERED
PRIVATE (list)
FIRSTPRIVATE (list)
LASTPRIVATE (list)
SHARED (list)
REDUCTION (operator | intrinsic : list)
do_loop
!$OMP END DO
[ NOWAIT ]
Work-Sharing Constructs
• for directive (C/C++)
#pragma omp for [clause ...] newline
schedule (type [,chunk])
ordered
private (list)
firstprivate (list)
lastprivate (list)
shared (list)
reduction (operator: list)
nowait
{
for_loop
}
Work-Sharing Constructs
• schedule clause: schedule(kind [,chunk_size])
– static: less overhead, default on many OpenMP compilers
– dynamic & guided: useful for poorly balanced and unpredictable
workload. In guided the size of chunk decreases over time.
– runtime: kind is selected according to the value of environment
variable OMP_SCHEDULE.
– Larger chunks are desirable because they reduce the overhead
– Load balancing is often more of an issue toward the end of
computation
Work-Sharing Constructs
• schedule clause:
When a thread
finishes one chunk, it
is dynamically
assigned another. The
default chunk size is 1.
– describes how iterations of the loop are divided
among the threads in the team
Loop iterations are
divided into pieces of
size chunk statically
The chunk size is
exponentially reduced
with each dispatched
piece of the iteration
space. The default
chunk size is 1.
Work-Sharing Constructs
schedule clause:
runtime: If this schedule is selected, the decision regarding scheduling
kind is made at run time. The schedule and (optional) chunk size are
set through the OMP_SCHEDULE environment variable.
• NO WAIT (Fortran) / nowait (C/C++) clause:
– If specified, then threads do not synchronize at the end of the
parallel loop. Threads proceed directly to the next statements
after the loop.
Work-Sharing Lab 1
bash: $ icc -openmp omp_workshare1.c -o omp_workshare1.x
bash: $ ./omp_workshare1.x
• Example steps:
– Examine the code for schedule (‘static schedule’), compile and
run
– Change and work with ‘dynamic schedule’. What did change?
The iterations of the loop will be distributed dynamically in chunk
sized pieces.
– Add ‘nowait’ at the end of omp for clause. What did change?
Threads will not synchronize upon completing their individual
pieces of work (nowait).
Work-Sharing Lab 2
• SECTIONS construct:
– Easiest way to get different threads to carry out different kinds of work
– Each section must be a structured block of code that is independent of
the other sections
– If there are fewer code blocks than threads, the remaining threads will be
idle
– If there are fewer threads than code blocks, some or all of the threads
execute multiple code blocks
– Depending on the type of work, this construct might lead to a loadbalancing problem
Work-Sharing Lab 2
• SECTIONS construct for 2 functions (or threads)
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
FUNCTION_1(MAX)
}
#pragma omp section
{
FUNCTION_2(MIN)
}
} // Sections Ends Here
} // Parallel Ends Here
Work-Sharing Lab 2
bash: $ icc -openmp omp_workshare2.c -o omp_workshare2.x
bash: $ ./omp_workshare2.x
• This example demonstrates use of the OpenMP SECTIONS
worksharing construct Note how the PARALLEL region is divided
into separate sections, each of which will be executed by one
thread.
• Run the program several times and observe any differences in
output. Because there are only two sections, you should notice
that some threads do not do any work.
• You may/may not notice that the threads doing work can vary. For
example, the first time thread 0 and thread 1 may do the work,
and the next time it may be thread 0 and thread 3.
Work-Sharing Constructs
• SINGLE Constructs:
– It specifies that the enclosed code is to be executed by only
one thread in the team.
– The thread chosen could vary from one run to another.
– Threads that are not executing in the SINGLE directive wait at
the END SINGLE directive unless NOWAIT is specified.
C/C++
#pragma omp single [clause ...]
structured_block
Fortran
!$OMP SINGLE [clause...]
structured-block
!$OMP END SINGLE [NOWAIT]
Work-Sharing Constructs
• SINGLE Constructs:
Only one thread
initializes the
shared variable a
Work-Sharing Constructs
• SINGLE Constructs:
– It specifies that the enclosed code is to be executed by only
one thread in the team.
– The thread chosen could vary from one run to another.
– Threads that are not executing in the SINGLE directive wait at
the END SINGLE directive unless NOWAIT is specified.
C/C++
#pragma omp single [clause ...]
structured_block
Fortran
!$OMP SINGLE [clause...]
structured-block
!$OMP END SINGLE [NOWAIT]
Work-Sharing Constructs
• SINGLE Constructs:
Only one thread
initializes the
shared variable a
Synchronization (BARRIER)
C/C++
Fortran
#pragma omp barrier newline
!$OMP BARRIER newline
structured_block
structured_block
Example:
Check barrier.f and barrier.c example code.
• BARRIER Directive:
– Synchronizes all threads in the team.
– When a BARRIER directive is reached, a thread will wait at
that point until all other threads have reached that barrier.
– All threads then resume executing in parallel the code that
follows the barrier.
Synchronization (BARRIER)
• BARRIER Directive
• Important restrictions
– Each barrier must be encountered by all threads in a team, or
by none at all
– The sequence of work-sharing regions and barrier regions
encountered must be the same for every thread in the team.
• Without these rules some threads wait forever (or until
somebody kills the process) for other threads to reach a
barrier
Synchronization (MASTER)
C/C++
#pragma omp master newline
Statement_or_expression
Fortran
!$OMP MASTER newline
Statement_or_expression
!$OMP END MASTER
• MASTER Directive:
– Specifies a region that is to be executed only by the master
thread of the team.
– All other threads on the team skip this section of code
– It is similar to the SINGLE construct
Synchronization (ORDERED)
C/C++
Fortran
#pragma omp ordered newline
!$OMP ORDERED newline
structured_block
!$OMP END ORDERED
structured_block
Example:
Check ordered.c example code.
• ORDERED Directive:
– allows one to execute a structured block within a parallel loop in
sequential order
– The code outside this block runs in parallel
– if threads finish out of order, there may be an additional
performance penalty because some threads might have to wait.
Synchronization (CRITICAL)
C/C++
#pragma omp critical (name)
structured_block
Fortran
!$OMP CRITICAL (name)
structured_block
!$OMP END CRITICAL (name)
Example:
Correct critical.F90 and critical.c example code.
• CRITICAL Directive:
– It provides a means to ensure that multiple
threads do not attempt to update the same
shared data simultaneously.
– An optional name can be given to a critical
construct. Name must be global and unique
– When a thread encounters a critical construct,
it waits until no other thread is executing a
critical region with the same name.
race
condition
Synchronization (ATOMIC)
C/C++
#pragma omp atomic newline
Expression_statement
Fortran
!$OMP ATOMIC newline
Expression_statement
Example:
Check atomic.c example code.
• ATOMIC Directive:
– Specifies that a specific memory location must be updated
atomically, rather than letting multiple threads attempt to write
to it.
– In essence, this directive provides a mini-CRITICAL section. It
is an efficient alternative to the critical region
TEŞEKKÜRLER!
Download