Shared Memory Parallelism Programming in the SGI Environment TM

advertisement
TM
Shared Memory Parallelism
Programming in the SGI Environment
TM
Parallel Programming on SGI
• Automatic parallelization with APC
– Fortran and C compiler option: -apo
– HPF: pghpf (Portland Group), xHPF (third party)
• parallelization by inserting OpenMP compiler directives (using
the -mp compiler option)
– Fortran
– C
!$omp parallel, !$omp do
#pragma omp parallel
• Low-Level parallelism
–
–
–
–
Posix standard
SGI library
UNIX native
Sequent library
Pthreads
sproc()
fork()
m_fork()
• Message Passing
– PVM, MPI
– shmem
• Definition of number of parallel threads (man
pe_environ)
– SMP (APC and OpenMP, sproc): setenv OMP_NUM_THREADS n
– MPI: mpirun -np n prog
TM
SMP Programming
The parallelism in the code:
• can be found by the compiler
• can be asserted by the programmer with directives to
help the compiler to analyze the code
– compiler analyzes data access rules based on the additional
information presented
• can be forced by the programmer with directives/pragmas
– compiler has to be told about data access rules
(shared/private)
– explicit synchronization is often necessary to access the
shared data
• can be forced by the programmer with library calls
– POSIX Pthreads
TM
Automatic Parallelization
TM
Automatic Parallelization
% setenv OMP_NUM_THREADS 4
TM
Automatically Parallelizable Code
• linear assignments:
• reductions:
• triangular assignments
• NOT parallelizable:
– early exits
DO I=1,N
A(I) = B(I) + X(I)
ENDDO
DO I=1,N
S = S + X(I)
ENDDO
DO I=1,N
DO J=I,N
A(J,I) = B(J,I) + X(I)
ENDDO
TM
Compiler Assertions
Assertions help the compiler to deal with data dependencies that
cannot be analyzed based on program semantics alone:
C*$* (NO) CONCURRENTIZE
#pragma (no) concurrentize
– (not) to apply parallel optimizations, the scope is the entire compilation unit
C*$* ASSERT DO (SERIAL)
C*$* ASSERT DO (CONCURRENT)
#pragma serial
#pragma concurrent
– implement the required action for the loop (nest) immediately following the
directive
C*$* ASSERT PERMUTATION(A) #pragma permutation(a)
– similar to the IVDEP directive, asserts that the indirect addressing with index
array A in the following loop nest will not alias to the same storage location
C*$* ASSERT CONCURRENT CALL #pragma concurrent call
– asserts that the following loop nest can be parallel despite subroutine
invocations that cannot be analyzed for side effects
-LNO:ignore_pragmas compiler option is used to disable these directives
TM
Assertion Examples
do …
do …
c*$* assert do (serial)
do …
do …
enddo
enddo
enddo
enddo
Compiler will not parallelize the outer loops,
since the inner loop must be serial
Compiler might parallelize the inner loops
C*$* assert concurrent call
do I=1,n
call do_parallel_work(I)
enddo
C*$* assert permutation(i1)
do k=1,nsite
A(i1(k)) = A(i1(k))+force*B(k)*B(k)
enddo
• The compiler is analyzing the code given the information in the assertion and
does automatic parallelization if possible
• even with the c*$* assert do (concurrent) directive, the compiler will not make
loops parallel where it finds obvious data dependencies
• If it does not work, next step is to introduce the parallelization directive with
explicit rules for variable usage given by the user (e.g. OpenMP semantics)
TM
Manual Parallelization
When parallelizing a program (or part of it) manually,
one needs to consider the following:
• what is the highest level of parallelism
(Coarse-grain vs fine-grain parallelism)
• which data should be shared and which is private
• how the data should be distributed between processors
(this is special optimization with CC-NUMA programming)
• how the work should be distributed between processors
(or select automatic work distribution by the compiler)
• can threads have simultaneous access to shared data, such
that race conditions can occur
TM
Creation of Parallel Regions
With OpenMP, the syntax is:
C$ OMP PARALLEL
[clause[,clause]…]
……… code to run in parallel
Transition for
C$ OMP END PARALLEL
data association
#pragma omp parallel [clause[,clause]…]
{
/* this code is run in parallel OMP_NUM_THREADS times */
}
clause specifies the data association rules (default, shared, private)
for each variable (array, common block)
internally, compiler generates a subroutine call to the body of
parallel code. The data definitions determine how variables are
passed to that subroutine
Old syntax:
C$ DOACROSS
!$ par parallel
……
!$ par end parallel
#pragma parallel
TM
Data Status Association
In a serial program:
• global data
– variables at a fixed well known address (global variables)
• subroutine arguments, common blocks, externals
• local data
– variables with address known only to a single block of code
• automatic variables, (re-)created on the stack
• variables with address allocated at load time (static)
» all DATA, SAVE, -static compiled programs
• dynamic variables allocated at run time
» malloc, ALLOCATE, etc.
Parallel program has additional qualification:
• shared data
– variables that can be accessed by every thread
• private data
– variables with address known only to a single thread
• volatile data
– special type variables that can be changed outside of program
• synchronization data
Data Status Transition in Parallel
Programs
TM
In parallel programs following transition takes place:
Serial program
Parallel program
Global variables
Shared variables
Automatic variables
Private variables
Static local variables
Shared variables
Dynamic variables
Shared or private variables
• It is safe to work with the automatic local variables that are
allocated on the stack. In a parallel region each thread of execution
gets its own private stack- these variables are automatically private
• Especially hazardous are the static local variables. Since these are
allocated on the heap, their address is the same in different threads
• Dynamic variables are mostly created on the heap , therefore
shared in parallel program. However, when allocation happens
from a parallel region these variables become private
• OpenMP syntax allows declaration of the status of each variable.
However, in a call hierarchy the variables would assume their
default status as specified above
TM
Volatile Variables
Volatile variables get special attention from the compiler
• data that gets updated may be allocated in a register as part of
compiler optimization
• a write to shared data in one thread may therefore not be visible
by another thread until the register is flushed to memory
• this is mostly incorrect behaviour that can lead to a deadlock or
incorrect results
• variables such that an update has to propagate immediately to
all threads have to be declared volatile
• compiler performs no optimizations on volatile variables, nor on
the code around them, such that every write to volatile variable
is flushed to memory
• there is a performance penalty associated with volatile variables
• Syntax:
– Fortran VOLATILE var-list
– C
volatile “type” var-list
TM
Synchronization Variables
To be found in man sync(3F):
• call synchronize
Full memory barrier:
memory references will not cross FMB
perform atomically k->j Operation and return the old or
the new value:
•
i = fetch_and_OP(j,k)
i = OP_and_fetch(j,k)
<- atomically i=j, j OP k
<- atomically j OP k, i=j
OP stands for the functions:
– add, sub, or, and, xor, nand;
i,j,k,l can be integer*4 or integer*8 variables
l = compare_and_swap(j,k,i) <- (j==k)? j=i,l=1 : l=0
• i = lock_test_and_set(j,k) <- atomically: i=j, j=k
call lock_release(i)
<- atomically: i=0
all these functions are guaranteed to be atomic and will
be performed on globally visible arguments
TM
Workload Distribution
• Load imbalance is the condition when some processors do more
work then others. Load imbalance will lead to bad scalability.
• Load imbalance can be diagnosed by profiling the programs and
comparing the execution times for code in different threads
• In OpenMP, compiler accepts directives for work distribution:
– C$OMP DO SCHEDULE(type[,chunk]) where type is
• STATIC
iterations are divided into pieces at compile time (default)
SCHEDULE(STATIC,6)
26 iter on 4 processors
• DYNAMIC
iterations assigned to processors as they finish, dynamically.
This requires synchronization after each chunk iterations.
• GUIDED
pieces reduce exponentially in size with each dispatched piece
SCHEDULE(GUIDED,4)
26 iter on 4 processors
• RUNTIME
schedule determined by an environment variable OMP_SCHEDULE
With RUNTIME it is illegal to specify chunk.
• Compiler accepts -mp_schedtype=type switch
Example: Parallel Region in
FORTRAN
Compiler directives are enabled with the -mp switch or -apo
switch. With -apo automatic parallelization is requested.
SUBROUTINE MXM(A,B,C,LM,M,N,N1)
DIMENSION A(M,L), B(N1,M), C(N1,L)
100
DO 100 K=1,L
DO 100 I=1,N
C(I,K) = 0.0
CONTINUE
Will be run serially, unless
-apo is specified
Default variable type association rules:
all variables shared (I.e. A,B,C are shared)
Induction variables I,J,K will be found
automatically and assigned private scope
C$OMP PARALLEL DO
DO 110 K=1,L
DO 110 I=1,N
DO 110 J=1,M
C(I,K) = C(I,K) + B(I,J)*A(J,K)
110 CONTINUE
END
• In this example all loops can be made parallel automatically (-apo)
TM
TM
Parallelization Examples/3
Code with shared data
Common //a(im,jm),b(im,jm),c(im,jm),d(im,jm),f(im)
do j=1,jm
Making this loop parallel will lead to
call dowork(j)
enddo
incorrect results, because all threads
end
will use same address for scratch array f
subroutine dowork(j)
Common //a(im,jm),b(im,jm),c(im,jm),d(im,jm),f(im)
do I=1,im
f(I) = (a(I,j) + b(I,j))*3.141592*j
c(I,j) = f(I)**3
enddo
end
three solutions possible:
• scratch array f should be declared local in the subroutine
• f should be isolated in special common block /scratch/, which
– could be declared c$omp threadprivate(scratch), or
– could be loaded with the flag -Wl,-Xlocal,scratch_
• f could be extended as f(*,”numthreads”)
TM
False Sharing
Bad data or work distribution can lead to false sharing
for(i=myid; i<n; i+= nthr) {
a[i] = b[i] + c[i];
}
• example:
processor P0 does only even iterations
processor P1 does only odd iterations
both processors read and write same cache lines
each write invalidates the cache line of the pear processor
P0
a[1] is loaded,
but not used
P1
a[0] is loaded,
but not used
etc.,
normal behaviour
a[0]
a[3]
a[4]
a[6]
a[8]
a[5]
a[7]
a[9]
Cache line
a[2]
a[1]
Performance
•
•
•
•
False Sharing
1
2 ...
Number of processors
• each processor runs cache coherency protocol on the cache lines,
thus the cache lines will ping-pong from one cache to another
TM
Scalability & Data Distribution
The Scalability of parallel programs might be affected by:
• Amdahl law for the parallel part of the code
– plot elapsed run times as function of number of processors
• Load balancing and work distribution to the processors
– profiling the code will reveal significant run times for libmp library routines
– compare perfex for event counter 21 (fp operations) for different threads
• False sharing of the data
– perfex high value of cache invalidation events (event counter 31: store/prefetch
exclusive to shared block in scache)
• Excessive synchronization costs
– perfex high value for store conditionals (event counter 4)
When cache misses account for only a small percentage of the run
time, the program makes good use of the memory and its
performance will not be affected by data placement
– diagnose significant memory load with perfex statistics: high value of memory traffic
Program scalability will be improved with data placement if there is:
• memory contention on one node
• excessive cache traffic
TM
Why Distribute?
CC-NUMA architecture optimization. If:
• all data initialized on one node
– memory bottle neck
• all shared data on one node
L2
Cache
R1*000
R1*000
L2
Cache
L2
Cache
R1*000
R1*000
L2
Cache
L2
Cache
R1*000
R1*000
L2
Cache
L2
Cache
R1*000
R1*000
L2
Cache
memory
Mem/Dir
Bedrock
ASIC
Mem/Dir
Bedrock
ASIC
– serialization of the CC
• data used from different node
– large interconnect traffic
– average net latency on SGI3800:
16 proc
Bedrock
ASIC
R1*000
L2
Cache
R1*000
L2
Cache
Mem/Dir
R1*000
R1*000
L2
Cache
L2
Cache
~(1/2 log2(p/16)+1)*50 ns
Processor(s)
TM
SGI Additions to OpenMP
Data Distribution Extensions:
• IRIX dplace command
• setenv _DSM_PLACEMENT
(FIRST_TOUCH|ROUND_ROBIN)
• compiler directives:
– !$sgi distribute a(*,BLOCK,CYCLIC(5))
– !$sgi distribute_reshape a(*,BLOCK,…)
Default Distribution: “First Touch”
TM
IRIX strategy for data placement:
• respect program distribution directives
• allocation on the node where data is initialized
• allocation on the node where memory is available
PARAMETER (PAGESIZE=16384, SIZEDBLE=8)
REAL, ALLOCATABLE:: WORK(:)!! Distribution directive not available
READ *, MX
ALLOCATE(WORK(MX))
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(WORK,MX)
ID = OMP_GET_THREAD_NUM()
NP = OMP_GET_NUM_THREADS()
NPAGES = MX*SIZEDBLE/PAGESIZE
DO I=ID,NPAGES-NP+1,NP
!! round-robin distribution of pages:
WORK(1+I*PAGESIZE/SIZEDBLE) = 0.0
ENDDO
!$OMP END PARALLEL
TM
Data Placement Policies
Initialization with the “first touch” policy on a single processor
• all data on a single node
• bottle neck in access to that node
HUB
memory
cpu
cpu
HUB
memory
cpu
cpu
HUB
memory
cpu
cpu
memory
cpu
cpu
HUB
Initialization with the “first touch” policy with multiple processors
HUB
memory
cpu
cpu
HUB
memory
cpu
cpu
HUB
memory
cpu
cpu
HUB
memory
data is distributed naturally
each processor has local data to run on
minimal data exchange between nodes
page edge effects
cpu
cpu
•
•
•
•
Initialization with the “round robin” policy with a single processor
HUB
memory
cpu
cpu
HUB
memory
cpu
cpu
HUB
memory
cpu
cpu
HUB
memory
circular page placement on the nodes
on average, all pages are remote
no single bottle neck
significant traffic between nodes
cpu
cpu
•
•
•
•
TM
Example: Data Distribution
To avoid the bottle neck resulting from memory allocation
on a single node:
• Perform initialization in parallel, such that each processor
initializes data that it is likely to access later for calculation
• Use explicit data placement compiler directives
(e.g. C$ distribute )
REAL*8 A(N),B(N),C(N),D(N)
C$DISTRIBUTE A(BLOCK),B(BLOCK),C(BLOCK),D(BLOCK)
DO I=1,N
A(I) = 0.0
B(I) = I/2
C(I) = I/3
D(I) = I/7
ENDDO
C$OMP PARALLEL DO
DO I=1,N
A(I) = B(I) + C(I) + D(I)
ENDDO
• Enable environment variable for data placement
(setenv _DSM_PLACEMENT)
• Use dplace command (with appropriate options) to start the
program execution
TM
Data Initialization: Example
The F90 allocatable arrays should be initialized explicitly to place
the pages:
PARAMETER (PAGESIZE=16384, SIZEDBLE=8)
REAL, ALLOCATABLE:: WORK(:)
READ *, MXRVAR
ALLOCATE(WORK(MXRVAR))
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(WORK,MXRVAR)
ID = OMP_GET_THREAD_NUM()
NP = OMP_GET_NUM_THREADS()
NBLOCK
NPAGES
NLOWER
NUPPER
NLOWER
NUPPER
=
=
=
=
=
=
MXRVAR/NP
MXRVAR*SIZEDBLE/PAGESIZE
ID*NBLOCK
NLOWER + NBLOCK
NLOWER + 1
MIN(NUPPER,MXRVAR)
c round-robin distribution of pages:
DO I=ID,NPAGES-NP+1,NP
WORK(1+I*PAGESIZE/SIZEDBLE) = 0.0
ENDDO
!$OMP END PARALLEL
TM
Data Distribution Directives
Compiler supports the following data distribution directives:
!$SGI DISTRIBUTE A(shape[,shape]…) ONTO(n[,m]…)
#pragma distribute …
– define regular distribution of an array with given shape
– the shape can be
• BLOCK
• CYCLIC(k)
• *
(default: do not distribute)
– the placement of the pages will not violate Fortran/C rules for data usage and therefore
the exact shape is not guaranteed, in particular no padding at page boundary
!$SGI DISTRIBUTE_RESHAPE A(shape[,shape]…) ONTO(n[,m]…)
#pragma distribute_reshape …
– define reshaped distribution of an array with given shape, i.e exact mapping of the data
to the given shape. The shape parameters are as for the DISTRIBUTE directive
!$SGI DYNAMIC A
#pragma dynamic …
– permit dynamic redistribution of an array A
!$SGI REDISTRIBUTE A(shape[,shape]…)
#pragma redistribute …
– executable statement; forces redistribution of a (previously distributed) array A
!$SGI PAGE_PLACE A(…)
#pragma page_place …
– specify distribution by pages
!$SGI AFFINITY(i)=[DATA(A(I))|THREAD(I)] #pragma affinity …
– the affinity clause specifies that iteration i should execute on the processor who’s
memory contains element A(i) or which runs THREAD i
TM
Data Distribution: Examples/1
Regular data distributions to 4 processors:
1)
!$SGI DISTRIBUTE X(BLOCK)
2)
X(CYCLIC) or X(CYCLIC(1))
3)
X(CYCLIC(3))
Y(*,BLOCK)
4)
Real*8 Y(1000,1000)
c$sgi distribute Y(*,block)
each individual portion is a contiguous piece of size 8*106/P
I.e. for 4 processors it is 2 MB/piece
Y(BLOCK,*)
5)
6)
Real*8 Y(1000,1000)
c$sgi distribute Y(block,*)
each individual portion is still of size 8*106/P,
but each contiguous piece should be only 8*103/P,
I.e. for 4 processors it is 2 KB (<16 KB page size)
Y(BLOCK,BLOCK)
if the requested distribution incurs significant page boundary
effects (example: (5)) - reshaping of array data is necessary
TM
Data Distributions: Examples/2
Regular data distributions on 6 processors;
the ONTO clause changes the aspect ratio:
7)
Y(BLOCK,BLOCK)
Should be constant integer
8)
9)
Y(BLOCK,BLOCK) ONTO (2,3)
Y(BLOCK,BLOCK) ONTO (1,*)
Regular data distribution can be re-distributed:
C$SGI DISTRIBUTE A(BLOCK,BLOCK,BLOCK) ONTO (2,2,8)
C$SGI DYNAMIC A
…
c - executable statement, can appear anywhere in the code:
C$SGI REDISTRIBUTE A(BLOCK,BLOCK,BLOCK) ONTO (1,1,32)
TM
Data Reshaping Restrictions/1
The DISTRIBUTE_RESHAPE directive declares that the program
makes no assumptions about the storage lay out of the array
Array reshaping is legal only under certain conditions :
• an array can be declared with only one of these options
– default (no (re-)shaping)
– distribute
– distribute_reshape
– it will keep the declared attribute for the duration of the program
• reshaped array cannot be re-distributed
• a reshaped array cannot be equivalenced to another array
– explicitly through an equivalence statment
– implicitly through multiple declarations of a common block
• if an array in a common block is reshaped, all declarations of that common
block must:
– contain an array at the same offset within the common block
– declare that array with the same number of dimensions and same size
– specify the same reshaped distribution for the array
TM
Data Reshaping Restrictions/2
• when reshaped array is passed as an actual argument to a subroutine:
– if reshaped array is passed as a whole (I.e. call sub(A)), the formal parameter
should match exactly the number of dimensions and their size
– if only elements of reshaped array are passed ( I.e. call sub(A(0)) or call
sub(A(I)), the called procedure treats the incoming parameter as non-distributed
– for elements of reshaped array, the declared bounds on the formal parameter are
required not to exceed the size of the distributed array portion passed in as actual
argument. Example:
– compiler performs consistency check
on the legality of the reshaping:
• -MP:dsm=ON
Real*8 A(1000)
c$ distribute_reshape A(cyclic(5))
do I=1,1000,5
call mysub(A(I))
enddo
end
(default)
• -MP:check_reshape={ON|OFF}
(default OFF)
generation of run-time consistency checks
• -MP:clone={ON|OFF}
(default ON)
automatically generate clones
of procedures called with the
reshaped parameters
subroutine mysub(X)
real*8 X(5)
…
end
TM
Data Reshaping Restrictions/3
Consider also the following restrictions:
• BLOCK DATA (i.e. statically) initialized data cannot be reshaped
• alloca(3) or malloc(3) arrays cannot be reshaped
• F90: a dynamically sized automatic array can be distributed, but an F90
allocatable array is not
• cannot mix I/O for a reshaped array with namelist I/O of a function call in the
same I/O statement
• a common block containing a reshaped array cannot be linked -Xlocal
TM
Data Distribution: Affinity
usage of AFFINITY clause:
•
data form:
REAL*8 A(N),B(N),Q
!$SGI DISTRIBUTE A(BLOCK),B(BLOCK)
C INITIALIZATION:
DO I=1,N
A(I) = 1 - I/2
B(I) = -10 + 0.01*(I*I)
ENDDO
C REAL WORK:
DO IT=1,NITERS
Q = 0.01*IT
C$OMP PARALLEL DO
C$SGI AFFINITY(I)=DATA(A(I))
DO I=1,N
A(I) = A(I) + Q*B(I)
ENDDO
ENDDO
thread form:
REAL*8 A(N),B(N),Q
!$SGI DISTRIBUTE A(BLOCK),B(BLOCK)
C INITIALIZATION:
DO I=1,N
A(I) = 1 - I/2
B(I) = -10 + 0.01*(I*I)
ENDDO
C REAL WORK:
DO IT=1,NITERS
Q = 0.01*IT
C$OMP PARALLEL DO
C$SGI AFFINITY(I)=thread(I/(N/P))
DO I=1,N
A(I) = A(I) + Q*B(I)
ENDDO
ENDDO
iterations are scheduled such that each processor does iterations
over data that is placed in the memory near to that processor
TM
The dplace Program
dplace - shell level tool for Data Placement:
• declare number of threads to use
• declare number of memories to use
• specify the topology of the memories
• specify the distribution of threads on to memories
• specify the distribution of virtual addresses on to memories
• specify text/data/stack page sizes
• specify migration policy and threshold
• dynamically migrate virtual address ranges between memories
• dynamically move threads from one node to another
dplace
[-place file] [-verbose][-mustrun]
[-data_pagesize size]
[-stack_pagesize size]
[-text_pagesize size]
[-migration threshold] program arguments
• see man dplace(1)
Note: dplace switches off the _DSM_… environment variables
TM
dplace: Placement File
The file must include:
• the number of memories being used and their topology:
– memories M in topology [none|cube|cluster|physical near]
• number of threads
– threads N
• assignment of threads to memories:
– run thread n on memory m [using cpu k]
– distribute threads [t0:t1:[:dt]] across memories
[m0:m1[:dm]] [block [m]] | [cyclic [k]]
• optional specifications for policy:
– policy stack|data|text pagesize N [k|K]
– policy migration n [%]
– mode verbose [on|off|toggle]
Example:
memories ($OMP_NUM_THREADS+1)/2 in topology cube
threads $OMP_NUM_THREADS
distribute threads across memories
TM
Data Distribution: page_place
Useful for irregular and dynamically allocated data structures:
#pragma page_place (A, index-range, thread-id)
C$SGI PAGE_PLACE (A, index-range, thread-id)
c$
Parameter (n=8*1024*1024)
real*8 a(n)
p=1
p = omp_get_max_threads()
c- distribute using block mapping
npp=(n + p-1)/p
! Number of elements per processor
c$omp parallel do
do I=0,p-1
c$sgi page_place (a(1+I*npp), npp*8, I)
enddo
Note: page_place is useful only for initial placement of the data, it
has no effect if pages have been created already
TM
Environment Variables: Process
OMP_NUM_THREADS
(default max(8,#cpu))
• sets the number of threads to use during execution. If OMP_DYNAMIC is
set to TRUE, this is the maximum number of threads
OMP_DYNAMIC TRUE|FALSE
(default TRUE)
• enables or disables dynamic adjustment of parallel threads to load on the
machine. The adjustment can take place at the start of each parallel region
Setting OMP_DYNAMIC to FALSE will switch MPC_GANG ON
MPC_GANG ON|OFF
(default OFF)
• With gang scheduling all parallel threads are started simultaneously and
the time quantum is twice as long as normal. To obtain higher priority for
parallel jobs it is often necessary to use:
setenv OMP_DYNAMIC FALSE
setenv MPC_GANG OFF
_DSM_MUSTRUN
(default not set)
• to lock threads to a cpu where it happened to be started
_DSM_PPM
(default 2/4 processes per node)
• if _DSM_PPM is set to 1 only one process will be started on each node
TM
Environment Variables: Process
TM
Environment Variables: Process
TM
Environment Variables: Size
MP_SLAVE_STACKSIZE
(default 16 MB (4 MB >64 threads))
• stack size of the slave processes. The stack space is private, an adjustment
of this size might be necessary if machine has constrains on memory
MP_STACK_OVERFLOW
(default ON)
• stack size checking. Some programs overflow their allocated stack, while
still functioning correctly. This variable has to be set OFF in that case.
PAGESIZE_STACK
PAGESIZE_DATA
PAGESIZE_TEXT
(default 16 KB)
(default 16 KB)
(default 16 KB)
• specifies the page size for stack, data and text segments. Sizes can be
specified in KBytes or Mbytes in power of 2: 64 KB, 256 KB, 1 MB, 4 MB.
If the system does not have requested pages available, it might choose
smaller page size silently.
TM
Environment Variables: DSM
_DSM_PLACEMENT
(default FIRST_TOUCH)
• memory allocation policy for stack, data and text segments. This variable
can be set to the following values:
– FIRST_TOUCH
– ROUND_ROBIN
allocate pages close to the process which requested them
allocate pages alternating the nodes where processes run
_DSM_OFF
(default not set)
• data placement directives will not be honoured if _DSM_OFF is set to ON.
In particular, that is the mechanism that dplace program is using to force
the specified data distribution in an executable. E.g.
dplace -place file myprogram
• will set _DSM_OFF to ON and proceed with specifications in file.
_DSM_VERBOSE
(default not set)
• when set, reports memory allocation decisions to stdout.
man pe_environ(5) page for other useful environment variables.
TM
Case Study: Maxwell Code
REAL EX(NX,NY,NZ),EY(NX,NY,NZ),EZ(NX,NY,NZ) !Electric field
REAL HX(NX,NY,NZ),HY(NX,NY,NZ),HZ(NX,NY,NZ) !Magnetic field
…
DO K=2,NZ-1
DO J=2,NY-1
DO I=2,NX-1
HX(I,J,K)=HX(I,J,K)-(EZ(I,J,K)-EZ(I,J-1,K))*CHDY
+(EY(I,J,K)-EY(I,J,K-1))*CHDZ
HY(I,J,K)=HY(I,J,K)-(EX(I,J,K)-EX(I,J,K-1))*CHDZ
+(EZ(I,J,K)-EZ(I-1,J,K))*CHDX
HZ(I,J,K)=HZ(I,J,K)-(EY(I,J,K)-EY(I-1,J,K))*CHDX
+(EX(I,J,K)-EX(I,J-1,K))*CHDY
ENDDO ENDDO ENDDO
here NX=NY=NZ = 32, 64, 128, 256 (i.e. with real*4 elements: 0.8MB, 6.3MB, 50MB, 403MB)
• For scalar optimization, NX,NY,NZ have been adjusted to uneven
values and arrays H,E have been padded to avoid cache trashing
• Now parallelization can be attempted to further increase
performance on a parallel machine
– automatic parallelization with the -apo compiler option
– manual parallelization approach
• Study the performance implications of the different parallelization
approaches
Case Study: Maxwell Code
Automatic Parallelization
TM
The Maxwell program is automatically parallelized with:
f77 -apo -O3 maxwell.f
There are 3 parallel sections:
• Initialization
• Calculation of H
• Calculation of E
Iterations
(for statistics)
DO K=1,NZ
DO J=1,NY
DO I=1,NX
HX(I,J,K)=1E-3/(I+J+K)
1
EX(I,J,K)=1E-3/(I+J+K)
….
ENDDO ENDDO ENDDO
DO K=2,NZ-1
DO J=2,NY-1
DO I=2,NX-1
HX(I,J,K)=HX(I,J,K)-(EZ(I,J,K)-EZ(I,J-1,K))*CHDY
+(EY(I,J,K)-EY(I,J,K-1))*CHDZ
HY(I,J,K)=HY(I,J,K)-(EX(I,J,K)-EX(I,J,K-1))*CHDZ
2
+(EZ(I,J,K)-EZ(I-1,J,K))*CHDX
HZ(I,J,K)=HZ(I,J,K)-(EY(I,J,K)-EY(I-1,J,K))*CHDX
+(EX(I,J,K)-EX(I,J-1,K))*CHDY
ENDDO ENDDO ENDDO
DO K=2,NZ-1
DO J=2,NY-1
DO I=2,NX-1
EX(I,J,K)=EX(I,J,K)-(HZ(I,J,K)-HZ(I,J-1,K))*CHDY
+(HY(I,J,K)-HY(I,J,K-1))*CHDZ
EY(I,J,K)=EY(I,J,K)-(HX(I,J,K)-HX(I,J,K-1))*CHDZ
3
+(HZ(I,J,K)-HZ(I-1,J,K))*CHDX
EZ(I,J,K)=EZ(I,J,K)-(HY(I,J,K)-HY(I-1,J,K))*CHDX
+(HX(I,J,K)-HX(I,J-1,K))*CHDY
ENDDO ENDDO ENDDO
There are no data dependencies in each of the lops;
The compiler divided k-loop iterations among N processors
TM
Scalability of Auto-Parallelization
Super-linear speedup
when data fits into the cache
Slow down: load balancing
the k-loop range is too small
The parallel k-loop DO K=2,NZ-1
therefore the performance peaks in
“perfect” points 62/2 and 126/2
TM
Case Study: Workload Distribution
Parallelization with directives
c$omp parallel do
c$sgi nest(k,j)
in front of the properly nested loop nest
TM
Case Study: Data Placement
Parallel initialization can be forced to
be done serially with the directive
c*$* assert do(serial)
Data distribution directive:
c$distribute
Data placement with environment variable:
setenv _DSM_PLACEMENT ROUND_ROBIN
There is no single bottle neck,
but most data is remote
Write back
traffic
Most data
is in cache
Data traffic from/to a single
node limits the scalability
since the default page placement strategy is the “first touch” strategy,
all pages end up in a memory of a single node where the program starts.
TM
Parallel Thread Creation
There are several mechanisms to create a parallel processing stream:
• Library call:
sproc()
(Shared Memory Fork)
m_fork()
(simpler sproc)
pthread_create
(Posix threads)
start_pes
(shmem init routine)
Shared address space
Fork()
MPI_init()
UNIX
MPI start up
Disjoint address space
• compiler level (compiler directives)
C$doacross
(SGI old syntax)
c$par parallel
(ANSI X3H5 spec)
!$omp parallel
(OpenMP specs)
#pragma omp parallel
• automatic parallelization (-apo)
• At low level all shared memory parallelization pass from sproc
call to create threads of control with global address space
TM
Compile Time Actions
Each parallel region (either found automatically or via directive):
• is converted into a subroutine call executed by each parallel thread
• The subroutine contains the execution body of the parallel region
• the arguments to the subroutine contain scheduling parameters (loop ranges)
Subroutine daxpy(n,a,x,y)
real*8 a,x(n),y(n)
do I=1,n
y(I) = y(I)+a*x(I)
enddo
end
• the new subroutine name:
– derived from _mpdo_sub_#
– will appear in the profile
• data in the parallel region:
– private data:
•
local (automatic) variables
in the newly created subroutine
– shared data
•
will be passed well known address
to the new subroutine
Subroutine daxpy(n,a,x,y)
real*8 a,x(n),y(n)
integer*4 args
_LOCAL_NTRIP = n/__mp_sug_numthreads_func$()
_LOCAL_START = 1+__mp_my_thread_id()*_LOCAL_NTRIP
_INCR = 1
args(1) = _LOCAL_START
args(2) = _LOCAL_NTRIP
args(3) = _INCR
args(4) = 0
call sproc$(_mpdo_daxpy_1,%VAL(PR_SALL),args)
END
subroutine _mpdo_daxpy_1(_LOCAL_START,
_LOCAL_NTRIP,_INCR,_INF)
do I=_LOCAL_START,_LOCAL_NTRIP
y(I) = y(I)+a*x(I)
enddo
end
TM
Run Time Actions
Master process (p0)
p1 par_loop
p2
p2 par_loop
sproc()
create
shared memory
process group
Serial execution
pn-1
pn-1 par_loop
barrier
!$omp parallel
Serial execution
__mp_simple_sched(par_loop)
p0 par_loop
__mp_barrier
_master()
or
!$omp parallel
p1
__mp_barrier_nthreads()
mp_setup()
__mp_slave_wait_for_work()
__mp_wait_for_completion()
When the parallel threads are spawn:
• they spin in __mp_slave_wait_for_work()
Master enters parallel region:
•passes the address of the _mpdo subroutine to the slaves
• at the end of parallel region all threads synchronize
• slaves go back to spin in __mp_slave_wait_for_work()
• master continues with serial execution
TM
Timing and Profiling MP Program
• To obtain timing information for an MP program
– setenv OMP_NUM_THREADS n
– time a.out
– example:
n=1 time output:
9.678u 0.091s 0:09.86 98.9% 0+0k 0+1io 0pf+0w
n=2 time output:
10.363u 0.121s 0:05.72 183.2% 0+0k 3+8io 3pf+0w
from this output we derive the following performance data:
processors
User+System
Elapsed
1
9.77
9.86
7.3%
S(2) = 9.86/5.72
2
10.48
5.72
therefore the parallel overhead is 7.3%, the Speedup is 1.72
• use Speedshop to obtain profiling information for an MP program
– setenv OMP_NUM_THREADS n
– ssrun -exp pcsamp a.out
– each thread will create a separate (binary) profiling file
• foreach file (a.out.pcsamp.*)
•
prof $file
• end
TM
MP SpeedShop Example Output
Slave execution:
103
96
67
40
35
30
26
24
23
11
1s( 20.6)
0.96s( 19.2)
0.67s( 13.4)
0.4s( 8.0)
0.35s( 7.0)
0.3s( 6.0)
0.26s( 5.2)
0.24s( 4.8)
0.23s( 4.6)
0.11s( 2.2)
1s(
2s(
2.7s(
3.1s(
3.4s(
3.7s(
4s(
4.2s(
4.4s(
4.5s(
20.6)
divfree (./CWIROS:DIVFREE.f)
39.7)
ROWKAP (./CWIROS:KAPNEW.f)
53.1)
ITER (./CWIROS:CHEMIE.f)
61.1)
FLUXY (./CWIROS:KAPNEW.f)
68.1)
__expf ( .. fexp.c)
74.1) __mp_barrier_nthreads (..libmp.so)
79.2)
FLUXX (./CWIROS:KAPNEW.f)
84.0)
KAPPA (./CWIROS:KAPNEW.f)
88.6) __mp_slave_wait_for_work (.. libmp.s)
90.8)
TWOSTEP (./CWIROS:CHEMIE.f)
1.2s(
2.2s(
2.9s(
3.3s(
3.6s(
3.8s(
4s(
4.2s(
4.3s(
4.4s(
24.0)
divfree (./CWIROS:DIVFREE.f)
42.1)
ROWKAP (./CWIROS:KAPNEW.f)
56.8)
ITER (./CWIROS:CHEMIE.f)
63.8)
FLUXY (./CWIROS:KAPNEW.f)
69.0)
__expf (.. libm/fexp.c)
73.4)
FLUXX (./CWIROS:KAPNEW.f)
77.7)
KAPPA (./CWIROS:KAPNEW.f)
80.6) __mp_barrier_master (.. libmp.so)
83.5)
diff (./CWIROS:DIFF.f)
85.3)
TWOSTEP (./CWIROS:CHEMIE.f)
Master execution:
124
93
76
36
27
23
22
15
15
9
1.2s( 24.0)
0.93s( 18.0)
0.76s( 14.7)
0.36s( 7.0)
0.27s( 5.2)
0.23s( 4.5)
0.22s( 4.3)
0.15s( 2.9)
0.15s( 2.9)
0.09s( 1.7)
• compare times (seconds) for each important subroutine to determine load balance
• evaluate contribution of libmp: pure overhead
TM
Example Profile Analysis
Running: ssrun -fpcsampx pipe.x
Analyzing: prof pipe.x.fpcsampx.*
S(2 threads)
func
__mpdo_output_3
__mpdo_output_2
__mpdo_output_1
matrix
…
__mpdo_predic_3
__mpdo_predic_2
__mpdo_predic_1
0
1
19.7
3.5
1.6
9.3
19.5
3.4
1.6
9.4
S(par reg subr)
Serial times
funcT funcS scalar
39.2
6.9
3.2
18.7
49.3
10.7
False Sharing
18.6
OK
4.7
4.6
4.7
4.7
4.6
4.8
9.5
9.2
9.5
28.2
26.8
TM
Summary
• Auto-Parallelizing Compilers can do a lot, but they are limited
to loop level parallelism and in analysis of data dependencies
• Manual parallelization with directives is easy to implement,
however care must be given to the treatment of data
• It is important to understand the exact semantics of the
parallelization directives
• To obtain scalable performance it is often necessary to make
program parallel at coarse-grain level, e.g. across subroutine
invocations
• To obtain scalable performance it is often necessary to take care
of proper data distribution on the machine
• It is necessary to profile parallel programs to diagnose cases of
load imbalance and parallelization overhead.
Download