TM Shared Memory Parallelism Programming in the SGI Environment TM Parallel Programming on SGI • Automatic parallelization with APC – Fortran and C compiler option: -apo – HPF: pghpf (Portland Group), xHPF (third party) • parallelization by inserting OpenMP compiler directives (using the -mp compiler option) – Fortran – C !$omp parallel, !$omp do #pragma omp parallel • Low-Level parallelism – – – – Posix standard SGI library UNIX native Sequent library Pthreads sproc() fork() m_fork() • Message Passing – PVM, MPI – shmem • Definition of number of parallel threads (man pe_environ) – SMP (APC and OpenMP, sproc): setenv OMP_NUM_THREADS n – MPI: mpirun -np n prog TM SMP Programming The parallelism in the code: • can be found by the compiler • can be asserted by the programmer with directives to help the compiler to analyze the code – compiler analyzes data access rules based on the additional information presented • can be forced by the programmer with directives/pragmas – compiler has to be told about data access rules (shared/private) – explicit synchronization is often necessary to access the shared data • can be forced by the programmer with library calls – POSIX Pthreads TM Automatic Parallelization TM Automatic Parallelization % setenv OMP_NUM_THREADS 4 TM Automatically Parallelizable Code • linear assignments: • reductions: • triangular assignments • NOT parallelizable: – early exits DO I=1,N A(I) = B(I) + X(I) ENDDO DO I=1,N S = S + X(I) ENDDO DO I=1,N DO J=I,N A(J,I) = B(J,I) + X(I) ENDDO TM Compiler Assertions Assertions help the compiler to deal with data dependencies that cannot be analyzed based on program semantics alone: C*$* (NO) CONCURRENTIZE #pragma (no) concurrentize – (not) to apply parallel optimizations, the scope is the entire compilation unit C*$* ASSERT DO (SERIAL) C*$* ASSERT DO (CONCURRENT) #pragma serial #pragma concurrent – implement the required action for the loop (nest) immediately following the directive C*$* ASSERT PERMUTATION(A) #pragma permutation(a) – similar to the IVDEP directive, asserts that the indirect addressing with index array A in the following loop nest will not alias to the same storage location C*$* ASSERT CONCURRENT CALL #pragma concurrent call – asserts that the following loop nest can be parallel despite subroutine invocations that cannot be analyzed for side effects -LNO:ignore_pragmas compiler option is used to disable these directives TM Assertion Examples do … do … c*$* assert do (serial) do … do … enddo enddo enddo enddo Compiler will not parallelize the outer loops, since the inner loop must be serial Compiler might parallelize the inner loops C*$* assert concurrent call do I=1,n call do_parallel_work(I) enddo C*$* assert permutation(i1) do k=1,nsite A(i1(k)) = A(i1(k))+force*B(k)*B(k) enddo • The compiler is analyzing the code given the information in the assertion and does automatic parallelization if possible • even with the c*$* assert do (concurrent) directive, the compiler will not make loops parallel where it finds obvious data dependencies • If it does not work, next step is to introduce the parallelization directive with explicit rules for variable usage given by the user (e.g. OpenMP semantics) TM Manual Parallelization When parallelizing a program (or part of it) manually, one needs to consider the following: • what is the highest level of parallelism (Coarse-grain vs fine-grain parallelism) • which data should be shared and which is private • how the data should be distributed between processors (this is special optimization with CC-NUMA programming) • how the work should be distributed between processors (or select automatic work distribution by the compiler) • can threads have simultaneous access to shared data, such that race conditions can occur TM Creation of Parallel Regions With OpenMP, the syntax is: C$ OMP PARALLEL [clause[,clause]…] ……… code to run in parallel Transition for C$ OMP END PARALLEL data association #pragma omp parallel [clause[,clause]…] { /* this code is run in parallel OMP_NUM_THREADS times */ } clause specifies the data association rules (default, shared, private) for each variable (array, common block) internally, compiler generates a subroutine call to the body of parallel code. The data definitions determine how variables are passed to that subroutine Old syntax: C$ DOACROSS !$ par parallel …… !$ par end parallel #pragma parallel TM Data Status Association In a serial program: • global data – variables at a fixed well known address (global variables) • subroutine arguments, common blocks, externals • local data – variables with address known only to a single block of code • automatic variables, (re-)created on the stack • variables with address allocated at load time (static) » all DATA, SAVE, -static compiled programs • dynamic variables allocated at run time » malloc, ALLOCATE, etc. Parallel program has additional qualification: • shared data – variables that can be accessed by every thread • private data – variables with address known only to a single thread • volatile data – special type variables that can be changed outside of program • synchronization data Data Status Transition in Parallel Programs TM In parallel programs following transition takes place: Serial program Parallel program Global variables Shared variables Automatic variables Private variables Static local variables Shared variables Dynamic variables Shared or private variables • It is safe to work with the automatic local variables that are allocated on the stack. In a parallel region each thread of execution gets its own private stack- these variables are automatically private • Especially hazardous are the static local variables. Since these are allocated on the heap, their address is the same in different threads • Dynamic variables are mostly created on the heap , therefore shared in parallel program. However, when allocation happens from a parallel region these variables become private • OpenMP syntax allows declaration of the status of each variable. However, in a call hierarchy the variables would assume their default status as specified above TM Volatile Variables Volatile variables get special attention from the compiler • data that gets updated may be allocated in a register as part of compiler optimization • a write to shared data in one thread may therefore not be visible by another thread until the register is flushed to memory • this is mostly incorrect behaviour that can lead to a deadlock or incorrect results • variables such that an update has to propagate immediately to all threads have to be declared volatile • compiler performs no optimizations on volatile variables, nor on the code around them, such that every write to volatile variable is flushed to memory • there is a performance penalty associated with volatile variables • Syntax: – Fortran VOLATILE var-list – C volatile “type” var-list TM Synchronization Variables To be found in man sync(3F): • call synchronize Full memory barrier: memory references will not cross FMB perform atomically k->j Operation and return the old or the new value: • i = fetch_and_OP(j,k) i = OP_and_fetch(j,k) <- atomically i=j, j OP k <- atomically j OP k, i=j OP stands for the functions: – add, sub, or, and, xor, nand; i,j,k,l can be integer*4 or integer*8 variables l = compare_and_swap(j,k,i) <- (j==k)? j=i,l=1 : l=0 • i = lock_test_and_set(j,k) <- atomically: i=j, j=k call lock_release(i) <- atomically: i=0 all these functions are guaranteed to be atomic and will be performed on globally visible arguments TM Workload Distribution • Load imbalance is the condition when some processors do more work then others. Load imbalance will lead to bad scalability. • Load imbalance can be diagnosed by profiling the programs and comparing the execution times for code in different threads • In OpenMP, compiler accepts directives for work distribution: – C$OMP DO SCHEDULE(type[,chunk]) where type is • STATIC iterations are divided into pieces at compile time (default) SCHEDULE(STATIC,6) 26 iter on 4 processors • DYNAMIC iterations assigned to processors as they finish, dynamically. This requires synchronization after each chunk iterations. • GUIDED pieces reduce exponentially in size with each dispatched piece SCHEDULE(GUIDED,4) 26 iter on 4 processors • RUNTIME schedule determined by an environment variable OMP_SCHEDULE With RUNTIME it is illegal to specify chunk. • Compiler accepts -mp_schedtype=type switch Example: Parallel Region in FORTRAN Compiler directives are enabled with the -mp switch or -apo switch. With -apo automatic parallelization is requested. SUBROUTINE MXM(A,B,C,LM,M,N,N1) DIMENSION A(M,L), B(N1,M), C(N1,L) 100 DO 100 K=1,L DO 100 I=1,N C(I,K) = 0.0 CONTINUE Will be run serially, unless -apo is specified Default variable type association rules: all variables shared (I.e. A,B,C are shared) Induction variables I,J,K will be found automatically and assigned private scope C$OMP PARALLEL DO DO 110 K=1,L DO 110 I=1,N DO 110 J=1,M C(I,K) = C(I,K) + B(I,J)*A(J,K) 110 CONTINUE END • In this example all loops can be made parallel automatically (-apo) TM TM Parallelization Examples/3 Code with shared data Common //a(im,jm),b(im,jm),c(im,jm),d(im,jm),f(im) do j=1,jm Making this loop parallel will lead to call dowork(j) enddo incorrect results, because all threads end will use same address for scratch array f subroutine dowork(j) Common //a(im,jm),b(im,jm),c(im,jm),d(im,jm),f(im) do I=1,im f(I) = (a(I,j) + b(I,j))*3.141592*j c(I,j) = f(I)**3 enddo end three solutions possible: • scratch array f should be declared local in the subroutine • f should be isolated in special common block /scratch/, which – could be declared c$omp threadprivate(scratch), or – could be loaded with the flag -Wl,-Xlocal,scratch_ • f could be extended as f(*,”numthreads”) TM False Sharing Bad data or work distribution can lead to false sharing for(i=myid; i<n; i+= nthr) { a[i] = b[i] + c[i]; } • example: processor P0 does only even iterations processor P1 does only odd iterations both processors read and write same cache lines each write invalidates the cache line of the pear processor P0 a[1] is loaded, but not used P1 a[0] is loaded, but not used etc., normal behaviour a[0] a[3] a[4] a[6] a[8] a[5] a[7] a[9] Cache line a[2] a[1] Performance • • • • False Sharing 1 2 ... Number of processors • each processor runs cache coherency protocol on the cache lines, thus the cache lines will ping-pong from one cache to another TM Scalability & Data Distribution The Scalability of parallel programs might be affected by: • Amdahl law for the parallel part of the code – plot elapsed run times as function of number of processors • Load balancing and work distribution to the processors – profiling the code will reveal significant run times for libmp library routines – compare perfex for event counter 21 (fp operations) for different threads • False sharing of the data – perfex high value of cache invalidation events (event counter 31: store/prefetch exclusive to shared block in scache) • Excessive synchronization costs – perfex high value for store conditionals (event counter 4) When cache misses account for only a small percentage of the run time, the program makes good use of the memory and its performance will not be affected by data placement – diagnose significant memory load with perfex statistics: high value of memory traffic Program scalability will be improved with data placement if there is: • memory contention on one node • excessive cache traffic TM Why Distribute? CC-NUMA architecture optimization. If: • all data initialized on one node – memory bottle neck • all shared data on one node L2 Cache R1*000 R1*000 L2 Cache L2 Cache R1*000 R1*000 L2 Cache L2 Cache R1*000 R1*000 L2 Cache L2 Cache R1*000 R1*000 L2 Cache memory Mem/Dir Bedrock ASIC Mem/Dir Bedrock ASIC – serialization of the CC • data used from different node – large interconnect traffic – average net latency on SGI3800: 16 proc Bedrock ASIC R1*000 L2 Cache R1*000 L2 Cache Mem/Dir R1*000 R1*000 L2 Cache L2 Cache ~(1/2 log2(p/16)+1)*50 ns Processor(s) TM SGI Additions to OpenMP Data Distribution Extensions: • IRIX dplace command • setenv _DSM_PLACEMENT (FIRST_TOUCH|ROUND_ROBIN) • compiler directives: – !$sgi distribute a(*,BLOCK,CYCLIC(5)) – !$sgi distribute_reshape a(*,BLOCK,…) Default Distribution: “First Touch” TM IRIX strategy for data placement: • respect program distribution directives • allocation on the node where data is initialized • allocation on the node where memory is available PARAMETER (PAGESIZE=16384, SIZEDBLE=8) REAL, ALLOCATABLE:: WORK(:)!! Distribution directive not available READ *, MX ALLOCATE(WORK(MX)) !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(WORK,MX) ID = OMP_GET_THREAD_NUM() NP = OMP_GET_NUM_THREADS() NPAGES = MX*SIZEDBLE/PAGESIZE DO I=ID,NPAGES-NP+1,NP !! round-robin distribution of pages: WORK(1+I*PAGESIZE/SIZEDBLE) = 0.0 ENDDO !$OMP END PARALLEL TM Data Placement Policies Initialization with the “first touch” policy on a single processor • all data on a single node • bottle neck in access to that node HUB memory cpu cpu HUB memory cpu cpu HUB memory cpu cpu memory cpu cpu HUB Initialization with the “first touch” policy with multiple processors HUB memory cpu cpu HUB memory cpu cpu HUB memory cpu cpu HUB memory data is distributed naturally each processor has local data to run on minimal data exchange between nodes page edge effects cpu cpu • • • • Initialization with the “round robin” policy with a single processor HUB memory cpu cpu HUB memory cpu cpu HUB memory cpu cpu HUB memory circular page placement on the nodes on average, all pages are remote no single bottle neck significant traffic between nodes cpu cpu • • • • TM Example: Data Distribution To avoid the bottle neck resulting from memory allocation on a single node: • Perform initialization in parallel, such that each processor initializes data that it is likely to access later for calculation • Use explicit data placement compiler directives (e.g. C$ distribute ) REAL*8 A(N),B(N),C(N),D(N) C$DISTRIBUTE A(BLOCK),B(BLOCK),C(BLOCK),D(BLOCK) DO I=1,N A(I) = 0.0 B(I) = I/2 C(I) = I/3 D(I) = I/7 ENDDO C$OMP PARALLEL DO DO I=1,N A(I) = B(I) + C(I) + D(I) ENDDO • Enable environment variable for data placement (setenv _DSM_PLACEMENT) • Use dplace command (with appropriate options) to start the program execution TM Data Initialization: Example The F90 allocatable arrays should be initialized explicitly to place the pages: PARAMETER (PAGESIZE=16384, SIZEDBLE=8) REAL, ALLOCATABLE:: WORK(:) READ *, MXRVAR ALLOCATE(WORK(MXRVAR)) !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(WORK,MXRVAR) ID = OMP_GET_THREAD_NUM() NP = OMP_GET_NUM_THREADS() NBLOCK NPAGES NLOWER NUPPER NLOWER NUPPER = = = = = = MXRVAR/NP MXRVAR*SIZEDBLE/PAGESIZE ID*NBLOCK NLOWER + NBLOCK NLOWER + 1 MIN(NUPPER,MXRVAR) c round-robin distribution of pages: DO I=ID,NPAGES-NP+1,NP WORK(1+I*PAGESIZE/SIZEDBLE) = 0.0 ENDDO !$OMP END PARALLEL TM Data Distribution Directives Compiler supports the following data distribution directives: !$SGI DISTRIBUTE A(shape[,shape]…) ONTO(n[,m]…) #pragma distribute … – define regular distribution of an array with given shape – the shape can be • BLOCK • CYCLIC(k) • * (default: do not distribute) – the placement of the pages will not violate Fortran/C rules for data usage and therefore the exact shape is not guaranteed, in particular no padding at page boundary !$SGI DISTRIBUTE_RESHAPE A(shape[,shape]…) ONTO(n[,m]…) #pragma distribute_reshape … – define reshaped distribution of an array with given shape, i.e exact mapping of the data to the given shape. The shape parameters are as for the DISTRIBUTE directive !$SGI DYNAMIC A #pragma dynamic … – permit dynamic redistribution of an array A !$SGI REDISTRIBUTE A(shape[,shape]…) #pragma redistribute … – executable statement; forces redistribution of a (previously distributed) array A !$SGI PAGE_PLACE A(…) #pragma page_place … – specify distribution by pages !$SGI AFFINITY(i)=[DATA(A(I))|THREAD(I)] #pragma affinity … – the affinity clause specifies that iteration i should execute on the processor who’s memory contains element A(i) or which runs THREAD i TM Data Distribution: Examples/1 Regular data distributions to 4 processors: 1) !$SGI DISTRIBUTE X(BLOCK) 2) X(CYCLIC) or X(CYCLIC(1)) 3) X(CYCLIC(3)) Y(*,BLOCK) 4) Real*8 Y(1000,1000) c$sgi distribute Y(*,block) each individual portion is a contiguous piece of size 8*106/P I.e. for 4 processors it is 2 MB/piece Y(BLOCK,*) 5) 6) Real*8 Y(1000,1000) c$sgi distribute Y(block,*) each individual portion is still of size 8*106/P, but each contiguous piece should be only 8*103/P, I.e. for 4 processors it is 2 KB (<16 KB page size) Y(BLOCK,BLOCK) if the requested distribution incurs significant page boundary effects (example: (5)) - reshaping of array data is necessary TM Data Distributions: Examples/2 Regular data distributions on 6 processors; the ONTO clause changes the aspect ratio: 7) Y(BLOCK,BLOCK) Should be constant integer 8) 9) Y(BLOCK,BLOCK) ONTO (2,3) Y(BLOCK,BLOCK) ONTO (1,*) Regular data distribution can be re-distributed: C$SGI DISTRIBUTE A(BLOCK,BLOCK,BLOCK) ONTO (2,2,8) C$SGI DYNAMIC A … c - executable statement, can appear anywhere in the code: C$SGI REDISTRIBUTE A(BLOCK,BLOCK,BLOCK) ONTO (1,1,32) TM Data Reshaping Restrictions/1 The DISTRIBUTE_RESHAPE directive declares that the program makes no assumptions about the storage lay out of the array Array reshaping is legal only under certain conditions : • an array can be declared with only one of these options – default (no (re-)shaping) – distribute – distribute_reshape – it will keep the declared attribute for the duration of the program • reshaped array cannot be re-distributed • a reshaped array cannot be equivalenced to another array – explicitly through an equivalence statment – implicitly through multiple declarations of a common block • if an array in a common block is reshaped, all declarations of that common block must: – contain an array at the same offset within the common block – declare that array with the same number of dimensions and same size – specify the same reshaped distribution for the array TM Data Reshaping Restrictions/2 • when reshaped array is passed as an actual argument to a subroutine: – if reshaped array is passed as a whole (I.e. call sub(A)), the formal parameter should match exactly the number of dimensions and their size – if only elements of reshaped array are passed ( I.e. call sub(A(0)) or call sub(A(I)), the called procedure treats the incoming parameter as non-distributed – for elements of reshaped array, the declared bounds on the formal parameter are required not to exceed the size of the distributed array portion passed in as actual argument. Example: – compiler performs consistency check on the legality of the reshaping: • -MP:dsm=ON Real*8 A(1000) c$ distribute_reshape A(cyclic(5)) do I=1,1000,5 call mysub(A(I)) enddo end (default) • -MP:check_reshape={ON|OFF} (default OFF) generation of run-time consistency checks • -MP:clone={ON|OFF} (default ON) automatically generate clones of procedures called with the reshaped parameters subroutine mysub(X) real*8 X(5) … end TM Data Reshaping Restrictions/3 Consider also the following restrictions: • BLOCK DATA (i.e. statically) initialized data cannot be reshaped • alloca(3) or malloc(3) arrays cannot be reshaped • F90: a dynamically sized automatic array can be distributed, but an F90 allocatable array is not • cannot mix I/O for a reshaped array with namelist I/O of a function call in the same I/O statement • a common block containing a reshaped array cannot be linked -Xlocal TM Data Distribution: Affinity usage of AFFINITY clause: • data form: REAL*8 A(N),B(N),Q !$SGI DISTRIBUTE A(BLOCK),B(BLOCK) C INITIALIZATION: DO I=1,N A(I) = 1 - I/2 B(I) = -10 + 0.01*(I*I) ENDDO C REAL WORK: DO IT=1,NITERS Q = 0.01*IT C$OMP PARALLEL DO C$SGI AFFINITY(I)=DATA(A(I)) DO I=1,N A(I) = A(I) + Q*B(I) ENDDO ENDDO thread form: REAL*8 A(N),B(N),Q !$SGI DISTRIBUTE A(BLOCK),B(BLOCK) C INITIALIZATION: DO I=1,N A(I) = 1 - I/2 B(I) = -10 + 0.01*(I*I) ENDDO C REAL WORK: DO IT=1,NITERS Q = 0.01*IT C$OMP PARALLEL DO C$SGI AFFINITY(I)=thread(I/(N/P)) DO I=1,N A(I) = A(I) + Q*B(I) ENDDO ENDDO iterations are scheduled such that each processor does iterations over data that is placed in the memory near to that processor TM The dplace Program dplace - shell level tool for Data Placement: • declare number of threads to use • declare number of memories to use • specify the topology of the memories • specify the distribution of threads on to memories • specify the distribution of virtual addresses on to memories • specify text/data/stack page sizes • specify migration policy and threshold • dynamically migrate virtual address ranges between memories • dynamically move threads from one node to another dplace [-place file] [-verbose][-mustrun] [-data_pagesize size] [-stack_pagesize size] [-text_pagesize size] [-migration threshold] program arguments • see man dplace(1) Note: dplace switches off the _DSM_… environment variables TM dplace: Placement File The file must include: • the number of memories being used and their topology: – memories M in topology [none|cube|cluster|physical near] • number of threads – threads N • assignment of threads to memories: – run thread n on memory m [using cpu k] – distribute threads [t0:t1:[:dt]] across memories [m0:m1[:dm]] [block [m]] | [cyclic [k]] • optional specifications for policy: – policy stack|data|text pagesize N [k|K] – policy migration n [%] – mode verbose [on|off|toggle] Example: memories ($OMP_NUM_THREADS+1)/2 in topology cube threads $OMP_NUM_THREADS distribute threads across memories TM Data Distribution: page_place Useful for irregular and dynamically allocated data structures: #pragma page_place (A, index-range, thread-id) C$SGI PAGE_PLACE (A, index-range, thread-id) c$ Parameter (n=8*1024*1024) real*8 a(n) p=1 p = omp_get_max_threads() c- distribute using block mapping npp=(n + p-1)/p ! Number of elements per processor c$omp parallel do do I=0,p-1 c$sgi page_place (a(1+I*npp), npp*8, I) enddo Note: page_place is useful only for initial placement of the data, it has no effect if pages have been created already TM Environment Variables: Process OMP_NUM_THREADS (default max(8,#cpu)) • sets the number of threads to use during execution. If OMP_DYNAMIC is set to TRUE, this is the maximum number of threads OMP_DYNAMIC TRUE|FALSE (default TRUE) • enables or disables dynamic adjustment of parallel threads to load on the machine. The adjustment can take place at the start of each parallel region Setting OMP_DYNAMIC to FALSE will switch MPC_GANG ON MPC_GANG ON|OFF (default OFF) • With gang scheduling all parallel threads are started simultaneously and the time quantum is twice as long as normal. To obtain higher priority for parallel jobs it is often necessary to use: setenv OMP_DYNAMIC FALSE setenv MPC_GANG OFF _DSM_MUSTRUN (default not set) • to lock threads to a cpu where it happened to be started _DSM_PPM (default 2/4 processes per node) • if _DSM_PPM is set to 1 only one process will be started on each node TM Environment Variables: Process TM Environment Variables: Process TM Environment Variables: Size MP_SLAVE_STACKSIZE (default 16 MB (4 MB >64 threads)) • stack size of the slave processes. The stack space is private, an adjustment of this size might be necessary if machine has constrains on memory MP_STACK_OVERFLOW (default ON) • stack size checking. Some programs overflow their allocated stack, while still functioning correctly. This variable has to be set OFF in that case. PAGESIZE_STACK PAGESIZE_DATA PAGESIZE_TEXT (default 16 KB) (default 16 KB) (default 16 KB) • specifies the page size for stack, data and text segments. Sizes can be specified in KBytes or Mbytes in power of 2: 64 KB, 256 KB, 1 MB, 4 MB. If the system does not have requested pages available, it might choose smaller page size silently. TM Environment Variables: DSM _DSM_PLACEMENT (default FIRST_TOUCH) • memory allocation policy for stack, data and text segments. This variable can be set to the following values: – FIRST_TOUCH – ROUND_ROBIN allocate pages close to the process which requested them allocate pages alternating the nodes where processes run _DSM_OFF (default not set) • data placement directives will not be honoured if _DSM_OFF is set to ON. In particular, that is the mechanism that dplace program is using to force the specified data distribution in an executable. E.g. dplace -place file myprogram • will set _DSM_OFF to ON and proceed with specifications in file. _DSM_VERBOSE (default not set) • when set, reports memory allocation decisions to stdout. man pe_environ(5) page for other useful environment variables. TM Case Study: Maxwell Code REAL EX(NX,NY,NZ),EY(NX,NY,NZ),EZ(NX,NY,NZ) !Electric field REAL HX(NX,NY,NZ),HY(NX,NY,NZ),HZ(NX,NY,NZ) !Magnetic field … DO K=2,NZ-1 DO J=2,NY-1 DO I=2,NX-1 HX(I,J,K)=HX(I,J,K)-(EZ(I,J,K)-EZ(I,J-1,K))*CHDY +(EY(I,J,K)-EY(I,J,K-1))*CHDZ HY(I,J,K)=HY(I,J,K)-(EX(I,J,K)-EX(I,J,K-1))*CHDZ +(EZ(I,J,K)-EZ(I-1,J,K))*CHDX HZ(I,J,K)=HZ(I,J,K)-(EY(I,J,K)-EY(I-1,J,K))*CHDX +(EX(I,J,K)-EX(I,J-1,K))*CHDY ENDDO ENDDO ENDDO here NX=NY=NZ = 32, 64, 128, 256 (i.e. with real*4 elements: 0.8MB, 6.3MB, 50MB, 403MB) • For scalar optimization, NX,NY,NZ have been adjusted to uneven values and arrays H,E have been padded to avoid cache trashing • Now parallelization can be attempted to further increase performance on a parallel machine – automatic parallelization with the -apo compiler option – manual parallelization approach • Study the performance implications of the different parallelization approaches Case Study: Maxwell Code Automatic Parallelization TM The Maxwell program is automatically parallelized with: f77 -apo -O3 maxwell.f There are 3 parallel sections: • Initialization • Calculation of H • Calculation of E Iterations (for statistics) DO K=1,NZ DO J=1,NY DO I=1,NX HX(I,J,K)=1E-3/(I+J+K) 1 EX(I,J,K)=1E-3/(I+J+K) …. ENDDO ENDDO ENDDO DO K=2,NZ-1 DO J=2,NY-1 DO I=2,NX-1 HX(I,J,K)=HX(I,J,K)-(EZ(I,J,K)-EZ(I,J-1,K))*CHDY +(EY(I,J,K)-EY(I,J,K-1))*CHDZ HY(I,J,K)=HY(I,J,K)-(EX(I,J,K)-EX(I,J,K-1))*CHDZ 2 +(EZ(I,J,K)-EZ(I-1,J,K))*CHDX HZ(I,J,K)=HZ(I,J,K)-(EY(I,J,K)-EY(I-1,J,K))*CHDX +(EX(I,J,K)-EX(I,J-1,K))*CHDY ENDDO ENDDO ENDDO DO K=2,NZ-1 DO J=2,NY-1 DO I=2,NX-1 EX(I,J,K)=EX(I,J,K)-(HZ(I,J,K)-HZ(I,J-1,K))*CHDY +(HY(I,J,K)-HY(I,J,K-1))*CHDZ EY(I,J,K)=EY(I,J,K)-(HX(I,J,K)-HX(I,J,K-1))*CHDZ 3 +(HZ(I,J,K)-HZ(I-1,J,K))*CHDX EZ(I,J,K)=EZ(I,J,K)-(HY(I,J,K)-HY(I-1,J,K))*CHDX +(HX(I,J,K)-HX(I,J-1,K))*CHDY ENDDO ENDDO ENDDO There are no data dependencies in each of the lops; The compiler divided k-loop iterations among N processors TM Scalability of Auto-Parallelization Super-linear speedup when data fits into the cache Slow down: load balancing the k-loop range is too small The parallel k-loop DO K=2,NZ-1 therefore the performance peaks in “perfect” points 62/2 and 126/2 TM Case Study: Workload Distribution Parallelization with directives c$omp parallel do c$sgi nest(k,j) in front of the properly nested loop nest TM Case Study: Data Placement Parallel initialization can be forced to be done serially with the directive c*$* assert do(serial) Data distribution directive: c$distribute Data placement with environment variable: setenv _DSM_PLACEMENT ROUND_ROBIN There is no single bottle neck, but most data is remote Write back traffic Most data is in cache Data traffic from/to a single node limits the scalability since the default page placement strategy is the “first touch” strategy, all pages end up in a memory of a single node where the program starts. TM Parallel Thread Creation There are several mechanisms to create a parallel processing stream: • Library call: sproc() (Shared Memory Fork) m_fork() (simpler sproc) pthread_create (Posix threads) start_pes (shmem init routine) Shared address space Fork() MPI_init() UNIX MPI start up Disjoint address space • compiler level (compiler directives) C$doacross (SGI old syntax) c$par parallel (ANSI X3H5 spec) !$omp parallel (OpenMP specs) #pragma omp parallel • automatic parallelization (-apo) • At low level all shared memory parallelization pass from sproc call to create threads of control with global address space TM Compile Time Actions Each parallel region (either found automatically or via directive): • is converted into a subroutine call executed by each parallel thread • The subroutine contains the execution body of the parallel region • the arguments to the subroutine contain scheduling parameters (loop ranges) Subroutine daxpy(n,a,x,y) real*8 a,x(n),y(n) do I=1,n y(I) = y(I)+a*x(I) enddo end • the new subroutine name: – derived from _mpdo_sub_# – will appear in the profile • data in the parallel region: – private data: • local (automatic) variables in the newly created subroutine – shared data • will be passed well known address to the new subroutine Subroutine daxpy(n,a,x,y) real*8 a,x(n),y(n) integer*4 args _LOCAL_NTRIP = n/__mp_sug_numthreads_func$() _LOCAL_START = 1+__mp_my_thread_id()*_LOCAL_NTRIP _INCR = 1 args(1) = _LOCAL_START args(2) = _LOCAL_NTRIP args(3) = _INCR args(4) = 0 call sproc$(_mpdo_daxpy_1,%VAL(PR_SALL),args) END subroutine _mpdo_daxpy_1(_LOCAL_START, _LOCAL_NTRIP,_INCR,_INF) do I=_LOCAL_START,_LOCAL_NTRIP y(I) = y(I)+a*x(I) enddo end TM Run Time Actions Master process (p0) p1 par_loop p2 p2 par_loop sproc() create shared memory process group Serial execution pn-1 pn-1 par_loop barrier !$omp parallel Serial execution __mp_simple_sched(par_loop) p0 par_loop __mp_barrier _master() or !$omp parallel p1 __mp_barrier_nthreads() mp_setup() __mp_slave_wait_for_work() __mp_wait_for_completion() When the parallel threads are spawn: • they spin in __mp_slave_wait_for_work() Master enters parallel region: •passes the address of the _mpdo subroutine to the slaves • at the end of parallel region all threads synchronize • slaves go back to spin in __mp_slave_wait_for_work() • master continues with serial execution TM Timing and Profiling MP Program • To obtain timing information for an MP program – setenv OMP_NUM_THREADS n – time a.out – example: n=1 time output: 9.678u 0.091s 0:09.86 98.9% 0+0k 0+1io 0pf+0w n=2 time output: 10.363u 0.121s 0:05.72 183.2% 0+0k 3+8io 3pf+0w from this output we derive the following performance data: processors User+System Elapsed 1 9.77 9.86 7.3% S(2) = 9.86/5.72 2 10.48 5.72 therefore the parallel overhead is 7.3%, the Speedup is 1.72 • use Speedshop to obtain profiling information for an MP program – setenv OMP_NUM_THREADS n – ssrun -exp pcsamp a.out – each thread will create a separate (binary) profiling file • foreach file (a.out.pcsamp.*) • prof $file • end TM MP SpeedShop Example Output Slave execution: 103 96 67 40 35 30 26 24 23 11 1s( 20.6) 0.96s( 19.2) 0.67s( 13.4) 0.4s( 8.0) 0.35s( 7.0) 0.3s( 6.0) 0.26s( 5.2) 0.24s( 4.8) 0.23s( 4.6) 0.11s( 2.2) 1s( 2s( 2.7s( 3.1s( 3.4s( 3.7s( 4s( 4.2s( 4.4s( 4.5s( 20.6) divfree (./CWIROS:DIVFREE.f) 39.7) ROWKAP (./CWIROS:KAPNEW.f) 53.1) ITER (./CWIROS:CHEMIE.f) 61.1) FLUXY (./CWIROS:KAPNEW.f) 68.1) __expf ( .. fexp.c) 74.1) __mp_barrier_nthreads (..libmp.so) 79.2) FLUXX (./CWIROS:KAPNEW.f) 84.0) KAPPA (./CWIROS:KAPNEW.f) 88.6) __mp_slave_wait_for_work (.. libmp.s) 90.8) TWOSTEP (./CWIROS:CHEMIE.f) 1.2s( 2.2s( 2.9s( 3.3s( 3.6s( 3.8s( 4s( 4.2s( 4.3s( 4.4s( 24.0) divfree (./CWIROS:DIVFREE.f) 42.1) ROWKAP (./CWIROS:KAPNEW.f) 56.8) ITER (./CWIROS:CHEMIE.f) 63.8) FLUXY (./CWIROS:KAPNEW.f) 69.0) __expf (.. libm/fexp.c) 73.4) FLUXX (./CWIROS:KAPNEW.f) 77.7) KAPPA (./CWIROS:KAPNEW.f) 80.6) __mp_barrier_master (.. libmp.so) 83.5) diff (./CWIROS:DIFF.f) 85.3) TWOSTEP (./CWIROS:CHEMIE.f) Master execution: 124 93 76 36 27 23 22 15 15 9 1.2s( 24.0) 0.93s( 18.0) 0.76s( 14.7) 0.36s( 7.0) 0.27s( 5.2) 0.23s( 4.5) 0.22s( 4.3) 0.15s( 2.9) 0.15s( 2.9) 0.09s( 1.7) • compare times (seconds) for each important subroutine to determine load balance • evaluate contribution of libmp: pure overhead TM Example Profile Analysis Running: ssrun -fpcsampx pipe.x Analyzing: prof pipe.x.fpcsampx.* S(2 threads) func __mpdo_output_3 __mpdo_output_2 __mpdo_output_1 matrix … __mpdo_predic_3 __mpdo_predic_2 __mpdo_predic_1 0 1 19.7 3.5 1.6 9.3 19.5 3.4 1.6 9.4 S(par reg subr) Serial times funcT funcS scalar 39.2 6.9 3.2 18.7 49.3 10.7 False Sharing 18.6 OK 4.7 4.6 4.7 4.7 4.6 4.8 9.5 9.2 9.5 28.2 26.8 TM Summary • Auto-Parallelizing Compilers can do a lot, but they are limited to loop level parallelism and in analysis of data dependencies • Manual parallelization with directives is easy to implement, however care must be given to the treatment of data • It is important to understand the exact semantics of the parallelization directives • To obtain scalable performance it is often necessary to make program parallel at coarse-grain level, e.g. across subroutine invocations • To obtain scalable performance it is often necessary to take care of proper data distribution on the machine • It is necessary to profile parallel programs to diagnose cases of load imbalance and parallelization overhead.