TM Cache Optimizations & the Loop Nest Optimizer TM Improvement Opportunities Program runs slow because not all resources are used: • processor: – not using opportunities to go superscalar (ILP) – scheduling of instructions is not optimal (too many wait states) • memory access: – not all data in cache line is used (spatial locality) – data in the cache in not reused (temporal locality) Performance analysis is used to diagnose the problem. Compiler will attempt to optimize the program for the given Architecture: • data structure can inhibit compiler optimizations • algorithm presentations can inhibit compiler optimizations Often it is necessary to rewrite critical part of code (loops) in the program so that compiler can do better performance optimization. Understand compiler optimizations techniques TM Compiler Optimization Techniques The following optimizations are built into the compiler: • general – procedure inlining – data and array padding • loop based: – – – – Loop interchange outer and inner loop unrolling cache blocking loop fusion (merge) and fission (split) Loop nests, implies usage of multi-dimensional arrays enabled at -O3 or with LNO:opt=[1|0] • Code generation: – software pipelining – instruction reordering Algorithm presentation in the program such that compiler can apply the optimization techniques - leads to optimal program performance on the machine. TM Scalar Architecture: Cache System Cache subsystem 1 0.1 0.01 64reg Speed of Access 1/clock • The hierarchy of memory devices: ~2-3 cy 32KB (L1) memory disk ~10 cy 8MB (L2) ~100 cy ~1 - 100s GB ~4000 cy Device Capacity (size) • The goal of Memory Hierarchy: – access speed ~ fastest memory – effective capacity ~ size of largest memory -> Programs should follow the principle of locality: (Use items in the cache) – Spatial locality of reference (use all words in cache line) – Temporal locality of reference (use same cache line) Scalar Architecture: TM Cache Organization Example Cache L2 on O2K (e.g. 8 MB or 2097152 words) Words in Memory cache line transfer Load instruction (ld) for 1 word cache lines in memory (32 words) ‡ Cache hit will load word from cache ‡ Cache miss will load cache line from memory The goal of scalar optimization: – Spatial locality of reference (use all words in cache line) – Temporal locality of reference (use same cache line) TM Problems of Scalar Optimization DO i=1,n DO j=1,n DO k=1,n C(i,j)=C(i,j) + A(i,k)*B(k,j) ENDDO ENDDO ENDDO i k i = j – – – – X k j cache lines each C(I,j) value is accumulated in the register for A(I,k)*B(k,j) B is traversed in sequence of cache lines (spatial locality) A is accessing only 1 word from each cache line (no locality) for A and B no reuse of cache lines (if n is large) This is a problem only if A,B,C do not fit into the cache TM Loop Nest Optimizer LNO performs loop restructuring to optimize data access: • • • • • • loop interchange loop unrolling loop blocking for cache loop fusion loop fission pre-fetching LNO is controlled with compiler options and/or compiler directives or pragmas; same options for both • LNO is the default at -O3, but can be turned on/off individually by -LNO:opt=[1|0] • directives/pragma syntax: – Fortran: C*$* keyword [=value(s)] – C/C++ : #pragma keyword [=value(s)] – • directives/pragmas can be disabled with the compiler switch -LNO:ignore_pragmas TM Array Indexing There are several ways to index arrays: Direct Addressing ++ Explicit Addressing DO j=1,M DO i=1,n … A(i,j) …. ENDDO ENDDO Loop carried Addressing DO j=1,M DO i=1,N k = k + 1 … A(k) … ENDDO ENDDO + DO j=1,M DO i=1,N … A(i+(j-1)*N) … ENDDO ENDDO - Indirect Addressing -- DO j=1,M DO i=1,N … A(index(i,j)) … ENDDO ENDDO • The addressing scheme will have impact on the performance • Arrays should be accessed in most natural direct way for compiler to apply loop optimization techniques TM Data Storage in Memory Data storage order is language dependent: • Fortran stores multi-dimensional arrays “column-wise” J I In memory i A(I,J) i i j+1 j i j+2 left most index changes fastest... • C stores multi-dimensional arrays “row-wise” j i a[i][j] In memory j j i j i+1 j i+2 right most index changes fastest... • Accessing array elements in storage order greatly improves performance: for arrays that do not fit in the cache(s) TM Loop Interchange: FORTRAN Original loop: Interchanged loops: c*$* no interchange DO I=1,N DO J=1,M C(I,J)=A(I,J)+B(I,J) ENDDO ENDDO c*$* interchange(J,I) DO J=1,M DO I=1,N C(I,J)=A(I,J)+B(I,J) ENDDO ENDDO A(I,J) B(I,J) C(I,J) J I N M A(I,J) B(I,J) C(I,J) Storage order Access order J I M N • The distribution of data in memory is not changed. Only the access pattern is changed • Compiler can do this optimization automatically -LNO:interchange=[on|off] (default on) TM Index Reversal Original loop: DO I=1,N DO J=1,M C(I,J)=A(I,J)+B(J,I) ENDDO ENDDO The access is poor for A and C, while it is optimal for B Interchanged loops + Index reversal: interchange will be good for A and C, it will be bad for B DO J=1,M DO I=1,N C(I,J)=A(I,J)+B(I,J) ENDDO ENDDO • Index reversal on B: i.e. B(I,J) replaced by B(J,I) must be done everywhere in the program • This has to be done manually, there is no compiler optimization that does index reversal. The Significance of Loop Interchange DO I=1,700 DO J=1,700 DO K=1,700 A(I,J,K)=A(I,J,K)+B(I,J,K)*C(I,J,K) ENDDO ENDDO ENDDO Run time in seconds obtained on an Origin 3000: loop order R12K@400MHz (8 MB cache) i,j,k j,i,k k,j,i 535.0 32.0 11.0 TM TM Loop Interchange in C In C, the situation is exactly the opposite to Fortran: Addressing of c[i][j] and a[i][j] are poor Original loop: #pragma no interchange for(j=0; j<m; j++) for(i=0; i<n; i++) c[i][j]=a[i][j]+b[j][i]; Interchanged loop: #pragma interchange(i,j) for(i=0; i<n; i++) for(j=0; j<m; j++) c[i][j]=a[i][j]+b[j][i]; Addressing of b[j][i] is optimal Index Reversal loop: for(j=0; j<m; j++) for(i=0; i<n; i++) c[j][i]=a[j][i]+b[j][i]; • The performance benefits in C are the same as in Fortran • In most practical situations, loop interchange (supported by the compiler) is much easier to achieve than index reversal. TM Array Placement Effects “Poor” data placement in memory can lead to the effect of cache thrashing. There are 2 techniques built into the compiler to avoid the cache thrashing: • array padding • leading dimension extension NOTE: leading dimension of arrays should be an odd number, if the multi-dimensional array has small extensions (e.g. a(64,64,64,..)) several leading dimensions should be odd numbers. TM Direct-Mapped Caches: Thrashing (Virtual) memory A(1) A(2) 32 KB A(8191) A(8192) B(1) B(8191) B(8192) COMMON //A(8192), B(8192) DO I=1,N PROD = PROD + A(I)*B(I) ENDDO Registers in the CPU Direct mapped cache (32 KB) Cache line: 4 words A(1) A(2) A(3) A(4) 1 A(5) A(6) A(7) A(8) 2 A(8185) A(8186) A(8187) A(8188) 2047 A(8189) A(8190) A(8191) A(8192) 2048 Thrashing: every memory reference results in a cache miss Location in the cache: (memory-address) mod (cache-size) in this case loc(A(1)) mod 32KB = loc(B(1)) mod 32KB [because B(1) = A(1) + 8192; 8192*4B mod 32KB = 0] TM Set-Associative Caches (Virtual) memory A(1) A(2) COMMON //A(8192), B(8192) DO I=1,N PROD = PROD + A(I)*B(I) ENDDO 2 way set associative cache (32 KB) Cache line: 4 words A(1) A(5) 32 KB A(8191) A(8192) B(1) B(1) B(5) B(2) B(6) A(2) A(6) B(3) B(7) A(3) A(7) B(4) B(8) A(4) A(8) 1 2 A(4089) A(4090) A(4091) A(4092) 1023 A(4093) A(4094) A(4095) A(4096) 1024 B(8191) B(8192) Registers in the CPU Set select (1bit) (LRU) B(4089) B(4090) B(4091) B(4092) B(4093) B(4094) B(4095) B(4096) No Thrashing: conflicting cache lines are stored into a different set Location in the cache: (memory-address) mod (cache-size) in this case loc(A(1)) mod 16KB = loc(B(1)) mod 16KB BUT A DIFFERENT SET! TM Array Padding: Example COMMON // A(1024,1024), B(1024,1024), C(1024,1024) DO J=1,1024 DO I=1,1024 A(I,J) = A(I,J)+B(I,J)*C(I,J) ENDDO ENDDO Assume 32 KB cache Addr[C(1,1)] = Addr[B(1,1)] + 1024*1024*4 position in the cache: C(1,1) = B(1,1) since (1024*1024*4) mod 32KB = 0 COMMON // A(1024,1024),pad1(129) B(1024,1024),pad2(129) C(1024,1024) DO J=1,1024 DO I=1,1024 A(I,J) = A(I,J)+B(I,J)*C(I,J) ENDDO ENDDO •Padding will cause cache lines to be placed in different cache locations •Compiler will try to do padding automatically Addr[C(1,1)] = Addr[B(1,1)] + 1024*1024*4+129*4 position in the cache: C(1,1) = B(129,1) mod 32KB TM Maxwell Code Example REAL EX(NX,NY,NZ),EY(NX,NY,NZ),EZ(NX,NY,NZ) !Electric field REAL HX(NX,NY,NZ),HY(NX,NY,NZ),HZ(NX,NY,NZ) !Magnetic field … DO K=2,NZ-1 DO J=2,NY-1 DO I=2,NX-1 HX(I,J,K)=HX(I,J,K)-(EZ(I,J,K)-EZ(I,J-1,K))*CHDY +(EY(I,J,K)-EY(I,J,K-1))*CHDZ HY(I,J,K)=HY(I,J,K)-(EX(I,J,K)-EX(I,J,K-1))*CHDZ +(EZ(I,J,K)-EZ(I-1,J,K))*CHDX HZ(I,J,K)=HZ(I,J,K)-(EY(I,J,K)-EY(I-1,J,K))*CHDX +(EX(I,J,K)-EX(I,J-1,K))*CHDY ENDDO ENDDO ENDDO here NX=NY=NZ = 32, 64, 128, 256 (i.e. with real*4 elements: 0.8MB, 6.3MB, 50MB, 403MB) Reusing load from previous iteration (I-1) gives in total: 13 memory operations (6H+7E) -> minimum 13 cycles/iteration 18 floating point operations in this code 18/(13*2)=69% peak, i.e. 800Mflop/s on the R10000@400MHz processor Compiling with: -mips4 -O3 -LNO:opt=0 -OPT:reorg_common=off (to show the effect of compiler not performing the necessary optimizations) gives performance on this code of 4.6 Mflop/s TM Maxwell Example - continued Problem: • array dimensions are small even numbers, power of 2 and map to the same location in both 1st level and the 2nd level caches In general: primary cache 32 KB = 2(way-set-ass) * 4(size-real) * 4096 secondary cache 8 MB = 2(way-set-ass) * 4(size-real) * 1048576 C print position of arrays in memory with the code: Integer*8 aEX aEX = %LOC(EX(1,1,1)) print *,’Addr EX=‘,mod(aEX,4096), mod(aEX,1048576),’words’ • for the Maxwell example the print shows with NX=NY=NZ=64: Addr EX= 3720 Addr EY= 3720 …….. etc. Addr HZ= 3720 470664 470664 470664 All arrays map to the same locations in both caches • Compiler is able to pad the arrays automatically. Compiling with the default optimizations: -mips4 -O3 gives for the performance 162 Mflop/s TM Dangers of Array Padding • Compiler will automatically pad local data • -O3 optimization will automatically pad common blocks • Padding of common blocks is safe as long as the Fortran standard is not violated: SUBROUTINE SUB COMMON // A(512,512), B(512,512) DO I=1, 2*512*512 A(I) = 0.0 END • Fix violation or do not to use this optimization either by compiling with lower optimization or using explicit compiler flag: • -OPT:reorg_common=off TM Variable Length Arrays (VLA) SGI compiler supports Variable Length Arrays in C and Fortran • It is standard in F90 and an SGI extension in F77: SUBROUTINE NAME1(N,M) DIMENTION R(N,M) ……… etc. … END These arrays are created on the stack, as opposed to a location in a static area • In C it is an SGI extension: void name1(int m, int n){ double r[m][n][n+m]; …… etc. ….. } • VLAs are very handy as scratch arrays, since they are created each time execution enters the subroutine and they are destroyed at exit • Unlike the static arrays, VLAs allow for proper aliasing and alignment considerations by the compiler TM Loop Unrolling Loop unrolling: perform multiple loop iterations at the same time DO I=1,N,1 …(I)… ENDDO C*$* unroll(p) P = 0 default unrolling p = 1 no unrolling p = UNROLL - that factor DO I=1,N,UNROLL …(I)… …(I+1)… …(I+2)… …(I+UNROLL-1)… ENDDO Advantages of loop unrolling: & cleanup • • • • DO I=N-mod(N,unroll)+1,N …(I)… ENDDO more opportunities for super-scalar code more data re-use & pseudo-prefetch exploit presence of cache lines reduction in loop overhead (minor) NOTE: Inner loops should “never” be unrolled by hand: • compiler will typically unroll the inner loop the necessary amount for SWP TM Prefetch Data from Memory Reordering instructions in unrolled loop leads to effective (pseudo) prefetch of the data for(i=0; a a a a } i<n; i+=4){ += b[i+0]; += b[i+1]; += b[i+2]; += b[i+3]; for(i=0; t a a a a } i<n; i+=4){ = b[i+3]; += b[i+0]; += b[i+1]; += b[i+2]; += t; • no instruction overhead; compiler does this optimization automatically. Explicit (manual) prefetch for memory: • prefetch to 1st level cache should be done in form of pseudo-prefetch • compiler will insert prefetch to 2nd level cache automatically (LNO) • manual prefetch to 2nd level cache can be done with compiler directive: C*$* prefetch_ref=a(1) c*$* prefetch_ref=a(1+16) do I=1,n c*$* prefetch_ref=a(I+32),stride=16,kind=rd sum = sum + a(I) enddo • same in C with the corresponding #pragma directive TM Outer Loop Unrolling DO I=1,N DO J=1,N A(I)=A(I)+B(I,J)*C(J) ENDDO ENDDO Problem: A(I) is constant for the inner loop J C(J) is traversed each I iteration B(I,J) is traversed poorly DO I=1,N,4 ! Unrolling by 4 DO J=1,N A(I+0)=A(I+0)+B(I+0,J)*C(J) A(I+1)=A(I+1)+B(I+1,J)*C(J) A(I+2)=A(I+2)+B(I+2,J)*C(J) A(I+3)=A(I+3)+B(I+3,J)*C(J) ENDDO ENDDO Unrolling the outer loop will load the complete cache line of B in to the registers -> data re-use one 1st level cache line • the unroll factor should match the cache line size • mostly 1st level cache optimization • if the data fits into the 2nd level cache, this is good optimization to use -LNO:outer_unroll=n TM TM Blocking for Cache (tiling) Blocking for cache: • An optimization that applies to data sets that do not fit into the (2nd level) data cache • A way to increase spatial locality of reference (i.e. exploit full cache lines) • A way to increase temporal locality of reference (i.e. to improve data re-use) • It is beneficial mostly with multi-dimensional arrays DO I=1,N …. (I) …. ENDDO -LNO:blocking=[on|off] (default on) -LNO:blocking_size=n1,n2 (for L1 and L2) By default L1=32KB and L2=1MB use -LNO:cs2=8M to specify the 8MB L2 cache DO i1=1,N,nb DO I=i1,min(i1+nb-1,N) …. (I) …. ENDDO ENDDO The inner loop is traversed only in the range of nb at a time TM Blocking: Example The following loop nest: for(i=0; i<n; i++) for(j=0; j<m; j++) x[i][j] = y[i] + z[j] • z[j] is reused for each i iteration x[i][j] is traversed in order y[I] is loop invariant z[j] is traversed sequentially changing loop order is not beneficial in this case • For large n the array z will not be reused from the cache Blocking the loops for cache: For(it=0; it<n; it += nb) for(jt=0; jt<m; jt += nb for(i=it; i<min(jt+nb,n); i++) for(j=jt; j<min(jt+nb,m); j++) x[i][j] = y[i] + z[j] • nb elements of z array will be brought in to the cache and reused nb times before moving on to the next tile TM Loop Fusion Loop fusion (merging two or more loops together): • fusing loops that refer to the same data enhances temporal locality • larger loop body allow more effective scalar optimizations Example: Original loops: Fused loops: for(i=0; i<n; i++) a[i] = b[i] + 1 for(i=0; i<n; i++) c[i] = a[i]/2 for(i=0; i<n; i++) d[i] = 1/c[i+1] for(i=0; i<n; i++){ a[i] = b[i] + 1 c[i] = a[i]/2 } for(i=0; i<n; i++) d[i] = 1/c[i+1] Fusing more loops with loop peeling: a[0] = b[0] + 1 c[0] = a[0]/2 for(i=1; i<n; i++){ a[i] = b[i] + 1 c[i] = a[i]/2 d[I-1] = 1/c[i] } d[n] = 1/c[n+1] -LNO:fusion=[0,1,2] (default 1) • loop peeling can be used to break data dependencies when fusing loops • sometimes temporary arrays can be replaced by scalars (this optimization has to be done manually) • Compiler will attempt fuse loops if they are adjacent, i.e. no code between the loops to be fused TM Loop Fusion in Array Assignments Loop Fusion is instrumental in generating good F90 code F90 code sequence: A(I:N) = B(I:N)+1 C(I:N) = A(1:N)/2 D(1:N) = 1/C(2:N+1) Compiler will typically generate the following instruction sequence compiler can optimize the loop sequence by fusion • for that, all assignments (loops) should be adjacent • preserving data dependencies, this can fused: Fused loops: DO I=1,N A(I) = B(I)+1 C(I) = A(I)/2 ENDDO DO I=1,N D(I) = 1/C(I+1) ENDDO Allocate T(1:N) DO I=1,N T(I)=B(I)+1 ENDDO DO I=1,N A(I) = T(I) ENDDO DO I=1,N T(I)= A(I)/2 ENDDO DO I=1,N C(I) = T(I) ENDDO DO I=1,N T(I)=1/C(I+1) ENDDO DO I=1,N D(I) = T(I) ENDDO Further peeling to break data dependencies will merge the two remaining loops • for this optimization to work automatically, no code should be placed between the array assignments, such that the assignments are adjacent TM Loop Fission Loop Fission (splitting) or loop distribution: • improve memory locality by splitting out loops that refer to different independent arrays for(i=1; i<n; i++){ a[i] = a[i] + b[i-1]; b[i] = c[i-1]*x + y; c[i] = 1/b[i]; d[i] = sqrt(c[i]); } for(i=0; i<n-1; i++){ b[i+1] = c[i]*x + y; c[i+1] = 1/b[i+1]; } for(i=0; i<n-1; i++) a[i+1] = a[i+1] + b[i]; for(i=0; i<n-1; i++) d[i+1] = sqrt(c[i+1]); i=n+1 -LNO:fission=[0,1,2] (default 1) 0 no fission 1 normal fission 3 fission tried before fussion attempts to distribute inner loops TM LNO: Gather-Scatter Special form of loop fission: • If the loop to be optimized contains conditional execution, it is often faster to evaluate all the conditions first. Subroutine fred(a,b,c,n) real*8 a(n), b(n), c(n) do I=1,n if(c(I) .gt. 0) then a(I) = c(I)/b(I) c(I) = c(I)*b(I) b(I) = 2*b(I) endif enddo end Conditional execution removed do I=1,n deref_gs(inc_0+1) = I if(c(I) .gt. 0) then inc_0 = inc_0 + 1 endif enddo do ind_0=0,inc_0-1 I=deref_gs(ind_0+1) a(I) = c(I)/b(I) c(I) = c(I)*b(I) b(I) = 2*b(I) enddo end • The computationally intensive loop runs only over the indices for which the condition was true and can be better optimized (SWP) • LNO will not evaluate the nested IF conditions, unless -LNO:gather_scatter=2 is used TM LNO: Vector Intrinsics Most intrinsics have their “vector” equivalents. The compiler will automatically substitute vector intrinsics where legal, when the functions are invoked in a loop: SUBROUTINE VFRED(A,N) REAL*8 A(N) DO I=1,N A(I) = A(I) + COS(A(I)) ENDDO END CALL VCOS$(A(1),DEREF_SE1_F8(1), %VAL(N-1),%VAL(1), %VAL(1)) DO I=1,N A(I) = A(I) + DEREF_SE1_F8(I) ENDDO Vector intrinsics are faster if N>10 for most intrinsics • Vector intrinsics have different precision rules (1 or 2 ulp less) • illegal arguments cannot be trapped with the vector intrinsics • -LNO:vintr=off to disable the generation of the vector intrinsics TM Vector Intrinsics: Performance TM Data Dependence in Loops In loops, each statement can be executed many times. • loop carried data dependence – dependence between statements in different iterations • loop independent data dependence – dependence between statements in the same iteration • lexically forward dependence: – source precedes the target lexically • lexically backward dependence: – opposite from above • right-hand side of an assignment precede the left-hand side example: unroll to analyze: (1) (2) (3) (4) for( i=2; i<9; i++){ x[i] = y[i] + z[i]; a[i] = x[i-1] + 1; } loop carried, lexically forward dependence S2 (1) S3 TM Specifying the Dependency Rules In the following example: Compiler schedules: K<N (dependence) 14% peak K>N (no dependence) 33% peak SUBROUTINE DAXPYI(N,X,K,A) INTEGER N,K REAL*8 X(N),A DO I=1,N X(K+I) = X(K+I) + A*X(I) ENDDO END if K>N no dependency; if K<N there is a dependency. The value of K is unknown to the compiler , thus the compiler will assume dependencies. SUBROUTINE DAXPYI(N,X,K,A) INTEGER N,K The ivdep directive can be used to REAL*8 X(N),A communicate to the compiler the cdir$ ivdep DO I=1,N data dependency rules. X(K+I) = X(K+I) + A*X(I) ENDDO END IVDEP = Ignore Vector DEPendency TM The IVDEP Directive With indexed addressing IVDEP is the only way to specify no data dependencies to the compiler: void update(int n, float *a, float *b, int *indx, float s) { int i; #pragma ivdep for(i=0; i<n; i++) a[indx[i]] += s*b[i]; } • here ivdep means that the integer values stored in indx array are all different, I.e. indx is a permutation array • assuming no data dependencies will produce faster processor code, because compiler has less constraints on ordering the load-store instructions The IVDEP directive to the compiler is not part of the language and its interpretation is not standardized. TM Three Types of IVDEP Directive The IVDEP directive to the compiler is not part of any language and its interpretation is not standardized. CDIR$ IVDEP DO I=1,N A(INDEX(1,I)) = B(I) A(INDEX(2,I)) = C(I) ENDDO SGI default behaviour: A and B and C are independent, i.e. index(*,i) != index(*,j) • Default interpretation: – A and B and C are independent, that breaks both, lexically forward (i+k) and backward (i-k) dependencies. • index(1,i) != index(1,j) • index(2,i) != index(2,j) • But for some I: index(1,*) == index(2,*) • The default interpretation can be changed with the -OPT: compiler option. Possible other interpretations: – break only lexically backward dependencies (Cray IVDEP), I.e. assume only index(*,i)!=index(*,i-k) (cray_ivdep=on) – there are no dependencies what so ever (Liberal IVDEP, enable with -OPT:liberal_ivdep=on) TM The Argument Alias Problem SUBROUTINE COPY(A,B,N) REAL*8 A(N),B(N) DO I=1,N B(I) = A(I) ENDDO END In Fortran, compiler assumes A and B do not overlap In C, compiler assumes pointers a and b can point to the same address void copy(double *a, double *b, int n) { int i; for(i=1; i<n; i++) b[i] = a[i]; } • In Fortran, it is a mistake to invoke copy with overlapping arguments. The compiler will perform optimizations assuming A and B are not aliases over the computational range. • In C, argument aliases are allowed. Therefore optimizations (SWP) changing the original order of loads and stores are not possible. There are several ways to remove this restriction: – the ivdep pragma – the compiler optimization flag: -OPT:alias=memory-access-model – the restrict keyword TM Aliases: the Optimizer Options These options work over all of the compilation unit. -OPT:alias=[any,typed,unnamed,restrict,disjoint] • any is the default. Any pair of memory references may be aliased. From the other memory access models, the most important are: • restrict – assume that any pair of memory references that are named differently do not point to the same regions in memory float *p, *q *p does not alias with *q, q, p or any global variable • disjoint – assume same restrictions as “restrict”, in addition any pointer dereferencing will point to an overlapping region in memory float *p, *q *p does not alias with *q, q, p or any global variable *p does not alias with **q, **p, ***q, etc. TM The restrict Keyword The Numerical C Extensions Group X3J11.1 proposed (1993) a restrict keyword as the way to specify pointer access models. The restrict semantics: • assume de-referencing the qualified pointer is the only way the program can access the memory pointed to by that pointer • loads and stores through such a pointer do not alias with any other load and stores, except these with the same pointer void copy(double * restrict a, double * restrict b, int n) { int i; for(i=1; i<n; i++) b[i] = a[i]; } • in this example, it is sufficient to indicate restrict b, since it is necessary to qualify only the pointers being stored through • to enable the restrict keyword it is necessary to use the compiler flag (7.2 and 7.3 compilers): -LANG:restrict TM Alias in Storage Allocation Program data can be stored in memory in 2 ways: • Storage in global area – memory pages are allocated statically, i.e. all data is put at a fixed (virtual) address at load time – loading such data takes often 2 instructions, since the load immediate instruction in MIPS is limited by 64 KB offset: ldadr ldw R1,addr R2,R1+offset #load base pointer #load base+offset – COMMON block data, global data, SAVE data, malloc, mmap – compilation with -static: all variables are allocated in global area • Storage on the stack – memory pages are allocated dynamically during program exec – each subroutine gets new stack area for local data – loading data from the stack requires single instruction ldw R2,TOS+offset #load TopOfStack+offset – local (automatic) variables, temporary storage, alloca data • Routines called from a parallel region : – Allocate private stack area – Variables allocated on private stack are private. – Variables in global area are shared (aliases). TM Procedure Inlining Inlining: replace a function call by that function source code Advantages: DO I=1,N call DO_WORK(A(I),C(I)) ENDDO • increase opportunities for processor optimizations • more opportunities for Loop Nest optimizations -INLINE:list=[on|off] (default off) Subroutine DO_WORK(X,Y) Y=1+X*(1+x*0.5) -INLINE:must=sub1:never=sub2 END -IPA:inline=[on|off] (default on) Candidates for inlining are modules that: • “small” i.e. not much source code • are called very often (typically in a loop) • do not take much time per call Inhibition to inlining: • • • • • • mismatched in the subroutine arguments (type or shape) no inlining across languages (e.g. Fortran calls C subroutine) no static (SAVE) local variables not varargs routines, no recursive routines no functions with alternate entry points no nested subroutines (like in F90) TM TM TM TM TM Software Pipelining (SWP) The software pipelining is the way to mix iterations in a loop such that all processor execution slots are filled: • SWP is performed by the Code Generator (CG), that also unrolls inner loop to achieve the best SWP schedule (-O3 opt level). This can be computationally intensive. • Vector loops well-suited for SWP; short loops may run slower with SWP Inhibitors to SWP: • loops with subroutine (or intrinsic) calls cannot be SWP-ed • loops with complicated conditionals or branching • loops that are too long cannot be software pipelined because compiler runs out of available registers (loop fission) • data dependence between iterations are harder to SWP TM Summary • Scalar optimization: – improving ILP by code transformation and grouping independent instructions – improving memory access by restructuring loop nests to take better advantage of memory hierarchy • compilers are good at instruction level optimizations and loop transformations. It depends on the language, however: – F77 is the easiest for compiler to work with – C is more difficult – F90/C++ are most complex for compiler optimizations • the user is responsible to present the code in a way that allows for compiler optimizations: – – – – – don’t violate the language standard write clean and clear code consider the data structures for (false) sharing and alignment consider the data structures for data dependencies most natural presentation of algorithms using multi-dimensional arrays TM Case Study: Vector Update Scalar Optimization Techniques TM Vector Update Code ll=0 do jj=1,nj Profiling tells us that we spend most do ii=1,ni time in this part ll=ll+1 res=0 do n=1,nib na=ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nb=n+(jj-1)*nib ndb1=nmb1/2 naa1=nma1+na nbb1=ndb1+nb res=res+p(naa1)*dp(nbb1) end do nde1=nme1/2 lle1=nde1+ll Thist is the net result of dp(lle1)=dp(lle1)+res all the computations end do end do L1 Cache (sec) 50 L2 Cache (sec) 37 TLB (sec) 215 Execution (sec) 286 TM Vector Update: Stride Analysis do jj=1,nj do ii=1,ni …. do n=1,nib na=ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nb=n+(jj-1)*nib ndb1=nmb1/2 naa1=nma1+na nbb1=ndb1+nb res=res+P(naa1)*DP(nbb1) end do …. • for the inner loop, the stride on array P is controlled by naa1: naa1 = nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub • • • • the loop index in n, therefore the stride is nra stride on array DP is controlled by nbb1: nbb1 = nbd1+n+(jj-1)*nib therefore the stride is 1 Inner loop over n ii jj loop exchange consideration: stride on P: nra 1 0 (note: nra, nib ~5000) stride on DP: 1 0 nib • thus ii should be the inner loop TM Vector Update: Loop Interchange To interchange the loops they have to be properly nested • substitution expressions and eliminating temporary variables do jj=1,nj res can be eliminated by placing do ii=1,ni in inner loop res=0 do n=1,nib Substituted NA ndb1=nmb1/2 naa1=nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nbb1=ndb1+n+(jj-1)*nib Substituted NB res=res+p(naa1)*dp(nbb1) end do nde1=nme1/2 lle1=nde1+ii+(jj-1)*ni dp(lle1)=dp(lle1)+res Eliminated LL end do end do do jj=1,nj • now the loops can be interchanged do ii=1,ni do n=1,nib ndb1=nmb1/2 naa1=nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nbb1=ndb1+n+(jj-1)*nib nde1=nme1/2 lle1=nde1+ii+(jj-1)*ni dp(lle1)=dp(lle1)+p(naa1)*dp(nbb1) end do end do end do TM Vector Update: DAXPY Form ndb1=nmb1/2 nde1=nme1/2 do jj=1,nj do n=1,nib do ii=1,ni naa1=nma1+ii+(n-1)*nra+ (i-1)*nru+(l-1)*nra*nrub nbb1=ndb1+n+(jj-1)*nib lle1=nde1+ii+(jj-1)*ni dp(lle1)=dp(lle1)+p(naa1)*dp(nbb1) end do end do end do simplifying indexing…. ndb1=nmb1/2 nde1=nme1/2 do jj=1,nj do n=1,nib naa1=nma1+(n-1)*nra+ (i-1)*nru+(l-1)*nra*nrub dp_temp=dp(ndb1+n+(jj-1)*nib) lle1=nde1+(jj-1)*ni do ii=1,ni dp(lle1+ii)=dp(lle1+ii)+ p(naa1+ii)*dp_temp end do end do end do ndb1=nmb1/2 nde1=nme1/2 id1 =nma1+(i-1)*nru+(l-1)*nra*nrub do jj=1,nj id2 = ndb1+(jj-1)*nib lle1= nde1+(jj-1)*ni id3 = id1 do n=1,nib dp_temp=dp(id2+n) do ii=1,ni dp(lle1+ii)=dp(lle1+ii)+p(id3+ii)*dp_temp end do id3 = id3 + nra end do end do this is a DAXPY operation TM Vector Update: 2D Form With DAXPY operation in the inner loop, we should consider further optimization with outer loop unrolling and blocking. • hand tuning was necessary • compiler would not implement loop interchange because in the original code the loops are not properly nested • With the DAXPY formulation, we can consider 2-dimensional implementation of that code: real*8 dp(ni,nj), p(ni,nib) ndb1=nmb1/2 nde1=nme1/2 id1 =nma1+(i-1)*nru+(l-1)*nra*nrub do jj=nde1,nj do n=ndb1,nib dp_temp=dp(n,jj-nde1) do ii=1,ni dp(ii,jj)=dp(ii,jj)+p(ii,jj)*dp_temp end do end do end do TM Vector Update: Compiler Opt Compilation the new 2D version with -O3: • compiler can perform automatically the necessary loop transforms DO tile2jj = 1, nj, 126 DO tile1ii = 1, ni, 544 DO n = 1, (nib + -3), 4 DO jj = tile2jj, MIN((nj + -1), (tile2jj mi0 = dp2(n, jj) mi1 = dp2(n + 3, jj + 1) mi2 = dp2(n + 2, jj + 1) mi3 = dp2(n + 1, jj + 1) mi4 = dp2(n, jj + 1) mi5 = dp2(n + 1, jj) mi6 = dp2(n + 2, jj) mi7 = dp2(n + 3, jj) DO ii = tile1ii, MIN((tile1ii + 543), dp1(ii, jj) = (dp1(ii, jj) +(p(ii, dp1(ii, jj) = (dp1(ii, jj) +(p(ii, dp1(ii, jj) = (dp1(ii, jj) +(p(ii, dp1(ii, jj) = (dp1(ii, jj) +(p(ii, dp1(ii, jj + 1) = (dp1(ii, jj + 1) dp1(ii, jj + 1) = (dp1(ii, jj + 1) dp1(ii, jj + 1) = (dp1(ii, jj + 1) dp1(ii, jj + 1) = (dp1(ii, jj + 1) END DO END DO END DO END DO END DO + 124)), 2 ni), 1 n) * mi0)) n + 1) * mi5)) n + 2) * mi6)) n + 3) * mi7)) +(p(ii, n) * mi4)) +(p(ii, n + 1) * mi3)) +(p(ii, n + 2) * mi2)) +(p(ii, n + 3) * mi1)) DO wd_jj0 = jj, MIN((tile2jj + 125), nj), 1 mi8 = dp2(n, wd_jj0) mi9 = dp2(n + 1, wd_jj0) mi10 = dp2(n + 3, wd_jj0) mi11 = dp2(n + 2, wd_jj0) DO ii0 = tile1ii, MIN((tile1ii + 543), ni), 1 dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n) * mi8)) dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 1) * mi9)) dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 2) * mi11)) dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 3) * mi10)) END DO END DO END DO DO wd_n = n, nib, 1 DO jj0 = tile2jj, MIN((nj + -1), (tile2jj + 124)), 2 mi12 = dp2(wd_n, jj0) mi13 = dp2(wd_n, jj0 + 1) DO ii1 = tile1ii, MIN((tile1ii + 543), ni), 1 dp1(ii1, jj0) = (dp1(ii1, jj0) +(p(ii1, wd_n) * mi12)) dp1(ii1, jj0 + 1) = (dp1(ii1, jj0 + 1) +(p(ii1, wd_n) * mi13)) END DO END DO DO wd_jj = jj0, MIN((tile2jj + 125), nj), 1 mi14 = dp2(wd_n, wd_jj) DO ii2 = tile1ii, MIN((tile1ii + 543), ni), 1 dp1(ii2, wd_jj) = (dp1(ii2, wd_jj) +(p(ii2, wd_n) * mi14)) END DO END DO TM Vector Update Summary ORIGINAL CODE TM Vector Update Summary TM Vector Update Summary