PowerPoin t /

advertisement
TM
Cache Optimizations &
the Loop Nest Optimizer
TM
Improvement Opportunities
Program runs slow because not all resources are used:
• processor:
– not using opportunities to go superscalar (ILP)
– scheduling of instructions is not optimal (too many wait states)
• memory access:
– not all data in cache line is used (spatial locality)
– data in the cache in not reused (temporal locality)
Performance analysis is used to diagnose the problem.
Compiler will attempt to optimize the program for the given Architecture:
• data structure can inhibit compiler optimizations
• algorithm presentations can inhibit compiler optimizations
Often it is necessary to rewrite critical part of code (loops) in the
program so that compiler can do better performance optimization.
Understand compiler optimizations techniques
TM
Compiler Optimization Techniques
The following optimizations are built into the compiler:
• general
– procedure inlining
– data and array padding
• loop based:
–
–
–
–
Loop interchange
outer and inner loop unrolling
cache blocking
loop fusion (merge) and fission (split)
Loop nests,
implies usage of
multi-dimensional arrays
enabled at -O3 or with
LNO:opt=[1|0]
• Code generation:
– software pipelining
– instruction reordering
Algorithm presentation in the program such that compiler can
apply the optimization techniques - leads to optimal program performance on the machine.
TM
Scalar Architecture: Cache System
Cache subsystem
1
0.1
0.01
64reg
Speed of Access 1/clock
• The hierarchy of memory devices:
~2-3
cy
32KB
(L1)
memory
disk
~10 cy
8MB
(L2)
~100 cy
~1 - 100s GB
~4000 cy
Device Capacity (size)
• The goal of Memory Hierarchy:
– access speed ~ fastest memory
– effective capacity ~ size of largest memory
-> Programs should follow the principle of locality:
(Use items in the cache)
– Spatial locality of reference (use all words in cache line)
– Temporal locality of reference (use same cache line)
Scalar Architecture:
TM
Cache Organization
Example Cache L2 on O2K
(e.g. 8 MB
or 2097152 words)
Words in Memory
cache line
transfer
Load instruction
(ld)
for 1 word
cache lines
in memory (32 words)
‡ Cache hit will load word from cache
‡ Cache miss will load cache line from memory
The goal of scalar optimization:
– Spatial locality of reference (use all words in cache line)
– Temporal locality of reference (use same cache line)
TM
Problems of Scalar Optimization
DO i=1,n
DO j=1,n
DO k=1,n
C(i,j)=C(i,j) + A(i,k)*B(k,j)
ENDDO ENDDO ENDDO
i
k
i
=
j
–
–
–
–
X
k
j
cache lines
each C(I,j) value is accumulated in the register for A(I,k)*B(k,j)
B is traversed in sequence of cache lines (spatial locality)
A is accessing only 1 word from each cache line (no locality)
for A and B no reuse of cache lines (if n is large)
This is a problem only if A,B,C do not fit into the cache
TM
Loop Nest Optimizer
LNO performs loop restructuring to optimize data access:
•
•
•
•
•
•
loop interchange
loop unrolling
loop blocking for cache
loop fusion
loop fission
pre-fetching
LNO is controlled with compiler options and/or
compiler directives or pragmas; same options for both
• LNO is the default at -O3, but can be turned on/off individually by
-LNO:opt=[1|0]
• directives/pragma syntax:
– Fortran: C*$* keyword [=value(s)]
– C/C++ : #pragma keyword [=value(s)]
–
• directives/pragmas can be disabled with the compiler switch
-LNO:ignore_pragmas
TM
Array Indexing
There are several ways to index arrays:
Direct Addressing
++
Explicit Addressing
DO j=1,M
DO i=1,n
… A(i,j) ….
ENDDO ENDDO
Loop carried Addressing
DO j=1,M
DO i=1,N
k = k + 1
… A(k) …
ENDDO ENDDO
+
DO j=1,M
DO i=1,N
… A(i+(j-1)*N) …
ENDDO ENDDO
-
Indirect Addressing
--
DO j=1,M
DO i=1,N
… A(index(i,j)) …
ENDDO ENDDO
• The addressing scheme will have impact on the performance
• Arrays should be accessed in most natural direct way for
compiler to apply loop optimization techniques
TM
Data Storage in Memory
Data storage order is language dependent:
• Fortran stores multi-dimensional arrays “column-wise”
J
I
In memory
i
A(I,J)
i
i
j+1
j
i
j+2
left most index changes fastest...
• C stores multi-dimensional arrays “row-wise”
j
i
a[i][j]
In memory
j
j
i
j
i+1
j
i+2
right most index changes fastest...
• Accessing array elements in storage order greatly improves
performance:
for arrays that do not fit in the cache(s)
TM
Loop Interchange: FORTRAN
Original loop:
Interchanged loops:
c*$* no interchange
DO I=1,N
DO J=1,M
C(I,J)=A(I,J)+B(I,J)
ENDDO ENDDO
c*$* interchange(J,I)
DO J=1,M
DO I=1,N
C(I,J)=A(I,J)+B(I,J)
ENDDO ENDDO
A(I,J)
B(I,J)
C(I,J)
J
I
N
M
A(I,J)
B(I,J)
C(I,J)
Storage order
Access order
J
I
M
N
• The distribution of data in memory is not changed. Only the
access pattern is changed
• Compiler can do this optimization automatically
-LNO:interchange=[on|off] (default on)
TM
Index Reversal
Original loop:
DO I=1,N
DO J=1,M
C(I,J)=A(I,J)+B(J,I)
ENDDO ENDDO
The access is poor for A and C,
while it is optimal for B
Interchanged loops + Index reversal:
interchange will be good for A and C,
it will be bad for B
DO J=1,M
DO I=1,N
C(I,J)=A(I,J)+B(I,J)
ENDDO ENDDO
• Index reversal on B: i.e. B(I,J) replaced by B(J,I)
must be done everywhere in the program
• This has to be done manually, there is no compiler optimization
that does index reversal.
The Significance of
Loop Interchange
DO I=1,700
DO J=1,700
DO K=1,700
A(I,J,K)=A(I,J,K)+B(I,J,K)*C(I,J,K)
ENDDO ENDDO ENDDO
Run time in seconds obtained on an Origin 3000:
loop order
R12K@400MHz
(8 MB cache)
i,j,k
j,i,k
k,j,i
535.0
32.0
11.0
TM
TM
Loop Interchange in C
In C, the situation is exactly the opposite to Fortran:
Addressing of
c[i][j] and
a[i][j] are poor
Original loop:
#pragma no interchange
for(j=0; j<m; j++)
for(i=0; i<n; i++)
c[i][j]=a[i][j]+b[j][i];
Interchanged loop:
#pragma interchange(i,j)
for(i=0; i<n; i++)
for(j=0; j<m; j++)
c[i][j]=a[i][j]+b[j][i];
Addressing of
b[j][i] is
optimal
Index Reversal loop:
for(j=0; j<m; j++)
for(i=0; i<n; i++)
c[j][i]=a[j][i]+b[j][i];
• The performance benefits in C are the same as in Fortran
• In most practical situations, loop interchange (supported by the
compiler) is much easier to achieve than index reversal.
TM
Array Placement Effects
“Poor” data placement in memory can lead to the effect of
cache thrashing.
There are 2 techniques built into the compiler to avoid the cache thrashing:
• array padding
• leading dimension extension
NOTE: leading dimension of arrays should be an odd number, if
the multi-dimensional array has small extensions (e.g.
a(64,64,64,..)) several leading dimensions should be odd numbers.
TM
Direct-Mapped Caches: Thrashing
(Virtual)
memory
A(1)
A(2)
32 KB
A(8191)
A(8192)
B(1)
B(8191)
B(8192)
COMMON //A(8192), B(8192)
DO I=1,N
PROD = PROD + A(I)*B(I)
ENDDO
Registers
in the CPU
Direct mapped cache (32 KB)
Cache line: 4 words
A(1)
A(2)
A(3)
A(4)
1
A(5)
A(6)
A(7)
A(8)
2
A(8185) A(8186) A(8187) A(8188) 2047
A(8189) A(8190) A(8191) A(8192) 2048
Thrashing: every memory reference results in a cache miss
Location in the cache:
(memory-address) mod (cache-size)
in this case loc(A(1)) mod 32KB = loc(B(1)) mod 32KB
[because B(1) = A(1) + 8192; 8192*4B mod 32KB = 0]
TM
Set-Associative Caches
(Virtual)
memory
A(1)
A(2)
COMMON //A(8192), B(8192)
DO I=1,N
PROD = PROD + A(I)*B(I)
ENDDO
2 way set associative cache (32 KB)
Cache line: 4 words
A(1)
A(5)
32 KB
A(8191)
A(8192)
B(1)
B(1)
B(5)
B(2)
B(6)
A(2)
A(6)
B(3)
B(7)
A(3)
A(7)
B(4)
B(8)
A(4)
A(8)
1
2
A(4089) A(4090) A(4091) A(4092) 1023
A(4093) A(4094) A(4095) A(4096) 1024
B(8191)
B(8192)
Registers
in the CPU
Set select (1bit)
(LRU)
B(4089) B(4090) B(4091) B(4092)
B(4093) B(4094) B(4095) B(4096)
No Thrashing: conflicting cache lines
are stored into a different set
Location in the cache:
(memory-address) mod (cache-size)
in this case loc(A(1)) mod 16KB = loc(B(1)) mod 16KB
BUT A DIFFERENT SET!
TM
Array Padding: Example
COMMON //
A(1024,1024),
B(1024,1024),
C(1024,1024)
DO J=1,1024
DO I=1,1024
A(I,J) = A(I,J)+B(I,J)*C(I,J)
ENDDO ENDDO
Assume 32 KB cache
Addr[C(1,1)] = Addr[B(1,1)] + 1024*1024*4
position in the cache: C(1,1) = B(1,1) since (1024*1024*4) mod 32KB = 0
COMMON //
A(1024,1024),pad1(129)
B(1024,1024),pad2(129)
C(1024,1024)
DO J=1,1024
DO I=1,1024
A(I,J) = A(I,J)+B(I,J)*C(I,J)
ENDDO ENDDO
•Padding will cause cache lines
to be placed in different
cache locations
•Compiler will try to do padding
automatically
Addr[C(1,1)] = Addr[B(1,1)] + 1024*1024*4+129*4
position in the cache: C(1,1) = B(129,1) mod 32KB
TM
Maxwell Code Example
REAL EX(NX,NY,NZ),EY(NX,NY,NZ),EZ(NX,NY,NZ) !Electric field
REAL HX(NX,NY,NZ),HY(NX,NY,NZ),HZ(NX,NY,NZ) !Magnetic field
…
DO K=2,NZ-1
DO J=2,NY-1
DO I=2,NX-1
HX(I,J,K)=HX(I,J,K)-(EZ(I,J,K)-EZ(I,J-1,K))*CHDY
+(EY(I,J,K)-EY(I,J,K-1))*CHDZ
HY(I,J,K)=HY(I,J,K)-(EX(I,J,K)-EX(I,J,K-1))*CHDZ
+(EZ(I,J,K)-EZ(I-1,J,K))*CHDX
HZ(I,J,K)=HZ(I,J,K)-(EY(I,J,K)-EY(I-1,J,K))*CHDX
+(EX(I,J,K)-EX(I,J-1,K))*CHDY
ENDDO ENDDO ENDDO
here NX=NY=NZ = 32, 64, 128, 256 (i.e. with real*4 elements: 0.8MB, 6.3MB, 50MB, 403MB)
Reusing load from previous iteration (I-1) gives in total:
13 memory operations (6H+7E) -> minimum 13 cycles/iteration
18 floating point operations in this code
18/(13*2)=69% peak, i.e. 800Mflop/s on the R10000@400MHz processor
Compiling with:
-mips4 -O3 -LNO:opt=0 -OPT:reorg_common=off
(to show the effect of compiler not performing the necessary optimizations)
gives performance on this code of 4.6 Mflop/s
TM
Maxwell Example - continued
Problem:
• array dimensions are small even numbers, power of 2 and map
to the same location in both 1st level and the 2nd level caches
In general:
primary cache
32 KB = 2(way-set-ass) * 4(size-real) * 4096
secondary cache 8 MB = 2(way-set-ass) * 4(size-real) * 1048576
C print position of arrays in memory with the code:
Integer*8 aEX
aEX = %LOC(EX(1,1,1))
print *,’Addr EX=‘,mod(aEX,4096), mod(aEX,1048576),’words’
• for the Maxwell example the print shows with NX=NY=NZ=64:
Addr EX=
3720
Addr EY=
3720
……..
etc.
Addr HZ=
3720
470664
470664
470664
All arrays map to the same locations
in both caches
• Compiler is able to pad the arrays automatically. Compiling
with the default optimizations: -mips4 -O3 gives for the
performance 162 Mflop/s
TM
Dangers of Array Padding
• Compiler will automatically pad local data
• -O3 optimization will automatically pad common blocks
• Padding of common blocks is safe as long as the Fortran
standard is not violated:
SUBROUTINE SUB
COMMON // A(512,512), B(512,512)
DO I=1, 2*512*512
A(I) = 0.0
END
• Fix violation or do not to use this optimization either by
compiling with lower optimization or using explicit compiler
flag:
• -OPT:reorg_common=off
TM
Variable Length Arrays (VLA)
SGI compiler supports Variable Length Arrays in C and Fortran
• It is standard in F90 and an SGI extension in F77:
SUBROUTINE NAME1(N,M)
DIMENTION R(N,M)
……… etc. …
END
These arrays are created on the stack,
as opposed to a location in a static area
• In C it is an SGI extension:
void name1(int m, int n){
double
r[m][n][n+m];
…… etc. …..
}
• VLAs are very handy as scratch arrays, since they are created each time
execution enters the subroutine and they are destroyed at exit
• Unlike the static arrays, VLAs allow for proper aliasing and alignment
considerations by the compiler
TM
Loop Unrolling
Loop unrolling: perform multiple loop iterations at the same time
DO I=1,N,1
…(I)…
ENDDO
C*$* unroll(p)
P = 0 default unrolling
p = 1 no unrolling
p = UNROLL - that factor
DO I=1,N,UNROLL
…(I)…
…(I+1)…
…(I+2)…
…(I+UNROLL-1)…
ENDDO
Advantages of loop unrolling:
& cleanup
•
•
•
•
DO I=N-mod(N,unroll)+1,N
…(I)…
ENDDO
more opportunities for super-scalar code
more data re-use & pseudo-prefetch
exploit presence of cache lines
reduction in loop overhead (minor)
NOTE: Inner loops should “never” be unrolled by hand:
• compiler will typically unroll the inner loop the necessary amount for SWP
TM
Prefetch Data from Memory
Reordering instructions in unrolled loop leads to effective (pseudo) prefetch of the data
for(i=0;
a
a
a
a
}
i<n; i+=4){
+= b[i+0];
+= b[i+1];
+= b[i+2];
+= b[i+3];
for(i=0;
t
a
a
a
a
}
i<n; i+=4){
= b[i+3];
+= b[i+0];
+= b[i+1];
+= b[i+2];
+= t;
• no instruction overhead; compiler does this optimization automatically.
Explicit (manual) prefetch for memory:
• prefetch to 1st level cache should be done in form of pseudo-prefetch
• compiler will insert prefetch to 2nd level cache automatically (LNO)
• manual prefetch to 2nd level cache can be done with compiler directive:
C*$* prefetch_ref=a(1)
c*$* prefetch_ref=a(1+16)
do I=1,n
c*$* prefetch_ref=a(I+32),stride=16,kind=rd
sum = sum + a(I)
enddo
• same in C with the corresponding #pragma directive
TM
Outer Loop Unrolling
DO I=1,N
DO J=1,N
A(I)=A(I)+B(I,J)*C(J)
ENDDO ENDDO
Problem:
A(I) is constant for the inner loop J
C(J) is traversed each I iteration
B(I,J) is traversed poorly
DO I=1,N,4
! Unrolling by 4
DO J=1,N
A(I+0)=A(I+0)+B(I+0,J)*C(J)
A(I+1)=A(I+1)+B(I+1,J)*C(J)
A(I+2)=A(I+2)+B(I+2,J)*C(J)
A(I+3)=A(I+3)+B(I+3,J)*C(J)
ENDDO ENDDO
Unrolling the outer loop
will load the complete cache
line of B in to the registers
-> data re-use
one 1st level cache line
• the unroll factor should match the cache line size
• mostly 1st level cache optimization
• if the data fits into the 2nd level cache, this is good optimization to use
-LNO:outer_unroll=n
TM
TM
Blocking for Cache (tiling)
Blocking for cache:
• An optimization that applies to data sets that do not fit into the
(2nd level) data cache
• A way to increase spatial locality of reference
(i.e. exploit full cache lines)
• A way to increase temporal locality of reference
(i.e. to improve data re-use)
• It is beneficial mostly with multi-dimensional arrays
DO I=1,N
…. (I) ….
ENDDO
-LNO:blocking=[on|off] (default on)
-LNO:blocking_size=n1,n2 (for L1 and L2)
By default L1=32KB and L2=1MB
use -LNO:cs2=8M to specify the 8MB L2 cache
DO i1=1,N,nb
DO I=i1,min(i1+nb-1,N)
…. (I) ….
ENDDO ENDDO
The inner loop is
traversed only in the
range of nb at a time
TM
Blocking: Example
The following loop nest:
for(i=0; i<n; i++)
for(j=0; j<m; j++)
x[i][j] = y[i] + z[j]
• z[j] is reused for each i iteration
x[i][j] is traversed in order
y[I]
is loop invariant
z[j]
is traversed sequentially
changing loop order is not beneficial
in this case
• For large n the array z will not be reused from the cache
Blocking the loops for cache:
For(it=0; it<n; it += nb)
for(jt=0; jt<m; jt += nb
for(i=it; i<min(jt+nb,n); i++)
for(j=jt; j<min(jt+nb,m); j++)
x[i][j] = y[i] + z[j]
• nb elements of z array will be brought in to the cache and reused nb
times before moving on to the next tile
TM
Loop Fusion
Loop fusion (merging two or more loops together):
• fusing loops that refer to the same data enhances temporal locality
• larger loop body allow more effective scalar optimizations
Example:
Original loops:
Fused loops:
for(i=0; i<n; i++)
a[i] = b[i] + 1
for(i=0; i<n; i++)
c[i] = a[i]/2
for(i=0; i<n; i++)
d[i] = 1/c[i+1]
for(i=0; i<n; i++){
a[i] = b[i] + 1
c[i] = a[i]/2
}
for(i=0; i<n; i++)
d[i] = 1/c[i+1]
Fusing more loops
with loop peeling:
a[0] = b[0] + 1
c[0] = a[0]/2
for(i=1; i<n; i++){
a[i] = b[i] + 1
c[i] = a[i]/2
d[I-1] = 1/c[i]
}
d[n] = 1/c[n+1]
-LNO:fusion=[0,1,2] (default 1)
• loop peeling can be used to break data dependencies when fusing loops
• sometimes temporary arrays can be replaced by scalars
(this optimization has to be done manually)
• Compiler will attempt fuse loops if they are adjacent, i.e. no code between the
loops to be fused
TM
Loop Fusion in Array Assignments
Loop Fusion is instrumental in generating good F90 code
F90 code sequence:
A(I:N) = B(I:N)+1
C(I:N) = A(1:N)/2
D(1:N) = 1/C(2:N+1)
Compiler will typically
generate the following
instruction sequence
compiler can optimize the loop sequence by fusion
• for that, all assignments (loops) should be adjacent
• preserving data dependencies, this can fused:
Fused loops:
DO I=1,N
A(I) = B(I)+1
C(I) = A(I)/2
ENDDO
DO I=1,N
D(I) = 1/C(I+1)
ENDDO
Allocate T(1:N)
DO I=1,N
T(I)=B(I)+1
ENDDO
DO I=1,N
A(I) = T(I)
ENDDO
DO I=1,N
T(I)= A(I)/2
ENDDO
DO I=1,N
C(I) = T(I)
ENDDO
DO I=1,N
T(I)=1/C(I+1)
ENDDO
DO I=1,N
D(I) = T(I)
ENDDO
Further peeling to break data dependencies
will merge the two remaining loops
• for this optimization to work automatically, no code should be placed between
the array assignments, such that the assignments are adjacent
TM
Loop Fission
Loop Fission (splitting) or loop distribution:
• improve memory locality by splitting out loops that refer to different
independent arrays
for(i=1; i<n; i++){
a[i] = a[i] + b[i-1];
b[i] = c[i-1]*x + y;
c[i] = 1/b[i];
d[i] = sqrt(c[i]);
}
for(i=0; i<n-1; i++){
b[i+1] = c[i]*x + y;
c[i+1] = 1/b[i+1];
}
for(i=0; i<n-1; i++)
a[i+1] = a[i+1] + b[i];
for(i=0; i<n-1; i++)
d[i+1] = sqrt(c[i+1]);
i=n+1
-LNO:fission=[0,1,2] (default 1)
0
no fission
1
normal fission
3
fission tried before fussion
attempts to distribute inner loops
TM
LNO: Gather-Scatter
Special form of loop fission:
• If the loop to be optimized contains conditional execution, it is
often faster to evaluate all the conditions first.
Subroutine fred(a,b,c,n)
real*8 a(n), b(n), c(n)
do I=1,n
if(c(I) .gt. 0) then
a(I) = c(I)/b(I)
c(I) = c(I)*b(I)
b(I) = 2*b(I)
endif
enddo
end
Conditional
execution
removed
do I=1,n
deref_gs(inc_0+1) = I
if(c(I) .gt. 0) then
inc_0 = inc_0 + 1
endif
enddo
do ind_0=0,inc_0-1
I=deref_gs(ind_0+1)
a(I) = c(I)/b(I)
c(I) = c(I)*b(I)
b(I) = 2*b(I)
enddo
end
• The computationally intensive loop runs only over the indices for
which the condition was true and can be better optimized (SWP)
• LNO will not evaluate the nested IF conditions,
unless -LNO:gather_scatter=2 is used
TM
LNO: Vector Intrinsics
Most intrinsics have their “vector” equivalents. The compiler will
automatically substitute vector intrinsics where legal, when the
functions are invoked in a loop:
SUBROUTINE VFRED(A,N)
REAL*8 A(N)
DO I=1,N
A(I) = A(I) + COS(A(I))
ENDDO
END
CALL VCOS$(A(1),DEREF_SE1_F8(1),
%VAL(N-1),%VAL(1), %VAL(1))
DO I=1,N
A(I) = A(I) + DEREF_SE1_F8(I)
ENDDO
Vector intrinsics are faster if N>10 for most intrinsics
• Vector intrinsics have different precision rules (1 or 2 ulp less)
• illegal arguments cannot be trapped with the vector intrinsics
• -LNO:vintr=off to disable the generation of the vector intrinsics
TM
Vector Intrinsics: Performance
TM
Data Dependence in Loops
In loops, each statement can be executed many times.
• loop carried data dependence
– dependence between statements in different iterations
• loop independent data dependence
– dependence between statements in the same iteration
• lexically forward dependence:
– source precedes the target lexically
• lexically backward dependence:
– opposite from above
• right-hand side of an assignment precede the left-hand side
example:
unroll to analyze:
(1)
(2)
(3)
(4)
for( i=2; i<9; i++){
x[i] = y[i] + z[i];
a[i] = x[i-1] + 1;
}
loop carried, lexically forward dependence
S2
(1)
S3
TM
Specifying the Dependency Rules
In the following example:
Compiler schedules:
K<N (dependence)
14% peak
K>N (no dependence) 33% peak
SUBROUTINE DAXPYI(N,X,K,A)
INTEGER N,K
REAL*8 X(N),A
DO I=1,N
X(K+I) = X(K+I) + A*X(I)
ENDDO
END
if K>N no dependency; if K<N there is a dependency.
The value of K is unknown to the compiler , thus the
compiler will assume dependencies.
SUBROUTINE DAXPYI(N,X,K,A)
INTEGER N,K
The ivdep directive can be used to
REAL*8 X(N),A
communicate to the compiler the
cdir$ ivdep
DO I=1,N
data dependency rules.
X(K+I) = X(K+I) + A*X(I)
ENDDO
END
IVDEP = Ignore Vector DEPendency
TM
The IVDEP Directive
With indexed addressing IVDEP is the only way to specify
no data dependencies to the compiler:
void update(int n, float *a, float *b,
int *indx, float s)
{
int i;
#pragma ivdep
for(i=0; i<n; i++)
a[indx[i]] += s*b[i];
}
• here ivdep means that the integer values stored in indx array are all
different, I.e. indx is a permutation array
• assuming no data dependencies will produce faster processor code,
because compiler has less constraints on ordering the load-store
instructions
The IVDEP directive to the compiler is not part of the
language and its interpretation is not standardized.
TM
Three Types of IVDEP Directive
The IVDEP directive to the compiler is not part of any
language and its interpretation is not standardized.
CDIR$ IVDEP
DO I=1,N
A(INDEX(1,I)) = B(I)
A(INDEX(2,I)) = C(I)
ENDDO
SGI default behaviour:
A and B and C are independent,
i.e. index(*,i) != index(*,j)
• Default interpretation:
– A and B and C are independent, that breaks both, lexically
forward (i+k) and backward (i-k) dependencies.
• index(1,i) != index(1,j)
• index(2,i) != index(2,j)
• But for some I: index(1,*) == index(2,*)
• The default interpretation can be changed with the -OPT:
compiler option. Possible other interpretations:
– break only lexically backward dependencies (Cray IVDEP), I.e.
assume only index(*,i)!=index(*,i-k) (cray_ivdep=on)
– there are no dependencies what so ever (Liberal IVDEP, enable with
-OPT:liberal_ivdep=on)
TM
The Argument Alias Problem
SUBROUTINE COPY(A,B,N)
REAL*8 A(N),B(N)
DO I=1,N
B(I) = A(I)
ENDDO
END
In Fortran, compiler assumes
A and B do not overlap
In C, compiler assumes pointers a
and b can point to the same address
void copy(double *a, double *b, int n)
{
int i;
for(i=1; i<n; i++) b[i] = a[i];
}
• In Fortran, it is a mistake to invoke copy with overlapping arguments.
The compiler will perform optimizations assuming A and B are not
aliases over the computational range.
• In C, argument aliases are allowed. Therefore optimizations (SWP)
changing the original order of loads and stores are not possible. There
are several ways to remove this restriction:
– the ivdep pragma
– the compiler optimization flag: -OPT:alias=memory-access-model
– the restrict keyword
TM
Aliases: the Optimizer Options
These options work over all of the compilation unit.
-OPT:alias=[any,typed,unnamed,restrict,disjoint]
• any is the default. Any pair of memory references may be aliased.
From the other memory access models, the most important are:
• restrict
– assume that any pair of memory references that are named
differently do not point to the same regions in memory
float *p, *q
*p does not alias with *q, q, p or any global variable
• disjoint
– assume same restrictions as “restrict”, in addition any pointer dereferencing will point to an overlapping region in memory
float *p, *q
*p does not alias with *q, q, p or any global variable
*p does not alias with **q, **p, ***q, etc.
TM
The restrict Keyword
The Numerical C Extensions Group X3J11.1 proposed (1993) a
restrict keyword as the way to specify pointer access models.
The restrict semantics:
• assume de-referencing the qualified pointer is the only way the program can
access the memory pointed to by that pointer
• loads and stores through such a pointer do not alias with any other load and
stores, except these with the same pointer
void copy(double * restrict a,
double * restrict b, int n)
{
int i;
for(i=1; i<n; i++) b[i] = a[i];
}
• in this example, it is sufficient to indicate restrict b, since it is necessary to
qualify only the pointers being stored through
• to enable the restrict keyword it is necessary to use the compiler flag
(7.2 and 7.3 compilers): -LANG:restrict
TM
Alias in Storage Allocation
Program data can be stored in memory in 2 ways:
• Storage in global area
– memory pages are allocated statically, i.e. all data is put at a fixed (virtual)
address at load time
– loading such data takes often 2 instructions, since the load immediate
instruction in MIPS is limited by 64 KB offset:
ldadr
ldw
R1,addr
R2,R1+offset
#load base pointer
#load base+offset
– COMMON block data, global data, SAVE data, malloc, mmap
– compilation with -static: all variables are allocated in global area
• Storage on the stack
– memory pages are allocated dynamically during program exec
– each subroutine gets new stack area for local data
– loading data from the stack requires single instruction
ldw
R2,TOS+offset
#load TopOfStack+offset
– local (automatic) variables, temporary storage, alloca data
• Routines called from a parallel region :
– Allocate private stack area
– Variables allocated on private stack are private.
– Variables in global area are shared (aliases).
TM
Procedure Inlining
Inlining: replace a function call by that function source code
Advantages:
DO I=1,N
call DO_WORK(A(I),C(I))
ENDDO
• increase opportunities for processor optimizations
• more opportunities for Loop Nest optimizations
-INLINE:list=[on|off] (default off)
Subroutine DO_WORK(X,Y)
Y=1+X*(1+x*0.5)
-INLINE:must=sub1:never=sub2
END
-IPA:inline=[on|off] (default on)
Candidates for inlining are modules that:
• “small” i.e. not much source code
• are called very often (typically in a loop)
• do not take much time per call
Inhibition to inlining:
•
•
•
•
•
•
mismatched in the subroutine arguments (type or shape)
no inlining across languages (e.g. Fortran calls C subroutine)
no static (SAVE) local variables
not varargs routines, no recursive routines
no functions with alternate entry points
no nested subroutines (like in F90)
TM
TM
TM
TM
TM
Software Pipelining (SWP)
The software pipelining is the way to mix iterations in a
loop such that all processor execution slots are filled:
• SWP is performed by the Code Generator (CG), that also unrolls
inner loop to achieve the best SWP schedule (-O3 opt level). This
can be computationally intensive.
• Vector loops well-suited for SWP; short loops may run slower with
SWP
Inhibitors to SWP:
• loops with subroutine (or intrinsic) calls cannot be SWP-ed
• loops with complicated conditionals or branching
• loops that are too long cannot be software pipelined because
compiler runs out of available registers (loop fission)
• data dependence between iterations are harder to SWP
TM
Summary
• Scalar optimization:
– improving ILP by code transformation and grouping independent
instructions
– improving memory access by restructuring loop nests to take better
advantage of memory hierarchy
• compilers are good at instruction level optimizations and loop
transformations. It depends on the language, however:
– F77 is the easiest for compiler to work with
– C is more difficult
– F90/C++ are most complex for compiler optimizations
• the user is responsible to present the code in a way that allows
for compiler optimizations:
–
–
–
–
–
don’t violate the language standard
write clean and clear code
consider the data structures for (false) sharing and alignment
consider the data structures for data dependencies
most natural presentation of algorithms using multi-dimensional arrays
TM
Case Study:
Vector Update
Scalar Optimization Techniques
TM
Vector Update Code
ll=0
do jj=1,nj
Profiling tells us that we spend most
do ii=1,ni
time in this part
ll=ll+1
res=0
do n=1,nib
na=ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub
nb=n+(jj-1)*nib
ndb1=nmb1/2
naa1=nma1+na
nbb1=ndb1+nb
res=res+p(naa1)*dp(nbb1)
end do
nde1=nme1/2
lle1=nde1+ll
Thist is the net result of
dp(lle1)=dp(lle1)+res
all the computations
end do
end do
L1 Cache
(sec)
50
L2 Cache
(sec)
37
TLB
(sec)
215
Execution
(sec)
286
TM
Vector Update: Stride Analysis
do jj=1,nj
do ii=1,ni
….
do n=1,nib
na=ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub
nb=n+(jj-1)*nib
ndb1=nmb1/2
naa1=nma1+na
nbb1=ndb1+nb
res=res+P(naa1)*DP(nbb1)
end do
….
• for the inner loop, the stride on array P is controlled by naa1:
naa1 = nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub
•
•
•
•
the loop index in n, therefore the stride is nra
stride on array DP is controlled by nbb1: nbb1 = nbd1+n+(jj-1)*nib
therefore the stride is 1
Inner loop over n
ii
jj
loop exchange consideration:
stride on P:
nra
1
0
(note: nra, nib ~5000)
stride on DP: 1
0
nib
• thus ii should be the inner loop
TM
Vector Update: Loop Interchange
To interchange the loops they have to be properly nested
• substitution expressions and eliminating temporary variables
do jj=1,nj
res can be eliminated by placing
do ii=1,ni
in inner loop
res=0
do n=1,nib
Substituted NA
ndb1=nmb1/2
naa1=nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub
nbb1=ndb1+n+(jj-1)*nib
Substituted NB
res=res+p(naa1)*dp(nbb1)
end do
nde1=nme1/2
lle1=nde1+ii+(jj-1)*ni
dp(lle1)=dp(lle1)+res
Eliminated LL
end do
end do
do jj=1,nj
• now the loops can
be interchanged
do ii=1,ni
do n=1,nib
ndb1=nmb1/2
naa1=nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub
nbb1=ndb1+n+(jj-1)*nib
nde1=nme1/2
lle1=nde1+ii+(jj-1)*ni
dp(lle1)=dp(lle1)+p(naa1)*dp(nbb1)
end do
end do
end do
TM
Vector Update: DAXPY Form
ndb1=nmb1/2
nde1=nme1/2
do jj=1,nj
do n=1,nib
do ii=1,ni
naa1=nma1+ii+(n-1)*nra+
(i-1)*nru+(l-1)*nra*nrub
nbb1=ndb1+n+(jj-1)*nib
lle1=nde1+ii+(jj-1)*ni
dp(lle1)=dp(lle1)+p(naa1)*dp(nbb1)
end do
end do
end do
simplifying indexing….
ndb1=nmb1/2
nde1=nme1/2
do jj=1,nj
do n=1,nib
naa1=nma1+(n-1)*nra+
(i-1)*nru+(l-1)*nra*nrub
dp_temp=dp(ndb1+n+(jj-1)*nib)
lle1=nde1+(jj-1)*ni
do ii=1,ni
dp(lle1+ii)=dp(lle1+ii)+
p(naa1+ii)*dp_temp
end do
end do
end do
ndb1=nmb1/2
nde1=nme1/2
id1 =nma1+(i-1)*nru+(l-1)*nra*nrub
do jj=1,nj
id2 = ndb1+(jj-1)*nib
lle1= nde1+(jj-1)*ni
id3 = id1
do n=1,nib
dp_temp=dp(id2+n)
do ii=1,ni
dp(lle1+ii)=dp(lle1+ii)+p(id3+ii)*dp_temp
end do
id3 = id3 + nra
end do
end do
this is a DAXPY
operation
TM
Vector Update: 2D Form
With DAXPY operation in the inner loop, we should consider
further optimization with outer loop unrolling and blocking.
• hand tuning was necessary
• compiler would not implement loop interchange because in the original code
the loops are not properly nested
• With the DAXPY formulation, we can consider 2-dimensional implementation
of that code:
real*8 dp(ni,nj), p(ni,nib)
ndb1=nmb1/2
nde1=nme1/2
id1 =nma1+(i-1)*nru+(l-1)*nra*nrub
do jj=nde1,nj
do n=ndb1,nib
dp_temp=dp(n,jj-nde1)
do ii=1,ni
dp(ii,jj)=dp(ii,jj)+p(ii,jj)*dp_temp
end do
end do
end do
TM
Vector Update: Compiler Opt
Compilation the new 2D version with -O3:
• compiler can perform automatically the necessary loop transforms
DO tile2jj = 1, nj, 126
DO tile1ii = 1, ni, 544
DO n = 1, (nib + -3), 4
DO jj = tile2jj, MIN((nj + -1), (tile2jj
mi0 = dp2(n, jj)
mi1 = dp2(n + 3, jj + 1)
mi2 = dp2(n + 2, jj + 1)
mi3 = dp2(n + 1, jj + 1)
mi4 = dp2(n, jj + 1)
mi5 = dp2(n + 1, jj)
mi6 = dp2(n + 2, jj)
mi7 = dp2(n + 3, jj)
DO ii = tile1ii, MIN((tile1ii + 543),
dp1(ii, jj) = (dp1(ii, jj) +(p(ii,
dp1(ii, jj) = (dp1(ii, jj) +(p(ii,
dp1(ii, jj) = (dp1(ii, jj) +(p(ii,
dp1(ii, jj) = (dp1(ii, jj) +(p(ii,
dp1(ii, jj + 1) = (dp1(ii, jj + 1)
dp1(ii, jj + 1) = (dp1(ii, jj + 1)
dp1(ii, jj + 1) = (dp1(ii, jj + 1)
dp1(ii, jj + 1) = (dp1(ii, jj + 1)
END DO
END DO
END DO
END DO
END DO
+ 124)), 2
ni), 1
n) * mi0))
n + 1) * mi5))
n + 2) * mi6))
n + 3) * mi7))
+(p(ii, n) * mi4))
+(p(ii, n + 1) * mi3))
+(p(ii, n + 2) * mi2))
+(p(ii, n + 3) * mi1))
DO wd_jj0 = jj, MIN((tile2jj + 125), nj), 1
mi8 = dp2(n, wd_jj0)
mi9 = dp2(n + 1, wd_jj0)
mi10 = dp2(n + 3, wd_jj0)
mi11 = dp2(n + 2, wd_jj0)
DO ii0 = tile1ii, MIN((tile1ii + 543), ni), 1
dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n) * mi8))
dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 1) * mi9))
dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 2) * mi11))
dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 3) * mi10))
END DO
END DO
END DO
DO wd_n = n, nib, 1
DO jj0 = tile2jj, MIN((nj + -1), (tile2jj + 124)), 2
mi12 = dp2(wd_n, jj0)
mi13 = dp2(wd_n, jj0 + 1)
DO ii1 = tile1ii, MIN((tile1ii + 543), ni), 1
dp1(ii1, jj0) = (dp1(ii1, jj0) +(p(ii1, wd_n) * mi12))
dp1(ii1, jj0 + 1) = (dp1(ii1, jj0 + 1) +(p(ii1, wd_n) * mi13))
END DO
END DO
DO wd_jj = jj0, MIN((tile2jj + 125), nj), 1
mi14 = dp2(wd_n, wd_jj)
DO ii2 = tile1ii, MIN((tile1ii + 543), ni), 1
dp1(ii2, wd_jj) = (dp1(ii2, wd_jj) +(p(ii2, wd_n) * mi14))
END DO
END DO
TM
Vector Update Summary
ORIGINAL CODE
TM
Vector Update Summary
TM
Vector Update Summary
Download