Embedded Computer Architecture Loop Transformations TU/e 5kk73 Henk Corporaal Bart Mesman Overview • Motivation • Loop level transformations catalogue – Loop merging – Loop interchange – Loop unrolling – Unroll-and-Jam – Loop tiling • Loop Transformation Theory and Dependence Analysis Thanks for many slides go to the DTSE people from IMEC and Dr. Peter Knijnenburg († 2007) from Leiden University @HC 5KK73 Embedded Computer Architecture 2 Loop Transformations • Change the order in which the iteration space is traversed. • Can expose parallelism, increase available ILP, or improve memory behavior. • Dependence testing is required to check validity of transformation. @HC 5KK73 Embedded Computer Architecture 3 Why loop transformations Example 1: in-place mapping for (j=1; j<M; j++) for (i=0; i<N; i++) A[i][j] = f(A[i][j-1]); for (i=0; i<N; i++) OUT[i] = A[i][M-1]; for (i=0; i<N; i++){ for (j=1; j<M; j++) A[i][j] = f(A[i][j-1]); OUT[i] = A[i][M-1]; } OUT j A @HC 5KK73 Embedded Computer Architecture i 4 Why loop transformations Example 2: memory allocation for (i=0; i<N; i++) N cyc. B[i] = f(A[i]); for (i=0; i<N; i++) N cyc. C[i] = g(B[i]); 2 background ports for (i=0; i<N; i++){ B[i] = f(A[i]); 2N cyc. C[i] = g(B[i]); } 1 backgr. + 1 foregr. ports Foreground memory CPU External memory interface Background memory @HC 5KK73 Embedded Computer Architecture 5 Loop transformation catalogue • merge (fusion) • interchange – improve locality/regularity – improve locality • • • • bump extend/reduce body split reverse – improve regularity @HC 5KK73 Embedded Computer Architecture • • • • • skew tiling index split unroll unroll-and-jam 6 Loop Merge for Ia = exp1 to exp2 A(Ia) for Ib = exp1 to exp2 B(Ib) for I = exp1 to exp2 A(I) B(I) • Improve locality • Reduce loop overhead @HC 5KK73 Embedded Computer Architecture 7 Loop Merge: Example of locality improvement: for (i=0; B[i] = for (j=0; C[j] = i<N; i++) f(A[i]); j<N; j++) f(B[j],A[j]); Location Production/Consumption Consumption(s Time for (i=0; i<N; i++) B[i] = f(A[i]); C[i] = f(B[i],A[i]); Location Production/Consumption Consumption(s) Consumptions of second loop closer to productions and consumptions of first loop • Is this always the case after merging loops? @HC 5KK73 Embedded Computer Architecture Time 8 Loop Merge not always allowed ! • Data dependencies from first to second loop can block Loop Merge • Merging is allowed if I: cons(I) in loop 2 prod(I) in loop 1 • Enablers: Bump, Reverse, Skew for (i=0; i<N; i++) B[i] = f(A[i]); N-1 >= i for (i=0; i<N; i++) C[i] = g(B[N-1]); @HC 5KK73 Embedded Computer Architecture for (i=0; i<N; i++) B[i] = f(A[i]); i-2 < i for (i=2; i<N; i++) C[i] = g(B[i-2]); 9 Loop Bump for I = exp1 to exp2 A(I) @HC 5KK73 Embedded Computer Architecture for I = exp1+N to exp2+N A(I-N) 10 Loop Bump: Example as enabler for (i=2; B[i] = for (i=0; C[i] = i<N; i++) f(A[i]); i<N-2; i++) g(B[i+2]); i+2 > i direct merging not possible • we would consume B[i+2] before it is produced Loop Bump for (i=2; i<N; i++) B[i] = f(A[i]); for (i=2; i<N; i++) C[i-2] = g(B[i+2-2]); i+2–2 = i merging possible for (i=2; i<N; i++) Loop Merge B[i] = f(A[i]); C[i-2] = g(B[i]); @HC 5KK73 Embedded Computer Architecture 11 Loop Extend for I = exp1 to exp2 A(I) @HC 5KK73 Embedded Computer Architecture exp3 exp1 exp4 exp2 for I = exp3 to exp4 if Iexp1 and Iexp2 A(I) 12 Loop Extend: Example as enabler for (i=0; B[i] = for (i=2; C[i-2] i<N; i++) f(A[i]); i<N+2; i++) = g(B[i]); for (i=0; i<N+2; i++) if(i<N) Loop Extend B[i] = f(A[i]); for (i=0; i<N+2; i++) if(i>=2) C[i-2] = g(B[i]); Loop Merge @HC 5KK73 Embedded Computer Architecture for (i=0; i<N+2; i++) if(i<N) B[i] = f(A[i]); if(i>=2) C[i-2] = g(B[i]); 13 Loop Reduce for I = exp1 to exp2 if Iexp3 and Iexp4 A(I) @HC 5KK73 Embedded Computer Architecture for I = max(exp1,exp3) to min(exp2,exp4) A(I) 14 Loop Body Split for I = exp1 to exp2 A(I) B(I) @HC 5KK73 Embedded Computer Architecture A(I) must be single-assignment: its elements should be written once for Ia = exp1 to exp2 A(Ia) for Ib = exp1 to exp2 B(Ib) 15 Loop Body Split: Example as enabler for (i=0; i<N; i++) Loop Body Split A[i] = f(A[i-1]); B[i] = g(in[i]); for (j=0; j<N; j++) for (i=0; i<N; i++) C[j] = h(B[j],A[N]); A[i] = f(A[i-1]); for (k=0; k<N; k++) B[k] = g(in[k]); for (j=0; j<N; j++) for (i=0; i<N; i++) C[j] = h(B[j],A[N]); A[i] = f(A[i-1]); for (j=0; j<N; j++) Loop Merge B[j] = g(in[j]); C[j] = h(B[j],A[N]); @HC 5KK73 Embedded Computer Architecture 16 Loop Reverse for I = exp1 to exp2 A(I) Location Production Consumption Time for I = exp2 downto exp1 Location A(I) Or written differently: for I = exp1 to exp2 A(exp2-(I-exp1)) @HC 5KK73 Embedded Computer Architecture Time 17 Loop Reverse: Satisfy dependencies for (i=0; i<N; i++) A[i] = f(A[i-1]); No loop-carried dependencies allowed ! Can not reverse the loop A[0] = … Enabler: data-flow for (i=1; i<=N; i++) transformations; exploit A[i]= A[i-1]+ f(…); associativity of + operator … = A[N] … A[N] = … for (i=N-1; i>=0; i--) A[i]= A[i+1]+ f(…); … = A[0] … @HC 5KK73 Embedded Computer Architecture 18 Loop Reverse: Example as enabler for (i=0; B[i] = for (i=0; C[i] = i<=N; i++) f(A[i]); i<=N; i++) g(B[N-i]); N–i > i merging not possible Loop Reverse for (i=0; i<=N; i++) N-(N-i) = i B[i] = f(A[i]); merging possible for (i=0; i<=N; i++) C[N-i] = g(B[N-(N-i)]); Loop Merge for (i=0; i<=N; i++) @HC 5KK73 Embedded Computer Architecture B[i] = f(A[i]); C[N-i] = g(B[i]); 19 Loop Interchange for I1 = exp1 to exp2 for I2 = exp3 to exp4 A(I1, I2) @HC 5KK73 Embedded Computer Architecture for I2 = exp3 to exp4 for I1 = exp1 to exp2 A(I1, I2) 20 Loop Interchange: index traversal j j Loop Interchange i for(i=0; i<W; i++) for(j=0; j<H; j++) A[i][j] = …; @HC 5KK73 Embedded Computer Architecture i for(j=0; j<H; j++) for(i=0; i<W; i++) A[i][j] = …; 21 Loop Interchange • Validity: check dependence direction vectors. • Often used to improve cache behavior. • The innermost loop (loop index changes fastest) should (only) index the right-most array index expression in case of row-major storage, like in C. • Can improve execution time by 1 or 2 orders of magnitude. @HC 5KK73 Embedded Computer Architecture 22 Loop Interchange (cont’d) • Loop interchange can also expose parallelism. • If an inner-loop does not carry a dependence (entry in direction vector equals ‘=‘), this loop can be executed in parallel. • Moving this loop outwards increases the granularity of the parallel loop iterations. @HC 5KK73 Embedded Computer Architecture 23 Loop Interchange: Example of locality improvement for (i=0; i<N; i++) for (j=0; j<M; j++) B[i][j] = f(A[j],B[i][j-1]); for (j=0; j<M; j++) for (i=0; i<N; i++) B[i][j] = f(A[j],B[i][j-1]); • In second loop: – A[j] reused N times (temporal locality) – However: loosing spatial locality in B (assuming row major ordering) • => Exploration required @HC 5KK73 Embedded Computer Architecture 24 Loop Interchange: Satisfy dependencies • No loop-carried dependencies allowed unless I2: prod(I2) cons(I2) Compare: for (i=0; i<N; i++) for (j=0; j<M; j++) A[i][j] = f(A[i-1][j+1]); for (i=0; i<N; i++) for (j=0; j<M; j++) A[i][j] = f(A[i-1][j]); • Enablers: – Data-flow transformations – Loop Bump – Skewing @HC 5KK73 Embedded Computer Architecture 25 Loop Skew: Basic transformation for I1 = exp1 to exp2 for I2 = exp3 to exp4 A(I1, I2) @HC 5KK73 Embedded Computer Architecture for I1 = exp1 to exp2 for I2 = exp3+.I1+ to exp4+.I1+ A(I1, I2-.I1- ) 26 Loop Skew j j Loop Skewing i for(j=0;j<H;j++) for(i=0;i<W;i++) A[i][j] = ...; @HC 5KK73 Embedded Computer Architecture i for(j=0; j<H; j++) for(i=0+j; i<W+j;i++) A[i-j][j] = ...; 27 Loop Skew: Example as enabler of regularity improvement for (i=0; i<N; i++) for (j=0; j<M; j++) … = f(A[i+j]); for (i=0; i<N; i++) for (j=i; j<i+M; j++) … = f(A[j]); Loop Skew Loop Extend & Interchange @HC 5KK73 Embedded Computer Architecture for (j=0; j<N+M; j++) for (i=0; i<N; i++) if (j>=i && j<i+M) … = f(A[j]); 28 for (i=0; i<N; i++) for (j=0; j<M; j++) A[i+1][j] = f(A[i][j+1]); i i j i j Loop skewing as enabler for Loop Interchange @HC 5KK73 Embedded Computer Architecture j 29 Loop Tiling for I = 0 to exp1 . exp2 A(I) @HC 5KK73 Embedded Computer Architecture Tile size exp1 Tile factor exp2 for I1 = 0 to exp2 for I2 = exp1.I1 to exp1.(I1 +1) A(I2) 30 Loop Tiling (1-D example) i j 1-D Loop Tiling i for(i=0;i<9;i++) A[i] = ...; @HC 5KK73 Embedded Computer Architecture for(j=0; j<3; j++) for(i=4*j; i<4*j+4; i++) if (i<9) A[i] = ...; 31 Loop Tiling • Improve cache reuse by dividing the (2 or higher dimensional) iteration space into tiles and iterating over these tiles. • Only useful when working set does not fit into cache or when there exists much interference. • Two adjacent loops can legally be tiled if they can legally be interchanged. @HC 5KK73 Embedded Computer Architecture 32 2-D Tiling Example for(i=0; i<N; i++) for(j=0; j<N; j++) A[i][j] = B[j][i]; for(TI=0; TI<N; TI+=16) for(TJ=0; TJ<N; TJ+=16) for(i=TI; i<min(TI+16,N); i++) Inside for(j=TJ; j<min(TJ+16,N); j++) a tile A[i][j] = B[j][i]; @HC 5KK73 Embedded Computer Architecture 33 2-D Tiling: changes the execution order i j i j @HC 5KK73 Embedded Computer Architecture 34 What is the best Tile Size? • Current tile size selection algorithms use a cache model: – Generate collection of tile sizes; – Estimate resulting cache miss rate; – Select best one. • Only take into account L1 cache. • Mostly do not take into account n-way associativity. @HC 5KK73 Embedded Computer Architecture 35 Loop Index Split for Ia = exp1 to exp2 A(Ia) @HC 5KK73 Embedded Computer Architecture for Ia = exp1 to p A(Ia) for Ib = p+1 to exp2 A(Ib) 36 Loop Unrolling • Duplicate loop body and adjust loop header. • Increases available ILP, reduces loop overhead, and increases possibilities for common subexpression elimination. • Always valid !! @HC 5KK73 Embedded Computer Architecture 37 (Partial) Loop Unrolling for I = exp1 to exp2 A(I) A(exp1) A(exp1+1) A(exp1+2) … A(exp2) @HC 5KK73 Embedded Computer Architecture for I = exp1/2 to exp2 /2 A(2l) A(2l+1) 38 Loop Unrolling: Downside • If unroll factor is not divisor of trip count, then need to add remainder loop. • If trip count not known at compile time, need to check at runtime. • Code size increases which may result in higher I-cache miss rate. • Global determination of optimal unroll factors is difficult. @HC 5KK73 Embedded Computer Architecture 39 Loop Unroll: Example of removal of non-affine iterator for (L=1; L < 4; L++) for (i=0; i < (1<<L); i++) A(L,i); for (i=0; i < 2; i++) A(1,i); for (i=0; i < 4; i++) A(2,i); for (i=0; i < 8; i++) A(3,i); @HC 5KK73 Embedded Computer Architecture 40 Unroll-and-Jam • Unroll outerloop and fuse new copies of the innerloop. • Increases size of loop body and hence available ILP. • Can also improve locality. @HC 5KK73 Embedded Computer Architecture 41 Unroll-and-Jam Example for (i=0;i<N;i++) for (j=0;j<N;j++) A[i][j] = B[j][i]; • More ILP exposed • Spatial locality of B enhanced @HC 5KK73 Embedded Computer Architecture for (i=0;i<N;i=i+2) for (j=0;j<N;j++) A[i][j] = B[j][i]; for (j=0;j<N;j++) A[i+1][j] = B[j][i+1]; for (i=0; i<N; i+=2) for (j=0; j<N; j++) { A[i][j] = B[j][i]; A[i+1][j] = B[j][i+1]; } 42 Simplified loop transformation script • Give all loops same nesting depth – Use dummy 1-iteration loops if necessary • Improve regularity – Usually applying loop interchange or reverse • Improve locality – Usually applying loop merge – Break data dependencies with loop bump/skew – Sometimes loop index split or unrolling is easier @HC 5KK73 Embedded Computer Architecture 43 Loop transformation theory • Iteration spaces • Polyhedral model • Dependency analysis @HC 5KK73 Embedded Computer Architecture 44 Technical Preliminaries (1) do i = 2, N do j = i, N x[i,j]=0.5*(x[i-1,j]+ x[i,j-1]) enddo (1) • Address expr (1) 1 0 i 1 0 1 j 0 j 4 • perfect loop nest • iteration space • dependence vector 3 2 2 @HC 5KK73 Embedded Computer Architecture 3 4 i 45 Technical Preliminaries (2) Switch loop indexes: do m = 2, N do n = 2, m x[n,m]=0.5*(x[n-1,m]+ x[n,m-1]) enddo (2) (2) m 0 1 i n 1 0 j 1 0 i 1 1 0 0 1 m 1 0 1 m 1 0 1 j 0 0 1 1 0 n 0 1 0 n 0 affine transformation @HC 5KK73 Embedded Computer Architecture 46 Polyhedral Model • Polyhedron is set x : Ax c for some matrix A and bounds vector c • Polyhedra (or Polytopes) are objects in a many-dimensional space without holes • Iteration spaces of loops (with unit stride) can be represented as polyhedra • Array accesses and loop transformations can be represented as matrices @HC 5KK73 Embedded Computer Architecture 47 Iteration Space • A loop nest is represented as BI b for iteration vector I • Example: for(i=0; i<10;i++) -1 0 0 for(j=i; j<10;j++) 1 0 i 9 1 -1 j 0 0 1 9 @HC 5KK73 Embedded Computer Architecture 48 Array Accesses • Any array access A[e1][e2] for linear index expressions e1 and e2 can be represented as an access matrix and offset vector. A+a • This can be considered as a mapping from the iteration space into the storage space of the array (which is a trivial polyhedron) @HC 5KK73 Embedded Computer Architecture 49 Unimodular Matrices • A unimodular matrix T is a matrix with integer entries and determinant 1. • This means that such a matrix maps an object onto another object with exactly the same number of integer points in it. • Its inverse T¹ always exist and is unimodular as well. @HC 5KK73 Embedded Computer Architecture 50 Types of Unimodular Transformations • • • • Loop interchange Loop reversal Loop skewing for arbitrary skew factor Since unimodular transformations are closed under multiplication, any combination is a unimodular transformation again. @HC 5KK73 Embedded Computer Architecture 51 Application • Transformed loop nest is given by AT¹ I’ a • Any array access matrix is transformed into AT¹. • Transformed loop nest needs to be normalized by means of Fourier-Motzkin elimination to ensure that loop bounds are affine expressions in more outer loop indices. @HC 5KK73 Embedded Computer Architecture 52 Dependence Analysis Consider following statements: S1: a = b + c; S2: d = a + f; S3: a = g + h; • S1 S2: true or flow dependence = RaW • S2 S3: anti-dependence = WaR • S1 S3: output dependence = WaW @HC 5KK73 Embedded Computer Architecture 53 Dependences in Loops • Consider the following loop for(i=0; i<N; i++){ S1: a[i] = …; S2: b[i] = a[i-1];} • Loop carried dependence S1 S2. • Need to detect if there exists i and i’ such that i = i’-1 in loop space. @HC 5KK73 Embedded Computer Architecture 54 Definition of Dependence • There exists a dependence if there two statement instances that refer to the same memory location and (at least) one of them is a write. • There should not be a write between these two statement instances. • In general, it is undecidable whether there exist a dependence. @HC 5KK73 Embedded Computer Architecture 55 Direction of Dependence • If there is a flow dependence between two statements S1 and S2 in a loop, then S1 writes to a variable in an earlier iteration than S2 reads that variable. • The iteration vector of the write is lexicographically less than the iteration vector of the read. • I I’ iff i1 = i’1 i(k-1) = i’(k-1) ik < i’k for some k. @HC 5KK73 Embedded Computer Architecture 56 Direction Vectors • A direction vector is a vector (=,=,,=,<,*,*,,*) – where * can denote = or < or >. • Such a vector encodes a (collection of) dependence. • A loop transformation should result in a new direction vector for the dependence that is also lexicographically positive. @HC 5KK73 Embedded Computer Architecture 57 Loop Interchange • Interchanging two loops also interchanges the corresponding entries in a direction vector. • Example: if direction vector of a dependence is (<,>) then we may not interchange the loops because the resulting direction would be (>,<) which is lexicographically negative. @HC 5KK73 Embedded Computer Architecture 58 Affine Bounds and Indices • We assume loop bounds and array index expressions are affine expressions: a0 + a1 * i1 + + ak * ik • Each loop bound for loop index ik is an affine expressions over the previous loop indices i1 to i(k-1). • Each loop index expression is a affine expression over all loop indices. @HC 5KK73 Embedded Computer Architecture 59 Non-Affine Expressions • Index expressions like i*j cannot be handled by dependence tests. We must assume that there exists a dependence. • An important class of index expressions are indirections A[B[i]]. These occur frequently in scientific applications (sparse matrix computations). • In embedded applications??? @HC 5KK73 Embedded Computer Architecture 60 Linear Diophantine Equations • A linear diophantine equations is of the form aj * xj = c • Equation has a solution iff gcd(a1,,an) is divisor of c @HC 5KK73 Embedded Computer Architecture 61 GCD Test for Dependence • Assume single loop and two references A[a+bi] and A[c+di]. • If there exist a dependence, then gcd(b,d) divides (c-a). • Note the direction of the implication! • If gcd(b,d) does not divide (c-a) then there exists no dependence. @HC 5KK73 Embedded Computer Architecture 62 GCD Test (cont’d) • However, if gcd(b,d) does divide (c-a) then there might exist a dependence. • Test is not exact since it does not take into account loop bounds. • For example: • for(i=0; i<10; i++) • A[i] = A[i+10] + 1; @HC 5KK73 Embedded Computer Architecture 63 GCD Test (cont’d) • Using the Theorem on linear diophantine equations, we can test in arbitrary loop nests. • We need one test for each direction vector. • Vector (=,=,,=,<,) implies that first k indices are the same. • See book by Zima for details. @HC 5KK73 Embedded Computer Architecture 64 Other Dependence Tests • There exist many dependence test – Separability test – GCD test – Banerjee test – Range test – Fourier-Motzkin test – Omega test • Exactness increases, but so does the cost. @HC 5KK73 Embedded Computer Architecture 65 Fourier-Motzkin Elimination • Consider a collection of linear inequalities over the variables i1,,in • e1(i1,,in) e1’(i1,,in) • • em(i1,,in) em’(i1,,in) • Is this system consistent, or does there exist a solution? • FM-elimination can determine this. @HC 5KK73 Embedded Computer Architecture 66 FM-Elimination (cont’d) • First, create all pairs L(i1,,i(n-1)) in and in U(i1,,i(n-1)). This is solution for in. • Then create new system • L(i1,,i(n-1)) U(i1,,i(n-1)) • together with all original inequalities not involving in. • This new system has one variable less and we continue this way. @HC 5KK73 Embedded Computer Architecture 67 FM-Elimination (cont’d) • After eliminating i1, we end up with collection of inequalities between constants c1 c1’. • The original system is consistent iff every such inequality can be satisfied. • There does not exist an inequality like • 10 3. • There may be exponentially many new inequalities generated! @HC 5KK73 Embedded Computer Architecture 68 Fourier-Motzkin Test • Loop bounds plus array index expressions generate sets of inequalities, using new loop indices i’ for the sink of dependence. • Each direction vector generates inequalities • i1 = i1’ i(k-1) = i(k-1)’ ik < ik’ • If all these systems are inconsistent, then there exists no dependence. • This test is not exact (real solutions but no integer ones) but almost. @HC 5KK73 Embedded Computer Architecture 69 N-Dimensional Arrays • Test in each dimension separately. • This can introduce another level of inaccuracy. • Some tests (FM and Omega test) can test in many dimensions at the same time. • Otherwise, you can linearize an array: Transform a logically N-dimensional array to its one-dimensional storage format. @HC 5KK73 Embedded Computer Architecture 70 Hierarchy of Tests • Try cheap test, then more expensive ones: • if (cheap test1= NO) • then print ‘NO’ • else if (test2 = NO) • then print ‘NO’ • else if (test3 = NO) • then print ‘NO’ • else @HC 5KK73 Embedded Computer Architecture 71 Practical Dependence Testing • Cheap tests, like GCD and Banerjee tests, can disprove many dependences. • Adding expensive tests only disproves a few more possible dependences. • Compiler writer needs to trade-off compilation time and accuracy of dependence testing. • For time critical applications, expensive tests like Omega test (exact!) can be used. @HC 5KK73 Embedded Computer Architecture 72