Array Dependence Analysis with the Chains of Recurrences Framework for Loop Optimization Robert van Engelen Florida State University Also thanks to J. Birch, Y. Shou, and K. Gallivan NCSU 2/24/06 1 Outline Motivation Restructuring compilers Chains of recurrences algebra and associated algorithms for the GCC and Polaris compilers Nonlinear array dependence testing for loop restructuring and vectorization Experimental results Conclusions NCSU 2/24/06 2 Motivation Intel CTO: “the increased power requirements of newer chips will lead to CPUs that are hotter than the surface of the sun by 2010” Enter multi-core CPUs Increase the overall system speed by adding CPU cores Speed up multi-threaded applications Can effectively lower the power consumption Enter (more?) multi-media extensions Vector-like instruction sets: MMX, SSE, AltiVec Speed up multi-media codes, such as JPEG, MPEG NCSU 2/24/06 3 Code Optimization by Hand or Automatic? Rewriting applications by hand to exploit parallelism is doable, if: Tasks can be identified that run independently, such as a Web browser’s rendering and communications tasks Course-grain parallelism: tasks must have sufficient work Rewriting applications by hand to exploit lots of finegrain parallelism is not doable Thousands of read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW), data dependences must be analyzed NCSU 2/24/06 4 Restructuring Compilers A restructuring compiler typically applies source-code transformations automatically to meet various performance enhancement criteria: Exploit parallelism in loops by reordering the loop structure to run loop iterations in parallel Find small loops to replace with vector instructions Optimize data locality by reordering code to change memory access order and cache All code changes are safe as long as RAW, WAR, and WAW data dependences are preserved! NCSU 2/24/06 5 Example: Loop Fission S1 DO I = 1, 10 S2 DO J = 1, 10 S3 A(I,J) = B(I,J) + C(I,J) S4 D(I,J) = A(I,J-1) * 2.0 S5 ENDDO S6 ENDDO S 3 (=,<) Loop fission splits a single loop into multiple loops S4 S1 DO I = 1, 10 S2 DO J = 1, 10 S3 A(I,J) = B(I,J) + C(I,J) Sx ENDDO Sy DO J = 1, 10 S4 D(I,J) = A(I,J-1) * 2.0 S5 ENDDO S6 ENDDO S3 (=,<) S4 S1 PARALLEL DO I = 1, 10 S3 A(I,1:10)=B(I,1:10)+C(I,1:10) S4 D(I,1:10)=A(I,0:9) * 2.0 S6 ENDDO S3 (=,<) SNCSU 2/24/06 4 Allows vectorization and parallelization of the new loops when original loop was sequential Loop fission must preserve all dependence relations of the original loop 6 Loop Fission: Algorithm S1 DO I = 1, 10 S2 A(I) = A(I) + B(I-1) S3 B(I) = C(I-1)*X + Z S4 C(I) = 1/B(I) S5 D(I) = sqrt(C(I)) S6 ENDDO S3 (<) S2 S4 (<) S3 S3 (=) S4 S4 (=) S5 S2 S1 S3 S4 Sx S2 S5 1 S3 S4 S3 1 0 S2 S4 0 Compute the acyclic condensation of the dependence graph to find a legal order of the loops DO I = 1, 10 B(I) = C(I-1)*X + Z C(I) = 1/B(I) ENDDO A(1:10) = A(1:10) + B(0:9) D(1:10) = sqrt(C(1:10)) S5 Acyclic condensation S5 Dependence graph NCSU 2/24/06 7 Example: Loop Interchange S1 DO I = 1, N S2 DO J = 1, M S3 A(I,J) = A(I,J-1) + B(I,J) S4 ENDDO S5 ENDDO S S 3 (=,<) Changes the loop nesting order Allows vectorization of an outer loop and more effective parallelization of an inner loop Can be used to improve 3 S2 DO J = 1, M S1 DO I = 1, N S3 A(I,J) = A(I,J-1) + B(I,J) S4 ENDDO S5 ENDDO S3 (<,=) S3 spatial locality S2 DO J = 1, M S3 A(1:N,J)=A(1:N,J-1)+B(1:N,J) S5 ENDDO S3 (<,=) S3 NCSU 2/24/06 Loop interchange must preserve all dependence relations of the original loop 8 Loop Interchange: Algorithm S1 DO I = 1, N S2 DO J = 1, M S3 DO K = 1, L S4 A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) S5 ENDDO S6 ENDDO S4 (<,<,=) S4 S7 ENDDO S4 (<,=,>) S4 <<= <=> Compute the direction matrix and find which columns (and therefore which loops) can be permuted without violating dependence relations in the original loop nest <<= <=> <=< =>< Invalid <<= <=> <<= =<> Valid Direction matrix NCSU 2/24/06 9 Complications Loop restructuring is complicated by: The presence of several induction variables Nonlinear and symbolic array index expressions The use of pointer arithmetic instead of arrays in C Non-unit loop strides and unstructured loops Control flow Need loop normalization and preprocessing Apply induction variable substitution Convert pointer dereferences to array accesses Normalize the loop iteration space NCSU 2/24/06 10 Induction Variable Substitution Example loop After IV substitution (IVS) After parallelization (note the affine indexes) I = 0 J = 1 while (I<N) I = I+1 … = A[J] IVS J = J+2 K = 2*I A[K] = … endwhile A[] W R A[2*i+1] W R W R for i=0 to N-1 S1: … = A[2*i+1] S2: A[2*i+2] = … endfor … A[2*i+2] forall (i=0,N-1) … = A[2*i+1] A[2*i+2] = … Dep test endforall GCD test to solve dependence equation 2id - 2iu = -1 Since 2 does not divide 1 there is no data dependence. NCSU 2/24/06 11 IV Recognition on SSA Forms I1 = 3 M1 = 0 do I2 = (I1,I3) J1 = (?,J3) K1 = (?,K2) L1 = (?,L2) M2 = (M1,M3) J2 = 3 I3 = I2+1 L2 = M2+1 M3 = L2+2 J3 = I3+J2 K2 = 2*J3 while (…) [Cytron91, Wolfe92] Spanning tree I2(i) = 3+i L2(i) = 1+3i M2(i) = 3i NCSU 2/24/06 J1(i) = 7+i K1(i) = 14+2i 12 Symbolic Differencing do x = y = z = while [Haghighat95] Use abstract interpretation to evaluate loop iterations and construct symbolic difference table of the IV values x+z z+1 y+1 (…) Iteration x 1 x+z diff 2 x+2z+2 z+2 3 x+3z+6 z+4 y z z+1 diff z diff diff z+3 2 z+2 2 2 z+5 2 z+4 2 x(i) = x0 + z0i + (i2-i) y(i) = z0 + 2i + 1 z(i) = z0 + 2i NCSU 2/24/06 13 Pointer-to-Array Conversion [vanEngelen01, Franke01] f += 2; lsp += 2; for (i = 2; i <= 5; i++) { *f = f[-2]; for (j = 1; j < i; j++, f--) *f += f[-2]-2*(*lsp)*f[-1]; *f -= 2*(*lsp); f += i; lsp += 2; } Lsp_az speech codec segment from ETSI with pointer updates. NCSU 2/24/06 for (i = 0; i <= 3; i++) { f[i+2] = f[i]; for (j = 0; j <= i; j++) f[i-j+2] += f[i-j]2*lsp[2*i+2]*f[i-j+1]; f[1] -= 2*lsp[2*i+2]; } Lsp_az speech codec segment after pointer-to-array conversion. Note that all array index expressions are affine. 14 Control-Flow Issues Conditional array accesses and conditionally updated induction variables present problems: for (…) { if (…) A[I] = … else … = A[J] } Assume RAW and WAR dependences do { K = 3; K = K+J; if (…) J = K; else J = J+3; A[J] = … } while (J<N) Extensive analysis reveals that J:=J+3 NCSU 2/24/06 DO I=1,10 IF … J = J+2 ELSE J = I ENDIF A(J) = … ENDDO Problem: J has no single recurrence form 15 Chains of Recurrences for Compiler Optimization Chains of recurrence forms and algebra can be used to: Detect (non)linear coupled IVs Analyze pointer arithmetic Effectively handle control flow Implement array dependence testing NCSU 2/24/06 16 Chains of Recurrences A chain of recurrences (CR) represents a polynomial or exponential function or mix evaluated over a unit-distance grid [Zima92] Basic form: {init, , stride} Iteration {init, , stride} f(i) = 2i+1 = {1,+,2} f(i) = 2i = {1,*,2} i = 0 init 1 1 i = 1 init stride 3 2 i = 2 init stride stride 5 4 i = 3 init stride stride stride 7 8 NCSU 2/24/06 17 Chains of Recurrences: General Formulation The key idea is to represent a non-constant CR stride in CR form itself, thereby forming a chain of recurrences Example: f(i) = i2 = {0, +, s(i-1)} = {0, +, 1, +, 2} where s(i-1) = {1, +, 2} Iteration {init, , s(i-1)} s(i) = {1, +, 2} f(i) = {0, +, s(i-1)} i = 0 init 1 0 i = 1 init s(0) 3 1 i = 2 init s(0) s(1) 5 4 i = 3 init s(0) s(1) s(2) 7 9 NCSU 2/24/06 18 CRs for Expediting Function Evaluations on Grids Suppose f(i) = a + b·i + c·i2 = {a, +, {b+c, +, 2c}} We have two IVs x and y: s(i) f(i) = x = {x0, +, y} with x0 = a s(i) = y = {y0, +, 2c} with y0 = b+c Implement loop to update x and y for efficient evaluation of f(i) over a unit-distance grid i = 0, …, n : 30 x = a y = b+c for i=0 to n f[i] = x x = x+y y = y+2*c endfor 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 Iteration NCSU 2/24/06 19 Multi-Dimensional Example Let f(i,j) = i2 + i·j + 1 1. Create IV k for f(i,j) in j-loop: f(i,j) = kj = {pi, +, ri}j with pi = i2 + 1 and ri = i 2. Create IVs for pi and ri in i-loop: pi = {p0, +, qi}i with p0 = 1 qi = {q0, +, 2}i with q0 = 1 ri = {r0, +, 1}i with r0 = 0 3. Implement k, p, q, and r in i-j-loop nest NCSU 2/24/06 p = 1 q = 1 r = 0 for i = 0 to n k = p for j = 0 to m f[i,j] = k k = k+r endfor p = p+q q = q+2 r = r+1 endfor 20 CR Construction with the CR Algebra To construct the CR form of a symbolic function f(i): Replace i with CR {0,+,1} 2. Apply CR algebra rewrite rules (selected rules shown): 1. {x, +, y} + c {x+c, +, y} c{x, +, y} {c·x, +, c·y} {x, +, y} + {u, +, v} {x+u, +, y+v} {x, +, y} * {u, +, v} {x·u, +, y{u, +, v}+v{x, +, y}+y·v} Example: f(i) = c·(i+a) = c·({0, +, 1}+a) = c{a, +, 1} = {c·a, +, c} NCSU 2/24/06 21 Loop Analysis with CR Forms [vanEngelen01] The basic idea: Scan the loop to detect IV updates Construct the CR form for each IV using the CR algebra do J = I = P = while J+I I+3 2*P (…) J = {J0, +, I} I = {I0, +, 3} P = {P0, *, 2} NCSU 2/24/06 J = {J0, +, {I0, +, 3}} 22 Algorithm 1: Find Recurrences Input: Loop L with live variable information Output: Set S of recurrence relations of IVs Start with set S = { v, v | v is live at loop header } 2. Search L from bottom to top: for each assignment v = x of expression x to scalar variable v update tuples u, y in S by replacing v in y with x 1. Loop L Step Changes to S = {H, H, I, I, J, J, K, K} do M = L = J = K = I = while 2 J-H L+M K+M*I I+1 (…) 5 4 3 2 1 S5 = {H, H, I, I+1, J, J-H+2, K, K+2*I} S4 = {H, H, I, I+1, J, J-H+M, K, K+M*I} S3 = {H, H, I, I+1, J, L+M, K, K+M*I} S2 = {H, H, I, I+1, J, J, K, K+M*I} S1 = {H, H, I, I+1, J, J, K, K} NCSU 2/24/06 23 Algorithm 2: Compute CR Forms Input: Set S with recurrence relations Output: CR forms for IVs in S For each relation v, x in S do: if x is of the form v then v = v0 (v is loop invariant) if x is of the form v + y then v = {v0, +, y} if x is of the form v * y then v = {v0, *, y} if x does not contain v then v = {v0, #, y} (v is wrap around) 2. Simplify the CR forms with the CR algebra rewrite rules 1. Recurrence relation in S H, H I, I+1 J, J-H+2 K, K+2*I CR form H = H0 I = {I0, +, 1} J = {J0, +, 2-H} K = {K0, +, 2*I} NCSU 2/24/06 Simplified CR form H = H0 I = {I0, +, 1} J = {J0, +, 2-H0} K = {K0, +, 2I0, +, 2} 24 Algorithm 3: Solve Input: CR forms for IVs Output: Closed-form solutions for IVs (when possible) For each CR form of v apply the CR inverse algebra, assuming loop is normalized for i = 0, …, n 2. Certain “exotic” mixed non-polynomial and nonexponential CR forms may not have closed forms 1. Loop L Simplified CR form Closed form do M = L = J = K = I = while 2 J-H L+M K+M*I I+1 (…) J = {J0, +, 2-H0} K = {K0, +, 2I0, +, 2} I = {I0, +, 1} NCSU 2/24/06 J(i) = J0 + (2-H0)*i K(i) = K0 + i2 + (2I0-1)*i I(i) = I0 + i 25 Example 1 Loop L x = 2 z = 0 do A(x) = A(z) x = x+z y = z+1 z = y+1 while (z<N) Step S = {x, x, z, z} 3 S3 = {x, x+z, z, z+2} 2 S2 = {x, x, z, z+2} 1 S1 = {x, x, z, y+1} CR form x = {x0, +, z} z = {z0, +, 2} Closed form x(i) = x0 + z0i + i2-i z(i) = z0+2i do i=0,2*N-2 A(i*i-i+2) = A(2*i) end do NCSU 2/24/06 26 Example 2 DO I=1,M DO J=1,I ij = ij+1 ijkl = ijkl+I-J+1 DO K=I+1,M DO L=1,K ijkl = ijkl+1 xijkl[ijkl]=xkl[L] ENDDO ENDDO ijkl = ijkl+ij+left ENDDO ENDDO TRFD code segment from Perfect Benchmark with IV updates IVS DO I=0,M-1 DO J=0,I DO K=0,M-I-2 DO L=0,I+K+1 tmp = ijkl+L+I*(K+(M+M*M+2*left+6)/ 4)+J*(left+(M+M*M)/2)+((I*I*M *M)+2*(K*K+3*K+I*I*(left+1))+ M*I*I)/4+2 xijkl[tmp] = xkl[L+1] ENDDO ENDDO ENDDO ENDDO TRFD after aggressive induction variable substitution NCSU 2/24/06 27 Example 3 (SSA) 1 a = 1; while (a<10) { x = a+2; a = a+1; } L1: L2: a0 = 1 if (a0>=10) a0 goto L2 a1 = (a0, a2) x0 = a1 + 2 a2 = a1+1 if (a2<10) goto L1 2 a1 1 + x0 a2 + GCC 4.x uses our approach applied to SSA form. a1 = {1,+,1} Note: GCC developers refer to CRs as “scalar evolutions” NCSU 2/24/06 28 Example 4 (SSA) 1 x = 0; i = 1; while (i<10) { x = x+i; i = i+1; } x0 = 0 i0 = 1 if (i0>=10) goto L2 L1: x1 = (x0, x2) i1 = (i0, i2) x2 = x1+i1 i2 = i1+1 if (i2<10) goto L1 L2: i0 0 i1 1 x0 i2 x1 x2 + + i1 = {1,+,1} x1 = {0,+,i1} = {0,+,1,+,1} NCSU 2/24/06 29 Example 5 (SSA) j = 0; i = 1; while (i<10) { if (p) j = j+2; else j = j+3; i = i+1; } j0 = 0 i0 = 1 if (i0>=10) goto L2 L1: i1 = (i0, i2) j1 = (j0, j4) if (!p) goto L3 j2 = j1+2 goto L4 L3: j3 = j1+3 L4: j4 = (j2, j3) i2 = i1+1 if (i2<10) goto L1 L2: 0 j0 2 j1 3 + + j4 j2 j3 {0,+,2} < j1 < {0,+,3} NCSU 2/24/06 30 Recognizing Mixed Functional Forms and Reductions Loop L I = 1 do F = F*I I = I+1 while (…) Loop L I = 0; S = 0 do S = S+A[I] I = I+2 while (…) Simplified CR form F = {F0, *, 1, +, 1} I = {1, +, 1} Factorial F = F0 * i! Simplified CR form Reduction S = {0, +, A[{0, +, 2}]} I = {0, +, 2} S = ∑ A[2i] NCSU 2/24/06 31 Pointer Access Descriptions of Pointer and Array References A pointer access description (PAD) [vanEngelen01] is a CR form of a pointer or array reference in a loop nest PADs are computed with the CR-based IV algorithms short a[…], *p; int i; p = a; for(i=0;…;i++) { } Loop Code PAD Sequence a[i] {a, +, 1} a[0],a[1],a[2],a[3] a[2*i+1] {a+1, +, 2} a[1],a[3],a[5],a[7] a[(i*i-i)/2] {a, +, 0, +, 1} a[0],a[0],a[1],a[3] a[1<<i] {a+1, +, 1, *, 2} a[1],a[2],a[4],a[8] p++ {a, +, 1} a[0],a[1],a[2],a[3] p+=i {a, +, 0, +, 1} a[0],a[0],a[1],a[3] NCSU 2/24/06 32 CR-Enhanced Array Dependence Testing Basic idea: construct dependence equations in CR form for both pointer and array accesses Determine the solution intervals by computing the value ranges of the equations in CR form If the solution space is empty, there is no dependence NCSU 2/24/06 33 Example float a[…], *p, *q; p = a; q = a+2*n; for (i=0; i<n; i++) { t = *p; S: *p++ = *q; *q-- = t; } S * p={a, +, 1} q={a+2n, +, -1} Dependence equation: {a, +, 1}id = {a+2n, + ,-1}iu Constraints: 0 < id < n-1 0 < iu < n-1 Compute solution interval: Low[{{-2n, +, 1}iu, +, 1}id] = Low[{-2n, +, 1}iu] = -2n Up[{{-2n, +, 1}iu, +, 1}id] = Up[{-2n, +, 1}iu + n-1] = Up[-2n + 2n - 2] = -2 Rewrite dependence equation: {a, +, 1}id = {a+2n, +, -1}iu {a, +, 1}id - {a+2n, +, -1}iu = 0 {{-2n, +, 1}iu, +, 1}id = 0 No dependence NCSU 2/24/06 34 Determining the Value Range of a CR Form Suppose x(i) = {x0, +, s(i-1)} for i = 0, …, n If s(i-1) > 0 then x(i) is monotonically increasing If s(i-1) < 0 then x(i) is monotonically decreasing If a function is monotonic on its domain, then it is trivial to find its exact value range NCSU 2/24/06 35 Example: Nonlinear and Symbolic Dependence Testing float a[…], *p, *q; p = q = a; for (i=0; i<n; i++) { for (j=0; j<=i; j++) *q += *++p; q++; } DO i = 1, M+1 S1: A[I*N+10] = ... S2: ... = A[2*I+K] K = 2*K+N ENDDO p = {{a+1, +, 1, +, 1}i, +, 1}j = a[(i2+i)/2+j+1] q = {a, +, 1}i = a[i] S1: A[{N+10, +, N}i] S2: A[{K0+2N, +, K0+ N+2, *, 2}i] CR range test disproves dependence when CR dep. test disproves flow dependence (<, <) K+N > 10 and K > 2 NCSU 2/24/06 36 Results Implemented a CR-enhanced trapezoidal Banerjee test Relatively simple test Enhanced with support for nonlinear forms Enhanced with support for conditional flow Construct dependence equations in CR form Implementation based on the Polaris compiler Pros: can compare to powerful dependence tests such as Omega and Range test Cons: Fortran only NCSU 2/24/06 37 Additional Independences Filtered over Omega Test 100% 90% 80% 70% 60% CR-EVT 50% Omega 40% 30% 20% 10% 0% Y D S FE M D M G E C O N A Q D C T FD R Perf. Benchmark G E P E N P S EP LAPACK NCSU 2/24/06 38 Additional Independences Filtered over Range Test 100% 90% 80% 70% 60% CR-EVT 50% Range 40% 30% 20% 10% 0% Y D S FE M D M G E C O N A Q D C T FD R NCSU 2/24/06 G E P E N P S EP 39 Additional Independences Filtered over Omega+Range 100% 90% 80% 70% 60% CR-EVT 50% Omega+Range 40% 30% 20% 10% 0% Y D S FE M D M G E C O N A Q D C T FD R G E P NCSU 2/24/06 N E P S EP 40 Percentage of Conditional IVs w/o Closed Forms in LAPACK 100% 90% 80% 70% 60% Conditional IVs 50% Other IVs 40% 30% 20% 10% 0% GEP NEP NCSU 2/24/06 SEP 41 Timing Comparison: Perf Bench. 10 9 8 Time (s) 7 Range 6 Omega 5 CR-EVT 4 CR-EVT (opt) 3 2 1 0 DYFESM MDG OCEAN NCSU 2/24/06 QCD TRFD 42 Timing Comparison: LAPACK 70 60 Time (s) 50 Range 40 Omega CR-EVT 30 CR-EVT (opt) 20 10 0 GEP NEP NCSU 2/24/06 SEP 43 Conclusions A CR-based compiler framework has advantages: Applicable to CFG, AST, and SSA forms Handles conditional flow Handles nonlinear and symbolic induction variable expressions Allows array and pointer-based dependence testing to be applied directly to the CR forms without induction variable substitution Future work: Improve GCC implementation Enhance other dependence tests with CR forms NCSU 2/24/06 44 Further Reading Robert van Engelen, Johnnie Birch, Yixin Shou, Burt Walsh, and Kyle Gallivan, “A Unified Framework for Nonlinear Dependence Testing and Symbolic Analysis”, in the proceedings of the ACM International Conference on Supercomputing (ICS), 2004, pages 106-115. Robert van Engelen, Johnnie Birch, and Kyle Gallivan, “Array Dependence Testing with the Chains of Recurrences Algebra”, in the proceedings of the IEEE International Workshop on Innovative Architectures for Future Generation HighPerformance Processors and Systems (IWIA), January 2004, pages 70-81. Robert van Engelen and Kyle Gallivan, “An Efficient Algorithm for Pointer-toArray Access Conversion for Compiling and Optimizing DSP Applications”, in proceedings of the 2001 International Workshop on Innovative Architectures for Future Generation High-Performance Processors and Systems (IWIA), January 2001, pages 80-89. Robert van Engelen, “Efficient Symbolic Analysis for Optimizing Compilers”, in proceedings of the International Conference on Compiler Construction, ETAPS 2001, LNCS 2027, pages 118-132. NCSU 2/24/06 45 The End NCSU 2/24/06 46