Towards Automated Characterization of the Data Movement Complexity of Affine Programs HiPEAC'15 1/19/2015 † Elango , V. †The F. § Rastello , Ohio State University L-N † Pouchet , J. ‡ Ramanujam , P. ‡Louisiana State University §INRIA 0:7 † Sadayappan Although a CDAG is derived from analysis of dependences between instances of statements executed by a sequential program, it abstracts away that sequential schedule of operations and only imposes an essential partial order captured by the data dependences between the operation instances. Control dependences in the computation need not be represented since the goal is to capture the inherent data locality characteristics based on the set of operations that actually transpired during an execution of the program. o CDAG abstrac#on: o Arbitrary CDAGs: – To compute a vertex, predecessor ver#ces must hold values in 1fast memory 2 – Limited fast memory => computed values may need to be temporarily stored in slow memory and reloaded for N=6 Fig. 5: CDAGCDAG for Gauss-Seidel code in Fig. 2. Fig. 6: m Convex-partition of the CDAG for the Input vertices are shown in black, all other• Inherent verd ata vmt. c omplexity o f code in Fig. 2 for N = 10. ticesis represent operations performed. • Data movement cost CDAG: Minimal #loads+#stores among different for two versions all possible valid schedules Theyskey idea behind the work presented in this article is to perform analysis on the CDAG of a • Also depends on cache ize upper order bounds on min-­‐cost computation, attempting Develop to find a different of execution of the operations that can improve the Ques#on: Can we do reuse-distance beJer? profile compared to that of the given program’s sequential execution trace. If this analysis revealsnao significantly improved reuse distance profile, it suggests that suitable source code transformaHow do we know when tions have the potential to enhance data locality.dOn themother hand, if cthe analysis is unable to improve Minimum p ossible ata ovement ost? further improvement ossible? profile of the code, it is likely that it is already as well optimized for data locality as thepreuse-distance No known effec?ve solu?on to problem possible. Ques#on: What is the The lowest dynamic analysis involves the following steps: achievable data movement (1) Generate a sequential execution trace of a program. cost among all equivalent (2) Form a CDAG from the executionlower trace. bounds on min-­‐cost Develop versions of the computa#on? (3) Perform a multi-level convex partitioning of the CDAG, which is then used to change the schedule of operations of the CDAG from the original order in the given input code. A convex partitioning of a CDAG is analogous to tiling the iteration space of a regular nested loop computation. Multi-level convex partitioning is analogous to multi-level cache-oblivious blocking. (4) Perform standard reuse-distance analysis of the reordered trace after multi-level convex partitioning. Lower Bounds: Geometric Reasoning with Data Footprints Finally, Fig. 6 shows the convex partitioning of the CDAG corresponding to the code in Fig. 2. After such a partitioning, the execution order of the vertices is reordered so that the convex partitions are executed in some valid order (corresponding to a topological sort of a coarse-grained inter-partition o Divide execu#on trace ininto segments wasith dependence graph), with the vertices within a partition being executed the same relative order the for (i=0; i<N; i++) Load original sequential execution. DetailsS areload/stores presented in the next (3 isection. n ex.) Seg. 3 Source: Jim Demmel …... Time Seg. 2 Seg. 1 o Loomis-­‐Whitney inequality (2D): bounds #points in a set by product of for (j=0;j<N;j++) Load Load OF CDAG 3. CONVEX PARTITIONING o Within each segment, #dis#nct elements # projected points on coordinate axes if (i <> j) force[i] we provide FLOP In this+= section, details on our algorithm for convex partitioning of CDAGs, which is at of pos[] <= 2S (up to S coming into o Prior work: Uses Loomis-­‐Whitney Store dynamic analysis. In the case of loops, numerous efforts have attempted to the heart of our proposed func(pos[i],pos[j]) optimize data localityFLOP by applying loop transformations, in particular involving loop tilingSand segment in scratchpad and another loop Background inequality & generaliza#on (Holder-­‐ fusion [Irigoin and Triolet Load 1988; Wolf and Lam 1991; Kennedy and McKinley 1993; Bondhugula et al. explicitly loaded) 2008]. Tiling for locality Brascamp-­‐Lieb) for lower bounds for Load attempts to group points in an iteration space of a loop into smaller blocks (tiles) allowing reuseLoad (thereby reducing reuse distance) in multiple directions when the block fits in o For c ode e xample, p rojec#on o f S tmt(i,j) linear-­‐algebra-­‐like computa#ons y Inequality FLOP onto i-­‐axis maps to data element pos[i]; ACM Transactions on Architecture § Projec#ons of itera#on-­‐space FLOP and Code Optimization, Vol. 0, No. 0, Article 0, Publication date: 2014. E FLOP similarly for j-­‐axis Max. # dis#nct elts of E points < ==> D ata f ootprint j bset in the euclidian d-space Store pos[i] or pos[j] read in any segment <= 2S § Geometric i nequality: B ound m ax. Store projections on the coordinates hyperplanes FLOP o By Loomis-­‐Whitney, max. # itera#on #of ops for a given # of data moves j …... Store points in any segment, |P| <= 2S*2S Load i 2/4S2; each seg. (but d Ei FLOP o Min. # segments > = N 1/(d 1) |E| |fj (E)| 2D Loomis-­‐Whitney Inequality last) has S load/stores j=1 |E| <= |Ei|*|Ej| o #load/stores >= (N2/4S2-­‐1)*S = Ω(N2/S) ’ Geometric Reasoning with Data Footprints: Limita?ons Contributions: POPL 2015 Affine computations o Cannot handle mul#-­‐ o Cannot model effect of data for (i=0; i<N; i++) for (t=1; t<T; t++) { for (j=0;j<N;j++) statement programs dependences for (i=1; i<N-1; i++) for (k=0;k<N;k++) § Computa#ons with § Dependences may impose Can be represented as (union of)very Z -polyhedra:B[i] = A[i-1]+A[i]+A[i+1]; C[i][j] += A[i][k]*B[k][j]; different d ata m vmt. constraints = > h igher d ata d for (i=1; i<N-1; i++) I Space: d -dimensional integer lattice (Z ). Rqmts. but same array movement cost than A[i] = B[i]; I Points: Each instance of the statement.} for (i=0; i<N; i++) access footprint => same LB footprint analysis reveals for (j=0;j<N;j++) I Arrows: True datapdependencies. § Seman#cs reserving loop § Example: 1D Jacobi – for (k=0;k<N;k++) { transforma#ons can result footprint based geometric C[i][j] += 1; for ( i =0; in i <cNhange ; i ++)to lower bound analysis cannot derive A[i][k] += 1; S1: A[i] = I[i]; B[k][j] += 1; known LB of Ω(NT/S) for ( t =1; Same t < T ; access t ++) functions } ⇒ same analysis result { 3/√S) LB = Ω(N for ( i =1; i <N -1; i ++) POPL 2015 Loop Distribution S2: B[i] = A[i-1]+A[i]+A[i+1]; 12 / 27 for ( i =1; i <N -1; i ++) S3: A[i] = B[i]; Semantically equivalent for (i,j,k) C[i][j] += 1; } code after loop distribution: for (i,j,k) A[i][k] += 1; for (i,j,k) B[k][j] += 1; but different IO lower bound LB = Ω(N2) · Apply geometric reasoning on Z -polyhedra to bound |P| 21 Lower Bounds for CDAGs: Geometric Reasoning o Linear-­‐Algebra-­‐like algorithms: § Hong & Kung (1981): strong § Irony et al. (2004) and Ballard et al. rela#on between:1) Data (2011): Geometric approach based movement cost for a CDAG on geometric inequality schedule, and 2) Number of § Christ et al. (2013): Automa#on, vertex-­‐sets in “2S-­‐par##on” based on generalized geometric HBL of CDAG inequality (Holder-­‐Brascamp-­‐Lieb) § Change from reasoning about § (+) Automated asympto#c all valid schedules to all valid parametric lower bound expressions, 2S-­‐par##ons of graph e.g., O(N3/sqrt(S)) for NxN mat-­‐mult § (+) Generality Background § (-­‐) R estricted c omputa#onal m odel: § (-­‐) Manual CDAG-­‐specific weakness of bounds or reasoning => c hallenge t o Hong & Kung’s S-partitioning inapplicability automate 5.3 Automated I/O Lower Bound Computation We present a static analysis algorithm for automated derivation of expressions for parametric asymptotic I/O lower bounds for partitioning. programs. We use two illustrative examples to explain the various steps in the algorithm before providing the detailed pseudo-code for the full algorithm. I/O lower boundingasympto?c technique based on graph Our work: Sta?c analysis to automate parametric lower bounds analysis of affine for without CDAG m odel Validcodes with and re-computation. Definition (SNR -partitioning Illustrative example 1: Consider the following example of Jacobi 1D stencil computation. – Recomputation prohibited) SNR -partitioning CDAG Lower Bounds: Hong/Kung S-­‐Par??oning Given a CDAG C, an such that: 1 2 3 4 5 FLOP Store FLOP VS1 Load 6 of C is a collection of h 0:7 V \I Parameters : N , T subsets Inputs : I[N] of Outputs : A[N] for (i =0; i <N; i ++) Parameters: N, T S1 : A[i] = I[i ]; Load Load FLOP FLOP VS2 3 5 1 2 i 5: CDAG for Gauss-Seidel code in Fig. 2. the CDAG for the § |In(VSFig. )| 6:<= Convex-partition 2S (up to Sof from prev. VS3 Input vertices are shown in black, all other verStore icode in Fig. 2 for N = 10. tices represent operations performed. Store FLOPFig. for (t=1; t<T; t++) { for (i=1; i<N-1; i++) B[i] = A[i-1]+A[i]+A[i+1]; for (i=1; i<N-1; i++) A[i] = B[i]; } o How to upper-­‐bound |VSi|for any valid 2S-­‐par##on? o Key Idea: Use geometric inequality, but relate itera#on points to In(VSi) and not data footprint § Find rays corresponding to [T,N]->{S1[i]->S2[1,i+1]:1<=i<N-2} dependence chains => projec#ons [T,N]->{S1[i]->S2[1,i]:1<=i<N-1} [T,N]->{S1[i]->S2[1,i-1]:2<=i<N-1} § Projected points from VSi must be • Edges e5 and e6: Multiple uses of the boundary elements of A[t][1] In(VSand I[0] andsubset I[N-1] by i) A[t][N-2], respectively, for 1<=t<T are represented by the following relations. § Transform itera#on space so that [T,N]->{S1[0]->S2[t,1]:1<=t<T} [T,N]->{S1[N-1]->S2[t,N-2]:1<=t<T} are along coordinate axes • Edge e7:rays The use of array B in statement S3 corresponds to edge e7 with the following relation. [T,N]->{S2[t,i]->S3[t,i]:1<=t<T and 1<=i<N-1} • Edges e8, e9 and e10: The uses of array A in statement S2 from S3 are represented by these edges with the following relations. [T,N]->{S3[t,i]->S2[t+1,i+1]:1<=t<T-1 and 1<=i<N-2} [T,N]->{S3[t,i]->S2[t+1,i]:1<=t<T-1 and 1<=i<N-1} [T,N]->{S3[t,i]->S2[t+1,i-1]:1<=t<T-1 and 2<=i<N-1} Given a path p = (e1 , . . . , el ) with associated edge relations (R1 , . . . , Rl ), the relation associated with p can be computed by composing the relations of its edges, i.e., relation(p) = Rl ◦ · · · ◦ R1 . For instance, the relation for the path (e7, e8) in the example, obtained through the composition Re8 ◦ Re7 , is given by R p = [T,N] -> {S2[t,i] -> S2[t+1,i+1]}. Further, the domain and image of a composition are restricted to the points for which the composition can apply, i.e., domain(R j ◦ Ri ) = Ri −1 (image(Ri ) ∩ domain(R j )) and image(R j ◦ Ri ) = R j (image(Ri ) ∩ domain(R j )). Hence, domain(R p ) = [T,N] -> {S2[t,i] : 1<=t<T-1 and 1<=i<N-2} and image(R p ) = [T,N] -> {S2[t,i] : 2<=t<T Background and 2<=i<N-1}. Two kinds of paths, namely, injective circuit and broadcast path, defined below, are of specific importance to the analysis. o Use ISL to find all “must” data flow dependences o Cycles data dep. graph == “rays” in the CDAG o Generalized geom. inequality allows different dimensional orthogonal projec#ons § Parametric exponents in inequality: sum of weighted ranks of projected subspaces 5 4 I must exceed rank of full itera#on space Out(Vi ): setassociated of vertices w ofith Vi athat the output set O or that 2S-­‐par##on of CofDAG 3 are also part § solve a linear program to find op#mal 7 have at least § Divide one successor outside of Vi . incurring trace into segments e1 Brascamp-Lieb inequality weights exactly S l oad/stores 4 6 POPL 2015 9 / 27 1 2 D EFINITION 6 (Injective edge and circuit). Anlower injective b edge a is § => a sympto#c p arametric I /O ound S1 § Ops executed in segment-­‐i form a an edge of a –data-flow whose associated relation Ra isnot both Generalizes Loomis-Whitney allowsgraph more general linear maps, affine and injective, i.e., Ra = A.⃗x +⃗b, where A is an invertible for a ffine p rogram convex vertex set VS necessarily all mapping onto spaces of the same dimension. Sh Inputs: In[N]; Outputs: A[N] P1 8i 6= j, Vi \ Vj = 0/ , and i=1 Vi = V \ I Although a CDAG is derived from analysis of dependences between instances of statements executed for (t =1; t <T; for t(++) i=0; i<N; i++) { by a sequential program, it abstracts away that sequential schedule of operations and only imposes P2 there is no cyclic dependence between subsets for (i =1; i <N A-1; [i] = iIn[i]; ++) S1 an essential partial order captured by the data dependences between the operation instances. Control S2 : B[i] = A[i -1] + A[i] + A[i for (t=0;t<T;t++) { +1]; dependences in the computation need not be represented since the goal is to capture the inherent data P3 8i, |In(Vi )| actually S transpired S-­‐Par??on of Cof DAG for(i=1;i<N-­‐1;i++) locality characteristics based on the set of operations that during an execution the for (i =1; i <N -1; i ++) program. ]; B[i] = A[i-­‐1]+A[i]+A[i+1]; S2 S3 : A[i] = B[i sa?sfies 4 p roper?es |Out(V P4 8i, )| S i Load } for(i=1;i<N-­‐1;i++) Load A[i] = B[i]; } S3 Load In(Vi ): seto Any of vertices of V \ Vi that have at least one successor in Vi . valid schedule using S registers is Segment 1 for(it = 1; it<N−1; it +=B) for(jt = 1; jt<N−1; jt +=B) for(i = it; i < min(it+B, N−1); i++) for(j = jt; j < min(jt+B, N−1); j++) Un#led version A[i][j] = A[i−1][j] + A[i][j−1]; 2 Comp. complexity: (N-­‐1) Ops Tiled Version Comp. complexity: (N-­‐1)2 Ops for (i=1; i<N-1; i++) for (j=1;j<N-1; j++) A[i][j] = A[i][j-1] + A[i-1][j]; Segment 2 – Vertex = opera#on, edge = data dep. 5S fast 4 with • 2-­‐level memory hierarchy 3 mem loca#ons & infinite slow locs. Prior Work: Data Movement Lower Bounds e2 e3 e4 e5 e6 matrix. An injective circuit is a circuit E of a data-flow graph such that every edge e ∈ E is an injective edge. Further generalized by Bennett et al. d m EFINITION 7 (Broadcast edge and path). A broadcast edge mb is 1/(d D1) sj segment a nd u p t o S n ew l oads) 7 FLOP |E| ’ |fj (E)| |E| |f (E)| s.t., 8i, 1 Rb isÂaffine S2 ’ whose j an edge of a data-flow graph associated relation j=1 si di,j j=1 s egment ( except l ast) h as S l oads/ and dim(domain(Rb )) <j=1 dim(image(Rb )). A broadcast path is a Store They key idea behind the work presented§ Each in this article is to perform analysis on the CDAG of a 0:7 where, path (e1 , . . . , en ) of a data-flow graph such that e1 is a broadcast Load computation, attempting to find a different order of execution of the operations that can improve the stores = > S *NS > = T otal I /O > =S*(NS-­‐1) e7 e8 e9 e10 edgemand ∀ni=2 ei are injective edges. …. FLOPreuse-distance (s1 , . . . , sm ) 2 [0, 1] profile compared to that of the given program’s sequential execution trace. If this analysis § Reasoning about minimum vertex sets d ! Rdj are orthogonal projections reveals a significantly improved reuse distance profile, it suggests that suitable source code#transformaStore f : R Injective circuits and broadcast paths in a data-flow graph essenj Although a CDAG is derived from analysis of dependences between instances of statements executed tions have the potential to enhance data locality. On the other hand, if the analysis is unable to improve S3 Load tially indicate multiple ! uses of same data, and therefore are good ! for a ny v alid 2 S-­‐par##on = > L ower d : dim(f (span( e ))) – where e i-th canonical i,j j i i ,analysis. the reuse-distance profileitofabstracts the code, it is likely it is already as well optimized for dataof locality as by a sequential program, awaythatthat sequential schedule operations and only imposes candidates for lower bound Hence onlyvector. paths of these two Total of NS Segments kinds are considered in the analysis. The current example of Jacobi bound on # loads/stores an essentialpossible. partial order captured by the data dependences between the operation instances. Control 1D computation illustrates the use of injective circuits to derive I/O The dynamic analysis involves the following steps: Figure 6: Data-flow graph for Jacobi 1D lower bounds, while the use of broadcast paths for lower bound dependences the acomputation need be represented since the goal is to capture the inherent data (1) in Generate sequential execution trace ofnot a program. analysis is explained in another example that follows. Fig. 6 shows the static data-flow graph G = (V , E ) for Jacobi F F F (2) Form a CDAGbased from the on execution locality characteristics the trace. set of operations that actually transpired during execution of the Injective circuits: In the Jacobi example, we have three circuits Lower B ounds: R esearch D irec?ons 1D. GF an contains a vertex for each statement in the code. The input Perform a multi-level convex partitioning of the CDAG, which is then used to change the schedule I is also explicitly represented in G by node I (shaded in black in to vertex S2 through S3. The relation for each circuit is computed F program. (3) of operations of the CDAG from the original order in the given input code. A convex partitioning of Fig. 6). Each vertex has an associated domain as shown below: by composing the relations of its edges as explained earlier. The 1) Alternate lower bounds relations, and the dependence vectors they represent, are listed a CDAG is analogous to tiling the iteration space of a regular nested loop computation. Multi-level • DI =[N]->{I[i]:0<=i<N} below. convex partitioning is analogous to multi-level cache-oblivious blocking. approach ( graph min-­‐cut be8): ased) • DS1 =[N]->{S1[i]:0<=i<N} • o Analyze C DAG s tructure Circuit c = (e7, 1 (4) Perform standard reuse-distance analysis of the reordered trace after multi-level convex partitioning. • DS2 =[T,N]->{S2[t,i]:1<=t<T and 1<=i<N-1} 2) C omposi#on o f l ower ->ounds {S2[t,i]->S2[t+1,i+1] : 1<=t<T-1 and 13 / 27 Rc = [T,N] b • DS3 =[T,N]->{S3[t,i]:1<=t<T and 1<=i<N-1} POPL 2015 § Establish Max |toVS <= inVFig. Smax Theory & Models Finally, Fig. 6 shows the convex partitioning of the CDAG corresponding thei| code 2. (S) The edges represent the true (read-after-write) data 1<=i<N-2} 3) dependences Modeling ver#cal + h orizontal After such a partitioning, the execution order of the vertices is reordered so that the convex partitions between the statements. Each edge has an associated affine depenb⃗1 = (1, 1)T Min. sort # vofertex sets = Ninter-partition Smin(S) = 4 data m ovement b ounds f or • Circuit c = (e7, e9): are executed in some valid order (corresponding to§ => a topological a coarse-grained 5 2 dence relation as shown below: = [T,N] -> scalable parallel Rscystems {S2[t,i]->S2[t+1,i] : 1<=t<T-1 and dependence graph), with the vertices within a partition being executed the same relative order as the • Edge e1: This edge corresponds to the dependence due to copyNver#ces /VSmaxin(S) 3 1<=i<N-1} original sequential execution. Details are presented in the next section. the following ing the inputs I to array A at statement S1 and has [ SPAA ‘ 14] b⃗2 = (1, 0)T § => IOmin(S) >= (NSmin-­‐1)*S relation. • Circuit c3 = (e7, e10): 3. CONVEX PARTITIONING OF CDAG [N]->{I[i]->S1[i]:0<=i<N} § In S=2, ofNCDAGs, Rc = [T,N] -> {S2[t,i]->S2[t+1,i-1] : 1<=t<T-1 and of array elements A[i-1], A[i] ver#ces=16 In this section, we provide details on our algorithm forexample: convex partitioning which is at • Edges e2, e3 and e4: The useTools Applica#ons 2<=i<N-1} and A[i+1] at statement S2 are captured by edges e2, e3 and the heart of our proposed dynamic analysis. In the§ VS case of loops, (S) =numerous 4 => efforts NSminhave = 1attempted 6/4 = 4to b⃗3 = (1, −1)T max e4, respectively. optimize data locality by applying loop transformations, in particular involving loop tiling1and loop 1) C ompara#ve a nalysis o f 1) Automated l ower 2 § IOKennedy 2*(4-­‐1) 6 Bondhugula et al. fusion [Irigoin and Triolet 1988; Wolf and Lam 1991; and McKinley=1993; min >= algorithms via lower bounds bounds for arbitrary 2008]. Tiling for locality attempts to group points in an iteration space of a loop into smaller blocks 2) Assessment of compiler (tiles) allowing reuse (thereby reducing reuse distance) in multiple directions when the block fits in explicit CDAGs effec#veness 2) Automated parametric ACM Transactions on Architecture and Code Optimization, Vol. 0, No. 0, Article 0, Publication date: 2014. 3) Algorithm/architecture co-­‐ lower bounds for affine Fig. 5: CDAG for Gauss-Seidel code in Fig. 2. design space explora#on Fig. 6: Convex-partition of the CDAG for the programs Input vertices are shown in black, all other ver [ACM TACO ’14, Hipeac ’15] [this poster; POPL ’15] code in Fig. 2 for N = 10. tices represent operations performed. Segment 3 Modeling Data Movement Complexity 1 2 3