Compiler Optimizations for Improving Kathryn Steve Cam Data Locality Chau-Wen S. McKinley edu Computer Systems Laboratory Stanford University Department of Computer Science University of Massachusetts Department of Computer Science Michigan Technological University Tseng tseng@cs.stanford. rnckinley@cs. umass.edu carr@cs.mtu.edu Abstract or parallel In because programmers will be able to achieve good performance without makiig machine-dependent source-level transformations. the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs, To validate our optimization strategy, we implemented our rdgorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thiity complete applications, we executed the original and transformed versions and simulated cache hk rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data caches; however, our optimizations significantly improved several programs. 1 1.1 programs will be more portable Framework 1. Improve order of memory accesses to exploit all levels of the memory hierarchy through loop permutation, fusion, distribution, skewing, and reversal. This process is mostly machineindependenc and requires knowledge only of the cache line size. of strip2. Fully utilize the cache through tiling, a combination miniig and loop permutation ~88]. Knowledge of the cache size, associativity, degrees of tiling and replacement policy is essential. can be applied to exploit multi-level Higher caches, the TLB, etc. (also known as 3. Promote register reuse through unroll-and-jam register tiling) and scalar replacemeti [CCK90]. The number and type of registers available are required to determine the degree of unroll-and-jam and the number of array references to replace with scalars. In this paper, we concentrate on the first step. Our algorithms are complementary to and in fac~ improve the effectiveness of optittizations performed in the latter two steps [Car92]. However, the other steps and interactions between steps are beyond the scope of this paper. Because processor speed is increasing at a much faster rate than memory speed, computer architects have turned increasingly to the use of memory hierarchies with one or more levels of cache memory. Caches take advantage of data locality in programs. Data locality is 1.2 Overview We present a compiler strategy based on an effective, yet simple, model for estimating the cost of executing a given loop nest in terms of the number of cache line references. This paper extends previous work [KM92] with a slightly more accurate memory model. We the property that references to the same memory location or adjacent are reused within Optimization In addition, Based on our experiments and experiences, we believe that compiler optimization to improve data locality should proceed in the following ordec Introduction locations machine. a short period of time. Caches also have an impact on programming; programmers substantially enhance perfonuance by using a style that ensures more use the model to derive a loop structure which results in the fewest memory an algorithm references are handled by the cache. mers expend considerable loops so that the innermost umn, which effort at improving Scientific locality accesses to main memory. program- by structuring pound loop transformations loop iterates over the elements of a col- are stored consecutively in Fortran. The loop structure is achieved through that uses compound and reversal. This task is time ented The algorithm in a source-to-source We present extensive loop transformations. are permutation, is unique fusion, The comdistribution, to thk paper and is implem- Fortran 77 translator. empirical results for kernels and bench- consuming, tedious, and error-prone. Instead, achieving good data locality should be the responsibility of the compiler. By placing the burden on the compiler, programmers can get good uniprocessor mark programs that validate the effectiveness of our optimization strategy; they reveal programmers often use programming styles performance with good locality. even if they originally wrote their program for a vector Steve Car was supported by NSF grant CCR-9409341. Chau-Wen supported in part by an NSF CISE Postdoctoral Fettowship in Experimental We measure both inherent data locality char- acteristics of scientific programs and our abilhy to improve data locality. When the cache miss rate for a program is non-negligible, Tseng was Sciemce. we show there are often opportunities Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and Its date appear, and notice is given that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee and/ors ecific permission. ASPLO .& W 10/94 San Jose, California USA @ 1994 ACM 0-89791 -660-3/94/0010..$3.50 252 optimization consequently algorithm improves to improve data locality. Our takes advantage of these opportunities and performance, As expected, loop permuta- tion plays the key role; however, loop fusion and distribution may also produce significant improvements. Our algorithms never found an opportunity where loop reversal could improve locality. 2 Related Abu-Sufslt . Work first discussed applying compiler transformations on data dependence (e.g., loop interchange, tiling) to improve paging [AS79]. [KM92]. are loop-camied and We extend their original for parallelism 3.2 whereas we integrate they perform permutation, tion, and reversal into a comprehensive and fusion, approach. are loop-independent if the accesses occur on different is loop they loop iterations. Sources of Data Reuse to the same memory memory distribu- if the accesses occur in the same loop iteration, The two sources of data reuse are temporal cost model to capture more types of reuse. The only transformation permutation, based fusion, distribution, Data dependence to the same memory location In thk paper, we extend and val- idate recent research to integrate optimizations memory references. locations location, reuse, multiple accesses and spatial reuse, accesses to nearby that share a cache line or a block of memory st some level of the memory hierarchy (Unit-stride access is the most common type of spatial locality). Temporal and spatial reuse may Our extensive experimental results are unique to this paper. We measure bIMh the result horn self-reuse from a single array reference or group-reuse effectiveness of our approach and urdiie from multiple the inherent data locality to exploit characteristics other optimization of programs and our ability To simplify has several advantages over previous It is applicable to a wider range of programs require perfect nests or nests that can be made perfect [FST91, the expected permutations combination GJG88, LP9~ and worse case. [FST91, GJG88] of permutation, WL91]. research. because we do not with con- It is quicker, and reversal) [LP9~ n is typically small). only performs one evaluation In comparison, estimates of temporal transformations determines the and spatial and tiling reuse to improve loop.l fusion, dkrnbution, We use the algorithms Reference Groups Our cost model first applies group-temporal an inner loop. This work ated references [GJG88]. SPEC Benchmarks, and routines is per- than they do, degrade 6 programs/routines, (b) in one In Wolf and Lam’s experiments, reversal was seldom applied it is implemented than uniformly gener- to the same create and combine Condition reversal, researchers [FST91, when including reuse is by itself NP-hard our algorithms find the best transformations permutation, fusion and dktribution. dependence, $1 is a small constant d (Idl or < 2) and all other entries 3.4 transforma- 1 accounts for group-temporal distribu- [KM93]. are efficient for data locality group-reuse, by each loop using the functions By using reuse and condition we can calculate the reuse carried RefCost and LoopCostin To determine the cost in cache lies an arbitrary array reference group. with for the innermost Figure 1. of a reference group, we select the deepest nesting Each loop 1 with trip iterations a candidate from position. Data Dependence for consecutive LoopCosfthen We assume the reader is familiar with concept of data dependence [KKP+81, GKT91]. F = {& . . . ~k } is a hybrid distance/duection vector with the most precise information derivable. data dependence between two array references, corresponding left loop enclosing the loop to innermost by the for 1, i.e., the number of references, tripl(clslstride) or trip for non-consecutive references. calculates the total number of cache lines accessed by all references when 1is the innermost loop. It simply sums RefCost for all reference groups, then multiplies the result by the trip counts It represents a to right horn the outermost references, as Let CIS be the cache lime size in data items and stride be the step size of 1 multiplied cache lines 1 uses: 1 for loop-invariant each in the nest is considered coefficient of the loop index variable. In Figure 1, ReKost calculates locality data reuse and present an effective 2 reuse. Loop Cost in Terms of Cache Lines Once we accountfor Background In this section, we ch~acterize data locality cost model. All in our system loop nests (e.g., fusion, with a cache model, and detects most forms of group-spatial and usually 3.1 of than previous other subscripts must be identical. locality. is not practical tion). Fusion for improving 3 more restrictive if iterations is more general equal to the cache line size in terms of array elements. chose not to We did integrate approach taken by previous GJG88, LP92, WL91] heuristics but slightly group reuse, i.e., they ac- 2. or, Refl and Refz refer to the same array and differ by at most d’ in the first subscript dimension, where d’ is less than or was never needed and We therefore can drive it. but it did not help to improve driving skewing [W0192]. even though and our model tions which to calculate are zero, the NAS Benchmarks. The exhaustive or group-spatial (a) ~ is a loop-independent case by 20?i0 [W0192]. We degrade one program by 21?f0, ApphJ from skewing, [KM92], 1. 3 Refi $Ref2, results and are executed on a d ifferent a few more programs/routines but their cache optimizations [KMT93] ReKkoup RefGroup: Two references Refi and Ref2 belong reference group with respect to loop 1 if in Dnasa;7 in the because their cache optimization and scalar replacement processor. We improve include data locality. a subset of our test suite. It is dNficult to clirectly compare our experiments include tiliig loop bouncls even Wolf and Lam’s evaluation on the Perfect Benchmarks Re~ost are in the same reference formulation ever, it may be less precise because it ignores formed ReKhoup, cess the same cache line on the same or different data local- ity NL9 1]. Their memory model is potentially more precise than ours because it directly calculates reuse across outer loops; howwhen they are known constants. algorithm Two references they exhibit with and reversal to improve 3.3 group-reuse. and Lam use unimodular on reuse that occurs between The result reveals the relative amourm of reuse between loops in the same nest and across dkjoint nests; it also drives penrnttation, WL91], best loop permutation. Wolf in this paper storage. and LoopCost to determine the total number of cache lines accessed when a candidate loop 1 is placed in the innermost loop position. our approach step because it directly loss of generality, analysis, we concentrate of the innermost both in Previous work generates all loop or unimodular trtmsfonnations (a skewing, Without small numbers of inner loop iterations. Our memory model assumes there will be no conflict or capacity cache misses in one iteration evaluates the locality of all legal permutations, and then picks the best. ‘Ilk process requires the evaluation of up to n! loop permutations (though references. we assume Fortran’s column-major them. Our approach dhionals studies, 1Lsm et d confirm this assumption~W91 ]. 253 Figure 1: LoopCost L= 72. = INPUK trapl ALGOR~Nf: {/,,..., In} a loop nest with headers lb, ub, step { Refi,. . . . Ref~} representatives from each reference group = (ub~ – lb, + step, )/stepl Cls = the cache lime size, f) = the coefficient stride = I stept * coef(fl, L.oopcost(2) = number of cache lines accessed with 1 as innermost Loopcost(2) = coefiit, OUTPUT: Algorithm ~ of the index variable u in the subscript ~ (it maybe (RefCost(Refk(fl(,l,..., loop zn),...,fj(i~,..., in)), Z))~tr@h h#l k=l RefCOSt(Re~k, 1 t~ (coef(fl, rnpl if ((stride z) = tripl of all the remaining loops. ReK’ost and LoopCost subscript expressions 4 it) = O) A . . . A (coeflfj, with one evaluation appear in Fig- fimion, distribution, and reversal. 4.1.1 describes gal compound the number an algorithm for discovering loop nest transformations of cache lines accessed. and applying references to C(IJ) time, where n is in the number for matrix in the same reference resulting To validate permutations, and B(KJ) order; exhbhs To determine the loop permutation 1promotes are considered which accesses the fewest cache Changing observation. matrices more reuse than loop i’ when both as innermost loops, 1will promote our cost model, ranking > LOopCost(iJ. to innermost on perfect we gathered results for all possible To determine we compare positive the permutation the nest. If a legal permutation loops. the algorithm is evaluating most expensive step. Our algorithm predicts relative performance. tiis type of comparison on several more kernels with the same resul~ memory order always Reversal does not change the pattern of reuse, but it is an i.e., itmay enable permutation to achieve better locality. We extend Permute to perform reversal as follows. If memory order is not legal, Permute places outer loops in position firs~ building up locali~ in our experiments, therefore we will not discuss it further. sorts the loops in O(n log n) the locali~ accurately Reversal did not improved 4.3 time. If it is not legal, Permute selects a legal permutation as close to memory order as possible, taking worst case n(n – 1) time [KM92]. These steps are inexpensive; factors of up to 3.7 on if reversal is legal and enables the loop to be put in the position. loop, order is legal, as it is in 80% of loops in our test suite, Permute simply x 300 set stays in the lexicographically positive dependence vectors [KM92]. If Penmzte cannot legally position a loop in a desired position, Permute tests yields the best data locality. When memory of the working Loop Reversal enabler; is exists which and so on. Because most data reuse occurs on the innermost Complexity, on performance. Loop reversal reverses the order in which the iterations of a loop nest execute and is legal if dependence remain carried on ‘outer if the order is a legal the loop with the most reuse innerrnos~ it correctly effect the guaranteed to find it. If the desired inner loop cannot be obtained, the next most desirable inner loop is positioned innermost if possible, positioning with our resulted in the best performance. ermies in the distance/direction vector. If the result is lexicographically legal and we transform 2. Consistent x 512 versus the 300 times vary by significant and a small program 4.2 nests.z loop has a dramatic because a larger portion We performed Permute to achieve memory order when one, we permute the corresponding positions the of the nest with the least If the bounds are symbolic, We define tie algorithm possible ordering (11 . . /.) so that LoopCost(l,.-l) We call this permutation cost rnernory order. dominating terms. temporal the Sparc2, 6.2 on the i860, and 23.9 on the RS/6000. The entire rank the loops using LQopCosL loops horn outermost and C(I,J) them left to right from the least to the highest is greater on the 512 cache. Execution more ranking simply A(I,K) in the fewest cache line accesses. the inner The impact reuse than 1’ at any outer loop position. We therefore and MemoryOrder model, choosing I as the inner loop results in the best execution time. lines, we rely on the following If loop puts the two and A(I,K) loop invariant cost (JKI, KJI, JIK, IJK, KI.J, IKJ) in Figure Loop Permutation multiply group Algorithm to select JKI as memory spatial locality locality, are implemented. 4.1 of Multiplication RejGroup in separate groups for all loops. exhibit of these transformations Matrix 2, algorithm uses LoopCost le- that attempt to minimize All Example: In Figure B(KJ) tests based on the cost model to determine when individual transformations are profitable. Using these components, Section 4.5 presents Compound, O(n) loops in the nest. bounds [McK92]. Each subsection step for each loop in the nest. The complexity of this step is therefore In this section, we show how the cost model guides loop permutation, zt) =0) no reuse Loop Transformations Compound consecutive it) = O) A . . . A (Coelf(fj, otherwise nested loops, complicated and nests with symbolic loop invariant it) = O) < ck) A(coefi(f~, cls / stride ure 1. This method evaluates imperfectly zero) ZL)I Loop Fusion Loop fusion takes multiple loop nests and combines their bodies into one loop nest. It is legal only if no data dependence are of the nest is the reversed [War84]. As an example of its effec~ consider the scalarization into Fortran 77 in Figure 3(b) of the Fortran 90 code fragment computes the best permutation for performing ‘In Section4.5, we perform imperfect interchanges with distribution. 254 ADI integration in Figure 3(a). The Fortrsn 90 code Figure 2: Matrix Multiply Figure 3: Loop Fusion (a) SmpJe Fortran 90 loopsfor ADllntegration DGI=2, N SI X(t,l :N) = X(I,l:N) - X(t-1,1 :N)*A(I,l:N)/B(I-l,l:N) SZ B(T,l:N) = B(t,l :N) - A(I,l :N)*A(I,l:N)/B(I-l,l:N) ~“K=l, N DGI=l, N C(IJI = C(t,J) + A(LK) * I-K&J) (b) J ‘hndafion LoopCost Rf C;i] A(I,K) B(KJ) (wirb CIS =4) I I n*n~ l*n2 n*n2 total 2n3 + n2 ---....... 50 – ~n*n2 n*n2 $n*n2 +n * n2 R Times (in seeonds) 300 x 300 LoopCost Sun Spare2 In@li860 -- 20 — . ----- .. X(I,K) A(I,K) B(l,K) 7. :.7.= - o— I I I IJK KIJ I KJI JIK IKJ 512 X 512 n*n +n*n n*n ~n*n n*n +n*n total 3*n2 ~*n2 S, total 3*n2 ;*ra2 52 total 2*n2 +*n2 s] + S2 5.n2 ~*n2 If the fused ImopCost 400— locality. 300— 200— / / F is lower, fusion alone will result in additional As an example, fusing the two K loops in Figure 3 lowers the LoopCost ---- for K from 5 nz to 3 nz. need not be nested withii 4.3.2 o— . I KJI I I IJK KIJ I JIK Loop Fusion exhibits both poor temporal locality. and poor spatial reuse. The prclblem is not the fault of the programmed instead, it is inherent can be expressed in Fortran 90. locality significantly for all the arrays. This transformation Profitability of Loop Loop fusion may improve by moving same cache line to the same loop iteration. nest in which memory is illustrated 4.3.3 in accesses to the Ref&oup Two nests Me compatible compatible determine the profitability of iterations. build a DAG from the candidate the LoopCosts nesta, we use ● ● if all the statements were Compute ReKh-oup independently candidate and add the results. and LoopCost compatible. loops. We therefore the most locality, compatibility We The edges are dependence of the fused and unfused versions. nests into sets of compatible To yield Compute ReKkoup and LoopCostas A the same 22es\ i.e., fused. headers is NP-hard [KM93]; between the loops; the weight of an edge is the difference the cost model as follows: ● Algorithm apply a greedy strategy based on the depth of compatibility. nested up to level 1. To of fusing two compatible Loop Fusion order can be achieved. adjacent set of n loops with compatible at level / if the loops at level 1 to 1 are and the headers are perfectly and temporal of the I loop is lower than the K loops, here all the headers are not necessarily if they already were in the same Ioop body. The two loop headers if the loops have the same number spatial we detect that this transformation Fusion thus serves two purposes: 1. to improve temporal locality, and 2. to fuse all inner loops, creating a nest that is permutable. Previous research has shown that optimizing temporal locality for an discovers this reuse between two nests by treating the statements as are compatible loop For instance, fusing the K loops in Figure 3 Using the cost model, itnprov ing spa- Algoritlurr reuse in imperfect We then test if fusion of all inner nests is legal and creates a perfect the compiler Fusion reuse directly nests. but memory order cannot be achieved because of the loop structure. Figure 3(c). 4.3.1 improve the loop nes~ improvirtg is desirable since LoopCost in how the Fusing the K loops for array B. In addkion, is now able to apply loop interchange, tial locality loops for fusion Note that the memory a perfeet nest that enables a loop permutation with better data locality. results in temporal Candidate loop. to enable Loop Permutation Loop fusion may also induectly nests by providing IKJ enables permuting computation a common order for the fused loops may differ from the individual ........ JKI (with cl. = 4.) H .. 10 — . . . . ...+ JKI l,K) eu 40 — 30— I-1 JO (c) $ Loop Fusion& Interchange $ DGK=l, N DGI=2, N x(t,x) = x(x,x) - X(t-1 ,K)*A(t,K)/B(t-l JO B(I,K) = B(M) - A(t,K)*Mt,K)/B( t-l,K) l*n2 ~n3 + nl 4n3 + n2 Execution 60- l*n2 to Fcwfran 771 DGI=2. N DGK=l, N X(M) = X(LK) - X(I-l,K)*A(LK)/B( DO K=l, N B(I,K) = B(I,K) - A(I,K)*A(I,K)/B(t- between We partition the nests at the deepest levels possible. we first and temporal locality. fuse nests with the deepest Nests are fused only if is legal, i.e., no dependence are violated between the loops or in the DAG. We update the graph, then fuse at the next level until all compatible for each sets are considered. This algorithm appeam in Figure 4. The complexity of this algorithm is 0(m2 ) time and space, where m is Compare the total LoopCosts. the number of candidate nests for fusion. 255 Figure 5: Distribution Figure 4: Fusion Algorithm Distribute(.C, Fuse(L) S) S= ALGORITHM: H3}, H, = {hk} asetof BuildW= {HI,..., compatible nests with depth(H,) ~ depth(H, Build dependence DAG ~ with foreach H,={hl. ..h~}, {sl, ,s~} containing statements Restrict the dependence graph to 6 carried at level j or deeper and loop independent 6 ., Pm } Divide S into finest partitions P = {PI,.. edges and weights i=l@j s.t. f between 11and h) /* 3 edge (11, lz) with weight> . ALGORITHM: forj=m–ltol +l ) for II = hl to hm for 12= hz to 11 if ((3 locality aloopnest Z = {21, . . . . /m}, INPUT ~ = /1, . . . . /k, nests that are fusion candidates INrTrE Algorithm O*/ Sr, St e a recurrence, s,, srt E p,. compute MemoryOrder, for each p, if ( 3 i I MemoryOrder, is achievable distribution & (it is legal to fuse them)) perform return fuse 1, and 12and update G endfor with and permutation) distribution and permutation endfor endfor end for 4.3.4 Example: The original Figure 5. It divides Erlebacher hand-coded PDEs using ADI version of Erlebacher, integration with a program 3D arrays, mostly titions solving smallest consists of out to the outermost loops into memory order, producing a distributed program version. Since the loops are fully distributed in this version, it resembles the output of a Fortran 90 scalarizer. In Table 1, we measure the performance of the original program (Hand), fusion (Distributed), the transformed program that still enables permutation. the For loop, stopping if successful. We only call Distribute if memory order cannot be achieved on a nest and not all the inner nests can be fused (see Section 4.5). Dis- We then applied Fuse to obtain locality. amount of distribution par- It performs a nest of depth m, it starts with the loop at level m – 1 and works single statement loops in memory order. We permuted the remainiig more temporal the statements into the finest granularity and tests if that enables loop permutation. without ti”bute tests if distribution will enable memory for any of the partitions. The dependence test for loop permutation and the fused version (Fused). dence Table 1: Performance of Erlebacher (in seconds) Hand Memory Order is created by restricting on statements in the partition distribution order to be achieved structure only if it combines of interest. The algorithm’s time to determine the LoopCost complexity to We thus perform with permutation actual LoopCost. required its test to depento improve is dominated of the individual the by the partitions. See Section 4.5.1 for an example. 4.5 Compound The driving loop, many variables are shared between loops. Permuting into memory order increases locality grades locality will its impact can be significant. Compound de- increase as more programs are written tween the resulting its impact with Fortran 90 array Loop Distribution and reversal as the most reuse at the inner- nests when legal and temporal proved. To optimize memory order and determining goes onto separates independent d~tribution, memory The algorithm for each statement. a nes~ the algorithm reuse can be placed innermost. Loop distribution fusion, loop trans- by achieving Algorithm Compound in Figure 6 considers adjacent loop nests. It first optimizes each nest independently, then applies fusion be- syntax. 4.4 uses permutation, most position in performance In addition, of compound actual LQopCost needed to place the loop that provides of the distributed version compared to the original. Even though the benefits of fusion are additive rather than multiplicative as in loop permutation, is to minimize order for as many statements in the nest as possible. the loops in each nes~ but slightly between nests, hence the degradation Algorithm force behind our application formations Fusion is always an improvement (of up to 17%) over the handcoded and distributed versions. Since each statement is in a separate Transformation the next loop. into memory statements in a single loop locality if the loop containing If it can, the algorithm Otherwise, is im- begins by computing the most does so and it tries to enable permutation order by fissing all inner loops to form a perfect nest. into multiple looPs with identical headers. To maintain the meaning of the original loop, statements in a recurrence (a cycle in the depen- tries di.trIf fusion cwmot enable memory order, the algorithm bution. If disrnbution succeeds in enabling memory order, several dence graph) must be placed in the same loop. Groups of statements new nests may be formed. that must be in the same loop are called partitions. the statements into the finest partitions, only use loop distribution loop permutation to indirectly improve In our system we in different partitions may prefer different achievable after distribution. The algorithm for fusion to recover temporal reuse by enabling on a nest that is not permutable3. memory Complexity. Statements Ignoring of the compound orders that are Distn”bufe Since the distribution n is the maximum appears in algorithm is O(n) for a momen~ the complexity + 0(n2) 256 0(n2) + 0(m2) time, where number of loops in a nes~ and m is the maximum steps are needed to evaluate the locality nest. divides locality. distribution number of adjacent nests in the original %stribution could also be effective if there is no temporal locality between partitions and the accessed arrays are too numerous to tit in cache at once, or register pressure is a concern. We do not address these issues hem. algorithm these nests are candidates program. O(n) non-trivial of the statements in each simple steps result in the worst case when tinding a Figure 6: Compound Loop Transformation Figure 7: Cholesky Algorithm Factorization (a) {KJJ form] Compound(N) JNNm DO K=l,N A(LK) = SQKT(A(JLK)) DO I=K+l,N S1 ~.k }, adjacent loop neStS Af={nl,... A&tC) = A(JJC)/A(K,K) D(3 J=K+l J A(IJ) = A(J,J)-A(LK)*A(J,K) s~ ALGORITHM: fori=ltok Compute S3 MemoryOrder if (Permute(n, (n,) ) places inner loop in memory continue else if (7s, is not a perfect nest& adjacent loops rnj) if (FuseAll(mj,2) (b) J. {WI form} Loop Disti”bution & TriangulIzr Interchange & DO K=l~ A(K,K) = SQKT(A(lC,K)) DO I=K+I ,N A(LK)=A(t,K)/A(K,K) DO J=K,N DO I=J+l,N A&,I+l) = A(I,J+l)-A(tJC) *A(J+lJC) order) contains only and Permute(l) places inner loop in memory order) continue else if (Distribute(n, LoopCost ,1)) efs w Fuse(1) ~ A!&/ end for Fuse(Af) legal permutation problem. and 0(rn2) However, steps result from building because distribution nests that are candidates for fusion, a single application produces the fusion more adjacent m includes the addkional In practice, this increase was jacent nests created by dktribution. negligible; K n*n of dkmibution u S3 ad- 2n3 + total Consider Cholesky optimizing ure 7(a) with memory Factorization Compound. kernel LoopCost order for the nest is KJI, ranking in Fig- determines that the nests horn lowest 6 cost to highest (KJI, JKI, KJJ, IKJ, JIK, IJK). Because KJI cannot be achieved with permutation alone and fusion is of no help here, Compound calls Distribute. separate partitions 3 Since the loop is of depth 3, Distn”bute 0 at depth 2, the I loop. S2 and S3 go into starts by testing distribution u times (in seconds) 300 x 300 Factorization the Cholesky algorithm il }ras+ ?Z2 :n3 +n2 nz Execution Example: I l*n 1 never created more than 3 new nests. 4.5.1 J — I (there is no recurrence between them at level 2 ,p . .... I u - ........ . :.:. -” RS6000 .-. Sparc2 . . . ..i860 I I I IKJ JIK I JKI KM . . . . . . . . . . . . . . . .--. . . . .-— .... KIJ IJK or deeper). Memory order of Sq is also KJJ. Distribution of the I loop places S3 alone in a JJ nest where I and J may be legally interchanged into memory order, as shown in Figure 7(b). Note that our system handles tbe permutation of both triangular KMT93]. and rectangular Memoria Fortran programs nests. formance. To gather performance possible results for Choles~, loop permutations; they are all legal. tion, we applied the miniial (Wolfe enumerates to matrix multiply, dicted behavior. structure; all stant propagation, necessary. elimination. Compared able distribution. these loop organizations ~o191].) there are more variations in observed and pre- Compound performs are due to the triangular current loop directed still attains the loop structure wih auxiliary forward expression version variable propagation, Since scalar expansion by the compiler. the con- and dead code will further en- is not integrated we applied Memoria analysis, substitution, if scalar expansion of the transformer, that analyzes their cache per- of dependence induction It also determines tmnslator them to improve To increase the precision compiler For each permuta- amount of loop dishibution These variations however, we generated is a source-to-source and transforms in the it by hand when then used the resulting and dependence graph to gather statistics and perform code data locality the best performance. optinizations 5 For our test suite, we used 35 programs from the Perfect Benchmarks, the SPEC benchmarks, the NAS kernels, and some miscel- Experimental To validate Results our optilzation strategy, we implemented and transformed our test suite and simulated cache hit rates. To measure our abil- ity to improve correctness locality 5.1 locality, we determined could be ignored. in the original, laneous programs. versions on the best locality We collected transformed, program our algo- rithms, executed the original statistics possible comment 5.2 on the data the cost model, the transformations, above in Memoria, ParaScope Programming Environment the Memory [Car92, and the algo- Compiler CHH+9J, They ranged Their execution in size from 195 to 7608 non- times on the IBM RS/6000 ranged Transformation Results In Table 2, we report the results of transforming the loop nests of each program. Table 2 first lists the number of non-comment lines (Lines), the number of loops (hops), and the number of loop and ideul programs. We implemented lines. Compound. from seconds to a couple of hours. if Methodology rithms described using the algorithm nests (Nests) for each program. in the Only loop nests of depth 2 or more are considered for transformation. Memory Order and Inner Loop entries reflect the percentage of loop nests and inner loops, CK94, 257 Table 2: Memory Memory Order Peres I Fail Orig Program Lines Nests Loops Order Statistics Inner Loop Fad Penn Orig percentages % Loop Loop Fusion C]A % LoopCost Distribution Ratio DIR Final Ideal Perfect Benchmarks 6105 3965 3980 7608 1986 I adm arc2d bdna dvfesm 219 152 106 75 52 55 16 28 32 17 53 65 16 34 31 1 c1 35 c1 12 1 1 2 2.54 2 2.21 4.14 104 56 75 18 7 75 18 ‘7 4 2 3 6 2.31 2.51 164 1491 80 76]1 63 831 65 95 19 5] ~ I 16 2 1 0 0 3.08 8.62 0/ o 11 1,72 I 1.79 o II 1.11 . . . I 1.7tt 2 1.00 1.13 6 2.05 2.20 0 6.10 4.98 0 2.32 5.58 2 1.99 7.95 0 1.00 14.81 1?. IIt 8-3 I 15 22 171 011 81811!73 mg3d ocean acd 2812 4343 2327 ii 115 94 46 56 45 9i 82 53 3 13 11 3 5 36 9s 84 58 o 13 16 2 4 15 0 2 0 1 0 0 0 spec77 3885 255 162 64 7 29 66 7 27 track trfd 3735 485 57 67 32 29 50 52 16 0 25 34 0 1 0 0 1 0 dnasa7 doduc fpppp hydro2d matrix300 mdljdp2 mdljsp2 ora su2c0r swm256 1105 5334 2718 4461 439 4316 3885 453 2514 487 111 60 23 110 4 4 4 6 84 16 50 33 8 55 2 1 1 3 36 8 64 6 88 100 50 0 0 100 42 88 14 6 12 0 50 0 0 0 19 12 34 56 19 0 48 66 SPEC Benchmarks 22 74 16 88 6 6 0 88 12 0 100 0 0 50 50 100 0 0 100 0 0 0 100 0 39 42 19 0 88 12 0 2 0 10 88 0 0 0 100 100 0 39 0 5 0 0 44 0 0 0 0 0 0 2 0 0 11 0 0 0 0 0 0 1 4 0 0 1 0 0 0 4 0 2 12 0 0 2 0 0 0 8 0 2.08 1.89 1.03 1.@ 4,50 1.00 1.00 1.00 3.51 4.91 2.95 14.25 1.03 1.00 4.50 1.05 1.02 1.00 5.30 4.91 195 12 6 100 0 0 7 2 0 0 1.00 1.00 fl;52 ...-= mdv 123X ---- 1 tomcatv appbt applu I 25 4457 3285 I 35;16 305 855 265 773 676 aDDSV bik cgm embar fftpde mjrrid I 181 155 I 184 0 11 3 40 43 ,{ 87 II 71 II 84 0 6 2 18 19 98 73 I 73 0 0 50 89 89 0 erlebacher 1I I 870 75 II 81 39 I 1[801 I 797 1892 7519 lirrpackd simple wave L totals . respectively, that are: 2644 1 14001] 100 originally in memory Perm: permuted into memory Fail: fail to achieve memory entries. Similarly 411~] ~ 1II 010 83 1I 751 86 I 58] 13 II4II1OO11 -012511 91 511 2911311 691 1112011 IIt 0 I i 3 o II 1.00 I 6 II 1.35 II 0 0 0 0 1 0 1 I 0 0 0 0 0 2 IIII 1.25 1.00 1.00 1.00 1.00 / 1.00 II 4.34 1.00 2.75 1.12 1.00 1.00 II (t 1I n )1II lnll .._- 1I 0 II 1.00 I 0 11 2.48 I o 4.26 I 1.fnl 311112 8 4 0 0 0 0 0 0 0 0 0101128111 I t) 1 75 I 012511-3 ]-i// 861 91 511 651 291 611 74] 11115 t 11 01 6i2]l 01 70 26 0 I 229 80 23 interchange. candidates There 52 / were 229 adjacent loop nests that were other nests to improve reuse. Fusion was applicable and completely Compound and the permuted for the inner loop, the sum of the original fused 26 and 12 nests respectively. column, D is the number of loop nests distributed permuted entries is the percent of nests where the most desirable innermost loop is positioned correctly. in 17 programs fused nests of depth 2 and 3. In Wave and Arc2d, In the Loop Distribution and the — for fusion and of these, 80 were fused with one or more order. that are in memory i:;; 2.72 4.30 —/ order, or is the sum of the original 1.26 8.03 I Program order, of loop nests in the program order after transformation (1 enable Orig: The percentage 30 IIII -41/ 22 II 851] I 01] NAS Benchmarks 010[13111101 01 2[[100[ 3 I 24 II 79 I 61 15 II 12 15 80 12 8 0 0 0 0 0 0 100 0 0 100 50 50 0 0 50 0 11 100 0 0 111011100 01011311 I MLrcellaneow I 6.10 to achieve a better loop permutation, and R is the number of nests that resulted. Table 2 also lists the number of times that fusion and distribution were applied by the compound algorithm. Either fusion, distribu- The Compound tion, or both were applied to 22 out of the 35 programs. permutation In the Loop Fusion loop for at least one of the resultant nests. Compound applied distribution in 12 of the 35 programs. Ort 23 nests, distribution column, C is the number of candidate nests for fusion, and A is the number of nests that were actually fused. Candidate locality enabled loop permutation nest correctly, for these programs; Fusion improved LoopCost group-temporal it did not find any opportunities only applied distribution when it enabled order in a nest or in the innermost to position creating 29 additional the inner loop or the entire nests. In Bdna, Ocean, Applu and Su2cor 6 or more nests resulted. nes~ for fusion were adjacent nests, where at least one pair of nests were compatible. algorithm to attairt memory LoopCost to 258 Ratio in Table 2 estimates the potential for the final transformed program (Final) reduction in and the ideal program (Ideal). The ideal program every nest without in the implementation. By ignoring the best data locality versions, one could the average ratio LoopCost achieves memory regard to dependence constraints is listed. order for Figure 8: Achieving or Iiiitations Memory Order for Loop Nests correctness, it is in some sense achieve. of original For the final LoopCost and ideal to transformed These values reveal the potential for locality improvement. Memoria may not obtain memory loop permutation followed order due to the following: is illegal due to dependence; by permutation bounds are too complex, is illegal due to dependence; i.e., not rectangulm 20’%. of nests where the compiler (3) the loop or triangular. For the could not achieve memory der, 87% were because permutation by permutation (1) (2) loop distribution and then distribution or- followed could not be applied because of dependence <=20 con- straints. The rest were because the loop bounds were too complex. More sophkticated algorithms 5.3 dependence to lransform Coding Imprecise >= 70 >=80 >= 90 Percentage of Loop Nests in Memoxy Order may ena~ble the a few more nests. Figure 9: Achieving Memory Order for the Inner Loop Styles dependence for improvements analysis is a factor in limiting in our application analysis for the program for our algorithm data ‘locality U Original due to the use of index ~ Final Cgm cannot expose potential Mg3d is written style introduces the potential suite. For example, dependence because of imprecision arrays. The program coding testing tectilques >= 60 >=40 symbolics and again makes dependence with linearized arrays. This into rhe subscript analysis imprecise. expressions The inability to analyze the use of index arrays and linearized arrays prevents many optimization and is not a deficiency specific to our system, Other coding styles may also inhibit For example, LinpackdandMatrix300 optimization in our system. are written in amoduk-wstyle with singly nested loops enclosing function calls to routines which also contain singly nested loops. To improve programs written this style requires interprocedural optimization [CHK93, HKM91]; these optirnizations Many are not currently loop nests (69Yo) in the original in memory order, and even more the most reuse in the innermost scientific implemented programmers able to permute order, resulting This result indicates We illustrate in Figures memory however, Our compiler was sociative 9. to transform for data locality characterize Memona by of their nests and inner loops that are originally in order and that are transformed Figure 8, half of the original into memory older. of programs achieve memory Unfortunately, Our transformation order in the majority our abllily algorithms scakrized cache, the transformed significantly loop nests may be cpu-bound and the optimized contribute transformer. portions applications improvement (Arc2d, Dnasa7, success- or degradation optirnizations vector programs, with significant and Simple). are particularly since these programs the predicted improvements did not materialize programs. To explore these results, we simulated determine perfor- These results effective for are structured to correctly sociative, 2-way for many of the cache behavior to cache hit rates for out test suite. We simulated cachel, an RS/6000 cache (64KB, 4-way set as- 128 byte cache lines), and cache2, an i860 cache (8KB, set associative, 32 byte cache limes). We determined the change in the hit rates both for just the optimized procedures and for the entire program. The resulting measured rates are presented in Table 4. Places where the compiler affected cache hit rates by z .1 V. are emboldened for greater emphasis. For the Find columns programs instead of we chose the better of the timed and unfused versions for each pro- of the progrmn may not to the overall execution All emphasize vector operations rather than cache lime reuse. However, may not result in run-time improvements for several reasons: data sets for benchmark programs tend to be small enough to fit in memory-bound, with the -O option to and executed on the RS/6000. For those applications indicate that data locality In deternnine and transform set as- We used and the version produced by our Table 3 shows a number of applications of nests and programs. to successfully cache, 4-way and 128 byte cache lines. program source-to-source mance improvements can be transformed such that 90’%0 or more of their inner loops are positioned of our test suite running a 64KB occurred. programs have fewer than 70% of their The majority for the best locality. policy, both the original automatic nests in memory order. In the transformed version, Z9Y0 have fewer than 70% of their nests in memory order. Over half now have 80% or more of their nests in memory order. The results in Figure 9 are more dramatic. replacement fully compiled by program the programs >= 90 Results not listed in Table 3, no performance The figures >= 70 >=80 the standard IBM RS/6000 Fortran 77 compiler compile for one or more nests in 66% of the programs. our ability Performance on an IBM RS/6000 model 540 with order and a order position. >= 60 In Table 3, we present the performance Transformation 8 and the percentage that 11% of the loop nests into memory total of 85q0 of the inner loops in memory Successful 5.5 (74~0) have the loop carrying position. >=40 Percent of Inner Loops in Memory Order are already in a total 801?’oof the nests in memory improved data locality S.4 programs for improvement. an additional <=20 in our translator. often pay attention to data locality; there are many opportunities in gram. time. 259 Table 3: Performance Results (in seconds) Table 4: Simulated RS/6000 with 64KB, 4-way set associative cache and cache tine size of 128 byte. Original Transformed ‘erfect Benchmarks Program Cold misses are not included Optimized Program z NAS Benchmarks 149.49 applu I 146.61 361.43 337.84 amp Mix Programs simple 850.18 963.20 Iinpackd 159.04 157.48 414.60 wave 445.94 1.13 1.01 1.08 in Table 4, the reason more programs portions have more significant instance, whole program adm II 100 arc2d II 89.0198.5 obtained improvements ocean 100 100 100 100 For always carry over to the entire program unoptimized hit ratios both with For the 8K cache, fusion rates for Hydro2d, 0.95%, Appsp, respectively. performance applu because of the presence of and without improved and Erlebacher whole by 0.51 We were surprised applying with fusion by 5.3% on the subroutine Linpackd’s routine fusion also lowered hit rates Track, Dnasa7, and Wave; the degradation be due to added cache conflict u rnatgen and by 0.02% for the entire program. Ma~gen is an initialization whose performance is not usually measured. Unfortunately, To recognize hh and ().24~0, to improve reuse at the innermost array references that interfere correct thk deficiency We intend 5.6 though duect comparisons with cache optimization when compared to Wolf’s on a DECstation 5000 with a 64KB results show performance which degradations transformations programs in execution did not degrade performance and performance of Arc2d time. improvements on Btri.x, Gmtry, and Vpenta. Our 96.04 simple 91.0 99.1 84.3 93.7 97.35 9934 93.33 95.65 98.2 99.9 82.9 95.9 99.74 99.82 87.31 88.09 ,, and by a slight amount on the i860. or degrade either kernel. More direct u We neither comparisons are times were measured on different architectures. Data Access Properties (Refs/Group), temporsl improved. Wolf ,,-96.04 that we significantly im- in Table 5 .4 We report the ideal memory order, and final of Reference Group classifies constructed pardy or completely using group-spatial reuse. The amount of group reuse is indicated by measuring the average number of references in each RefGroup where a RefGroup reuse and occasionally size greater than 1 implies group-spatial group- reuse. The amount of group reuse is presented for each type of self reuse and their av - Our results on the routines in Dnasa7 are similar to Wolf ‘s, both showing 99.65 wave tains the percentage of RefGroups His on any of the Perfect was significantly 92.1 the percentage of RefGroups displaying each form of self reuse as invariant (Inv), unit-stride (Unit), or none (None). (Group) con- or no change in all but Adm, showed a small (1 VO) improvement 91.6 we present the data access properties tiling cache. 99.65 99.8 H u II 99.3 locality statistics for the original, versions of the programs. Locality only relative to direct-map 88.76 95.92 98.34 93.28 81.67 170.41 [81.11 195.25 proved on the RS/6000 (Arc2d, Dnusa7, Appsp, Simple and Wave), programs widt scalar replacement ~o192]. Wolf applied permutation, skewing, reversal and tiling to the Perfect Benchmarks and Dnasa7 96.95 93.80 “ — — 93.72 L 98.79 93.78 97.54 96.40 “ appsp ties for our test suite. For the applications to results, are dh%cult because he combmes and reports improvements 100 96.3 87.4 98.7 92.8 L To further interpret our results, we measured the data access proper- in the future. Our results are very favorable 99.98 99.97 97.02 98.77 98.77 93.84 — — — — — — 99.36 99.36 93.71 100 100 99.83 99.83 98.85 100 100 99.28 99.28 93.79 10Q 100 99.81 99.81 97.49 93.7 93.7 99.92 99.92 96.43 SPEC Benchmarks 54.5 73.9 99.26 99.27 85.45 95.5 95.5 99.77 99.77 95.92 100 100 99.99 99.99 98.34 90.2 91.9 98.36 98.48 92.77 91.6 92.1 93.26 93.26 81.66 1199.2199.81198 .83198,831170.41 I[1OO 1100 II1198.83198.841181.00 , , ,3 II 87.3187.3 II 99.20 199.201[95.26 NA.VRenchmrk~ 100 96.7 87.4 95,3 93.0 mmid execution loop level, it may sometimes merge cache. 99.45 99.45 97.32 97.32 100 not possible because Wolf does not present cache hit rates and the requires cache capacity and or overflow II 95.30198.661188.58193.61 100 9938 99.36 97.22 97.14 99.33 99.39 96.04 96.43 improve interference anrdysis similar to that performed for evaluating loop tiling [LRW91 ]. Because our fusion algorithm only attempts to optimize 1[99.95 199.951198.48198.58 II68.3191.9 99.9 99.9 -99:4 99.4 90.5 92.9 88.5 89.0 on the DECstation may and capacity misses after loop fusion. and avoid these situations 1197.7197.8 Mircelianeous Programs erlebacher 99.4 99.8 94.0 96.8 98.00 98.25 92.11 93.36 98.93 98.94 95.58 95.60 98.7 100 94.7 100 hnpackd loop program Yo, 83.2 92.7 100 100 100 100 97.9 98.3 99.7 99.7 1]100 ] 100 II 100 1100 !, II97.8197.8 II but did not loops. We measured fusion. su2c0r swm256 tomcatv Simple, and Wave. Improve- loop nests were more dramatic, 100 100 lUI 100 99.9 99.9 dnasa7 doduc fpppp hydro2d matrix300 programs in whole program hit rates for Adm, Arc2d, ments in the optimized 100 100 H did not im- hh rates for Dnasa7 and Appsp show sig- Appsp, Erlebacher, 100 99.6 99.6 100 100 98.8 99.7 spec77 track trfd improvements. 100 dyfesm nificant improvement after optilzation for the smaller cache even though they barely changed in the larger cache. Our optirnizations Dnasa7, Hydro2d, [ 100 flo52 mdg mg3d ‘qcd caused by small data set sizes. When the cache is reduced to 8K, the optimized Orig I Final II Orig I Final bdna L prove on the RS/6000 is due to high hit ratios in the original Cache 1 Cache 2 1 Orig I Finat II Orig I Final L 0.98 1.07 I Whole Program Cache 2 Perfect Benchmarks 1.20 1.00 8.68 1.29 dnasa7 (’btrix) dnasa7 (emit) dnasa7 (gmtry) dnasa7(vpenta) Procedures Cache 1 [ 1.01 = SPEC Benchmarks Cache Hit Rates 64K cache, 4-way, 128 byte cache line (RS/6000) SK cache, 2-way, 32 byte cache line (i860) Spcedup 2.15 1.00 arc2d dyfesm flo52 As illustrated Cachel: Cache2: im- proved Mxm by about 10% on the DECssation, but slightly degraded performance on the i860. Wolf slowed Cholesky by about 10% erage (Avg). The LoopCost improvement as an average (Avg) over all the nests and a weighted Ratio column estimates the potential 4Tke data accessproperties for rdt the programs are presented etsewhere [CMT94]. 260 Table 5: Data Access Properties Locality of Reference Groups original jinol 3 3 Utit 53 77 ideal 14 66 5 8 35 o o 8 o o 1 6 1 3 48 57 37 38 49 44 93 98 97 47 71 70 program Irrv arc2d dnasa7 original jinal ideal original Jlnal appsp iakal simple original final ideal original Jinal ideal original Jinal ideal wave all programs average (W) uses nesting depth. for all the programs. 3 37 60 3 8 44 41 53 51 The last row contains Table 5 also reveals that each of these applications nificant gain in self-spatial reuse (Unit) Because of the relatively locality machine-dependent to ensure unit-stride By having the compiler loop ordering, without programmer significantly that are often involved However, tiliig maybe in recurrences innermost be applied to the outermost able to exploit the Analysis 1.29 2.21 2.16 1.30 1.33 1.34 1.34 1.06 1.06 1.06 2.22 2.23 2.23 1.41 1.41 1.41 1.23 1.23 1.23 4.14 4.73 2.08 2.95 2.27 3.33 1.25 4.34 1.24 4.43 2.48 2.72 2.48 2.72 4.26 4.30 4.25 4.28 — — — — far outweigh when the recurrence loop. To regain any lost parallelism, loop [CCK88, performance unroll-and-jam Car92]. through the use of compiler sian elimination and perrmrtstion optimization. in no spatial locality. are able to achieve unit-stride is therefore in a form that she or he understands, machine-dependent be permuted. some of the invarisnt across rows, resultiig most loop. The programmer and time-step loops performance Apphr suffers from reuse Below, we examine Arc2d, Simple, Gtntry (three of the applications tion in performance. effectively a tiny accesses in the itmer- allowed to non-unit 6 solver from the Perfect benchmarks. degradation in performance routines exhibit stride accesses. The poor cache performance main computational and two with nesting depth 3. Our algorithm factor of 6 improvement Permuting due tertn cache-liie loop is an better performance unit- Tiling in the two loops with nesting depth 3. improvement illustrated on the RS/6000. Locality within the cart apply loop and loop interchange, WL91, to capW0187]. because it affects scalar opti- Our cost model provides primary criterion us with the key in- for tiling is to create references with respect to the target loop. These reffewer cache lines than both consecutive access significantly It contains and non-consecutive references, making tiliig worthwhile despite the potential loss of spatial reuse at tile boundaries. For machines form (i.e., a recurrence with long cache lines, it may also be advantageous to tile outer loops hydrodynamics two loops that are written in a “vectorizable” short- Assuming increases loop overhead, and may decrease spatial reuse loop-invariant loop order for performance. is a two-dimensional must be applied judiciously sight to guide tiliig-the erences large, the compiler of strip-mining estimated of inner loops. reuse at outer loops [IT88, LRW91, at tile boundaries. in Tnble 3 is order maximizes reuse across iterations a combination mization, attained similarly by improving less time-critical routines. Our optimization strategy obviated the need for the programmer to select Simple loops into memory ture long-term alone accounted for a factor of 1.9 on the whole The additional the “correct” forunit-stride give the original loops is not a problem. that the cache size is relatively tiling, is able to achieve a on the main loop nest by attaining stride accesses to memory application. (l%). Tiling The loop nest with four inner loops, two with nesting, depth 2 This improvement handles the of the main data arrays are very small better performance two innermost to write the code whtle the compiler We note specific coding styles that our system Arc2d is a fluid-flow imperfect reductions with a degrada- ported to the RS/6000. main computational Al- details. (5 x 5). While ourmodelpredicts (the only application can it is irn- though this may have been how the author viewed Gaussian elimination conceptually, it translated to poor performance. Distribution Programs and Applu Finally, was allowed to write the code in access to the amays, the small array dmensions that we improved) the potential is carried by the Gmtry, a SPEC benchmark kernel fkom Dnasa7, performs Gaus- The two leading dimensions of Individual shown in Table 3. In this case, a form for one type of machine and still attain machine-independent carried by outer loops. 5.7 Wt portartt to note that the programmer strat- or final. Invariant and cannot Avg parallelism Although compute Avg 1.25 in cache performance access on reuse. The ideal program exhibits occurs on loops with reductions 1.26 1.00 1.00 1.16 1.10 1.07 1.08 1.09 1.02 1.85 1.00 1.00 1.27 1.02 1.01 1.15 1.05 1.03 loss in low-level row in Table 5 reveals that on average fewer reuse typically Ratios None the improvements effort. more invariant reuse than the original Unit 1.23 1.34 1.31 1.48 1.48 1.27 1.04 1.03 1.03 2.25 2.26 2.27 1.48 1.55 1.55 1.26 1.27 1.26 ization to achieve the improvements program. than two references exhibited group-temporal reuse in the inner loop, and no references displayed group-spatial reuse. Instead, most programs exhibit self-spatial Irrv 1.53 2.12 1.72 1.41 1.33 1.61 0 0 1.49 0 0 1.50 1.95 2.00 1.63 1.53 1.52 1.23 loops exhibited poor cache performance. Compound reorders these loops for data locality (both spatial and temporal) rather than vector- a variety of coding styles can be additional The all programs ; 0 0 0 0 the totals we have shown that our optimization egy makes this unnecessary. ntn efficiently : 0 0 0 0 0 0 0 0 0 0 long cache lines on the RS/6000, spatial can make the effort RS/6GO0 applications, Group had a sig- over the original was the key to getting good cache performance. progratrtrners None 44 20 20 47 35 28 62 51 48 7 2 2 47 28 27 LoopCost Refs/Gmup % Groups code. is carried by the outer loop rather than the innermost loop). These 261 if they carry many unit-stride references, D. Gannon, W. Jalby, and K. Gallwan. [GJG88] such as when ~ansposing local memory a matxix. In tJte future, we intend to study the cumulative effects of optimization presented in this paper widr tiling, unroll-and-jam, Journal [GKT91] 7 Conclusion This paper presents a comprehensive locality and is the first to combine approach to improving loop permutation, in practice, More importantly, the compiler’s making to exploit presented in this paper validate [HKM91] The empirical but particularly machine-independent 5(5):587-616, Practicaf dependence Conference Design andlmplementation, on Pro- Toronto, Canada, and K. S. McKinley. Interproceduraf for parallel code generation. 91, Albuquerque, and R. Triolet. In Proceedings NM, November Supemode partitioning. of 1991. In Proceed- ings of the Fi)leenth Annual ACM Symposium on the Principles of Programming [KKP+ 81] D. Kuck, Languages, R. Kuhn, Dependence San Diego, CA, January 1988. D. Padua, B. Leasure, graphs and compiler of Programming with [KM92] programming. K. Kennedy Languages, Conference In Conference Symposium on the Principles Williamsburg, and K. S. McKinley. and data locality. and M. J. Wolfe. optimization. Record of the Eighth AnnuaiACM We believe good performance of the SIGPL4N’91 F. Irigoin [IT88] results target architecture, 90 programs. step towards achieving and C.-W. Tseng. M. W. Hall, K. Kennedy, the accuracy of our cost model and for vector and Fortran this is a significant G. Goff, K. Kennedy, Supercornpding’ algorithms for selecting the best loop structure for data locality. In addition, they show this approach has wide applicability for existing Fortran programs regardless of their original Computing, testing. Jn Proceedings transformations in our model is not a factor in data locality. and DLstribtied June 1991. fusion, distri- them ideal for use in a compiler. the imprecision ability Strategies for cache and by global program transformation. 1988. gram Language data bution, and reversal into an integrated algorithm. Because we accept some imprecision in the cost model, our algorithms are simple and inexpensive of Parallel October and scalar replacement. management In Proceedings VA, Jarmasy 1981. Optimizing for parallelism of the 1992 ACM International on Supercotrsputing, Washington, DC, July 1992. K. Kennedy and K. S, McKinley. Maximizing loop parallelism Acknowledgments [KM93] We wish to thank Ken Kennedy for providing ance for much of this research. the impetus and guid- We are grateful and improving to the ParaScope Proceedings research group at Rice Universi~ for the software infhstructure on which this work depends. In particular, we appreciate dte assistance of Nathaniel McIntosh on simulations. data locality of the Sixth Workshop on .!.anguogesand ers for Parallel [KMT93] K. Kennedy, Computing, the and transformation Center for Research on Parallel Computation at Rice University supplying most of the computing resources for our experiments for and Concurrency: [LP92] Computers. tional of Virtual Memory PhD thesis, Dept. of Computer Science, University of JIJinois at Urbana-Champaign, [Car92] S. Carr. Memory-Hierarchy [CCK88] D. Callahan, Computer 1979. and improving balance for pipelined D. Callahan, allocation Computing, variables. Jn Proceedings [CHH+93] K. Cooper,M. ley, J. M. Met.lor-Cnsmmey, the IEEE, 81(2):244-263, K. Cooper, for procedure February [CK94] February M. W. Hall, cloning. of the SIG- [War84] Computer NL91] of conditional control flow. Scalar replacement SoftwareAractice S. Carr, K. S. M%isrley, tion for improving Dept. of Computer [FST91] data locality. [W0187] and Experience, Automatic July 1994. [W0192] A. Nicolau, and D. Padua, editors, Ixmtguages and Compilers Computing, Worhhop, Systemr, Santa Ctara, and Interactive Parallelization. Science, Rice University, of Programming April transformations. Languages, M. E. Woff and M. Lam. A data locality of the SIGPLAN Salt Lake City, optimizing ‘ 91 Conference Design and Implementation, Toronto, algorithm. on Program Canada, June M. J. Wolfe. Iteration space tiling for memory hierarchies, December 1987. Exumded version of a paper which appeared in M. J. Wolfe. Processing, Report TR94-234, for Parallel Fourth International Operating A hierarchical basis for reordering Proceedings optimiza- M. E. Wolf. of the Third S[AM Conference The Tusy loop restructuring of the 1991 International St. Charles, IL, August Improving Locality Santa 1991. Springer-Verlag. 262 on Parallel research tool. Conference Pro- In on Parallel 1991. and Parallelism Loops. PhD thesis, Dept. of Computer versity, August 1992. J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. Jn U. Banerjee, D. Gelemter, Clara, CA, August J. Warren. Proceedings cessing. in the presence of [W0191] Technical MA, October 1992. The cache performance 1991. 19(2): 105-117, and C.-W. Tseng. Compiler Science, Rice University, K. S. McKinley. Language 24(1):5 1–77, January 1994. [CMT94] and M. E. Wolf. Languagesand In Proceedings A methodology Languages, Support for Programming Systernr,Boston, UT, January 1984. K. S. McKin- 1993. S. Carr and K. Kennedy. M. Lam, E. Rothberg, Loop restructurof the Fifth Interna- Jn Conference Record of the Eleventh Anrtuol ACM Symposium 1993. and K. Kennedy. tool. October 1992. register Proceedings In Proceedings on Architectural Operating on the Principles environment. 5(7):575-602, Access normalization: PhD thesis, Dept. of Computer L. Torczon, and S. K. Warren. The ParaScope parallel programming [CHK93] [McK92] 1988. Design and Im- W. Hall, R. T. Hood, K, Kennedy, Analysis CA, Aprit 1991. of Par- August Jrnproving PL4N ’90 Conference on Program Latguage plementation, White Plains, NY, June 1990. Conference Programming interlock machines. Journal 5(4):334-358, S. Carr, and K. Kennedy. for subscripted 1992. Estimating Tseng. and optirnizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for PhD thesis, Dept. of September J. Cocke, and K. Kennedy. 1993. parallel programming & Experience, compilers. Languagesartd [LRW91] Management. Science, Rice University, allel and DLrtributed [CCK90] W. Li and K, Pingali. ing for NUMA References the Perfornwtce and C.-W. in an interactive Practice OR, August Jrr Compil- 1993. simulations. W. Abu-Sufah. Improving Portland, K. S. McKinley, We acknowledge [AS79] via loop fusion and distribution. in Nested Science, Stanford Uni-