Improving the Ratio of Memory operations to Floating-Point Operations in Loops STEVE CARR Michigan Technological Unwerslty and KEN KENNEDY Rice University Over the past tional power decade, floatmg-pomt computational cycles memory It to write program to a target is our mance and memory statically to have that target architectures, memory and because A this a better per Idle these balance 1s a step m the wrong programmer instead, cache To overcome ensures In our view, from are plpelmed, are executed that the computa- data operatmns style machme-specific should the task not be of speclahzmg a to the compder. optimizing found their paper compder way mto we attempt automatically to elimmate between more a coding a sophisticated In this and use these on mcreaslng reqmre for each new machme, techmques seek focused operations, be left that have often apphcatlons use more hierarchy? and program Can tricks the balance particular scientific version should practical? on specific accesses when learned programs machme evaluate performance can dehver and floatmg-pomt makes of the memory develop a machme have of programming strateges computations a new program v.ew design Because are common references because requu-ed But than programmers dmectlon myriad chip operation bottlenecks, between microprocessor on a single to answer restructure methods attempt or reduce pipeline interlock operations need the that question loops to balance to apply the perfor- To do so, we high computation operations whether for to achieve To do thin, and floating-point to determme the to improve program These estimates obviate practice they and estimate for each loop m a various loop transfor- mations, Experiments kernels, and with our automatic Additionally, the between the application of the the balance achieved Categories and Subject techniques estimate show of the balance estimate are by our Descriptors: very automatic that Integer-factor between speedups memory operations accurate—experiments system and that D 3,4 [Programming are possible reveal possible httle by hand Languages] on and computation, Processors difference optlmlzatlon. —compz[ers, optzmzzatzon General Terms: Addltlonal This IBM. Languages Key Words research and was supported Authors’ address: Umvers,ty, Houghton, Permission to copy without not made of the or distributed publication Assoclatlon MI and Phrases by NSF grant unroll-and-jam CCR-9120008, ONR S, Carr, Department of Computer 49931; K. Kennedy, CITI, Rice fee all or part for direct Its date for Computmg Balance, of this commercial appear, Machinery specific permission. c, 1994 ACM 0164-0925/94/1100-1768 ACh%T,an~act~.n~ “n Pmgranumng Languag,, and grant is granted advantage, the ACM notice 1s ~ven Mich]gan University, material To copy otherwise, NOO014-91-J-1989, Science, that Houston, provided copyright copying or to repubhsh, TX 77251-1892, that notice the copies Is by permission reqmres VO1 lh NO 6 N., ember 1994, are and the title of the a fee and\or $03.50 and Sj+twns and by Technolo~cal Pa~,s 176 X-181O Improving the Ratio of Memory Operations 1769 . 1. INTRODLICTION Over the past decade, the computer industry has realized ments in the power of microprocessors. These gains have dramatic improvebeen achieved both by cycle-time improvements and by architectural innovations like instruction issue and pipelined functional units. As a result of these tural improvements, tions per machine But with these today’s microprocessors cycle than gains have can perform their predecessors. come problems. Because many multiple architec- more the latency opera- and band- width of memory systems have not kept pace with processor speed, computations are often delayed waiting for data from memory. As a result, processors see idle computational cycles and empty pipeline stages more frequently. In fact, memory and pipeline delays have become the principal performance bottleneck for high-performance microprocessors. To overcome these performance problems, programmers employ a coding style that ences and floating-point eliminate pipeline floating-point achieves operations. interlock operations a good balance The goal of this and to match in program the ratio have between coding to refer- methodology of memory loops to the optimum learned memory is to operations such ratio to that the target machine can deliver. This is usually done by unrolling outer loops and merging the resulting copies of inner loops. Unfortunately, this coding method almost always leads to very ugly code. Additionally, this method is error prone, tedious, and time consuming. The thesis of this paper is that most of this type of hand optimization is unnecessary, because it can, compiler. Having the compiler two ways. First, it avoids and should, be performed automatically in a optimize loops is better for the programmer in a lot of work that is not necessary for the expression of a correct solution. Second, it makes it possible for the same program to be used on a variety of machine architectures. To establish this thesis, the paper presents new techniques to restructure program loops automatically for high performance on specific target architectures. A method to estimate statically the performance of a given loop on a particular architecture is presented and used in an algorithm to determine how to transform a loop to best improve its performance. This approach extends previous work by showing how to decide automatically when and how to transform a loop. The transformation system is tailored to a specific target machine based on a few parameters of that machine, such as number of registers, floating-point operations transformations are machine retargeted to new processors described in this and experiments per cycle, independent by changing and paper has been implemented to determine its effectiveness paper concludes with these experiments. a report loads per cycle. in the sense that a few parameters. on the performance Thus, they The in an experimental have been carried improvements the can be method compiler, out. The observed in 2. BACKGROUND This section lays the foundation tions that improve the memory ACM TransactIons for the application of reordering transformaperformance of programs. It begins with on Programmmg Languages and Systems, Vol. 16, No, 6, November a 1994 1770 . measure Steve Carr and Ken Kennedy of machine and loop performance that will be used in the application of the transformations described in the next section. Then, a form dence graph used by the transformation system to model memory presented. 2.1 Performance Model It is assumed that chronous execution IBM RS/6000 has a typical only. of depenusage is the target of memory or HP PA-RISC). optimizing In particular, architecture is pipelined accesses and floating-point It is also assumed compiler—one the compiler ters globally (via a typical tion and the arithmetic that performs coloring pipelines. that the performs strength scheme) This and allows operations reduction, and makes target scalar schedules machine optimizations allocates both it possible asyn(e.g., the regisinstruc- for the transforma- tion system to restructure ing the loop code to the the loop nests, while leaving the details of optimizcompiler for the target machine. To measure the performance of program notion of balance defined loops given by Callahan 2.1.1 Machine steady state Balance. manner the above assumptions, et al. [ 1988]. A computer with both is balanced memory when accesses and tions being performed at peak speed. To quantify defined as the rate at which data can be fetched pared to the rate at which floating-point max words/cycle Alf = The values of MM and FM represent peak it can operate the in a opera- relationship, /3~1 is memory, il/l~[, com- can be performed, FAI: = Mh[ max flops/cycle word is the same as the precision machine has at least one intrinsic use floating-point this from operations we = FII “ performance where the size of a of the floating-point operations. Every ~lf. For example, on the IBM RS\6000 PM = l.l 2.1.2 Balance Loop Balance. Just as machines for a specific loop can be defined pL It is assumed that = number references have as balance of memory references number of flops = F to array variables ratios, so do loops. = M are (1) actually references to memory, while references to scalar variables involve only registers. Memory references are assigned a uniform cost under the assumption that loop interchange, tiling, and cache copying can be performed to attain cache locality [Kennedy and McKinley 1992; Lam et al. 1991; Meadows et al. 1992; Wolf and Lam 1991]. Comparing /3~ to & gives a measure of the performance of a loop running on a particular 1Note that multlply-add architecture. we count floating-point instruction A@M Transactions counts on Programming If ~~ > /3~, then instruction issues the loop needs rather than actual data operations. at a higher Therefore, as one operation, Languages and Systems, Vol 16, No 6, November 1994 a Improving the Ratio of Memory Operations rate than the machine can provide, and idle Such a loop is said to be memory bound, by lowering fi~. If /3~ < ~~, then data supplied to the processor, said to be compute bound. rate of a machine point operations computational for exist. is and need not be further balanced. This is because floatingcannot usually be removed, and arbitrarily increasing the ~L= order will and idle memory cycles will exist. Such a loop Compute-bound loops run at the peak floating-point 2.1.3 Pipeline Interlock. Previous measure of pipeline interlock in the formulation of balance is as follows: caused cycles and its performance can be improved cannot be processed as fast as it is number of memory operations is unlikely to be beneficial. the loop is balanced for the target machine. In 1771 . balance to relate by pipeline interlock how to use unroll-and-jam interlock using Equation does not explicitly F + more must Finally, if ~~ = ~~, work in loop balance has included a computation of balance. The complete M idle cycles “ closely (2) to performance, be removed. Callahan idle cycles et al. [1988] the show explicitly to remove idle cycles due to pipeline (2). The method described in Section 3 in this paper remove interlock. Instead, interlock is implicitly removed by unrolling to create balanced loops using Equation (1). It is assumed that unrolling to balance a loop will create enough copies of an innerloop recurrence to remove all interlock. The experiments in Section 4 validate this assumption unrolling, amounts for short pipelines. the Callahan et al. until the handled separately removal in more 2.2 Dependence A dependence interlock However, technique is removed. and only when Interlock necessary. exists between antidependence writes —an output —an input ignored, after unroll but 3.2.4 describes rather interlock Graph two references if the if there exists a control-flow to the second, and both references 1978]. The dependence is —a true dependence or flow dependence location and the second reads from it, second is not Section interlock increase detail. path from the first reference same memory location [Kuck —an if there is still can be used to first reference if the first reads reference from the access writes location the to the and the to it, dependence dependence if both if both references references write read to the location, from and the location. u and w, are contained in n common loops, separate If two references, instances of the execution of the references can be described by an iteration vector. An iteration vector, denoted ~ is simply the values of the loop control variables of the loops containing u and w. The set of iteration vectors corresponding to all iterations of the loop nest is called the iteration space. ACM Transactions on Programmmg Languages and Systems, Vol 16, No 6, November 1994 1772 . Steve Carr and Ken Kennedy DO 10 I = I,N DO 10 J = l,N DO 10 K = I,N A(I+I, 10 J,K-1) = A(I, J,K) + C(I a t 1 (=,=,*) J) (<,=,>) (1,0,-1) Fig. Using iteration fined for vectors, each dependence: 1. Example a distance if accesses location Z on iteration .>this definition, z ~ . Under l—[1’ u dependence vector, graph, d= (dl, d2,..., dn), can be de- accesses location Z on iteration ~, and vector for this dependence i~U, the distance the k th component of the distance vector w is is equal to the number of iterations of the k th loop (numbered from outermost to innermost) between accesses to Z. The direction uector D = ( D1, . . . . D,) of a dependence is defined by the equation: These symbols can be combined < > , and # . If all direction when more than one condition is true to give vectors are possible for a dependence, then its ~r’ection is denoted as *. The elements are always displayed in order left to right, from the outermost to the innermost loop in the nest. As an example, consider Figure 1. The distance and direction vectors for the dependence between the definition and use of array A are (1, O, – 1) and ( <, = , > ), respectively. The direction vector for the dependence from C to itself is ( = , = . *). There is no single distance vector for this dependence. The loop associated with the outermost nonzero direction (or vector entry is said to be the carrier of the dependence. The distance) threshold of a carried dependence is the value of the distance vector entry for the carrier. A minimum threshold equal to the minimum possible positive distance vector value is associated with those dependence having no single distance vector value in the outermost position. If all entries are O or = , the dependence is anti Its threshold M O. In this paper, we consider only those loop independent, dependence that have a consistent threshold-that is, those dependence for which the threshold is constant throughout the execution of the loop—for memory reuse [Callahan et al. 1988; Gannon et al. 1987]. For a dependence to have a consistent threshold, the location accessed by the dependence source on iteration i must be accessed by the sink on iteration i + c, where c does L. A dependence with a consistent threshold is called a not vary with consistent dependence; otherwise, it is called an inconsistent dependence. For the purposes of this paper, only those consistent dependence whose source ACM Transact,oni on Programming Languages and Systems, Vol 16, No 6. November 1994 Improving the Rat[o of Memory Operations and sink have considered only one for memory dled elsewhere induction reuse. [Carr variable in each subscript Multiple-induction-variable 1773 . position subscripts are are han- 1992]. 3. METHOD Conceptually, the transformation computed. Then, method is simple. for each memory-bound For each loop nest, loop, transformations try to make ~~ < &I. This section presents two transformations reduce fl~. The first transformation reduces the number of memory in an innermost loop. The second introduces into the innermost loop, keeping the loop-nest the overall memory requirements more total of the entire ~~ is are applied to used to accesses floating-point operations constant, while reducing loop nest. 3.1 Scalar Replacement The number references of memory to arrays references with in sequences a loop can of scalar be lowered variables. In by replacing the code shown iteration of the loop below, DO IO 1=2, N 10 A(I) =A(I – l)+ the value accessed B(I) by A(1 – 1) is defined on the previous Using this knowledge, the flow of values by A(1) on all but the first iteration. between the references can be expressed with scalar temporaries as follows. T= A(l) D0101=2, N T= T+ B(l) 10 A(I)=T Since the values held in scalar of A(1 – 1) has been load cycle requirements quantities removed, of the loop will resulting [ Chaitin probably be in registers, in a reduction et al. 1981]. This in the transformation called 1988; scalar replacement and is described in detail elsewhere [Callahan 1990; Carr and Kennedy 1994; Dehnert et al. 1989; Duesterwald 1993; Rau 1991; Rau et al. 1992]. Essentially, any the memory consistent true is et al. et al. or input vector entries is amenable to dependence with O’s in the outer n – 1 distance scalar replacement. Additionally, any array reference that is invariant with respect to the innermost loop may be removed by scalar replacement. 3.2 Unroll-And-Jam Unroll-and-jam is a transformation that can be used in conjunction with scalar replacement to improve the performance of memory-bound loops [Aiken and Nicolau 1987; Allen and Cocke 1972; Callahan et al. 1988]. The transformation unrolls an outer loop and then jams the resulting inner loops back together.z z Note capture that Using in scalar loop-carried be done to improve ACM unroll-and-jam, replacement reuse. the innermost Unroll-and-jam scheduling Transactions more and scalar on Programming computation loop is separate can be introduced is unrolled to remove any copies from innerloop unrolling any into an needed that to might replacement. Languages and Systems, Vol 16, No 6, November 1994 1774 . Steve Carr and Ken Kennedy innermost loop body without For example, the loop a proportional increase in memory references. DO1OI=1,2*M DOIOJ=I, N 10 A(1) = A(1) + B(J) after unroll-and-jam of I by a factor DO IO I= I,2KM,2 DO IO J=I, N A(1) = A(1) + B(J) 10 A(I+l)=A(I+l)+ In the original loop, B(J), are left after unroll-and-jam, of 1 becomes:3 B(J) one floating-point scalar operation replacement, two floating-point giving and one memory a balance operations reference, of 1. After and only applying one memory reference exist in the loop, giving a balance of 0.5. On a machine that can perform twice as many floating-point operations as memory accesses per clock cycle, the second loop would perform better. Although unroll-and-jam has been studied extensively, it has not been shown how to tailor unroll-and-jam to specific loops run on specific architectures. In the past, unroll amounts have been determined experimentally and specified with a compile-time parameter [Callahan et al. 1990]. However, best choice for unroll amounts varies between loops and architectures. tion to 3.2.3 derives balance a method program Unroll-and-jam loops is tailored to choose with unroll respect to a specific amounts to automatically a specific machine based target the Sec- in order architecture. on a few parameters of the architecture, such as effective number of registers and machine balance. The result is a machine-independent transformation system in the sense that it can be retargeted to new processors by changing a few parameters. The rest of this section begins with an overview of the safety conditions for unroll-and-jam. Then, a method for updating the dependence scalar replacement can take advantage of the new unroll-and-jam is presented. Finally, an automatic gram loops with respect to a particular architecture 3.2.1 Safety. With a semantics that is only graph so that opportunities created by method to balance prois derived. defined on correct programs (as in Fortran), unroll-and-jam is safe if the flow of values is not changed. Dependence can prevent unroll-and-jam or limit the amount of unrolling that can be done and still allow jamming to be safe. Consider the following loop. DO1OI=1,2*M DO1OJ=1,N 10 A(I,J) = A(1 – l,J + 1) There is a true of (1, – 1). ~Note ACM that dependence Unrolling unrolling TransactIons the by a factor on Progammmg from A(l ,J) to A(l – 1,J + 1) with outer loop of k results Languages once creates a distance a loop-independent in k + 1 loop bodies and Systems, Vol 16, No 6, November 1994 vector true . 1775 dependence that becomes an antidependence in the reverse direction loop jamming. This dependence prevents jamming because the order after of the Improving the Ratio of Memory Operations store to and load from a location would be reversed as shown in the unrolled code below. DO IO I= I,2*M,2 DO1OJ=1,N A(I,J) = A(1 – I,J + 1) 10 A(1+ I,J) =A(I,J + 1) Location the A(3,2) original is defined loop. In the on iteration (3,2) transformed loop, and the read on iteration location is read (4, 1) in on iteration (3, 1) and defined on iteration (3, 2), reversing the access order and changing the semantics of the loop. The following theorem from Callahan et al. establishes the conditions for unroll-and-jam safety and is stated without proof et al. 19881. [Callahan 3.2.1. THEOREM with distance Let e be a true, d= (O,..., O,dk, where d] < 0 and all components at level k can be unrolled at generated output or antidependence carried at level k vector that prevents Essentially, this fusion O,..., ), between the kth and jth are O. Then the loop most dk — 1 times before a dependence is of the inner theorem O,dl),.. states that n – k loops. unrolling loop k more than d~ – 1 will introduce a dependence that prevents loop fusion. If all d~ are positive, fusion will not be prevented. To allow unroll-and-jam in some loops containing edges with negative distance vector entries, loop skewing can be used to remove the assumed the safety appear for at am entries skewing appear unroll-and-j dence any test statements loops negative that in preventing done unroll-and-jam at levels inner levels, to be safe. can be violated is other [Allen than loop No dependence by a transformation applied. Additionally, the innermost distribution recurrence and Kennedy must involving 1987; Kuck unroll-and-jam is safe if no jamming-preventing by loop unrolling and if no recurrence is violated [Wolfe system level when or when be safe data 1986]. is done in et al. 1980]. is non-DO multiple order or control dependence by distribution. It before for depen- Therefore, are generated 3.2.2 Dependence Copying. To allow scalar replacement to take advantage of the new opportunities for reuse created by unroll-and-jam, the dependence graph must be updated to reflect these changes. In this section, we describe a method to update of variables. Both classes subscript expression. dependence have 3.2.2.1 Variant References. dependence graph after loop array references contain only position and whose references [Callahan only for the two most one induction prominent variable in classes any given Below is a method to compute the updated unrolling for consistent dependence whose one loop induction variable in each subscript are variant with respect to the unrolled loop et al. 19881. ACM Transactions on Programming Languages and Systems, Vol. 16. No 6, November 1994 1776 If the . Steve Carr and Ken Kennedy ith loop in a nest L is unrolled by a factor of k, then dependence graph G = (V’, E’) for Z can be computed graph G = (V, E) for L by the following rules: (1) For each u = V, there to the original reference from the updated the dependence are k + 1 nodes UO,... , U}, in V’. These and its k copies. correspond ● E, with distance vector ci(e ), there are k + 1 em = (u~,w~), u~ is the mth copy of u, w. is the (2) For each edge e = (u, w ) edges eo, . . . . ek where n th copy of w and n = (m +d, (e)) mod(k d,(e) \ To extend unrolled dependence copying in an outermost- ifn>m 11k+l = d(e) * [ H dl(en) (3) + 1) (4) i~n<m +1 to unroll-and-jam, to innermost-loop we assume ordering. that If loop and a new distance vector, d(e) = ( dl, dz, . . . . d,) d~(e) = O and Vi < k, d,(e) = O then the next inner loops are k is unrolled is created such that nonzero entry, d,(e), becomes the threshold of e and j becomes the carrier of e. If Vi, d,(e) = O, then e is loop independent. If 3 i < k such that d,(e) > 0 then e remains carried by loop i with the same threshold. To demonstrate dependence copying, consider Figure 2. The original edge A(1, J) to A(1,J – 2) has a distance vector of (2, O) before unroll-and-jam. 10 After unroll-and-jam the edge from A(I,J) now goes to A(I,J) in statement and has a vector of (O, O). An edge from A(I,J + 1) to A(I,J – 2) with a vector of (1, O) and an edge from A(I,J + 2) to A(I,J – 1) with a vector of (1, O) have also from been created. For the first edge, the first formula latter two edges, the second formula applies. 3.2.2.2 Inuarlant copying for invariant References. references, (1) If any reference is invariant reference invariant. (2) If a specific wirwariant. reference, for d,( em ) applies. For the To aid in the explanation of dependence the following terms are defined. with respect u, is invariant In the following example, I and invariant; and J is A(l) -invariant. J are with to a particular respect reference-invariant loop, that to a loop, that loops: loop is loop is I is B(J)- DOIOI=l, N DO1OJ=1,N 10 A(1) = A(1) + B(J) The behavior of invariant references differs from that of variant references when unroll-and-jam is applied. Given an invariant reference u, the direction vector entry of an incoming edge corresponding to a u-invariant loop represents multiple values rather than just one value. Therefore, Equation (4) is ACM TransactIons on Programmmg Languages and Systems, Vol 16, No 6, No> ember 1994 Improving the Ratio of Memory Operations DO 10 J = I,H DO 10 J. = I,H 1777 . DO 10 J = I,N,3 DO 10 I = I,N I A(I 10 ‘(H-2) (0,0) (1,0) (2,0) A(I -1 I = A(I~J-2) J) 10 A(I, J+l r = A(I, J-1) 1’+2) ‘)(l’J] \ (1,0) Distance Flg.2. insufficient. To accommodate entry, the following dence edge. steps vectors before the in are * applied the andafter outermost in order (1) Equation (4) is applied to the minimum “*” in the outermost nonzero direction The minimum distance is 1, and the unroll-and-jam nonzero to copy direction vector an invariant-depen- positive distance indicated by the vector entry of the original edge. result will be a loop-independent edge. edge is created in each new loop body between the (2) A copy of the original copies of the source and sink corresponding to the original references. This is because the “*” includes a distance vector entry of O. (3) A copy of the original identical references. invariant. Therefore, dependence edge is inserted All copies of an invariant between reference a dependence every exists between every pair of will also be reference (4) Loop-independent dependence are inserted from the references statement to all identical references following them. This inserts the edge created in step There is no need to insert In Figure pair. within a copies of 1 to account for unroll values of more than an edge identical to the one inserted in step 3, J is A(1, K)-invariant. When J is unrolled 1. 1. by 1, a loop-independent dependence is inserted between the two references to A(I, K) per step 1; a copy of the original dependence is made for the new copy of A(1,K) per step 2; and there are loop-carried dependence inserted between the two references per step 3. Step 4 is not necessary because step 1 has already inserted a loop-independent dependence. The safety rules of unroll-and-jam prevent the exposure of a negative distance vector entry in the outermost nonzero position for true, output and antidependences. However, the same is not true for input dependence. The order in which reads are performed does not change the semantics of the loop. Therefore, If the outermost entry of an input dependence distance vector becomes negative during unroll-and-jam, the direction of the dependence is changed, and the entries in the vector are negated. ACM TransactIons on Programmmg Languages and Systems, Vol 16, No 6, November 1994. 1778 . Steve Carr and Ken Kennedy DO IOK= 1.X DO 10 J ~ I,M,2 DO 10 I = I,N DO 10 K = I,li DO 10 J = 1,11 DO 10 I = i,li 10 ... = A( 1 ,K) a [step 3) . ..=. ,=, *,_-) (=,*,=) El I,K) (=,*,=) (original) (0,0,0) (step 1) t 10 . ..= ,1, K L (=,*,=) (step Flg.3. 3.2.3 ImprolingB is to convert Invariant-dependence alancewith loops that copying. i7nro11-and-Jam. are bound by memory 2) Asstated access time earlier, into loops the goal that are balanced. Since unroll-and-jam can introduce more floating-point operations without a proportional increase in memory accesses, it has the potential to improve loop balance. In this section, a decision procedure to compute the balance ratio decision memory procedure accesses in unrolled loops dence carried by the l-loop advantage. After unrolling become leaving decision original is works, consider per floating-point developed. As an Figure 4. The operation, but of which unroll-and-jam the l-loop by 1, there example original it also of how the loop has three has two depen- by a factor of 1 can take are two dependence that loop independent and expose opportunities for scalar replacement, a balance of two memory operations per floating-point operation. The procedure for unroll-and-jam loop and the unroll amount takes the for the outer dependence loop, and then edges in computes the the unrolled-loop balance as if the loop were unrolled by the specified amount. This procedure is applied to various unroll amounts m an efficient manner until the unroll amount that best balances a loop on a particular architecture is found. Although unroll-and-jam improves balance, using it carelessly can be counterproductive. Transformed loops that spill floating-point registers or whose object code is larger than the instruction cache may suffer performance degradation. Since the size of the object code is compiler dependent, it is assumed that the instruction cache is large enough to hold the unrolled loop body. Thus, the transformation heuristic can remain relatively independent ACM TransactIons on Programming Languages and Systems, Vol 16, NrJ 6, November 1994 Improving the Ratio of Memory Operations DO 10 I = I,l? I = I,M,2 DO 10 J = I,l? DO 10 (1,0) DO 10 J = I,lf I ‘0 1779 . + ‘(1113’’J) ,0 +A(1-2’J) ::;~~::~;:’:; 9 -? (1,0) Fig,4. Objective of the details of a particular objectives for unroll-and-jam: (1) Balance a loop with (2) Control register The goals stated forunroll-and-jam software a particular system. This leaves the following architecture. pressure. can be expressed as a nonlinear integer optimization prob- lem. objective min I /3L – ~J~Io # floating-point function: constraint: registers required s register-set size I flL – L?MIo is the balance where The decision the loops the point objective variables in a loop before in nest. the norm problem The c causes are the register-pressure floating-point function, and has the following registers the amounts constraint are balance unroll spilled. norm for limits For the to favor bound loop over a slightly memory-bound one. For each loop nest within a program, its possible definition: each unrolling solution a slightly of to to the compute- transformation is modeled as a problem of the above form. Solving the optimization problem yields unroll amounts that balance the loop nest as much as possible. Using the following definitions throughout the derivation, the compile-time construction and efficient solution V = array references V, = members X, = number (This M = number the perfectly of times nested. of memory of this Section loops. Additionally, flow in the innermost ACM the ith Transactions it outermost loop in L is unrolled created by unrolling after unroll-and-jam operations references derivation, is detailed. reads of loop bodies 3.2.4.5 problem loop are memory of floating-point purposes optimization in an innermost of V that is the number F = number For of a balance after it shows + 1 loop i) loop nest unroll-and-jam is assumed how to that handle each imperfectly is assumed that loops containing conditional loop body are not candidates for unroll-and-jam. on Programming Langnages and Systems, Vol. 16, No 6, November is nested control This is 1994 1780 Steve Carr and Ken Kennedy . because inserting disastrous effects replacement additional control flow within 4 Finally, upon performance. removes not a limitation no stores. This of the method an innermost loop can have it is assumed that scalar is done for clarity [Carr of presentation and is 1992]. Since the value of f?~ is a compile-time 3.2.3.1 Computing Loop Balance. constant defined by the target architecture, only the coefficients in LIL need to be computed at compile time to compute the objective function for a particular loop. To compute ~L for a particular loop, the number of floatingpoint operations, F, and memory references, M, are computed for one innermost-loop iteration. F is simply the number of floating-point operations in the original loop, unroll-and-j f, multiplied by the number of loop bodies created by am, giving F=fx ~ lsl where n is the the dependence In computing Xl, ..77 number of loops in a nest. M requires a detailed analysis of graph. that floating-point common subexpressions F, it is assumed are not removed from the inner loop. There are two reasons for this assumption. First, the removal of floating-point computation from memory-bound loops is not likely to improve performance. Since the loop is already bound by memory increase assuming cycles, the removal of floating-point the number of idle computation no common subexpression operations will probably only cycles in the loop. Second, by removal, the solution equation is an order of magnitude faster. In computing M, an infinite register set is assumed. to the loop balance The register-pressure constraint in the optimization problem ensures that any assumption about a memory reference being removed is true. To simplify the derivation of the formulation for memory cycles, it is initially assumed that at most one incoming or outgoing edge is incident upon each u G V. This restriction is removed at the end of Section 3.2.4. The reference set of the loop can different memory listed below.5 (1) V“ behavior = references when without be partitioned unroll-and-jam an incoming (2) V,c = memory ing consistent reads that dependence, (3) Vrl = memory reads that consistent are invariant M= have ‘At this point, edges that shown this to be we are only concerned can possibly be made sets that exhibit These sets dependence. with with A#+M:+ true with innermost. on the loops respect to a loop. each partition, {I@’, M:, TransactIons on Programmmg Languages IBM RS/6000 where Therefore, and M:}, M:. fusion we only is legal deal after with unroll-and-jam posltwe distance entries. ACM are have a loop-carried or loop-independent incombut are not invariant with respect to any loop. Using this partitioning, the cost associated is computed to get the total memory cost, 4 Experiments into is applied. Systems, Vol 16, No 6, November 1994 and vector Improving the Ratio of Memory Operations A(I, J,K) = A(I, A(I, A(I, A(I, . . . 10 Fig.5, For the first and-jam partition, contains ence. This value V“ @, before andafter each one memory canbe copy expressed additional memory. in copies For each Figure from . loop body associated with produced by unroll- each original refer- nxt]. q lsz<n V@ 5, unrolling the . . as follows: ,1 f= example, J,K) = . . . J,K+I) = . . J+I, K) = . . =.. J+ I,K+I) unroll-and-jam. of the reference M@= For 1781 DO 10 K = I,IT,2 DO 10 J = I,N,2 DO 10 I = I,N DO 10 K = l,N DO 10 J = 1,11 DO 10 I = 1,14 10 -. . the reference outer two to A(1,J, K) that loops will by 1 creates be references 3 to u G V,c, unroll-and-jam may create dependence useful in scalar for some of the new references created. In order for a reference, replacement u, to be scalar replaced, its incoming an innermost-loop-carried an innermost dependence dependence, e,,, must be converted into or a loop-independent dependence, which is called edge. In terms of the distance vector associated with an edge, d(e,, ) = ( dl, d2, . . . . d.), the edge is made innermost when its distance vector becomes d(ej ) = (O, 0,...,0, d.). Essentially, the subscript expression associated with each outerloop induction variable must be identical in both the source and sink of the dependence in order for the reference at the sink of the dependence the references will be the to be removed. After sink of an innermost for applying unroll-and-jam 6, the copy of A(I,J – 2) (the from A(I,J + 3) that first dependence copies (A(I,J) and must quantify this reference is not A(1,J + 1)) are the unroll-and-jam, only some of edge, The decision procedure value. innermost, sinks For example, in Figure to A(I,J – 1)) is the but the of innermost sink second edges (in of a and third this case, ones that are loop independent) and will be removed by scalar replacement. To compute the number of memory references inserted in the innermost loop by unroll-and-jam after scalar replacement, first the total number of references before scalar replacement is computed as in @, Next, the ment is computed number of new references by quantifying that the will number be removed by scalar of references that replace- will be the sink of an innermost dependence after unroll-and-jam. To simplify the derivai is considered, where Ve, d(e) = only the unrolling of loop tion, is extended to arbitrary loop (o,..., O,d,, O,...,0, d. ). Later, the formulation nests and distance vectors. ACM Transactions on Programming Languages and Systems, Vol. 16, No 6, November 1994 1782 Steve Carr and Ken Kennedy . DO 10 J = I,H,4 DO 10 I = l,?i DO 10 J = 1,11 DO 10 I = l,li + A(I, ‘0 ‘(r!-H7 J) = A(I, J:2) (o, o) (1,0) (2,0) A(I ‘( I, J-l) J+l) (0,0) \+ A(I, J+2) A(I, (1,0) J) \ I 10 V,( before Flg.6 A(I, andafter J+3) I = A(I, J+I) unroll-and-Jam Recall from Eauation (4) in Section 3.2.2 that, in orderto create adistance . vector with a O in the ith position, there must be some u G V,c for which d,(e, ) <X,, resulting in an innermost dependence edge (if d,(e) >Xl, then not enough outer iterations are unrolled to create a reference with the subscript expressions containing the ith induction variable being equivalent in both the source and sink of a dependence). However, depending upon which of the two distance formulas is applied, it may be impossible to create an edge with entry a O in the ith position. In order for the formula that allows an to become O, d,( e; ) = [ci,(e,, )\X, j, to be applied, the edge created by and-j am, ej = ( w~, U. ), must have n > m. If n < m, then the applica- unroll- an entry from becoming O. ble formula, d, (e;, ) = 1dt(e,, )/Xl ] + 1, prevents Below, Theorem 3.2.3.1.1 shows that n > m is true for Xl – dl(ec ) of the new dependence edges if d,( e,, ) < X,. This can be removed by scalar replacement. shows that X, – d, ( e, ) new references THEOREM 3.2.3.1.1. Given O < m, d,(e) <Xl, and n = (m + d,(e)) mod X, for each new edge e; = ( Zum, U.) created from e,, = ( WO, ZJO ) by unmll-andjam, then n>mtim<Xc-dl(e). PROOF. See Appendix In the case where a general formula placement.G For an arbitrary after ❑ C. X, < d,( e, ), no edge can be made of (XC — d, (e,))+ distance unroll-and-jam, the edges vector, new if any edge will that outer not innermost, are amenable entry is left be innermost. resulting in to scalar re- greater 6 lfx>o ‘+= ACM TransactIons on Programmmg Languages ( ; ifx and Systems, <O’ Vol 16, No than If a dependence 6, November 1994 O Improving the Ratio of Memory Operations edge becomes that innermost, additional edge. For example, innermost edge that in Figure unrolling of any 6, the load from is a copy of the incoming loop creates 1783 a copy of A(1, J + 1) has an incoming edge for the load The edge into A(1, J + 1) was created by increasing the J-loop after A(1,J)’s edge became innermost. Quantifying JJ (XL - . unroll these from A(1, J). amount of the ideas gives d,(e)))+ l<2<n references that from total the can be scalar number unroll-and-jam gives replaced for each u c V c. Subtracting before scalar replacement of references the value for M,c. Therefore, ~il<c<n predicts Xl <2 6, Xl = 4 and correctly that were no distance true, vector For references dl( e) = 2 for the reference no memory that to A(1,J – 2). The formula be removed. Note that if references would have been removed because a O in the first entry. 2 memory would have references are invariant is different from variant loop L,, i < n, unrolling with expression of v. Unrolling memory references. Finally, loop, no memory references outer loops loop-invariant ory references, while respect to a loop, memory access the same memory location as U. for L, does not appear in the subscript (scalar In Figure unrolling the behavior is invariant with respect to more references to memory any loop that is not v-invariant will if u is invariant with respect to the will be left after scalar replacement are unrolled references). will references. If v G V,l L, will not introduce because each unrolled copy of u will This is because the induction variable which t I<z<n UEVT In Figure this value and after replacement 7, unrolling J-loop removes all introduce innermost no matter innermost- the K-loop introduces does not. Using these mem- properties gives where o(e, i, n) = if e is invariant wrt loop n then return O else if e IS invariant wrt loop i then return 1 else return X, Now that the formulation for loop balance in relation been obtained, it can be used to prove that unroll-and-jam ACM TransactIons on Programmmg Languages and Systems, to unroll-and-jam has will never increase Vol 16, No 6, November 1994 1784 Steve Carr and Ken Kennedy . DO 10 K = I,li DO 10 J = I,li DO 10 I = l,li ,.. 10 = A( DO 10 K = i,li,2 DO 10 J = l,?f,2 DO 10 I = I,F4 ... ,K) El (=,*,=) 10 Fig 7 the measure unrolling ~L = A simplified all of the following loop ~ *X, q K) andafter Consider ... = A( . . . = . . . = A(I, , +1) A( Ij , )(0,0,0) +1) unroll-and-jam. the following formula for ~~ when i. + E,= V;X, – (Xt –d, (e,,))++ ~uel,; ~(ec), i,zz) fxx( formula vertices where Combining Vrl before balance. of loop only = A(I, (0,0,0) Vj, these of M and for a given edges in X, each may be determined of the” above vertex by examining sets, giving the al >0. terms W’ = aoX[ M; = alX, M: = a,3X, + a4 + az gives pL = (Coxt + cl) fxxl where co, c1 > 0. Note that the values of co and c1 may vary with the value of X,, but they are always nonnegative. As X1 increases, co may decrease and c1 become positive for may increase because the value of (X, — d,(~, ))+ may some v G V,c. Theorem 3.2.3.1.2 below states that if loop z is unrolled n times, there will be n * f flops and possibly less than n * m memory references, where m is the number of memory references in the original loop. Hence, unroll-and-jam does not increase the measure of loop balance.7 71f floating-point not hold common Unroll-and-j am subexpresslon may insert removal proportionally is taken more into account, memory Theorem references than 3,2.3.1.2 operations, ACM ‘IYan+actlons on Programmmg Language> and Systems, Vol 16, No 6, November does floatmg-pomt 1994 Improving the Ratio of Memory Operations THEOREM The function 3.2.3.1.2. for fl~ is nonincreasing . 1785 in the i dimen- sion. See Appendix PROOF. THEOREM Unrolling 3.2.3.1.3. innermost-loop Theorem ❑ C. order, 3.2.3.1.3 multiple does not increase can be proved loop nests in an outermost- to ~~. by simple induction on the number of loops unroll-and-jammed. 3.2.3.2 An Example. the following loop. As an example of a formula for loop balance, consider DO IO J=l,N DO IO I=l,N 10 A(I,J) = A(I,J – 1) + B(1) A(1, J) G V@, A(I,J – 1) = V,c B(1) G V,z. This gives pL = with an incoming distance vector of (1, O), and Xl+xl–(xl–l)++l xl 3.2.3.3 Register Pressure. The next step in creating a balance problem for a particular loop is to derive a formula for register optimization pressure, To compute the number of registers required by an unrolled loop, it must be determined which references can be removed by scalar replacement and how many registers each of those references needs. This quantity is called R. As in computing k!, the partitioned sets of V, {V@, V,c, V,r}, are considered. Associated required with by that each of the set, {~, R:, partitions R;}, R= Since each member used to capture hold a value ters required R’”+R; of Vc’ has no incoming reuse. However, during expression for those memory temporaries needed technique of Sethi number of registers of V is the number of registers giving members evaluation. references +R; . dependence, of Y’ti still registers require cannot a register be to To compute the number of regisnot scalar replaced and for those during the evaluation of expressions, the tree-labeling and Unman [1970] is used. Using this technique, the required for each expression in the original loop body is computed, and the maximum over all expressions I?z. This technique does not take into account pressure due to instruction scheduling. This issue is taken as the value for the increase in register is addressed at the end of Section 3.2.4. For variant references whose incoming edge is carried by some loop, how many references will be removed by u G vr~, it must be determined scalar replacement after unroll-and-j am and how many registers are required ACM Transactmns on Programmmg Languages and Systems, Vol 16, No 6, November 1994 1786 for Steve Carr and Ken Kennedy . those references. computation The latter assumed flow The of M,!’ value, that across all former value and is shown d~(e, ) + 1, comes registers a loop-carried has previously been derived in the below. interfere from scalar [Callahan dependence, all replacement, et al. 1990]. such values For where it values that interfere with is one another across the back edge of the loop. For values that flow across loopindependent dependence, predicting the effects of scheduling before optimization to determine interference for loop-independent dependence is compiler loop dependent. Therefore, all values body. In addition, registers are avoid result reliance upon a particular scheduling in the following formula for A?:: References ters are assumed to be live throughout the assumed not to be reused in order to only that are invariant if a corresponding with respect algorithm. These to an outer reference-invariant loop assumptions require loop is unrolled. regis- Unrolling a loop that is not u-invariant does not increase register pressure, but rather creates opportunities for scalar replacement that may be utilized if a Uinvariant loop is unrolled. No matter which loops are unrolled, references that are invariant ters. In Figure with respect 7, the reference to the innermost to A(1, K) requires is unrolled and since K is unrolled ing this behavior gives loop always no registers once, two registers require unless are required. regis- the J-loop Formulat- where a (e, i, n) ~ 3.2.3.4 pressure, An if e is Invariant wrt loop I then If (3X,1X, > 1 A e IS Invariant wrl loop j) or (e E Invanant wrt loop n) then return 1 else return O else return X, Example. consider As an example the following loop. of a formula for computing register DO1OJ=1,N DO IO I=I,N 10 A(I,J) = A(I,J – 1) + B(J) ACM TransactIons on Programmmg Languages and Systems, Vol 16, No 6, November 1994 Improving the Ratio of Memory Operations A(I,J) G Va, A(I,J – 1) G V,c B(J) G 1’:. This with an incoming distance 1787 . vector of (1, O) and gives R=l+(xl–l)++xl. 3.2.4 Applying to take program loops Unroll-and-Jam in a Compiler. This section the previous optimization problem and solve it loop nests on real machines. First, it is shown to which choose the unroll-and-jam unroll will amounts for be applied, those loops. and issues that may have it is shown interlock sented, followed by the effects of multiple incident tion. Then, imperfectly nested loops are handled, compiler-dependent presented. then Next, discusses how practically for real how to choose the how removal to is pre- edges on balance computaand finally, machineand an effect on loop balance are 3.2.4.1 loops in Picking Loops to Unroll. Applying unroll-and-jam to all of the an arbitrary loop nest can make the solution to the optimization problem extremely complex. To simplify the solution procedure, unroll-and- jam is only applied to a subset of the loop nest. Since experience suggests that most loop nests have a nesting depth of 3 or less, unroll-and-jam can be limited to 1 or 2 loops in a given nest. Even though tiling for cache may increase the depth of the loop, it is assumed that unroll-and-jam is only performed on the tiled iteration space that contains the cache reuse [Wolf and Lam 1991]. Unrolling outer loops that iterate by the tile size would increase the amount of tiling. of data required For example, to be held in the tiled in cache, and thus, defeat the purpose loop DO IO I= I,N,IS DO IO J=I,N DO IO H= II, II+ IS-1 10 A(n) = A(n) + B(J) unroll-and-jam probably The pick might be applied to the J-loop, but applying it to I would not be profitable. heuristic those for determining loops references in the whose the unrolling unrolled loop. subset of loops is likely to result In particular, the for unroll-and-jam in the fewest loops that carry is to memory the most dependence that can be innermost after unroll-and-jam is applied should be unrolled. To find these loops, each distance vector is examined to determine if it can be made innermost for each pair of loops in the nest. Given a distance where d,, d] >0, unrolling loops i vector d(e) =(dl, . . ..dl. dl, ,dl, . . . . d.) and j can make e innermost if Vk, k + i, j, n d~(e) = O. If only one loop, i, is unrolled, then e can be made innermost if Vk, k # i, n dh(e) = O. 3.2.4.2 Picking lem solution, the solution ACM Unroll unrolling procedure. Transactions In Amounts. only Later, one loop we on Programming the following is considered discuss Languages how and discussion in to order extend Systems, Vol of the to bring the solution 16, No. prob- clarity 6, November to to two 1994 1788 Steve Cam and Ken Kennedy . loops. Given following where register this restriction, optimization formula can be reduced to the S = (Xl – d,(e))+ and Rkl is the effective size of the floating-point set for the target architecture. The solution procedure begins by examining the vertices coefficient for X, takes the O(G) time, and and edges of the dependence any where constant term G is the as was size of the the summations are solved to get a linear requires us to keep an array of coefficients graph shown to accumulate earlier. dependence This graph. the process Essentially, function of Xl. The ‘-function and constants, one set for each value of d,, that will be used depending upon the value of Xl. In practice, most distances are O, 1, or 2, allowing us to keep 4 sets of values: one for each of the common distances and one for all remaining determines which coefficient set is used. If d, >2 may be imprecise if 3 < X1 < d[. However, instances of this situation. Given a set of coefficients, the solution checking register pressure, to determine distances. The value of X, for some edge, the results experimentation the space can best unroll be are O, 1, or 2, unrolling more than pressure beyond the physical limits chine. size Therefore, time the 3.2.3.1.2, bound the of O(log the best answer. of the solution R~r ) time Thus, decision to two loops, loop can be created from the general terms in ~~ for two particular loops, i sets of coefficient values for each loop. is held constant, the function for ~~ dimensional solution space with each space is sorted for a binary the entire time for one loop. To extend unroll-and-jam solution space searched, factor. dependence distances increase the register Theorem discovered can be bounded requires a simplified most by R~. order, of the solution procedure while Since R~ will probably of the target ma- in descending search no By giving a space to find 0( G + log RA1 ) optimization problem formula. The coefficients and constant and j, can be constructed by keeping 4 By Theorem 3.2.3.1.2, if either X, or Xj is nonincreasing, which gives a tworow and column sorted in decreasing order, lending itself to an 0( RV1 ) intelligent search (since the rows and columns are sorted, each step of a search algorithm can eliminate an entire row or column from consideration) and a total time of 0( G + R~ ) to determine unroll amounts. In general, a solution for unroll-and-jam of k loops, k > 1, can be found in O(G + R$l- 1) time. As an example of the result of applying dix A. unroll-and-jam using 3.2.4.3 Removing Interlock. pipeline interlock, leaving idle ACM TransactIons on Programmmg a balance optimization problem, After unroll-and-jam, computational cycles. a loop To remove Languages and Systems, Vol 16, No 6, November see Appen- may contain these cycles, 1994 Improving the Ratio of Memory Operations the technique interlock. of Callahan If interlock et al. [1988] exists, one loop can be used to estimate can be unrolled 1789 . the amount to create more of paral- lelism by introducing enough copies of the innerloop recurrence. Given the pipeline length, 1P, and the length of the longest recurrence carried by the innermost loop, p( L, ), unroll-and-jam can be applied to an outer loop until enough floating-point computation is moved into F > p(L~ ) X 1P. Essentially, the innermost loop to fill the delay slots caused by interlock. This is possible because copies of the innerloop recurrence will be parallel with each other and unroll-and-jam will al. 1988]. Experimental interlock of the not move evidence detection pipeline in Section completely is short. outerloop 4 suggests and only Because recurrences optimize of the short that inward [Callahan it is reasonable for balance pipeline, when little et to ignore the length unrolling will probably be necessary to remove interlock, and the balance optimization problem already discussed will most likely take care of that interlock. Applying only one heuristic will speed up compilation. However, with a long pipeline, it may be advantageous to remove interlock first because the unrolling determined by the optimization enough to remove interlock. Additionally, second allows the optimization resources of the machine. 3.2.4.4 be only multiple Multiple Edges. one edge entering incident edges problem In general, problem removing to keep it cannot will be less interlock first register likely rather pressure be assumed that to be than within there or leaving each node in the dependence graph are possible and likely. One possible solution the will since is to treat each edge separately as if it were the only incident edge. Unfortunately, this is inadequate because many edges may correspond to the flow of the same set of values, Consider the loop in Figure 8. There is a loop-carried input dependence from A(1,J) to each A(1 – 1,J) with distance vector (O, 1) and a loop-independent dependence between the A(1 – 1, J) ’s. Considering each edge separately would give a register pressure estimation of 5 registers per unrolled iteration provided Although come from the this estimation of J rather than the actual value of 2, since both values same source and require use of the same registers. is conservative, it is more conservative than neces- sary. To improve the estimation, register sharing is considered. To relate all references that access the same values, the scalar replacement algorithm considers the oldest reference in a dependence chain as the reference that provides the value for the entire chain [Carr and Kennedy 1994]. The oldest reference in a chain is called the generator. Using this technique alone to capture sharing within the context of unroll-and-jam, however, may underestimate register pressure. While the generator of the dependence chain is the source of the value that flows through the chain, the value from the generator may cross dependence carried by an outer loop. Since scalar replacement is only amenable to innermost dependence, the value from the generator may not provide the value for a scalar-replaced reference until enough unrolling has occurred to ensure that the chain only crosses the innermost loop. ACM Transactions on Programmmg Languages and Systems, Vol 16, No 6, November 1994 1790 Steve Carr and Ken Kennedy . DO 10 J = l,H DO 10 10 I B(I, = l,B (0,0) J) - “(’m= (0,1) Fig DO 10 D I = 10 8. Effect of multiple DO 10 I,li A I+; I 1 I I L r ..J)=A(I!, . J)+A(I, (0,0)1 ----- I DO 10 J = I,li ~._#o) 10 edges. J) ~ I - -1 ----- 1 l,lJ,2 I,N ‘------’ (0,0) m :’(1+ I (w; ----- : : ,J)=A(~,J)+A(I,J) I I 1 I I I I (0,0) I I (0,0) \\ (1,0) 10 = J = ~A(I+?,J)=A(I+i,J)+A(I+i,J) 1, I L-J ~ I I I L------------- t(u) : I _---J (1,0) Flg 9 Consider the example in innermost). In the original incoming input dependence: l,J) and one with a distance Mult]ple Figure generators 9(note in one partition. that the solid dependence edges are loop, the second reference to A(I,J) has two one with a distance vector of(l,O) from A(1 + vector of (O, O) from A(l,J). Before unrolling, the generator of the dependence chain, A(I+l,J) does not provide the value to scalar replace the second reference to A(l,J). In this case, the first referenceto A(I,J) provides the value. However, after unrolling thel-loopby l, the generator provides the value for scalar replacementto the copies ofA(l,J)—that is, the references to A(1 + I,J) in statement 10 in the unrolled loop. Therefore, the edge from the generator, A(1 + 1,J) in the original loop, cannot necessarily be picked to calculate the register pressure for the whole chain. Instead, a stepwise approximation can be used, where each possible intermediate generator within a dependence chain is considered. These intermediate generators occur when the value from a generator of a dependence chain flows across an outerloop-carried dependence edge to a member, A, of the dependence chain and some third member, B, of the chain has a value that flows across an generainnermost dependence edge to A. In this case, B is an intermediate tor. ACM Tran&actlons on Programming Languages and Systems. Vol 16, No 6, November 1994 Improving the Ratio of Memory Operations Procedure PartitionNodes( Input: G = (V, .!3), graph to be unrolled each v c V in its own partition, while 3 an unmarked node do P(v) put let u be an arbitrary mark v fordf w E Vl((de input (e is t= (for mark 1791 G,~, k) the dependence ~, k = loops . unmarked node = (w, u) Ve or true = (v, w))A A and is consistent) 1 . ..n–l. i#j, t#k, d,(e) =O)) do w merge P(w) P(v) and enddo enddo end Fig. The first graph step in the stepwise so that all same partition. references references been node in each register ing dependence those this edges that section, the value of the the edges of each partition d, for that to ensure After computing tion for register this ACM of the the distance generator’s reuse loop body. number in the values. algorithm, partially For In addition, does not provide Transactions register on Programming Languages the vector the This value is underestimated. to the formula have of a partition Vol encomfrom be measured coefficients points Systems, will the maximum is not and across is picked. will where at the innermost-loop and unrolling for i = 1,...,n – edges requirements the intermediate the reuse d., used that in provide only partition) a partition of registers a distance the vector in the outgoing within will to flow Only earlier (if enough To be conservative, a summarized distance vector, pressure are accumulated. point of a partition all in the unrolled that to estimate set of unroll is computed. the innermost-loop as soon as possible partition generator node flows is found. the generator the the that as described entire to each to the has no incom- loops loop, leaving (one 11 is applied or has only loop-carried for the value SummarizeEdges, outgoing 1 the minimum At replacement to cause in the Once the OldestValue, the value v-invariant unroll-and-jam, in procedure created by task. z), that of its partition carried After Next, guarantees are to reference are considered. edges). generator member that be put this in Figure The reference, in the unrolled for scalar all is first reuse innermost picked another the dependence will in the procedure can create is performed pass algorithm is determined. dependence generator 10 accomplishes the that is to partition same in Figure partition from the used in R. First, the partition incoming approximation partitioned, the coefficients throughout PartitionNodes having The algorithm have calculate 10. 16, No. equabeen given a the generator level for some 6, h’ovember 1994 1792 . Steve Ken Kennedy Carr and Procedure C’omputeR Input: P = partition G (V, E), = (P, of pc memory references the dependence R = register foreach G, l?) pressure graph formula P do v = OldestValue(p) d=S ummarizeEdges(v) update R using d Prune(p,u) while size(p) > 1 do u = OldestValue(p) d=S ummarizeEdges(u) update R Using Prune(p, R,j – Rd+d[ej, 6 = (~, u) u) U=u enddo enddo end Procedure Oldest foreach v E Value(p) p if /l an incoming dependence v Ve = (u, u), ((u is invariant (P(u) return wrt # P(v))) the loop at level(carrier(e))) V then u end Procedure SummarizeEdgm(v) fort= lton–1 e = (v, w) 6 E, w is a memory foreach df = min(d~, foreach dj e c E, w is a memory = max(d~, return read d,(e)) read d~(e)) d’ end Procedure Prune(p,u) p remove v from foreach e = (v, w) c E if d,(e) = O, 1 < ~ < n then remove w end Fig, of the references 1 I, within Compute the coefficients partition for optimization must problem be determined (see A(1,J) in the original loop in Figure 9). Register requirements will be underestimated these situations are not handled. First, applying procedure Prune to each partition, we remove all nodes which there are current generator ACM Transactions on no intermediate points where provides the values for scalar Programmmg Languages and Systems, Vol a reference replacement. 16, No other Any 6, November if for than the reference 1994 Improving the Ratio of Memory Operations that occurs onthesame iteration of loops of the partition will have no such registers needed by the remaining calculating the coefficients I to n – Iasthe points. nodes the previous of R for an intermediate generator. Since current generator Next, the number of additional in the partition is computed by partition. The coefficients of R derived from of the modified partition will include registers with 1793 . generator of the modified the summarized distance vector that have already been counted the generator for a partition will point provide the values for the entire partition, we do not want same registers multiple times. Overcounting of register pressure at some to count the is prevented by computing the coefficients for R as if an intermediate generator were the generator in an original partition. Then, we subtract from R the number of references that have already been handled by the previous generator. The distance vector intermediate current current d(6) of the edge, generator, summarized and previous represents the 6, from u, is used the previous to compute generator, the latter Now ences, summarized distance vector as if the – O)+ – ( Xl – 1)+ additional points are repeated until references the partition respect to some invariant with loop. the other and have dependence only edge be incident respect can Since multiple edges to one of the unrolled carried upon are multiple edges also by both the generator presented in by A(1 + q ,J), edges for variant that are invariant allowed, loops. a reference. affect previous example headed references loops and variant occurs, the reference is treated as a member loop and V c with respect to the second. Finally, the removed. These steps for has 1 or fewer members. that we have shown how to handle multiple we discuss the differences for references one new Given distance vector, d’, and d(2) for the edge between the Rd. — R(d S+~(;)). The value d’ + generators, compute was the generator of the modified partition. In the Fi~re 9, d’ = (Oj O) and d(~) = (1, O) for the partition giving (Xl intermediate u, to the value. If This the of V 1 with computation with referwith may be respect to is not the case if previous situation respect to the first of Here, M. each reference is considered separately, using a summarized distance vector that consists of the minimum d, over all incoming edges of u for i = 1, . . . . n – 1. The summarized distance vector represents the earliest possible point in unrolling many when vectors, estimate a reference the point may be removed. calculated may Since the vector be too early, is a summary giving of a conservative of M. 3.2.4.5 Imperfectly which unroll-and-jam following loop. It may be the case that the loop nest on Nested Loops. is to be performed is not perfectly nested. Consider the DO IO I=I,N D020J=1,N 20 A(I,J) = B + E DO1OJ=1,N 10 C(J) = C(J) i- D(l) On the RS/6000, where /3M = 1, the l-loop would not be unrolled surrounding statement 20, but would be unrolled for the loop ACM TransactIons on Programmmg Languages and Systems, Vol. 16, No. for the loop surrounding 6, November 1994 1794 . Steve Ken Kennedy Carr and statement 10.X To handle the J-loops as follows this situation, the l-loop can be distributed around D0201=1, N D020J=1,N A(I,J) = B + E 20 DO IO I=I,N DO1OJ=1,N 10 C(J) = C(J) + D(l) and each loop unrolled independently. In general, unroll amounts are computed with a balance optimization formula for each innermost loop body as if it were within a perfectly nested loop body. A vector of unroll amounts that includes a value for each of a loop’s surrounding loops, and a value for the loop itself is kept for each loop within the nest. When determining if distribution is necessary, if a loop that has multiple inner loops is encountered at the next level, each of the unroll vectors for each of the inner amounts differ. If they differ, loop at the Then outermost level each of the newly at inner levels. where created If any loop that to a recurrence, unroll loops is compared to determine if any unroll each of the loops from the current loop to the the unroll amount the loops must vector unroll differ cannot in each vector for each level is distributed. for additional be distributed values in all of the vectors amounts is examined (note distribution be distributed due are set to the smallest that a loop that cannot be distributed will have an unroll value of O to ensure safety) [Allen and Kennedy 1987]. The complete algorithm for distributing loops for unroll-andjam is given in Figure 3.2.4.6 Machinescribed a solution there are other compiler 12. and Compiler-Dependent to the loop balance and factors that may need Issues. This paper has register pressure problem, to be addressed debut on some architecture- combinations. Addressing. Unroll-and-jam will likely increase the number of address registers required for the array references in tbe innerloop body. Address-register pressure is a compiler-dependent issue that will affect balance if there is register spilling. In addition to address registers, the updating of their contents must also be considered. On architectures with an autoincrement addressing mode, there will be no effect on balance. However, if explicit instructions are necessary to increment address registers, then these instructions must instructions be included. Most likely since memory references Instruction Cache. In this paper, is large enough to hold an unrolled they will count are often serwced as additional by an integer memory pipeline. it is assumed that the instruction cache loop body. However, some architectures (e.g., Intel i860) have very small on-chip instruction caches. In these cases, it would be necessary to understand how the back-end compiler generates 9If both statements ACM loops and Transactions are leave unrolled the loop the same structure on Programming amount, we would Just make copies of the intact. Languages and Systems, Vol 16, No 6, November 1994 non-DO Improving the Ratio of Memory Operations Procedure DistributeLoops( Input: L = loop level if L has maltiple 1 = outermost loop L,i) to check i = nesting 1795 . for distribution of L inner level loops at level where t + 1 then the IUUWll amo~ts &ffer for the nests. if 1 > 0 then if loops at levels 1, . . . . z can be distributed then L=L while level(L) ~ 1 do distribute C around its inner loops L = parent(L) enddo else for] =lto ido m = fiNmum of all WOll for the inner vectors Of loops L at level J assign to those vectors at level j the value m enddo endif endif N = the first if N exists inner DistributeLoops( N = the next while of L at level loop z+ 1 then N,, loop N exists DistributeLoops( N = the next + 1) at level t following L do N,,) loop at level t following N enddo end Fig, instructions so that limit unrolling. Scheduling. heuristic 12, some estimate Instruction presented the back-end Distribute scheduling for unroll-and-jam. of code size could scheduling does not handle compiler’s loops can this be performed increase directly algorithm register since intimate would in order pressure. The knowledge be necessary to to give of an effective estimate. One method for handling scheduling is to reduce the number of available registers artificially to allow room for the scheduler to require additional resources. The number of effective registers can be determined Data experimentally. Cache. The design of the data cache can have a great effect on the ability of the cache to retain data as assumed by the heuristic. Direct-mapped caches often suffer from interference between references [Lam et al. 1991]. However, this issue is beyond the scope of this paper and is left to the cache optimization phase of the Wolf and Lam 1991]. ACM Transactmns compiler on Pmgrammmg [Lam Lal.guages et al. 1991; and Systems, Meadows Vol 16, No et al. 6, November 1992; 1994 1796 Steve Carr and Ken Kennedy . Special Instructions. multiply-add instruction completed in separately.g time This, number of flops balance can give Many that less than in turn, that advanced microprocessors allows the two arithmetic it would affects to perform the peak machine can be performed a false take of ~~ procedure instruction floating-point reducing if the instruction be performed. rather than instruction by increasing per cycle. However, impression counts each balance as a result, unnecessary unrolling may add can be counted as one operation actually have a floating-point instructions to be the machine is not used and, Instead, the multiplytwo since the decision issues. 4. EXPERIMENT A source-to-source ParaScope PFC [Allen mental translator, programming and Kennedy design called is illustrated preprocessor, unroll-and-jam transformed Memoria, has been in Figure has a good Additionally, compiler and the RS/6000 loop balance a natural 13. In this a large number has a number estimate scheme, This Renaming. Zero-Cost Branches. ing instructions next iteration allows before pipelined Large 4-Way cache the the the pressure achieve Mode. (64K ). ignore for loops Set-Associative the effects of stalls related to This the parallelism Autoincrement by fetch- operations helps assumed allows on the simulate the to exist heuristic a in to of balance. The size of the cache allows the heuristic Data Large Cache. associativity tends or loop-independent Implementation identifies mstructlon Trmwactlons to reduce effectively. most-loop-carried ACM it (32). make loop size. interference a multlply as a because are optimized beginning finishes. in the computation Cache to and to ignore and iteration loop and helps Instruction heuristic boundary current instructions 540 was chosen As discussed at the end of Section 3.2.4, address-register be controlled. For the RS/6000, any expression having ‘Our serves of floating-point registers of hardware features that branches loop Addressing address to ignore Backward across Autoincrement ignore Memoria of performance: pipelining across iterations on register antidependences on registers. software a loop. in the rewriting Fortran programs to improve loop balance using both and scalar replacement as described. Both the original and versions of the program are then compiled and run using the standard product compiler for the target machine. For the test machine, the IBM RS/6000 model Register implemented environment, using the dependence analyzer from 1987; Callahan et al. 1987; Carr 1992]. The experi- on multlply-add M an add mstructlon Programmmg Languages dependence mstructlons This and may requires in an AST not be valid Systems, Vol pressure needs to no incoming inner- 16, an address by checking if the parent on all architectures. No 6. NCJ> ember regis- 1994 of Improving the Ratio of Memory Operations 1797 Maoria Fig. ter. Although dence from reasons. In considering Experimental 4.1 c- Original compiler, the it number pressure could not removes be avoided the indepen- of registers available for was performance artificially re- additional register spilling. with scalar replacement 1992]. Performance Figures 14 and Results 15 show applying unroll-and-jam complete applications. and Improved design. address-register a particular addition, 13. duced to 26 to allow for scheduling and to prevent This number was determined by experimentation [Carr * Compiler SOurc. original reported formed Figures results results performance and scalar Full code using in normalized optimization the IBM execution on the was used XLF compiler where on both version attained Below code is 100. XLF compiler for loop-independent the results with by using more sophisticated we discuss the improvement applications in our after of kernels the and transformed 2.2. The results are of the trans- of the original code. In code; (SR) reports the and (UJSR) reports the scalar replacement. The performs a limited form of dependence and loop-invariant refscalar replacement show the benefit dependence attained suite. First analysis. by Memoria the on each results of each played in Figure 14 is presented. Then, a discussion performance shown in Figure 15 is presented. of the 4.1.1 and RS/6000 the performance code is normalized against the performance 14 and 15, (Original) de~otes the original after Memoria applied scalar replacement, after Memoria applied unroll-and-jam and scalar replacement erences. Therefore, IBM to a number times base execution time for the original It should be noted that the IBM kernels results replacement of the kernel dis- application Kernels DMXPY. DMXPY (denoted by VM in Figure 14) version of a vector-matrix multiply that is machine RS/6000)) and difficult to understand. Unroll-a~d-jam is a hand-optimized specific (not for the and scalar replace- ment are applied to the unoptimized version (Original) and compared with the performance with the LINPACKD version (look at the SR bar). Memoria (the UJSR bar) is able to attain slightly better performance than the hand-optimized code, while allowing the kernel to be expressed in a machine-indepenACM TransactIons on Programming Languages and Systems, Vol 16, No 6, November 1994 1798 . Steve Carr and Ken Kennedy l’;\hhhkOk!i &+Fig. VM 14. Kernel unroll-and-j MMk MMi BLU performance am and scalar BLUP on IBM Gmt Emit Vpa Afold (SR = after RS/6000 Fold Sevsd scalar Sor ;: .... .x .,.. ::: :::: ..:. .... ! IU300 Flg 15. after Tciuv Apphcatlon unroll- A&n scalar on IBM RS/6000 (SR = after 9895 w 89 “::: i.:: ,,,. j .,.. : :; :,:: .,.. ,.,. ::j :.., ,.,. ,... .::: ,,,, ,::: :::; ;:: .: .... .:: .::! ,,,, 11° oOpt phot Arc2d performance and-j am and UJSR replacement, scalar ❑ Orlgmd ❑R wUJSR Mesa UJSR replacement, of how hand optimization is not necessarily kernel in a machine-independent form and allow the compiler to handle the machine details. This allows good mance across a variety of architectures without programmer effort. Matrix Multiply. The results for two versions of matrix 50 x 50 are displayed. MMk is matrix multiply by the innermost loop, and MMi is matrix multiply carried by the middle loop. Memoria achieves integer small matrices where most memory references are from memory lelism. reuse = replacement) dent fashion. This is an example best. It is better to express the of size carried = after replacement) 98 ::: ::.; ,.:: ;::: x Mean in the loop improve, but also available Linear Algebra Kernels. Experimentation with decomposition with and without partial pivoting multiply perfor- on arrays with the reduction with the reduction factor speedups on cache. Not only does instruction-level paral- the block versions (BLUP and BLU, of LU respec- tively) is shown. Memoria attains a factor of over 2 improvement in run-time. Each of the kernels contained an innerloop reduction that is amenable to unroll-and-jam. In these instances, unroll-and-jam lel copies of the mnerloop recurrence to Improve introduces balance. multiple paral- Three of the NAS kernels (Gmtry, Emit, and Vpenta) from NAS Kernels. the SPEC benchmark suite are examined. Gmtry and Emit observe larger improvements because they contain outerloop reductions to which unroll-andjam can be applied. Geophysics nels: one that another ACM that Tran.act]on~ Kernels. computes Included in the experiment are two geophysics the adjoint convolution of two time series (Afold) computes the on Programmmg convolution Languages and (Fold). Systems, Each Vol 16, of these No loops 6, November 1994 kerand requires Improving the Ratio of Memory Operations Table Suite Application SPEC Matrix300 Description Matrix Mesh Adm Transonic Onedim Seval and Ser. tained by applying computes only the to remove Particle 4.1.2 Seval scalar We Exploration and a B-spline applicable). replacement interlock, Simulation triangular developed in other works kernels contain innerloop The to the successive-over-relaxation Applications. Particle rhomboidal, evaluates is not Transport Electromagnetic Oil Equai.ion Prediction Sphot coopt pipeline Flow 2d Hydrodynamics of trapezoidal, (unroll-and-jam Solver Simple using techniques that have been and Kennedy 1992]. Again, these loops Pollution Schroedinger Weather Wave optimization Air Inviscid Time-Independent ShaJ the Generation 2d Fluid-Flow F1052 Local Multiply Pseudospectral Arc2d RiCEPS 1799 I. Tomcatv Perfect . main of a grid. but originally and has reported 27 spaces only singly is ob- computational Unroll-and-jam Carr nested improvement no improvement chose iteration [Carr 1992; reductions. loop. Sor is applied is noted. Fortran applications from SPEC, Perfect, RiCEPS, and local sources to be a part of our study. Of those 27, 5 failed to be analyzed successfully by PFC; 1 failed to compile on the RS6000; and 10 contained no reuse opportunities for the loop-balance optimization algorithms. Most of those programs that contained no reuse opportunities had loop-independent or loop-invariant reuse that was the IBM XLF compiler. Table I contains a short description of each of the remaining captured by applications that can be transformed by Memoria. The results of performing scalar replacement and unroll-and-j am on the application suite are shown in Figure 15. The results for most applications are modest with the exception of Matrix300 and Onedim. These two applications spend considerable time ACM Transactions on Programming Languages and Systems, Vol 16, No 6, November 1994 1800 . Steve Carr and Ken Kennedy computing reductions tions. As shown for improvement. ments from 15% unroll-and-jam tion time. is given 50°6 Shal, improves effective completely loop on the individual loops by reduc- loops to which do not dominate unroll-and-jam is never execu- applied. of unroll-and-jam A more on the applica- 4.2. unroll-and-jam carry is most a large instruction-level cost. dominated present the best opportunity Simple, and CoOpt, improve- these performance often in the computational achieved and Sphot, that Reductions dominates are of the suggests reductions. being Unfortunately, in Section The study least to examination reduction Matrix300 is applied. For Adm, detailed tions with in the kernel results, reductions For Tomcatv, Arc2d, F1052, presence Since effective amount parallelism greatly. of a floating-point a divide performance takes to make Balance Estimate of unrolling because on the memory and a Unroll-and-jam divide 19 cycles enough in the presence of reuse, is of its RS/6000, high its references cost a minimal factor. 4.2 Effectweness The next jam of the Loop portion decision balance of the experiment procedure prediction dependence the PFC in statistics analyzer analyzer deals predicting can is used because for this the implementation of the statistics was upgraded to be of PFC quality. In Figure 16, the accuracy kernels problem “Observed’ with The kernel. “Predicted’ “Compiler” the IBM are after estimates numbers with balance given a fixed number register allocator). (cache miss experiment bar numbers graph the registers unroll and multiply-add 16 indicates identical to the balance prediction is counted the different cies do not emerge on this Programmmg predicted produced are likely with on that balance dependence TransactIons source code applied. generated (using by the best the IBM amounts are used or a minimum the loop. represents XLF are increased balance is reached. as one floating-point Division is counted as 19 flops since the base operations one cycle and division takes 19 cycles. ACM from each In all references to memory are counted as one memory operation penalties are not included), and each addition, subtraction, multiplication, Figure balance are assembler of floating-point numbers, the replacement “Optimal” on the the for by examining scalar than analyzer are obtained dependence of during by Memoria represents optimization. registers rather unsupported of balance by examining full tables ParaScope and (2) the ParaScope obtained For the “Optimal” all floating-point each column, the detailed B. The became “Initial” and are obtained compiler until from unroll-and-jam XLF of the analyzer gathering of the unroll-and- More Appendix of the computation created numbers Memoria in portion is detailed. of the loop in the original optimization the accuracy balance. be found (1) the PFC aforementioned with loop by balance the to occur when distances Languages references at multiple set of kernels. and for compiler. Systems, Vol this set of kernels is inaccuracies in have multiple These also be noted 16, take Slight loop levels. It should operation. on the RS/6000 No. 6, November incoming inaccurathat 1994 these Improwng the Ratio of Memory Operations . 1801 1:lLLLLLBkBmwL— Vm MMk W Fig. BLU 16 BLUP Kernel Gmt Emit Vpe balance-prediction Afold Fold Scr accuracy. .illl a b% Dmm Q Redl.ad ❑ Oh.arw!d Q Campi!a ❑ o@und Vpe Fig. 17. Kernel register-pressure AMd Fold Sm accuracy. inaccuracies will not occur between the observed balance and the compiler balance. No results are reported for “Seval” because unroll-and-jam is not performed on the kernel. For the kernels in this study, Memoria achieves a loop balance that is not appreciably different from ence between the optimal the optimal balance. The balance and the balance occurred on Vpenta. Here, ranges contain reuse. Not loop-independent dependence with considering register reuse causes overestimate register pressure and unnecessarily only significant differachieved by Memoria restrict limited Memoria live to unrolling. The register pressure results in Figure 17 indicate that Memoria overestimates register pressure in most instances. This would argue for Memoria to include some heuristic for approximating scheduling to allow register reuse. However, it does not argue for the elimination of reserving registers as a buffer against only considers the register allocator of the target compiler because Memoria register pressure on a per loop basis, not over a whole routine as in global register allocation [Chaitin 1984]. Previous experiments have shown loops with very low register spills standing Table of program II presents register are done pressure for an entire structure information et al. 1981; that register due live Chow and Hennessy spilling will occur in to scalar replacement range rather than [Briggs 1992; Carr 1992]. concerning the unrolling with because an under- of the loops in each kernel. “Unrolled” shows how many loops are unrolled due to balance and pipeline interlock. Of those that contained interlock, only SOR requires additional unrolling after balancing the loop without considering interlock. Interlock is removed in all other loops just by unrolling to improve the ratio of memory to floating-point operations. “Not Unrolled shows how many loops are not unrolled due to balance and lack of potential improvement (“None”), and transformation legality (“Safety”). “Nests” reports the number of perfectly nested loop structures ACM Transactions that on Programming are considered Languages and for unroll-and-jam. Systems, Vol 16, No 6, November Imper1994 1802 Steve Carr and Ken Kennedy - Table Kernel Unrolling Information II Kernel Not Unrolled Unrolled None Safety 3MXPY o 0 1 MatJIK 0 0 1 1 0 0 1 2 2 0 1 0 3 3 0 5 1 5 6 --1--- MatJKI BLU BLUP 2 Gmtry 1/0 I 0 2 0 0 5 7 1 0 0 1 2 Afold 0 0 1 Fold 0 0 1 0 0 0 0 0 1 Emit Vpent a % Q----k Seval 0 Sor 4.0 Nests 1 1 ..” Tcatv Iu300 Flg Arc2d 18 Apphcatlon FI052 Simp O&n balance-prediction accuracy. feet loop nests have multiple perfect loop structures that unroll-and-jam. Figure 18 reports the accuracy of Memoria in predicting application suite. in an application are considered loop balance for in the The balance numbers are reported as an average of all loops to which unroll-and-jam is applied. “Initial,” “Predicted,” and “Observed” numbers have the same meaning as in Figure 16. No numbers from the compiler-generated code are reported. It is expected that the compiler numbers for balance would be identical to the observed numbers as previously reported for the kernels because the output of the XLF compiler is very predictable. ACM TransactIons on Programmmg Languages and Systems, Vol 16, No 6 November 1994 Improving the Ratio of Memory Operations Memoria This that note predicts application the observed contains numbers references in every with application multiple incoming . 1803 except F1052. dependence required summarization as described in Section 3.2.4. It is pleasing how accurately the decision procedure performs across a variety applications and kernels. 5. RELATED WORK Callahan but they et al. [1988] describe unroll-and-jam do not present a method to compute [ 1987] discuss in the context of loop balance, unroll amounts automatically. Aiken and Nicolau called zation loop quantization. To ensure parallelism, where each loop is unrolled until iterations dent. However, with between the unrolled software iterations their register method misses unnecessarily pressure. Wolf restricts and Lam a transformation identical to unroll-and-jam they perform a strict quantiare no longer data indepen- or hardware pipelining, do not prohibit low-level usage unroll [ 1991] to of benefits from true dependence parallelism. Thus, true dependence amounts. They also do not control present a framework for determining and register the data locality for a loop nest and use loop interchange and tiling to improve locality. They present unroll-and-jam in this context as register tiling, but they do not present a method to determine unroll amounts. 6. SUMMARY We have presented a design replacement and unroll-and-j transformations input only are machine a few parameters and number of registers. A major achievement problem that for a transformation am to improve minimizes independent of the target of this work the distance the in system that applies balance of loops. the sense machine, is the development between that they such as machine scalar These take as balance of an optimization machine balance and loop balance, bringing the balance of the loop as close as possible to the balance of the machine. This work has also shown how to use a few simplifying assumptions to make the solution to the optimization inclusion in real compilers. These transformations have been implemented source preprocessor and applied to a substantial problem in fast enough a Fortran collection for source-to- of kernels and whole programs. The accuracy of our method is established in the experimental results. These results show that in almost all cases hand optimization is unable to achieve a significantly better balance than Memoria. We believe this result shows that hand-balancing loops is unnecessary. In addition to the accuracy of Memoria, the run-time results show that the methods are particularly successful on kernels from linear algebra. Even though the technique is targeted toward loop optimization, we still achieve spectacular results on a couple of programs; however, in general, modest results are reported on full applications. These results are achieved on an IBM RS/6000, which has an effective optimizing compiler and a small load penalty (1 cycle). We would expect such methods to produce larger improvements on machines with greater load ACM penalties. TransactIons on Programmmg Languages and Systems, Vol 16, No 6, November 1994 1804 Steve Carr and Ken Kennedy . The effectiveness of scalar replacement and unroll-and-jam, as demon- strated by the experiments, argues that they should be included highly optimizing scalar compiler. Since it has been established that be carried out by a machine-independent few machine parameters to drive the possible to develop a preprocessor processors as they emerge. APPENDIX Al. A. MATRIX Original Matrix MULTIPLY— in every they can source-to-source system that uses a optimization algorithm, it should be that can BEFORE be quickly retargeted to new AND AFTER Multlply doj=l, n dol=l,n dok=l, n c(I,j = c(i,j) + a(i, k) * b(k,j) enddo enddo enddo A2. Balanced c C c Matrix Multiply pre-loop removed do J-mod(n,2) + 1,n,2 c C c pre-loop removed do I = mod(n,2) + l,n,2 if (1 ,gt. n) goto 1 a$l$O = a(l,l) b$2$0 = b(],]) c$O$O = C(I,J)+ a$l $0 * b$2$0 b$4$0 = b(l,j + 1) c$3$0 = C(I,J + 1) + a$l$O*b$4$0 a$6$0 =a(l + 1,1) c$5$0 = c(i + 1,j) + a$6$0 x b$2$0 c$7$0 do = c(i + 1 J + 1 ) + a$6$0 ~ b$4$0 k = 2,n a$l$O = b$2$0 = c$O$O = b$4$0 = c$3$0 = a$6$0 = a(i, k) b(k,]) c$O$O + a$l $0 * b$2$0 b(k,j + 1) c$3$0 + a$l $0 t b$4$0 a(i + 1,k) c$5$0 = c$5$0 + a$6$0 x b$2$0 c$7$0 = c$7$0 + a$6$0 L b$4$0 Languages and enddo c(i+l,j+l)=c$7$0 c(i + 1,j) = c$5$0 C(I,J + 1) = C$3$0 c(i,j) = c$O$O 1 continue enddo enddo ACM TransactIons on F’rogrammmg Systems, Vol 16, No 6, November 1994 Improving the Ratio of Memory Operations APPENDIX B. BALANCE PREDICTION Label Meanin,s 1Predicted OL obtained from the dependence (Observed PL obtained A I the optimization graph 0= obtained IBM (3ptimaf XLF by examining the murm and scalar by examining compiler the best ~= given problem created from for each loop aft er unroll-and-jam (Dompiler 1805 TABLES Table ~ . full a fixed the IBM XLF 1[B initial PL 1~B @L after 17P floating-point register 1vests the number of perfectly generated by the optimization number register unroll-and-jam Memoria replacement the assembler with (using code with of floating-point re~istern allocator). and scalar replacement pressure nested loops considered for unrokand.jarn 1Jnrolled results for those nests 1WA Unrolled results those P 1 the number that nests that of perfect are unrolled am not unrolled nests in that appfkation that are unrolled I 1 the number of loops N! 1 the number of perfect not NI 1 that are unroUed due to interlock. nests in the application that are unmffed the number of those in W’ that could not be improved by unmfkmd-jam s the number not ACM Transactions of those in “N” where unrolf -and-jam is safe on Programming Languages and Systems, Vol 16, No 6, November 1994 1806 . Steve Carr and Ken Kennedy Table Kernel Predicted Kernel ACM Prediction Accuracy Optmd Compiler Observed lB FB FP FB FP B3 FB FP FB Fp DMXPY 3.00 1.09 26 1.09 25 3.00 1.09 26 1.07 32 MatJIK 2.00 1.00 10 1,00 10 2.00 1.00 7 1.00 7 MatJKI 3.00 1.00 17 1.00 16 300 1.00 14 100 14 BLU 3.00 2.04 26 2,04 25 3.00 2.04 25 2.03 32 2.00 1.00 10 1.00 10 2.00 1.00 7 1.00 7 3.00 2.04 26 204 25 3.00 2.04 25 203 32 2.00 1.00 10 1.00 10 2.00 100 7 1.00 7 Gmtry 3.00 2.04 26 2.04 25 3.00 2.04 25 2.03 32 Emit 300 1.09 26 109 25 300 I 09 26 107 32 300 1.09 26 1.09 25 3.00 1.09 26 1.07 32 Vpenta 2.33 1.58 23 1.58 22 2.33 1.58 12 1,01 12 Afold 2.00 1.00 5 1.00 5 2.00 1.00 4 1.00 4 Fold 2.00 1.00 5 100 5 2.00 1.00 4 1.00 4 Sor 0.60 0.53 9 0.53 8 0.80 0.53 9 0.53 9 BLUP Table Table Table Balance A.11 A. I m-esents a key to the other two tables in this appendix. A.1~ presents the-balance prediction accuracy of Mernoria on kernels. A.111 presents Memoria’s balance prediction accuracy on applications. TransactIons on Programmmg Languages and Systems, Vol 16, No 6, November 1994 Improving the Ratio of Memory Operations Table Application Application Accuracy Not Unrolled Predicted IB FB FP FB FP N NI s 2 2 0 2.00 1.04 26 1.04 25 0 0 0 5 1 0 2.00 1.58 24 1.58 24 4 4 0 Adm 106 0 0 - - - - - 106 72 34 Arc2d 75 15 0 2.70 1.55 15.9 1.55 15.1 60 46 14 F1052 76 14 0 2.13 1.53 15.2 1.57 12.8 62 62 0 Onedim 10 6 0 2.00 1.00 10 1.00 10 4 4 0 Shal 7 0 0 - - - - - 7 7 0 Sphot 10 0 0 - - - - - 10 7 3 Simple 22 2 0 4.00 3.08 17.5 3.08 16.5 20 18 2 Wave 10 0 0 - - - - 10 7 3 139 15 0 1.04 21.2 1.04 17.2 124 124 0 v coopt each new 2.5 C. PROOFS 3.2.3.1.1. THEOREM for Observed 1 Tomcat APPENDIX Unrolled N Matrix300 1807 A.III. Balance Nests . edge e; = Giuen ( w~, O < m, d,(e) v. ) created < X,, from and e,, = ( w“, n = (m + d,(e)) modXz V. ) by unroll-and-jam, then n>m=m<X1—dL(e). PROOF, ( ~ ) Given n > m, assume m >XL – d,(e). Since O <m, d,(e) <XL then n = (m + d,(e)) = m + d,(e) ACM Transactions on Programming Languages mod X, –X, and Systems, Vol 16, No 6, November 1994, 1808 . Steve Carr and Ken Kennedy giving n2+cZ, (e)- X,>n2. Since d,(e) the inequality is violated, <X,, leaving m <X, – d,(e) Given (=) m –d,(e) and n = (m +dl(e))mod <X, m,dl(e) > 0, then = m +d, Xl (e). Since m,d,(e) > 0 then m +d, (e) >m, giving ❑ n>m. THEOREM The 3.2.3.1.2. fimction for ~L is nonincreasing in the i dinlen- sion. For PROOF. jam the is applied original to the loop, i-loop pi X, = 1 and 1 times, n – (c:r2 = ~~ = (c. n > +c~ 1, then + cl)/f. If unroll-andX, = n giving +c~) fn where the in Cz is the sum of all X,, value would not (X, decrease between in – d, ( eU )) ‘ became positive value of fl~ is (c~n f+ = +Cl + (CO –c~)(n – 1)) fn To show that flI, is nonincreasing, ~L n(cO 2 + Cl) > PI > ~~ must be true. % (c~n + (c. –c~)(n – 1’) +cl) fh nf nco+ncl>nc~+nco — nc~+c~—co+cl (n–l)cl>c~l–cO ACM Transactions on with or remain the same with the change co and CL is the number of edges becomes positive with the change in X,. Given that of any d, ( e, ) added into Cz is n – 1 (any greater positive). Therefore, the maximum make (X, — d, ( e, ))+ where (X, – d,(e, ))+ X, = n, the maximum value d, ( e, ) where X,. Since co will c; s co. The difference increase Programming Languages and Systems, Vol 16 No 6, November 1994 lmprovlng Since n > 1, c1 >0 ing. El and the Raho of Memory Operations CL s co, the inequality holds and & 1809 o is nonincreas- ACKNOWLEDGMENTS The authors this paper. Reynolds would first Their like to thank suggestions and Phil Sweany the referees greatly provided valuable tion of this document. Preston Briggs, helpful suggestions on document form. and David Callahan for their improved the suggestions Keith Finally, valuable paper’s input quality. during to Robert the prepara- Cooper, and Uli Kremer made John Cocke, Peter Markstein, gave us the inspiration for this work. REFERENCES AIKEN, A. AND 87-821, ALLEN, NICOLAU, Cornell F. AND J. ACM J. KENNEDY, Program. P. 1992. Science, Rice CALLAHAN, D., CAER, S., CAILAHAN, CALLAHAN, S. Univ., CARR, S. ANU CARR, S. ings KF,NNEI)Y, Pratt. AND G,j Re~ster Not Desig]z and of Fortran programs to vector form. Ph.D. thesis, Dept. of Computer Improvmg allocation for subscripted SIGPLAN ’90 Confer- Implementation. 1988. Esblmatmg Comput. interlock and improving balance 5, 334-358. K., KENNEDY, mm Proceedings In register of the ACM TORCZON, of the L. 1987. 1st International l?araScope: A Conference on Berhn. Ph.D. thesis, Dept. of Computer Science, Rice K. 1992 M., via 19, Scalar 1 (Jan.), Compiler A., J. L. the of Society Press, COCKS, Corrzput J., Lang. 1984. 222–232. in blockability Computer CIJANDRA, coloring 6 (June), replacement presence of conditional control 51-77. numerical algorithms. Los Alamitos, HOPKINS, M., AND Proceed- In Calif., 114– 124. MARIHTEIN, P. 1981. 6, 45-57. Register allocation proceedings by priority-based of the ACM SIGPLAN coloring. ’84 SZG- S.vmposzum on Construction. J. C., Hsu, DEHNER’r, Proceedings Languages GUPTA, reference AND BRAT-T, J. Internutlonal Operutzng E., Proceedings P. Y.-T.. of the 3rd a?ld DUESTERWALD, array R., 1994, 24, CHOW, F. C. AND HENNIMSY, In K. ng ’92. IEEE AUSLANN?R, Compiler Rep. N. J., 1-30. Proceedings management. K, KENNEDY, allocation PLAN In transformations. Cliffs, coloring. 1990. Dmtrlb. HOOD, Exper. of Supercomputz CHAI’rIN, Tech. algorithm. Tex. Soft[L,. flow, graph and KENNEDY, J. Parallel Memory-hierarchy Houston, and 491-542. 53 –65. Desgn Springer-Verlag, 1992. K. environment. Supercomputmg. CARR, via KENNEDY, AND programming optimizing translation 9, 4 (Oct.), 25, 6 (June), COOPER, K., D., parallel J., analysis Tex. AND machines. of Automatic Language D , COCKE, for pipelined An Englewood allocation Not. ence on Programming catalogue Syst. Houston, SIGPLAN variables. A 1987. Lang. Register Univ., quantization: Prentice-Hall, K. AND Loop N.Y. 1972. of Compilers. Trans. BRIGM, 1987. Ithaca, COCKE, OptunLzcztLon ALLEN, A. Univ., Systems. R., analysis AND and of the ACM P. ACM, SOFFA, its 1989. Conference use SIGPLAN New M in ’93 Overlapped York, L. loop on Arch ~tectural support in Support the Cydra 5. for Programming 26-38. 1993. A practical data optimization. SIGPLAN Conference on Programmmg flow Not. 28, framework 6 (June), Language for 68–77. DeSLgn and Irnplementatlon, GANNON, D., JALBY, W., agement by global on Supercomputzng. KENNEDY, K. Proceedings AND of AND GALLIVAN, K. 1987. program transformations. Springer-Verlag, Berhn. MCKINLEY, the 1992 K. 1992. Strategies In Proceedings Opt]m~zing Znternatzonal Conference on Programmmg Languages for cache for parallelism on and local of the 1st Znternatlonal and memory Supercomputing. ACM, memory man- Conference hierarchy. New In York, 323-334. ACM TransactIons and Systems, Vol. 16, No. 6, November 1994 1810 . KUCK, D. New The Structureof 1978. Computerland Computations.Vol. I. John Wiley and D., KLTHN, getable R,, LEASURE, vectorlzer. Press, In Los Alamltos, M Calif., algorithms. for Programmmg M~AJIOWS, L , NAKAMOTO, piler for LIW RAU, B. and 1991. ceedings WOLFE, Languages Lecture superscalar Notes software plpelmed and SIGPLAN 592 Conference 17, 4 (Oct.), M E. (June), Design WOLFE, AND 30–44 and M. The structure Apphcchons. The cache 4th of an advanced IEEE Society In for 589 Not. 1970 rntng The 27, and Language Destgn of optimal parallehsrn and comIn for Parallel Berlin, 283–299. 63-74. pipelimng ’92. 1992. S on Architectural York, software C’ompllers Sprmger-Verlag, 7 (July), generation zing, mstructlon-level P. P., ANU SCHLANSKF,R, M SIGPLAN New of RISC on Languages vol ACM, A vector] Proceedings analysis Science, Systems. 1992 and optimizations Conference Pro- Comput- 236–250 Register allocation Proceedings of the for ACM Implementatmn. code for arithmetic expressions. J. 715-728. L.AM, M S Proceedings 1991 A data of the ACM locabty optlmlzmg SIGPLAN algor,thm. ’91 Conference SIGPLAN Not. on Programming 26, 6 Language Implementation. 1986. Loop skewing: December 1992; revised The wave front method rewslted. J. Parallel Program. 279-293. Received retar- Computer performance International Operatzng dependence on Program SI?THI, R. AND ULLIWMJ, J. ACM and Workshop TIRUMALAI, loops. and 1991 architectures in Computer M., E of the S., AND SCHUSTER, V Data-flow 1984. M. Destgn Proceedings of the 4th International R.Au, B. R., LEE, WOLF, AND 163-178 In Support zng B., Supercomputers: S., ROTHBERG, E E , AND WOLF, M of blocked ACM Sons, York. KLTCK, LMI, Steve Carr and Ken Kennedy Transactions on Programming December Languages 1993 and and May Systems, 1994; Vol 16, accepted No June 6, November 1994 1994 15, 4,