Improving Register David Allocation Callahan+ for Subscripted Steve Variables* Ken Kennedy4 CarrS Abstract 1 Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to analyze the definitions and uses of individual array elements. This deficiency is particularly troublesome for floating-point registers, which are most often used as temporary repositories for subscripted variables. In this paper, we present a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables. The objective is to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers. In addition, we present transformations to improve the overall effectiveness of scalar replacement and show how these transformations can be applied in a variety of loop nest types. Finally, we present experimental results showing that these techniques are extremely effective-capable of achieving integer factor speedups over code generated by good optimizing compilers of conventional design. Although conventional compilation systems do a good job of allocating scalar variables to registers, their handling of subscripted variables leaves much to be desired. Most compilers fail to recognize even the simplest opportunities for reuse of subscripted variables. For example, in the code shown below, *Research supportedby NSF Grant CCR-8809615 DO 10 I = 1, I DO 10 J = 1, H 10 A(I) = A(1) + B(J) most compilers will not keep A (I ) in a register in the inner loop. This happens in spite of the fact that standard optimization techniques are able to determine that the address of the subscripted variable is invariant in the inner loop. On the other hand, if the loop is rewritten as 10 20 and by IBM 400 N 34th St, Suite 300, Seattle t Department of Computer Texas 77251-1892 Science, Rice University, Houston Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 01990 ACM 0-89791-364-7/90/0006/0053 DO 20 I = 1, If T= A(I) DO 10 J = 1, M T = T + B(J) A(1) = T even the most naive compilers allocate T to a register in the inner loop. The principal remon for the problem is that the data-flow analysis used by standard compilers is not powerful enough to recognize most opportunities for reuse of subscripted variables. Standard register allocation methods, following Chaitin, et al. [CAC+81], concentrate on allocating scalar variables to registers throughout a procedure. Furthermore, they treat subscripted variables in a particularly naive fashion, if at all, making it impossible to determine when a specific element might be reused. This is particularly problematic for floating-point register allocation, because most of the computational quantities held in such registers originate in subscripted arrays. Corporation tTera Computer Company, Washington 98103 Introduction $1.50 Proceedings of the ACM SIGPLAN’SO Conference on Programming Language Design and Implementation. White Plains, New York, June 20-22, 1990. 53 statement to the second, and both statements reference the same memory location [Kuc78]. We say the dependence is For the past decade, the PFC Project at Rice University has been using the theory of data dependence to recognize parallelism in Fortran programs [AK84b]. Since data dependences arise from the reuse of memory cells, it is natural to speculate that dependence might be used for register allocation. This speculation led to methods for the allocation of vector registers [AK88], which led to scalar register allocation via the observation that scalar registers are vector registers of length 1 [CCK87]. In this paper, we extend the theory developed in our earlier works and report on experiments with a prototype system, baaed on PFC, that improves register allocation by rewriting subscripted variables as scalars, as illustrated by the example above. 2 l l l 2.1.1 an output dependence if both statements the location, and an input dependence if both statements the location. write to read from For the purposes of improving register allocation, we concentrate on the true and input dependences, each of which represents an opportunity to eliminate a load. In addition, output dependences can be helpful in determining when it is possible to eliminate a store. A dependence is said to be carried by a particular loop if the references at the source and sink of the dependence are on different iterations of the loop. If both the source and sink occur in the same iteration, the dependence is loop independent. The threshold of a loop-carried dependence, r(e), is the number of loop iterations between the source and sink. In determining which dependences represent reuse, we consider only those that have a consistent threshold - that is, those dependence8 for which the threshold is constant throughout the execution of the loop [GJG87, CCK87]. F or a dependence to have a consistent threshold, it must be the case that the location accessed by the dependence source on iteration i is accessed by the sink on iteration i + c, where c does not vary with i. Method Dependence an antidependence if the first statement reads from the location and the second writes to it, l In this section, we will describe a set of transformations that can be applied to Fortran source to improve the performance of compiled code. We assume that the target machine has a typical optimizing Fortran compiler, one that performs scalar optimizations only. In particular, we assume that it performs strength reduction, allocates registers globally (via some coloring scheme) and schedules both the instruction and the arithmetic pipelines. This makes it possible for our preprocessor to restructure the loop nests in Fortran, while leaving the details of optimizing the loop code to the compiler. These assumptions proved to be valid for the Fortran compiler on the MIPS M120, which served as the target for most of our experiments. We begin the description of our method in the first subsection with a discussion of the special form of dependence graph that we use in subsequent transformations, paying special attention to the differences between this form and the standard dependence Next, we introduce scalar replacement, a graph. transformation that replaces references to subscripted variables by references to generated scalar variables whenever reuse is possible. Finally, we present a set of transformations that can be used before scalar replacement to improve its effectiveness. Our current implementation, a modified version of our automatic parallelizer PFC, will be described in section 3. 2.1 a true dependence or flow dependence if the first statement writes to the location and the second reads from it, 2.1.2 Dependence Pruning Since conventional methods for computing a dependence graph often produce transitive dependence edges, not all edges entering a reference represent the true flow of values [Wo182]. For example, if there were two stores to the same location in a loop nest, the dependence graph would have true dependences from both definitions to every use that they could reach. Therefore, in order to capture potential reuse, it must be determined which of these definitions is the last to store into the memory location before a value is loaded from that location. In general, for each memory load in the inner loop body, we want to find the most recent definition within the innermost loop or, if no definition exists, the least recent use. Graph Basics We say that a dependence exists between two statements if there exists a control-flow path from the first 54 inner loop just to eliminate a single memory reference. The procedure consists of six steps, as follows. Procedure PRVNEGRAPH is an algorithm that takes as input a dependence graph for a loop nest containing no inner-loop conditional control flow and returns a dependence graph pruned of all transitive, inconsistent and outer-loop-carried dependence edges, leaving only those edges which represent the true flow of values in the innermost loop. We Determining the Number of Temporaries. begin by calculating the number of temporary variables, ri, needed to hold each variable generated by the source, gi, of a true or input dependence long enough to achieve reuse. For each generator, we need T(e) + 1 temporary variables to eliminate the load at the sink of e, where e is the outgoing edge with the largest threshold. This provides one temporary for each of the values computed on the r(e) previous iterations plus one for the value computed on the current iteration. Note that for some backward dependence edges it may be possible to use one less temporary variable. However, for the sake of simplicity, our implementation treats backward and forward edges uniformly when determining the number of temporaries. Although this would seem to create unnecessary register pressure, the typical coloring register allocator will coalesce the extra temporaries with others before generating code. procedure PRUNEGRAPH (G) Definitions: G = (V, E) and E, = incoming edges for node v E V. VALID(e) returns true if edge e is carried by an innermost loop, or is loop independent and its source and sink are in the same loop body. for each v E V do begin if 3 an incoming true dependence then begin T = {ele E E,, is a true dependence with the smallest threshold & VALID(e)) es = ele E T and the source of e does not have an outgoing loop-independent output dependence whose sink has a true dependence i E T if r(e,) is not consistent then eg = I end else begin T = {ele E Ev is an input dependence with the largest consistent threshold & VALID(e)) es = ele E T and the source of e does not have an incoming loop-independent consistent input dependence end for each e E Eu do if e # e, and e is not an output dependence then remove e from G end Reference Replacement. The next step in the transformation is to create temporary variables, Tf,T;, . . . , G*-i, for each generator, gi, and replace the reference to the generator with Tf . The sink of each of the generator’s outgoing dependence edges should be replaced with Tt, where k is the threshold of the dependence edge. If gi is a load from memory, then we must insert a “load” of the form Tt = gi immediately before the generating statement . If gi is a definition, then we must insert a “store” of the form gi = Tf immediately following the generating statement. This second assignment may be unnecessary if there are other definitions of the same location in the same iteration of the loop body, which can be detected by looking for a loop-independent consistent output dependence leaving the definition. The effect of scalar replacement will be illustrated on the following loop nest. Once the graph has been pruned, the source of each true or input dependence can be seen as “generating” a value that could eliminate a load if kept in a register. Hence, we call these sources generators. 2.2 Scalar I 2 10 Replacement DO 10 I = 1, 100 A(I+2) = A(I) + C(I) = B(K) + C(I) B(K) CONTINUE The reusegenerating Before discussing loop restructuring transformations, we will describe how to replace subscripted variables by temporary scalar variables to effect reuse. In making this transformation, we consider only loopindependent dependences and dependences carried by the innermost loop, since it would be impractical to save a value in a register over all the iterations of the dependences are: 1. A true dependence from A(I+2) ment 1 (threshold 2), to A(I) in state- 2. A loop-independent input dependence from C (I ) in statement 1 to C(1) in statement 2 (threshold 0), and 55 example above, B(K) does not change with each iteration of the I-loop; therefore, its value can be kept in a register during the entire execution of the loop and stored back into B(K) after the loop exit, as below. 3. A true dependence from B(K) to B(K) in statement 2 (threshold 1). In addition, there is an output dependence from B(K) in statement 2 to itself. By our method, the generator A(It2) in statement 1 needs three temporaries, which we call TlO, Til and Ti2. Here we are using the first numeric digit to indicate the number of the generator. The generator C(1) in statement 1 needs only one temporary, T20, and the generator B(K) in statement 2 needs two temporaries, T30 and T31. Note that we could use only one temporary for the last generator, but we have chosen to be consistent for the sake of simplicity. When we apply the reference replacement procedure to the example loop, we generate the following code. DO 10 I = T20 1 TlO A(It2) 2 T30 B(K) 10 CONTINUE DO 10 I = 1, 100 T20 = C(1) TlO = T12 + T20 1 A(It2) = TlO T30 = T31 + T20 2 Ti2 = Tll Tll = TIO T31 = T30 10 CONTINUE B(K) = T30 The enabling conditions for this type of code motion in the presence of true dependences are: 1, 100 = C(1) = T12 + T20 = TlO = T31 + T20 = T30 l l Register-to-Register Moves. In order to keep the values in temporary variables correct across the iterations of the innermost loop, we insert, for each generator gi , the following block of assignments at the end of the loop body. Tie1 57-i-2 There exists a consistent output dependence carried by the innermost loop from a definition to itself, and All innermost-loop-carried and loop-independent dependences related to that definition are consistent. In the presence of only input dependences, the condition for moving a load out of the loop is: l = Tim2 = 7-y-3 There exists an innermost-loop-carried consistent input dependence from a use to itself. We call each of these array reference types scalar because they are scalar with respect to a given loop. When only the first condition for scalar array references is met, a store to a generator for that variable cannot be moved outside of the innermost loop. Consider the following example. In the example from above, the code becomes: DO 10 I = 1, N DO 10 J = 1, N 10 AtI) = A(1) + A(J) DO 10 I = 1, 100 = C(1) T20 = T12 + T20 1 TlO A(I+2) = TlO 2 = T31 •t T20 T30 = T30 B(K) = Tll T12 = TlO Tll = T30 T31 10 CONTINUE The true dependence from A(1) to A(J) is not consistent. If the value of A (I 1 were kept in a temporary and stored into A(1) outside of the loop, then the value of A(J) would be wrong whenever I = J and I > 1. This situation can be recognized while pruning the dependence graph. Initialization. Finally, max( rl , . . , , rn) - 1 iterations of the loop must be peeled from the beginning of the loop to initialize the temporary variables correctly. In these initial iterations, the sinks of the dependences from generators for iteration k are replaced It may be the case that the assignCode Motion. ment to a generator may be moved entirely out of the innermost loop and put after the loop. This is possible when the reference to the generator does not vary with the iterations of the innermost loop. In the 56 achieved by modeling the task as a knapsack problem, where the number of scalar temporaries required for a generator is the size of the object to be put in the knapsack, the number of memory accesses eliminated is the value of the object and the size of the register file is the knapsack size. Using dynamic programming, an optimal solution to the knapsack problem can be found in O(kn) time, where k is the number of registers available for allocation and n is the number of generators. Hence, for a specific machine, we get a linear time bound. The best solution to the knapsack problem may leave the knapsack partially empty, requiring us to examine each of the generators not yet allocated to see whether they have an edge which requires few enough temporaries to be allocated in the leftover space. If the optimal solution takes too much time, as it may on a machine with a large number of registers, a greedy approximation algorithm might be used. In this approximation scheme, the generators are ordered in decreasing order based upon the ratio of benefit to cost, or in this case, the ratio of memory accesses saved to registers required. At each step, the algorithm chooses the first generator that fits into the remaining registers. This method requires 0( n log n) time to sort the ratios and O(n) time to select the generators. Hence, the total running time is O(nlogn). With the greedy method, we have a running time that is independent of machine architecture, making it more practical for use in a general tool such as PFC. Unfortunately, the greedy approximation can produce suboptimal results. To see this, consider the following example, in which each generator is represented by a pair (n, m), where n is the number of registers needed and m is the number of memory accesses eliminated. Assume that we have a machine with 6 registers and that we are generating temporaries for a loop that has the following generators: (4,8), (3,5), (3,5), (2,l). The greedy algorithm would first choose (4,8) and then (2,1). This would result in the elimination of nine memory accesses instead of the optimal ten. However, it may be possible to improve the greedy strategy if partial allocations for generators are considered. with the temporary variable T! only if j < k, and the appropriate number of register transfers are inserted for each peeled iteration. When this transformation is applied to our example, we get the following code. T20 T10 A(3) T30 Tll T31 T20 = C(1) = A(l) + T20 = T10 TlO A(4) = A(2) + T20 = TIO = T31 + T20 T30 T12 1 2 IO = B(K) + T20 = TIO = T30 = C(2) = Tll = TIO Tll T31 = T30 DO 10 I = 3, = T20 = TIO A(I+2) = 100 c(1) Tl2 = T31 T30 Tl2 Tll T31 CONTINUE + T20 TIO + T20 = Tli = TIO = T30 B(K) = T30 We have eliminated three loads and one store from each iteration the loop, at the cost of three registerto-register transfers in each iteration. Fortunately, these transfer instructions can be eliminated if we unroll the original loop by 2 [CCK87]. Even without unrolling, many transfers will be eliminated by live-range coalescing in the target compiler’s register allocator. 2.2.1 Reducing Register Pressure During our experiments, we discovered that a problem arises when scalar replacement creates a large number of scalars that compete for registers: some compilers were not as effective as we expected when presented with loops that require almost every available register. In particular, some compilers failed to coalesce temporaries whose live ranges did not interfere. Because of this, we decided to generate temporaries in a way that minimizes register pressure. In effect, we wanted to do part of the register allocator’s job ourselves. Our goal is to allocate temporary variables so that we can eliminate the most memory accesses, given the available number of machine registers. This can be 2.2.2 Handling Conditional Control-Flow The algorithm we have described fails for loops with conditional control flow because it makes no provision for multiple generators-it is possible that the generator for a particular use depends upon the control-flow 57 While our approach works in most simple may introduce unnecessary overhead in the of more complex control flow. Our future will pursue a more comprehensive algorithm replacement in the presence of conditionals. path taken through the body of the loop. This is the case in the following loop. DO 10 I = i,N IF (B(1) A(I) 10 .LT. 0.0) THEN = C(1) + D(1) ELSE A(I) = E(I) ENDIF G(I) = 2*A(I) Sl In this section, we discuss two transformations that can be used to improve scalar replacement: loop interchange and unroll-and-jam. These transformations move dependences carried by outer loops inward, so that the distances between references to the same array location will be shortened enough to make scalar replacement feasible. We will also show how to perform unroll-and-jam on loops with irregular-shaped iteration spaces. The definition of A(I) in either statement S1 or SZ can provide the value used in statement Ss. The problem of multiple generators is only one of many that must be handled by scalar replacement in the presence of control flow. The following is a more complete list of the responsibilities of the algorithm. 1. Determining which possible generators load along the control-flow paths. 2.3.1 reach a dependence threshold of scalar replace- temporaries consistently Many of these problems can be solved by formulating them as data-flow problems. For example, to determine if a definition of identical threshold reaches a load on all paths, we can solve a data-flow problem similar to available expressions. We have developed a preliminary method for dealThe method requires ing with these problems. that each path through a loop define the requisite temporaries-a condition which may result in the insertion of loads. In the example above, this strategy would result in the following code. DO 10 I = 1,N IF (B(1) .LT. 0.0) THEN T10 = C(1) + D(1) = TIO A(I) ELSE T10 = E(I) + F(1) = TlO A(I) ENDIF 10 G(I) = 2*TlO Loop Interchange Loop interchange can be used to shift an outer loop to the innermost level, possibly moving dependences inward to the new innermost loop. Therefore, we want to use loop interchange, where safe, to increase reuse and remove as many memory accesses as possible [AK84a]. Our first goal with loop interchange is to move the loop that carries the most dependences for scalar array references to the innermost level. This will achieve the largest gains because loads and stores can be removed completely from the innermost loop. Also, the transformation presented in the next section can take advantage of this situation. If this step is not possible, then we would like to move the loop that carries the most consistent dependences. Unfortunately, different loop orderings may produce different carriers for the same dependence, which adds to the complexity of computing the optimal loop ordering. Another problem with loop interchange is that getting the best loop ordering for scalar replacement may hurt other critical performance areas. Consider the following example. 2. Finding the most recent definitions (or least recent uses) along the control-flow paths. the scalar 5. Naming within the loop. Transformations s2 s3 the profitability 4. Determining ment on each generator. Restructuring 2.3 + F(1) the optimal 3. Determining for scalar replacement. cases, it presence research for scalar DO IO J = I, N DO IO I = I, N 10 A(I) = A(1) + B(I,J) Sl Because Fortran stores its arrays in column major order, the stride between accesses to the B array is one. With a cache that has a line length of 1, this will correspond to l- 1 cache hits and 1 cache miss for 1 accesses to B. If the I and J loops were interchanged so that A(1) could be kept in a register, the s2 s3 58 stride between accesses to B would be equal to the dimension of the column of B, resulting in poor cache performance. 1. ForeachvEV,therearek+lnodesuc,...,wkin V’. These correspond to the original and its k copies. 2.3.2 2. For each edge e = (v, w) E E, with consistent threshold r(e), there are k + 1 edges ec, . . . , ek where ej = (vi, wi) and we have the relations: Unroll-and-jam Unroll-and-jam is another transformation that can be used, where safe, to shorten the distances between references to the same array location [AC72, AN87, CCK87]. The transformation unrolls an outer loop and then fuses the inner loops back together. It can transform dependences carried by an outer loop into dependences that are either loop independent or carried by the innermost loop. For example, the loop: DO 10 J = 1, 2*M DO 10 I = 1, N 10 A(I, J) = A(I-1,J) after unrolling i= (j + T(e)) mod (k + 1) Intuitively, these rules simply take account of the fact that several iterations are unrolled into the body, so there are k + 1 copies of both u and w and the thresholds are greater than zero only for dependences that cross iterations of the new loop. These rules can be extended to unroll-and-jam in the following manner. Whenever a threshold of a dependence becomes zero, the next inner non-zero entry in the distance vector for the dependence becomes the new threshold and the loop corresponding to that entry becomes the new carrier. If there is no non-zero entry, then the new dependence becomes loop independent. If the edge is not carried by the unrolled loop, then it is just copied into the new loop body. Unroll-and-jam not only improves register allocation opportunities-it also improves cache performance by shortening distances between sources and sinks of dependences. As a result, cache hits are more likely [Por89]. Finally, unroll-and-jam can be used to improve pipeline utilization in the presence of recurrences [CCK87]. The major difficulty with unroll-and-jam is determining the amount, of unrolling necessary to produce optimal results. There are many factors that are involved in the decision process, some of which are register pressure (both integer and floating-point), storage layout, loop balance, pipeline utilization and instruction-cache size [CCK87]. The current implementation picks the two outer loops that carry the most dependences (true and input) and unrolls them by a factor provided as a parameter. We have developed, but not implemented, an efficient heuristic that chooses this parameter automatically. Because we are using unroll-and-jam to improve scalar replacement, our heuristic concentrates on register-file utilization. We would like to use all of the floating-point registers available if possible, but we do not want to unroll the loop more than necessary. Unrolling carelessly can lead to a large explosion in code size with little additional improvement in execution time. Therefore, the decision procedure + A(I-l,J-1) becomes: DO 10 J = 1, 2*M, 2 DO 20 I = 1, N + A(I-l,J-1) 20 A(I, J) = A(I-1,J) DO 10 I=l, N A(1, J+l) = A(I-l,J+l) + A(I-I,J) 10 and after fusion becomes: DO 10 J = 1, 2*M, 2 DO IO I = 1, N + A(I-l,J-1) A(I, J> = A(I-1,J) = A(I-1,JtI) t A(I-1,J) A(I,J+I) 10 statement Sl s2 The true dependence from A(1, J) to A(I-I, J-I) is originally carried by the outer loop, but after unrolland-jam the dependence edge goes from A(1, J) in S1 to A(I-1, J) in SZ and is carried by the innermost loop. We can now allocate a temporary for A(1, J) and remove two loads instead of one as in the original loop * To allow scalar replacement to take advantage of the new opportunities for reuse created by unrolland-jam, we must update the dependence graph to reflect these changes. Below is a method to compute the updated dependence graph after loop unrolling if we restrict ourselves to consistent dependences that contain only one loop induction variable in each subscript position [CCK87]. The dependence graph G‘ = (V’, E’) for L’ can be computed from the dependence graph G = (V, E) for L by the following rules: 59 becomes a constrained integer-optimization problem, where the size of the inner-loop body is maximized with the constraint that the register pressure is not increased beyond the limits of the hardware. To handle the problems of pipeline utilization and instructioncache size, we can increase or decrease the unroll factor to fit the situation, although unrolling for register usage, by itself, is likely to create enough parallelism, while keeping the inner-loop body within a reasonable size limit. 2.3.3 Irregular-Shaped Iteration aN+P Spaces I 1 When the iteration space of a loop is not rectangular, unroll-and-jam cannot be directly applied, as shown previously. For each iteration of the outer loop, the number of iterations of the inner loop may vary, which makes unrolling a specific amount illegal. Below, we show how to perform unroll-and-jam on triangularand trapezoidal-shaped iteration spaces by extending loop interchange previous work done on triangular [Wo186, Wo187, CK89]. Another way to think of unroll-and-jam is as an application of strip mining, loop interchange and loop unrolling. Using this model to demonstrate the transformation, we give the general form of a loop that has been already strip mined. The I and J loops have been normalized to give a lower bound of 1, LY and p are integer constants(p may be a loop invariant), (Y > 0 and, for simplicity, S divides B. I I I+& FIGURE 1: New Iteration - II N Space overcome this, we can split the index set of J into one loop that iterates over the rectangular region below the line J=crI+,d and one loop that iterates over the triangular region above the line. Since we know the length of the rectangular region, the first loop can be unrolled to give the following loop nest. DO 10 I = l,N,S DO 20 J = l,crI+p loop body unrolled S-l times 20 DO 10 II = I+I,I+S-I DO 10 J = WAX(l,aI+@+I) 10 loop body DO 10 I = I,N,S DO 10 II = I,I+S-I DO 10 J = l,ctTI+p IO loop body ,aII+P Depending upon the values of (Y and p, it may also be possible to determine the size of the triangular region; therefore, it may be possible to completely unroll the second loop nest to eliminate the overhead. The formula above can be trivially extended to handle cr < 0. While the previous method applies to many of the common irregular-shaped iteration spaces, there are still some important loops that it will not handle. In linear algebra, signal processing and partial differential equation codes, we often find loops with trapezoidal-shaped iteration spaces. Consider the following example. To interchange the II and J loops, we have to account for the fact that the line aII+/3 intersects the iteration space at the point where J = aI+P. Therefore, when we interchange the loops, the II-loop must iterate over a trapezoidal region, as shown in Figure 1, requiring its lower bound to begin at I until (J-p) a > I. This gives the following loop nest. DO 10 I = I,N,S DO 10 J = i,a(I+S-i)+P DO 10 II = MAX(I,(J+)/a),I+S-I IO loop body DO 10 I = 1,N DO 10 J = I,#IN(I+H,L) loop body 10 To complete the unroll-and-jam transformation, we need to unroll the inner loop. Unfortunately, the iteration space is a trapezoidal region, making the number of iterations of the inner loop vary with J. To 60 I I transformer (PFC) FIGURE 2: Experimental Here, as is often the case, a MIN function has been used to handle boundary conditions, resulting in the loop’s trapezoidal shape. The formula to handle a triangular region does not apply in this case, so we must extend it. Consider the previous loop after normalization. We have a rectangular iteration space while I+H < L and a triangular region after that point. Because we already know how to handle rectangular and triangular regions and because we want to reduce the execution overhead in the rectangular region, we can use index set splitting, as described previously, to split the loop into one rectangular and one triangular iteration space. body = MAX(l,MIN(N,(L-H))+l),N J = l,L-1+1 body Experimental ) original time time design. Livermore Loops. We tested scalar replacement on a number of Livermore loops with unroll-and-jam turned ofi Many of the loops did not contain opportunities for reuse or contained inner-loop conditional control flow, which we do not currently handle. Therefore, we do not show results for these loops, since no improvement was attainable by our system. In Table 1, we show the performance of the original and transformed loops, using arrays of size 1000 and single precision REALS. The “Unrolled” column shows the speed of the original loop after unrolling it an amount equivalent to that done by scalar replacement. The speedup column may have two fields. The first field shows the speedup of the transformed loop over the original loop. The second field, if present, shows the speedup of the transformed loop over the unrolled loop. Some of the interesting results include the performances of loops 5 and 11, which compute first order linear recurrences. In each of these loops, one load per iteration of the loop is removed and the loop is unrolled once. The unrolling has limited effect because of the recurrence and the fact that our version = l,WIN(N,(L-M)) J = l,M+l We can now apply regular unroll-and-jam to the first loop and triangular unroll-and-jam to the second. For the general case, we can easily extend the index set splitting to allow MIN or MAXfunctions in either the upper or lower bound of a loop and to allow constant coefficients on the induction variable appearances in the bounds of the inner loop. 3 improved then replaces subscripted variables with scalars, using both the knapsack and greedy algorithms to moderate register pressure. For our test machine, we chose the MIPS Ml20 (based on the MIPS R2OOO) because it had a good compiler and a reasonable number of floating-point registers (16) and because it was availIn examining these results, the reader able to us. should be aware that some of the improvements cited may have been achieved because the transformed versions of the program have, in addition to better register utilization, more operations to fill the pipeline on the M120. The experimental design is illustrated in Figure 2. In this scheme, PFC serves as a preprocessor, rewriting Fortran programs to improve register allocation. Both the original and transformed versions of the program are then compiled and run using the standard product compiler for the target machine. DO 10 I = 1,N DO 10 J = l,MIN(I+M,L)-I*1 loop body 10 DO 10 I DO 10 10 loop DO 10 I DO 10 IO loop ) Fortran compiler Fortran source program Results We have implemented a prototype source-to-source translator based on PFC, a dependence analyzer and parallelization system for Fortran programs. The prototype restructures loops using unroll-and-jam and 61 Loop 1 Iterations 1 I 20.000 5 I 25,000 I H 13 L 14 19 I I -I--- 2,600 31000 10.000 1 Original I 25.44s I 16.61s , I I 30.22s 1 22.8s I 21.19s TABLE 1 Unrolled Il 20.6s I 15.12s 1 1 - I - 1: Livermore 1 Transformed 20.0s Il I 12.15s I I 29.14s 19.61s 19.36s 1 Speedup II 1.23 II 1.03 1.37 1 1.24 1 I I I 1.04 1.16 1.09 H Loops results. a hand optimized version of code. We chose the routine SMXPY from LINPACK (see appendix A), which computes a vector matrix product and was hand unrolled to improve performance [DBMS79]. We rerolled it to get the original loop, shown below: of the MIPS compiler (1.31) seems to be more conservative in allocating registers to loop-independent dependences when stores to arrays are present. Our method removes this problem with the allocation of temporary variables and we see significant improvements. Loops 1 and 12 do not show as much improvement over the unrolled original loop because they involve input dependences rather than true dependences. Because there is no recurrence, unrolling improves parallelism and the MIPS compiler can allocate registers to the loop-independent dependences that are created. Loop 6, shown below, is an interesting case because it involves a scalar array reference that is difficult to recognize. DO 10 J = 1, H DO 10 I = 1, 1J 10 = Y(1) + X(J)*W(I,J) Y(I) and then fed this loop to PFC which performed unrolland-jam 15 times (the same as the hand-optimized version) plus scalar replacement. As the table below shows, our automatic techniques achieved slightly better performance from the easy-to-understand original loop, without the effort of hand optimization. The speedup column represents the improvement of the automatic version over the hand-optimized version. To make this measurement, we iterated over each loop 30 times using arrays of dimension 1000. DO 6 I= 2,B DO 6 K= 1,1-l W(I)= W(I)tB(I,K)*Y(I-K) 6 A compiler that recognizes loop invariant addresses to get this case would fail because of the load of W(I-K) . Through the use of dependence analysis, we are able to determine that there is no dependence between W(I) and W(I-K) that is carried by the innermost loop. Because of this additional information, we are able to get a speedup of 1.09. Loop 8 required more registers than the MIPS provides, so the transformed loop was approximately 10% slower than the original loop. After we used either the knapsack or greedy algorithm to relieve register pressure, we saw a speedup of 1.04 over the original loop. Original 28.0s LINPACK 18.47s Automatic 17.52s Speedup 1.05 Matrix Multiplication. Next, we performed our test on matrix multiplication with 500 x 500 arrays of single precision REAL numbers (see appendix B). For matrix multiplication, we used various unroll values performed on both outer loops plus scalar replacement. We show their separate effects in the table below. The first entry in the speedup column of the table represents the speedup due to scalar replacement alone, whereas the second column shows the speedup due to unroll-and-jam plus scalar replacement. An important test for a reLinpack Benchmark. structuring compiler is to achieve the same results as 62 Unroll Value Original Scalar unroll-and-jam can be applied to a variety of irregular loop structures, including triangular and trapezoidal loops. We have also demonstrated that these techniques can lead to dramatic speedups on RISC architectures, even those for which the native compiler does extensive code optimization (the MIPS R2000, for example). We are currently engaged in studying the generality of these techniques by extending our implementation and applying it to a wider class of programs. This research involves the following subprojects: ~~~ When the unroll values used on matrix multiply exceeded 1, the register pressure of the loop caused scalar replacement to hurt performance rather than help it. Fortunately, both the knapsack and greedy algorithms successfully eliminated this problem, reducing the time for the loop unrolled by a factor of 3 from 67.2s to 60.49s. The current implementation is incomplete. We are in the process of extending it to automatically compute unroll amounts and to handle triangular and trapezoidal loops and loops with conditionals. A Trapezoidal Loop. To test the effectiveness of unroll-and-jam on loops with irregular-shaped iteration spaces, we chose the following trapezoidal loop which computes the adjoint-convolution of two time series. Similar methods can be used to improve cache performance. The key transformation for this purpose is strip-mine-and-inlerchange [Por89] (sometimes called loop blocking [CK89], iteration space tiling [Wo187, Wo189] or supernode partitioning [ITSSJ). This transformation is essentially unroll-and-jam with the copies of the inner loop body rolled back into a loop. We have implemented a preliminary version of this transformation in PFC and performed some experiments that indicate that it further enhances machine performance [CK89]. We are in the process of enhancing this transformation to automatically select automatic strip sizes and to handle complicated loop structures. DO 10 I = 1,Nl DO 10 J = I,HIN(I+N2,N3) F3(I) = F3(I)+DT*Fl(K)*WXLK(I-K) 10 We performed trapezoidal unroll-and-jam by hand with an unroll value of 3 plus scalar replacement. We ran it on arrays of single-precision REALS of size 100 so that 75% of the execution was done in the triangular region. Below is a table of the results. Iterations 5000 Original 15.59s Transformed 9.08s Speedup 1.72 Although the study reported here demonstrates the potential of these methods, we need to conduct a study that determines their effectiveness on whole programs. Once the implementation is complete, we plan to apply the transformations to the RiCEPS Fortran benchmark suite being developed at Rice. A Complete Application. We also performed the test on a complete application, onedim. This program computes the eigenfunctions and eigenenergies of a time-independent Schrijdinger equation. The table below shows the results when scalar replacement and unroll-and-jam are applied with an unroll value of 2. Original 47.9s 4 Summary 1 Transformed I 33.34s Given that future machine designs are certain to have increasingly complex memory hierarchies, compilers will need to adopt increasingly sophisticated strategies for managing the memory hierarchy so that programmers can remain free to concentrate on program logic. The transformations and experiments reported in this paper represent an encouraging first step in that direction. It is pleasing to note that the theory of dependence, originally developed for vectorizing and parallelizing compilers can be used to substantially improve the performance of modern scalar machines as well. 1 Speedup I 1.44 and Future Work We have presented the method of scalar replacement, which transforms program source to uncover opportunities for reuse of subscripted variables. This method can be enhanced by a number of transformations, particularly unroll-and-jam. We have shown how 63 Acknowledgments The authors would like to thank Preston Briggs, Keith Cooper, Ervan Darnell, Paul Havlak and ChauWen Tseng for their helpful suggestions during the preparation of this document. We also thank John Cocke and Peter Markstein for giving us the inspiration for this work and supporting us while we pursued it. References [AC721 J.R. Allen and K. Kennedy. Automatic loop interchange. In Proceedings of the SIGPLAN ‘84 Symposium on Compiler Conslrzelion, SIGPLAN Notices Vol. 19, No. 6, June 1984. [AK84b] JR. Allen and K. Kennedy. PFC: A program to In Supercomconvert fortran to parallel form. Design and Applicahons, pages 186-205. puters: IEEE Computer Society Press, Silver Spring, MD., 1984. [AK881 J.R. Allen and K. Kennedy. Vector cation. Technical report, Department Science, Rice University, 1988. [AN871 A. Aiken and A. Nicolau. Loop quantization: An analysis and algorithm. Technical Report 87-821, Cornell University, March 1987. [CK89] S. Carr and K. Kennedy. Blocking linear algebra codes for memory hierarchies. In Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing, Chicago, IL, December 1989. [aBMS79] J.J. Dongarra, J.R. Bunch, Stewart. LINPACK User’s tions, Philadelphia, 1979. [G JG87] [IT881 A.K. Porterfield. Software Methods for Zmprovement of Cache Performance on Supercomputer Applicationa. PhD thesis, Rice University, May 1989. [WolSS] M. Wolfe. More iteration space tiling. In Proceedings of the Supercomputing ‘89 Conference, 1989. A version of SMXPY IF (J .GE. 2) THEI DO 2 I = 1. Ii Y(I) = ( (Y(I))+x(J-~)*H(I,J-~)) +X(J)*H(I,J) COITIIUE ESDIF J = HDD(12,8) IF (J .GE. 4) THEI DO 3 I = 1, 11 Y(I) = ((( (Y(I))+X(J-3)*H(I,J-3)) +X(J-2)*l’l(I,J-2>)+X(J-l~*!l(I,J-l)) +X(J>+H(I,J) COITIIUE ElDIF J = HOD(I2,16) IF (J .GE. 8) THEI DO 4 I = 1, Ii Y(I) = ttttt(t (Y(I)) +X(J-7)*H(I,J-7,)+X(J-6,*H(I,J-6)) +XtJ-5)*l’l(I,J-S))+X(J-4)*n(I,J-4)) +X(J-3)*!l(I,J-3))+X(J-2)*H(I.J-2)) +X(J-i)*H(I,J-l))+X(J) l I’l(I,J) COITIIUE ElDIF JHII = J+16 DO 6 J = Jl!II, 12, DO 5 I = 1, 11 Y(I) = ((((((<(((((((( 16 (Y(I)) +X(J-i5)*H(I,J-f5))+X~J-l4)*n(I,J-i4)) +X(J-l3)*l!(I,J-l3))+X(J-l2)*n(I,J-12)) +X(J-ll)*f!(I,J-ii))+X(J-lO)*n(I,J-IO)) +X(J-9)*!t(I,J-9))+X(J-s)*n(I,J-8)) +X(J-7)*H(I,J-7))+X(J-6)*M(I,J-6)) F. Irigoin and R. Triolet. Supernode partitiong. In Conference Record of the Fifteenfh ACM Symposium on the Principles of Programming Languagea, pages 319-328, January 1988. [Por89] M. Wolfe. Iteration space tiling for memory hierarchies, December 1987. Extended version of a paper which appeared in Proceedings of the Third SIAM Conference on Parallel Processing. J = noD(12,4) D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformations. In Proceedings of the Firsi Zniernational Conference on Sapercompuling. Springer-Verlag, Athens, Greece, 1987. D. Ku& The Structure of Computers and Computations Volume 1. John Wiley and Sons, New York, 1978. [wo187] Y(I) = (Y(I))+X(J)*H(I,J) COITIIUE EIDIF C.B. Moler, and G.W. Guide. SIAM Publica- [Kuc78] M. Wolfe. Advanced loop interchange. In Proceedings of the 1986 International Conference on Parallel Processing, August 1986. J = HODtI2,2) IF (J .GE. 1) THE1 DO 1 I = 1, I1 [CAC+ 811 G.J. Chaitin, M.A. Auslander, A.K. Chandra, J. Cocke, M.E. Hopkins, and P.W. Ma&stein. Register allocation via coloring. Computer Languages, 6:45-57, January 1981. D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and improving balance for pipelined machines. In Proceedings of the 1987 International Conference on Parallel Processing, August 1987. [WolSS] Hand-optimized register alloof Computer [CCK87] M. Wolfe. Optimizing Supercompilers for Supercompuless. PhD thesis, University of Illinois, October 1982. Appendix F.E. Allen and J. Cocke. A catalogue of optimizing transformations. In Design and Optimization of Compilers, pages l-30. Prentice-Hall, 1972. [AK84a] [Wo182] +X(J-5)*M(I,J-S))+X(J-4)*n(I,J-4)) +X(J-3)*H(I.J-3))+X(J-2)*H(I,J-2)) +X(J-l)*H(I,J-l))+X(J)*n(I,J) COITIIUE COITIIUE 64 from LINPACK. Appendix B C$2$0 = C(I,J+l)+A$l$O*B$Z$O A$2$0 = A(I+l,l) C!$B$O = C<I+l,J)+A$2$O*B$l$O CQ4$0 = C(I+l,J+l)+A$2$O+B$2$0 DO 13 K - 2,1,1 B$l$O = B(K,J) A$l$O = A(I,K) CSlSO = C$l$O+A$lSO*B$l$O B$2$0 = B(K,J+l) CS2SO = C$2SO+ASl$O*B$2$0 A$2$0 = A(I+l,K) C$3$0 = CS3$O+AQ2$0*B$l$O Automaticall generated matrix multiply with unrolland-jam per r ormed on both the I-loop and J-loop with an unroll value of 1 plus scalar replacement. 3 4 2 6 7 5 1 C C DO 1 J = l.HOD(1.2),1 DO21= l,HOD(1,4),1 C(1.J) = 0.0 IF (1 .GT. I) GOT0 4 C$l$O = C(I,J)+A(I,l)*B(l,J) DO 3 K = 2,1,1 C$l$O = C$lSO+A(I,K)*B(K,J) C(1.J) = C$l$O COITIIUE COKTIIUK DO 5 I = I'lOD(1,4)+1,1,4 C(I,J) = 0.0 C(I+i,J) = 0.0 C(I+2,J) = 0.0 C(I+J,J) = 0.0 IF (1 .GT. I) GOT0 7 B$l$O = B(l,J) C$i$O = C(I,J)+A(I,I)*B$I$O C$2$0 = C(I+l,J)+A(I+l,l>*B$1$O C$3$0 = C(I+2,J)+A(I+2,l)*B$l$O CS4$0 = C(I+3,J)+A(I+3,1)*B$l$O DO 6 K = 2,1,1 BSlSO = B(K,J) C$l$O = C$l$O+A(I,K)*B$l$O C$2$0 = C$2$0+A(I+l,K)*B$l$O cs3so = C.$3$O+A(I+2,K)*B$l$O cs4so = C$4$O+A(I+3,K)*B$l$O C(I+3,J) = C$4$0 C(I+P,J) = C$3$0 C(I+l,J) = C$2$0 C(I,J) = C$l$O COPTIIUE COKTIIUE COBTIBUE 13 c$4$0 = c$4$O+A$2SO+BS2$0 C(I+l,J+I) = C$4$0 C(I+l,J) = C$3$0 C(I,J+l) = C$Z$O C(I,Y) = C$l$O COITIIUE 14 12 COITIWK 8 COITINE DO 8 J = tlOD(1,2)*1.1,2 DO 9 I = l,ROD(I,2),1 C(I,J) = 0.0 C(I,J+l) = 0.0 IF (1 .GT. I) GOT0 11 A$l$O = A(I,i) C$l$O = C(I,J)+A$l$O*B(l,J) C$2$0 = C(I,J+l>+A$1$O*B(l.J+1) DO 10 K = 2,1,1 ASlOO = A(I,K) C$l$O = C$l$O+A$l$O*B(K,J) 10 C$2$0 = CQ2SO+A$l%O*B(K,J+l) C(I,J+l) = C%2$0 C(I,J) = C$i$O 11 COITIIUE 9 COITIMUE DO 12 I = MOD(1,2)+1,1,2 C(I,J) = 0.0 C(I,J+l) = 0.0 C(I+l,J) = 0.0 C(I+l,J+l) = 0.0 IF (1 .GT. I) GOT0 14 B$l$O = B(l,J) ASIS0 = AtI,l) C$l$O = C(I,J)+A$l$O*B$l$O B$2$0 = B(f,J+l) 65