Program Improvement by Source-to-Source Transformation DAVID B. LOVEMAN Massachusetts Computer Assoctates, lnc , Wakefield, Massachusetts ABSTRACT The use of source-to-source p r o g r a m t r a n s f o r m a t i o n s has p r o v e d valuable in improving p r o g r a m p e r f o r m a n c e The concept of p r o g r a m m a m p u l a t l o n is elucidated by describing its role in b o t h c o n v e n t i o n a l optnmJzatmn a n d high level modification of condltnonal, looping, a n d p r o c e d u r e structures A n example p r o g r a m f r a g m e n t written in an Algol-like l a n g u a g e is greatly i m p r o v e d by t r a n s f o r m a t i o n s e n a b l e d by a userprovided assertion a b o u t a d a t a a r r a y A compllatnon model based on the use of source-to-source p r o g r a m transformations is used to provide a f r a m e w o r k for discussing nssues of code generatnon, compllatnon of high level languages such as A P L , a n d e h m l n a t m g o v e r h e a d c o m m o n l y associated with m o d u l a r structured p r o g r a m m i n g A p p l i c a t i o n of the compilation m o d e l to several different languages is discussed KEY WORDS AND PHRASES p r o g r a m optimization, source-to-source t r a n s f o r m a t i o n , p r o g r a m i m p r o v e m e n t , compilation, p r o g r a m mannpulatlon, p r o g r a m t r a n s f o r m a t i o n CR CATEGORIES 4 12, 4 22, 5 24, 5 25 Introduction A c o m p u t e r p r o g r a m may be subject to severe time a n d / o r space constraints caused by the necessnty of e x e c u t i o n on a small m i m c o m p u t e r , real time p e r f o r m a n c e requirements, or e c o n o m i c considerations resulting from repetitive execution of the p r o g r a m . Opttmtzatton is the t e r m given to the application of a set of rules for manipulating various representations of a p r o g r a m by exploiting local or global i n v a n a n c e s within the p r o g r a m in o r d e r to ~mprove the p r o g r a m r e l a t w e to s o m e m e a s u r e . C o n v e n t i o n a l l y , optlmizatton has b e e n r e g a r d e d as a b e h i n d - t h e - s c e n e s actlvlty to be d o n e by a c o m p i l e r on a representation of the p r o g r a m invisible to the p r o g r a m m e r . W e have found, h o w e v e r , that p r o g r a m nmprovement can be v i e w e d profitably as successive t r a n s f o r m a t i o n s on source language representations of a p r o g r a m visible to the p r o g r a m m e r and subject to his direction and g m d a n c e . This perspective, a d v o c a t e d recently by such investigators as Knuth [16], W e g b r e i t [28], Standish et al. [25], C h e a t h a m and T o w n l e y [8], and H o a r e [12], has p r o v e d invaluable at Massachusetts C o m p u t e r Associates in such diverse applications as optimizing compilers for Fortran and a high level cryptographnc language, design of a h a r d w a r e resource allocation t e c h m q u e for a language for c o m p u t e r controlled a u t o m a t i c test e q u i p m e n t [19], and indicating parallelism in a F o r t r a n c o m p i l e r for the I L L I A C IV [21]. This view of p r o g r a m i m p r o v e m e n t will be shown in this p a p e r to provide a m o d e l of the bulk of the compilation process (including most of code generation), will indicate how to efficiently compile high level languages such as A P L , and will show how much of the o v e r h e a d associated with c o n v e n t i o n a l structured or m o d u l a r p r o g r a m m i n g can be eliminated. O u r purpose is not to provide a f o r m a l t r e a t m e n t of source-to-source transformations, nor to prove that certain transformations, u n d e r speciDc enabling conditions, p r e s e r v e Copyright © 1977, Association for Computing Machinery, Inc General permission to republish, but not for profit, all or part of thin materml ISgranted provided that ACM's copyright notice is given and that reference ns made to the pubhcatlon, to its date of issue, and to the fact that reprinting privileges were granted by permlssxon of the Association for Computing Machinery Thns paper is a revised version of one with the same title that appears m the Conference Record of the Third ACM Symposium on Principles of Programming Languages held by ACM-SIGACT and ACM-SIGPLAN, January 19-21, 1976 Author's address Massachusetts Computer Associates, Inc , 26 Princess Street, Wakefield, MA 01880 Journal of the Associationfor Computing Machinery, Vol 24, No 1, January 1977, pp 121-145 122 DAVID B. LOVEMAN program equivalence. It is, rather, to establish informally and by means of examples a uniform view of the compilation process from the processing of high level language features to the problems of code generation. As such we have glossed over a number of interesting problems, not the least of which are those of representation, both of program and of transformations. Our only assumptions are that a tree-structured representation of a program is obtainable and that the source-to-source transformations are expressible as pattern-directed rearrangements of tree-structured forms of program text. In our study of source-to-source transformations, both m the abstract and in the implementation of several language processing systems, we have found several notions of use: First, transformations are not applied m random order; the successful application of a transformation suggests successor transformations. Indeed, m a practical apphcation, the most difficult software engineering problems are concerned with transformation ordering and informatiol, gathering. Second, although a transformation may be applicable, it may not win or cause an improvement m the program. Third, the distraction between machine-dependent and machine-independent portions of a compder Js more subtle than usually thought; a transformation on a program may be machine independent, in the usual sense, but the reason for applying it may well depend on the target machine architecture. Fourth, a number of interesting transformations were zdentlfied. In particular the concept that a variable use may on occasion be replaced by an expression representing an assertion about the value of the variable is quite powerful. Fifth, several sequences of transformations have been identified and studied. The code generation sequence m particular is qmte powerful. The Programmmg Process Before discussing optimization, we should discuss some ideas on the nature of the programming process. The programming process can be divided somewhat arbitrarily into a number of conceptual phases: construction, verification, optimization, compdation, execution, maintenance, documentation, etc. In an idealized world one constructs a program with the help of computer aids such as a text editor, file system, and syntax checker. The intent during program construction is to describe data objects of interest, primary operations on these objects, and an algorithm using these objects and operations to solve some problem. At the construction phase in the programming process, programming style should be aimed primarily at clarity of expression and ease of verification and only secondary attention should be paid to efficiency. Unfortunately programming styles which lead to logically clear programs tend to result in programs which, when compiled, are not efficient enough to meet their operational requirements. Structured programming techniques, such as the isolation of implementation details of operations from the application of the operations, tend to result in highly modular programs Naive compilauon of highly modular programs results in great Inefficiency as a consequence of the interconnection of the modules. It is necessary to decide, for example, which modules should be specialized to their particular calling environments and which should be left as general modules. The code resulting from modules which are expanded in line will usually allow for considerable optimization when viewed in the context of the program in which the expansion occurs. Conventional optimization techniques, unfortunately, will only perform a little of this optimization. More powerful techniques which address themselves to program structure, loops and condRionals in particular, are needed These optimization techmques may be triggered in two ways: by facts known about the particular data objects being manipulated at the point of module expansion, or by interactions with the program text that is the context of the point where the module is expanded. Examples of these two different ways of triggering optimization will be given in a later section. In general we wew the process of optimization as one which e~ther deduces, or accepts as given, statements of invariance in a program. Statements of mvariance include the Program Improvement by Source-to-Source Transformatton 123 fixing of the specification of the internal representation of the basic operations on the data objects, as well as such " s t a n d a r d " consta41cy information as "N has the value of 4 , " "SIN represents the library sine routine," and "'This algorithm will be executed only on a CDC 7600." Given statements of invariance, certain program transformations may be shown to preserve program correctness while at the same time "improving" the program relative to some measure Obviously fault can be found with the simplistic view of the programming process presented above. Algorithms to manipulate data objects are not designed without some notion of how to implement these objects. Even at program construction some constancy information such as "this program will be compiled by IBM Fortran Level H " may be known and the statement of the algorithm may be affected. Also, whale doing opttmlzation, one may discover that a chosen representation does not allow the operational requirements to be met, thus requiring the construction, verification, and optimization of a new representation. Our intent in dividing the programming process into construction, verification, and optimization for algorithm and representation is to stress two separations: data objects and operations on them which are natural to a problem area from the underlying representation of the objects and operations; and the problem of constructing a correct algorithm from the problem of making an algorithm meet certain time and/or space requirements. Our model of optimization is that it is a phase in the program preparation process following the construction and verification of a correct program. The first step in the optimization process is the measurement of the program to determine whether or not optimization is m fact needed at all. If it is needed, various transformations must be made to the program in order to improve it. In the absence of software tools, the programmer must resort to his instinct and knowledge of the language system m order to improve the program Alternatively the programmer can utilize a set of gmdelines on how to improve a program [27]. In addition to giving unpredictable results, these methods require the programmer to physically modify the program, degrading the readability and, in all probability, introducing subtle errors. With more sophisticated tools, the program might have been debugged by using a "checkout" compiler w a h good diagnostic messages. If the resulting program needs improvement, it can be compiled by using the "optimizing" compiler. If both compilers accept the same language, and ff the optimizing compiler improves the program enough to meet the operational requirements, this is an acceptable approach The approach we favor, however, is to allow the programmer to be the strategist and to provide a mechanical assistant to perform the optimization itself. A canned set of transformations may improve the program sufficiently. Otherwise the p r o g r a m m e r may have to experiment by applying transformations and measuring the resulting program repetitively until a sufficiently improved program results. The programmer indicates the transformations to perform; the system performs them mechanically, verifying that program equivalence is preserved or requesting that the programmer so verify. The advantages of such an interactive program manipulation system seem obvious. First, effort in optimization is spent only when program measurement determines that the effort is necessary and potentially beneficial, and only on those portions of the program where the payoff appears to be high. Work on program measurement [13] and goal-directed program transformations [28] are relevant here. Second, by concatenating sequences of low level transformations for the purpose of performing higher level manipulations, the system will provide the programmer with a powerful set of tools. Such sequences might include, for example, constant folding, consisting of constant detection, propagation, computation, and dead variable elimination, or code generation, which we discuss at length later. A reasonable manipulation system will allow the p r o g r a m m e r to experiment with and produce his own sequences of transformations for his own purposes. Indeed, we might view a conventional compiler with an optional optimizing phase as a very simple example of a program manipulation system although there is only one 124 DAVID B. LOVEMAN fixed sequence of transformations [9, 20]. Third, if a transformation has been shown to preserve program correctness, the p r o g r a m m e r may invoke it and investigate its effect on the program's efficiency without any concern for its effect on program correctness. Finally, a p r o g r a m m e r is free to use all the facilities of his programming language to produce a high level, well-structured, modular program with the knowledge that should it prove to be inefficient, he has access to a set of tools to change representations and eliminate modularizat)on overhead while preserving the correctness of his original program. High Level Opttmizatton In optimization we treat the algorithm as an object of manipulation, determine items of algorithm constancy either from the algorithm itself or by statement from the outside, and simplify the algorithm based on the constancy. We include within optimization such standard techniques as constant propagation, constant computation, common subexpresslon elimination, removal of invariant code from loops, strength reduction, elimination of redundant code, etc. O u r primary interest, however, is in high level optimlzattons such as case splitting based on known possible values of a variable, loop unrolling, loop fusion, interaction between loops and conditional statements, etc. We are interested in those opt~mizations which can be described as transformations on a tree-structured representation of a program, that is, opt~mizauon which can be considered as source-to-source. We thus may describe an optimization as a mapping of one p a m c u l a r tree structure to another. The vital issue is that this mapping must preserve program validity. In some cases, for example loop unrolhng, the transformation always preserves validity; m others certain enabhng conditions must be true before the transformation can be performed. F o r example two syntactically identical computations are conventional common subexpresslons only if corresponding variables referenced by them can be shown to have the same values. A second consideration, the raison d'Etre of optimization, ~s that the transformation must lead to an improvement (of some form) in the program. A complication ~s that the transformation ~tself may not improve the program, but may change it m such a way that later transformations will result in a gain. This is analogous to the intermediate expression swell issue in algebraic symbol manipulation. For example, in certain cases computation done prior to an if . . . then . . . else can be duplicated on the then and else branches. This results in no savings of time and an increase in space, but it may be that taking advantage of the knowledge that the if predicate is true on the then branch and false on the else branch will lead to considerable savings. One can imagine an optimizer searching blindly through a program tree, trying to decide where to apply transformations. Such a process would be extremely wasteful, however. It appears that the application of certain transformations immediately suggests other transformations to apply. For example, having discovered that a variable is assigned a constant value immediately suggests that we try to propagate that constant and then see whether the assignment is to a dead variable. Having p r o p a g a t e d the constant value, we want to see if constant computations can be done and whether we can assert that other variables are constant. We thus view an optimization as containing seven parts: - t h e name of the optimization, for identification purposes; - a pattern to be found in the program tree structure; - a pattern predtcate which evaluates to true if the optimization preserves program validity; a win predicate which evaluates to true if the optimization, or a succeeding optimization, will improve the program; - a sequence of pattern-replacement rules which make the transformation, - a set of predicates asserted to be true as a result of applying this optimmatJon; - a sequence of names of optlmlzatlons which should be tried after applying this optimization. - Program I m p r o v e m e n t by Source-to-Source Transformatton 125 A serious problem involves the propagation of information from its point of origin to where it is needed to evaluate pre&cates in an optimization rule. Classical optimizers such as IBM's Fortran H have used the techmques of back dominators or interval analysis as described in Schaefer [22]. A n improved form of global analysis called "pgraph analysis" has been developed by Massachusetts C o m p u t e r Assoctates and used in the I L L I A C - I V Fortran compiler and in the Fortran Laundry, a program developed for Fortran source language optimization [20, 24]. Wegbreit [29] has extended the p-graph concepts so that they apply to any property set which satisfies an appropriate set of axioms. Recent work by Karr [14] has studied the problems of gathering and propagating information about programs We are interested m techniques for manipulating programs which have looping structures explicitly written by the programmer. Although these are of interest, of greater interest are the loops which exist lmpllc~tly within the program as a result of high level programming language features. A programmer could conceivably perform transformations such as loop fusion on loops which he has written. There is no way, however, that he can touch the implicit loops m the program. By analogy, a p r o g r a m m e r can, if he ~s careful, perform common subexpresslon ellmmat~on on code he has written There is no way, however, that he can ehminate the common code resulting from similar array references; the compiler must aecomphsh this. High level programming language features which provide implicit looping may exist in two forms: They may be budt into the language as standard features, such as the primary operators and con&tion expressions in cryptological languages or the vector and array operations in A P L , or they may be provided by extension in an extenslble language such as ECL [30]. The first case can clearly be viewed as a spectal case of the second, where the "extensions" arc made at language definttlon time. Examples o f Htgher Level Optimization 1 Assume the data type matrtx lS defined in a natural way and mult ts a procedure which wdi multiply an l by m and an m by n matrix, giving an l by n result matrix: procedure mult(x mamx[l, m], y loop fort = 1 t o l loop for1 = 1 t o n matrix[m, n], z matrtx[l, n]) z[,, j] .= 0, = ltom' z[,, d '= x[,, k]*y[k, d + zO, J]; loop f o r k repeat; repeat, repeat, end, Let us look at a parhcular call on m u l t , namely, a multiplication of a by b giving c, where a happens to be a &agonal matrix: dtagonal(x matrix) ~ ~ ~ 1 implies x[l, 1] = 0, dedare a matnx[lO, 10], b matrtx[lO, 20], c mamx[lO, 20], assume a is dtagonal, mult(a, b, c), Directly expanding mult in line gives, with constant propagation loop for : 1 to 10" loop for/ = l t o 2 0 "= c[l,/] = 0, loop f o r k = 1 t o l O C[t, I] = nit, k ]*b[k, 1] + C[~,1], repeat, repeat, repeat, The syntax we use in the examples is close to that of [16] except that we use endif where Knuth uses ft. 126 DAVID B. LOVEMAN Case splitting based on the knowledge a is dtagonal allows a r e p l a c e m e n t o f a [ i , k] by if i ~ k then 0 else air, k]. The assignment s t a t e m e n t in the i n n e r m o s t loop b e c o m e s c[:, 1] := (if ~ ~ k then 0 else a[~, k])*b[k, 1] + c[t, 1]; By e x p a n d i n g the scope of the if, we get loop fort = l t o l O loop for/ = 1 t o 2 0 : c[~,]] := 0, loop for k := 1 to 10 if z ~ k then c[,, j] = 0*b[k, ]] + c[~,j], else c[t, 1] = air, k]*b[k, I] + c[t,/]; endif, repeat, repeat, repeat, The then clause simphfies to c[t, j] := c[i, ]], which is r e d u n d a n t a n d may be eliminated. In the else cause i = k; thus references to k m a y be replaced by references to t. T h e program Is n o w loop fort = l t o l O loop for/ = 1 tb20 c[t, ;] .= 0. loop fork = l t o l 0 fit = k then c[t, 1] .= air, t]*blt, 1] + c[L 1], endif, repeat; repeat, repeat, T h e if s t a t e m e n t w i t h i n t h e i n n e r m o s t l o o p s e l e c t s t h e o n e n u m b e r in t h e r a n g e o f k w h i c h ~s e q u a l t o t h e c u r r e n t v a l u e o f t. I n o t h e r w o r d s t h e l o o p o n k c o u p l e d w R h t h e if s t a t e m e n t s a y s , " I f t h e s e t o f v a l u e s o f t a n d k i n t e r s e c t , a n d t h e c u r r e n t v a l u e o f i is within the mtersechon, then perform the given code." In general the intersection of the sets of values might be null, or difftcult to determine. In the case of nested loops, h o w e v e r , t h e t e s t s a r e p a r t i c u l a r l y e a s y . If w e h a v e , f o r e x a m p l e , loop for i := lmzttal to lanai" loop for k = kmmat to kyinal if t = k then endff, repeat, repeat, then the values of t and k intersect if t~n,t~a~ --< kf~nal and k~B,t,~ --< t~n~. The value of l in the tuner loop is within the intersection lf max(k,n,t,~t, i,,,,t,~) --< i and t --< min(kr~n~z, i~n~l). The program may thus be written by using this t r a n s f o r m a t i o n as loop for I := 1 to 10 Ioopfor/ = l t o 2 0 c[,, 1] = 0, if 1 ~ 10 and 1 --< 10 and max(l, 1) -< t a n d l ~ mm(10, 10) then c[t, 1] = a[t, t]*b[t, 1] + c[t, 1], endif, repeat, repeat, 1 _< 10; m a x ( l , 1), and min(lO, 10) clearly may be e v a l u a t e d at compile time; a n d the pre&cate of the if s t a t e m e n t is clearly true for all i since the if s t a t e m e n t is in the range of Program Improvement by Source-to-Source Transformation 127 a loop in which 1 --< i -< 10. Thus the if statement may be replaced by its then branch, giving loop fort = l t o l O loop for/ = l t o 2 0 ¢[~,j] = 0, c[t, 1] .= air, t]*b[t, 1] + c[t, 1], repeat, repeat, At this point constant propagation, expression simplification, and dead variable ehmination (after the zero has been propagated, the assignment c[i,]] := 0 is redundant since that generation of c[t, I] is dead) gives the simplified code loop for t .= 1 to 10 loop for/ .= 1 to 20 cb, j] = a[~, d,bit, d, repeat, repeat, With the p r o g r a m m this form, conventional compilation techniques would result in good object code. This example of optimization depended on assertions known true at the beglnnmg and on assertions derwed from the program. Clearly the more that is known about data and program regularities, the better the job an optimizer can do. Such initial assertions can come from two sources: from the programmer by direct statement; or derived from the program as a result of global compilation, compiling the routine in the presence of the call of the routine m a main program. As a second example, suppose we have, in additmn to the above, extended the definition of "*" as a binary infix operator for matrix multiply, " + " as a binary infix operator for matrix addition, and " : = " as a binary infix operator for matrix assignment. The meaning of "*" is the procedure m u l t ; the meaning of " + " is procedureplus(x matrix[l, m], y loop forl = l t o l . I o o p f o r l = 1 torn matrix[l, m], z matrix[l, m]) z[l, 1] = x[t, 1] + y[t, I], repeat, repeat, end, and the meaning of " : = " is procedure asstgn(x matrix[l, m], z loop forl = l t o l I o o p f o r l = 1 tom matrix[l, m]) ztz, j] "= x[l, d; repeat, repeat, end, Suppose we have the following: declare a matrtx[lO, 10], b matrtx[lO, 10], c . matrtx[lO, 10], d matrtx[lO, 10], d = (a.b) + c, Naive compilation of the assignment statement will give declare tl :rnatrtx I10, 101,t2 matrix [10, 10], mult(a, b, tl), plus(tl, c, t2), assign(t2, d), Note the need for 200 words of temporary storage. Expanding m u l t , p l u s , and a s s i g n in hne gives DAVID B LOVEMAN 128 loop for ~ := 1 to 10 loop f o r t := 1 to 10" tl[i,/] = 0; loop for k .= 1 to 10. tl[t,l] .= air, k ]*b[k, I] + tl[t,/]; repeat, repeat; repeat, loop forz = l t o 1 0 loop f o r / .= 1 to 10. t2[t, 1] = tl[t, 1] +c[t, 1], repeat, repeat, loop fort = l t o 1 0 Ioopforl = lto10 d[t, 1] = t2[:, I], repeat, repeat, T h e o u t e r loops can clearly be fused since the index sets are the same and t h e r e are no data d e p e n d e n c i e s which forbid It, giving loop f o r t = l t o l O loopforj = ltolO. d[~, d '= o, loop for k .= 1 to tO tI[:, 1] = air, k ]*b[k, j] + tl[,,/], repeat; t2[~,/] := tt[~,j] + c[,, j]; d[,, 1] := t2b, 1], repeat, repeat, We now note that each instance oftl[i,j] ts d e a d a f t e r t h e ( i , j } i t e r a t i o n o f t h e t w o o u t e r loops. That is, t l and t2 have b e e n r e d u c e d to scalars. S u b s u m p t i o n (replacing a variable which is used only once by its definition) and initializing t l to c[i, j] gives loop fort = l t o l O loop f o r / = 1 t o l O tl = c[l,l], loop f o r k = 1 tolO tl = ab, k]*b[k, 1] + tl, repeat, d[z, 1] ,= tt, repeat, repeat, We could have e l i m i n a t e d t l c o m p l e t e l y by accumulating in d[t,j], but we did not for two reasons First, on m a n y m a c h i n e s an array r e f e r e n c e is m o r e expensive than a scalar reference. Second, a c o m p i l e r ' s c o d e g e n e r a t o r is m o r e hkely to assign a scalar which is d e a d at the end of each l o o p Iteration to a high s p e e d register than assign an array e l e m e n t r e f e r e n c e to a high speed reglstdr Description o f P r o g r a m T r a n s f o r m a t i o n Rules As indicated earlier, a p r o g r a m t r a n s f o r m a t i o n rule says. " F i n d s o m e pattern in the p r o g r a m for which s o m e conditions are true. If an e v a l u a t i o n function indicates that this transformation results in a gain, relative to the a p p r o p r i a t e m e a s u r e , then p e r f o r m a set of r e p l a c e m e n t s and m a k e a set of assertions. W h e n a rule has b e e n e x e c u t e d , a set of rules m a y b e c o m e applicable; see if they a p p l y . " A possible syntax for such a rule might be Program Improvement by Source-to-Source Transformatton name 129 in pattern where pattern~predtcate w h e n wtn ~ predtcate then patt ~ replacementt, path ~ replacement. assert predlcate~, , laredtcateA try name~, , namem, As an example of a conventional optimization, the rules for constant folding might be written constant~assert in "x = y" where constant(y) assert constant(x), value(x) = value(y) try dead\ var, constant~propagate, constant~propagate in "x" where constant(x) then "x" ~ value(x) try constant~ computatton, constant~ assert, constant~ computatton in "x op y" where constant(x) and constant(y) then "x op y" ~ value(x) op value(y) try constant~ computatton, constant~ assert, d e a d ~ var in "x = y " where dead(x) then "x = y " ~ null, The rules in the above examples of constant folding do not have w m ~ p r e d i c a t e s because they always result in an improvement Other rules are not always beneficial. For example, in general, sphttlng a loop into two loops introduces an overhead of extra control structure. However, on a CDC 7600, if the new loops will fit into the instruction stack of 12 words, there will be a gain. This rule might be written l o o p ~ s p h t in "'loop A . B repeat," where can ~ spht(Ioop) and c o d e r length (loop) > 12 w h e n code~length("loop A repeat") ~_ 12 or code~length("loop B repeat") _< 12 then "loop A ,B r e p e a t , " ~ " l o o p A repeat, loop B repeat,", If we were to actually state formally all the detatis of the rules presented later, the presentation would be qmte tedious, excessively long, and difficult to follow Some predicates and functions, such as c a n ~ s p l t t and c o d e \ l e n g t h above, are decidedly nontrlwal, but their existence can be assumed without much difficulty. We are most interested m the pattern and replacements. Rather than give the detailed replacements, in most cases we have lUSt given a generalized before and after example m the hope that the reader can deduce the form of the transformations. Typtcal Program Transformation Rules We have somewhat arbitrarily divided the program transformations into five classes for ease of exposition: stmphficattons, opttmlzattons, evoluttons, devoluttons, and mantpulattons. Since program transformation rules are still being developed, we cannot give a complete set of such rules, if indeed such a set exists. We shall describe informally and by example a collection of approximately 40 transformation rules. These include fairly trwlal program slmphfications, conventional optim~zations, sophisticated loop parallel~sm detection evoluuons, straightforward devolution or replacement of a construct by its lower level meaning, and m a m p u l a h o n rules. These rules include all those referred to m the examples. A n extenswe collection of over 100 transformation rules is presented by Standish et al. [26] Stmphficatton refers to a generalization of the idea of constant computation and involves cleaning up local patches of program m fairly obvious ways. We shall list some of the more obvious s~mphfications: 130 DAVID B. LOVEMAN • Constant computation, including evaluation of procedures with constant arguments (and no side effects): 3 + 2~5 sqrt(9)~3 • Expression simplifieaUon: x * l + 0 =~x • Loop collapse: elimination of a vacuous loop: loop repeat; ~ empty • Prune conditional: if true then A else B endif; ~ A if false then A else B endif; ~ B • Reorder conditional: i f p then else A endif; =-~ if not p then A eudif, • Assignment ehmmation by equahty: if a = b, the assignment a := b; can be eliminated. A special case of this eliminates the assignment a := a; Optimizauon Is the term we use to describe those program transformations, usually of a low level, which are already well understood m some sense. These transformations include all the classical ones and perhaps a few new ones: • C o m m o n subexpression elimination. • Code motion from program areas of high frequency to those of lower frequency. • Strength reduction. • Dead variable elimination. • Constant propagation. • SubsumpUon: following an assignment m := expr; and before modification of any part o f e x p r , a use o f m may be replaced by expr (only ifexpr has no side effects). This is especially valuable if there is only one use of m, or if the expr is a simple variable name, in which case the transformation is called scalar propagation. The idea of subsumption may be used with embedded assignment to assist register assignment. For example r := a +b;... ; z := r ; can be replaced by z := (r : = a + b ) ; • go to chasmg: go toa, go to b, a gotob, go to b, n Array temporary e h m m a t t o n : a [ i , / ] := b , c ; . .. ;a[/, i] := a[i, j]*2; can be replaced b y t := b * c ; a [ i , j ] := t; . . . ; a [ j , i] := t*2; • Count up to zero: A counting up loop can be shifted so that the last loop value is zero, thus simplifying the loop termination test. This requires modifying all uses of the index within the loop and insertion of an index adjustment assignment following the loop. If the loop final value is a constant, the index offset is a constant and its addition can be done via address arithmetic. If the index is dead following the loop, the index adjustment assignment is not needed. loop for1:= l b y l t o n .l. repeat, loop f o r t . = 1 - n b y l t o O ~ ., t+n repeat, t:=n+l, * Move while to end: Any loop must contain a conditional jump. Thus if a loop terminates with an unconditional jump, the loop can be rotated to move the conditional jump to the end, thus eliminating the unconditional jump. Program I m p r o v e m e n t by Source-to-Source Transformation loop, ~ go to I, loop' A while p B 131 B l A while p repeat, repeat, The term evolutton is used to describe those program transformations which discover higher level language constructs lurking among lower level constructs. Primarily we are concerned with discovering higher level parallel looping constructs. Lamport's work on the parallel execution of loops [17, 18, 21 ] is of great importance here. Lamport defines three types of parallel loops: l o o p c o n e . . . . concurrent loops m which individual processors are assumed to operate asynchronously on different iterations of the loop; l o o p s i r e . . . , simultaneous loops in which individual processors operate in lock-step fashion, similar to the ILLIAC IV, and l o o p s y n c . . . , synchronized loops in which the indwidual processors are assumed to not be in lock-step but to obey a synchronization condition which guarantees preservation of generation-use orderings. Two transformations have been developed to detect parallelism in conventional ~terated loops: * The coordmate m e t h o d transforms certain well behaved sets of nested loop tot's into a l o o p sire. * The hyperplane m e t h o d transforms certain well behaved sets of nested loop tot's into a l o o p cone. Obviously any l o o p c o n e is also a l o o p sire. It has been shown that any l o o p sire produced by the coordinate method is also a l o o p syne. • Strtp m i n m g refers to the transformation of a given parallel loop into the lterative execution of a parallel loop with smaller index set. This transformation is necessary for any parallel machine with a fixed n u m b e r of processors. For example, on the I L L I A C IV which has 64 processors: Ioopfort = l b y l t o 2 5 6 ~ Ioopconcfort = l b y l t o 2 5 6 ~ a(t) = b 0 ) +c(t), a(t) = b ( t ) + c ( t ) , repeal, repeal, ioopforl = 0 b y l t o 3 Ioopsimfort =64.1 + l b y l to 64*1 + 64 a(t) = b(t) + cO), repeat, repeat, Observe that strip mining can work equally well for a strip of width one, i e. for a machine with only one processor. It has been shown that any l o o p e o n c or l o o p syne can be strip mined. Thus in place of the second transformahon in the example above, we could have strip mined m reverse order with a width of one to get the counting backward loop: loop fort = 256 by-1 to 1 a(t) = b(t) + c(t), repeat, The transformations to parallel loops require considerable data dependency analysis, exactly the analysis necessary for loop splitting, loop fusion, etc. We shall see the role of parallel loop detectmn in easing the explanation of these transformatmns later. Another evolution transformation we need is * Coindex detection, which, for a given loop, determines variables other than those given exphcitly which also serve as loop indexes. The coindexes are written in the for clause as explicit parallel indexes. For example: ./.=4, loop forz = l b y l t o 7 A I = 1 + 3, repeat, loop fort = l b y 1 t o 7 # . / ~ A repeal, = 4by3to22 132 DAVID B. LOVEMAN D e v o l u t i o n d e s c r i b e s t h e r e p l a c e m e n t of a h i g h e r level l a n g u a g e f e a t u r e b y its m e a n i n g in t e r m s o f l o w e r level l a n g u a g e f e a t u r e s . Typically we p e r f o r m t r a n s f o r m a t i o n s o n g i v e n language constructs, then replace the constructs by their meaning and perform further t r a n s f o r m a t i o n s at t h e l o w e r level. • Coindex ehmtnation: loop fort = bt b y s i toil # / := b! by s1 to f j A repeat; ] =b/, until done: loop fort = bt by st toil: A I =1+s1, t f (hi - fl)*slgn(sl) > 0 the• done, endif; repeat, end, • Multistep iteratton elimination: declare b procedure A end, loop for t '= t l by s I to f l loop for t := t 1 by s 1 to f l , t2 by s2 to f2. b, A repeat, ~ repeat, loop for t = t2 bys2 to f2 b, repeat, * Value-list tteration ehminatton: loop fort = v l , v2, v3, v4 A repeat, ~ declare b procedure' A end; t =vl,b, s = v2, b, t .= v3, b; t = v4, b, alternatively declare b procedure A end, declare a array [1 4] initially (vl, v2, v3, v4), loop for1 = 1 by 1 t o 4 t = a[/], b, repeat; • Single step tteratton reduction: loop fort .= b bys t o f A repeat, ~ sl = s, ft=f, loop for t = b by st tuff A repeat, • Single step iteration eliminatton: Ioopfort = b bys tof A repeat, t = b, ~ st "= s, fi = f, go to l, loop A I .= l + St, 1 while (t - fi)*stgn(st) --< 0 repeat, • Dahl loop elimmatton: loop A go to l, rn B while p B repeat, ~ • Procedure pl, ... l A ifp then go t o m , endlf, mtegratton: S u p p o s e f h a s b e e n d e c l a r e d a p r o c e d u r e of n a r g u m e n t s m n , r e s p e c u v e l y , a n d w i t h r e t u r n v a l u e p r of m o d e m r . , pn of modes ml ..... For example: Program I m p r o v e m e n t by Source-to-Source Transformation declarefprocedure (pl .rnl, . ,pn .mn,pr 133 mr).A end, A reference to f can be replaced by a block which computes the appropriate value: f(al, , an) ~ begin declare m l initially a 1, pn mn initiallyan, pr mr, p 1 A pr end Manipulation is the mapping of one versmn of a computation into another version which is better m the sense that it is smaller or faster or will enable other beneficml transformatmns to apply to the computation. Some of the manipulations that have been developed are given here: • B a c k expansion o f c o n d m o n a l allows a computatmn to be tailored by the reformation that p is true on the t h e n branch and false on the else branch. If the absorbed computatmn is an assignment, the variable assigned to may well be dead on one of the branches. A ifp then TC else FC endif, ifp then A, TC else A, FC endif, • F o r w a r d expanston o f condttional: lip then TC else FC enflif, ifp then TC, A else FC, A endif, A • L o o p unrolhng in general reduces the number of tests and jumps executed and increases the number of instructions exposed for parallel execution. It often aids register allocation by amehorating the seam matching problem. loop A loop ~ A repeat, A repeat, loop while p : until done loop while p A repeat, A ifp then done, endif, A repeat, end, loop fort = b bys t o f A repeat, until done loop for~ = b b y s t o f A t .:| +S, if i > f the• done, endit, A repeat, end, Note that ff the number of iterations of the loop is divisible by the unrolhng factor, then 134 DAVID B. LOVEMAN the m i d l o o p test for t e r m i n a t i o n is not n e e d e d . T h e assignment t := t + s, can often be e l i m i n a t e d by subsumptlon. • L o o p first case: loop fort .= b bys toy A repeat; t = b, ff if t -< f then A loop fort = t + s b y s t o f A repeat, endif, If it can be shown that the l o o p will have at least one iteration, i.e. that t -< f initially, t h e n the enclosing if s t a t e m e n t can be r e m o v e d . T h e assignment ~ := b; can often be e l i m i n a t e d by subsumptlon. • L o o p last case: loop fort = b bys tof A repeat, ifl ~ f then loop f o r t ' = b b y s t o f - s ~ A repeat, A endif, • L o o p mtddle case: loop for/ = b bys t o f loop fort = b b y s t • m A A repeat, repeat, loop forl = t b y s t • f A repeat. This m a m p u l a t i o n is particularly useful w h e n the l o o p body is an a p p r o p r i a t e c o n d m o n a l s t a t e m e n t , for e x a m p l e if/--> m + 1 t h e n . . . If the predicate ist = m + 1, p e r f o r m i n g a loop first case m a n i p u l a t i o n on the second resulting l o o p is often valuable. • Case spht: A s s u m e there is an assertion of the f o r m p implies q, w h e r e p is a predicate a b o u t array subscripts and q an assertion a b o u t array values of the f o r m ar = value, where ar Is an array r e f e r e n c e (whose subscripts are restricted by p ) . F o r e x a m p l e , i ~ j implies a[i, j] = 0. If there is an expression which contains an array r e f e r e n c e r e f which " m a t c h e s " ar, then replace r e f w i t h i f p t h e n value else r e f e n d l f . If there is an assignment r e f := x w h e r e r e f " m a t c h e s " ar and x m a t c h e s value, the assignment can be r e p l a c e d with if not p t h e n ref := x; endif. F o r e x a m p l e , suppose i -~ j implies a[i, j] = O; then a[k, 1] := O, will b e c o m e i f k = l t h e n a[k, 1] := O; endif; • L o o p elimmatton: A predicate within an inner l o o p may be used to select an i t e m out of the intersection of the index sets of the inner l o o p and o u t e r loop. W h e n d e t e c t e d , this condition can be greatly simplified. loop fort : = / b y l t o k A loop fort = / b y l t o k ~ loop forl = m by 1 ton ift = l t h e n S endif, repeat, B repeat, A if max(l, m) <--t --<rnm(k, n) the• l =t, S endif, B repeat, Typically the m a x and m m expressions e v a l u a t e to constants and the assignment l := ~ is e l i m i n a t e d by s u b s u m p t i o n . • L o o p compresston: O f t e n a predicate within a l o o p serves to restrict the index set of the loop. In this case the restriction can be i n c o r p o r a t e d into the index set itself: Program Improvement by Source-to-Source Transformatton suppose b --< I -~ f; Io•pfort = b b y l t • f if t -->1 the• 135 loop fort = l by 1 t • f A repeal, ~ A endif, repeal, or suppose b -< l _<f, loop fort = b b y l t • f ift --<l then A loop fort = b b y l t • l , A repeat, endif, repeal, • Order reduction: If an array a is used in a loop on t and all references to a have i in, for e x a m p l e , the first subscript position, then ff a is d e a d following the loop, the dlmenslonality of a may be r e d u c e d . (As stated here, this is a special case of a m o r e general rule.) F o r e x a m p l e : loop for t = a0,1]. a[i, 3] a[t,,, + 7] loop for t '= ~ aD] a[3] a[l + 7] repeat, repeat, u Index set shift: It may be that the set of values assumed by an index variable, if modified, will be m o r e compatible with additional transformations. loop fort = b b y s t o f I loop fort .= b + c bys t o f + c , ~ repeat, 1 --C repeat, • Loop fuston" A d j a c e n t loops o v e r the same index set, if data d e p e n d e n c y considerations allow, can be fused into a single loop. The data d e p e n d e n c y analysis can be r e p r e s e n t e d by first evolving c o n v e n t i o n a l loops into sire loops, fusing, and strip mining to get back to a c o n v e n t i o n a l loop. O n e of the index sets may require index set shifting to m a k e it identical to the other. loop fort = b b y s t o f A Ioopsimfort =~ repeat, loop for] .= b hys t o f = b bys t o f A repeat, loop sim forj '= b by s tof' B B repeat, repeat, Ioopsim furl = b bys t o f A I =t, B loop for i = b bys tof. A ! =t, B ~ repeal, repeal, u Loop sphttmg: The loop splitting manipulation reverses a l o o p fusion and d e p e n d s on similar analysis: E v o l v e to s i m loops, spht, strip mine. loop fort = b b y s t o f A B Ioopsimfort ~ repeat, repeat, Ioopsimfort = b b y s t o f loop fort = b b y s t • f A repeat, Ioopsim fort = b b y s t • f B repeat, = b bys t o f A B A ~ repeat, loop fort = b bys t o f B repeat, 136 DAVID B. LOVEMAN * Loop reordermg: The loop reordering manipulation depends on finding a tightly nested loop set, evolving to a sire loop, and strip mining in an appropriate way: loop for t := bi by st to ft" loop for j = bl by sl to fl loop sire for t , I = (bt by sl t o i l ) × (b1 by s! to f/) A repeat; repeat, A repeat, loop for I = bl by s/ to fl loop fort = b t b y s t t o f i A repeat, repeat, * Loop variable ehmmatton suppose t t s mvartant tn loop and dt ts the change t n t as a result o f the loop, loop f o r t . = t + b b y s t o f . A repeat, loop fort = t + b b y s tof l =t, A repeat, t =t-dr, The Compdatton Model As Hoare [12] has commented, a solution to the problem of producing efficient object code for programs written in a high level language is to have designed the language so that "a simple straightforward 'non-pessimizing' compiler will produce straightforward object programs of acceptable compactness and efhclency . . . the language [should be] sufficiently expressive that m o s t . . , optimizatlons can be made in the language i t s e l f . . . that a general machine-independent optimizer can simply translate an inefficient program into a more efficient one w,th guaranteed identical effects, and expressed in the same source language." This is our approach. Our model is that a straightforward parser, perhaps automatically generated, wdl parse the character string representation of a program into a more highly structured, eas,er to manipulate, internal form. It is this internal form that is in fact transformed by our source-to-source transformations. At any point in the transformation process we can inspect the results by use of an unparser or transcriber which will re-create the character string representation of a program. We believe that by applying an appropriate set of transformations to a given source program written in a suitably rich multilevel source language, we can transform the given program in such a way that quite naive code generation techmques will result in very good code. It is our hope that this reduction in scope of the code generation problem may in fact make possible some real progress toward the automatic, or at least semiautomatic, production of code generators. The point here is that a considerable amount of the work done by conventional code generation can be viewed as source-to-source transformation in which the win predicate depends on the target machine. That is, the transformation may always be a valid one, but may only be useful for a particular target machine. We have found, however, that most existing languages do not allow "nonpessimlzing" compilation. It is an assumption inherent in source-to-source transformation that there is some subset of the language for which the compiler knows how to generate efficient code. For example, part of the benefit of constant propagation will be lost if the compiler does not generate immediate instructions when one argument of a computation is a constant. In our work with existing languages, we have found it necessary to make suitable language extensions so that the results of our transformations can be expressed. It is intended that the algorithm writer, however, will confine himself to the original language. Examples of language extensions we have used in existing languages include Program Improvement by Source-to-Source Transformation 137 e m b e d d e d assignment; running relationals such as a < x < b or x = 3 151 8; value returning blocks and con&tionals; storage attributes in declarations; restricted procedure invocations; and many forms of looping constructs with restricted semantics including value list iteration, multiple indexes, coindex, Dahl loop [16], concurrent, simultaneous, and synchronous loops, and Zahn situation statements [16[. Since our compilation model attempts to reflect as much as possible what is going on in the source code itself, it is important that the language facilities exist at all different levels of abstractness to express the desired meaning. Although we favor a single, unified language, the ideas are equally applicable to multiple languages. One could imagine, for example, performing transformations m a very high level language, then devolving the language constructs to a language like Pascal. In this lower form, more transformations could take place; loops which were previously implicit would now become explicit, for example. Following these transformations, we should devolve the program into a low level language, perhaps BLISS [31], enabling stdl more transformations. Finally we have the program m a low level form, with language constructs reflecting the machine architecture, for which simple machine code generation will result m high quality code. It ~s important to note our distinction between machine independent and machine dependent. In a conventional compiler there is a (more or less) machine independent portion consisting of parse and optimization followed by a (typically) ad hoc code generation phase. In our model of compdatlon a program is parsed into a mampulable form, subjected to a large number of source-to-source transformations, and then has code generated by a nawe walk over the program representation. The transformations themselves, even those done early in the processing, may be machine independent m that they do not affect the program meaning or reflect directly machine dependent features, but the motwatlon for performing them, the win pre&cate, does depend on machine features For example, doing an index set shift so that the final value of a loop variable Is zero may be done if the target machine has a very efficient register compare with zero and if addition of a constant in an address calculation IS free: loop fort = l t o l O all] = a[t]+b[t], ~ repeat, loop fort = - 9 t o o a[t + 10] =oUt + 10]+b[t + 10], repeat, Alternatively, which is probably better in this specific case, we could transform the l o o p into a loop sim~ and strip mine in reverse order, giwng a counting down loop rather than a counting up loop As a second e x a m p l e , x = a o r x = b could be rewritten a s x = a { b, effectively performing the common subexpression elimination of the reference to x. Alternatwely, at a lower level conventional common subexpression elimination would perform an essentially similar transformation. This model of compilation has been grossly slmphfied by convemently forgetting the problems of global information gathering. Typical global processing requires complete global flow analysis of the entire program and is used, for example, to produce use-set reformation or register assignment. Th~s analys~s falls outside the scope of this paper, although we expect that the results are directly applicable. Use-set information can be reflected in the program by attaching generation markers to the varmble names implemented as threads in the internal representation. Register assignment can be reflected by introducing a (lo~v level) register data type. for Code Generatton One of the more interesting aspects of source-to-source transformations is the possibilities presented by the abihty to concatenate sequences of transformations to perform complex program manipulations. One such sequence for constant folding was hinted at earlier. The sequence of transformations for performing "conventional" optimization is described in [20]. In this section we w~sh to consider briefly a sequence of transformations that we have termed code generatton. 138 DAVID B. LOVEMAN At some point in the compilation process the compiler has a program representatton which has been optimized as far as is possible at the level of abstraction of the language operators present in the program. The compiler then has one of two choices: The program representation can be turned over for execution to an interpreter which understands the program in its current form. Alternatively, the language features represented m the program can be devolved to lower level forms which can then be subject to further optimization. For example, we may have the A P L expression (A × B) + A × B. Source-to-source transformations might replace this with 2 × A × B. No further improvements can be made without looking at the meaning of × on matrices and scalars. Once this ts done, we can proceed in a manner similar to the prevtous examples. This problem, for APL, has been examined by Abrams [1]. A similar problem has been studied by Carter [7]. Carter is concerned with special case analysis for PL/1 concatenation operation. His solution is to define a procedure c o n c a t e n a t e which contains within its body executable code for performing the case analysts. He then uses the transformation of procedure integration to expand the body of the procedure in hne at each point of call. Conventional optimlzatton then simplifies the resulting code. Standish et al. [25] also utilized a version of this approach. Standish et al.'s model is one of partial evaluation of the procedure body based on actual arguments of call, followed by replacement of the call by the procedure body. At Massachusetts Computer Associates we have had considerable success with this approach. An apphcation described later used this techmque to produce optimized code for high level string operators in a language for cryptographlc problems. The approach has also been used at a low level to design a (more or less) machine independent code generator. Procedures were devised to perform case analysts for language operators. For example, a procedure (written very informally) for integer exponentlation might be procedure exponenttate(~ , I), if i = 2 k then shtfl "1" k *1 bits left; elsif/gs constant then case(j) 6: (lz*z)2, end case, more valuable than space then m hne code; else intrinsic functton reference; elsif time endif; end, Transformations were developed to handle sequencing and addressing tssues. For example, A * B - C * D on a one-accumulator machme which can subtract memory from register but not register from memory should be rewritten as T ~ C , D ; A * B - T . Note that thts is another example of a machine independent transformahon with a machine dependent wm predtcate. Such sequencing transformations can have sigmficant payoff if data objects are packed. For example, suppose m a language hke JOVXAL,I, J, and K are 12-bit packed integers starting at bit positions 0, 4, and 8, respectively. Consider the computation of K ~-- I + K - J on a one-accumulator machine. If this is computed naively as K ~ (I + K) - J, it wdl require two temporaries more, four more instructions and shift 12 positions as opposed to 8, compared with K ~ (! - J) + K. Obviously transformations are necessary for explication of lmphed scaling, field access, and mode conversion. Also necessary ~s the exphcation of location and contents operauons. This allows conventional optimization transformations to, for example, remove invariant load register instructions out of loops. The following is the sequence of transformations we have termed code generation: systemahcally rename procedure locals model parameter passing (substitution for call by name, assignment for call by value) Program Improvement by S o u r c e - t o - S o u r c e T r a n s f o r m a t t o n 139 simphfy procedure body constant computation expression simplification, including relatlonals and Booleans control structure simplification collapse loops prune conditionals reorder conditionals substitute body of procedure for call further optimlzations Note now that further optimization will cause interactions of the procedure body with the remaining program. For example, if the call occurred within a loop and the body contains loop invariant code, that code will be removed from the loop. There are at least two exceptions to code generation: If space is more important than time it may be appropriate to leave the procedures intact, accessed via calls; or if p a m c u l a r cunning algorithms (such as high level machine instructions) are available, they may be appropriate to use. For example, the assignment A ~ B where A and B are matrices is better implemented by a b l o c k transfer or m o v e character m s t r u c t t o n , if available, than by expansion into the general loop form. The following section continues a previous example down to a very low level. Further transformations are possible, but not without introducing machine dependent operators. Code Generation Example Let us reconsider the previous example of matrix multiplication which we had transformed to loop forl = 1 t o tO loop forj = l t o 2 0 ' c[t, 1] = a[t, t]*b[i, 1], repeat, repeat, We shall continue this example to show how source-to-source transformations (given a rich enough source language) can assist in the remaining compilation and code generation. The underlying representation of an m by n matrix is a one-dimensional array of m * n items, where selection of the [ i , j ] item of the matrix selects the [(i - 1)*n + j] item of the underlying array, assuming row major storage. Letting u a , u b , and uc be the underlying arrays for a, b, and c, respectively, we get loop fort = 1 t o l O loop for/ = l t o 2 0 uc[O - 1).20 +j] =ua[O - 1).10 +tl*ub[O - 1).20 +j], repeat, repeal, With expression simplification and common subexpresslon eliminahon within the inner loop body, the inner loop body becomes tl t2 = 20.~ +j .= ll*t - - 20, 10, uc[t l ] .= ua[t2 ]*ub[t l ], Part of the t l computation and all of the t2 computation are mvarlant m the inner loop. Performing code motion, we get loop fort = 1 tolO t3 = 20*t - 20, t2 = ll*t - 10, Ioopfor/ = lto20, tl = t3 +J, uc[tl ] = ua[t2]*ubltl ], repeat, repeat, 140 DAVID B. LOVEMAN A f t e r b o t h l o o p s h a v e b e e n a n a l y z e d for c o i n d e x e s , t h e c o m p u t a t i o n b e c o m e s loop for t := 1 to 10 # t3 := 0 by 20 to 180 # t2 := 1 by 11 to 100: Ioopforl := l t o 2 0 # t l . = t 3 + l b y l t o t 3 + 20: uc[tl] := ua[t2]*ub[tl]; repeat, repeat, T h e i a n d ! i n d e x e s are n o w d e a d a n d c a n b e e h m m a t e d : loop for t3 -= 0 by 20 to180 # t2 = l b y l l t o l 0 0 loop f o r t l . = t 3 + l b y l t o t 3 + 2 0 uc[tl] = ua[t2]*ub[tl], repeat, repeat, D e v o l u t i o n r e m o v e s t h e o n e r e m a i n i n g c o m d e x . N o t e t h a t t h e i n d e x sets of t3 a n d t2 c o n t a i n t h e s a m e n u m b e r of e l e m e n t s ; t h u s r a n g e t e s t i n g for t3 is n o t n e e d e d : t3 = 0 , loop fort2 = l b y l l t o l O 0 loop fortl = t 3 + l t o t 3 + 2 0 uc[tl] .= ua[t2]*ub[tl], repeat; t 3 : = t 3 +20, repeat, U s i n g l o o p v a r i a b l e e h m i n a t i o n , we c a n r e p l a c e t l b y t3. A s a result, at t h e e n d o f t h e i n n e r l o o p t3 is t o o l a r g e ; we c o m p e n s a t e b y s u b t r a c t i n g a c o m p e n s a t i n g a m o u n t . U s i n g m d e x set shifting, we c a n h a v e t h e t2 l o o p final v a l u e b e z e r o , t h u s e a s i n g c o d e select. T h i s r e q u i r e s t h a t r e f e r e n c e s to t2 b e r e p l a c e d by r e f e r e n c e s to t2 + 100: t3 = 0, loop for t2 = - 9 9 by 11 to 0 loop fort3 = t 3 + l b y 1 tot3 + 20 tl '= t3, uc[tl] .= ua[t2 + lO0]*ub[tl], repeat; t3 = t3 - 21, t 3 ' = t 3 + 20, repeat, Scalar p r o p a g a t i o n e l i m i n a t e s t h e r e m a i n i n g t l r e f e r e n c e s ; s u b s u m p t i o n a n d c o n s t a n t c o m p u t a t i o n simplify t h e t3 c o m p u t a t i o n ; a n d d e v o l u t i o n e l i m i n a t e s t h e t u n e r i t e r a t i o n loop: t3 "= 0, loop fort2 '= -99 by 11 to 0 t3 .=t3 + 1, ft3 = t 3 + 20, go to 1, loop: uc[t3] = ua[t2 + lO0]*ub[t3], t3 = t 3 + 1, l. while (t3 - ft3) _< 0 repeat, t3 = t3 - 1, repeat, S u b s u m p t l o n b e t w e e n l o o p i t e r a t i o n s a g a i n simplifies t h e t3 c o m p u t a t i o n s : t3 -= 1, loop for t2 := -99 by 11 to 0: ft3 := t3 + 20, go to/, loop: uc[t3] "= ua[t2 + lO0]*ub[t3], t3 := t3 + 1, l "while (t3 - ft3) --< 0 repeat, repeat, Program Improvement by Source-to-Source Transformatton 141 Devolution eliminates the outer iteration loop: t3 = I , t2 = -99, go to ll, loop. ft3 = t 3 + 20, go to l, loop. uc[t3] = ua[t2 + lO0]*ub[t3], t3 "= t3 + 1, l whileit3 - it3] -< 0 repeat, t2 "= t2 + 11; ll while t2 -< 0 repeat, Finally devolution eliminates both Dahl loops: t3 = 1, t2 = - 9 9 , go to II, label outer loop, ft3 .= t3 + 20, go t o / , label mner loop, uc[t3] = ua[t2 + lO0]*ub[t3], t3 .= t3 + 1, label l, if (t3 - f13) --<0 then go to tnner loop, endif, t2 = t2 + 11, l a b e l ll , ff t2 <-- 0 then go to o u t e r l o o p , endif, The program is now in a form m which, with only reasonable assumptions about the target machine, a naive code generation would produce good code. By "naive" we mean that code generahon need look only at one statement at a time, not that tt need not be clever. Indeed, it is very important that code generation know how to handle the obvious special cases: load a register with a constant; increment a register by a constant; access an indexed memory location; conditional transfer of control; etc. Applicatton of Higher Level Opttmtzation At Massachusetts Computer Assocmtes we have had considerable success using the concepts of source-to-source transformattons in the manufacture of production software as well as in the spectflcat~on and design of language processing systems. The I V T R A N compder for Fortran for the I L L I A C IV uses source-to-source transformations in two distinct ways. Powerful parallelism detectton algorithms can determine that certain Fortran loops can be executed in parallel on the I L L I A C IV. The results of program modification are reflected m the source by means of language extensions [21]. The "paralyzed" source is available for a programmer to examine. In addttion, the conventional optimization phase is tmplemented in terms of source-to-source transformations [20]. Since this phase operates on extended ANSI Fortran and since a transcriber (unparser) was available, the parser, optimizer, and transcriber were packaged as a Fortran Laundry. A similar implementation of Fortran optimization has been reported by Schneck and Angel [23]. We have designed a language, OeaL, which is the U.S. Army's interim standard language for automatic test equipment. The language has difficult compilation problems associated with hardware device handling at a high level. The language was designed in such a way that source-to-source transformations conveniently solve many of the language processing problems [19]. In previous sections we have seen how source-to-source transformations can remove overhead in modular programming, perform much of code generation, and provide the framework for an A P L compiler. Standish et al. [25] reports the use of source-to-source 142 DAVID B. LOVEMAN transformation in estabhshlng that the result of multiplying two trmngular mamces Is a triangular matrix. He terms this proof by program transformauon. In order to improve the object code of an existing high level production language, the techniques of source-to-source program transformation were used. This language is a Fortran-hke programming language designed specifically for the processing of cryptologic problems. The language contains the following features: (1) abdlty to define new character string alphabets; (2) computational operations extended to character strings, such as character-bycharacter addition (modulo alphabet size); (3) high level cryptologic operators, such as monographlc frequency count; (4) abihty to extend operations to digraphs and, m general, k-graphs; (5) explicit within-statement indexing over strings in a replacement statement or conditional statement; (6) complex relations over strings for use in conditional statements; (7) compound indexing in loop statements and within-statement indexing including multiple and parallel indexes. To study this language we used a language laboratory consisting of a parser generator and program mampulator to construct a model front end compiler for the language [5]. The Language Laboratory is a tool designed to assist m the development of optimization techniques for high level programming languages. It is written m BCPL and designed so that experimental optimization techniques can be plugged m for testing and analyzing. The laboratory Is capable of syntactically analyzing a program, creating a structural representation of the program, and transcribing that representation into a new source program. It also provides utilities to manipulate and print the structural representation so that experimental opUmization techniques can be quickly implemented, debugged, and analyzed. The laboratory is composed of five parts: Syntax Analyzer, Working Tree and Table Generator, Transcriber, Utilmes, and Experiments. There is also an interactive Command Interpreter to assist with on-line experimentation. The Laboratory implementation was somewhat language dependent. It now appears that this language dependence could have been removed from all parts of the laboratory except, of course, the experiments themselves. In this laboratory environment we were able at low cost to study the effect of various transformations on typical source programs. Our results allowed us to make suggestions for changes m the production compder. The introduction of high level loop opt~mJzat~ons in the production compiler allowed certain critical language constructs to be executed up to a factor of three faster. The language proved to be an inhospitable target language for source-to-source transformations. Some study revealed that several approprmte low level constructs were not present and that the productaon compiler's code generator often missed many specml cases. A number of simple changes were made m code generauon as a result of this study. In addation to defining and tuning the nonpesslm~zing subset of the language, the special-case analysas an code select also made a noticeable amprovement in the object code. As dascussed earlier, we found that specific h~gh level language features can be handled in one of two ways: what we termed previously as code generation or cunning algorithm design. The use of a language feature can be replaced by its implementation in terms of lower level language features, thus exposing lower level code to further optimization. This technique was used very successfully m the handhng of a large class of complex string assignment operations and predicates over strings. It was also used to handle logical operations in conditional statements and the processing of array references. Alternatwely, cunning algorithms were sometimes dewsed to implement certain features. This was often machine dependent and was very useful when a given target Program Improvement by Source-to-Source Transformation 143 machine had speclahzed, powerful instructions. For example, we designed a special algorithm for a PL/1-1ike I N D E X function (to search a string for an instance of a second string) which used the IBM 370 translate instruction and executed approximately six t~mes faster than a conventionally implemented I N D E X function. We studied the problem of computation on elements in strings and devised techmques for performing certain computations one word at a time (packed) rather than one character at a time. We studied the problem of providing efficient formatted output for languages using the Fortran run time input-output package. The techniques developed allowed certain programs to run up to three times faster than they had prior to the application of these techmques. Topics for Future Research The use of source-to-source transformations appears to be very promising both for high level program optimization and as a model of the bulk of the compilation process. Obwously much work remains to be done. Imtial steps have been taken at gathering information needed to evaluate the predicates in transformations [29, 14, 15]. Considerable effort is still needed in this area and in the specification of the predicates themselves. Only a little work has been done on the ordering of transformations [20, 28]. Although a large number of compilers have been constructed, and therefore the software engineering aspects have had to be addressed, viewing compilatmn as an integrated information gathering and transformatmn system poses a number of interesting software engineering problems. If the role of code generation m a compiler can in fact be reduced as much as we hope, so that semiautomatic production of code generators becomes possible, considerable work wdl be needed m this field. The transformations themselves pose a number of problems: How does one precisely define a transformation9 How can one prove that a given transformation preserves program correctness? Is there a complete (m some sense) set of transformations for a given language7 A r e there language independent transformatmns? Is a given set of transformations closed? How can one concatenate numbers of related transformations to form subsystems? Does the approach work well m all programmmg domains? How difficult is it to write and debug the transformations themselves? As has been pointed out earlier, the use of source-to-source transformations makes demands on programming language features: The language must contain the features at both a high and a quite low level to serve as target for the transformations. Clearly research is needed in language design: Will a single langiiage with both high level and low level features solve the problem? A n d if so, how does an installation restrict a user to the preferred subset of the language? Alternatively, perhaps a hierarchy of languages from high level to Fortran level to machine architecture level is the solution. The use of a Language Laboratory in the development of transformation techniques proved to be very valuable. A similar but more general and more powerful system should be produced. Such an effort would have many interesting software engineering problems of its own, and would provide a very powerful tool for future research. The technology of informatmn gathering and the tools provided by the Language Laboratory make possible the idea of constructing an interactive program tailoring system. Such a system would allow a user to interact with and transform his program, taking advantage of known constancies of data ("matrix A is lower left trmngular") or environment ("the compiled version of this program will run on a C D C 7600"). Since the transformations would preserve the correctness of his program, the user could concentrate on the Job of tailoring his program so that it would meet his required space and time constraints. Conclusions The use of source-to-source transformations has already proved to be a valuable approach in the design and construction of several language processing systems. We have 144 DAVID B. LOVEMAN found that, m a production compiler [20], when careful attention is patd to software engineering issues, the cost of this approach in implementation efficiency is comparable to a conventional approach. In addition, two significant benefits resulted: First, as a result of the inherent modularization resulting from identifying particular transformations, modification was considerably easier. Second, since the various optimizatlons were reflected directly m a representatton of the source, use of a transcriber or unparser allowed the developers to directly view the results of optimizations, greatly easing the system debugging problem. The examples in thts paper show that the techniques can easily be used to provide quite powerful high level program improvement. Indeed, source-to-source program transformatzons appear to provide a coherent model of the compilation process for hzgh level languages and for much of code generation and to be an implementation technique for eliminating overhead in modular programming. Thts view of program ~mprovement provides not only a large class of interesting problems remaining to be solved, but also a useful framework for constructing language processors today. ACKNOWLEDGMENTS I would like to express my appreciation to Kirk Sattley for numerous stimulatmg conversations, Mtchael Karr for dtscusstons on program informatton gathermg, Mat Myszewskt for his insights into code generation, and Ross Faneuf for hts pragmatic software engineering approach to conventtonai optimtzation REFERENCES (Note. 1 2 3. 4 5 6 7 8 9. 10. 11 12 13 14 15 16 17 18 19 References [2-4, 6, 10, 11] are not cited m the text ) ABRAMS,P An APL machine, SU-SEL-70-017, Stanford Electron Lab Stanford, Cahf , Feb. 1970 ACM SIGPLAN Symposium on very high level languages SIGPLAN Nottces (ACM) 9, 4 (Aprd 1974) ACM SIGPLAN. Proceedings of a symposmm on compder optJmizanon. SIGPLAN Notices (ACM) 5, 7 (July 1970) ALLEN,F.E., AND COCKE, J A catalogue of optimizing transformattons In Destgn and Opttmtzatton of Compders, R Rustm, Ed , Prenttce-Hall, Englewood Cliffs, N J , 1972, pp 1-30 BEARISTO,D , AND SATTERLY, K BETA laboratory Final Rep CADD-7312-3111, Mass Computer Assocmtes, lnc , Wakefield, Mass., Dec 1973 BI..IRSTALL,R i , AND DARLINGTON, J A transformation system for developing recurslve programs J ACM 24, 1 (Jan 1977), 44-67 (this issue) CARTER,J L A case study of a new compdmg code generatton techmque RC 5666, IBM Thomas J Watson Res Ctr , Yorktown Hezghts, N Y , Oct 1975 CHEATHAM,T.E., AND TOWNLEY. J A Symbohc evaluation of programs--a look at loop analysis Proc 1976 ACM Symp on Symbohc and Algebraic Comput , Aug 1976, pp 90-96 CHEATHAM, T E , AND WEGBREIT, B k laboratory for the study of automatic programming Proc. AFIPS 1972 SJCC, Vol 40, AFIPS Press, Montvale, N J , pp 11-21. GERHART, S L Correctness-preserving program transformations Conf. Rec Second ACM Syrup on Principles of Programming Languages, Jan 1975. pp 54-66. GESCHKE, C M Global program optlmlzauons Ph D Th , Comptr Sct Dep , Carnegie-Mellon U , Pittsburgh, Pa , 1972 HOARE,C A R Hints on programming language design STAN-CS-73-403, Comptr Sci Dep , Stanford U , Stanford, Cahf, Dec 1973 |NGALLS.D The execution time profde as a programming tool In Design and Optzmlzatton of Compders, R Rustm, Ed , Prentice-Hall, Englewood Cliffs, N J , 1972, pp. 107-128 KARR, M Gathering mformatton about programs CA-7507-1411, Mass Computer Associates, Inc , Wakefield, Mass , July 1975 KABR,M On affme relationships among varmbles of a program Acta lnformat:ca 6, 2 (1976), 133-152 KNOTH, D Structured programming with goto statements Computtng Surveys 6, 4 (Dec 1974), 261301 LAMPORT,L Parallel execution on array and vector computers Proc 1975 Sagamore Conf. on Parallel Processing, Syracuse U , Aug 1975, pp 187-191 LAMPORT,L The parallel executmn of DO loops. Comm ACM 17, 2 (Feb 1974), 83-93 LOVEMAN,D An ATE language processing system Autotestcon 76 Formally Automatic Support Systems for Advanced Maintainability, IEEE, New York, 1976, pp 1-9 Program I m p r o v e m e n t by Source-to-Source Transformation 20 21 22 23 24 25 26 27 28. 29 30 31 145 LOVEMAN,D , AND FANEUF, R Program optimization--theory and practice Proc Conf. on Programruing Languages and Compilers for Parallel and Vector Machines SIGPLAN Notices (ACM) 10, 3 (March 1975), 97-102 PRESBERG.D , AND JOHNSON,N The paralyzer IVTRAN's parallehsm analyzer and synthesizer Proc Conf on Programming Languages and Compilers for Parallel and Vector Machines. SIGPLAN Notices (ACM) 10, 3 (March 1975), 9-16 SCHAEFER, M A Mathematical Theory of Global Program Optimization Prentice-Hall, Englewood Cliffs, N J , 1973 SCHNECK,P B , AND ANGEL, E A FORTRAN to FORTRAN optimizing compiler Computer J 16, 4 (Nov 1973), 322-330 SHAPIRO,R M , AND SAINT, H The representation of algorithms Final Tech, Rep RADC-TR-69-313, Applied Data Research, I n c . Vol II, Rome Air Develop Ctr.. Sept 1969 STANDISH, T A , HARR1MAN, D C , KIBLER, D F , AND NEIGHBORS, J M Improving and refimng programs by program manipulation Proceedings 1976 ACM Annual Conl , Oct. 20-22, 1976, pp 509516 STANDISH, T , HARRIMAN, D , KIBLER, D . AND NEIGHBORS, J The lrvlne program transformation catalogue Dep Inform and Comptr Scl, U of California at lrvlne, Irvlne, C a h f , Jan 1976 VANTASSEL.D Program Style, Design Efficiency, Debugging and Testing Prentice-Hall, Englewood Chffs. N J , 1974 WEGBREIT, B Goal-directed program transformation IEEE Trans on Software Eng SE-2, 2 (June 1976), 69-80 WEGBREIT,B Property extraction m well-founded property sets IEEE Trans on Software Eng SE-1, 3 (Sept, 1975), 270-285 WEGBRE1T,B The ECL programming system Proc AF1PS FJCC, Vol 39, AF1PS Press, Montvale, N J . 253-262 WULF, W,A , RUSSELL, D.B , AND HABERMANN, A M BLISS a language for systems programming Comm ACM 14, 12 (Dec 1971), 780-790 RECEIVED MARCH 1976, REVISED JULY 1976 Journal of The Assocmnonfor ComputingMachineryVol 24. No 1, January 1977