RICE UNIVERSITY Memory-Hierarhy Management by Steve Carr A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Dotor of Philosophy Approved, Thesis Committee: Ken Kennedy, Chairman Noah Harding Professor of Computer Siene Keith D. Cooper Assoiate Professor of Computer Siene Danny C. Sorensen Professor of Computational and Applied Math Houston, Texas July, 1994 Memory-Hierarhy Management Steve Carr Abstrat The trend in high-performane miroproessor design is toward inreasing omputational power on the hip. Miroproessors an now proess dramatially more data per mahine yle than previous models. Unfortunately, memory speeds have not kept pae. The result is an imbalane between omputation speed and memory speed. This imbalane is leading mahine designers to use more ompliated memory hierarhies. In turn, programmers are expliitly restruturing odes to perform well on partiular memory systems, leading to mahine-spei programs. It is our belief that mahine-spei programming is a step in the wrong diretion. Compilers, not programmers, should handle mahine-spei implementation details. To this end, this thesis develops and experiments with ompiler algorithms that manage the memory hierarhy of a mahine for oating-point intensive numerial odes. Speially, we address the following issues: Lak of information onerning the ow of array values in standard data-ow analysis prevents the apturing of array reuse in registers. We develop and experiment with a tehnique to perform salar replaement in the presene of onditional-ontrol ow to expose array reuse to standard data-ow algorithms. Salar replaement. Many loops require more data per yle than an be proessed by the target mahine. We present and experiment with an automati tehnique to apply unroll-and-jam to suh loops to redue their memory requirements. Unroll-and-jam. Cahe loality in programs run on advaned miroproessors is ritial to performane. We develop and experiment with a tehnique to order loops within a nest to attain good ahe loality. Loop Interhange. Iteration-spae bloking is a tehnique used to attain temporal loality within ahe. Although it has been applied to \simple" kernels, there has been no investigation into its appliability over a range of algorithmi styles. We show how to apply bloking to loops with trapezoidal-, rhomboidal-, and triangular-shaped iteration spaes. In addition, we show how to overome ertain omplex dependene patterns. Bloking. Experiments with the above tehniques have shown that integer-fator speedups on a single hip are possible. These results reveal that many numerial algorithms an be expressed in a natural, mahine-independent form while retaining good memory performane through the use of ompiler optimizations. Aknowledgments I would like to thank the members of my ommittee for their input and support throughout the development of this thesis. In partiular, I would like to thank Ken Kennedy for his guidane and support throughout my graduate areer, espeially through the numerous times when I was ready to quit. I would also like to give speial thanks to Keith Cooper for being willing to listen when I had ideas that needed to be heard. From the onset of graduate shool, I found that many of my fellow students provided muh-needed support and insight. Preston Briggs, Uli Kremer, Mary Hall and Ervan Darnell have given feedbak on the ideas developed within this thesis. Those who began graduate shool with me, Rene Rodrguez, Elmootazbellah Elnozahy and Rebea Parsons, have been great friends and a onstant soure of enouragement. The members of the ParaSope researh group provided the infrastruture for the implementation of the ideas developed within this thesis. Finally, speial thanks goes to Ivy Jorgensen for proofreading this thesis. Without the enouragement of my undergraduate professors at Mihigan Teh, I would have never even onsidered graduate shool. Dave Poplawski, Steve Seidel, and Karl Ottenstein provided that enouragement and I am indebted to them for it. As with all endeavors, nanial support is needed to allow one to eat. I have been supported by IBM Corporation, the National Siene Foundation and Darpa at various points in my graduate areer. Finally, I wish to thank my family. My parents have supported me with love and enouragement that has been invaluable. Most of all, I thank my wife Beky who has been the best friend that I ould ever have. Her love and enouragement has done more to make my life meaningful than she ould ever imagine. Contents Abstrat Aknowledgments List of Illustrations iii v xi 1 Introdution 1.1 Bakground : : : : : : : : : : : : : : : : : : : : : : 1.1.1 Performane Model : : : : : : : : : : : : : : 1.1.2 Dependene Graph : : : : : : : : : : : : : : 1.2 Transformations To Improve Memory Performane 1.2.1 Salar Replaement : : : : : : : : : : : : : 1.2.2 Unroll-and-Jam : : : : : : : : : : : : : : : : 1.2.3 Loop Interhange : : : : : : : : : : : : : : : 1.2.4 Strip-Mine-And-Interhange : : : : : : : : : 1.3 Related Work : : : : : : : : : : : : : : : : : : : : : 1.3.1 Register Alloation : : : : : : : : : : : : : : 1.3.2 Memory Performane Studies : : : : : : : : 1.3.3 Memory Management : : : : : : : : : : : : 1.4 Overview : : : : : : : : : : : : : : : : : : : : : : : 2 Salar Replaement 2.1 Partial Redundany Elimination : : 2.2 Algorithm : : : : : : : : : : : : : : : 2.2.1 Control-Flow Analysis : : : : 2.2.2 Availability Analysis : : : : : 2.2.3 Reahability Analysis : : : : 2.2.4 Potential-Generator Seletion 2.2.5 Antiipability Analysis : : : : 2.2.6 Dependene-Graph Marking : 2.2.7 Name Partitioning : : : : : : 2.2.8 Register-Pressure Moderation 2.2.9 Referene Replaement : : : 2.2.10 Statement-Insertion Analysis 2.2.11 Register Copying : : : : : : : 2.2.12 Code Motion : : : : : : : : : 2.2.13 Initialization : : : : : : : : : 2.2.14 Register Subsumption : : : : 2.3 Experiment : : : : : : : : : : : : : : 2.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2 2 3 4 4 4 5 5 5 6 6 7 9 11 12 12 12 14 15 16 18 18 19 21 23 25 26 27 28 28 30 33 viii CONTENTS 3 Unroll-And-Jam 3.1 Safety : : : : : : : : : : : : : : : : : : : : : : : 3.2 Dependene Copying : : : : : : : : : : : : : : : 3.3 Improving Balane with Unroll-and-Jam : : : : 3.3.1 Computing Transformed-Loop Balane. 3.3.2 Estimating Register Pressure : : : : : : 3.4 Applying Unroll-and-Jam in a Compiler : : : : 3.4.1 Piking Loops to Unroll : : : : : : : : : 3.4.2 Piking Unroll Amounts : : : : : : : : : 3.4.3 Removing Interlok : : : : : : : : : : : : 3.4.4 Multiple Edges : : : : : : : : : : : : : : 3.4.5 Multiple-Indution-Variable Subsripts : 3.4.6 Non-Perfetly Nested Loops : : : : : : : 3.5 Experiment : : : : : : : : : : : : : : : : : : : : 3.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.0.1 Iteration-Spae Bloking : : : : : : : : : : : : : : : : : Index-Set Splitting : : : : : : : : : : : : : : : : : : : : : : : : 5.1.1 Triangular Iteration Spaes : : : : : : : : : : : : : : : 5.1.2 Trapezoidal Iteration Spaes : : : : : : : : : : : : : : 5.1.3 Complex Dependene Patterns : : : : : : : : : : : : : Control Flow : : : : : : : : : : : : : : : : : : : : : : : : : : : Solving Systems of Linear Equations : : : : : : : : : : : : : : 5.3.1 LU Deomposition without Pivoting : : : : : : : : : : 5.3.2 LU Deomposition with Partial Pivoting : : : : : : : : 5.3.3 QR Deomposition with Householder Transformations 5.3.4 QR Deomposition with Givens Rotations : : : : : : : Language Extensions : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 Loop Interhange 4.1 Performane Model : : : : : : : : : 4.1.1 Data Loality : : : : : : : : 4.1.2 Memory Cyles : : : : : : : 4.2 Algorithm : : : : : : : : : : : : : : 4.2.1 Computing Loop Order : : 4.2.2 Non-Perfetly Nested Loops 4.3 Experiment : : : : : : : : : : : : : 4.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 Blokability 5.1 5.2 5.3 5.4 5.5 6 Conlusion 6.1 Contributions : : 6.1.1 Registers 6.1.2 Cahe : : 6.2 Future Work : : 6.3 Final Remarks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A Formulas for Non-Retangular Iteration Spaes A.1 Triangular Loops : : : : : : A.1.1 Upper Left: > 0 : A.1.2 Upper Right: < 0 A.1.3 Lower Right: > 0 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 35 35 36 37 38 43 44 44 45 46 46 49 50 51 54 55 55 55 56 57 57 58 58 60 61 61 62 62 63 65 66 67 69 71 73 74 75 77 79 79 79 80 80 81 83 83 83 85 86 ix CONTENTS A.1.4 Lower Left: < 0 : : : : : A.2 Trapezoidal Loops : : : : : : : : : A.2.1 Upper-Bound MIN Funtion A.2.2 Lower-Bound MAX Funtion A.3 Rhomboidal Loops : : : : : : : : : Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87 88 88 89 90 91 Illustrations 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 Partially Redundant Computation : : After Partial Redundany Elimination FindPotentialGenerators : : : : : : : : MarkDependeneGraph : : : : : : : : GenerateNamePartitons : : : : : : : : CalulateBenet : : : : : : : : : : : : ModerateRegisterPressure : : : : : : : ReplaeReferenes : : : : : : : : : : : InsertStatements : : : : : : : : : : : : InsertRegisterCopies : : : : : : : : : : CodeMotion : : : : : : : : : : : : : : : Initialize : : : : : : : : : : : : : : : : : Subsume : : : : : : : : : : : : : : : : : Example After Salar Replaement : : Experimental design. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 13 17 19 20 22 24 25 27 28 29 29 31 31 31 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Distane Vetors Before and After Unroll-and-Jam V ; Before and After Unroll-and-Jam : : : : : : : : VrC Before and After Unroll-and-Jam : : : : : : : : VrI Before and After Unroll-and-Jam : : : : : : : : PikLoops : : : : : : : : : : : : : : : : : : : : : : : PartitionNodes : : : : : : : : : : : : : : : : : : : : Compute CoeÆients for Optimization Problem : : Distribute Loops for Unroll-and-jam : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 37 39 40 41 45 47 48 52 4.1 Example Loop for Memory Costs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 56 4.2 Algorithm for Ordering Loops : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 57 4.3 Algorithm for Non-Perfetly Nested Loops : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 59 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 Upper Left Triangular Iteration Spae : : : : : : : : : Trapezoidal Iteration Spae with Retangle : : : : : : Trapezoidal Iteration Spae with Rhomboid : : : : : : Data Spae for A : : : : : : : : : : : : : : : : : : : : : Matrix Multiply After IF-Inspetion : : : : : : : : : : Regions Aessed in LU Deomposition : : : : : : : : Blok LU Deomposition : : : : : : : : : : : : : : : : : LU Deomposition with Partial Pivoting : : : : : : : : Blok LU Deomposition with Partial Pivoting : : : : QR Deomposition with Givens Rotations : : : : : : : Optimized QR Deomposition with Givens Rotations : Blok LU in Extended Fortran : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 63 64 65 66 68 70 70 72 72 75 76 76 xii ILLUSTRATIONS A.1 A.2 A.3 A.4 A.5 A.6 A.7 Upper Left Triangular Iteration Spae : : : : : Upper Right Triangular Iteration Spae : : : : Lower Right Triangular Iteration Spae : : : : Lower Left Triangular Iteration Spae : : : : : Trapezoidal Iteration Spae with MIN Funtion Trapezoidal Iteration Spae with MAX Funtion Rhomboidal Iteration Spae : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 83 85 86 87 88 89 90 1 Chapter 1 Introdution Over the past deade, miroproessor design strategies have foused on inreasing the omputational power available on a single hip. These advanes in power have been ahieved not only through redued yle times, but also via arhitetural hanges suh as multiple instrution issue and pipelined oating-point funtional units. The resulting miroproessors an proess dramatially more data per mahine yle than previous models. Unfortunately, the performane of memory has not kept pae. The result has been an inrease in the number of yles for a memory aess|a lateny of 10 to 20 mahine yles is now quite ommon|ausing an imbalane between the rate at whih omputations an be performed and the rate at whih operands an be delivered onto the hip. To ameliorate these problems, mahine designers have turned inreasingly to omplex memory hierarhies. For example, the Intel i860XR has an on-hip ahe memory for 8K bytes of data and there are several systems that use two levels of ahe with a MIPS R3000. Still, these systems perform poorly on sienti alulations that are memory intensive and are not strutured to take advantage of the target mahine's memory hierarhy. This situation has led many programmers to restruture their odes by hand to improve performane in the memory hierarhy. We believe that this is a step in the wrong diretion. The user should not be writing programs that target a partiular mahine; instead, the task of speializing a program to a target mahine should fall to the ompiler. If this trend ontinues, an inreasing fration of the human resoures available for siene and engineering will be spent on onversion of high-level language programs from one mahine to another|an unaeptable eventuality. There is a long history of the use of sophistiated ompiler optimizations to ahieve mahine independene. The Fortran I ompiler inluded enough optimizations to make it possible for sientists to abandon mahine language programming. More reently, advaned vetorization tehnology has made it possible to write mahine-independent vetor programs in a sublanguage of Fortran 77. Is it possible to ahieve the same suess for memory-hierarhy management on salar proessors? More preisely, an we enhane ompiler tehnology to make it possible to express an algorithm in a natural, mahine-independent form while ahieving memory-hierarhy performane good enough to obviate the need for hand optimization? This thesis shows that ompiler tehnology an make it possible for a program expressed in a natural form to ahieve high performane, even on a omplex memory hierarhy. Compiler algorithms to manage the memory hierarhy automatially are developed and shown through experimentation to be eetive. By adapting ompiler tehnology developed for parallel and vetor arhitetures, we are able to restruture sienti odes to ahieve good memory performane on salar arhitetures. In many ases, our tehniques have been extremely eetive | apable of ahieving integer-fator speedups over ode generated by a good optimizing ompiler of onventional design. This aomplishment represents a step forward in the enouragement of mahine-independent programming. This hapter introdues the models, transformations and previous work on whih our ompiler strategy is based. Setion 1.1 presents the models we use to understand mahine and program behavior. Setion 1.2 presents the transformations that we will use to improve program performane. Setion 1.3 presents the previous work related to memory-hierarhy management and, nally, Setion 1.4 gives a brief overview of the thesis. 2 1.1 CHAPTER 1. INTRODUCTION Bakground In this setion, we lay the foundation for the appliation of our reordering transformations that improve the memory performane of programs. First, we desribe a measure of mahine and loop performane used in the appliation of the transformations desribed in the next setion. Seond, we present a speial form of dependene graph that an be used by our transformation system to model memory usage. 1.1.1 Performane Model We diret our researh toward arhitetures that are pipelined and allow asynhronous exeution of memory aesses and oating-point operations (e.g., Intel i860 or IBM RS/6000). We assume that the target mahine has a typial optimizing ompiler | one that performs salar optimizations only. In partiular, we assume that it performs strength redution, alloates registers globally (via some oloring sheme) and shedules the arithmeti pipelines [CK77, CAC+ 81, GM86℄. This makes it possible for our transformation system to restruture the loop nests while leaving the details of optimizing the loop ode to the ompiler. Given these assumptions for the target mahine and ompiler, we will desribe the notion of balane dened by Callahan, et. al, to measure the performane of program loops in relation to memory [CCK88℄. This model will serve as the fore behind the appliation of our transformations desribed throughout this thesis. Mahine Balane A omputer is balaned when it an operate in a steady state manner with both memory aesses and oating-point operations being performed at peak speed. To quantify this relationship, we dene M as the rate at whih data an be fethed from memory, MM , ompared to the rate at whih oating-point operations an be performed, FM : words/yle = MM M = max max ops/yle = FM The values of MM and FM represent peak performane where the size of a word is the same as the preision of the oating point operations. Every mahine has at least one intrinsi M (there may be one for singlepreision oating point and one for double preision). For example, on the IBM RS/6000 M = 1 and on the DEC Alpha M = 3. Although this measure an be used to determine if a mahine is balaned between omputation and data aesses, it is not our purpose to use balane in this manner. Instead, our goal is to evaluate the performane of a loop on a partiular mahine based upon that mahine's partiular balane ratio. To do this, we introdue the notion of loop balane. Loop Balane Just as mahines have balane ratios, so do loops. We an dene balane for a spei loop as of memory referenes= M . L = numbernumber of ops= F We assume that referenes to array variables are atually referenes to memory, while referenes to salar variables involve only registers. Memory referenes are assigned a uniform ost under the assumption that loop interhange, software prefething or tiling will attain ahe loality [WL91, KM92, CKP91℄. Comparing M to L an give us a measure of the performane of a loop running on a partiular arhiteture. If L = M , the loop is balaned for the mahine and will run well on that partiular mahine. The balane measure favors no partiular mahine or loop balane. It reports that out-of-balane loops will run well on similarly out-of-balane arhitetures and balaned loops will run well on similarly balaned arhitetures. In addition, it reports that performane bottleneks our when loop and mahine balane do no math. If L > M , then the loop needs data at a higher rate than the memory system an provide and idle omputational yles will exist. Suh a loop is said to be memory bound and its performane an be improved by lowering L . If L < M , then data annot be proessed as fast as it is supplied to the proessor and memory bandwidth will be wasted. Suh a loop is said to be ompute bound. Compute-bound loops run 1.1. BACKGROUND 3 at the peak oating-point rate of a mahine and need not be further balaned. Floating-point operations usually annot be removed and arbitrarily inreasing the number of memory operations is pointless. 1.1.2 Dependene Graph To aid in the omputation of the number of memory referenes for the appliation of our transformations, we use a form of dependene graph that exposes reuse of values. We say that a dependene exists between two referenes if there exists a ontrol-ow path from the rst referene to the seond and both referenes aess the same memory loation [Ku78℄. The dependene is a true dependene or ow dependene if the rst referene writes to the loation and the seond reads from it, an antidependene if the rst referene reads from the loation and the seond writes to it, an output dependene if both referenes write to the loation, and an input dependene if both referenes read from the loation. If two referenes, v and w, are ontained in n ommon loops, we an refer to separate instanes of the exeution of the referenes by an iteration vetor. An iteration vetor, denoted ~i; is simply the values of the loop ontrol variables of the loops ontaining v and w. The set of iteration vetors orresponding to all iterations of the of the loop nest is alled the iteration spae. Using iteration vetors, we an dene a distane vetor, d = hd1 ; d2 ; : : : ; dn i, for eah onsistent dependene: if v aesses loation Z on iteration i~v and w aesses loation Z on iteration i~w , the distane vetor for this dependene is i~w i~v . Under this denition, the k th omponent of the distane vetor is equal to the number of iterations of the k th loop (numbered from outermost to innermost) between aesses to Z . For example, given the following loop: DO 10 I = 1,N DO 10 J = 1,N 10 A(I,J) = A(I-1,J) + A(I-2,J+3) the true dependene from A(I,J) to A(I-1,J) has a distane vetor of h1; 0i and the true dependene from A(I,J) to A(I-2,J+3) has a distane vetor of h2; 3i. The loop assoiated with the outermost non-zero distane vetor entry is said to be the arrier. The distane vetor value assoiated with the arrier loop is alled the threshold of the dependene. If all distane vetor entries are zero, the dependene is loop independent. In determining whih dependenes an be used for memory analysis, we onsider only those that have a onsistent threshold | that is, those dependenes for whih the threshold is onstant throughout the exeution of the loop [GJG87, CCK88℄. For a dependene to have a onsistent threshold, it must be the ase that the loation aessed by the dependene soure on iteration i is aessed by the sink on iteration i + , where does not vary with i. Formally, if we let A(f (~i)) = A(a0 + a1 I1 + + an In ) A(g(~i)) = A(b0 + b1 I1 + + bnIn ) be array referenes where eah ai and bi is a onstant and eah Ij is a loop indution variable (In is assoiated with the innermost loop), then we have the following denition. Theorem 1.1 A dependene has a onsistent threshold i ai = bi for eah 1 i n and there exists an integer suh that bn = a0 b0 . Proof See Callahan, et. al [CCK88℄. Some arried dependenes will have multiple distane vetor values assoiated with one entry. Consider the following loop. 10 DO 10 I = 1, N A(K) = A(K) + ... The true dependene between the referenes to A(K) has the distanes 1; 2; 3; : : : ; N 1 for the entry assoiated with the I-loop. For the purposes of memory management, we will use the minimum value, 1 in this ase, as the distane vetor entry. 4 CHAPTER 1. 1.2 INTRODUCTION Transformations To Improve Memory Performane Based upon the dependene graph desribed in the previous setion, we an apply a number of transformations to improve the performane of memory-bound programs. In this setion, we give a brief introdution to these transformations. 1.2.1 Salar Replaement In the model of balane presented in the previous setion, all array referenes are assumed to be memory referenes. The prinipal reason for this is that the data-ow analysis used by typial salar ompilers is not powerful enough to reognize most opportunities for reuse in subsripted variables. Arrays are treated in a partiularly naive fashion, if at all, making it impossible to determine when a spei element might be reused. This, however, need not be the ase. In the ode shown below, 10 DO 10 I = 2, N A(I) = A(I-1) + B(I) the value aessed by A(I-1) is dened on the previous iteration of the loop by A(I) on all but the rst iteration. Using this knowledge, obtained via dependene analysis, the ow of values between the referenes an be expressed with temporaries as follows. 10 T = A(1) DO 10 I = 2, N T = T + B(I) A(I) = T Sine global register alloation will most likely put salar quantities in registers, we have removed the load of A(I-1), resulting in a redution in the balane of the loop from 3 to 2 [CAC+ 81, CH84, BCKT89℄. This transformation is alled salar replaement and in Chapter 2, we show how to apply it to loops automatially. 1.2.2 Unroll-and-Jam Unroll-and-jam is another transformation that an be used to improve the performane of memory-bound loops [AC72, AN87, CCK88℄. The transformation unrolls an outer loop and then jams the resulting inner loops bak together. Using unroll-and-jam we an introdue more omputation into an innermost loop body without a proportional inrease in memory referenes. For example, the loop: DO 10 I = 1, 2*M DO 10 J = 1, N 10 A(I) = A(I) + B(J) after unrolling beomes: DO 10 I = 1, 2*M, 2 DO 20 J = 1, N 20 A(I) = A(I) + B(J) DO 10 J = 1, N 10 A(I+1) = A(I+1) + B(J) and after jamming beomes: DO 10 I = 1, 2*M, 2 DO 10 J = 1, N A(I) = A(I) + B(J) 10 A(I+1) = A(I+1) + B(J) In the original loop, we have one oating-point operation and one memory referene after salar replaement, giving a balane of 1. After applying unroll-and-jam, we have two oating-point operations and still only one memory referene, giving a balane of 0.5. On a mahine that an perform twie as many oating-point operations as memory aesses per lok yle, the seond loop would perform better. In Chapter 3, we will show how to apply unroll-and-jam automatially to improve loop balane. 1.3. RELATED WORK 1.2.3 5 Loop Interhange Not only are we onerned with the number of referenes to memory, but also whether the data aessed by a referene is stored in ahe or main memory. Consider the following Fortran loop where arrays are stored in olumn-major order. DO 10 I = 1, N DO 10 J = 1, N 10 A = A + B(I,J) Referenes to suessive elements of B by B(I,J) are a long distane apart in number of memory aesses, requiring an extremely large ahe to apture the potential ahe-line reuse. With the likelihood of ahe misses on aesses to B, we an interhange the I- and J-loops to make the distane between suessive aesses small, as shown below [AK87, Wol86a℄. DO 10 J = 1, N DO 10 I = 1, N 10 A = A + B(I,J) Now, we will only have a ahe miss on aesses to B one every ahe line, resulting in better memory performane. In Chapter 4, we derive a ompiler algorithm to apply loop interhange, when safe, to a loop nest to improve ahe performane. 1.2.4 Strip-Mine-And-Interhange Sometimes loops aess more data than an be handled by a ahe even after loop interhange. In these ases, the iteration spae of a loop an be bloked into setions whose reuse an be aptured by the ahe. Strip-mine-and-interhange is a transformation that ahieves this result [Wol87, Por89℄. The eet is to shorten the distane between the soure and sink of a dependene so that it is more likely for the datum to reside in ahe when the reuse ours. Consider the following example DO 10 J = 1,N DO 10 I = 1,M 10 A(I) = A(I) + B(I,J) Assuming that the value of M is muh greater than the size of the ahe, we would miss the opportunity to reuse the values of A on eah iteration of J. To apture this reuse, we an use strip-mine-and-interhange. First, we strip mine the I-loop as shown below. DO 10 J = 1,N DO 10 I = 1,M,IS DO 10 II = I,MIN(I+IS-1,N) 10 A(II) = A(II) + B(II,J) Note that this loop exeutes the iterations of the loop in preisely the same order as the original and its reuse properties are unhanged. However, when we interhange the stripped loop with the outer loop to give DO 10 I = 1,M,IS DO 10 J = 1,N DO 10 II = I,MIN(I+IS-1,N) 10 A(II) = A(II) + B(II,J) the iterations are now exeuted in bloks of size IS by N. With this bloking, we an reuse IS values of A out of ahe for every iteration of the J-loop if IS is less than half the size of the ahe. Together, unroll-and-jam and strip-mine-and-interhange make up a transformation tehnique known as iteration-spae bloking. As disussed, the rst is used to blok for registers and the seond for ahe. In Chapter 5, we explore additional knowledge neessary to apply iteration-spae bloking to many real-world algorithms. 1.3 Related Work Muh researh has been done in memory-hierarhy management using the aforementioned transformations. In this setion, we will review the previous work, noting the deienies that we intend to address. 6 1.3.1 CHAPTER 1. INTRODUCTION Register Alloation The most signiant work in the area of register alloation has been done by Chaitin, et al., and Chow and Hennessy [CAC+ 81, CH84℄. Both groups ast the problem of register alloation into that of graph oloring on an interferene graph, where the nodes represent salar memory loations and an edge between two nodes prevents them from oupying the same register. The objetive is to nd a k-oloring of the interferene graph, where k is the number of mahine registers. Sine graph oloring is NP-omplete, heuristi methods must be applied. It is usually the ase that a k-oloring is not possible, requiring values to be spilled from registers to satisfy physial onstraints. This method has been shown to be very eetive at alloating salar variables to registers, but beause information onerning the ow of array values is laking, subsripted variables are not handled well. Allen and Kennedy show how data dependene information an be used to reognize reuse of vetor data and how that information an be applied to perform vetor register alloation [AK88℄. They also present two transformations, loop interhange and loop fusion, as methods to improve vetor register alloation opportunities. Both of these transformations were originally designed to enhane loop parallelism. Beause dependenes represent the ow of values in an array, Allen and Kennedy suggest that this information ould be used to reognize reuse in arrays used in salar omputations. Callahan, Coke and Kennedy have expanded these ideas to develop salar replaement as shown in Setion 1.2.1 [CCK88℄. Their method is only appliable to loops that have no inner-loop onditional ontrol ow; therefore, it has limited appliability. Also, their algorithm does not onsider register pressure and may expose more reuse than an be handled by a partiular mahine's register le. If spill ode must be inserted, the resulting performane degradation may negate most of the value of salar replaement. Callahan, Coke and Kennedy also use unroll-and-jam to improve the eetiveness of salar replaement and to inrease the amount of low-level parallelism in the inner-loop body. Although the mehanis of unroll-and-jam are desribed in detail, there is no disussion of how or when to apply the transformation to a loop nest. Aiken and Niolau present a transformation idential to unroll-and-jam, alled loop quantization [AN87℄. However, they do not use this transformation to inrease data loality, but rather to improve inner-loop parallelism. They perform a strit quantization, where the unrolled iterations of the original loop in the body of the unrolled loop are data-independent. This means that they do not improve the data loality in the innermost loop for true dependenes; therefore, little redution in memory referenes an be obtained. 1.3.2 Memory Performane Studies Previous studies have shown how poor ahe behavior an have disastrous eets on program performane. Abu-Sufah and Malony showed that the performane of the LANL BMK8A1 benhmark fell by a fator of as muh as 2.26 on an Alliant FX/8 when vetor sizes were too large to be maintained in ahe [ASM86, GS84℄. Similarly, Liu and Strother found that vetor performane on the IBM 3090 fell by a fator of 1.40 when vetor lengths exeeded the ahe apaity [LS88℄. In this seond study, it was also shown that if the vetor ode were bloked into smaller setions that t into ahe, the optimal performane was regained. Portereld reported that on omputers with large memory latenies, many large sienti programs spent half of their exeution time waiting for data to be delivered to ahe [Por89℄. A number of other studies have shown the eetiveness of bloking loops for ahe performane. Gallivan, et al., show that on the Alliant FX/8, the bloked version of LU deomposition is nearly 8 times faster than the unbloked version, using BLAS3 and BLAS2 respetively [GJMS88, DDHH88, DDDH90℄. The BLAS2 version performs a rank 1 update of the matrix while the best BLAS3 version performs a bloked rank 96 update. Also on the Alliant FX/8, Berry and Sameh have ahieved speedups of as large as 9 over the standard LINPACK versions for solving tridiagonal linear systems [BS88, DBMS79℄ and on the Cray-2, Calahan showed that bloking LU deomposition improved the performane by a fator of nearly 6 [Cal86℄. All of these studies involved tedious hand optimization to attain maximal performane. The BLAS primitives are noteworthy examples of this methodology, in whih eah primitive must be reoded in assembly language to get performane on eah separate arhiteture [LHKK79℄. Hand optimization is less than ideal beause it takes months to ode the BLAS primitives by hand, although reoding the whole program is a worse alternative. Therefore, signiant motivation exists for the development of a restruturing ompiler that an 1.3. RELATED WORK 7 optimize for any memory hierarhy and relieve the programmer from the tedious task of memory-hierarhy management. 1.3.3 Memory Management The rst major work in the area of memory management by a ompiler is that of Abu-Sufah on improving virtual memory performane [AS78℄. In his thesis, he desribes the use of a restruturing ompiler to improve the virtual memory behavior of a program. Through the use of dependene analysis, Abu-Sufah is able to perform transformations on the loop struture of a program to redue the number of physial pages required and to group aesses to eah page. The transformations that Abu-Sufah uses to improve virtual memory performane of loops are lustering (or loop distribution), loop fusion and page indexing (a ombination of strip mining, interhange and distribution). Clustering is used to split the loop into omponents whose name spaes (or working sets) are disjoint. Loop fusion is then used to fuse omponents whih originated in dierent loops but whose name spaes interset. The page indexing transformation is used to blok a loop nest so that all the referenes to a page are ontained in one iteration over a blok, thus maximizing loality. The goal is a set of loops whose working sets are disjoint; hene, the required number of physial pages is redued without inreasing the page fault rate. Although Abu-Sufah's transformation system shows the potential for improving a program's memory behavior, his use of loop distribution an inrease the amount of pipeline interlok and ause performane degradation [CCK88℄. Additionally, the fat that virtual memory is fully assoiative rather than set assoiative prevents the appliation of his model to ahe management. Fabri's work in automati storage optimization onentrates on extended graph-oloring tehniques to manage storage overlays in memory [Fab79℄. She presents the notion of array renaming (analogous to liverange splitting) to minimize the memory requirements of a program. However, the problem of storage overlays does not map to ahe management sine the ompiler does not have expliit ontrol over the ahe. Thabit has examined software methods to improve ahe performane through the use of paking, prefething and loop transformations [Tha81℄. He shows that optimal paking of salars for ahe-line reuse is an NP-omplete problem and proposes some heuristis. However, salars are not onsidered to be a signiant problem beause of register-alloation tehnology. In addition, Thabit maps graph-oloring to the alloation of arrays to eliminate ahe interferene aused by set assoiativity. However, no evaluation of the eetiveness of this approah is given. In partiular, he does not address the problem of an array interfering with itself. Finally, Thabit establishes the safety onditions of loop distribution and strip-mineand-interhange. Wolfe's memory performane work has onentrated on developing transformations to reshape loops to improve their ahe performane [Wol87℄. He shows how tiling (or iteration-spae bloking) an be used to improve the memory performane of program loops. Wolfe also shows that his tehniques for advaned loop interhange an be used to tile loops with non-retangular iteration spaes and loops that are not perfetly nested [Wol86a℄. In partiular, he disusses bloking for triangular- and trapezoidal-shaped iteration spaes, but he does not present an algorithm. Instead, he illustrates the transformation by a few examples. Irigoin and Triolet desribe a new dependene abstration, alled a dependene one, that an be used to blok ode for two levels of parallelism and two levels of memory [IT88℄. The dependene one gives more information than a distane vetor by maintaining a system of linear inequalities that an be used to derive all dependenes. The one allows a larger set of perfetly nested loops to be transformed than other dependene abstrations by providing a general framework to partition an iteration spae into supernodes (or bloks). The idea is to aggregate many loop iterations, so as to provide vetor statements, parallel tasks and data referene loality. To improve memory performane, the supernodes an be hosen to maximize the amount of data loality in eah supernode. Unfortunately, this tehnique does not work on imperfetly nested loops nor does it handle partially blokable loops, both of whih our in linear algebra odes. Wolfe presents work that is very similar to Irigoin and Triolet's [Wol89℄. He does not use the dependene one as the dependene abstration, but instead he uses the standard distane vetor. Using loop skewing, loop interhange and strip mining, he an tile an iteration spae into bloks whih have both data loality and parallelism. Wolfe is limited by the transformations that he applies and by the restritive nature of the 8 CHAPTER 1. INTRODUCTION subsripts handled with distane vetors, but he is apable of handling non-perfetly nested loops. The main advantage of Wolfe's tehniques over Triolet's is that Wolfe's method is more straightforward to implement in a restruturing ompiler, although the omplexity of the algorithm is O(d!), where d is the loop nesting depth. Gannon, et al., present a tehnique to desribe the amount of data that must be in the ahe for reuse to be possible [GJG87℄. They all this the referene window. The window for a dependene desribes all of the data that is brought into the ahe for the two referenes from the time that one datum is aessed at the soure until it is used again at the sink. A family of referene windows for an array represents all of its elements that must t into ahe to apture all of the reuse. To determine if the ahe is large enough to hold every window, the window sizes are summed and ompared against the size of the ahe. If the ahe is too small, bloking transformations suh as strip-mine-and-interhange an be used to derease the size of the referene windows. Portereld proposes a method to determine when the data referened in a loop does not t entirely in ahe[Por89℄. He develops the idea of an overow iteration, whih is that iteration of a loop that brings in the data item whih will ause the ahe to overow. Using this measurement, Portereld an predit when a loop needs to be tiled to improve memory performane and to determine the size of a one dimensional tile for the loop that auses the ahe to overow. Portereld also presents two new transformations to blok loops. The rst is peel-and-jam, whih an be used to fuse loops that have ertain types of fusion preventing dependenes by peeling o the oending iterations of the rst loop and fusing the resulting loop bodies. The seond is either a ombination loop skewing, interhanging and strip-mining or loop unrolling, peeling and jamming to perform wavefront bloking. The key tehnique here is the use of non-reordering transformations (skewing, peeling and unrolling) to make it possible to blok loops. Some of these non-reordering transformations will beome espeially important when dealing with partially blokable loops. Portereld also disusses the appliability of bloking transformations to twelve Fortran programs. He organizes them into three ategories: transformable, semi-transformable and non-transformable. One-third of the programs in his study were transformable sine the bloking transformations were diretly appliable. The semi-transformable programs ontained oding styles that made it diÆult to transform them automatially, and in the ase of the non-transformable program, partial pivoting was the ulprit. Portereld laims that a ompiler annot perform iteration-spae bloking in the presene of partial pivoting, but his analysis is not extensive enough to make this laim. He does not onsider inreasing the \intelligene" of the ompiler to improve its eetiveness. Lam and Wolf present a framework for determining memory usage within loop nests and use that framework to apply loop interhange, loop skewing, loop reversal, tiling and unroll-and-jam [WL91, Wol86b℄. Their method does not work on non-perfetly nested loops and does not enompass a tehnique to determine unroll-and-jam amounts automatially. Additionally, they do not neessarily derive the best blok algorithm with their tehnique, leaving the possibility that suboptimal performane is still possible. Lam, Rothberg and Wolf present a method to determine blok sizes for blok algorithms automatially [LRW91℄. Their results show that typial eetive blok sizes use less than 10% of the ahe. They suggest the use of opy optimizations to remove the eets of set assoiativity and allow the use of larger portions of the ahe. Kennedy and MKinley present a simplied model of ahe performane that only onsiders inner-loop reuse [KM92℄. Any reuse that ours aross an outer-loop iteration is onsidered to be prevented by ahe interferene. Using this model, they desribe a method to determine the number of ahe lines required by a loop if it were innermost and then they reorder the loops to use the minimum number of ahe lines. This simple model is less preise than the Lam and Wolf model, but is very eetive in pratie. While the number of ahe lines used by a loop is related to memory performane, it is not a diret measure of performane. Their work is direted toward shared-memory parallel arhitetures where bus ontention is a real performane problem and minimizing main memory aesses is vital. On salar proessors with a higher ahe-aess ost, the number of ahe lines aessed may not be an aurate measure of performane. 1.4. 1.4 OVERVIEW 9 Overview In the rest of this thesis, we will explore the ability of the ompiler to optimize a program automatially for a mahine's memory hierarhy. We present algorithms to apply transformations to improve loop balane and we present experiments to validate the eetiveness of the algorithms. In Chapter 2, we address salar replaement in the presene of onditional-ontrol ow. In Chapter 3, we disuss automati unroll-andjam to improve loop balane. Chapter 4 addresses loop interhange to improve ahe performane and in Chapter 5, we analyze the appliability to real-world numerial subprograms of ompiler bloking tehniques that further optimize ahe performane. Finally, we present our onlusions and future work in Chapter 6. 11 Chapter 2 Salar Replaement Although onventional ompilation systems do a good job of alloating salar variables to registers, their handling of subsripted variables leaves muh to be desired. Most ompilers fail to reognize even the simplest opportunities for reuse of subsripted variables. For example, in the ode shown below, DO 10 I = 1, N DO 10 J = 1, M 10 A(I) = A(I) + B(J) most ompilers will not keep A(I) in a register in the inner loop. This happens in spite of the fat that standard optimization tehniques are able to determine that the address of the subsripted variable is invariant in the inner loop. On the other hand, if the loop is rewritten as 10 20 DO 20 I = 1, N T = A(I) DO 10 J = 1, M T = T + B(J) A(I) = T even the most naive ompilers alloate T to a register in the inner loop. The prinipal reason for the problem is that the data-ow analysis used by standard ompilers is not powerful enough to reognize most opportunities for reuse of subsripted variables. Subsripted variables are treated in a partiularly naive fashion, if at all, making it impossible to determine when a spei element might be reused. This is partiularly problemati for oating-point register alloation beause most of the omputational quantities held in suh registers originate in subsripted arrays. Salar replaement is a transformation that uses dependene information to nd reuse of array values and expose it by replaing the referenes with salar temporaries as was done in the above example [CCK88, CCK90℄. By enoding the reuse of array elements in salar temporaries, we an give a oloring register alloator the opportunity to alloate values held in arrays to registers [CAC+ 81℄. Although previous algorithms for salar replaement have been shown to be eetive, they have only handled loops without onditional-ontrol ow [CCK90℄. The priniple reason for past deienies is the reliane solely upon dependene information. A dependene ontains little information onerning ontrol ow between its soure and sink. It only reveals that both statements may be exeuted. In the loop, 5 10 DO 10 I = 1,N IF (M(I) .LT. 0) A(I) = B(I) + C(I) D(I) = A(I) + E(I) the true dependene from statement 5 to statement 10 does not reveal that the denition of A(I) is onditional. Using only dependene information, previous salar replaement algorithms would produe the following inorret ode. 5 10 DO 10 I = 1,N IF (M(I) .LT. 0) THEN A0 = B(I) + C(I) A(I) = A0 ENDIF D(I) = A0 + E(I) 12 CHAPTER 2. SCALAR REPLACEMENT If the result of the prediate is false, no denition of A0 will our, resulting in an inorret value for A0 at statement 10. To ensure A0 has the proper value, we an insert a load of A0 from A(I) on the false branh, as shown below. 5 10 DO 10 I = 1,N IF (M(I) .LT. 0) THEN A0 = B(I) + C(I) A(I) = A0 ELSE A0 = A(I) ENDIF D(I) = A0 + E(I) The hazard with inserting instrutions is the potential to inrease run-time osts. In the previous example, we have avoided the hazard. If the true branh is taken, one load of A(I) is removed. If the false branh is taken, one load of A(I) is inserted and one load is removed. It will be a requirement of our salar replaement algorithm to prevent an inrease in run-time aesses to memory. This hapter addresses salar replaement in the presene of forward onditional ontrol ow. We show how to map partial redundany elimination to salar replaement in the presene of onditional ontrol ow to ensure that memory osts will not inrease along any exeution path [MR79, DS88℄. This hapter begins with an overview of partial redundany elimination. Then, a detailed derivation of our algorithm for salar replaement is given. Finally, an experiment with an implementation of this algorithm is reported. 2.1 Partial Redundany Elimination In the elimination of partial redundanies, the goal is to remove the latter of two idential omputations that are performed on a given exeution path. A omputation is partially redundant when there may be paths on whih both omputations are performed and paths on whih only the latter omputation is performed. In Figure 2.1, the expression A+B is redundant along one branh of the IF and not redundant along the other. Partial redundany elimination will remove the omputation C=A+B, replaing it with an assignment, and insert a omputation of A+B on the path where the expression does not appear (see Figure 2.2). Beause there may be no basi blok in whih new omputations an be inserted on a partiular path, insertion is done on ow-graph edges and new basi bloks are reated when neessary [DS88℄. The essential property of this transformation is that it is guaranteed not to inrease the number of omputations performed along any path [MR79℄. In mapping partial redundany elimination to salar replaement, referenes to array expression an be seen as the omputations. A load or a store followed by another load from the same loation represents a redundant load that an be removed. Thus, using this mapping will guarantee that the number of memory aesses in a loop will not inrease. However, the mapping that we use will not guarantee that a minimal number of loads is inserted. 2.2 Algorithm In this setion, we present an algorithm for salar replaement in the presene of forward onditional-ontrol ow. We begin by determining whih array aesses provide values for a partiular array referene. Next, we link together referenes that share values by having them share temporary names. Finally, we generate salar replaed ode. For partially redundant array aesses, partial redundany elimination is mapped to the elimination of partially redundant array aesses. 2.2.1 Control-Flow Analysis Our optimization strategy fouses on loops; therefore, we an restrit ontrol-ow analysis to loop nests only. Furthermore, it usually makes no sense to hold values aross iterations of outer loops for two reasons. 1. There may be no way to determine the number of registers needed to hold all the values aessed in the innermost loop beause of symboli loop bounds. 2. Even if we know the register requirement, it is doubtful that the target mahine will have enough registers. 2.2. 13 ALGORITHM IF (P) J J JJ J^ D = A+B R C = A+B Figure 2.1 Partially Redundant Computation IF (P) J JJ JJ ^ D = A+B, T = D T = A+B J J JJ J^ C = T Figure 2.2 After Partial Redundany Elimination 14 CHAPTER 2. SCALAR REPLACEMENT Hene, we need only perform ontrol-ow analysis on eah innermost loop body. To simplify our analysis, we impose a few restritions on the ontrol-ow graph. First, the ow graph of the innermost loop must be reduible. Seond, bakward jumps are not allowed within the innermost DO-loop beause they potentially reate loops. Finally, multiple loop exits are prohibited. This restrition is for simpliity and an be removed with slight modiation to the algorithm. 2.2.2 Availability Analysis The rst step in performing salar replaement is to alulate available array expressions. Here, we will determine if the value provided by the soure of a dependene is generated on every path to the sink of the dependene. We assume that enough iterations of the loop have been peeled so values an be available upon entry to the loop. Sine eah lexially idential array referene aesses the same memory loation on a given loop iteration, we do not treat eah lexially idential array referene as a separate array expression. Rather, we onsider them in onert. Array expressions most often ontain referenes to indution variables. Therefore, their naive treatment in availability analysis is inadequate. To illustrate this, in the loop, 10 20 30 DO 30 I = 1,N IF (B(I) .LT. 0.0) THEN C(I) = A(I) + D(I) ELSE C(I) = A(I-1) + D(I) ENDIF E(I) = C(I) + A(I) the value aessed by the array expression A(I) is fully available at the referene to A(I-1) in statement 20, but it is not available at statement 10 and is only partially available at statement 30. Using a ompletely syntati notion of array expressions, essentially treating eah lexially idential expression as a salar, A(I) will be inorretly reported as available at statements 10 and 30. Thus, more information is required. We must aount for the fat that the value of an indution variable ontained in an array expression hanges on eah iteration of the loop. A onvenient solution is to split the problem into loop-independent availability, denoted liav, where the bak edge of the loop is ignored, and loop-arried availability, lav, where the bak edge is inluded. Thus, an array expression is only available if it is in liav and there is a onsistent inoming loop-independent dependene, or if it is in lav and there is a mathing onsistent inoming loop-arried dependene. The data-ow equations for availability analysis are shown below. T liavin(b) = p2preds (b) liavout(p) S = (liavin(b) likill(b)) ligen(b) T lavout (p) lavin(b) = p2preds (b) S lavout(b) = lavin(b) lgen(b) liavout(b) For liav, an array expression is added to gen when it is enountered whether it is a load or a store. At eah store, the soures of all inoming inonsistent dependenes are added to kill and removed from gen. At loads, nothing is done for kill beause a previously generated value annot be killed. We all these sets ligen and likill. For lav, we must onsider the fat that the ow of ontrol from the soure to the sink of a dependene will inlude at least the next iteration of the loop. Subsequent loop iterations an eet whether a value has truly been generated and not killed by the time the sink of the dependene is reahed. In the loop, DO 10 I = 1,N IF (A(I) .GT. 0.0) THEN C(I) = B(I-2) + D(I) ELSE B(K) = C(I) + D(I) ENDIF 10 B(I) = E(I) + C(I) the value generated by B(I) on iteration I=1 will be available at the referene to B(I-2) on iteration I=3 only if the false branh of the IF statement is not taken on iteration I=2. Sine determining the diretion 2.2. 15 ALGORITHM of a branh at ompile time is undeidable in general, we must assume that the value generated by B(I) will be killed by the denition of B(K). In general, any denition that is the soure of an inonsistent output dependene an never be in lgen(b), 8 b. It will always be killed on the urrent or next iteration of the loop. Therefore, we need only ompute the lgen and not lkill. There is one speial ase where this denition of lgen will unneessarily limit the eetiveness of salar replaement. When a dependene threshold is 1, it may happen that the sink of the generating dependene edge ours before the killing denition. Consider the following loop. DO 10 I = 1,N B(K) = B(I-1) + D(I) 10 B(I) = E(I) + C(I) the value used by B(I-1) that is generated by B(I) will never be killed by B(K). The solution to this limitation is to reate a new set alled lavif1 that ontains availability information only for loop-arried dependenes with a threshold of 1. Control ow through the urrent and next iterations of the loop is inluded when omputing this set. The data-ow equations for lavif1 are idential to those of liav. This is beause we onsider ontrol ow on the next iteration of the loop, unlike lav. Below are the data-ow equations for lavif1. T lavif1in(b) = p2preds (b) lavif1out(p) lavif1out(b) = (lavif1in(b) likill(b)) S ligen(b) Beause we onsider both fully and partially redundant array aesses, we need to ompute partial availability in order to salar replae referenes whose load is only partially redundant. As in full-availability analysis, we partition the problem into loop-independent, loop-arried and loop-arried-if-1 sets. Computation of kill and gen orresponds to that of availability analysis. Below are the data-ow equations for partially available array expression analysis. S lipavin(b) = p2preds (b) lipavout(p) S = (lipavin(b) likill(b)) ligen(b) S lpavin(b) = lpavout(p) p2preds (b) S lpavout(b) = lpavin(b) lgen(b) S lpavif1in(b) = ) p2preds (b) lpavif1out(pS lpavif1out(b) = (lpavif1in(b) likill(b)) ligen(b) lipavout(b) 2.2.3 Reahability Analysis Beause there may be multiple lexially idential array referenes within a loop, we want to determine whih referenes atually supply values that reah a sink of a dependene and whih supply values that are killed before reahing suh a sink. In other words, whih values reah their potential reuses. In omputing reahability, we do not treat eah lexially idential array expression in onert. Rather, eah referene is onsidered independently. Reahability information along with availability is used to selet whih array referenes provide values for salar replaement. While reahability information is not required for orretness, it an prevent the marking of referenes as providing a value for salar replaement when that value is redened by a later idential referene. This improves the readability of the transformed ode. We partition the reahability information into three sets: one for loop-independent dependenes (lirg), one for loop-arried dependenes with a threshold of 1 (lrgif1) and one for other loop-arried dependenes (lrg). Calulation of ligen and likill is the same as that for availability analysis exept in one respet. The soure of any inoming loop-independent dependene whose sink is a denition is killed whether the threshold is onsistent or inonsistent. Additionally, likill is subtrated from lrgout to aount for onsistent referenes that redene a value on the urrent iteration of a loop. For example, in the following loop, the denition of B(I) kills the load from B(I), therefore, only the denition reahes the referene to B(I-1). 10 DO 10 I = 1,N A(I) = B(I-1) + B(I) B(I) = E(I) + C(I) 16 CHAPTER 2. SCALAR REPLACEMENT Using reahability information, the most reent aess to a value an be determined. Even using likill when omputing lrg will not eliminate all unreahable referenes. Referenes with only outgoing onsistent looparried output or antidependenes will not be killed. This does not eet orretness, rather only readability, and an only happen when partially available referenes provide the value to be salar replaed. Below are the data-ow equations used in omputing array-referene reahability. S lirgin(b) = p2preds (b) lirgout(p) S = (lirgin(b) likill(b)) ligen(b) S lrgout(b) = p2preds (b) lrgout(p) S lrgout(b) = (lrgin(b) likill(b)) lgen(b) S lrgif1in(b) = p2preds (b) lrgif1out(p) S lrgif1out(b) = (lrgif1in(b) likill(b)) ligen(b) lirgout(b) 2.2.4 Potential-Generator Seletion At this point, we have enough information to determine whih array referenes potentially provide values to the sinks of their outgoing dependene edges. We all these referenes potential generators beause they an be seen as generating the value used at the sink of some outgoing dependene. The dependenes leaving a generator are alled generating dependenes. Generators are only potential at this point beause we need more information to determine if salar replaement will even be protable. In Figure 2.3, we present the algorithm FindPotentialGenerators for determining potential generators. We have two goals in hoosing potential generators. The rst is to insert the fewest number of loads, orrelating to the maximum number of memory aesses removed. The seond is to minimize register pressure, or the number of registers required to eliminate the loads. To meet the rst objetive, fully available expressions are given the highest priority in generator seletion. To meet the seond, loop-independent fully available generators are preferred beause they require the fewest number of registers. If no loop-independent generator exists, loop-arried fully available generators are onsidered next. If there are multiple suh generators, the one that requires the fewest registers (the one with the smallest threshold) is hosen. If there are no fully available generators, partially available array expressions are next onsidered as generators. Partially available generators do not guarantee a redution in the number of memory aesses beause memory loads need to be inserted on paths along whih a value is needed but not generated. However, we an guarantee that there will not be an inrease in the number of memory loads by only inserting load instrutions if they are guaranteed to have a orresponding load removal on any exeution path [MR79℄. Without this guarantee, we may inrease the number of memory aesses at exeution time, resulting in a performane degradation. The best hoie for a partially available generator would be one that is loop-independent. Although there may be a \more available" loop-arried generator, register pressure is kept to a minimum and salar replaement will be applied if we hose a loop-independent generator. If there are no loop-independent partially available array expressions, then the next hoie would be a loop-arried partially available array expression with a generating dependene having the largest threshold of any inoming potential generating dependene. Although this ontradits the goal of keeping the register pressure to a minimum, we inrease the probability that there will be a use of the value on every path by inreasing the window size for potential uses of the generated value. We have hosen to sarie register pressure for potential savings in memory aesses. Finally, when propagating data-ow sets through a basi blok to determine availability or reahability at a partiular point, information is not always inrementally updated. For loop-independent information, we update a data-ow set with gen and kill information as we enounter statements. However, the same is not true for loop-arried and loop-arried-if-1 information. gen information is not used to update any loop-arried data-ow set. Loop-arried information must propagate around the loop to be valid and a loop-arried data-ow set at the entry to a basi blok already ontains this information. kill information is only inrementally updated for loop-arried-if-1 sets sine ow through the seond iteration of a loop after a value is generated is onsidered. 2.2. 17 ALGORITHM proedure FindPotentialGenerators(G) Input: G = (V; E ), the dependene graph Defs: bv = basi blok ontaining v for eah v 2 V do if 9e 2 E je = (w; v ); w 2 liav(bv ); w 2 lirg(bv ); dn (e) = 0 then mark all suh w's as v 's potential liav generator else if 9e 2 E je = (w; v ); w 2 lavif1(bv ); w 2 lrgif1(bv ); w 2 lavif1out(exit); w 2 lrgif1out(exit); dn (e) = 1 then mark all suh w's as v 's potential lav generator else if 9e 2 E je = (w; v ); w 2 lavout(exit); w 2 lrgout(exit); dn (e) > 0; minf 2Ev (dn (f )) = dn (e) then mark all suh w's as v 's potential lav generator else if 9e 2 E je = (w; v ); w 2 lipav(bv ); w 2 lirg(bv ); dn (e) = 0 then mark all suh w's as v 's potential lipav generator else if 9e 2 E je = (w; v ); w 2 lpavout(exit); w 2 lrgout(exit); dn (e) > 0); maxf 2Ev (dn (f )) = dn (e) then mark all suh w's as v 's potential lpav generator else if 9e 2 E je = (w; v ); w 2 lpavif1(bv ); w 2 lrgif1(bv ); w 2 lpavif1out(exit); w 2 lrgif1out(exit); dn (e) = 1 then mark all suh w's as v 's potential lpav generator else v has no generator Figure 2.3 FindPotentialGenerators 18 CHAPTER 2. SCALAR REPLACEMENT In the next portion of the salar replaement algorithm, we will ensure that the value needed at a referene to remove its load is fully available. This will involve insertion of memory loads for referenes whose generator is partially available. We guarantee through our partial redundany elimination mapping that we will not inrease the number of memory aesses at run time. However, we do not guarantee a minimal insertion of memory aesses. 2.2.5 Antiipability Analysis After determining potential generators, we need to loate the paths along whih loads need to be inserted to make partially available generators fully available. Loads need to be inserted on paths along whih a value is needed but not generated. We have already enapsulated value generation in availability information. Now, we enapsulate value need with antiipability. The value generated by an array expression, v , is antiipated by an array expression w if there is a true or input edge v ! w and v is w's potential generator. DO 6 I = 1,N IF (A(I) .GT. B(I) = C(I) ELSE F(I) = C(I) ENDIF C(I) = E(I) + 5 6 0.0) + D(I) + D(I) B(I) In the above example, the value generated by the denition of B(I) in statement 5 is antiipated at the use of B(I) in statement 6. As in availability analysis, we onsider eah lexially idential array expression in onert. We split the problem, but this time into only two partitions: one for loop-independent generators, lian, and one for looparried generators, lan. We do not onsider the bak edge of the loop during analysis for either partition. For lian, the reason is obvious. For lan, we only want to know that a value is antiipated on all paths through the loop. This is to ensure that we do not inrease the number of memory aesses in the loop. In eah partition, an array expression is added to gen at an array referene if it is a potential generator for that referene. For members of lian, array expressions are killed at the point where they are dened by a onsistent or inonsistent referene. For lan, only inonsistent denitions kill antiipation beause onsistent denitions do not dene the value being antiipated on the urrent iteration. For example, in the loop DO 1 I = 1,N A(I) = B(I) + D(I) 1 B(I) = A(I-1) + C(I) the denition of A(I) does not redene a partiular value antiipated by A(I-1). The value is generated by A(I) and then never redened beause of the iteration hange. T lianout(b) = s2sus (b) lianin(s) S = (lianout(b) likill(b)) ligen(b) T lanout(b) = s2sus (b) lanin(s) S lanin(b) = lanout(b) - lkill(b) lgen(b) lianin(b) 2.2.6 Dependene-Graph Marking One antiipability information has been omputed, the dependene graph an be marked so that only dependenes to be salar replaed are left unmarked. The other edges no longer matter beause their partiipation in value ow has already been onsidered. Figure 2.4 shows the algorithm MarkDependeneGraph. At a given array referene, we mark any inoming true or input edge that is inonsistent and any inoming true or input edge that has a symboli threshold or has a threshold greater than that of the dependene edge from the potential generator. Inonsistent and symboli edges are not amenable to salar replaement beause it is impossible to determine the number of registers needed to expose potential reuse at ompile time. When a referene has a onsistent generator, all edges with threshold less than or equal to the generating threshold are left in the graph. This is to failitate the onsistent register naming disussed in subsequent 2.2. 19 ALGORITHM Proedure MarkDependeneGraph(G) Input: G = (V; E ), the dependene graph for eah v 2 V if v has no generator then mark v 's inoming true and input edges else if v 's generator is inonsistent or symboli dn then mark v 's inoming true and input edges v no longer has a generator else if v 's generator is lpav and v 2= lantin(entry) mark v 's inoming true and input edges v no longer has a generator else v = threshold of edge from v 's generator mark v 's inoming true and input edges with dn > v mark v 's inoming edges whose soure does not reah v or whose soure is not partially available at v mark v 's inoming inonsistent and symboli edges Figure 2.4 MarkDependeneGraph setions. It ensures that any referene ourring between the soure and sink of an unmarked dependene that an provide the value at the sink will be onneted to the dependene sink. Finally, any edge from a loop-arried partially available generator that is not antiipated at the entry to the loop is removed beause there will not be a dependene sink on every path. 2.2.7 Name Partitioning At this point, the unmarked dependene graph represents the ow of values for referenes to be salar replaed. We have determined whih referenes provide values that an be salar replaed. Now, we move on to linking together the referenes that share values. Consider the following loop. 5 6 DO 6 I = 1,N B(I) = B(I-1) + D(I) C(I) = B(I-1) + E(K) After performing the analysis disussed so far, the generator for the referene to B(I-1) in statement 6 would be the load of B(I-1) in statement 5. However, the generator for this referene is the denition of B(I) in statement 5 making B(I-1) in statement 5 an intermediate point in the ow of the value rather than the atual generator for B(I-1) in statement 6. These referenes need to be onsidered together when generating temporary names beause they address the same memory loations. The nodes of the dependene graph an be partitioned into groups by dependene edges (see Figure 2.5) to make sure that temporary names for all referenes that partiipate in the ow of values through a memory loation are onsistent. Any two nodes onneted by an unmarked dependene edge after graph marking belong in the same partition. Partitioning is aomplished by performing a traversal of the dependene graph, following unmarked true and input dependenes only sine these dependenes represent value ow. Partitioning will tie together all referenes that aess a partiular memory loation and represent reuse of a value in that loation. After name partitioning is ompleted, we an determine the number of temporary variables (or registers) that are neessary to perform salar replaement on eah of the partitions. To alulate register requirements, we split the referenes (or partitions) into two groups: variant and invariant. Variant referenes ontain the innermost-loop indution variable within their subsript expression. Invariant referenes do not ontain the 20 CHAPTER 2. SCALAR REPLACEMENT Proedure GenerateNamePartitions(G; P ) Input: G = (V; E ), the dependene graph P = the set of name partitions 8v 2 V i=0 mark v as unvisited for eah v 2 V if v is unvisited then put v in Pi Partition(v; Pi ) i = i+1 for eah p 2 P do FindGenerator(p; p ) if p is invariant then rp = 1 else rp = max(CalNumberOfRegisters(g ) + 1) enddo end g 2p Proedure Partition(v; p) mark v as visited add v to p for eah (e = (v; w) _ (w; v )) 2 E if e is true or input and unmarked and w not visited then Partition(w; p) end Proedure FindGenerator(p; ) for eah v 2 p that is a potential generator if (v is invariant ^ (v is a store _ v has no inoming unmarked loop-independent edge)) _ (v is variant ^ v has no unmarked inoming true or input dependenes from another w 2 p) then =[v end Proedure CalNumberOfRegisters(v) dist = 0 for eah e = (v; w) 2 E if e is true or input and unmarked ^ w not visited ^ P (w) = P (v ) then dist = max(dist,dn (e)+ CalNumberOfRegisters(w)) return(dist) end Figure 2.5 GenerateNamePartitons 2.2. ALGORITHM 21 innermost-loop indution variable. In the previous example, the referene to E(K) is invariant with respet to the I-loop while all other array referenes are variant. For variant referenes, we begin by nding all referenes within eah partition that are rst to aess a value on a given path from the loop entry. We all these referenes partition generators. A variant partition generator will already have been seleted as a potential generator in Setion 2.2.4 and will not have inoming unmarked dependenes from another referene within its partition. The set of partition generators within one partition is alled the generator set, p . Next, the maximum over all dependene distanes from a potential generator to a referene within this partition is alulated. Even if dependene edges between the generator and a referene have been removed from the graph, we an ompute the distane by nding the length of a hain, exluding yles, to that referene from the generator. Letting p be the maximum distane, eah partition requires rp = p + 1 registers or temporary variables: one register for the value generated on the urrent iteration and one register for eah of the p values that were generated previously and need to ow to the last sink. In the previous example, B(I) is the oldest referene and is the generator of the value used at both referenes to B(I-1). The threshold of the partition is 1. This requires 2 registers beause there are two values generated by B(I) that are live after exeuting statement 5 and before exeuting statement 6. For invariant referenes, we an use one register for salar replaement beause eah referene aesses the same value on every iteration. In order for an invariant potential generator to be a partition generator, it must be a denition or the rst load of a value along a path through the loop. 2.2.8 Register-Pressure Moderation Unfortunately, experiments have shown that exposing all of the reuse possible with salar replaement may result in a performane degradation [CCK90℄. Register spilling an ompletely ounterat any savings from salar replaement. The problem is that speialized information is neessary to reover the original ode to prevent exessive spilling. As a result, we need to generate salar temporaries in suh a way as to minimize register pressure, in eet doing part of the register alloator's job. Our goal is to alloate temporary variables so that we an eliminate the most memory aesses given the available number of mahine registers. This an be approximated using a greedy algorithm. In this approximation sheme, the partitions are ordered in dereasing order based upon the ratio of benet to ost, or in this ase, the ratio of memory aesses saved to registers required. At eah step, the algorithm hooses the rst partition that ts into the remaining registers. This method requires O(n log n) time to sort the ratios and O(n) time to selet the partitions. Hene, the total running time is O(n log n). To ompute the benet of salar replaing a partition, we begin by omputing the probability that eah basi blok will be exeuted. The rst basi blok in the loop is assigned a probability of exeution of 1. Eah outgoing edge from the basi blok is given a proportional probability of exeution. In the ase of two outgoing edges, eah edge is given a 50% probability of exeution. For the remaining bloks, this proedure is repeated, exept that the probability of exeution of a remaining blok is the sum of the probabilities of its inoming edges. Next, we ompute the probability that the generator for a partition is available at the entrane to and exit from a basi blok. The probability upon entry is the sum of the probabilities at the exit of a blok's predeessors weighted by the number of inoming edges. Upon exit from a blok, the probability is 1 if the generator is available, 0 if it is not available and the entry probability otherwise. After omputing eah of these probabilities, the benet for eah referene within a partition an be omputed by multiplying the exeution probability for the basi blok that ontains the referene by the availability probability of the referenes generator. Figure 2.6 gives the omplete algorithm for omputing benet. As an example of benet, in the loop, DO 1 I = 1,N IF (M(I) .LT. 0) THEN A(I) = B(I) + C(I) ENDIF 1 D(I) = A(I) + E(I) the load of A(I) in statement 1 has a benet of 0.5. Unfortunately, the greedy approximation an produe suboptimal results. To see this, onsider the following example, in whih eah partition is represented by a pair (n; m), where n is the number of registers needed and m is the number of memory aesses eliminated. Assume that we have a mahine with 6 registers 22 CHAPTER 2. SCALAR REPLACEMENT Proedure CalulateBenet(P; F G) Input: P = set of referene partitions F G = ow graph Defs: Ape (b) = probability p is available on entry to b Apx (b) = probability p is available on exit from b bp = benet for partition p EF G = edges in ow graph P (b) = probability blok b will be exeuted P (entry) = 1 n =# outgoing forward edges of entry ) weight eah outgoing forward edge of entry at P (entry n for the remaining basi bloks b2 F G in reverse depth-rst order let P (b)= sum of all inoming edge weights n =# outgoing forward edges of b weight eah outgoing forward edge at P (nb) for eah p 2 P for eah basi blok b2 F G in reverse depth-rst order n =# inoming forward edges of b for eah inoming forward edge of b, e = (,b) p Ape (b) = Ape (b)+SAxn() if p 2 liavout(b) lavout(b) then Apx (b) = 1 S else if p 2 lipavout(b) lpavout(b) then Apx (b)= Ape (b) else Apx (b)= 0 for eah v 2 p if p is lpav or lpavif1 then bp = bp + Apx (exit) P (bv ) else if p is lipav bp = bp + Ape (b) P (bv ) else bp = bp + P (bv ) end Figure 2.6 CalulateBenet 2.2. ALGORITHM 23 and that we are generating temporaries for a loop that has the following generators: (4; 8); (3; 5); (3; 5); (2; 1). The greedy algorithm would rst hoose (4; 8) and then (2; 1), resulting in the elimination of nine aesses instead of the optimal ten. To get a possibly better alloation, we an model register-pressure moderation as a knapsak problem, where the number of salar temporaries required for a partition is the size of the objet to be put in the knapsak, and the size of the register le is the knapsak size. Using dynami programming, an optimal solution to the knapsak problem an be found in O(kn) time, where k is the number of registers available for alloation and n is the number of generators. Hene, for a spei mahine, we get a linear time bound. However, with the greedy method, we have a running time that is independent of mahine arhiteture, making it more pratial for use in a general tool. The greedy approximation of the knapsak problem is also provably no more than two times worse than the optimal solution [GJ79℄. However, our experiments suggest that in pratie the greedy algorithm performs as well as the knapsak algorithm. After determining whih generators will be fully salar replaed, there may still be a few registers available. Those partitions that were eliminated from onsideration an be examined to see if partial alloation is possible. In eah eliminated partition whose generator is not lpav, we alloate referenes whose distane from p is less than the number of remaining registers. All referenes within p that do not t this riterium are removed from p. This step is performed on eah partition, if possible, while registers remain unused. Finally, sine we have possibly removed referenes from a partition, antiipability analysis for potential generators must be redone. To illustrate partial alloation, assume that in the following loop there is one register available. DO 10 I = 1,N A(I) = ... IF (B(I) .GT. 0.0) THEN ... = A(I) ENDIF 10 ... = A(I-1) Here, full alloation is not possible, but there is a loop-independent dependene between the A(I)'s. In partial alloation, A(I-1) is removed from the partition allowing salar replaement to be performed. The algorithm in Figure 2.7 gives the omplete register-pressure moderation algorithm, inluding partial alloation. Although this method of partial alloation may still leave possible reuses not salar replaed, experiene suggests this rarely, if ever, happens. One possible solution is to onsider dependenes from intermediate points within a partition when looking for potential reuse. 2.2.9 Referene Replaement At this point, we have determined whih referenes will be salar replaed. We now move into the ode generation phase of the algorithm. Here, we will replae array referenes with temporary variables and ensure that the temporaries ontain the proper value at a given point in the loop. After we have determined whih partitions will be salar replaed, we replae the array referenes within eah partition. This algorithm is shown in Figure 2.8. First, for eah variant partition p, we reate the temporary variables Tp0 ; Tp1 ; : : : ; Tprp 1 , where Tpi represents the value generated by g 2 p i iterations earlier, where g is the rst generator to aess the value used throughout the partition. Eah referene within the partition is replaed with the temporary that oinides with its distane from g . For invariant partitions, eah referene is replaed with Tp0 . If a replaed referene v 2 p is a memory load, then a statement of the form Tpi = v is inserted before the generating statement. Requiring that the load must be in p for load insertion ensures that a load that is a potential generator but also has a potential generator itself will not have a load inserted. The value for the potential generator not in p will already be provided by its potential generator. If the v is a store, then a statement of the form v = Tpi is inserted after the generating statement. The latter assignment is unneessary if it has an outgoing loop-independent edge to denition that is always exeuted and it has no outgoing inonsistent true dependenes. We ould get better results by performing availability and antiipability analysis exlusively for denitions to determine if a value is always redened. 24 CHAPTER 2. SCALAR REPLACEMENT Proedure ModerateRegisterPressure(P; G) Input: P = set of referene partitions G = (V; E ), the dependene graph Defs: bp = benet for a partition p rp = register required by a partition p CalulateBenet(P; G) R = registers available H = sort of P on ratio of brpp in dereasing order for i = 1 to jHj do if rHi < R then alloate(Hi ) R = R rHi else S = S [ Hi P = P Hi endif enddo if R > 0 then i=1 while i < jSj and R > 0 do while jSi j > 0 and R > 0 do Si = fvjv 2 Si ; dn (^e) < R; e^ = (Si ; v)g alloate(Si ) R = R r Si P = P [ Si enddo enddo redo antiipability analysis endif end Figure 2.7 ModerateRegisterPressure 2.2. 25 ALGORITHM Proedure ReplaeReferenes(P ) Input: P = set of referene partitions for eah p 2 P do let g = v 2 p with no inoming marked or unmarked edge from another w 2 p or v is invariant for eah v 2 P do if v is invariant then d = 0 else d = distane from g to v replae v with Tpd if v 2 p and v is a load then insert \Tpd = v " before v 's statement if v is a store ^ (9e = (v; w)je is true and inonsistent _ 8e = (v; w) where e is output and e = 0; w is not always exeuted) then insert \v = Tpd " after v 's statement end Figure 2.8 ReplaeReferenes The eet of referene replaement will be illustrated on the following loop nest. 1 2 3 DO 3 I = 1, 100 IF (M(I) .LT. 0) E(I) = C(I) A(I) = C(I) + D(I) B(K) = B(K) + A(I-1) The reuse-generating dependenes are: 1. A loop-independent input dependene from C(I) in statement 1 to C(I) in statement 2 (threshold 0), and 2. A true dependene from A(I) to A(I-1) (threshold 1). 3. A true dependene from B(K) to B(K) in statement 2 (threshold 1). By our method, the generator C(I) in statement 1 needs only one temporary, T10. Here, we are using the rst numeri digit to indiate the number of the partition and the seond to represent the distane from the generator. The generator B(K) in statement 1 needs one temporary, T20, sine it is invariant, and the generator A(I) needs two temporaries, T30 and T31. When we apply the referene replaement proedure to the example loop, we generate the following ode. 1 2 3 DO 3 I = 1, 100 IF (M(I) .LT. 0) THEN T10 = C(I) E(I) = T10 ENDIF T30 = T10 + D(I) A(I) = T30 T20 = T20 + T31 B(K) = T20 The value for T31 is not generated in this example. We will disuss its generation in Setions 2.2.11 and 2.2.13. 2.2.10 Statement-Insertion Analysis After we replae the array referene with salar temporaries, we need to insert loads for partially available generators. Given a referene that has a partially available potential generator, we need to insert a statement 26 CHAPTER 2. SCALAR REPLACEMENT at the highest point on a path from the loop entrane to the referene that antiipates the generator where the generator is not partially available and is antiipated. By performing statement-insertion analysis on potential generators, we guarantee that every referene's antiipated value will be fully available. Here, we handle eah individual referene, whereas name partitioning linked together those referenes that share values. This philosophy will not neessarily introdue a minimum number of newly inserted loads, but there will not be an inrease in the number of run-time loads. The plae for insertion of loads for partially available generators an be determined using Drehsler and Stadel's formulation for partial redundany elimination, as shown below [DS88℄. ppin(b) = ppout(b) = insert(b) insert(a,b) = = antin(b) T pavin(b) T S (antlo(b) (transp(b) T ppout(b)) FALSE if b is the loop exit s2su(b) ppin(s) T T : T : ppout(b) avout(b) ( ppin(b) T T ppin(b) avout(a) ppout(a) : : S :transp(b)) Here, ppin(b) denotes plaement of a statement is possible at the entry to a blok and ppout(b) denotes plaement of a statement is possible at the exit from a blok. insert(b) determines whih loads need to be inserted at the bottom of blok b. insert(a,b) is dened for eah edge in the ontrol-ow graph and determines whih loads are inserted on the edge from a to b. transp(b) is true for some array expression if it is not dened by a onsistent or inonsistent denition in the blok b. antlo(b) is the same as gen(b) for antiipability information. Three problems of the above form are solved: one for lipav generators, one for lpav generator and one for lpavif1 generators. Additionally, any referene to loop-arried antin information refers to the entry blok. If insert(a,b) is true for some potential generator g , then we insert a load on the edge (a,b) of the form Tpd = g , where Tpd is the temporary name assoiated with g . If insert(b) is true for some potential generator g , then a statement of idential form is inserted at the end of blok b. Finally, if insert(a,b) is true 8a 2 pred(b), then loads an be ollapsed into the beginning of blok b. The algorithm for inserting statements is shown in Figure 2.9. If we perform statement insertion on our example loop, we get the following results. 1 2 3 DO 3 I = 1, 100 IF (M(I) .LT. 0) THEN T10 = C(I) E(I) = T10 ELSE T10 = C(I) ENDIF T30 = T10 + D(I) A(I) = T30 T20 = T20 + T31 B(K) = T20 Again, the generation of T31 is left for Setions 2.2.11 and 2.2.13. 2.2.11 Register Copying Next, we need to ensure that the values held in the temporary variables are orret aross loop iterations. The value held in Tpi needs to move one iteration further away from its generator, Tp0 , on eah subsequent loop iteration. Sine i is the number of loop iterations the value is from the generator, the variable Tpi+1 needs to take on the value of Tpi at the end of the loop body in preparation for the iteration hange. The algorithm in Figure 2.10 eets the following shift of values for eah partition, p. Tprp 1 = Tprp 2 Tprp 2 = Tprp 3 Tp1 = Tp0 2.2. 27 ALGORITHM Proedure InsertStatements(F G) Input: F G = ow graph perform insert analysis for eah ow edge e 2 F G for eah v 2inserte if v is invariant then d = 0 else d = distane from the oldest generator in P (v ) to v insert statement \TPd (v) = v " on e for eah basi blok b 2 F G for eah v 2 insert(b) if v is invariant then d = 0 else d = distane from the oldest generator in P (v ) to v insert statement \TPd (v) = v " at the end of b for eah v suh that insert(a,b) (v ) is true 8a 2 preds(b) end ollapse loads of v into beginning of b Figure 2.9 InsertStatements After inserting register opies, our example ode beomes: 1 2 3 2.2.12 DO 3 I = 1, 100 IF (M(I) .LT. 0) THEN T10 = C(I) E(I) = T10 ELSE T10 = C(I) ENDIF T30 = T10 + D(I) A(I) = T30 T20 = T20 + T31 B(K) = T20 T31 = T30 Code Motion It may be the ase that the assignment to or a load from a generator may be moved entirely out of the innermost loop. This is possible when the referene to the generator is invariant with respet to the innermost loop. In the example above, B(K) does not hange with eah iteration of the I-loop; therefore, its value an be kept in a register during the entire exeution of the loop and stored bak into B(K) after the loop exit. 1 2 3 DO 3 I = 1, 100 IF (M(I) .LT. 0) THEN T10 = C(I) E(I) = T10 ELSE T10 = C(I) ENDIF T30 = T10 + D(I) A(I) = T30 T20 = T20 + T31 T31 = T30 B(K) = T20 The algorithm for this type of ode motion is shown in Figure 2.11. 28 CHAPTER 2. SCALAR REPLACEMENT Proedure InsertRegisterCopies(G,P,x) Input: G = (V; E ), the dependene graph P = set of referene partitions x = peeled iteration number or 1 for a loop body for eah p 2 P for i = min(rp 1; x) to 1 do insert \Tpi = Tpi 1 " at the end of loop body in G end Figure 2.10 InsertRegisterCopies When inonsistent dependenes leave an invariant array referene that is a store, the generating store for that variable annot be moved outside of the innermost loop. Consider the following example. 10 DO 10 J = 1, N A(I) = A(I) + A(J) The true dependene from A(I) to A(J) is not onsistent. If the value of A(I) were stored into A(I) outside of the loop, then the value of A(J) would be wrong whenever I=J and I > 1. 2.2.13 Initialization To ensure that the temporary variables ontain the orret values upon entry to the loop, it is peeled using the algorithm in Figure 2.12. We peel max(rp1 ; : : : ; rpn ) 1 iterations from the beginning of the loop, replaing the members of a variant partition p for peeled iteration k with their original array referene, substituting the iteration value for the indution variable, only if j k for a temporary Tpj . For invariant partitions, we only replae non-partition generators' temporaries on the rst peeled iteration. Additionally, we let rp = 2 for eah invariant partition when alulating the number of peeled iterations. This ensures that invariant partitions will be initialized orretly. Finally, at the end of eah peeled iteration, the appropriate number of register transfers is inserted. When this transformation is applied to our example, we get the following ode. IF (M(1) .LT. 0) THEN T10 = C(1) E(1) = T10 ELSE T10 = C(1) ENDIF T30 = T10 + D(1) A(1) = T30 T20 = B(K) + A(0) T31 = T30 ... LOOP BODY 2.2.14 Register Subsumption In our example loop, we have eliminated three loads and one store from eah iteration of the loop, at the ost of three register-to-register transfers in eah iteration. Fortunately, inserted transfer instrutions an be eliminated if we unroll the salar replaed loop using the algorithm in Figure 2.13. If we have the partitions p0 ; p1 ; : : : ; pn, we an remove the transfers by unrolling lm(rp0 ; rp1 ; : : : ; rpn ) 1 times. In the kth unrolled 2.2. 29 ALGORITHM Proedure CodeMotion(P ) Input: P = set of referene partitions for eah p 2 P for eah v 2 p if v is a store ^v 2 lantin(entry) then if 9e = (v; w) 2 E ^ v = w^ 8e = (v; z) 2 E; e is onsistent then move the store into v after the loop else if 9e = (v; w) 2 E suh that e is onsistent ^ v = w^ 8e = (u; v ) 2 E; if e is true it is onsistent move the load into v before the loop end Figure 2.11 CodeMotion Proedure Initialize(P,G) Input: P = set of referene partitions G = (V; E ), the dependene graph x = max(rp1 ; : : : ; rp2 ) 1 for k = 1 to x G0 =peel of the kth iteration of G for eah v 2 V 0 if v = Tpj ^ (v is variant ^ j k)_ (v is invariant ^ v 2= P (v) ^ j + 1 k^ v 's generator is loop arried) then replae v with its original array referene replae the inner loop indution variable with its kth iteration endif InsertRegisterCopies(G0 ; P; k) end Figure 2.12 Initialize 30 CHAPTER 2. SCALAR REPLACEMENT body, the temporary variable Tpj is replaed with the variable Tpmod(j k;rp ) where mod(y; x) = y b xy x. Essentially we apture the permutation of values by irulating the register names within the unrolled iterations. The nal result of salar replaement on our example is shown in Figure 2.14 (the pre-loop to apture the extra iterations and the initialization ode are not shown). 2.3 Experiment We have implemented a soure-to-soure translator in the Parasope programming environment, a programming system for Fortran, that uses the dependene analyzer from PFC. The translator replaes subsripted variables with salars using the desribed algorithm. The experimental design is illustrated in Figure 2.15. In this sheme, Parasope serves as a preproessor, rewriting Fortran programs to improve register alloation. Both the original and transformed versions of the program are then ompiled and run using the standard produt ompiler for the target mahine. For our test mahine, we hose the IBM RS/6000 model 540 beause it had a good ompiler and a large number of oating-point registers (32). In fat, the IBM XLF ompiler performs salar replaement for those referenes that do not require dependene analysis. Many fully available loop-independent ases and invariant ases are handled. Therefore, the results desribed here only reet the ases that required dependene analysis. Essentially, we show the results of performing salar replaement on loop-arried dependenes and in the presene of inonsistent dependenes. Livermore Loops. We tested salar replaement on a number of the Livermore Loops. Some of the kernels did not ontain opportunities for our algorithm; therefore, we do not show their results. In the table below, we show the performane gain attained by our transformation system. Loop 1 5 6 7 8 11 12 13 18 20 23 Iterations 10000 10000 10000 5000 10000 10000 10000 5000 1000 500 5000 Original 3.40s 3.05s 3.82s 3.94s 3.38s 4.52s 1.70s 3.25s 2.62s 2.99s 2.68s Transformed 2.54s 1.36s 2.82s 2.02s 3.07s 1.69s 1.42s 3.01s 2.54s 2.90s 2.35s Speedup 1.34 2.24 1.35 1.95 1.10 2.67 1.20 1.08 1.03 1.03 1.14 Some of the interesting results inlude the performanes of loops 5 and 11, whih ompute rst order linear reurrenes. Livermore Loop 5 is shown below. 5 DO 5 I = 2,N X(I) = Z(I) * (Y(I) - X(I-1)) Here, salar replaement not only removes one memory aess, but also improves pipeline performane. The store to X(I) will no longer ause the load of X(I-1) on the next iteration to blok. Loop 11 presents a similar situation. Loop 6, shown below, is also an interesting ase beause it involves an invariant array referene that requires dependene analysis to detet. DO 6 I= 2,N DO 6 K= 1,I-1 6 W(I)= W(I)+B(I,K)*W(I-K) A ompiler that reognizes loop invariant addresses to get this ase, suh as the IBM ompiler, fails beause of the load of W(I-K). Through the use of dependene analysis, we are able to prove that there is no dependene between W(I) and W(I-K) that is arried by the innermost loop, allowing ode motion. Beause of this additional information, we are able to get a speedup of 1.35. 2.3. 31 EXPERIMENT Proedure Subsume(G,P) Input: P = set of referene partitions G = (V; E ), the dependene graph x =lm(rp0 ; : : : ; rpn ) 1 unroll G0 x times for Gk = eah of the x new loop bodies for eah v 2 Vk if v = Tpj then replae v with Tpmod(j k;rp ) end Figure 2.13 Subsume DO 3 I = 2, 100,2 IF (M(I) .LT. 0) THEN T10 = C(I) E(I) = T10 ELSE T10 = C(I) ENDIF T30 = T10 + D(I) 2 A(I) = T30 T20 = T20 + T31 IF (M(I+1) .LT. 0) THEN T10 = C(I+1) E(I+1) = T10 ELSE T20 = C(I+1) ENDIF T31 = T10 + D(I+1) A(I+1) = T31 3 T20 = T20 + T30 1 Figure 2.14 Fortran program Example After Salar Replaement transformer (Parasope) Figure 2.15 Fortran ompiler Experimental design. improved original 32 CHAPTER 2. SCALAR REPLACEMENT We also tested salar replaement on both the point and blok versions of LU deomposition with and without partial pivoting. In the table below, we show the results. Linear Algebra Kernels. Kernel LU Deomp Blok LU LU w/ Pivot Blok LU w/ Pivot Original 6.76s 6.39s 7.01s 6.84s Transformed 6.09s 4.40s 6.35s 4.81s Speedup 1.11 1.45 1.10 1.42 Eah of these kernels ontains invariant array referenes that require dependene analysis to detet. The speedup ahieved on the blok algorithms is higher beause an invariant load and store are removed rather than just a load as in the point algorithms. To omplete our study we ran a number of Fortran appliations through our translator. We hose programs from Spe, Perfet, RiCEPS and loal soures. Of those programs that belong to the benhmark suites, but are not inluded in the experiment, 5 failed to be suessfully analyzed by PFC, 1 failed to ompile on the RS6000 and 10 ontained no opportunities for our algorithms. Table 3 ontains a short desription of eah appliation. Appliations. Suite SPEC Perfet RiCEPS Loal Appliation Matrix300 Tomatv Adm Ar2d Flo52 Onedim Shal Simple Sphot Wave CoOpt Seval Sor Desription Matrix Multipliation Mesh Generation Pseudospetral Air Pollution 2d Fluid-Flow Solver Transoni Invisid Flow Time-Independent Shrodinger Equation Weather Predition 2d Hydrodynamis Partile Transport Eletromagneti Partile Simulation Oil Exploration B-Spline Evaluation Suessive Over-Relaxation The results of performing salar replaement on these appliations is reported in the following table. Any appliation not listed observed a speedup of 1.00. Suite Perfet RiCEPS Loal Program Adm Ar2d Flo52 Shal Simple Sphot Wave CoOpt Seval Sor Original 236.84s 410.13s 66.32s 302.03s 963.20s 3.85s 445.94s 122.88s 0.62s 1.83s Transformed 228.84s 407.57s 63.83s 290.42s 934.13s 3.78s 431.11s 120.44s 0.56s 1.26s Speedup 1.03 1.01 1.04 1.04 1.03 1.02 1.03 1.02 1.11 1.46 The appliations Sor and Seval performed the best beause we were able to optimize their respetive omputationally intensive loop. Eah had one loop whih omprised almost the entire running time of the program. For the program Simple, one loop omprised approximately 50% of the program exeution time, but the arried dependene ould not be exposed due to a lak of registers. In fat, without the registerpressure minimization algorithm of Setion 2.2.8, program performane deteriorated. Sphot's improvement was gained by performing salar replaement on one partition in one loop. This partiular partition ontained a loop-independent partially available generator that required our extension to handle ontrol ow. The IBM RS/6000 has a load penalty of only 1 yle. On proessors with larger load penalties, suh as the DEC Alpha, we would expet to see a larger performane gain through salar replaement. Additionally, the 2.4. SUMMARY 33 problem sizes for benhmark are typially small. On larger problem sizes, we expet to see larger performane gains due to a higher perentage of time being spent inside of loops. 2.4 Summary In this hapter, we have presented an algorithm to perform salar replaement in the presene of forward onditional-ontrol ow. By mapping partial redundany elimination to salar replaement, we are able to ensure that we will not inrease the run-time memory osts of a loop. We have applied our algorithm to a number of kernels and whole appliations and shown that integer-fator speedups over good optimizing ompilers are possible on kernels. 35 Chapter 3 Unroll-And-Jam Beause applying salar replaement alone may still leave a loop memory bound, we an sometimes apply unroll-and-jam before salar replaement to allow a larger redution in loop balane. For example, reall from Chapter 1 that given the loop DO 10 I = 1, 2*M DO 10 J = 1, N 10 A(I) = A(I) + B(J) with a balane of 1, unroll-and-jam of the I-loop by 1 produes the loop DO 10 I = 1, 2*M, 2 DO 10 J = 1, N A(I) = A(I) + B(J) 10 A(I+1) = A(I+1) + B(J) with a balane of 0.5. On a mahine that an perform 2 ops per load for a balane of 0.5, the transformed loop would exeute faster beause it would be balaned. Although unroll-and-jam has been studied extensively, it has not been shown how to tailor unroll-andjam to spei loops run on spei arhitetures. In the past, unroll amounts have been determined experimentally and speied with a ompile-time parameter [CCK90℄. However, the best hoie for unroll amounts varies between loops and arhitetures. Therefore, in this hapter, we derive a method to hose unroll amounts automatially in order to balane program loops with respet to a spei target arhiteture. Unroll-and-jam is tailored to a spei mahine based on a few parameters of the arhiteture, suh as eetive number of registers and mahine balane. The result is a mahine-independent transformation system in the sense that it an be retargeted to new proessors by hanging parameters. This hapter begins with an overview of the safety onditions for unroll-and-jam. Then, we present a method for updating the dependene graph so that salar replaement an take advantage of the new opportunities reated by unroll-and-jam. Next, we derive an automati method to balane program loops with respet to a partiular arhiteture and, nally, we report on performane improvements ahieved by an experimental implementation of this algorithm. 3.1 Safety Certain dependenes prevent or limit the amount of unrolling that an be done and still allow jamming to be safe. Consider the following loop. DO 10 I = 1, 2*M DO 10 J = 1,N 10 A(I,J) = A(I-1,J+1) A true dependene with a distane vetor of h1; 1i goes from A(I,J) to A(I-1,J+1). Unrolling the outer loop one reates a loop-independent true dependene that beomes an antidependene in the reverse diretion after loop jamming. This dependene prevents jamming beause the order of the store to and load 36 CHAPTER 3. UNROLL-AND-JAM from loation would be reversed as shown in the unrolled ode below. DO 10 I = 1, 2*M,2 DO 10 J = 1,N A(I,J) = A(I-1,J+1) 10 A(I+1,J) = A(I,J+1) Loation A(3,2) is dened on iteration h3; 2i and read on iteration h4; 1i in the original loop. In the transformed loop, the loation is read on iteration h3; 1i and dened on iteration h3; 2i, reversing the aess order and hanging the semantis of the loop. With a semantis that is only dened on orret programs (as in Fortran), we say that unroll-and-jam is safe if the ow of values is not hanged. The following theorem summarizes this property. Theorem 3.1 Let e be a true, output or antidependene arried at level k with distane vetor d = h0; : : : ; 0; dk ; 0; : : : ; 0; dj ; : : :i; where dj < 0 and all omponents between the k th and j th are 0. Then the loop at level k an be unrolled at most dk 1 time before a dependene is generated that prevents fusion of the inner n k loops. Proof See Callahan, et. al [CCK88℄. Essentially, unrolling more than dk 1 will introdue a dependene with a negative entry in the outermost position after fusion. Sine negative thresholds are not dened, the dependene diretion must be reversed. Dependene diretion reversal makes unroll-and-jam unsafe. Additionally, when non-DO statements appear at levels other than the innermost level, loop distribution must be safe in order for unroll-and-jam to be safe. Essentially, no reurrene involving data or ontrol dependenes an be violated [AK87℄. 3.2 Dependene Copying To allow salar replaement to take advantage of the new opportunities for reuse reated by unroll-and-jam, we must update the dependene graph to reet these hanges. Below is a method to ompute the updated dependene graph after loop unrolling for onsistent dependenes that ontain only one loop indution variable in eah subsript position and are not invariant with respet to the unrolled loop. [CCK88℄. 0 0 0 If 0 we unroll the mth loop in a nest L by a fator of k , then the updated dependene graph G = (V ; E ) for L an be omputed from the dependene graph G = (V; E ) for L by the following rules: 0 1. For eah v 2 V , there are k + 1 nodes v0 ; : : : ; vk in V . These orrespond to the original referene and its k opies. 2. For eah edge e = hv; wi 2 E , with distane vetor d(e), there are k + 1 edges e0 ; : : : ; ek where ej = hvj ; wi i, vj is the j th opy of v, wi is the ith opy of w and i = (j + dm (e)) mod (k + 1) ( b dkm+1(e) if i j dm (ej ) = d (e) m b k+1 + 1 if i < j Below dependene opying is desribed for invariant referenes. To aid in the explanation, the following terms are dened. 1. If referenes are invariant with respet to a partiular loop, we all that loop referene invariant. 2. If a spei referene, v , is invariant with respet to a loop, we all that loop v -invariant. 3.3. IMPROVING BALANCE WITH UNROLL-AND-JAM 37 The behavior of invariant referenes diers from that of variant referenes when unroll-and-jam is applied. Given an invariant referene v , the distane vetor value of an inoming edge orresponding to a v -invariant loop represents multiple values rather than just one value. Therefore, we rst apply the previous formula to the minimum distane. Then, we reate a opy of the original edge for eah new loop body between the soure and sink orresponding the original referenes and insert loop-independent dependenes from the referenes within a statement to all idential referenes following them. To extend dependene opying to unroll-and-jam, we have the following rules using an outermost- to innermost-loop appliation ordering. If we unroll loop k we have a new distane vetor, d(e) = hd ; d ; : : : ; dn i dk (e) = 0 8m < k; dm (e) = 0 1 2 then the next inner non-zero entry, dp (e); p > k , beomes the threshold of e and p beomes the arrier of e. If 8m; dm (e) = 0, then e is loop independent. If 9m k suh that dm (e) > 0 then e remains arried by loop m with the same threshold. The safety rules of unroll-and-jam prevent the exposure of a negative distane vetor entry in the outermost non-zero position for true, output and antidependenes. However, the same is not true for input dependenes. The order in whih reads are performed does not hange the semantis of the loop. Therefore, if the outermost entry of an input dependene distane vetor beomes negative during unroll-and-jam, we hange the diretion of the dependene and negate the entries in the vetor. To demonstrate dependene opying, onsider Figure 3.1. The original edge from A(I,J) to A(I,J-2) has a distane vetor of h2; 0i before unroll-and-jam. After unroll-and-jam the edge from A(I,J) now goes to A(I,J) in statement 10 and has a vetor of h0; 0i. An edge from A(I,J+1) to A(I,J-2) with a vetor of h1; 0i and an edge from A(I,J+2) to A(I,J-1) with a vetor of h1; 0i have also been reated. For the rst edge, the rst formula for dm (ej ) applies. For the latter two edges, the seond formula applies. 3.3 Improving Balane with Unroll-and-Jam As stated earlier, we would like to onvert loops that are bound by main-memory aess time into loops that are balaned. Sine unroll-and-jam an introdue more oating-point operations without a proportional inrease in memory aesses, it has the potential to improve loop balane. Although unroll-and-jam improves balane, using it arelessly an be ounterprodutive. Transformed loops that spill oating-point registers or whose objet ode is larger than the instrution ahe may suer performane degradation. Sine the size of the objet ode is ompiler dependent, we are fored to assume that the instrution ahe is large enough to hold the unrolled loop body. By doing so, we an remain relatively independent of the details of a partiular software system. This assumption proved valid on the IBM RS/6000 and leaves us with the following objetives for unroll-and-jam. 1. Balane a loop with a partiular arhiteture. 2. Control register pressure. DO 10 J = 1,N DO 10 I = 1,N 10 A(I,J) = A(I,J-2) Figure 3.1 | | | | | 10 DO 10 J = 1,N,3 DO 10 I = 1,N A(I,J) = A(I,J-2) A(I,J+1) = A(I,J-1) A(I,J+2) = A(I,J) Distane Vetors Before and After Unroll-and-Jam 38 CHAPTER 3. UNROLL-AND-JAM Expressing these goals as an integer optimization problem, we have objetive funtion: onstraint: min jL M j # oating-point registers required register-set size where the variables in the problem are the unroll amounts for eah of the loops in a loop nest. For the solution to the objetive funtion, we prefer a slightly negative dierene over a slightly positive dierene. For eah loop nest within a program, we model its possible transformation as a problem of this form. Solving it will give us the unroll amounts to balane the loop nest as muh as possible. Using the following definitions throughout our derivation, we will show how to onstrut and eÆiently solve a balane-optimization problem at ompile time. V Vr Vw Xi = = = = array referenes in an innermost loop members of V that are memory reads members of V that are memory writes number of times the ith outermost loop in L is unrolled + 1 (This is the number of loop bodies reated by unrolling loop i) F = number of oating-point operations after unroll-and-jam M = number of memory referenes after unroll-and-jam For the purposes of this derivation, we will assume that eah loop nest is perfetly nested. We will show how to handle non-perfetly nested loops in Setion 3.4.6. Additionally, loops with onditional-ontrol ow in the innermost loop body will not be andidates for unroll-and-jam. Inserting additional ontrol ow within an innermost loop an have disastrous eets upon performane.y 3.3.1 Computing Transformed-Loop Balane. Sine the value of m is a ompile-time onstant dened by the target arhiteture, we need only reate the formula for L at ompile time to reate the objetive funtion. To reate L , we ompute the number of oating-point operations, F , and memory referenes, M , in one innermost-loop iteration. F is simply the number of oating-point operations in the original loop, f , multiplied by the number of loop bodies reated by unroll-and-jam, giving F =f Y i<n Xi ; 1 where n is the number of loops in a nest. M requires a detailed analysis of the dependene graph. Additionally, in omputing M we will assume an innite register set. The register-pressure onstraint in the optimization problem will ensure that any assumption about a memory referene being removed will be true. To simplify the derivation of our formulation for memory yles, we initially assume that at most one inoming or outgoing edge is inident upon eah v 2 V and that all subsript positions ontain at most one indution variable. We will remove these restritions later. Additionally, we partition the referene set of the loop into sets that exhibit dierent memory behavior when unroll-and-jam is applied. These sets are listed below. 1. V ; = referenes without an inoming onsistent dependene. 2. VrC = memory reads that have a loop-arried or loop-independent inoming onsistent dependene, but are not invariant with respet to any loop. On massively parallel systems, it may be advantageous to have L << M y Experiments have shown this to be true on the IBM RS/6000. to redue network ontention. 3.3. 39 IMPROVING BALANCE WITH UNROLL-AND-JAM 3. VrI = memory reads that are invariant with respet to a loop. 4. VwC = memory writes that have a loop-arried or loop-independent inoming onsistent dependene, but are not invariant with respet to any loop. 5. VwI = memory writes that are invariant with respet to a loop. Using this partitioning, we ompute the ost assoiated with eah partition, fM ; ; MrC ; MrI ; MwC ; MwI g, to get the total memory ost, M = M ; + MrC + MrI + MwC + MwI . For the rst partition, M ; , eah opy of the loop body produed by unroll-and-jam ontains one memory referene assoiated with eah original referene. This value an be expressed as follows. M; = X v2V ; ( Y i<n Xi ): 1 For example, in Figure 3.2, unrolling the outer two loops by 1 reates 3 additional opies from the referene to A(I,J,K) that will be referenes to memory. For eah v 2 VrC , unroll-and-jam may reate dependenes useful in salar replaement for some of the new referenes reated. In order for a referene, v , to be salar replaed, its inoming dependene, ev , must be onverted into an innermost-loop-arried or a loop-independent dependene, whih we all an innermost dependene edge. In terms of the distane vetor assoiated with an edge, d(ev ) = hd1 ; d2 ; : : : ; dn i, the edge is made innermost when it beomes d(e0v ) = h0; 0; : : : ; 0; dn i. After unroll-and-jam, only some of the referenes will be the sink of an innermost edge. It is our goal to quantify this value. For example, in Figure 3.3, the rst opy of A(I,J-2), the referene to A(I,J-1), is the sink of a dependene from A(I,J+3) that is not innermost, but the seond and third opies, A(I,J) and A(I,J+1), are the sinks of innermost edges (in this ase, ones that are loop independent) and will be removed by salar replaement. To quantify the behavior of VrC , we rst ompute the total number of referenes before salar replaement as in M ; , X v2VrC ( Y i<n Xi ): 1 Next, we determine whih of the new referenes will be removed by salar replaement by quantifying the number of referenes that will be the sink of an innermost dependene after unroll-and-jam. To simplify the derivation, we onsider unrolling only loop i, where 8e and 8k 6= i; n we have dk (e) = 0. This limits the dependenes to be arried by loop i or to be innermost already. Later, we will extend the formulation to arbitrary loop nests and distane vetors. Reall from our dependene-opying formulas in Setion 3.2 that di (ev ) < Xi , for some v 2 VrC , must be true in order to reate a distane vetor with a 0 in the ith position, resulting in an innermost dependene edge. However, depending upon whih of the two distane formulas is applied, the edge may still have a 1 in the ith position. In order for the formula that allows an entry to beome 0, di (e0v ) = b diX(eiv ) ; to be applied, the edge reated by unroll-and-jam, e0v = hwm ; vn i, must have n m true. If n < m, then the appliable formula, di (e0v ) = b diX(eiv ) + 1, prevents an entry from beoming 0. In the following theorem, we show that n m is true for Xi di (ev ) of the new dependene edges if di (ev ) < Xi . This shows that Xi di (ev ) new referenes an be removed by salar replaement. DO 10 K = 1,N DO 10 J = 1,N DO 10 I = 1,N 10 A(I,J,K) = ... Figure 3.2 | | | | | | | DO 10 K = 1,N,2 DO 10 J = 1,N,2 DO 10 I = 1,N A(I,J,K) A(I,J,K+1) A(I,J+1,K) 10 A(I,J+1,K+1) = = = = ... ... ... ... V ; Before and After Unroll-and-Jam 40 CHAPTER 3. UNROLL-AND-JAM Theorem 3.2 Given 0 m; di (e) < Xi and n = (m + di (e)) mod Xi for eah new edge e0v = hwm ; vn i reated from ev = hw0 ; v0 i by unroll-and-jam, then nm () m < Xi di (e): Proof ()) Given n m, assume m Xi di (e). Sine 0 m; di (e) < Xi we have n = (m + di (e)) mod Xi = m + di (e) Xi giving m + di (e) Xi m. Sine di (e) < Xi , the inequality is violated, leaving m < Xi di (e). (() Given m < Xi then di (e) and m; di (e) 0, n = (m + di (e)) mod Xi = m + di (e) Sine m; di (e) 0 we have m + di (e) m. giving n m. In the ase where Xi di (ev ), no edge an be made innermost, resulting in a general formula of (Xi di (ev ))+ edges with di (e0v ) = 0, where x+ is the positive part of x. For an arbitrary distane vetor, if any outer entry is left greater than 0 after unroll-and-jam, the new edge will not be innermost. Additionally, one a dependene edge beomes innermost, unrolling any loop additionally reates a opy of that edge. For example, in Figure 3.3, the load from A(I,J+1) has an inoming DO 10 J = 1,N DO 10 I = 1,N 10 A(I,J) = A(1,J-2) Figure 3.3 | | | | | | 10 DO 10 J = 1,N,3 DO 10 I = 1,N A(I,J) = A(I,J-2) A(I,J+1) = A(I,J-1) A(I,J+2) = A(I,J) A(I,J+3) = A(I,J+1) VrC Before and After Unroll-and-Jam 3.3. 41 IMPROVING BALANCE WITH UNROLL-AND-JAM innermost edge that is a opy of the inoming edge for the load from A(I,J). The edge into A(I,J+1) was reated by additional unrolling of the J-loop after A(I,J)'s edge was reated. Quantifying these ideas, we have Y i<n (Xi di (ev ))+ 1 referenes that an be salar replaed. Subtrating this value from the total number of referenes before salar replaement and after unroll-and-jam gives the value for MrC . Therefore, MrC = X v2VrC ( Y i<n Y Xi i<n 1 (Xi di (ev ))+ ): 1 In Figure 3.3, X1 = 4 and d1 (e) = 2 for the referene to A(I,J-2). Our formula orretly predits that 2 memory referenes will be removed. Note that if X1 2 were true, no memory referenes would have been removed beause no distane vetor would have a 0 in the rst entry. For referenes that are invariant with respet to a loop, memory behavior is dierent from variant referenes. If v 2 VrI is invariant with respet to loop Li , i < n, unrolling Li will not introdue more referenes to memory beause eah unrolled opy of v will aess the same memory loation as v . This is beause the indution variable for Li does not appear in the subsript expression of v . Unrolling any loop that is not v -invariant will introdue memory referenes. Finally, if v is invariant with respet to the innermost loop, no memory referenes will be left after salar replaement no matter whih outer loops are unrolled. In Figure 3.4, unrolling the K-loop introdues memory referenes, while unrolling the J-loop does not. Using these properties, we have MrI = X v2VrI ( Y i<n !(ev ; i; n)) 1 where !(e; i; n) ( if e is invariant wrt loop n then return 0 else if e is invariant wrt loop i then return 1 else return Xi Members of VwC behave exatly as those in VrC exept that salar replaement will remove stores only when their inoming dependene is loop independent. Thus, for MwC we have MwC = DO 10 K = DO 10 J DO 10 10 ... 1,N = 1,N I = 1,N = A(I,K) Figure 3.4 X v2VwC | | | | | | | (v; 1; n; n) 10 DO 10 K = DO 10 J DO 10 ... ... ... ... 1,N,2 = 1,N,2 I = 1,N = A(I,K) = A(I,K+1) = A(I,K) = A(I,K+1) VrI Before and After Unroll-and-Jam 42 CHAPTER 3. UNROLL-AND-JAM where (v; j; k; n) = if dn (ev ) =Y 0 and ev isY an output edge then return ( Xi (Xi di (ev ))+ ) j i<k j i<k else return Y j i<k Xi Finally, MwI is dened exatly as MrI . Now that we have obtained the formulation for loop balane in relation to unroll-and-jam, we an use it to prove that unroll-and-jam will never inrease loop balane. Consider the following formula for L when unrolling only loop i. X L = v2V ; Xi + X v2VrC Xi (Xi di (ev ))+ + f X v2VrI [VwI Xi !(ev ; i; n)+ X v2VwC (v; i; i + 1; n) If we examine the verties and edges in eah of the above vertex sets to determine the simplied formula for ML, we obtain the following, where 8j; aj 0. M; MrC MrI MwC MwI = = = = = a 0 Xi a 1 Xi + a 2 a 3 Xi + a 4 a 5 Xi + a 6 a 7 Xi + a 8 Combining these terms we get L = (0 XfXi +i 1 ) where 0 ; 1 0. Note that the values of 0 and 1 may vary with the value of Xi . As Xi inreases, 0 may derease and 1 may inrease beause the value of (Xi di (ev ))+ may beome positive for some v 2 VrC . In the theorem below, we show that L will not inrease as Xi inreases, showing that unrolling one loop will not inrease loop balane. Theorem 3.3 The funtion for L is noninreasing in the i dimension. Proof For the original loop, Xi = 1 and L = (0 +f 1 ) If we unroll-and-jam the i-loop n 1 times, n > 1, then Xi = n giving 0 L0 = (0 n+fn1 +2 ) where 2 is the sum of all di (ev ) where (Xi di (ev ))+ beame positive with the inrease in Xi . Sine 0 will derease or remain the same with the hange in Xi , 00 0 . The dierene between 0 and 00 is the number of edges where (Xi di (ev ))+ beomes positive with the hange in Xi . Given that Xi = n, the maximum value of any di (ev ) added into 2 is n 1 (any greater value would not make (Xi di (ev ))+ positive). Therefore, the maximum value of L0 is 0 0 00 )(n 1)) L0 = (0 n+1 +(fn 3.3. 43 IMPROVING BALANCE WITH UNROLL-AND-JAM To show that L is noninreasing, we must show that L L0 or that the unrolled loop's balane is no greater than the original balane. n + n (n 1) L nL n n(0 +1 ) nf L0 L0 00 n+(0 00 )(n fn ( 1 ) 1)+ n00 + n0 n00 + 00 0 + 1 0 1 00 0 1 Sine n > 1; 1 0 and 00 0 , the inequality holds and L is noninreasing. Given that we unroll in an outermost- to innermost-loop order, unrolling loop 1 produes a new loop with balane L1 suh that L1 L by Theorem 3.3. After unrolling loop i, assume we have 8j < i; Li Lj . If we next unroll loop i + 1 to produe Li+1 , then again by Theorem 3.3, Li+1 Li . Therefore, unrolling multiple loops will not inrease loop balane. An Example. As an example of a formula for loop balane, onsider the following loop. DO 10 J = 1,N DO 10 I = 1,N 10 A(I,J) = A(I,J-1) + B(I) We have A(I,J) gives 2 V ; , A(I,J-1) 2 VrC with an inoming distane vetor of h1; 0i, and B(I) L = X1 +X1 X(X11 3.3.2 2 VrI . This + 1) +1 Estimating Register Pressure Now that we have the formula for omputing loop balane given a set of unroll amounts, we must reate the formula for register pressure. To ompute the number of registers required by an unrolled loop, we need to determine whih referenes an be removed by salar replaement and how many registers eah of those referenes need. We will refer to this quantity as R. As in omputing M , we onsider the partitioned sets of V , fV ; ; VrC ; VrI g, but we do not onsider Vw beause its register requirements are inluded in the omputation of Vr . Assoiated with eah of the partitions of V is the number of registers required by that set, fR; ; RrC ; RrI g, giving R = R; + RrC + RrI . ; Sine eah member of V has no inoming dependene, we annot use registers to apture reuse. However, members of V ; still require a register to hold a value during expression evaluation. To ompute the number of registers required for those memory referenes not salar replaed and for those temporaries needed during the evaluation of expressions, we use the tree labeling tehnique of Sethi and Ullman [SU70℄. Using this tehnique, the number of registers required for eah expression in the original loop body is omputed and the maximum over all expressions is taken as the value for R; . For variant referenes whose inoming edge is arried by some loop, v 2 VrC , we need to determine whih (or how many) referenes will be removed by salar replaement after unroll-and-jam and how many registers are required for those referenes. The former value is derived in the omputation of MrC and is shown below. X v2VrC ( Y i<n (Xi di (ev ))+ ) 1 The latter value, dn (ev ) + 1, omes from salar replaement, where we assume that all registers interfere. Any value that ows aross a loop-arried dependene interferes with all other values and prediting the eets of sheduling before optimization to handle loop-independent interferene is beyond the sope of this work. Therefore, RrC = X v2VrC ( Y i<n 1 (Xi di (ev ))+ ) (dn (ev ) + 1): 44 CHAPTER 3. UNROLL-AND-JAM Referenes that are invariant with respet to an outer loop require registers only if a orresponding refereneinvariant loop is unrolled. However, referenes that are invariant with respet to the innermost loop always require registers. For eah v 2 VrI , unrolling loops that are not v -invariant reates referenes that possibly require additional registers. In Figure 3.4, the referene to A(I,K) requires no registers unless the J-loop is unrolled and sine K is unrolled one, two registers are required. Formulating this behavior, we have RrI = X v2VrI ( Y i<n (ev ; i; n)) 1 where (e; i; n) ( if e is invariant wrt loop i then if (9Xj jXj > 1 ^ e is invariant wrt loop j ) or (e is invariant wrt loop n) then return 1 else return 0 else return Xi An Example. As an example of a formula for omputing register pressure, onsider the following loop. DO 10 J = 1,N DO 10 I = 1,N 10 A(I,J) = A(I,J-1) + B(J) We have A(I,J) 2 V ; , A(I,J-1) gives R = 1 + (X1 1)+ + X1 : 3.4 2 VrC with an inoming distane vetor of h1; 0i and B(J) 2 VrI . This Applying Unroll-and-Jam in a Compiler In this setion, we will disuss how to take the previous optimization problem and solve it pratially for real program loop nests. In general, integer-programming problems are NP-omplete, but we an make simplifying assumptions for our optimization problem that will make its solution tratable. First, we show how to hoose the loops to whih unroll-and-jam will be applied and then we show how to hoose the unroll amounts for those loops. 3.4.1 Piking Loops to Unroll Applying unroll-and-jam to all of the loops in an arbitrary loop nest an make the solution to our optimization problem extremely omplex. To simplify the solution proedure, we will only apply unroll-and-jam to a subset of the loop nest. Sine experiene suggests that most loop nests have a nesting depth of 3 or less, we an limit unroll-and-jam to 1 or 2 loops in a given nest. Our heuristi for determining the subset of loops for unroll-and-jam is to pik those loops whose unrolling is likely to result in the fewest memory referenes in the unrolled loop. In partiular, we unroll the loops that arry the most dependenes that will be innermost after unroll-and-jam is applied. To nd these loops, we examine eah distane vetor to determine if it an be made innermost for eah pair of loops in the nest. Given a distane vetor d(e) = hd1 ; : : : ; di ; : : : ; dj ; : : : ; dn i where di ; dj 0, unrolling loops i and j an make e innermost if 8k; k 6= i; j; n we have dk (e) = 0. If we unroll only one loop, i, then e an be made innermost if 8k; k 6= i; n we have dk (e) = 0. The algorithm for hoosing loops is shown in Figure 3.5. 3.4. 45 APPLYING UNROLL-AND-JAM IN A COMPILER proedure PikLoops(V ) Input: V = array referenes in a loop for eah v 2 V if v is a memory read or dn (ev ) = 0 then for i = 1 to n 1 for eah j ij8k; k 6= i; j; n we have dk (e) = 0 ount(i; j )++ if unrolling two loops unroll loops i; j j ount(i; j ) is the maximum else unroll loop ij ount(i; i) is the maximum end Figure 3.5 3.4.2 PikLoops Piking Unroll Amounts In the following disussion of our problem solution, we onsider unrolling only one loop to bring larity to the solution proedure. Later, we will disuss how to extend the solution to the problem for two loops. Given this restrition, we an redue the optimization formula to the following, where Æ = (Xi di (e))+ and % is the eetive size of the oating-point register set for the target arhiteture. min X v V 2 ; Xi + R ; + Xi X v2VrC X v2VrC (Xi Æ )+ X v2VrI [VwI !(ev ; i; n)+ f Xi (Æ (dn (ev ) + 1)) + X v2VrI X v2VwC (ev ; i; i + 1; n) m (ev ; i; n) % Xi 1 We begin the solution proedure by examining the verties and edges of the dependene graph to aumulate the oeÆient for Xi and any onstant value as was shown earlier. Essentially, we solve the summations to get a linear funtion of Xi . However, the + -funtion requires us to keep an array of oeÆients and onstants, one set for eah value of di , that will be used depending upon the value of Xi . In pratie, most distanes are 0, 1 or 2, allowing us to keep 4 sets of values: one for eah ommon distane and one for the remaining distanes. If di > 2 is true for some edge, our results may be a bit impreise if 3 Xi < di . However, experimentation disovered no instanes of this situation. Given a set of oeÆients, we an searh the solution spae, while heking register pressure, to determine the best unroll fator. Sine most dependene distanes are 0, 1 or 2, unrolling more than % will most likely inrease the register pressure beyond the physial limits of the target mahine. Therefore, by Theorem 3.3, we an nd the solution to the optimization problem in the solution spae in O(log %) time. To extend unroll-and-jam to two loops, we an reate a simplied problem from the original by setting n = 2. A formula for two partiular loops, i and j , an be onstruted by examining the dependene distane vetors as in the ase for one loop. By Theorem 3.3, if we hold either Xi or Xj onstant, the funtion for L is noninreasing, whih gives a two dimensional solution spae with eah row and olumn sorted in dereasing order, lending itself to an O(%) intelligent searh. In general, a solution for unroll-and-jam of k loops, k > 1, an be found in O(%k 1 ) time. 46 CHAPTER 3. 3.4.3 UNROLL-AND-JAM Removing Interlok After unroll-and-jam, a loop may ontain pipeline interlok, leaving idle omputational yles. To remove these yles, we use the tehnique of Callahan, et. al, to estimate the amount of interlok and then unroll one loop to reate more parallelism by introduing enough opies of the inner loop reurrene [CCK88℄. Essentially, the length of the longest reurrene arried by the innermost loop is alulated and then unrolland-jam is applied to an outer loop until there are more oating-point operations than the reurrene length. 3.4.4 Multiple Edges In general, we annot assume that there will be only one edge entering or leaving eah node in the dependene graph sine multiple inident edges are possible and likely. One possible solution is to treat eah edge separately as if it were the only inident edge. Unfortunately, this is inadequate beause many edges may orrespond to the ow of the same set of values. In the loop, DO 10 J = 1,N DO 10 I = 1,N 10 B(I,J) = A(I-1,J) + A(I-1,J) + A(I,J) there is an input dependene from A(I,J) to eah A(I-1,J) with distane vetor h0; 1i and a loop-independent dependene between the A(I-1,J)'s. Considering eah edge separately would give a register pressure estimation of 5 registers per unrolled iteration of J rather than the atual value of 2. Both values provided ome from the same soure, requiring us to use the same registers. Although the original estimation is onservatively safe, it is more onservative than neessary. To improve our estimation, we onsider register sharing. To relate all referenes that aess the same values, our salar replaement algorithm onsiders the oldest referene in a dependene hain as the referene that provides the value for the entire hain. Using this tehnique alone to apture sharing within the ontext of unroll-and-jam, however, an ause us to miss registers that are used. Consider the following example. DO 10 I = 1,N DO 10 J = 1,N 10 C(I,J) = A(I+1,J) + A(I,J) + A(I,J) The seond referene to A(I,J) has two inoming input dependenes: one with a distane vetor of h1; 0i from A(I+1,J) and one with a distane vetor of h0; 0i from A(I,J). After unrolling the I-loop by 1 we have DO 10 I = 1,N,2 DO 10 J = 1,N 1 C(I,J) = A(I+1,J) + A(I,J) + A(I,J) 10 C(I+1,J) = A(I+2,J) + A(I+1,J) + A(I+1,J) Now, the referene A(I+1,J) in statement 1 provides the value used in statement 10 by both referenes to A(I+1,J), and the rst referene to A(I,J) in statement 1 provides the value used in the seond referene to A(I,J) in statement 1. Here, we annot isolate whih of the two referenes to A will ultimately provide the value for the seond A(I,J) before unrolling beause they both provide the value at dierent points. Therefore, we annot pik just the edge from the oldest referene, A(I+1,J), to alulate its register pressure. Instead, we an use a stepwise approximation where eah possible oldest referene within a dependene hain is onsidered. The rst step is to partition the dependene graph so that all referenes using the same registers after unrolling will be put in the same partition. The algorithm in Figure 3.6 aomplishes this task. One we have the referenes partitioned, we apply the algorithm in Figure 3.7 to alulate the oeÆients used in R. First, in the proedure OldestValue, we ompute the node in eah register partition that is rst to referene the value that ows throughout the partition. We look for the referene, v , that has no inoming dependene from another member of its partition or has only loop-arried inoming dependenes that are arried by v-invariant loops. We onsider only those edges that an reate reuse in the unrolled loop as desribed in Setion 3.4.1. After unroll-and-jam, the oldest referene will provide the value for salar replaement for the entire partition and is alled the generator of the partition. Next, in proedure SummarizeEdges, we determine the distane vetor that will enompass all of the outgoing edges (one to eah node in the partition) from the generator of eah partition. Beause we want 3.4. APPLYING UNROLL-AND-JAM IN A COMPILER 47 Proedure PartitionNodes(G; j; k) Input: G = (V; E ), the dependene graph j; k = loops to be unrolled put eah v 2 V in its own partition, P (v ) while 9 an unmarked node do let v be an arbitrary unmarked node mark v forall w 2 V j((9e = (w; v ) _ e = (v; w))^ (e is input or true and is onsistent) ^ (for i = 1 : : : n 1; i 6= j; i 6= k; di (e) = 0)) do mark u merge P (u) and P (v ) enddo enddo end Figure 3.6 PartitionNodes to be onservative, for i = 1; : : : ; n 1 we pik the minimum di for all of the generator's outgoing edges. This guarantees that the innermost-loop reuse within a partition will be measured as soon as possible in the unrolled loop body. For dn , we pik the maximum value to make sure that we do not underestimate the number of registers used. After omputing a summarized distane vetor, we use it to aumulate the oeÆients to the equation for register pressure. At this point in the algorithm, we have reated a distane vetor and formula to estimate the register requirements of a partition partially given a set of unroll values. In addition, we must aount for the intermediate points where the generator of a partition does not provide the reuse at the innermost loop level for some of the referenes within the partition as shown in the previous example (i.e. A(I-1,J) and A(I,J)). We will underestimate register requirements if we fail to handle these situations. First, using proedure Prune, we remove all nodes from a partition for whih there are no intermediate points where a referene other than the urrent generator provides the values for salar replaement. Any referene that ours on the same iteration of loops 1 to n 1 as the urrent generator of the partition will have no suh points. Their value will always be provided by the urrent generator. Next, we ompute the additional number of registers needed by the remaining nodes in the partition by alulating the oeÆients of R for an intermediate generator of the modied partition. The only dierene from the omputation for the original generator is that we modify the oeÆients of R derived from the summarized distane vetor of the modied partition by aounting for the registers that we have already ounted with the previous generator. This is done by omputing the oeÆients for R as if an intermediate generator were the generator in an original partition. Then, we subtrat from R the number of referenes that have already been handled by the previous generator. We use the distane vetor of the edge, e^, from the previous generator, v , to the new intermediate generator, u, to ompute the latter value. Given ds for the urrent summarized distane vetor and d(^e) for the edge between the urrent and previous generators, we ompute Rds R(ds +d(^e)) . The value ds + d(^e) would represent the summarized distane vetor if the previous generator were the generator of the modied partition. In our example, ds = h0; 0i and d(^e) = h1; 0i giving (X1 0)+ (X1 1)+ additional referenes removed. These steps for intermediate points are repeated until the partition has 1 or less members. With the allowane of multiple edges, referenes may be invariant with respet to only one of the unrolled loops and still have a dependene arried by the other unrolled loop. In this ase, we treat the referenes as a member of V I with respet to the rst loop and V C with respet to the seond. Partitioning is not neessary to handle multiple edges for M beause sharing does not our. Instead, we onsider eah referene separately using a summarized vetor for eah v 2 V that onsists of the minimum di over all inoming edges of v for i = 1; : : : ; n 1. Although this is a slightly dierent approximation then used for R, it is more aurate and will result in a better estimate of L . 48 CHAPTER 3. UNROLL-AND-JAM Proedure ComputeR(P; G; R) Input: P = partition of memory referenes G = (V; E ), the dependene graph R = register pressure formula foreah p 2 P do v = OldestValue(p) d = SummarizeEdges(v ) update R using d Prune(p,v ) while size(p) > 1 do u = OldestValue(p) d = SummarizeEdges(u) update R using Rd Rd+d(^e) ; e^ = (v; u) Prune(p,u) v=u enddo enddo end Proedure OldestValue(p) foreah v 2 p if 8e = (u; v ); v is invariant wrt the loop at level(e)_ P (u) 6= P (v )) then return v end Proedure SummarizeEdges(v ) for i = 1 to n 1 foreah e = (v; w) 2 E; w is a memory read dsi = min(dsi ; di (e)) foreah e 2 E; w is a memory read dsn = max(dsn ; dn (e)) return ds end Proedure Prune(p,v ) remove v from p foreah e = (v; w) 2 E if d(e) = (=; =; : : : ; =; ) then remove w end Figure 3.7 Compute CoeÆients for Optimization Problem 3.4. APPLYING UNROLL-AND-JAM IN A COMPILER 3.4.5 49 Multiple-Indution-Variable Subsripts In Setion 3.2, we disussed a method for updating the dependene graph for referenes with one indution variable in eah subsript position. Unfortunately, this method does not work on subsripts ontaining multiple indution variables, alled MIV subsripts, that have inoming onsistent dependenes [GKT91℄. Consider the following loop. DO 10 I = 1,N DO 10 J = 1,N 10 A(I) = A(I) + B(I-J) The load of B(I-J) has a loop-independent inoming onsistent dependene from itself arried by the I-loop. After unroll-and-jam of the I-loop by a fator of 2, we have DO 10 I = 1,N,3 DO 10 J = 1,N A(I) = A(I) + B(I-J) A(I) = A(I) + B(I-J+1) 10 A(I) = A(I) + B(I-J+2) Here, there is an input dependene arried by the innermost loop from B(I-J+2) to both B(I-J+1) and B(I-J). It was not possible for new referenes that ontained only single-indution-variable subsripts, alled SIV subsripts, to have multiple edges that have the same soure and that arose from one original edge. In the above example, even though the I-loop step value has inreased, the presene of J in the subsript removes its eet for the exeution of the innermost loop. The formula presented in Setion 3.2 is invalid in this ase sine the hange in loop step does not have the same eet on dependene distane and loation. In general, we would need to re-apply dependene analysis to MIV subsripts to get the full and orret dependene graph. This is beause of the interation between multiple MIV subsripts. In addition, it would be very diÆult to ompute M and R given the behavior of suh subsripts. Therefore, we restrit the type of MIV subsripts that we handle to a small, preditable subset, V MIV . In our experimentation, the only MIV subsripts that we found t into the following desription of V MIV . In order for a referene to be lassied as MIV in our system, it must have the following properties: 1. The referene is inident only upon onsistent dependenes. 2. The referene ontains only one MIV subsript position. 3. The inner-loop indution variable does not appear in the subsript or it appears only in the MIV subsript position. 4. At most, one unrolled-loop indution variable is in the MIV subsript position. 5. The oeÆients of the indution variables in the MIV subsript position are 1 for an unrolled loop and 1 or 1 for the innermost loop. If an MIV referene is invariant with respet to the innermost loop, it is treated as a member of VrI or VwI . If a loop is unrolled and its indution variable is not in the MIV subsript position, we use the lassiation of the referene in relation to the unrolled-loop indution variable when unroll-and-jam is applied (e.g, B(I-J) would be treated as a member of VrI with respet to a K-loop). Any referene ontaining multiple-indutionvariable subsripts that do not t the above riteria are lassied as members of V ; . To update the dependene graph for V MIV , we restrit ourselves to the innermost loop to satisfy the needs of salar replaement only. Essentially, we will apply the strong SIV test for the innermost-loop indution variable on eah MIV referene pair [GKT91℄. Given the referenes, A(an In + a0 ) and A(bn In + b0 ), where an = bn , there is a dependene between the two referenes with a onsistent threshold d if d = a0anb0 is an integer. Sine we restrit an and bn to be 1 or 1, there will always be a dependene. 50 CHAPTER 3. UNROLL-AND-JAM To ompute M MIV and RMIV , we must reognize that the eet of unroll-and-jam on MIV subsripts is to inrease the distane from the rst to the last aess of a value by the unroll amount of the loop. Consider the following loop. DO 10 I = 1,N DO 10 J = 1,N 10 ... = B(I-J) + B(I-J+1) Unroll-and-jam of I by 1 produes DO 10 I DO 10 ... 10 ... = J = = 1,N,3 = 1,N B(I-J) + B(I-J+1) B(I-J+1)+ B(I-J+2) In the original loop, the dependene distane for B's partition was 1. After unroll-and-jam by 1, the distane is 2. This example shows that eah v 2 VrMIV requires Xi + dn (e) registers, assuming a step value of 1 for loop i, when unrolling the loop whose indution variable is in the MIV subsript position. In the loop, DO 10 I = 1,N DO 10 J = 1,N 10 ... = B(I-J) we must unroll the I-loop at least one to get a dependene arried by the innermost loop. To quantify this, we have RrMIV = X v2VrMIV pos(Xi di (ev )) (Xi + dn (ev )) where dn (ev ) is the maximum distane in a partition and pos(x) returns 1 if x > 0, otherwise it returns 0. Here, pos(Xi di (ev )) ensures that a dependene will be arried by the innermost loop after unroll-and-jam. Only the MIV referene that is rst to aess a value will be left as a referene to memory after unrolland-jam and salar replaement. This is beause eah later referene introdued by unroll-and-jam will be the sink of an innermost dependene, as shown in the examples. Therefore, MrMIV = X v2VrMIV 1: For eah v 2 VwMIV , unrolling at least dn (ev ) will reate an idential subsript expression, resulting in an inoming loop-independent dependene. Therefore, MwMIV = X v2VwMIV min(Xi ; dn (ev )); where dn (ev ) is the minimum over all of v 's inoming output dependene edges. 3.4.6 Non-Perfetly Nested Loops It may be the ase that the loop nest on whih we wish to perform unroll-and-jam is not perfetly nested. Consider the following loop. DO 10 I = 1,N DO 20 J = 1,N 20 A(I) = A(I) + B(J) DO 10 J = 1,N 10 C(J) = C(J) + D(I) On the RS/6000, where M = 1, the I-loop would not be unrolled for the loop surrounding statement 20, but would be unrolled for the loop surrounding statement 10. To handle this situation, we an distribute 3.5. 51 EXPERIMENT the I-loop around the J-loops as follows DO 20 I = 1,N DO 20 J = 1,N 20 A(I) = A(I) + B(J) DO 10 I = 1,N DO 10 J = 1,N 10 C(J) = C(J) + D(I) i and unroll eah loop independently. In general, we ompute unroll amounts with a balane-optimization formula for eah innermost loop body as if it were within a perfetly nested loop body. A vetor of unroll amounts that inludes a value for eah of a loop's surrounding loops and a value for the loop itself is kept for eah loop within the nest. When determining if distribution is neessary, if we enounter a loop that has multiple inner loops at the next level, we ompare eah of the unroll vetors for eah of the inner loops to determine if any unroll amounts dier. If they dier, we distribute eah of the loops from the urrent loop to the loop at the outermost level where the unroll amounts dier. We then proeed by examining eah of the newly reated loops for additional distribution. To handle safety, if any loop that must be distributed annot be distributed, we set the unroll vetor values in eah vetor to the smallest unroll amount in all of the vetors for eah level. Distribution is unsafe if it breaks a reurrene [AK87℄. The omplete algorithm for distributing loops for unroll-and-jam is given in Figure 3.8. 3.5 Experiment Using the experimental system desribed in Chapter 2, we have implemented unroll-and-jam as presented in this hapter. In addition to ontrolling oating-point register pressure, we found that address-register pressure needed to be ontrolled. For the RS/6000, any expression having no inoming innermost-loop-arried or loop-independent dependene required an address register. Although this removes our independene from a partiular ompiler, it ould not be avoided for performane reasons. To allow for an imperfet salar optimizer, the eetive register-set size for both oating-point and address registers was hosen to be 26. This value was determined experimentally. For our rst experiment, we hoose DMXPY from the Level 2 BLAS library [DDHH88℄. This version of DMXPY is a hand-optimized vetor-matrix multiply that is mahine-spei and diÆult to understand. We applied our transformations to the unoptimized version and ompared the performane with the BLAS2 version. DMXPY. Array Size 300 Iterations 700 Original 7.78s BLAS2 5.18s UJ+SR 5.06s Speedup 1.54 As the table shows, we were able to attain slightly better performane than the BLAS version, while allowing the kernel to be expressed in a mahine-independent fashion. The hand-optimized version was not spei to the RS/6000 and more hand optimization would have been required to tune it to the RS/6000. However, our automati tehniques took are of the mahine-spei details without the knowledge or intervention of the programmer and still obtained good performane. The next experiment was performed on two versions of matrix multiply. The JKI loop ordering for better ahe performane is shown below. Matrix Multiply. DO 10 J = 1,N DO 10 K = 1,N DO 10 I = 1,N 10 C(I,J) = C(I,J) + A(I,K) * B(K,J) The results from our transformation system show that integer fator speedups are attainable on both small and large matries for both versions of the loop. Not only did we improve memory reuse within the loop, 52 CHAPTER 3. UNROLL-AND-JAM Proedure DistributeLoops(L; i) Input: L = loop to hek for distribution i = nesting level of L if L has multiple inner loops at level i + 1 then l = outermost level where the unroll amounts dier for the loop nests. if l > 0 then if loops at levels l; : : : ; i an be distributed then L=L while level(L) l do distribute L around its inner loops L = parent(L) enddo else for j = 1 to i do m = minimum of all unroll vetors for the inner loops of L at level j assign to those vetors at level j the value m enddo endif endif N = the rst inner loop of L at level i + 1 if N exists then DistributeLoops(N; i + 1) N = the next loop at level i following L while N exists do DistributeLoops(N; i) N = the next loop at level i following N enddo end Figure 3.8 Distribute Loops for Unroll-and-jam 3.5. 53 EXPERIMENT but also available instrution-level parallelism. Loop Order JIK JKI Array Size 50x50 500x500 50x50 500x500 Iterations 500 1 500 1 Original 4.53s 135.61s 6.71s 15.49s SR 4.53s 135.61s 6.71s 15.49s UJ+SR 2.37s 44.16s 3.3s 6.6s Speedup 1.91 3.07 2.04 2.35 Comparing the results of the two versions of matrix multiply shows how ritial memory performane really is. The JKI version has better overall ahe loality and on large matries, that property gives dramatially better performane than the JIK version. We also experimented with both the point and blok versions of LU deomposition with and without partial pivoting (LU and LUP, respetively). In the table below, we show the results of salar replaement and unroll-and-jam. Linear Algebra Kernels. Kernel Blok LU Blok LUP Array Size 300x300 300x300 500x500 500x500 300x300 300x300 500x500 500x500 Blok Size 32 64 32 64 32 64 32 64 Original 1.37s 1.42s 6.58s 6.59s 1.42s 1.48s 6.85s 6.83s SR 0.91s 1.02s 4.51s 4.54s 0.97s 1.10s 4.81s 4.85s UJ+SR 0.56s 0.72s 2.48s 2.79s 0.67s 0.77s 2.72s 3.02s Speedup 2.45 1.97 2.65 2.36 2.12 1.92 2.52 2.26 As shown in the above examples, a fator of 2 to 2.5 was attained. Eah of the kernels ontained an inner-loop redution that was amenable to unroll-and-jam. In these instanes, unroll-and-jam introdued multiple parallel opies of the inner-loop reurrene while simultaneously improving data loality. Next, we studied three of the NAS kernels from the SPEC benhmark suite. A speedup was observed for eah of these kernels with Emit and Gmtry doing muh better. The reason is beause one of the main omputational loops in both Emit and Gmtry ontained an outer-loop redution that was unroll-and-jammed. NAS Kernels. Kernel Vpenta Emit Gmtry Original 149.68s 35.1s 155.3s SR 149.68s 33.83s 154.74s UJ+SR 138.69s 24.57s 25.95s Speedup 1.08 1.43 5.98 We also inluded in our experiment two geophysis kernels: one that omputed the adjoint onvolution of two time series and another that omputed the onvolution. Eah of these loops required the optimization of trapezoidal, rhomboidal and triangular iteration spaes using tehniques that we develop in Chapter 5. Again, these kernels ontained inner-loop redutions. Geophysis Kernels. Kernel Afold Fold Iterations 1000 1000 1000 1000 Array Size 300 500 300 500 Original 4.59s 12.46s 4.61s 12.56s SR 4.59s 12.46s 4.61s 12.56s UJ+SR 2.55s 6.65s 2.53s 6.63s Speedup 1.80 1.87 1.82 1.91 To omplete our study we ran a number of Fortran appliations through our translator. We hose programs from SPEC, Perfet, RICEPS and loal soures. Those programs that belong to our benhmark suites but are not inluded in this experiment ontained no opportunities for our algorithm. The Appliations. 54 CHAPTER 3. UNROLL-AND-JAM results of performing salar replaement and unroll-and-jam on these appliations are shown in the following table.z Program Ar2d CoOpt Flo52 Matrix300 Onedim Simple Tomatv Original 410.13s 122.88s 61.01s 149.6s 4.41s 963.20s 37.66s SR 407.57s 120.44s 58.61s 149.6s 4.41s 934.13s 37.66s UJ+SR 401.96 116.67s 58.8s 33.21s 3.96s 928.84s 37.41s Speedup 1.02 1.05 1.04 4.50 1.11 1.04 1.01 The appliations that observed the largest improvements with unroll-and-jam (Matrix300, Onedim) were dominated by the ost of loops ontaining redutions that were highly optimized by our system. Although many of the loops found in other programs reeived a 15%-40% improvement, the appliations themselves were not dominated originally by the osts of these loops. Throughout our study, we found that unroll-and-jam was most eetive in the presene of redutions. Unrolling a redution introdued very few stores and improved ahe loality and instrution-level parallelism. Stores seemed to limit the pereived available parallelism on the RS/6000. Unroll-and-jam was least eetive in the presene of a oating-point divide beause of its high omputational ost. Beause a divide takes 19 yles on the RS/6000, its ost dominated loop performane enough to make data loality a minimal fator. 3.6 Summary In this hapter, we have developed an optimization problem that minimizes the distane between mahine balane and loop balane, bringing the balane of the loop as lose as possible to the balane of the mahine. We have shown how to use a few simplifying assumptions to make the solution times fast enough for inlusion in ompilers. We have implemented these transformations in a Fortran soure-to-soure preproessor and shown its eetiveness by applying it to a substantial olletion of kernels and whole programs. These results show that, over whole programs, modest improvements are usually ahieved with spetaular results ourring on a few programs. The methods are partiularly suessful on kernels from linear algebra. These results are ahieved on an IBM RS/6000, whih has an extremely eetive optimizing ompiler. We would expet more dramati improvements over a less sophistiated ompiler. These methods should also produe larger improvements on mahines where the load penalties are greater. z Our version of Matrix300 is after proedure loning and inlining to reate ontext for unroll-and-jam [BCHT90℄. 55 Chapter 4 Loop Interhange The previous two hapters have addressed the problem of improving the performane of memory-bound loops under the assumption that good ahe loality already exists in program loops. It is the ase, however, that not all loops exhibit good ahe loality, resulting in idle omputational yles while waiting for main memory to return data. For example, in the loop, DO 10 I = 1, N DO 10 J = 1, N 10 A = A + B(I,J) referenes to suessive elements of B are a long distane apart in number of memory aesses. Most likely, urrent ahe arhitetures would not be able to apture the potential ahe-line reuse available beause of the volume of data aessed between reuse points. With eah referene to B being a ahe miss, the loop would spend a majority of its time waiting on main memory. However, if we interhange the I- and J-loops to get DO 10 J = 1, N DO 10 I = 1, N 10 A = A + B(I,J) the referenes to suessive elements of B immediately follow one another. In this ase, we have attained loality of referene for B by moving reuse points loser together. The result will be fewer idle yles waiting on main memory. In this hapter we show how the ompiler an automate the above proess to attain a loop ordering with good memory performane. We begin with a model of memory performane for program loops. Then, we show how to use this model to hoose the loop ordering that maximizes memory performane. Finally, we present an experiment with an implementation of this tehnique. 4.1 Performane Model Our model of memory performane onsist of two parts. The rst part models the ability of ahe to retain values between reuse points. The seond part models the osts assoiated with eah memory referene within an innermost loop. 4.1.1 Data Loality When applied to data loality within ahe, a data dependene an be thought of as a potential opportunity for reuse. The reason the reuse opportunity is only potential is ahe interferene | where two data items need to oupy the same loation in ahe at the same time. Previous studies have shown that interferene is hard to predit and often prevents outer-loop reuse [CP90, LRW91℄. However, inner-loop loality is likely to be aptured by ahe beause of short distanes between reuse points. Given these fators, our model of ahe will assume that all outer-loop reuse will be prevented by ahe interferene and that all innerloop reuse will be aptured by ahe. This assumption allows us to ignore the unpreditable eets of set 56 CHAPTER 4. LOOP INTERCHANGE assoiativity and onentrate solely on the ahe line size, miss penalty and aess ost to measure memory performane. In essene, by ignoring set assoiativity, we assume that the ahe will losely resemble a fully assoiative ahe in relation to innermost-loop reuse. Although preision is lowered, our experimentation shows that loop interhange based on this model is eetive in attaining loality of referene within ahe. 4.1.2 Memory Cyles To ompute the ost of an array referene, we must rst know the array storage layout for a partiular programming language. Our model assumes that arrays are stored in olumn-major order. Row-major order an be handled with slight modiations. One we know the storage layout, we an determine where in the memory hierarhy values reside and the assoiated ost of aessing eah value by analyzing reuse properties. In the rest of this setion, we show how to determine the reuse properties of array referenes based upon the dependene graph and how to assign an average ost to eah memory aess. One reuse property, temporal reuse, ours when a referene aesses data that has been previously aessed in the urrent or a previous iteration of an innermost loop. These referenes are represented in the dependene graph as the sink of a loop-independent or innermost-loop-arried onsistent dependene. If a referene is the sink of an output or antidependene, eah aess will be out of ahe (assuming a write-bak ahe) and will ost T = Ch yles, where Ch is the ahe aess ost. If a referene is the sink of a true or input dependene, then it will be removed by salar replaement, resulting in a ost of 0 yles per aess. In Figure 4.1, the referene to A(I-1,J) has temporal reuse of the value dened by A(I,J) 1 iteration earlier and will be removed by salar replaement. Therefore, it will ost 0 yles. The other reuse property, spatial reuse, ours when a referene aesses data that is in the same ahe line as some previous aess. One type of referene possessing spatial reuse has the innermost-loop indution variable ontained only in the rst subsript position and has no inoming onsistent dependene. This referene type aesses suessive loations in a ahe line on suessive iterations of the innermost loop. It requires the ost of aessing ahe for every memory aess plus the ahe miss penalty, Cm , for every aess that goes to main memory. Assuming that the ahe line size is Cl words and the referene has a stride of s words between suessive aesses, b Csl suessive aesses will be in the same ahe line. Therefore, the probability that a referene with spatial reuse will be a ahe miss is Pm = b C1l ; s giving the following formulation for memory ost Sl = Ch + Pm Cm : In Figure 4.1, the referene to A(I,J) has spatial reuse of the form just desribed. Given a ahe with Ch = 1; Cm = 8 and Cl = 16, A(I,J) osts 1 + 168 = 1:5 yles/aess. The other type of referene having spatial reuse aesses memory within the same ahe line as a dierent referene but does not aess the same value. These referenes will be represented as the sink of an outerloop-arried onsistent dependene with a distane vetor of h0; : : : ; 0; di ; 0; : : : ; 0; dn i, where the indution variable for loop i is only in the rst subsript position. In this ase, memory aesses will be ahe hits on all but possibly the rst few aesses and essentially ost ST = Ch . In Figure 4.1, the referene to C(J-1,I) has this type of spatial reuse due to the referene to C(J,I). Under the previous ahe parameters, C(J-1,I) requires 1 yle/aess. The remaining types of referenes an be lassied as those that ontain no reuse and those that are stored in registers. Array referenes without spatial and temporal reuse fall into the former ategory and require N = Ch + Cm yles. Referenes to salars fall into the latter ategory and require 0 yles. C(J,I) in Figure 4.1 has no reuse sine the stride for onseutive elements is too large to attain spatial reuse. It has a ost of 9 yles/aess under the urrent ahe parameters. DO 10 J = 1,N DO 10 I = 1,N 10 A(I,J) = A(I-1,J) + C(J,I) + C(J-1,I) Figure 4.1 Example Loop for Memory Costs 4.2. 57 ALGORITHM 4.2 Algorithm This setion presents an algorithm for ordering loop nests to maximize memory performane that is based upon the memory-ost model desribed in the previous setion. First, we show how to ompute the order for memory performane for perfetly nested loops. Then, we show how to extend the algorithm to handle non-perfetly nested loops. 4.2.1 Computing Loop Order Using the model of memory from Setion 4.1, we will apply a loop interhange algorithm to a loop nest to minimize the ost of aessing memory. The algorithm onsiders the memory osts of the referenes in the innermost-loop body when eah loop is interhanged to the innermost position and then orders the loops from innermost to outermost based upon inreasing osts. Although osts are omputed relative only to the innermost position, they reet the data-loality merit of a loop relative to all other loops within the nest. Loops with a lower memory ost have a higher loality and have a higher probability of attaining outer-loop reuse if interferene happens not to our. To atually ompute the memory osts assoiated with eah loop as desribed, the distane vetors of all of the dependene edges in the loop nest are viewed as if the entry for the loop under onsideration were shifted to the innermost position with all other entries remaining in the same relative position. This will allow determination of whih edges an be made innermost to apture reuse. After omputing the memory ost for eah loop and omputing the order for the loops, we must ensure that no violation of dependenes will our. Safety is ensured using the tehnique of MKinley, et al. [KM92℄. Beginning, with the outermost position in the loop, we selet the loop with the highest memory ost that an safely go in the urrent loop position. A loop positioning is safe if no resulting dependene has a negative threshold [AK87℄. After seleting the outermost loop, we proeed by iteratively seleting the loop with the next highest ost for the next outermost position until a omplete ordering is obtained. The algorithm for omputing loop order for perfetly nested loops is shown in Figure 4.2. An Example. memory osts. To illustrate the algorithm in Figure 4.2, onsider the following loop under two sets of DO 10 I = 1,N DO 10 J = 1,N 10 A(J) = A(J) + B(J,I) Given Cl = 8; Ch = 1 and Cm = 8, the ost of the I-loop in the innermost position is 0 + 0 + 9 = 9 yles/iteration and the ost of the J-loop is 3 (1 + 81 8) = 6 yles/iteration. This ost model would argue for an I,J loop ordering. However, if Ch were 3 yles, the ost of the I-loop beomes 11 yles/iteration Proedure LoopOrder(L) Input: L = loop nest ompute memory osts for eah loop P = SortByCost(L) if 9 a dependene with a negative threshold then L=P for i = 1 to jLj then Pi = the outermost loop, lj 2 L j putting lj at level i will not reate a dependene with a negative threshold remove Pi from L enddo endif store P for this perfetly nested loop end Figure 4.2 Algorithm for Ordering Loops 58 CHAPTER 4. LOOP INTERCHANGE and the ost of the J-loop beomes 12 yles/iteration, giving a J,I loop ordering. As illustrated by this example, our ost model an optimize for a loop/arhiteture ombination that alls for either better ahe reuse or better register reuse. Unfortunately, we do not have aess to a mahine with a high load penalty to validate this eet of our model. It may be the ase that TLB misses, due to long strides for referenes with no reuse, would dominate the ost of memory aesses. In this ase, our model would need to be updated to reet TLB performane. 4.2.2 Non-Perfetly Nested Loops It may be the ase that the loop nest on whih we wish to perform interhange is not perfetly nested. Consider the following loop. DO 10 I = 1,N DO 20 J = 1,N 20 A(I,J) = A(I,J) + B(I,J) DO 10 J = 1,N 10 C(J,I) = C(J,I) + D(J,I) It is desirable to interhange the I- and J-loops for statement 20, but is not desirable for statement 10. To handle this situation, we will use an approah similar to that used in unroll-and-jam, where eah innermost loop body will be onsidered as if it were perfetly nested when omputing memory osts. If interhange aross a non-perfetly nested portion of the loop nest is required or dierent loop orderings of ommon loops are requested, then the interhanged loop is distributed, when safe, and the interhange is performed as desired [Wol86a℄. Distribution safety an be inorporated into LoopOrder at the point where interhange safety is tested. To be onservative, any loop ontaining a distribution-preventing reurrene and any loop nested outside of that loop annot be interhanged inward. In the previous example, the result after distribution and interhange would be DO 20 J = 1,N DO 20 I = 1,N 20 A(I,J) = A(I,J) + B(I,J) DO 10 I = 1,N DO 10 J = 1,N 10 C(J,I) = C(J,I) + D(J,I) In Figure 4.3, we show the omplete algorithm for ordering nested loops. 4.3 Experiment We have implemented our loop interhange algorithm in the Parasope programming environment along with salar replaement and unroll-and-jam. Our experiment was performed on the IBM RS/6000 model 540 whih has the following ahe parameters: Ch = 1; Cm = 8 and Cl = 128 bytes. The order in whih transformations are performed in our system is loop interhange followed by unroll-and-jam and salar replaement. First, we get the best data loality in the innermost loop and then we improve the resulting balane. Matrix Multiply. We begin our study with the eets of loop order on matrix multiply using matries too large for ahe to apture all reuse. In the table below, the performane of the various versions using two dierent-sized matries is given. Loop Order IJK IKJ JIK JKI KIJ KJI Array Size 300x300 500x500 300x300 500x500 300x300 500x500 300x300 500x500 300x300 500x500 300x300 500x500 Time 12.51s 131.71s 50.93s 366.77s 12.46s 135.61s 3.38s 15.49s 51.32s 366.62s 3.45s 15.83s 4.3. 59 EXPERIMENT Proedure Interhange(L; i) Input: L = loop nest i = level of outermost loop for N = eah possible perfet loop nest in L do LoopOrder(N ) InterhangeWithDistribution(L; i) end Proedure InterhangeWithDistribution(L; i) if L has multiple inner loops at level i + 1 then if interhange is desired aross level i + 1 or multiple orderings requested for levels 1 to i then distribute L for N = L and its distributed opies do Interhange(N; i) return endif if L is innermost then let N = ordering for perfet nest produed by LoopOrder order perfet nest of loops by N endif N = the rst inner loop of L at level i + 1 if N exists then InterhangeWithDistribution(N; i + 1) N = the next loop at level i following L while N exists do InterhangeWithDistribution(N; i) N = the next loop at level i following N enddo end Figure 4.3 Algorithm for Non-Perfetly Nested Loops 60 CHAPTER 4. LOOP INTERCHANGE The loop order with the best performane is JKI beause of stride-one aess in ahe. This is the loop order that our transformation system will derive given any of the possible orderings. The key observation to make is that the programmer an speify matrix multiply in the manner that is most understandable to him without onern for memory performane. Using our tehnique, the ompiler will give the programmer good performane without requiring knowledge of the implementation of the array storage layout for a partiular programming language nor the underlying mahine. In essene, mahine-independent programming is made possible. NAS Kernels. Next, we studied the eets of loop interhange on a ouple of the Nas kernels from the SPEC benhmark suite. As is shown in the results below, getting the proper loop ordering for memory performane an have a dramati eet. Kernel Vpenta Gmtry Original 149.68s 155.30s SR 149.68s 154.74s LI+SR 115.62s 17.89s UJ+SR 138.69s 25.95s LI+UJ+SR 115.62s 18.07s Speedup 1.29 8.68 The speedup reported for Gmtry does not inlude the appliation of unroll-and-jam. Although we have no analysis tool to determine what exatly aused the slight degradation in performane, the most likely suspet is ahe interferene. Unroll-and-jam probably reated enough opies of the inner loop to inur ahe interferene problems due to set assoiativity. To omplete our study, we ran our set of appliations through the translator to determine the eetiveness of loop interhange. In the following table, the results are shown for those appliations where loop interhange was appliable. Appliations. Program Ar2d Simple Wave Original 410.13s 963.20s 445.94s SR 407.57s 934.13s 431.11s LI+SR 190.69s 850.18s 414.63s UJ+SR 401.96 928.84s 431.11s LI+UJ+SR 192.53s 847.82s 414.63s Speedup 2.15 1.14 1.08 Again, we have shown that remarkable speedups are attainable with loop interhange. Ar2d improved by a fator of over 2 on a single proessor. This appliation was written in a \vetorizable" style with regard for vetor operations rather than data loality. What our transformation system has done is to allow the ode to be portable aross dierent arhitetures and still attain good performane. The very slight degradation when unroll-and-jam was added is probably again due to ahe interferene. Throughout our study, one fator stood out as the key to performane on mahines like the rs6000 | stride-one aess in the ahe. Long ahe lines, a low aess ost and a high miss penalty reate this phenomena. Attaining spatial reuse was a simple yet extremely eetive approah to obtaining high performane. 4.4 Summary In this hapter, we have shown how to automatially order loops in a nest to give good memory performane. The model that we have presented for memory performane is simple, but extremely eetive. Using our transformation system, the programmer is freed from having to worry about his loop order, allowing him to write his ode in a mahine-independent form with ondene that it will be automatially optimized to ahieve good memory performane. The results presented here represent a positive step toward enouraging mahine-independent programming. 61 Chapter 5 Blokability In Chapter 1, we presented two transformations, unroll-and-jam and strip-mine-and-interhange, to eet iteration-spae bloking. Although these transformations have been studied extensively, they have mainly been applied to kernels written in an easily analyzed algorithmi style. In this hapter, we examine the appliability of iteration-spae bloking on more omplex algorithmi styles. Speially, we present the results of a projet to see if a ompiler ould automatially generate blok algorithms similar to those found in LAPACK from the orresponding point algorithms expressed in Fortran 77. In performing this study, we address the question, \What information does a ompiler need in order to derive blok versions of real-world odes that are ompetitive with the best hand-bloked versions?" In the ourse of this study, we have found transformation algorithms that an be suessfully used on triangular loops, whih are quite ommon in linear algebra, and trapezoidal loops. In addition, we have disovered an algorithmi approah that an be used to analyze and blok programs that exhibit omplex dependene patterns. The latter method has been suessfully applied to blok LU deomposition without pivoting. The key to many of these results is a transformation known as index-set splitting. Our results with this transformation show that a wide lass of numerial algorithms an be automatially optimized for a partiular mahine's memory hierarhy even if they are expressed in their natural form. In addition, we have disovered that speialized knowledge about whih operations ommute with one another an enable ompilers to blok odes that were previously thought to be unblokable by automati means. This hapter begins with a review of iteration-spae bloking. Next, the transformations that we found were neessary to blok algorithms like those found in LAPACK are presented. Then, we present a study of the appliation of these transformations to derive the blok LAPACK-like algorithms from their orresponding point algorithms. Finally, for those algorithms that annot be bloked by a ompiler, we propose a set of language extensions to allow the expression of blok algorithms in a mahine-independent form. 5.0.1 Iteration-Spae Bloking To improve the memory behavior of loops that aess more data than an be handled by a ahe, the iteration spae of a loop an be bloked into setions whose temporal reuse an be aptured by the ahe. Strip-mine-and-interhange is a transformation that ahieves this result [Wol87, WL91℄. The eet is to shorten the distane between the soure and sink of a dependene so that it is more likely for the datum to reside in ahe when the reuse ours. Consider the following loop nest. DO 10 J = 1,N DO 10 I = 1,M 10 A(I) = A(I) + B(J) Assuming that the value of M is muh greater than the size of the ahe, we would get temporal reuse of the values of B, while missing the temporal reuse of the values of A on eah iteration of J. To apture A's reuse, we an use strip-mine-and-interhange as shown below. DO 10 J = 1,N,JS DO 10 I = 1,M DO 10 JJ = J, MIN(J+JS-1,N) 10 A(I) = A(I) + B(JJ) 62 CHAPTER 5. BLOCKABILITY Now, we an apture the temporal reuse of JS values of B out of ahe for every iteration of the J-loop if JS is less that the size of the ahe and no ahe interferene ours, and we an apture the temporal reuse of A in registers [LRW91℄. As stated earlier, both strip-mine-and-interhange and unroll-and-jam make up the transformation tehnique known as iteration-spae bloking. Essentially, unroll-and-jam is strip-mine-and-interhange with the innermost loop unrolled. The dierene is that unroll-and-jam is used to blok for registers and strip-mineand-interhange for ahe. 5.1 Index-Set Splitting Iteration-spae bloking annot always be diretly applied as shown in the previous setion. Sometimes a transformation alled index-set splitting must be applied to allow bloking. Index-set splitting involves the reation of multiple loops from one original loop, where eah new loop iterates over a portion of the original iteration spae. Exeution order is not hanged and the original iteration spae is still ompletely exeuted. As an example of index-set splitting, onsider the following loop. 10 DO 10 I = 1,N A(I) = A(I) + B(I) We an split the index set of I at iteration 100 to obtain DO 10 I = 1,MIN(N,100) A(I) = A(I) + B(I) DO 20 I = MIN(N,100)+1,N 20 A(I) = A(I) + B(I) 10 Although this transformation does nothing by itself, its appliation an enable the bloking of omplex loop forms. This setion shows how index-set splitting allows the bloking of triangular-, trapezoidal-, and rhomboidal-shaped iteration spaes and the partial bloking of loops with omplex dependene patterns. 5.1.1 Triangular Iteration Spaes When the iteration spae of a loop is not retangular, iteration-spae bloking annot be diretly applied. The problem is that when performing interhange of loops that iterate over a triangular region, the loop bounds must be modied to preserve the semantis of the loop [Wol86a, Wol87℄. Below, we will derive the formula for determining loop bounds when bloking is performed on triangular iteration spaes. We begin with the derivation for strip-mine-and-interhange and then extend it to unroll-and-jam. The general form of a strip-mined triangular loop is given below. and are integer onstants ( may be a loop invariant) and > 0. 10 DO 10 I = 1,N,IS DO 10 II = I,I+IS-1 DO 10 J = II+ ,M loop body Figure 5.1 gives a graphial desription of the iteration spae of this loop. To interhange the II and J loops, we have to aount for the fat that the line J=II+ intersets the iteration spae at the point (I,I+ ). Therefore, when we interhange the loops, the II-loop must iterate over a trapezoidal region, requiring its upper bound to be J until (J ) > I+IS-1. This gives the following loop nest. 10 DO 10 I = 1,N,IS DO 10 J = I+ ,M DO 10 II = I,MIN((J- )/,I+IS-1) loop body This formula an be trivially extended to handle the ases where < 0 and where a linear funtion of I appears in the upper bound instead of the lower bound (see Appendix A). Given the formula for triangular strip-mine-and-interhange, we an extend it to triangular unroll-andjam as follows. The iteration spae dened by the two inner loops is a trapezoidal region, making unrolling the innermost loop non-trivial beause the number of iterations vary with J. To overome this, we an use indexset splitting on J to reate one loop that iterates over the triangular region below the line J=(I+IS-1)+ 5.1. 63 INDEX-SET SPLITTING J 6 M J=II+ + 1 I Figure 5.1 I+IS-1 N - II Upper Left Triangular Iteration Spae and one loop that iterates over the retangular region above the line. Sine we know the length of the retangular region, the seond loop an be unrolled to give the following loop nest. 20 DO 10 I = 1,N,IS DO 20 II = I,I+IS-2 DO 20 J = II+ ,MIN((I+IS-2)+ ,M) loop body DO 10 J = 10 (I+IS-1)+ ,M unrolled loop body Depending upon the values of and , it may also be possible to determine the size of the triangular region; therefore, it may be possible to ompletely unroll the rst loop nest to eliminate the overhead. Additionally, triangular unroll-and-jam an be trivially extended to handle other ommon triangles (see Appendix A). To see the potential of triangular unroll-and-jam, onsider the following loop that is used in a bak solve after LU deomposition. 10 DO 10 I = 1,N DO 10 J = I+1,N A(J) = A(J) + B(J,I) * A(I) We used an automati system to perform triangular unroll-and-jam and salar replaement on the above loop. We then ran the result on arrays of DOUBLE-PRECISION REALS on an IBM RS/6000 540. The results are shown in the table below. Size 300 500 5.1.2 Iterations 500 200 Original 6.09s 6.78s Xformed 3.49s 3.82s Speedup 1.74 1.77 Trapezoidal Iteration Spaes While the previous method applies to many of the ommon non-retangular-shaped iteration spaes, there are still some important loops that it will not handle. In linear algebra, seismi and partial dierential equation odes, we often nd loops with trapezoidal-shaped iteration spaes. Consider the following example, where L is assumed to be a onstant and > 0. 10 DO 10 I = 1,N DO 10 J = L,MIN(I+ ,N) loop body 64 CHAPTER 5. BLOCKABILITY Here, as is often the ase, a MIN funtion has been used to handle boundary onditions, resulting in the loop's trapezoidal shape as shown in Figure 5.2. The formula to handle a triangular region does not apply in this ase, so we must extend it. The trapezoidal iteration spae ontains one retangular region and one triangular region separated at the point where I+ = N. Beause we already know how to handle retangular and triangular regions and beause we want to redue the exeution overhead in a retangular region, we an split the index set of I into two separate regions at the point I = N . This gives the following loop nest. 10 10 DO 10 I = 1,MIN(N,(N- )/) DO 10 J = L,I+ loop body DO 10 I = MAX(1,MIN(N,(N- )/)+1),N DO 10 J = L,N loop body We an now apply triangular iteration-spae bloking to the rst loop and retangular unroll-and-jam to the seond loop. In addition to loops with MIN funtions in the upper bound, index-set splitting of trapezoidal regions an be extended to allow MAX funtions in the lower bound of the inner loop (see Appendix A). The lower bound, L, of the inner loop in the trapezoidal nest need not be restrited to a onstant value. It an essentially be any funtion that produes an iteration spae that an be bloked. In the loop, DO 10 I = 1,N1 DO 10 J = I,MIN(I+N2,N3) 10 F3(I) = F3(I)+DT*F1(K)*WORK(I-K) the lower bound is a linear funtion of the outer-loop indution variable, resulting in rhomboidal and triangular regions (see Figure 5.3). To handle this loop, bloking an be extended to rhomboidal regions using index-set splitting as in the ase for triangular regions (see Appendix A). To see the potential for improving the performane of trapezoidal loops, see Setion 3.5 under Geophysis Kernels. The algorithm Afold is the example shown above and omputes the adjoint-onvolution of two time series. Fold omputes the onvolution of two time series and ontains a MAX funtion in the lower bound and a MIN funtion in the upper bound. J N L Figure 5.2 6 1 N N -I Trapezoidal Iteration Spae with Retangle 5.1. 65 INDEX-SET SPLITTING J N3 1 6 1 Figure 5.3 5.1.3 N3 N2 N1 -I Trapezoidal Iteration Spae with Rhomboid Complex Dependene Patterns In some ases, it is not only the shape of the iteration spae that presents diÆulties for the ompiler, but also the dependene patterns within the loop. Consider the strip-mined example below. DO 10 I = 1,N,IS DO 10 II = I, I+IS-1 T(II) = A(II) DO 10 K = II,N 10 A(K) = A(K) + T(II) To omplete bloking, the II-loop must be interhanged into the innermost position. Unfortunately, there is an reurrene between the denition of A(K) and the load from A(II) arried by the II-loop. Using only standard dependene abstrations, suh as distane and diretion vetors, we would be prevented from bloking the loop [Wol82℄. However, if we analyze the setions of the arrays that are aessed at the soure and sink of the oending dependene using array summary information, the potential to apply bloking is revealed [CK87, HK91℄. Consider Figure 5.4. The region of the array A read by the referene to A(II) goes from I to I+IS-1 and the region written by A(K) goes from I to N. Therefore, the reurrene does not exist for the region from I+IS to N. To allow partial bloking of the loop, we an split the index set so that one loop iterates over the ommon region and one loop iterates over the disjoint region. To determine the split point to reate these regions, we set the subsript expression for the larger region equal to the boundary between the ommon and disjoint regions and solve for the inner indution variable. In our example, we let K = I+IS-1 and solve for K. Splitting at this point yields DO 10 I = 1,N,IS DO 10 II = I,I+IS-1 T(II) = A(II) DO 20 K = I,I+IS-1 20 A(K) = A(K) + T(II) DO 10 K = I+IS,N 10 A(K) = A(K) + T(II) We an now distribute the II-loop and omplete bloking on the loop nest surrounding statement 10. The method that we have just desribed may be appliable when the referenes involved in the preventing dependenes have dierent indution variables in orresponding positions (e.g., A(II) and A(K) in the previous example). An outline of our appliation of the method after strip mining is given below. 1. Calulate the setions of the soure and sink of the preventing dependene. 2. Interset and union the setions. 66 CHAPTER 5. B B BB B B BB B B B B B B B B BB BB BB BB A BB BB BB BB BB BB 1 I BLOCKABILITY Figure 5.4 I+IS-1 Data Spae for A N 3. If the intersetion is equal to the union then stop. 4. Set the subsript expression of the larger setion equal to the boundary between the disjoint and ommon setions and solve for the inner-loop indution variable. 5. Split the index set of the inner loop at this point and proeed with bloking on the disjoint region. 6. Repeat steps 4 and 5 if there are multiple boundaries. If multiple transformation-preventing dependenes exist, we proess eah one with the above steps until failure or a region is reated where bloking an be performed. 5.2 Control Flow In addition to iteration-spae shapes and dependene patterns, we must also onsider the eets of ontrol ow on bloking. It may be the ase that an inner loop is guarded by an IF-statement to prevent unneessary omputation. Consider the following matrix multiply ode. DO 10 J = 1,N DO 10 K = 1,N IF (B(K,J) .NE. 0.0) THEN DO 20 I = 1,N 20 C(I,J) = C(I,J) + A(I,K) * B(K,J) ENDIF 10 CONTINUE If we were to ignore the IF-statement and perform unroll-and-jam on the K-loop, we would obtain the following ode. DO 10 J = 1,N DO 10 K = 1,N,2 IF (B(K,J) .NE. 0.0) THEN DO 20 I = 1,N C(I,J) = C(I,J) + A(I,K) * B(K,J) 20 C(I,J) = C(I,J) + A(I,K+1) * B(K+1,J) ENDIF 10 CONTINUE Here, the value of B(K+1,J) is never tested and statements that were not exeuted in the original ode may be exeuted in the unrolled ode. Thus, we have performed unroll-and-jam illegally. One possible method to preserve orretness is to move the guard into the innermost loop and repliate it for eah unrolled iteration. This, however, would result in a performane degradation due to a derease in loop-level parallelism and an inrease in instrutions exeuted. Instead, we an use a ombination of IF-onversion and sparse-matrix tehniques that we all IF-inspetion to allow us to keep the guard out of the innermost loop and still allow bloking [AK87℄. The idea is to inspet at run-time the values of an outer-loop indution variable for whih the guard is true and the inner loop is exeuted. Then, we exeute 5.3. 67 SOLVING SYSTEMS OF LINEAR EQUATIONS the loop nest for only those values. To eet IF-inspetion, ode is inserted within the IF-statement to reord loop bounds information for the loop that we wish to transform. On the true branh of the guard to be inspeted we insert the following ode, where KC is initialized to 1, FLAG is initialized to false, K is the indution variable of the loop to be inspeted and KLB is the store for the lower bound of an exeuted range. IF (.NOT. FLAG) THEN KC = KC + 1 KLB(KC) = K FLAG = .TRUE. ENDIF On the false branh of the inspeted guard, we insert the following ode to store the upper bound of eah exeuted range. IF (FLAG) THEN KUB(KC) = K-1 FLAG = .FALSE. ENDIF Note that we must also aount for the fat that the value of the guard ould be true on the last iteration of the loop, requiring a test of FLAG to store the upper bound of the last range after the IF-inspetion loop body. After inserting the inspetion ode, we an distribute the loop we wish to transform around the inspetion ode and reate a new loop nest that exeutes over the iteration spae where the innermost loop was exeuted. In our example, the result would be C C C DO 20 K = 1,N IF-inspetion ode 20 10 CONTINUE DO KN = 1,KC DO K = KLB(KN),KUB(KN) DO I = 1,N C(I,J) = C(I,J) + A(I,K) * B(K,J) The KN-loop exeutes over the number of ranges where the guarded loop is exeuted and the K-loop exeutes within those ranges. The new K-loop an be transformed as was desired in the original loop nest. The nal IF-inspeted ode for our example is shown in Figure 5.5 If the ranges over whih the inner loop is exeuted in the original loop are large, the inrease in runtime ost aused by IF-inspetion an be more than ounterated by the improvements in performane on the newly transformed inner loop. To show this, we performed unroll-and-jam on our IF-inspeted matrix multiply example and ran it on an IBM RS/6000 model 540 on 300x300 arrays of REALS. In the table below, Frequeny shows how often B(K,J) = 0, UJ is the result of performing unroll-and-jam after moving the guard into the innermost loop and UJ+IF is the result of performing unroll-and-jam after IF-inspetion. Frequeny 2.5% 10% 5.3 Original 3.33s 3.08s UJ 3.84s 3.71s UJ+IF 2.25s 2.13s Speedup 1.48 1.45 Solving Systems of Linear Equations LAPACK is a projet whose goal is to replae the algorithms in LINPACK and EISPACK with blok al- gorithms that have better ahe performane. Unfortunately, sientists have spent years developing this pakage to attain high performane on a variety of arhitetures. We believe that this proess is the wrong diretion for high-performane omputing. Compilers, not programmers, should handle the mahine-spei details required to attain high performane. It should be possible for algorithms to be expressed in a natural, mahine-independent form with the ompiler performing the mahine-spei optimizations to attain performane. 68 CHAPTER 5. 20 10 FLAG = .FALSE. DO 10 J = 1,N KC = 0 DO 20 K = 1,N IF (B(K,J) .NE. 0.0) THEN IF (.NOT. FLAG) THEN KC = KC + 1 KLB(KC) = K FLAG = .TRUE. ENDIF ELSE IF (FLAG) THEN KUB(KC) = K-1 FLAG = .FALSE. ENDIF ENDIF CONTINUE IF (FLAG) THEN KUB(KC) = N FLAG = .FALSE. ENDIF DO 10 KN = 1,KC DO 10 K = KLB(KN),KUB(KN) DO 10 I = 1,N C(I,J) = C(I,J) + A(I,K) * B(K,J) Figure 5.5 Matrix Multiply After IF-Inspetion BLOCKABILITY 5.3. SOLVING SYSTEMS OF LINEAR EQUATIONS 69 In this setion, we examine the eÆay of this hypothesis by studying the mahine-independent expression of algorithms similar to those found in LAPACK. We examine the blokability of three algorithms for solving systems of linear equations, where an algorithm is \blokable" if a ompiler an automatially derive the best known blok algorithm, similar to the one found in LAPACK, from its orresponding mahine-independent point algorithm. Deriving the blok algorithms poses two main problems. The rst is developing the transformations that allow the ompiler to attain the blok algorithm from the point algorithm. The seond is to determine the mahine-dependent bloking fator for the blok algorithm. In this setion, we address the rst issue. The seond is beyond the sope of this thesis. Our study shows that LU deomposition without pivoting is a blokable algorithm using the tehniques derived in Setion 5.1, LU deomposition with partial pivoting is blokable if information in addition to the index-set splitting tehniques is supplied to the ompiler and QR deomposition with Householder transformations is not blokable. The study will also show how to improve the memory performane of a fourth non-LAPACK algorithm, QR deomposition with Givens rotations using the tehniques of Setions 5.1 and 5.2. 5.3.1 LU Deomposition without Pivoting Gaussian elimination is a form of LU deomposition where the matrix A is deomposed into two matries, L and U , suh that A = LU , L is a unit lower triangular matrix and U is an upper triangular matrix. This deomposition an be obtained by multiplying the matrix A by a series of elementary lower triangular matries, Mk : : : M1 , as follows [Ste73℄. A = LU A = M1 1 : : : M k 1 U U = Mk : : : M 1 A (5.1) Using Equation 5.1, an algorithm for LU deomposition without pivoting an be derived. This point algorithm, where statement 20 omputes Mk and statement 10 applies Mk to A, is shown below. 20 10 DO 10 K = 1,N-1 DO 20 I = K+1,N A(I,K) = A(I,K) / A(K,K) DO 10 J = K+1,N DO 10 I = K+1,N A(I,J) = A(I,J) - A(I,K) * A(K,J) Unfortunately, this algorithm exhibits poor ahe performane on large matries. To improve its ahe performane, sientists have developed a blok algorithm that essentially groups a number of updates to the matrix A and applies them together to a blok portion of the array [DDSvdV91℄. To attain the best bloking, strip-mine-and-interhange is performed on the outer K-loop for only a portion of the inner loop nest, requiring the tehnique desribed in Setion 5.1.3 for automati derivation of the blok algorithm. Consider the strip-mined version of LU deomposition below. DO 10 K = 1,N-1,KS DO 10 KK = K,K+KS-1 DO 20 I = KK+1,N 20 A(I,KK) = A(I,KK)/A(KK,KK) DO 10 J = KK+1,N DO 10 I = KK+1,N 10 A(I,J) = A(I,J) - A(I,KK) * A(KK,J) To omplete the bloking of this loop, the KK-loop would have to be distributed around the loop that surrounds statement 20 and around the loop nest that surrounds statement 10 before being interhanged to 70 CHAPTER 5. 20 K BLOCKABILITY 10 N K Figure 5.6 K+KS-1 N Regions Aessed in LU Deomposition the innermost position. However, there is a reurrene between statements 20 and 10 arried by the KK-loop that prevents distribution unless index-set splitting is done. If we analyze the regions of the array A aessed for the entire exeution of the KK-loop, we nd that the region touhed by statement 20 is a subset of the region touhed by statement 10 (Figure 5.6 gives a graphial desription of the data regions). Sine the reurrene exists for only a portion of the iteration spae, we an split the larger region, dened by the referene to A(I,J) in statement 10, at the point J = K+KS-1. The new loop that overs the disjoint region is shown below. DO 10 KK = K,K+KS-1 DO 10 J = K+KS,N DO 10 I = KK+1,N 10 A(I,J) = A(I,J) - A(I,KK) * A(KK,J) Now, we an use triangular interhange to put the KK-loop in the innermost position. At this point, we have obtained the best blok algorithm, making LU deomposition blokable (see Figure 5.7). Not only does this blok algorithm exhibit better data loality, it also has inreased parallelism as the J-loop that surrounds statement 10 an be made parallel. At this point, we would like to perform unroll-and-jam and salar replaement to further improve the performane of blok LU deomposition. Unfortunately, the true dependene from A(I,J) to A(KK,J) in DO 10 K = 1,N-1,KS DO 20 KK = K,MIN(K+KS-1,N-1) DO 30 I = KK+1,N 30 A(I,KK) = A(I,KK)/A(KK,KK) DO 20 J = KK+1,K+KS-1 DO 20 I = KK+1,N 20 A(I,J) = A(I,J) - A(I,KK) * A(KK,J) DO 10 J = K+KS,N DO 10 I =K+1,N DO 10 KK = K,MIN(MIN(K+KS-1,N-1),I-1) 10 A(I,J) = A(I,J) - A(I,KK) * A(KK,J) Figure 5.7 Blok LU Deomposition 5.3. 71 SOLVING SYSTEMS OF LINEAR EQUATIONS statement 10 is inonsistent and when moved inward by unroll-and-jam would prevent ode motion of the assignment to A(I,J). However, if we examine the array setions of the soure and sink of the dependene after splitting for trapezoidal and triangular regions, we nd that the dependene does not exist in an unrolled loop body. We applied our algorithm by hand to LU deomposition and ompared its performane with the original program and a hand oded version of the right-looking algorithm [DDSvdV91℄. In the table below, \Blok 1" refers to the right-looking version and \Blok 2" refers to our algorithm in Figure 5.7. In addition, we used our automati system to perform trapezoidal unroll-and-jam and salar replaement, to our bloked ode, produing the version referred to as \Blok 2+". The experiment was run on an IBM RS/6000 model 540 using DOUBLE-PRECISION REALS. The reader should note that these nal transformations ould have been applied to the Sorensen version as well, with similar improvements. Array Size 300x300 300x300 500x500 500x500 5.3.2 Blok Size 32 64 32 64 Original 1.47s 1.47s 6.76s 6.76s Blok 1 1.37s 1.42s 6.58s 6.59s Blok 2 1.35s 1.38s 6.44s 6.38s Blok 2+ 0.49s 0.58s 2.13s 2.27s Speedup 3.00 2.53 3.17 2.98 LU Deomposition with Partial Pivoting Although the ompiler an disover the potential for bloking in LU deomposition without pivoting using dependene information, the same annot be said when partial pivoting for numerial stability is added to the algorithm. Using the following matrix formulation U = Mn 1Pn 1 M3 P3 M2 P2 M1 P1 A, the point version that inludes partial pivoting an be derived (see Figure 5.8) [Ste73℄. While we an apply index-set splitting to the algorithm in Figure 5.8 after strip mining to break the reurrene arried by the new KK-loop involving statement 10 and statement 40 as in the previous setion, we annot break the reurrene involving statements 10 and 25 using this tehnique. After index-set splitting, we have the following relevant setions of ode. DO 10 KK = K,K+KS-1 ... DO 30 J = 1,N TAU = A(KK,J) 25 A(KK,J) = A(IMAX,J) 30 A(IMAX,J) = TAU C ... DO 10 J = K+KS,N DO 10 I = KK+1,N 10 A(I,J) = A(I,J) - A(I,KK) * A(KK,J) C Distributing the KK-loop around both J-loops would onvert what was originally a true dependene from A(I,J) in statement 10 to A(IMAX,J) in statement 25 into an anti-dependene in the reverse diretion. The rules for the preservation of data dependene prohibit the reversing of a dependene diretion, whih would seem to prelude the existene of a blok analogue similar to the non-pivoting ase. However, a blok algorithm, that essentially ignores the preventing reurrene and is similar to the non-pivoting ase, an still be mathematially derived using the following result from linear algebra (see Figure 5.9) [DDSvdV91, Ste73℄. If we have 1 0 1 0 M1 = ; P = 2 m1 I 0 P^2 The dependene from A(I,J) to A(KK,J) in statement 10 needed to be deleted in order for our system to work. Not only do we need setion analysis to handle the dependene, but also inorretly reported the dependene as interhange preventing. In this version of the algorithm, row interhanges are performed aross every olumn. Although this is not neessary (it an be done only for olumns K through N), it is done to allow the derivation of the blok algorithm. PFC 72 CHAPTER 5. DO 10 K = 1,N-1 TAU = ABS(A(K,K)) IMAX = K DO 20 I = K+1,N IF (ABS(A(I,K)) .LE. TAU) GOTO 20 IMAX = I TAU = ABS(A(I,K)) 20 CONTINUE DO 30 J = 1,N TAU = A(K,J) 25 A(K,J) = A(IMAX,J) 30 A(IMAX,J) = TAU DO 40 I = K+1,N 40 A(I,K) = A(I,K)/A(K,K) DO 10 J = K+1,N DO 10 I = K+1,N 10 A(I,J) = A(I,J) - A(I,K) * A(K,J) Figure 5.8 LU Deomposition with Partial Pivoting DO 10 K = 1,N-1,KS DO 20 KK = K,MIN(K+KS-1,N-1) TAU = ABS(A(K,K)) IMAX = K DO 30 I = K+1,N IF (ABS(A(I,K)) .LE. TAU) GOTO 20 IMAX = I TAU = ABS(A(I,K)) 30 CONTINUE DO 40 J = 1,N TAU = A(KK,J) 25 A(KK,J) = A(IMAX,J) 40 A(IMAX,J) = TAU DO 50 I = KK+1,N 50 A(I,K) = A(I,KK)/A(KK,KK) DO 20 J = KK+1,K+KS-1 DO 20 I = KK+1,N 20 A(I,J) = A(I,J) - A(I,KK) * A(KK,J) DO 10 J = K+KS,N DO 10 I =K+1,N DO 10 KK = K,MIN(MIN(K+KS-1,N-1),I-1) 10 A(I,J) = A(I,J) - A(I,KK) * A(KK,J) Figure 5.9 Blok LU Deomposition with Partial Pivoting BLOCKABILITY 5.3. 73 SOLVING SYSTEMS OF LINEAR EQUATIONS then P2 M1 = = = 1 0 1 0 m1 I 0 P^2 1 0 P^2 m1 P^2 1 0 1 0 P^2 m1 I 0 P^2 ^ 1 P2 : = M (5.2) This result shows that we an postpone the appliation of the eliminator M1 until after the appliation of the permutation matrix P2 if we also permute the rows of the eliminator. Extending Equation 5.2 to the entire formulation we have U = Mn 1Pn 1 Mn 2Pn 2 Mn 3 Pn 3 M1P1 A ^ n 2 Pn 1 Pn 2 Mn 3 Pn 3 M1 P1 A = Mn 1 M ^ n 3 Pn 1 Pn 2 Pn 3 M1 P1 A ^ n 2M = Mn 1 M ^n 3 M ^ 1 Pn 1 Pn 2 Pn 3 P1 A: ^ = Mn 1 Mn 2 M In the implementation of the blok algorithm, Pi annot be omputed until step i of the point algorithm. Pi only depends upon the rst i olumns of A, allowing the omputation of k Pi 's and M^ i 's, where k is the ^ i 's as is done in Figure 5.9 [DDSvdV91℄. bloking fator, and then the blok appliation of the M To install the above result into the ompiler, we examine its impliations from a data dependene viewpoint. In the point version, eah row interhange is followed by a whole-olumn update in whih eah row element is updated independently. In the blok version, multiple row interhanges may our before a partiular olumn is updated. The same omputations (olumn updates) are performed in both the point and blok versions, but these omputations may our in dierent loations (rows) of the array beause of the appliation of Pi+1 to Mi . The key onept for the ompiler to understand is that row interhanges and whole-olumn updates are ommutable operations. Data dependene alone is not suÆient to understand this. A data dependene relation maps values to memory loations. It reveals the sequene of values that pass through a partiular loation. In the blok version of LU deomposition, the sequene of values that pass through a loation is dierent from the point version, although the nal values are idential. Therefore, from the point of view of a ompiler that only understands data dependene, LU deomposition with partial pivoting is not blokable. Fortunately, a ompiler an be equipped to understand that operations on whole olumns are ommutable with row permutations. To upgrade the ompiler, one would have to install pattern mathing to reognize both the row permutations and whole-olumn updates to prove that the reurrene involving statements 10 and 25 of the index-set split ode ould be ignored. Forms of pattern mathing are already done in ommerially available ompilers, so it is reasonable to believe that we an reognize the situation in LU deomposition. The question is, however, \Will the inrease in knowledge be protable?" To see the potential protability of making the ompiler more sophistiated, onsider the table below, where \Blok" refers to the algorithm given in Figure 5.9 and \Blok+" refers to that algorithm after unroll-and-jam and salar replaement. This experiment was run on an IBM RS/6000 model 540 using DOUBLE-PRECISION REALS. Array Size 300x300 300x300 500x500 500x500 5.3.3 Blok Size 32 64 32 64 Original 1.52s 1.52s 7.01s 7.01s Blok 1.42s 1.48s 6.85s 6.83s Blok+ 0.58s 0.67s 2.58s 2.73s Speedup 2.62 2.27 2.72 2.57 QR Deomposition with Householder Transformations The key to Gaussian elimination is the multipliation of the matrix A by a series of elementary lower triangular matries that introdue zeros below eah diagonal element. Any lass of matries that have this 74 CHAPTER 5. BLOCKABILITY property an be used to solve a system of linear equations. One suh lass, having orthonormal olumns, is used in QR deomposition [Ste73℄. If A has linearly independent olumns, then A an be written uniquely in the form A = QR, where Q has orthonormal olumns, QQT = I and R is upper triangular with positive diagonal elements. One lass of matries that ts the properties of Q is elementary reetors or Householder transformations of the form I 2vv T . The point algorithm for this form of QR deomposition onsists of iteratively applying the elementary reetor Vk = I 2vk vkT to Ak to obtain Ak+1 for k = 1; : : : ; n 1. Eah Vk eliminates the values below the diagonal in the k th olumn. For a more detailed disussion of the QR algorithm and the omputation of Vk , see Stewart [Ste73℄. Although pivoting is not neessary for QR deomposition, the best blok algorithm is not an aggregation of the original algorithm. The blok appliation of a number of elementary reetors involves both omputation and storage that does not exist in the original algorithm [DDSvdV91℄. Given A= The rst step is to fator and then solve where A11 A21 = A^12 A^22 A11 A12 A21 A22 Q11 Q12 Q21 Q22 = Q^ Q^ = (I 2v1 v1T )(I = I 2V T V T : : A12 A22 R11 0 ; ; 2v2 v2T ) (I 2vb vbT ) The diÆulty for the ompiler omes in the omputation of I 2V T V T beause it involves spae and omputation that did not exist in the original point algorithm. To illustrate this, onsider the ase where the blok size is 2. Q^ = (I 2v1 v1T )(I 2v2 v2T ) T 1 (v1T v2 ) v 1 = I 2(v1 v2 ) 0 1 v2T Here, the omputation of the matrix T T = 10 (v1 v21) is not part of the original algorithm, making it is impossible to determine the omputation of Q^ from the data dependene information. The expression of this blok algorithm requires the hoie of a mahine-dependent bloking fator. We know of no way to express this algorithm in a urrent programming language in a manner that would allow a ompiler to automatially hose that fator. Can we enhane the expressibility of a language to allow blok algorithms to be stated in a mahine-independent form? One possible solution is to dene looping onstruts whose semantis allow the ompiler omplete freedom in hoosing the bloking fator. In Setion 5.4, we will address this issue. 5.3.4 QR Deomposition with Givens Rotations Another form of orthogonal matrix that an be used in QR deomposition is the Givens rotation matrix [Sew90℄. We urrently know of no best blok algorithm to derive, so instead we show that the index-set splitting tehnique desribed in Setion 5.1.3 and IF-inspetion have wider appliability. Consider the Fortran ode for Givens QR shown in Figure 5.10 (note that this algorithm does not hek for overow) [Sew90℄. The referenes to A in the inner K-loop have a long stride between suessive aesses, resulting in poor ahe performane. Our algorithm from Chapter 4 would reommend interhanging the Jloop to the innermost position, giving stride-one aess to the referenes to A(J,K) and making the referenes 5.4. 75 LANGUAGE EXTENSIONS 10 DO 10 L = 1,N DO 10 J = L+1,M IF (A(J,L) .EQ. 0.0) GOTO 10 DEN = DSQRT(A(L,L)*A(L,L) + A(J,L)*A(J,L)) C = A(L,L)/DEN S = A(J,L)/DEN DO 10 K = L,N A1 = A(L,K) A2 = A(J,K) A(L,K) = C*A1 + S*A2 A(J,K) = -S*A1 + C*A2 Figure 5.10 QR Deomposition with Givens Rotations to A(L,K) invariant with respet to the innermost loop. In this ase, loop interhange would neessitate distribution of the J-loop around the IF-blok and the K-loop. However, a reurrene onsisting of a true and antidependene between the denition of A(L,K) and the use of A(L,L) seems to prevent distribution. Examining the regular setions for these referenes reveals that the reurrene only exists for the element A(L,L), allowing index-set splitting of the K-loop at L, IF-inspetion of the J-loop, distribution (with salar expansion) and interhange as shown Figure 5.11 [KKP+ 81℄. Below is a table of the results of the performane of Givens QR using DOUBLE-PRECISION REALS run on an IBM RS/6000 model 540. Array Size 300x300 500x500 5.4 Original 6.86s 84.0s Optimized 3.37s 15.3s Speedup 2.04 5.49 Language Extensions The examination of QR deomposition with Householder transformations has shown that some blok algorithms annot be derived by a ompiler from their orresponding point algorithms. In order for us to maintain our goal of mahine-independent oding styles, we need to allow the expression of these types of blok algorithms in a mahine-independent form. Speially, we need to diret the ompiler to pik the mahine-dependent bloking fator for an algorithm automatially. To this end, we present a preliminary proposal for two looping onstruts to guide the ompiler's hoie of bloking fator. These onstruts are BLOCK DO and IN DO. BLOCK DO speies a DO-loop whose bloking fator is hosen by the ompiler. IN DO speies a DO-loop that exeutes over the region dened by a orresponding BLOCK DO and guides the ompiler to the regions that it should analyze to determine the bloking fator. The bounds of an IN DO statement are optional. If they are not expressed, the bounds are assumed to start at the rst value in the speied blok and end at the last value with a step of 1. To allow indexing within a blok region, we dene LAST to return the last index value in a blok. For example, if LU deomposition were not a blokable algorithm, it ould be oded as in Figure 5.12 to ahieve mahine independene. The prinipal advantage of the extensions is that the programmer an express a non-blokable algorithm in a natural blok form, while leaving the mahine-dependent details, namely the hoie of bloking fator, to the ompiler. In the ase of LAPACK, the language extensions ould be used, when neessary, to ode the algorithms for a soure-level library that is independent of the hoie of bloking fator. Then, using ompiler tehnology, the library ould be ported from mahine to mahine and still retain good performane. By doing so, we would remove the only mahine-dependeny problem of LAPACK and make it more aessible 76 CHAPTER 5. DO 10 L = 1,N DO 20 J = L+1,M IF (A(J,L) .EQ. 0.0) GOTO 20 DEN = DSQRT(A(L,L)*A(L,L) + A(J,L)*A(J,L)) C(J) = A(L,L)/DEN S(J) = A(J,L)/DEN A1 = A(L,L) A2 = A(J,L) A(L,L) = C(J)*A1 + S(J)*A2 A(J,L) = -S(J)*A1 + C(J)*A2 C C C IF-Inspetion Code 20 10 ENDIF CONTINUE DO 10 K = L+1,N DO 10 JN = 1,JC DO 10 J = JLB(JN),JUB(JN) A1 = A(L,K) A2 = A(J,K) A(L,K) = C(J)*A1 + S(J)*A2 A(J,K) = -S(J)*A1 + C(J)*A2 Figure 5.11 Optimized QR Deomposition with Givens Rotations BLOCK DO K = 1,N-1 IN K DO KK DO I = KK+1,N A(I,KK) = A(I,KK)/A(KK,KK) ENDDO DO J = KK+1,LAST(K) DO I = KK+1,N A(I,J) = A(I,J) - A(I,KK) * A(KK,J) ENDDO ENDDO ENDDO DO J = LAST(K)+1,N DO I = K+1,N IN K DO KK = K,MIN(LAST(K),I-1) A(I,J) = A(I,J) - A(I,KK) * A(KK,J) ENDDO ENDDO ENDDO ENDDO Figure 5.12 Blok LU in Extended Fortran BLOCKABILITY 5.5. SUMMARY 77 for new arhitetures. To realize this goal, researh eorts must fous on eetive tehniques for the hoie of bloking fator. 5.5 Summary We have set out to determine whether a ompiler an automatially restruture omputations well enough to avoid the need for hand bloking and enourage mahine-independent programming. To that end, we have examined a olletion of programs similar to LAPACK for whih we were able to aquire both the blok and orresponding point algorithms. For eah of these programs, we determined whether a plausible ompiler tehnology ould sueed in obtaining the blok version from the point algorithm. The results of this study are enouraging: we an blok triangular, trapezoidal and rhomboidal loops and we have found that many of the problems introdued by omplex dependene patterns an be overome by the use of the transformation known as \index-set splitting". In many ases, index-set splitting yields odes that exhibit performane at least as good as the best blok algorithms produed by LAPACK developers. In addition, we have shown that, in the speial ase of LU deomposition with partial pivoting, knowledge about whih operations ommute an enable a ompiler to sueed in bloking odes that ould not be bloked by any ompiler based stritly on dependene analysis. 79 Chapter 6 Conlusion This dissertation has dealt with ompiler-direted management of the memory hierarhy. We have desribed algorithms to improve the balane between oating-point operations and memory requirements in program loops by reduing the number of memory referenes and improving ahe performane. These algorithms work under the assumption that the ompiler for the target mahine is eetive at sheduling and register alloation. The implementation of our algorithms has validated our methodology by showing that integerfator speedups over quality ommerial optimizers are possible on whole appliations. Our hope is that these results will be used to enourage programmers to write their appliations in a natural, mahine-independent form, leaving the ompiler to handle mahine-dependent optimization details. In this hapter, we review this thesis. First, we disuss the ontributions that we have made to register alloation and automati management of ahe. Next, we disuss the issues related to ahe performane that still must be solved and nally, we present some losing remarks. 6.1 6.1.1 Contributions Registers We have developed and implemented an algorithm to perform salar replaement in the presene of innerloop onditional ontrol ow. The goal of salar replaement is to expose the ow of values in arrays with salar temporaries so that standard data-ow analysis will disover the potential for register alloation. By mapping partial redundany elimination to salar replaement, we are able to replae array referenes whose dening value is only partially available, something that was not done before this thesis. The implementation of this algorithm has shown that signiant improvements are possible on sienti appliations. We have also developed and implemented an algorithm to apply unroll-and-jam to a loop nest to improve its balane between memory referenes and oating-point operations automatially. Our algorithm hooses unroll amounts for one or two loops to reate a loop that is balaned as muh as possible on a partiular arhiteture. Inluded in this algorithm is an estimate of oating-point register pressure that is used to prevent spill ode insertion in the nal ompiled loop. The results of an experiment using this tehnique have shown that integer-fator speedups are possible on some appliations. In partiular, redutions benet greatly from unroll-and-jam beause of both improved balane and easily attained instrution-level parallelism. The algorithms for salar replaement and unroll-and-jam eliminate the need for hand optimization to eet register alloation of array values. Not only do the algorithms apture reuse in inner loops, but also in outer loops. Although the outer-loop reuse an be obtained by hand, the proess is extremely tedious and error prone and produes mahine-dependent programs. In one partiular ase, we have shown that hand optimization atually produes slightly worse ode than the automatially derived version. By relying on ompiler tehnology to handle mahine-dependent details, programs beome more readable and are portable aross dierent arhitetures. Beause salar replaement and unroll-and-jam an greatly inrease register pressure within loops, one question might be \How many registers are enough?" The answer depends upon the balane of the target 80 CHAPTER 6. CONCLUSION mahine. For mahines that an perform multiple oating-point operations per memory operation, more registers are needed to ompensate for a lower memory bandwidth. However, balaned arhitetures will require fewer registers beause memory bandwidth is not as muh of a bottlenek. 6.1.2 Cahe We have developed and implemented an algorithm to attain the best loop ordering for a loop nest in relation to memory-hierarhy performane. Our algorithm is simple, but eetive. It safely ignores the eets of ahe interferene on reuse by only onsidering reuse in the innermost loop. The algorithm is driven by ahe line size, aess ost and miss penalty. Implementation has shown that the algorithm is apable of ahieving dramati speedups on whole appliations on a single proessor. Using this tehnology, it is possible for a programmer to order loop nests independent of language implementation of array storage and ahe struture. We have also shown that urrent ompiler tehniques are not suÆient to perform iteration-spae bloking on real-world algorithms. Trapezoidal-, rhomboidal- and triangular-shaped iteration spaes, whih are ommon in linear algebra and geophysis odes, require a transformation known as index-set splitting to be onsidered blokable. We have derived formulas to handle these ommon loop shapes with index-set splitting and shown that bloking these loops an result in signiant speedups. In addition, we have applied index-set splitting to dependenes that prevent iteration-spae bloking. The objetive is to reate new loops where the preventing dependenes do not exist and bloking an be performed. Using this tehnique, we have been able to derive automatially the best-known blok algorithms for LU deomposition with and without pivoting. Previously, ompiler tehnology was unable to aomplish this. Unfortunately, not all blok algorithms an be derived automatially by a ompiler. Those blok formulations that represent a hange of algorithm from their orresponding point algorithm annot be obtained automatially. To handle these situations, we have proposed that a set of programming-language extensions be developed to allow a programmer to speify blok algorithms in a mahine-independent manner. Finally, our study of blokability has also led to a transformation alled IF-inspetion to handle inner loops that are guarded by ontrol onditions. With this transformation, we determine exatly whih iterations of an innermost loop will exeute and then, we optimize the memory performane of the loop nest for those iterations. Large integer-fator speedups have been shown to be possible with IF-inspetion. 6.2 Future Work Although we have addressed many memory hierarhy issues in this dissertation, there is still muh left to do. Most of that work lies in the area of ahe management. In this setion, we will survey some of the major issues yet to be solved. In the omputation of loop balane, memory referenes are assigned a uniform ost under the assumption that all aesses are made out of the ahe. This is not always the ase. Can we do a better job of omputing the memory requirements of a loop? By applying our memory-hierarhy ost model presented in Chapter 4, we an ompute the atual number of yles needed to aess memory within the innermost loop body. Using this measurement we will be able to get a better measurement of the relationship between omputation and memory yles, resulting in a better measure of loop balane. The question is \Does this inreased preision matter to performane?" An implementation and omparison between the two omputations of balane would answer the question. Our treatment of iteration-spae bloking for ahe is not omplete. Although we have studied extensions to urrent ompiler tehniques to allow a larger lass of algorithms to be bloked automatially, these additional tehniques alone are insuÆient for implementation in a real ompiler. Piking blok sizes and dealing with ahe interferene beause of set assoiativity are two issues that must be solved to make automati bloking a viable tehnology. The optimal blok size for a loop is dependent upon the behavior of the set assoiativity of the ahe and has been shown to be diÆult to determine [LRW91℄. Can the ompiler predit these eets at ompile time or is it hopeless to perform automati bloking with today's ahe arhitetures? Additionally, an implementation of the tehniques developed in Chapter 5 needs to be done to show its viability. 6.3. FINAL REMARKS 81 As our study of blokability revealed, not all algorithms an be written in a style that will allow them to be bloked optimally. The language extensions presented in Chapter 5 provide a vehile for programmers to express those blok algorithms in a mahine-independent manner. It must be determined exatly whih extensions are needed and how to implement them eetively. Assuming that we an automatially determine blok sizes, an we use the language extensions to allow variable blok sizes to inrease performane? One transformation for memory performane that has not been disussed in this thesis is software prefething. Previous work in this area has ignored the eets of ahe-line length on the redundany of prefething in the presene of limited issue slots [CKP91℄. Can we take advantage of our loop interhange algorithm to attain stride-one aesses and use ahe-line size to derive an eÆient and eetive method for using software prefething? Future eorts should be direted toward the development of an eetive software prefething algorithm. Finally, we have not studied the eets of multi-level ahes on the performane of sienti appliations. Many urrent systems use a MIPS R3000 or an Intel i860 with a seond-level ahe to attain better ahe performane. Are urrent ompiler tehniques (with the addition of the rest of our future work) good enough to handle this inrease in omplexity? What is the orret way to view the higher-level ahes? What is the true payo of level-2 ahes? How muh improvement an be attained with the extra level of ahe? These issues and others must be answered before suh omplex memory hierarhies beome eetive. 6.3 Final Remarks The omplexity in the design of modern memory hierarhies and the lak of sophistiation in modern ommerial ompilers have put a signiant burden on the programmer to ahieve any large fration of performane available on high-performane arhitetures. Given that future mahine designs are ertain to have inreasingly omplex memory hierarhies, ompilers will need to adopt inreasingly sophistiated memorymanagement strategies to oset the need for programmers to perform hand optimization. It is our belief that programmers should not be burdened with arhitetural details, but rather onentrate solely on program logi. To this end, our goal has been to nd ompiler tehniques that would make it possible for a programmer to express numerial algorithms naturally with the expetation of good memory-hierarhy performane. We have demonstrated that there exist readily implementable methods that an manage the oating-point register set and improve the eetiveness of ahe. By aomplishing these objetives, we have taken a signiant step towards ahieving our goal. 83 Appendix A Formulas for Non-Retangular Iteration Spaes A.1 A.1.1 Triangular Loops Upper Left: 10 >0 DO 10 I = 1,N DO 10 J = I+ ,M loop body J 6 M + 1 Figure A.1 I I+IS-1 N - II Upper Left Triangular Iteration Spae 84 APPENDIX A. FORMULAS FOR NON-RECTANGULAR ITERATION SPACES Strip-Mine-and-Interhange Formula 10 DO 10 I = 1,N,IS DO 10 J = I+ ,M DO 10 II = I,MIN((J- )/,I+IS-1) loop body Unroll-and-Jam Formula 20 10 DO 10 I = 1,N,IS DO 20 II = I,I+IS-2 DO 20 J = II+ ,MIN((I+IS-2)+ ,M) loop body DO 10 J = (I+IS-1)+ ,M unrolled loop body A.1. 85 TRIANGULAR LOOPS A.1.2 Upper Right: 10 <0 DO 10 I = 1,N DO 10 J = I+ ,M loop body J 6 M N+ 1 Figure A.2 I I+IS-1 DO 10 I = 1,N,IS DO 10 J = (I+IS-1)+ ,M DO 10 II = MAX(I,(J- )/),I+IS-1 loop body Unroll-and-Jam Formula 20 10 DO 10 I = 1,N,IS DO 20 II = I+1,I+IS-1 DO 20 J = II+ ,MIN((I+1)+ ,M) loop body DO 10 J = I+ ,M unrolled loop body N - II Upper Right Triangular Iteration Spae Strip-Mine-and-Interhange Formula 10 86 A.1.3 APPENDIX A. FORMULAS FOR NON-RECTANGULAR ITERATION SPACES Lower Right: 10 >0 DO 10 I = 1,N DO 10 J = L,I+ loop body J 6 N+ L I 1 Figure A.3 I+IS-1 DO 10 I = 1,N,IS DO 10 J = L,(I+IS-1)+ DO 10 II = MAX(I,(J- )/),I+IS-1 loop body Unroll-and-Jam Formula 20 10 DO 10 I = 1,N,IS DO 20 J = L,I+ unrolled loop body DO 10 II = I+1,I+IS-1 DO 10 J = MAX((I+1)+ ,L),II+ loop body - II Lower Right Triangular Iteration Spae Strip-Mine-and-Interhange Formula 10 N A.1. 87 TRIANGULAR LOOPS A.1.4 Lower Left: 10 <0 DO 10 I = 1,N DO 10 J = L,I+ loop body J 6 + L I 1 Figure A.4 I+IS-1 DO 10 I = 1,N,IS DO 10 J = L,I+ DO 10 II = I,MIN((J- )/),I+IS-1) loop body Unroll-and-Jam Formula 20 10 DO 10 I = 1,N,IS DO 20 J = L,(I+IS-1)+ unrolled loop body DO 10 II = I,I+IS-2 DO 10 J = MAX((I+IS-2)+ ,L),II+ loop body N - II Lower Left Triangular Iteration Spae Strip-Mine-and-Interhange Formula 10 88 A.2 A.2.1 APPENDIX A. FORMULAS FOR NON-RECTANGULAR ITERATION SPACES Trapezoidal Loops Upper-Bound MIN Funtion Assume ; > 0. 10 DO 10 I = 1,N DO 10 J = L,MIN(I+ ,N) loop body J 6 N L N 1 Figure A.5 20 DO 10 I = 1,MIN(N,(N- )/) DO 10 J = L,I+ loop body DO 20 I = MAX(1,MIN(N,(N- )/)+1),N DO 20 J = L,N loop body -I Trapezoidal Iteration Spae with MIN Funtion After Index-Set Splitting 10 N A.2. 89 TRAPEZOIDAL LOOPS A.2.2 Lower-Bound MAX Funtion Assume ; > 0. 10 DO 10 I = 1,N DO 10 J =MAX(I+ ,L),N loop body J 6 N 1 N 1 Figure A.6 20 DO 10 I = 1,MIN(N,(L- )/) DO 10 J = L,N loop body DO 20 I = MAX(1,MIN(N,(L- )/)+1),N DO 20 J = + ,N loop body -I Trapezoidal Iteration Spae with MAX Funtion After Index-Set Splitting 10 N 90 A.3 APPENDIX A. FORMULAS FOR NON-RECTANGULAR ITERATION SPACES Rhomboidal Loops Assume > 0. 10 DO 10 I = 1,N1 DO 10 J = I+N,I+M loop body J 6 N1+M +N 1 I Figure A.7 I+IS-1 N1 Rhomboidal Iteration Spae Strip-Mine-and-Interhange Formula 10 DO 10 I = 1,N1,IS DO 10 J = I+N,I+M DO 10 II = MAX(I,I+M),MIN(I+N,I+IS-1) loop body Unroll-and-Jam Formula 20 30 10 DO 10 I = 1,N1,IS DO 20 II = I,I+IS-2 DO 20 J = II+N,MIN((I+IS-2)+N,(I+IS-2)+M) loop body DO 30 J = (I+IS-1)+N,I+M unrolled loop body DO 10 II = I+1,I+IS-1 DO 10 J = MAX((I+1)+N,(I+1)+M),II+M loop body - II BIBLIOGRAPHY 91 Bibliography [AC72℄ F.E. Allen and J. Coke. A atalogue of optimizing transformations. In Design and Optimization of Compilers, pages 1{30. Prentie-Hall, 1972. [AK87℄ J.R. Allen and K. Kennedy. Automati translation of Fortran programs to vetor form. ACM Transations on Programming Languages and Systems, 9(4):491{542, Otober 1987. [AK88℄ J.R. Allen and K. Kennedy. Vetor register alloation. Tehnial Report TR86-45, Department of Computer Siene, Rie University, 1988. [AN87℄ A. Aiken and A. Niolau. Loop quantization: An analysis and algorithm. Tehnial Report 87-821, Cornell University, Marh 1987. [AS78℄ W. Abu-Sufah. Improving the Performane of Virtual Memory Computers. PhD thesis, Dept. of Computer Siene, University of Illinois, 1978. [ASM86℄ W. Abu-Sufah and A. Malony. Vetor proessing on the alliant FX/8 multiproessors. In Proeedings of the 1986 International Conferene on Parallel Proessing, pages 559{566, August 1986. [BCHT90℄ P. Briggs, K.D. Cooper, M.W. Hall, and L. Torzon. Goal-direted interproedural optimization. Tehnial Report TR90-102, Rie University, CRPC, November 1990. [BCKT89℄ P. Briggs, K.D. Cooper, K. Kennedy, and L. Torzon. Coloring heuristis for register alloation. In Proeedings of the ACM SIGPLAN 89 Conferene on Program Language Design and Implementation, Portland, OR, June 1989. [BS88℄ M. Berry and A. Sameh. Multiproessor shemes for solving blok tridiagonal linear systems. International Journal of Superomputer Appliations, 2(3):37{57, Fall 1988. [CAC+ 81℄ G.J. Chaitin, M.A. Auslander, A.K. Chandra, J. Coke, M.E. Hopkins, and P.W. Markstein. Register alloation via oloring. Computer Languages, 6:45{57, January 1981. [Cal86℄ D.A. Calahan. Blok-oriented, loal-memory-based linear equation solution on the Cray-2: Uniproessor algorithm. In Proeedings of the 1986 International Conferene on Parallel Proessing, 1986. [CCK88℄ D. Callahan, J. Coke, and K. Kennedy. Estimating interlok and improving balane for pipelined mahines. Journal of Parallel and Distributed Computing, 5, 1988. [CCK90℄ D. Callahan, S. Carr, and K. Kennedy. Improving register alloation for subsripted variables. In Proeedings of the SIGPLAN '90 Conferene on Programming Language Design and Implementation, White Plains, NY, June 1990. [CH84℄ F. Chow and J. Hennessy. Register alloation by priority-based oloring. In Proeedings of the SIGPLAN '84 Symposium on Compiler Constrution, SIGPLAN Noties Vol. 19, No. 6, June 1984. 92 BIBLIOGRAPHY [CK77℄ John Coke and Ken Kennedy. An algorithm for redution of operator strength. Communiations of the ACM, 20(11), November 1977. [CK87℄ D. Callahan and K. Kennedy. Analysis of interproedural side eets in a parallel programming environment. In Proeedings of the First International Conferene on Superomputing. Springer-Verlag, Athens, Greee, 1987. [CKP91℄ D. Callahan, K. Kennedy, and A. Portereld. Software prefething. In Proeedings of the Fourth International Conferene on Arhiteural Support for Programming Languages and Operating Systems, Santa Clara, CA, April 1991. [CP90℄ D. Callahan and A. Portereld. Data ahe performane of superomputer appliations. In Superomputing '90, 1990. [DBMS79℄ J.J. Dongarra, J.R. Bunh, C.B. Moler, and G.W. Stewart. LINPACK User's Guide. SIAM Publiations, Philadelphia, 1979. [DDDH90℄ J.J. Dongarra, J. DuCroz, I. Du, and S. Hammerling. A set of level 3 basi linear algebra subprograms. ACM Transations on Mathematial Software, 16:1{17, 1990. [DDHH88℄ J.J. Dongarra, J. DuCroz, S. Hammerling, and R. Hanson. An extendend set of fortran basi linear algebra subprograms. ACM Transations on Mathematial Software, 14:1{17, 1988. [DDSvdV91℄ J.J. Dongarra, I.S. Du, D.C. Sorensen, and H.A. van der Vorst. Solving Linear Systems on Vetor and Shared-Memory Computers. SIAM, Philadelphia, 1991. [DS88℄ K.H. Drehsler and M.P. Stadel. A solution to a problem with morel and renvoise's \global optimization by suppression of partial redudanies". ACM Transations on Programming Languages and Systems, 10(4):635{640, Otober 1988. [Fab79℄ Janet Fabri. Automati storage optimization. In Proeedings of the SIGPLAN Symposium on Compiler Constrution, Denver, CO, 1979. [GJ79℄ M.R. Garey and D.S. Johnson. Computers and Intratability: A Guide to the Theory of NPCompleteness. W.H. Freeman and Co., San Franiso, 1979. [GJG87℄ D. Gannon, W. Jalby, and K. Gallivan. Strategies for ahe and loal memory management by global program transformations. In Proeedings of the First International Conferene on Superomputing. Springer-Verlag, Athens, Greee, 1987. [GJMS88℄ K. Gallivan, W. Jalby, U. Meier, and A.H. Sameh. Impat of hierarhial memory systems on linear algebra design. International Journal of Superomputer Appliations, 2(1):12{48, Spring 1988. [GKT91℄ G. Go, K. Kennedy, and C.W. Tseng. Pratial dependene testing. In Proeedings of the SIGPLAN '91 Conferene on Programming Language Design and Implementation, Toronto, Ontario, June 1991. [GM86℄ P.B. Gibbons and S.S. Muhnik. EÆient instrution sheduling for a pipelined arhiteture. In Proeedings of the SIGPLAN '86 Symposium on Compiler Constrution, 1986. [GS84℄ J.H. GriÆn and M.L. Simmons. Los Alamos National Laboratory Computer Benhmarking 1983. Tehnial Report LA-10051-MS, Los Alamos National Laboratory, June 1984. [HK91℄ P. Havlak and K. Kennedy. An implementation of interproedural bounded regular setion analysis. IEEE Transations on Parallel and Distributed Systems, 2(3):350{360, July 1991. [IT88℄ F. Irigoin and R. Triolet. Supernode partitioning. In Conferene Reord of the Fifteenth ACM Symposium on the Priniples of Programming Languages, pages 319{328, January 1988. BIBLIOGRAPHY 93 [KKP+ 81℄ D. Kuk, R. Kuhn, D. Padua, B. Leasure, and M. Wolfe. Dependene graphs and ompiler optimizations. In Conferene Reord of the Eight ACM Symposium on the Priniples of Programming Languages, 1981. [KM92℄ K. Kennedy and K. MKinley. Optimizing for parallelism and memory hierarhy. In Proeedings of the 1992 International Conferene on Superomputing, Washington, DC, July 1992. [Ku78℄ D. Kuk. The Struture of Computers and Computations Volume 1. John Wiley and Sons, New York, 1978. [LHKK79℄ C. Lawson, R. Hanson, D. Kinaid, and F. Krogh. Basi linear algebra subprograms for fortran usage. ACM Transations on Mathematial Software, 5:308{329, 1979. [LRW91℄ M.S. Lam, E.E. Rothberg, and M.E. Wolf. The ahe performane and optimizations of bloked algorithms. In Proeedings of the Fourth International Conferene on Arhiteural Support for Programming Languages and Operating Systems, April 1991. [LS88℄ B. Liu and N. Strother. Programming in VS FORTRAN on the IBM 3090 for maximum vetor performane. Computer, 21(6), June 1988. [MR79℄ E. Morel and C. Revoise. Global optimization by suppression of partial redundanies. Communiations of the ACM, 22(2), February 1979. [Por89℄ A.K. Portereld. Software Methods for Improvement of Cahe Performane on Superomputer Appliations. PhD thesis, Rie University, May 1989. [Sew90℄ G Sewell. Computational Methods of Linear Algebra. Ellis Horwood, England, 1990. [Ste73℄ G.W. Stewart. Introdution to Matrix Computations. Aademi Press, New York, 1973. [SU70℄ R. Sethi and J.D. Ullman. The generation of optimal ode for arithmeti expressions. Journal of the ACM, 17(4):715{728, Otober 1970. [Tha81℄ Khalid O. Thabit. Cahe Managemant by the Compiler. PhD thesis, Rie University, November 1981. [WL91℄ M.E. Wolf and M.S. Lam. A data loality optimizing algorithm. In Proeedings of the SIGPLAN '91 Conferene on Programming Language Design and Implementation, June 1991. [Wol82℄ M. Wolfe. Optimizing Superompilers for Superomputers. PhD thesis, University of Illinois, Otober 1982. [Wol86a℄ M. Wolfe. Advaned loop interhange. In Proeedings of the 1986 International Conferene on Parallel Proessing, August 1986. [Wol86b℄ M. Wolfe. Loop skewing: The wavefront method revisited. Journal of Parallel Programming, 1986. [Wol87℄ M. Wolfe. Iteration spae tiling for memory hierarhies. In Proeedings of the Third SIAM Conferene on Parallel Proessing for Sienti Computing, Deember 1987. [Wol89℄ M. Wolfe. More iteration spae tiling. In Proeedings of the Superomputing '89 Conferene, 1989.