Document 14329702

advertisement
RICE UNIVERSITY
Memory-Hierarhy Management
by
Steve Carr
A Thesis Submitted
in Partial Fulfillment of the
Requirements for the Degree
Dotor of Philosophy
Approved, Thesis Committee:
Ken Kennedy, Chairman
Noah Harding Professor of Computer Siene
Keith D. Cooper
Assoiate Professor of Computer Siene
Danny C. Sorensen
Professor of Computational and Applied Math
Houston, Texas
July, 1994
Memory-Hierarhy Management
Steve Carr
Abstrat
The trend in high-performane miroproessor design is toward inreasing omputational power on the
hip. Miroproessors an now proess dramatially more data per mahine yle than previous models.
Unfortunately, memory speeds have not kept pae. The result is an imbalane between omputation speed
and memory speed. This imbalane is leading mahine designers to use more ompliated memory hierarhies.
In turn, programmers are expliitly restruturing odes to perform well on partiular memory systems,
leading to mahine-spei programs.
It is our belief that mahine-spei programming is a step in the wrong diretion. Compilers, not
programmers, should handle mahine-spei implementation details. To this end, this thesis develops and
experiments with ompiler algorithms that manage the memory hierarhy of a mahine for oating-point
intensive numerial odes. Speially, we address the following issues:
Lak of information onerning the ow of array values in standard data-ow analysis
prevents the apturing of array reuse in registers. We develop and experiment with a tehnique to
perform salar replaement in the presene of onditional-ontrol ow to expose array reuse to standard
data-ow algorithms.
Salar replaement.
Many loops require more data per yle than an be proessed by the target mahine.
We present and experiment with an automati tehnique to apply unroll-and-jam to suh loops to
redue their memory requirements.
Unroll-and-jam.
Cahe loality in programs run on advaned miroproessors is ritial to performane.
We develop and experiment with a tehnique to order loops within a nest to attain good ahe loality.
Loop Interhange.
Iteration-spae bloking is a tehnique used to attain temporal loality within ahe. Although it
has been applied to \simple" kernels, there has been no investigation into its appliability over a range
of algorithmi styles. We show how to apply bloking to loops with trapezoidal-, rhomboidal-, and
triangular-shaped iteration spaes. In addition, we show how to overome ertain omplex dependene
patterns.
Bloking.
Experiments with the above tehniques have shown that integer-fator speedups on a single hip are possible.
These results reveal that many numerial algorithms an be expressed in a natural, mahine-independent
form while retaining good memory performane through the use of ompiler optimizations.
Aknowledgments
I would like to thank the members of my ommittee for their input and support throughout the development
of this thesis. In partiular, I would like to thank Ken Kennedy for his guidane and support throughout
my graduate areer, espeially through the numerous times when I was ready to quit. I would also like to
give speial thanks to Keith Cooper for being willing to listen when I had ideas that needed to be heard.
From the onset of graduate shool, I found that many of my fellow students provided muh-needed support
and insight. Preston Briggs, Uli Kremer, Mary Hall and Ervan Darnell have given feedbak on the ideas
developed within this thesis. Those who began graduate shool with me, Rene Rodrguez, Elmootazbellah
Elnozahy and Rebea Parsons, have been great friends and a onstant soure of enouragement. The
members of the ParaSope researh group provided the infrastruture for the implementation of the ideas
developed within this thesis. Finally, speial thanks goes to Ivy Jorgensen for proofreading this thesis.
Without the enouragement of my undergraduate professors at Mihigan Teh, I would have never even
onsidered graduate shool. Dave Poplawski, Steve Seidel, and Karl Ottenstein provided that enouragement
and I am indebted to them for it.
As with all endeavors, nanial support is needed to allow one to eat. I have been supported by IBM
Corporation, the National Siene Foundation and Darpa at various points in my graduate areer.
Finally, I wish to thank my family. My parents have supported me with love and enouragement that
has been invaluable. Most of all, I thank my wife Beky who has been the best friend that I ould ever have.
Her love and enouragement has done more to make my life meaningful than she ould ever imagine.
Contents
Abstrat
Aknowledgments
List of Illustrations
iii
v
xi
1 Introdution
1.1 Bakground : : : : : : : : : : : : : : : : : : : : : :
1.1.1 Performane Model : : : : : : : : : : : : : :
1.1.2 Dependene Graph : : : : : : : : : : : : : :
1.2 Transformations To Improve Memory Performane
1.2.1 Salar Replaement : : : : : : : : : : : : :
1.2.2 Unroll-and-Jam : : : : : : : : : : : : : : : :
1.2.3 Loop Interhange : : : : : : : : : : : : : : :
1.2.4 Strip-Mine-And-Interhange : : : : : : : : :
1.3 Related Work : : : : : : : : : : : : : : : : : : : : :
1.3.1 Register Alloation : : : : : : : : : : : : : :
1.3.2 Memory Performane Studies : : : : : : : :
1.3.3 Memory Management : : : : : : : : : : : :
1.4 Overview : : : : : : : : : : : : : : : : : : : : : : :
2 Salar Replaement
2.1 Partial Redundany Elimination : :
2.2 Algorithm : : : : : : : : : : : : : : :
2.2.1 Control-Flow Analysis : : : :
2.2.2 Availability Analysis : : : : :
2.2.3 Reahability Analysis : : : :
2.2.4 Potential-Generator Seletion
2.2.5 Antiipability Analysis : : : :
2.2.6 Dependene-Graph Marking :
2.2.7 Name Partitioning : : : : : :
2.2.8 Register-Pressure Moderation
2.2.9 Referene Replaement : : :
2.2.10 Statement-Insertion Analysis
2.2.11 Register Copying : : : : : : :
2.2.12 Code Motion : : : : : : : : :
2.2.13 Initialization : : : : : : : : :
2.2.14 Register Subsumption : : : :
2.3 Experiment : : : : : : : : : : : : : :
2.4 Summary : : : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
1
2
2
3
4
4
4
5
5
5
6
6
7
9
11
12
12
12
14
15
16
18
18
19
21
23
25
26
27
28
28
30
33
viii
CONTENTS
3 Unroll-And-Jam
3.1 Safety : : : : : : : : : : : : : : : : : : : : : : :
3.2 Dependene Copying : : : : : : : : : : : : : : :
3.3 Improving Balane with Unroll-and-Jam : : : :
3.3.1 Computing Transformed-Loop Balane.
3.3.2 Estimating Register Pressure : : : : : :
3.4 Applying Unroll-and-Jam in a Compiler : : : :
3.4.1 Piking Loops to Unroll : : : : : : : : :
3.4.2 Piking Unroll Amounts : : : : : : : : :
3.4.3 Removing Interlok : : : : : : : : : : : :
3.4.4 Multiple Edges : : : : : : : : : : : : : :
3.4.5 Multiple-Indution-Variable Subsripts :
3.4.6 Non-Perfetly Nested Loops : : : : : : :
3.5 Experiment : : : : : : : : : : : : : : : : : : : :
3.6 Summary : : : : : : : : : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
5.0.1 Iteration-Spae Bloking : : : : : : : : : : : : : : : : :
Index-Set Splitting : : : : : : : : : : : : : : : : : : : : : : : :
5.1.1 Triangular Iteration Spaes : : : : : : : : : : : : : : :
5.1.2 Trapezoidal Iteration Spaes : : : : : : : : : : : : : :
5.1.3 Complex Dependene Patterns : : : : : : : : : : : : :
Control Flow : : : : : : : : : : : : : : : : : : : : : : : : : : :
Solving Systems of Linear Equations : : : : : : : : : : : : : :
5.3.1 LU Deomposition without Pivoting : : : : : : : : : :
5.3.2 LU Deomposition with Partial Pivoting : : : : : : : :
5.3.3 QR Deomposition with Householder Transformations
5.3.4 QR Deomposition with Givens Rotations : : : : : : :
Language Extensions : : : : : : : : : : : : : : : : : : : : : : :
Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
4 Loop Interhange
4.1 Performane Model : : : : : : : : :
4.1.1 Data Loality : : : : : : : :
4.1.2 Memory Cyles : : : : : : :
4.2 Algorithm : : : : : : : : : : : : : :
4.2.1 Computing Loop Order : :
4.2.2 Non-Perfetly Nested Loops
4.3 Experiment : : : : : : : : : : : : :
4.4 Summary : : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
5 Blokability
5.1
5.2
5.3
5.4
5.5
6 Conlusion
6.1 Contributions : :
6.1.1 Registers
6.1.2 Cahe : :
6.2 Future Work : :
6.3 Final Remarks :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
A Formulas for Non-Retangular Iteration Spaes
A.1 Triangular Loops : : : : : :
A.1.1 Upper Left: > 0 :
A.1.2 Upper Right: < 0
A.1.3 Lower Right: > 0
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
35
35
36
37
38
43
44
44
45
46
46
49
50
51
54
55
55
55
56
57
57
58
58
60
61
61
62
62
63
65
66
67
69
71
73
74
75
77
79
79
79
80
80
81
83
83
83
85
86
ix
CONTENTS
A.1.4 Lower Left: < 0 : : : : :
A.2 Trapezoidal Loops : : : : : : : : :
A.2.1 Upper-Bound MIN Funtion
A.2.2 Lower-Bound MAX Funtion
A.3 Rhomboidal Loops : : : : : : : : :
Bibliography
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
87
88
88
89
90
91
Illustrations
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
Partially Redundant Computation : :
After Partial Redundany Elimination
FindPotentialGenerators : : : : : : : :
MarkDependeneGraph : : : : : : : :
GenerateNamePartitons : : : : : : : :
CalulateBenet : : : : : : : : : : : :
ModerateRegisterPressure : : : : : : :
ReplaeReferenes : : : : : : : : : : :
InsertStatements : : : : : : : : : : : :
InsertRegisterCopies : : : : : : : : : :
CodeMotion : : : : : : : : : : : : : : :
Initialize : : : : : : : : : : : : : : : : :
Subsume : : : : : : : : : : : : : : : : :
Example After Salar Replaement : :
Experimental design. : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
13
13
17
19
20
22
24
25
27
28
29
29
31
31
31
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
Distane Vetors Before and After Unroll-and-Jam
V ; Before and After Unroll-and-Jam : : : : : : : :
VrC Before and After Unroll-and-Jam : : : : : : : :
VrI Before and After Unroll-and-Jam : : : : : : : :
PikLoops : : : : : : : : : : : : : : : : : : : : : : :
PartitionNodes : : : : : : : : : : : : : : : : : : : :
Compute CoeÆients for Optimization Problem : :
Distribute Loops for Unroll-and-jam : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
37
39
40
41
45
47
48
52
4.1 Example Loop for Memory Costs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 56
4.2 Algorithm for Ordering Loops : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 57
4.3 Algorithm for Non-Perfetly Nested Loops : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 59
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
Upper Left Triangular Iteration Spae : : : : : : : : :
Trapezoidal Iteration Spae with Retangle : : : : : :
Trapezoidal Iteration Spae with Rhomboid : : : : : :
Data Spae for A : : : : : : : : : : : : : : : : : : : : :
Matrix Multiply After IF-Inspetion : : : : : : : : : :
Regions Aessed in LU Deomposition : : : : : : : :
Blok LU Deomposition : : : : : : : : : : : : : : : : :
LU Deomposition with Partial Pivoting : : : : : : : :
Blok LU Deomposition with Partial Pivoting : : : :
QR Deomposition with Givens Rotations : : : : : : :
Optimized QR Deomposition with Givens Rotations :
Blok LU in Extended Fortran : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
63
64
65
66
68
70
70
72
72
75
76
76
xii
ILLUSTRATIONS
A.1
A.2
A.3
A.4
A.5
A.6
A.7
Upper Left Triangular Iteration Spae : : : : :
Upper Right Triangular Iteration Spae : : : :
Lower Right Triangular Iteration Spae : : : :
Lower Left Triangular Iteration Spae : : : : :
Trapezoidal Iteration Spae with MIN Funtion
Trapezoidal Iteration Spae with MAX Funtion
Rhomboidal Iteration Spae : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
83
85
86
87
88
89
90
1
Chapter 1
Introdution
Over the past deade, miroproessor design strategies have foused on inreasing the omputational power
available on a single hip. These advanes in power have been ahieved not only through redued yle times,
but also via arhitetural hanges suh as multiple instrution issue and pipelined oating-point funtional
units. The resulting miroproessors an proess dramatially more data per mahine yle than previous
models. Unfortunately, the performane of memory has not kept pae. The result has been an inrease in the
number of yles for a memory aess|a lateny of 10 to 20 mahine yles is now quite ommon|ausing
an imbalane between the rate at whih omputations an be performed and the rate at whih operands an
be delivered onto the hip.
To ameliorate these problems, mahine designers have turned inreasingly to omplex memory hierarhies.
For example, the Intel i860XR has an on-hip ahe memory for 8K bytes of data and there are several
systems that use two levels of ahe with a MIPS R3000. Still, these systems perform poorly on sienti
alulations that are memory intensive and are not strutured to take advantage of the target mahine's
memory hierarhy.
This situation has led many programmers to restruture their odes by hand to improve performane in
the memory hierarhy. We believe that this is a step in the wrong diretion. The user should not be writing
programs that target a partiular mahine; instead, the task of speializing a program to a target mahine
should fall to the ompiler. If this trend ontinues, an inreasing fration of the human resoures available
for siene and engineering will be spent on onversion of high-level language programs from one mahine to
another|an unaeptable eventuality.
There is a long history of the use of sophistiated ompiler optimizations to ahieve mahine independene. The Fortran I ompiler inluded enough optimizations to make it possible for sientists to abandon
mahine language programming. More reently, advaned vetorization tehnology has made it possible to
write mahine-independent vetor programs in a sublanguage of Fortran 77. Is it possible to ahieve the
same suess for memory-hierarhy management on salar proessors? More preisely, an we enhane ompiler tehnology to make it possible to express an algorithm in a natural, mahine-independent form while
ahieving memory-hierarhy performane good enough to obviate the need for hand optimization?
This thesis shows that ompiler tehnology an make it possible for a program expressed in a natural
form to ahieve high performane, even on a omplex memory hierarhy. Compiler algorithms to manage
the memory hierarhy automatially are developed and shown through experimentation to be eetive. By
adapting ompiler tehnology developed for parallel and vetor arhitetures, we are able to restruture
sienti odes to ahieve good memory performane on salar arhitetures. In many ases, our tehniques
have been extremely eetive | apable of ahieving integer-fator speedups over ode generated by a
good optimizing ompiler of onventional design. This aomplishment represents a step forward in the
enouragement of mahine-independent programming.
This hapter introdues the models, transformations and previous work on whih our ompiler strategy
is based. Setion 1.1 presents the models we use to understand mahine and program behavior. Setion 1.2
presents the transformations that we will use to improve program performane. Setion 1.3 presents the
previous work related to memory-hierarhy management and, nally, Setion 1.4 gives a brief overview of
the thesis.
2
1.1
CHAPTER 1.
INTRODUCTION
Bakground
In this setion, we lay the foundation for the appliation of our reordering transformations that improve the
memory performane of programs. First, we desribe a measure of mahine and loop performane used in
the appliation of the transformations desribed in the next setion. Seond, we present a speial form of
dependene graph that an be used by our transformation system to model memory usage.
1.1.1
Performane Model
We diret our researh toward arhitetures that are pipelined and allow asynhronous exeution of memory
aesses and oating-point operations (e.g., Intel i860 or IBM RS/6000). We assume that the target mahine
has a typial optimizing ompiler | one that performs salar optimizations only. In partiular, we assume
that it performs strength redution, alloates registers globally (via some oloring sheme) and shedules
the arithmeti pipelines [CK77, CAC+ 81, GM86℄. This makes it possible for our transformation system to
restruture the loop nests while leaving the details of optimizing the loop ode to the ompiler.
Given these assumptions for the target mahine and ompiler, we will desribe the notion of balane
dened by Callahan, et. al, to measure the performane of program loops in relation to memory [CCK88℄.
This model will serve as the fore behind the appliation of our transformations desribed throughout this
thesis.
Mahine Balane
A omputer is balaned when it an operate in a steady state manner with both memory aesses and
oating-point operations being performed at peak speed. To quantify this relationship, we dene M as
the rate at whih data an be fethed from memory, MM , ompared to the rate at whih oating-point
operations an be performed, FM :
words/yle = MM
M = max
max ops/yle = FM
The values of MM and FM represent peak performane where the size of a word is the same as the preision
of the oating point operations. Every mahine has at least one intrinsi M (there may be one for singlepreision oating point and one for double preision). For example, on the IBM RS/6000 M = 1 and on the
DEC Alpha M = 3.
Although this measure an be used to determine if a mahine is balaned between omputation and
data aesses, it is not our purpose to use balane in this manner. Instead, our goal is to evaluate the
performane of a loop on a partiular mahine based upon that mahine's partiular balane ratio. To do
this, we introdue the notion of loop balane.
Loop Balane
Just as mahines have balane ratios, so do loops. We an dene balane for a spei loop as
of memory referenes= M .
L = numbernumber
of ops= F
We assume that referenes to array variables are atually referenes to memory, while referenes to salar
variables involve only registers. Memory referenes are assigned a uniform ost under the assumption that
loop interhange, software prefething or tiling will attain ahe loality [WL91, KM92, CKP91℄.
Comparing M to L an give us a measure of the performane of a loop running on a partiular
arhiteture. If L = M , the loop is balaned for the mahine and will run well on that partiular mahine.
The balane measure favors no partiular mahine or loop balane. It reports that out-of-balane loops will
run well on similarly out-of-balane arhitetures and balaned loops will run well on similarly balaned
arhitetures. In addition, it reports that performane bottleneks our when loop and mahine balane do
no math. If L > M , then the loop needs data at a higher rate than the memory system an provide and
idle omputational yles will exist. Suh a loop is said to be memory bound and its performane an be
improved by lowering L . If L < M , then data annot be proessed as fast as it is supplied to the proessor
and memory bandwidth will be wasted. Suh a loop is said to be ompute bound. Compute-bound loops run
1.1.
BACKGROUND
3
at the peak oating-point rate of a mahine and need not be further balaned. Floating-point operations
usually annot be removed and arbitrarily inreasing the number of memory operations is pointless.
1.1.2
Dependene Graph
To aid in the omputation of the number of memory referenes for the appliation of our transformations,
we use a form of dependene graph that exposes reuse of values. We say that a dependene exists between
two referenes if there exists a ontrol-ow path from the rst referene to the seond and both referenes
aess the same memory loation [Ku78℄. The dependene is
a true dependene or ow dependene if the rst referene writes to the loation and the seond reads
from it,
an antidependene if the rst referene reads from the loation and the seond writes to it,
an output dependene if both referenes write to the loation, and
an input dependene if both referenes read from the loation.
If two referenes, v and w, are ontained in n ommon loops, we an refer to separate instanes of the
exeution of the referenes by an iteration vetor. An iteration vetor, denoted ~i; is simply the values of
the loop ontrol variables of the loops ontaining v and w. The set of iteration vetors orresponding to all
iterations of the of the loop nest is alled the iteration spae. Using iteration vetors, we an dene a distane
vetor, d = hd1 ; d2 ; : : : ; dn i, for eah onsistent dependene: if v aesses loation Z on iteration i~v and w
aesses loation Z on iteration i~w , the distane vetor for this dependene is i~w i~v . Under this denition,
the k th omponent of the distane vetor is equal to the number of iterations of the k th loop (numbered from
outermost to innermost) between aesses to Z . For example, given the following loop:
DO 10 I = 1,N
DO 10 J = 1,N
10
A(I,J) = A(I-1,J) + A(I-2,J+3)
the true dependene from A(I,J) to A(I-1,J) has a distane vetor of h1; 0i and the true dependene from
A(I,J) to A(I-2,J+3) has a distane vetor of h2; 3i.
The loop assoiated with the outermost non-zero distane vetor entry is said to be the arrier. The
distane vetor value assoiated with the arrier loop is alled the threshold of the dependene. If all distane
vetor entries are zero, the dependene is loop independent. In determining whih dependenes an be used
for memory analysis, we onsider only those that have a onsistent threshold | that is, those dependenes
for whih the threshold is onstant throughout the exeution of the loop [GJG87, CCK88℄. For a dependene
to have a onsistent threshold, it must be the ase that the loation aessed by the dependene soure on
iteration i is aessed by the sink on iteration i + , where does not vary with i. Formally, if we let
A(f (~i)) = A(a0 + a1 I1 + + an In )
A(g(~i)) = A(b0 + b1 I1 + + bnIn )
be array referenes where eah ai and bi is a onstant and eah Ij is a loop indution variable (In is assoiated
with the innermost loop), then we have the following denition.
Theorem 1.1 A dependene has a onsistent threshold i ai = bi for eah 1 i n and there
exists an integer suh that bn = a0 b0 .
Proof
See Callahan, et. al [CCK88℄.
Some arried dependenes will have multiple distane vetor values assoiated with one entry. Consider
the following loop.
10
DO 10 I = 1, N
A(K) = A(K) + ...
The true dependene between the referenes to A(K) has the distanes 1; 2; 3; : : : ; N 1 for the entry assoiated
with the I-loop. For the purposes of memory management, we will use the minimum value, 1 in this ase,
as the distane vetor entry.
4
CHAPTER 1.
1.2
INTRODUCTION
Transformations To Improve Memory Performane
Based upon the dependene graph desribed in the previous setion, we an apply a number of transformations to improve the performane of memory-bound programs. In this setion, we give a brief introdution
to these transformations.
1.2.1
Salar Replaement
In the model of balane presented in the previous setion, all array referenes are assumed to be memory
referenes. The prinipal reason for this is that the data-ow analysis used by typial salar ompilers is
not powerful enough to reognize most opportunities for reuse in subsripted variables. Arrays are treated
in a partiularly naive fashion, if at all, making it impossible to determine when a spei element might be
reused. This, however, need not be the ase. In the ode shown below,
10
DO 10 I = 2, N
A(I) = A(I-1) + B(I)
the value aessed by A(I-1) is dened on the previous iteration of the loop by A(I) on all but the rst
iteration. Using this knowledge, obtained via dependene analysis, the ow of values between the referenes
an be expressed with temporaries as follows.
10
T = A(1)
DO 10 I = 2, N
T = T + B(I)
A(I) = T
Sine global register alloation will most likely put salar quantities in registers, we have removed the load
of A(I-1), resulting in a redution in the balane of the loop from 3 to 2 [CAC+ 81, CH84, BCKT89℄. This
transformation is alled salar replaement and in Chapter 2, we show how to apply it to loops automatially.
1.2.2
Unroll-and-Jam
Unroll-and-jam is another transformation that an be used to improve the performane of memory-bound
loops [AC72, AN87, CCK88℄. The transformation unrolls an outer loop and then jams the resulting inner
loops bak together. Using unroll-and-jam we an introdue more omputation into an innermost loop body
without a proportional inrease in memory referenes. For example, the loop:
DO 10 I = 1, 2*M
DO 10 J = 1, N
10
A(I) = A(I) + B(J)
after unrolling beomes:
DO 10 I = 1, 2*M, 2
DO 20 J = 1, N
20
A(I) = A(I) + B(J)
DO 10 J = 1, N
10
A(I+1) = A(I+1) + B(J)
and after jamming beomes:
DO 10 I = 1, 2*M, 2
DO 10 J = 1, N
A(I) = A(I) + B(J)
10
A(I+1) = A(I+1) + B(J)
In the original loop, we have one oating-point operation and one memory referene after salar replaement,
giving a balane of 1. After applying unroll-and-jam, we have two oating-point operations and still only
one memory referene, giving a balane of 0.5. On a mahine that an perform twie as many oating-point
operations as memory aesses per lok yle, the seond loop would perform better. In Chapter 3, we will
show how to apply unroll-and-jam automatially to improve loop balane.
1.3.
RELATED WORK
1.2.3
5
Loop Interhange
Not only are we onerned with the number of referenes to memory, but also whether the data aessed by
a referene is stored in ahe or main memory. Consider the following Fortran loop where arrays are stored
in olumn-major order.
DO 10 I = 1, N
DO 10 J = 1, N
10
A = A + B(I,J)
Referenes to suessive elements of B by B(I,J) are a long distane apart in number of memory aesses,
requiring an extremely large ahe to apture the potential ahe-line reuse. With the likelihood of ahe
misses on aesses to B, we an interhange the I- and J-loops to make the distane between suessive
aesses small, as shown below [AK87, Wol86a℄.
DO 10 J = 1, N
DO 10 I = 1, N
10
A = A + B(I,J)
Now, we will only have a ahe miss on aesses to B one every ahe line, resulting in better memory
performane. In Chapter 4, we derive a ompiler algorithm to apply loop interhange, when safe, to a loop
nest to improve ahe performane.
1.2.4
Strip-Mine-And-Interhange
Sometimes loops aess more data than an be handled by a ahe even after loop interhange. In these
ases, the iteration spae of a loop an be bloked into setions whose reuse an be aptured by the ahe.
Strip-mine-and-interhange is a transformation that ahieves this result [Wol87, Por89℄. The eet is to
shorten the distane between the soure and sink of a dependene so that it is more likely for the datum to
reside in ahe when the reuse ours. Consider the following example
DO 10 J = 1,N
DO 10 I = 1,M
10
A(I) = A(I) + B(I,J)
Assuming that the value of M is muh greater than the size of the ahe, we would miss the opportunity to
reuse the values of A on eah iteration of J. To apture this reuse, we an use strip-mine-and-interhange.
First, we strip mine the I-loop as shown below.
DO 10 J = 1,N
DO 10 I = 1,M,IS
DO 10 II = I,MIN(I+IS-1,N)
10
A(II) = A(II) + B(II,J)
Note that this loop exeutes the iterations of the loop in preisely the same order as the original and its
reuse properties are unhanged. However, when we interhange the stripped loop with the outer loop to give
DO 10 I = 1,M,IS
DO 10 J = 1,N
DO 10 II = I,MIN(I+IS-1,N)
10
A(II) = A(II) + B(II,J)
the iterations are now exeuted in bloks of size IS by N. With this bloking, we an reuse IS values of A out
of ahe for every iteration of the J-loop if IS is less than half the size of the ahe.
Together, unroll-and-jam and strip-mine-and-interhange make up a transformation tehnique known as
iteration-spae bloking. As disussed, the rst is used to blok for registers and the seond for ahe. In
Chapter 5, we explore additional knowledge neessary to apply iteration-spae bloking to many real-world
algorithms.
1.3
Related Work
Muh researh has been done in memory-hierarhy management using the aforementioned transformations.
In this setion, we will review the previous work, noting the deienies that we intend to address.
6
1.3.1
CHAPTER 1.
INTRODUCTION
Register Alloation
The most signiant work in the area of register alloation has been done by Chaitin, et al., and Chow and
Hennessy [CAC+ 81, CH84℄. Both groups ast the problem of register alloation into that of graph oloring
on an interferene graph, where the nodes represent salar memory loations and an edge between two nodes
prevents them from oupying the same register. The objetive is to nd a k-oloring of the interferene
graph, where k is the number of mahine registers. Sine graph oloring is NP-omplete, heuristi methods
must be applied. It is usually the ase that a k-oloring is not possible, requiring values to be spilled from
registers to satisfy physial onstraints. This method has been shown to be very eetive at alloating salar
variables to registers, but beause information onerning the ow of array values is laking, subsripted
variables are not handled well.
Allen and Kennedy show how data dependene information an be used to reognize reuse of vetor data
and how that information an be applied to perform vetor register alloation [AK88℄. They also present
two transformations, loop interhange and loop fusion, as methods to improve vetor register alloation
opportunities. Both of these transformations were originally designed to enhane loop parallelism. Beause
dependenes represent the ow of values in an array, Allen and Kennedy suggest that this information ould
be used to reognize reuse in arrays used in salar omputations.
Callahan, Coke and Kennedy have expanded these ideas to develop salar replaement as shown in
Setion 1.2.1 [CCK88℄. Their method is only appliable to loops that have no inner-loop onditional ontrol
ow; therefore, it has limited appliability. Also, their algorithm does not onsider register pressure and may
expose more reuse than an be handled by a partiular mahine's register le. If spill ode must be inserted,
the resulting performane degradation may negate most of the value of salar replaement.
Callahan, Coke and Kennedy also use unroll-and-jam to improve the eetiveness of salar replaement
and to inrease the amount of low-level parallelism in the inner-loop body. Although the mehanis of
unroll-and-jam are desribed in detail, there is no disussion of how or when to apply the transformation to
a loop nest.
Aiken and Niolau present a transformation idential to unroll-and-jam, alled loop quantization [AN87℄.
However, they do not use this transformation to inrease data loality, but rather to improve inner-loop
parallelism. They perform a strit quantization, where the unrolled iterations of the original loop in the
body of the unrolled loop are data-independent. This means that they do not improve the data loality in
the innermost loop for true dependenes; therefore, little redution in memory referenes an be obtained.
1.3.2
Memory Performane Studies
Previous studies have shown how poor ahe behavior an have disastrous eets on program performane.
Abu-Sufah and Malony showed that the performane of the LANL BMK8A1 benhmark fell by a fator of as
muh as 2.26 on an Alliant FX/8 when vetor sizes were too large to be maintained in ahe [ASM86, GS84℄.
Similarly, Liu and Strother found that vetor performane on the IBM 3090 fell by a fator of 1.40 when
vetor lengths exeeded the ahe apaity [LS88℄. In this seond study, it was also shown that if the vetor
ode were bloked into smaller setions that t into ahe, the optimal performane was regained. Portereld
reported that on omputers with large memory latenies, many large sienti programs spent half of their
exeution time waiting for data to be delivered to ahe [Por89℄.
A number of other studies have shown the eetiveness of bloking loops for ahe performane. Gallivan,
et al., show that on the Alliant FX/8, the bloked version of LU deomposition is nearly 8 times faster than
the unbloked version, using BLAS3 and BLAS2 respetively [GJMS88, DDHH88, DDDH90℄. The BLAS2
version performs a rank 1 update of the matrix while the best BLAS3 version performs a bloked rank 96
update. Also on the Alliant FX/8, Berry and Sameh have ahieved speedups of as large as 9 over the standard
LINPACK versions for solving tridiagonal linear systems [BS88, DBMS79℄ and on the Cray-2, Calahan showed
that bloking LU deomposition improved the performane by a fator of nearly 6 [Cal86℄.
All of these studies involved tedious hand optimization to attain maximal performane. The BLAS
primitives are noteworthy examples of this methodology, in whih eah primitive must be reoded in assembly
language to get performane on eah separate arhiteture [LHKK79℄. Hand optimization is less than ideal
beause it takes months to ode the BLAS primitives by hand, although reoding the whole program is a worse
alternative. Therefore, signiant motivation exists for the development of a restruturing ompiler that an
1.3.
RELATED WORK
7
optimize for any memory hierarhy and relieve the programmer from the tedious task of memory-hierarhy
management.
1.3.3
Memory Management
The rst major work in the area of memory management by a ompiler is that of Abu-Sufah on improving
virtual memory performane [AS78℄. In his thesis, he desribes the use of a restruturing ompiler to improve
the virtual memory behavior of a program. Through the use of dependene analysis, Abu-Sufah is able to
perform transformations on the loop struture of a program to redue the number of physial pages required
and to group aesses to eah page.
The transformations that Abu-Sufah uses to improve virtual memory performane of loops are lustering (or loop distribution), loop fusion and page indexing (a ombination of strip mining, interhange and
distribution). Clustering is used to split the loop into omponents whose name spaes (or working sets) are
disjoint. Loop fusion is then used to fuse omponents whih originated in dierent loops but whose name
spaes interset. The page indexing transformation is used to blok a loop nest so that all the referenes
to a page are ontained in one iteration over a blok, thus maximizing loality. The goal is a set of loops
whose working sets are disjoint; hene, the required number of physial pages is redued without inreasing
the page fault rate.
Although Abu-Sufah's transformation system shows the potential for improving a program's memory
behavior, his use of loop distribution an inrease the amount of pipeline interlok and ause performane
degradation [CCK88℄. Additionally, the fat that virtual memory is fully assoiative rather than set assoiative prevents the appliation of his model to ahe management.
Fabri's work in automati storage optimization onentrates on extended graph-oloring tehniques to
manage storage overlays in memory [Fab79℄. She presents the notion of array renaming (analogous to liverange splitting) to minimize the memory requirements of a program. However, the problem of storage
overlays does not map to ahe management sine the ompiler does not have expliit ontrol over the ahe.
Thabit has examined software methods to improve ahe performane through the use of paking,
prefething and loop transformations [Tha81℄. He shows that optimal paking of salars for ahe-line
reuse is an NP-omplete problem and proposes some heuristis. However, salars are not onsidered to be
a signiant problem beause of register-alloation tehnology. In addition, Thabit maps graph-oloring to
the alloation of arrays to eliminate ahe interferene aused by set assoiativity. However, no evaluation
of the eetiveness of this approah is given. In partiular, he does not address the problem of an array
interfering with itself. Finally, Thabit establishes the safety onditions of loop distribution and strip-mineand-interhange.
Wolfe's memory performane work has onentrated on developing transformations to reshape loops to
improve their ahe performane [Wol87℄. He shows how tiling (or iteration-spae bloking) an be used to
improve the memory performane of program loops. Wolfe also shows that his tehniques for advaned loop
interhange an be used to tile loops with non-retangular iteration spaes and loops that are not perfetly
nested [Wol86a℄. In partiular, he disusses bloking for triangular- and trapezoidal-shaped iteration spaes,
but he does not present an algorithm. Instead, he illustrates the transformation by a few examples.
Irigoin and Triolet desribe a new dependene abstration, alled a dependene one, that an be used
to blok ode for two levels of parallelism and two levels of memory [IT88℄. The dependene one gives
more information than a distane vetor by maintaining a system of linear inequalities that an be used to
derive all dependenes. The one allows a larger set of perfetly nested loops to be transformed than other
dependene abstrations by providing a general framework to partition an iteration spae into supernodes
(or bloks). The idea is to aggregate many loop iterations, so as to provide vetor statements, parallel tasks
and data referene loality. To improve memory performane, the supernodes an be hosen to maximize
the amount of data loality in eah supernode. Unfortunately, this tehnique does not work on imperfetly
nested loops nor does it handle partially blokable loops, both of whih our in linear algebra odes.
Wolfe presents work that is very similar to Irigoin and Triolet's [Wol89℄. He does not use the dependene
one as the dependene abstration, but instead he uses the standard distane vetor. Using loop skewing,
loop interhange and strip mining, he an tile an iteration spae into bloks whih have both data loality
and parallelism. Wolfe is limited by the transformations that he applies and by the restritive nature of the
8
CHAPTER 1.
INTRODUCTION
subsripts handled with distane vetors, but he is apable of handling non-perfetly nested loops. The main
advantage of Wolfe's tehniques over Triolet's is that Wolfe's method is more straightforward to implement
in a restruturing ompiler, although the omplexity of the algorithm is O(d!), where d is the loop nesting
depth.
Gannon, et al., present a tehnique to desribe the amount of data that must be in the ahe for reuse
to be possible [GJG87℄. They all this the referene window. The window for a dependene desribes all of
the data that is brought into the ahe for the two referenes from the time that one datum is aessed at
the soure until it is used again at the sink. A family of referene windows for an array represents all of its
elements that must t into ahe to apture all of the reuse. To determine if the ahe is large enough to
hold every window, the window sizes are summed and ompared against the size of the ahe. If the ahe
is too small, bloking transformations suh as strip-mine-and-interhange an be used to derease the size
of the referene windows.
Portereld proposes a method to determine when the data referened in a loop does not t entirely in
ahe[Por89℄. He develops the idea of an overow iteration, whih is that iteration of a loop that brings in
the data item whih will ause the ahe to overow. Using this measurement, Portereld an predit when
a loop needs to be tiled to improve memory performane and to determine the size of a one dimensional tile
for the loop that auses the ahe to overow.
Portereld also presents two new transformations to blok loops. The rst is peel-and-jam, whih an be
used to fuse loops that have ertain types of fusion preventing dependenes by peeling o the oending iterations of the rst loop and fusing the resulting loop bodies. The seond is either a ombination loop skewing,
interhanging and strip-mining or loop unrolling, peeling and jamming to perform wavefront bloking. The
key tehnique here is the use of non-reordering transformations (skewing, peeling and unrolling) to make
it possible to blok loops. Some of these non-reordering transformations will beome espeially important
when dealing with partially blokable loops.
Portereld also disusses the appliability of bloking transformations to twelve Fortran programs. He
organizes them into three ategories: transformable, semi-transformable and non-transformable. One-third of
the programs in his study were transformable sine the bloking transformations were diretly appliable. The
semi-transformable programs ontained oding styles that made it diÆult to transform them automatially,
and in the ase of the non-transformable program, partial pivoting was the ulprit. Portereld laims that a
ompiler annot perform iteration-spae bloking in the presene of partial pivoting, but his analysis is not
extensive enough to make this laim. He does not onsider inreasing the \intelligene" of the ompiler to
improve its eetiveness.
Lam and Wolf present a framework for determining memory usage within loop nests and use that framework to apply loop interhange, loop skewing, loop reversal, tiling and unroll-and-jam [WL91, Wol86b℄.
Their method does not work on non-perfetly nested loops and does not enompass a tehnique to determine
unroll-and-jam amounts automatially. Additionally, they do not neessarily derive the best blok algorithm
with their tehnique, leaving the possibility that suboptimal performane is still possible.
Lam, Rothberg and Wolf present a method to determine blok sizes for blok algorithms automatially
[LRW91℄. Their results show that typial eetive blok sizes use less than 10% of the ahe. They suggest
the use of opy optimizations to remove the eets of set assoiativity and allow the use of larger portions
of the ahe.
Kennedy and MKinley present a simplied model of ahe performane that only onsiders inner-loop
reuse [KM92℄. Any reuse that ours aross an outer-loop iteration is onsidered to be prevented by ahe
interferene. Using this model, they desribe a method to determine the number of ahe lines required
by a loop if it were innermost and then they reorder the loops to use the minimum number of ahe lines.
This simple model is less preise than the Lam and Wolf model, but is very eetive in pratie. While
the number of ahe lines used by a loop is related to memory performane, it is not a diret measure of
performane. Their work is direted toward shared-memory parallel arhitetures where bus ontention is a
real performane problem and minimizing main memory aesses is vital. On salar proessors with a higher
ahe-aess ost, the number of ahe lines aessed may not be an aurate measure of performane.
1.4.
1.4
OVERVIEW
9
Overview
In the rest of this thesis, we will explore the ability of the ompiler to optimize a program automatially
for a mahine's memory hierarhy. We present algorithms to apply transformations to improve loop balane
and we present experiments to validate the eetiveness of the algorithms. In Chapter 2, we address salar
replaement in the presene of onditional-ontrol ow. In Chapter 3, we disuss automati unroll-andjam to improve loop balane. Chapter 4 addresses loop interhange to improve ahe performane and in
Chapter 5, we analyze the appliability to real-world numerial subprograms of ompiler bloking tehniques
that further optimize ahe performane. Finally, we present our onlusions and future work in Chapter 6.
11
Chapter 2
Salar Replaement
Although onventional ompilation systems do a good job of alloating salar variables to registers, their
handling of subsripted variables leaves muh to be desired. Most ompilers fail to reognize even the simplest
opportunities for reuse of subsripted variables. For example, in the ode shown below,
DO 10 I = 1, N
DO 10 J = 1, M
10
A(I) = A(I) + B(J)
most ompilers will not keep A(I) in a register in the inner loop. This happens in spite of the fat that standard optimization tehniques are able to determine that the address of the subsripted variable is invariant
in the inner loop. On the other hand, if the loop is rewritten as
10
20
DO 20 I = 1, N
T = A(I)
DO 10 J = 1, M
T = T + B(J)
A(I) = T
even the most naive ompilers alloate T to a register in the inner loop.
The prinipal reason for the problem is that the data-ow analysis used by standard ompilers is not
powerful enough to reognize most opportunities for reuse of subsripted variables. Subsripted variables are
treated in a partiularly naive fashion, if at all, making it impossible to determine when a spei element
might be reused. This is partiularly problemati for oating-point register alloation beause most of the
omputational quantities held in suh registers originate in subsripted arrays.
Salar replaement is a transformation that uses dependene information to nd reuse of array values
and expose it by replaing the referenes with salar temporaries as was done in the above example [CCK88,
CCK90℄. By enoding the reuse of array elements in salar temporaries, we an give a oloring register
alloator the opportunity to alloate values held in arrays to registers [CAC+ 81℄.
Although previous algorithms for salar replaement have been shown to be eetive, they have only
handled loops without onditional-ontrol ow [CCK90℄. The priniple reason for past deienies is the
reliane solely upon dependene information. A dependene ontains little information onerning ontrol
ow between its soure and sink. It only reveals that both statements may be exeuted. In the loop,
5
10
DO 10 I = 1,N
IF (M(I) .LT. 0) A(I) = B(I) + C(I)
D(I) = A(I) + E(I)
the true dependene from statement 5 to statement 10 does not reveal that the denition of A(I) is onditional. Using only dependene information, previous salar replaement algorithms would produe the
following inorret ode.
5
10
DO 10 I = 1,N
IF (M(I) .LT. 0) THEN
A0 = B(I) + C(I)
A(I) = A0
ENDIF
D(I) = A0 + E(I)
12
CHAPTER 2.
SCALAR REPLACEMENT
If the result of the prediate is false, no denition of A0 will our, resulting in an inorret value for A0 at
statement 10. To ensure A0 has the proper value, we an insert a load of A0 from A(I) on the false branh,
as shown below.
5
10
DO 10 I = 1,N
IF (M(I) .LT. 0) THEN
A0 = B(I) + C(I)
A(I) = A0
ELSE
A0 = A(I)
ENDIF
D(I) = A0 + E(I)
The hazard with inserting instrutions is the potential to inrease run-time osts. In the previous example,
we have avoided the hazard. If the true branh is taken, one load of A(I) is removed. If the false branh is
taken, one load of A(I) is inserted and one load is removed. It will be a requirement of our salar replaement
algorithm to prevent an inrease in run-time aesses to memory.
This hapter addresses salar replaement in the presene of forward onditional ontrol ow. We show
how to map partial redundany elimination to salar replaement in the presene of onditional ontrol ow
to ensure that memory osts will not inrease along any exeution path [MR79, DS88℄. This hapter begins
with an overview of partial redundany elimination. Then, a detailed derivation of our algorithm for salar
replaement is given. Finally, an experiment with an implementation of this algorithm is reported.
2.1
Partial Redundany Elimination
In the elimination of partial redundanies, the goal is to remove the latter of two idential omputations
that are performed on a given exeution path. A omputation is partially redundant when there may be paths
on whih both omputations are performed and paths on whih only the latter omputation is performed. In
Figure 2.1, the expression A+B is redundant along one branh of the IF and not redundant along the other.
Partial redundany elimination will remove the omputation C=A+B, replaing it with an assignment, and
insert a omputation of A+B on the path where the expression does not appear (see Figure 2.2). Beause
there may be no basi blok in whih new omputations an be inserted on a partiular path, insertion is
done on ow-graph edges and new basi bloks are reated when neessary [DS88℄.
The essential property of this transformation is that it is guaranteed not to inrease the number of
omputations performed along any path [MR79℄. In mapping partial redundany elimination to salar
replaement, referenes to array expression an be seen as the omputations. A load or a store followed
by another load from the same loation represents a redundant load that an be removed. Thus, using
this mapping will guarantee that the number of memory aesses in a loop will not inrease. However, the
mapping that we use will not guarantee that a minimal number of loads is inserted.
2.2
Algorithm
In this setion, we present an algorithm for salar replaement in the presene of forward onditional-ontrol
ow. We begin by determining whih array aesses provide values for a partiular array referene. Next,
we link together referenes that share values by having them share temporary names. Finally, we generate
salar replaed ode. For partially redundant array aesses, partial redundany elimination is mapped to
the elimination of partially redundant array aesses.
2.2.1
Control-Flow Analysis
Our optimization strategy fouses on loops; therefore, we an restrit ontrol-ow analysis to loop nests only.
Furthermore, it usually makes no sense to hold values aross iterations of outer loops for two reasons.
1. There may be no way to determine the number of registers needed to hold all the values aessed in
the innermost loop beause of symboli loop bounds.
2. Even if we know the register requirement, it is doubtful that the target mahine will have enough
registers.
2.2.
13
ALGORITHM
IF (P)
J
J
JJ
J^
D = A+B
R
C = A+B
Figure 2.1
Partially Redundant Computation
IF (P)
J
JJ
JJ
^
D = A+B, T = D
T = A+B
J
J
JJ J^
C = T
Figure 2.2
After Partial Redundany Elimination
14
CHAPTER 2.
SCALAR REPLACEMENT
Hene, we need only perform ontrol-ow analysis on eah innermost loop body.
To simplify our analysis, we impose a few restritions on the ontrol-ow graph. First, the ow graph
of the innermost loop must be reduible. Seond, bakward jumps are not allowed within the innermost
DO-loop beause they potentially reate loops. Finally, multiple loop exits are prohibited. This restrition is
for simpliity and an be removed with slight modiation to the algorithm.
2.2.2
Availability Analysis
The rst step in performing salar replaement is to alulate available array expressions. Here, we will
determine if the value provided by the soure of a dependene is generated on every path to the sink of
the dependene. We assume that enough iterations of the loop have been peeled so values an be available
upon entry to the loop. Sine eah lexially idential array referene aesses the same memory loation on
a given loop iteration, we do not treat eah lexially idential array referene as a separate array expression.
Rather, we onsider them in onert.
Array expressions most often ontain referenes to indution variables. Therefore, their naive treatment
in availability analysis is inadequate. To illustrate this, in the loop,
10
20
30
DO 30 I = 1,N
IF (B(I) .LT. 0.0) THEN
C(I) = A(I) + D(I)
ELSE
C(I) = A(I-1) + D(I)
ENDIF
E(I) = C(I) + A(I)
the value aessed by the array expression A(I) is fully available at the referene to A(I-1) in statement 20,
but it is not available at statement 10 and is only partially available at statement 30. Using a ompletely
syntati notion of array expressions, essentially treating eah lexially idential expression as a salar, A(I)
will be inorretly reported as available at statements 10 and 30. Thus, more information is required. We
must aount for the fat that the value of an indution variable ontained in an array expression hanges
on eah iteration of the loop.
A onvenient solution is to split the problem into loop-independent availability, denoted liav, where the
bak edge of the loop is ignored, and loop-arried availability, lav, where the bak edge is inluded. Thus,
an array expression is only available if it is in liav and there is a onsistent inoming loop-independent
dependene, or if it is in lav and there is a mathing onsistent inoming loop-arried dependene. The
data-ow equations for availability analysis are shown below.
T
liavin(b) =
p2preds (b) liavout(p)
S
= (liavin(b) likill(b)) ligen(b)
T
lavout (p)
lavin(b) =
p2preds (b)
S
lavout(b) = lavin(b) lgen(b)
liavout(b)
For liav, an array expression is added to gen when it is enountered whether it is a load or a store. At
eah store, the soures of all inoming inonsistent dependenes are added to kill and removed from gen.
At loads, nothing is done for kill beause a previously generated value annot be killed. We all these sets
ligen and likill.
For lav, we must onsider the fat that the ow of ontrol from the soure to the sink of a dependene
will inlude at least the next iteration of the loop. Subsequent loop iterations an eet whether a value has
truly been generated and not killed by the time the sink of the dependene is reahed. In the loop,
DO 10 I = 1,N
IF (A(I) .GT. 0.0) THEN
C(I) = B(I-2) + D(I)
ELSE
B(K) = C(I) + D(I)
ENDIF
10
B(I) = E(I) + C(I)
the value generated by B(I) on iteration I=1 will be available at the referene to B(I-2) on iteration I=3
only if the false branh of the IF statement is not taken on iteration I=2. Sine determining the diretion
2.2.
15
ALGORITHM
of a branh at ompile time is undeidable in general, we must assume that the value generated by B(I)
will be killed by the denition of B(K). In general, any denition that is the soure of an inonsistent output
dependene an never be in lgen(b), 8 b. It will always be killed on the urrent or next iteration of the
loop. Therefore, we need only ompute the lgen and not lkill.
There is one speial ase where this denition of lgen will unneessarily limit the eetiveness of salar
replaement. When a dependene threshold is 1, it may happen that the sink of the generating dependene
edge ours before the killing denition. Consider the following loop.
DO 10 I = 1,N
B(K) = B(I-1) + D(I)
10
B(I) = E(I) + C(I)
the value used by B(I-1) that is generated by B(I) will never be killed by B(K). The solution to this
limitation is to reate a new set alled lavif1 that ontains availability information only for loop-arried
dependenes with a threshold of 1. Control ow through the urrent and next iterations of the loop is
inluded when omputing this set. The data-ow equations for lavif1 are idential to those of liav. This
is beause we onsider ontrol ow on the next iteration of the loop, unlike lav. Below are the data-ow
equations for lavif1.
T
lavif1in(b) =
p2preds (b) lavif1out(p)
lavif1out(b)
= (lavif1in(b)
likill(b))
S
ligen(b)
Beause we onsider both fully and partially redundant array aesses, we need to ompute partial availability in order to salar replae referenes whose load is only partially redundant. As in full-availability analysis, we partition the problem into loop-independent, loop-arried and loop-arried-if-1 sets. Computation
of kill and gen orresponds to that of availability analysis. Below are the data-ow equations for partially
available array expression analysis.
S
lipavin(b) =
p2preds (b) lipavout(p)
S
= (lipavin(b) likill(b)) ligen(b)
S
lpavin(b) =
lpavout(p)
p2preds (b)
S
lpavout(b) = lpavin(b) lgen(b)
S
lpavif1in(b) =
)
p2preds (b) lpavif1out(pS
lpavif1out(b) = (lpavif1in(b) likill(b)) ligen(b)
lipavout(b)
2.2.3
Reahability Analysis
Beause there may be multiple lexially idential array referenes within a loop, we want to determine whih
referenes atually supply values that reah a sink of a dependene and whih supply values that are killed
before reahing suh a sink. In other words, whih values reah their potential reuses. In omputing reahability, we do not treat eah lexially idential array expression in onert. Rather, eah referene is onsidered
independently. Reahability information along with availability is used to selet whih array referenes provide values for salar replaement. While reahability information is not required for orretness, it an
prevent the marking of referenes as providing a value for salar replaement when that value is redened
by a later idential referene. This improves the readability of the transformed ode.
We partition the reahability information into three sets: one for loop-independent dependenes (lirg),
one for loop-arried dependenes with a threshold of 1 (lrgif1) and one for other loop-arried dependenes
(lrg). Calulation of ligen and likill is the same as that for availability analysis exept in one respet.
The soure of any inoming loop-independent dependene whose sink is a denition is killed whether the
threshold is onsistent or inonsistent. Additionally, likill is subtrated from lrgout to aount for
onsistent referenes that redene a value on the urrent iteration of a loop. For example, in the following
loop, the denition of B(I) kills the load from B(I), therefore, only the denition reahes the referene to
B(I-1).
10
DO 10 I = 1,N
A(I) = B(I-1) + B(I)
B(I) = E(I) + C(I)
16
CHAPTER 2.
SCALAR REPLACEMENT
Using reahability information, the most reent aess to a value an be determined. Even using likill when
omputing lrg will not eliminate all unreahable referenes. Referenes with only outgoing onsistent looparried output or antidependenes will not be killed. This does not eet orretness, rather only readability,
and an only happen when partially available referenes provide the value to be salar replaed. Below are
the data-ow equations used in omputing array-referene reahability.
S
lirgin(b) =
p2preds (b) lirgout(p)
S
= (lirgin(b) likill(b)) ligen(b)
S
lrgout(b) =
p2preds (b) lrgout(p)
S
lrgout(b) = (lrgin(b) likill(b)) lgen(b)
S
lrgif1in(b) =
p2preds (b) lrgif1out(p)
S
lrgif1out(b) = (lrgif1in(b) likill(b)) ligen(b)
lirgout(b)
2.2.4
Potential-Generator Seletion
At this point, we have enough information to determine whih array referenes potentially provide values
to the sinks of their outgoing dependene edges. We all these referenes potential generators beause they
an be seen as generating the value used at the sink of some outgoing dependene. The dependenes leaving
a generator are alled generating dependenes. Generators are only potential at this point beause we need
more information to determine if salar replaement will even be protable. In Figure 2.3, we present the
algorithm FindPotentialGenerators for determining potential generators.
We have two goals in hoosing potential generators. The rst is to insert the fewest number of loads,
orrelating to the maximum number of memory aesses removed. The seond is to minimize register
pressure, or the number of registers required to eliminate the loads. To meet the rst objetive, fully available
expressions are given the highest priority in generator seletion. To meet the seond, loop-independent fully
available generators are preferred beause they require the fewest number of registers. If no loop-independent
generator exists, loop-arried fully available generators are onsidered next. If there are multiple suh
generators, the one that requires the fewest registers (the one with the smallest threshold) is hosen.
If there are no fully available generators, partially available array expressions are next onsidered as
generators. Partially available generators do not guarantee a redution in the number of memory aesses
beause memory loads need to be inserted on paths along whih a value is needed but not generated. However,
we an guarantee that there will not be an inrease in the number of memory loads by only inserting load
instrutions if they are guaranteed to have a orresponding load removal on any exeution path [MR79℄.
Without this guarantee, we may inrease the number of memory aesses at exeution time, resulting in a
performane degradation.
The best hoie for a partially available generator would be one that is loop-independent. Although
there may be a \more available" loop-arried generator, register pressure is kept to a minimum and salar
replaement will be applied if we hose a loop-independent generator. If there are no loop-independent
partially available array expressions, then the next hoie would be a loop-arried partially available array
expression with a generating dependene having the largest threshold of any inoming potential generating
dependene. Although this ontradits the goal of keeping the register pressure to a minimum, we inrease
the probability that there will be a use of the value on every path by inreasing the window size for potential
uses of the generated value. We have hosen to sarie register pressure for potential savings in memory
aesses.
Finally, when propagating data-ow sets through a basi blok to determine availability or reahability
at a partiular point, information is not always inrementally updated. For loop-independent information,
we update a data-ow set with gen and kill information as we enounter statements. However, the same
is not true for loop-arried and loop-arried-if-1 information. gen information is not used to update any
loop-arried data-ow set. Loop-arried information must propagate around the loop to be valid and a
loop-arried data-ow set at the entry to a basi blok already ontains this information. kill information
is only inrementally updated for loop-arried-if-1 sets sine ow through the seond iteration of a loop after
a value is generated is onsidered.
2.2.
17
ALGORITHM
proedure FindPotentialGenerators(G)
Input: G = (V; E ), the dependene graph
Defs: bv = basi blok ontaining v
for eah v 2 V do
if 9e 2 E je = (w; v ); w 2 liav(bv ); w 2 lirg(bv ); dn (e) = 0 then
mark all suh w's as v 's potential liav generator
else
if 9e 2 E je = (w; v ); w 2 lavif1(bv ); w 2 lrgif1(bv );
w 2 lavif1out(exit); w 2 lrgif1out(exit); dn (e) = 1 then
mark all suh w's as v 's potential lav generator
else
if 9e 2 E je = (w; v ); w 2 lavout(exit); w 2 lrgout(exit);
dn (e) > 0; minf 2Ev (dn (f )) = dn (e) then
mark all suh w's as v 's potential lav generator
else
if 9e 2 E je = (w; v ); w 2 lipav(bv ); w 2 lirg(bv ); dn (e) = 0 then
mark all suh w's as v 's potential lipav generator
else
if 9e 2 E je = (w; v ); w 2 lpavout(exit); w 2 lrgout(exit);
dn (e) > 0); maxf 2Ev (dn (f )) = dn (e) then
mark all suh w's as v 's potential lpav generator
else
if 9e 2 E je = (w; v ); w 2 lpavif1(bv ); w 2 lrgif1(bv );
w 2 lpavif1out(exit); w 2 lrgif1out(exit); dn (e) = 1 then
mark all suh w's as v 's potential lpav generator
else
v has no generator
Figure 2.3
FindPotentialGenerators
18
CHAPTER 2.
SCALAR REPLACEMENT
In the next portion of the salar replaement algorithm, we will ensure that the value needed at a referene
to remove its load is fully available. This will involve insertion of memory loads for referenes whose generator
is partially available. We guarantee through our partial redundany elimination mapping that we will not
inrease the number of memory aesses at run time. However, we do not guarantee a minimal insertion of
memory aesses.
2.2.5
Antiipability Analysis
After determining potential generators, we need to loate the paths along whih loads need to be inserted to
make partially available generators fully available. Loads need to be inserted on paths along whih a value is
needed but not generated. We have already enapsulated value generation in availability information. Now,
we enapsulate value need with antiipability. The value generated by an array expression, v , is antiipated
by an array expression w if there is a true or input edge v ! w and v is w's potential generator.
DO 6 I = 1,N
IF (A(I) .GT.
B(I) = C(I)
ELSE
F(I) = C(I)
ENDIF
C(I) = E(I) +
5
6
0.0)
+ D(I)
+ D(I)
B(I)
In the above example, the value generated by the denition of B(I) in statement 5 is antiipated at the use
of B(I) in statement 6.
As in availability analysis, we onsider eah lexially idential array expression in onert. We split the
problem, but this time into only two partitions: one for loop-independent generators, lian, and one for looparried generators, lan. We do not onsider the bak edge of the loop during analysis for either partition.
For lian, the reason is obvious. For lan, we only want to know that a value is antiipated on all paths
through the loop. This is to ensure that we do not inrease the number of memory aesses in the loop.
In eah partition, an array expression is added to gen at an array referene if it is a potential generator
for that referene. For members of lian, array expressions are killed at the point where they are dened
by a onsistent or inonsistent referene. For lan, only inonsistent denitions kill antiipation beause
onsistent denitions do not dene the value being antiipated on the urrent iteration. For example, in the
loop
DO 1 I = 1,N
A(I) = B(I) + D(I)
1
B(I) = A(I-1) + C(I)
the denition of A(I) does not redene a partiular value antiipated by A(I-1). The value is generated by
A(I) and then never redened beause of the iteration hange.
T
lianout(b) =
s2sus (b) lianin(s)
S
= (lianout(b) likill(b)) ligen(b)
T
lanout(b) =
s2sus (b) lanin(s) S
lanin(b) = lanout(b) - lkill(b) lgen(b)
lianin(b)
2.2.6
Dependene-Graph Marking
One antiipability information has been omputed, the dependene graph an be marked so that only dependenes to be salar replaed are left unmarked. The other edges no longer matter beause their partiipation
in value ow has already been onsidered. Figure 2.4 shows the algorithm MarkDependeneGraph.
At a given array referene, we mark any inoming true or input edge that is inonsistent and any inoming
true or input edge that has a symboli threshold or has a threshold greater than that of the dependene
edge from the potential generator. Inonsistent and symboli edges are not amenable to salar replaement
beause it is impossible to determine the number of registers needed to expose potential reuse at ompile time.
When a referene has a onsistent generator, all edges with threshold less than or equal to the generating
threshold are left in the graph. This is to failitate the onsistent register naming disussed in subsequent
2.2.
19
ALGORITHM
Proedure MarkDependeneGraph(G)
Input: G = (V; E ), the dependene graph
for eah v 2 V
if v has no generator then
mark v 's inoming true and input edges
else if v 's generator is inonsistent or symboli dn then
mark v 's inoming true and input edges
v no longer has a generator
else if v 's generator is lpav and v 2= lantin(entry)
mark v 's inoming true and input edges
v no longer has a generator
else
v = threshold of edge from v 's generator
mark v 's inoming true and input edges with dn > v
mark v 's inoming edges whose soure does not reah v or
whose soure is not partially available at v
mark v 's inoming inonsistent and symboli edges
Figure 2.4
MarkDependeneGraph
setions. It ensures that any referene ourring between the soure and sink of an unmarked dependene
that an provide the value at the sink will be onneted to the dependene sink. Finally, any edge from a
loop-arried partially available generator that is not antiipated at the entry to the loop is removed beause
there will not be a dependene sink on every path.
2.2.7
Name Partitioning
At this point, the unmarked dependene graph represents the ow of values for referenes to be salar
replaed. We have determined whih referenes provide values that an be salar replaed. Now, we move
on to linking together the referenes that share values. Consider the following loop.
5
6
DO 6 I = 1,N
B(I) = B(I-1) + D(I)
C(I) = B(I-1) + E(K)
After performing the analysis disussed so far, the generator for the referene to B(I-1) in statement 6
would be the load of B(I-1) in statement 5. However, the generator for this referene is the denition of
B(I) in statement 5 making B(I-1) in statement 5 an intermediate point in the ow of the value rather
than the atual generator for B(I-1) in statement 6. These referenes need to be onsidered together when
generating temporary names beause they address the same memory loations.
The nodes of the dependene graph an be partitioned into groups by dependene edges (see Figure 2.5)
to make sure that temporary names for all referenes that partiipate in the ow of values through a memory
loation are onsistent. Any two nodes onneted by an unmarked dependene edge after graph marking
belong in the same partition. Partitioning is aomplished by performing a traversal of the dependene
graph, following unmarked true and input dependenes only sine these dependenes represent value ow.
Partitioning will tie together all referenes that aess a partiular memory loation and represent reuse of
a value in that loation.
After name partitioning is ompleted, we an determine the number of temporary variables (or registers)
that are neessary to perform salar replaement on eah of the partitions. To alulate register requirements,
we split the referenes (or partitions) into two groups: variant and invariant. Variant referenes ontain the
innermost-loop indution variable within their subsript expression. Invariant referenes do not ontain the
20
CHAPTER 2.
SCALAR REPLACEMENT
Proedure GenerateNamePartitions(G; P )
Input: G = (V; E ), the dependene graph
P = the set of name partitions
8v 2 V
i=0
mark v as unvisited
for eah v 2 V
if v is unvisited then
put v in Pi
Partition(v; Pi )
i = i+1
for eah p 2 P do
FindGenerator(p; p )
if p is invariant then
rp = 1
else
rp = max(CalNumberOfRegisters(g ) + 1)
enddo
end
g 2p
Proedure Partition(v; p)
mark v as visited
add v to p
for eah (e = (v; w) _ (w; v )) 2 E
if e is true or input and unmarked and w not visited then
Partition(w; p)
end
Proedure FindGenerator(p; )
for eah v 2 p that is a potential generator
if (v is invariant ^ (v is a store _
v has no inoming unmarked loop-independent edge)) _
(v is variant ^ v has no unmarked inoming true or
input dependenes from another w 2 p) then
=[v
end
Proedure CalNumberOfRegisters(v)
dist = 0
for eah e = (v; w) 2 E
if e is true or input and unmarked ^ w not visited ^
P (w) = P (v ) then
dist = max(dist,dn (e)+ CalNumberOfRegisters(w))
return(dist)
end
Figure 2.5
GenerateNamePartitons
2.2.
ALGORITHM
21
innermost-loop indution variable. In the previous example, the referene to E(K) is invariant with respet
to the I-loop while all other array referenes are variant.
For variant referenes, we begin by nding all referenes within eah partition that are rst to aess a
value on a given path from the loop entry. We all these referenes partition generators. A variant partition
generator will already have been seleted as a potential generator in Setion 2.2.4 and will not have inoming
unmarked dependenes from another referene within its partition. The set of partition generators within one
partition is alled the generator set, p . Next, the maximum over all dependene distanes from a potential
generator to a referene within this partition is alulated. Even if dependene edges between the generator
and a referene have been removed from the graph, we an ompute the distane by nding the length of
a hain, exluding yles, to that referene from the generator. Letting p be the maximum distane, eah
partition requires rp = p + 1 registers or temporary variables: one register for the value generated on the
urrent iteration and one register for eah of the p values that were generated previously and need to ow
to the last sink. In the previous example, B(I) is the oldest referene and is the generator of the value used
at both referenes to B(I-1). The threshold of the partition is 1. This requires 2 registers beause there are
two values generated by B(I) that are live after exeuting statement 5 and before exeuting statement 6.
For invariant referenes, we an use one register for salar replaement beause eah referene aesses
the same value on every iteration. In order for an invariant potential generator to be a partition generator,
it must be a denition or the rst load of a value along a path through the loop.
2.2.8
Register-Pressure Moderation
Unfortunately, experiments have shown that exposing all of the reuse possible with salar replaement may
result in a performane degradation [CCK90℄. Register spilling an ompletely ounterat any savings from
salar replaement. The problem is that speialized information is neessary to reover the original ode to
prevent exessive spilling. As a result, we need to generate salar temporaries in suh a way as to minimize
register pressure, in eet doing part of the register alloator's job.
Our goal is to alloate temporary variables so that we an eliminate the most memory aesses given
the available number of mahine registers. This an be approximated using a greedy algorithm. In this
approximation sheme, the partitions are ordered in dereasing order based upon the ratio of benet to ost,
or in this ase, the ratio of memory aesses saved to registers required. At eah step, the algorithm hooses
the rst partition that ts into the remaining registers. This method requires O(n log n) time to sort the
ratios and O(n) time to selet the partitions. Hene, the total running time is O(n log n).
To ompute the benet of salar replaing a partition, we begin by omputing the probability that eah
basi blok will be exeuted. The rst basi blok in the loop is assigned a probability of exeution of 1.
Eah outgoing edge from the basi blok is given a proportional probability of exeution. In the ase of two
outgoing edges, eah edge is given a 50% probability of exeution. For the remaining bloks, this proedure
is repeated, exept that the probability of exeution of a remaining blok is the sum of the probabilities of
its inoming edges. Next, we ompute the probability that the generator for a partition is available at the
entrane to and exit from a basi blok. The probability upon entry is the sum of the probabilities at the exit
of a blok's predeessors weighted by the number of inoming edges. Upon exit from a blok, the probability
is 1 if the generator is available, 0 if it is not available and the entry probability otherwise. After omputing
eah of these probabilities, the benet for eah referene within a partition an be omputed by multiplying
the exeution probability for the basi blok that ontains the referene by the availability probability of
the referenes generator. Figure 2.6 gives the omplete algorithm for omputing benet. As an example of
benet, in the loop,
DO 1 I = 1,N
IF (M(I) .LT. 0) THEN
A(I) = B(I) + C(I)
ENDIF
1
D(I) = A(I) + E(I)
the load of A(I) in statement 1 has a benet of 0.5.
Unfortunately, the greedy approximation an produe suboptimal results. To see this, onsider the
following example, in whih eah partition is represented by a pair (n; m), where n is the number of registers
needed and m is the number of memory aesses eliminated. Assume that we have a mahine with 6 registers
22
CHAPTER 2.
SCALAR REPLACEMENT
Proedure CalulateBenet(P; F G)
Input: P = set of referene partitions
F G = ow graph
Defs: Ape (b) = probability p is available on entry to b
Apx (b) = probability p is available on exit from b
bp = benet for partition p
EF G = edges in ow graph
P (b) = probability blok b will be exeuted
P (entry) = 1
n =# outgoing forward edges of entry
)
weight eah outgoing forward edge of entry at P (entry
n
for the remaining basi bloks b2 F G in reverse
depth-rst order
let P (b)= sum of all inoming edge weights
n =# outgoing forward edges of b
weight eah outgoing forward edge at P (nb)
for eah p 2 P
for eah basi blok b2 F G in reverse depth-rst order
n =# inoming forward edges of b
for eah inoming forward edge of b, e = (,b)
p
Ape (b) = Ape (b)+SAxn()
if p 2 liavout(b) lavout(b) then
Apx (b) = 1
S
else if p 2 lipavout(b) lpavout(b) then
Apx (b)= Ape (b)
else
Apx (b)= 0
for eah v 2 p
if p is lpav or lpavif1 then
bp = bp + Apx (exit) P (bv )
else if p is lipav
bp = bp + Ape (b) P (bv )
else
bp = bp + P (bv )
end
Figure 2.6
CalulateBenet
2.2.
ALGORITHM
23
and that we are generating temporaries for a loop that has the following generators: (4; 8); (3; 5); (3; 5); (2; 1).
The greedy algorithm would rst hoose (4; 8) and then (2; 1), resulting in the elimination of nine aesses
instead of the optimal ten.
To get a possibly better alloation, we an model register-pressure moderation as a knapsak problem,
where the number of salar temporaries required for a partition is the size of the objet to be put in the
knapsak, and the size of the register le is the knapsak size. Using dynami programming, an optimal
solution to the knapsak problem an be found in O(kn) time, where k is the number of registers available
for alloation and n is the number of generators. Hene, for a spei mahine, we get a linear time bound.
However, with the greedy method, we have a running time that is independent of mahine arhiteture,
making it more pratial for use in a general tool. The greedy approximation of the knapsak problem is
also provably no more than two times worse than the optimal solution [GJ79℄. However, our experiments
suggest that in pratie the greedy algorithm performs as well as the knapsak algorithm.
After determining whih generators will be fully salar replaed, there may still be a few registers available.
Those partitions that were eliminated from onsideration an be examined to see if partial alloation is
possible. In eah eliminated partition whose generator is not lpav, we alloate referenes whose distane
from p is less than the number of remaining registers. All referenes within p that do not t this riterium
are removed from p. This step is performed on eah partition, if possible, while registers remain unused.
Finally, sine we have possibly removed referenes from a partition, antiipability analysis for potential
generators must be redone.
To illustrate partial alloation, assume that in the following loop there is one register available.
DO 10 I = 1,N
A(I) = ...
IF (B(I) .GT. 0.0) THEN
... = A(I)
ENDIF
10
... = A(I-1)
Here, full alloation is not possible, but there is a loop-independent dependene between the A(I)'s. In partial
alloation, A(I-1) is removed from the partition allowing salar replaement to be performed. The algorithm
in Figure 2.7 gives the omplete register-pressure moderation algorithm, inluding partial alloation.
Although this method of partial alloation may still leave possible reuses not salar replaed, experiene
suggests this rarely, if ever, happens. One possible solution is to onsider dependenes from intermediate
points within a partition when looking for potential reuse.
2.2.9
Referene Replaement
At this point, we have determined whih referenes will be salar replaed. We now move into the ode
generation phase of the algorithm. Here, we will replae array referenes with temporary variables and
ensure that the temporaries ontain the proper value at a given point in the loop.
After we have determined whih partitions will be salar replaed, we replae the array referenes within
eah partition. This algorithm is shown in Figure 2.8. First, for eah variant partition p, we reate the
temporary variables Tp0 ; Tp1 ; : : : ; Tprp 1 , where Tpi represents the value generated by g 2 p i iterations earlier,
where g is the rst generator to aess the value used throughout the partition. Eah referene within the
partition is replaed with the temporary that oinides with its distane from g . For invariant partitions,
eah referene is replaed with Tp0 . If a replaed referene v 2 p is a memory load, then a statement of
the form Tpi = v is inserted before the generating statement. Requiring that the load must be in p for load
insertion ensures that a load that is a potential generator but also has a potential generator itself will not
have a load inserted. The value for the potential generator not in p will already be provided by its potential
generator. If the v is a store, then a statement of the form v = Tpi is inserted after the generating statement.
The latter assignment is unneessary if it has an outgoing loop-independent edge to denition that is always
exeuted and it has no outgoing inonsistent true dependenes. We ould get better results by performing
availability and antiipability analysis exlusively for denitions to determine if a value is always redened.
24
CHAPTER 2.
SCALAR REPLACEMENT
Proedure ModerateRegisterPressure(P; G)
Input: P = set of referene partitions
G = (V; E ), the dependene graph
Defs: bp = benet for a partition p
rp = register required by a partition p
CalulateBenet(P; G)
R = registers available
H = sort of P on ratio of brpp in dereasing order
for i = 1 to jHj do
if rHi < R then
alloate(Hi )
R = R rHi
else
S = S [ Hi
P = P Hi
endif
enddo
if R > 0 then
i=1
while i < jSj and R > 0 do
while jSi j > 0 and R > 0 do
Si = fvjv 2 Si ; dn (^e) < R; e^ = (Si ; v)g
alloate(Si )
R = R r Si
P = P [ Si
enddo
enddo
redo antiipability analysis
endif
end
Figure 2.7
ModerateRegisterPressure
2.2.
25
ALGORITHM
Proedure ReplaeReferenes(P )
Input: P = set of referene partitions
for eah p 2 P do
let g = v 2 p with no inoming marked or unmarked
edge from another w 2 p or v is invariant
for eah v 2 P do
if v is invariant then d = 0
else d = distane from g to v
replae v with Tpd
if v 2 p and v is a load then
insert \Tpd = v " before v 's statement
if v is a store ^ (9e = (v; w)je is true and inonsistent _
8e = (v; w) where e is output and e = 0;
w is not always exeuted) then
insert \v = Tpd " after v 's statement
end
Figure 2.8
ReplaeReferenes
The eet of referene replaement will be illustrated on the following loop nest.
1
2
3
DO 3 I = 1, 100
IF (M(I) .LT. 0) E(I) = C(I)
A(I) = C(I) + D(I)
B(K) = B(K) + A(I-1)
The reuse-generating dependenes are:
1. A loop-independent input dependene from C(I) in statement 1 to C(I) in statement 2 (threshold 0),
and
2. A true dependene from A(I) to A(I-1) (threshold 1).
3. A true dependene from B(K) to B(K) in statement 2 (threshold 1).
By our method, the generator C(I) in statement 1 needs only one temporary, T10. Here, we are using
the rst numeri digit to indiate the number of the partition and the seond to represent the distane from
the generator. The generator B(K) in statement 1 needs one temporary, T20, sine it is invariant, and the
generator A(I) needs two temporaries, T30 and T31. When we apply the referene replaement proedure
to the example loop, we generate the following ode.
1
2
3
DO 3 I = 1, 100
IF (M(I) .LT. 0) THEN
T10 = C(I)
E(I) = T10
ENDIF
T30 = T10 + D(I)
A(I) = T30
T20 = T20 + T31
B(K) = T20
The value for T31 is not generated in this example. We will disuss its generation in Setions 2.2.11 and
2.2.13.
2.2.10
Statement-Insertion Analysis
After we replae the array referene with salar temporaries, we need to insert loads for partially available
generators. Given a referene that has a partially available potential generator, we need to insert a statement
26
CHAPTER 2.
SCALAR REPLACEMENT
at the highest point on a path from the loop entrane to the referene that antiipates the generator where
the generator is not partially available and is antiipated. By performing statement-insertion analysis on
potential generators, we guarantee that every referene's antiipated value will be fully available. Here,
we handle eah individual referene, whereas name partitioning linked together those referenes that share
values. This philosophy will not neessarily introdue a minimum number of newly inserted loads, but there
will not be an inrease in the number of run-time loads. The plae for insertion of loads for partially available
generators an be determined using Drehsler and Stadel's formulation for partial redundany elimination,
as shown below [DS88℄.
ppin(b)
=
ppout(b)
=
insert(b)
insert(a,b)
=
=
antin(b)
T
pavin(b)
T
S
(antlo(b) (transp(b)
T
ppout(b))
FALSE if b is the loop exit
s2su(b) ppin(s)
T
T
:
T
:
ppout(b)
avout(b)
( ppin(b)
T
T
ppin(b)
avout(a)
ppout(a)
:
:
S
:transp(b))
Here, ppin(b) denotes plaement of a statement is possible at the entry to a blok and ppout(b) denotes
plaement of a statement is possible at the exit from a blok. insert(b) determines whih loads need to
be inserted at the bottom of blok b. insert(a,b) is dened for eah edge in the ontrol-ow graph and
determines whih loads are inserted on the edge from a to b. transp(b) is true for some array expression if
it is not dened by a onsistent or inonsistent denition in the blok b. antlo(b) is the same as gen(b)
for antiipability information. Three problems of the above form are solved: one for lipav generators, one
for lpav generator and one for lpavif1 generators. Additionally, any referene to loop-arried antin
information refers to the entry blok.
If insert(a,b) is true for some potential generator g , then we insert a load on the edge (a,b) of the
form Tpd = g , where Tpd is the temporary name assoiated with g . If insert(b) is true for some potential
generator g , then a statement of idential form is inserted at the end of blok b. Finally, if insert(a,b) is
true 8a 2 pred(b), then loads an be ollapsed into the beginning of blok b. The algorithm for inserting
statements is shown in Figure 2.9.
If we perform statement insertion on our example loop, we get the following results.
1
2
3
DO 3 I = 1, 100
IF (M(I) .LT. 0) THEN
T10 = C(I)
E(I) = T10
ELSE
T10 = C(I)
ENDIF
T30 = T10 + D(I)
A(I) = T30
T20 = T20 + T31
B(K) = T20
Again, the generation of T31 is left for Setions 2.2.11 and 2.2.13.
2.2.11
Register Copying
Next, we need to ensure that the values held in the temporary variables are orret aross loop iterations.
The value held in Tpi needs to move one iteration further away from its generator, Tp0 , on eah subsequent
loop iteration. Sine i is the number of loop iterations the value is from the generator, the variable Tpi+1
needs to take on the value of Tpi at the end of the loop body in preparation for the iteration hange. The
algorithm in Figure 2.10 eets the following shift of values for eah partition, p.
Tprp 1 = Tprp 2
Tprp 2 = Tprp 3
Tp1 = Tp0
2.2.
27
ALGORITHM
Proedure InsertStatements(F G)
Input: F G = ow graph
perform insert analysis
for eah ow edge e 2 F G
for eah v 2inserte
if v is invariant then d = 0
else d = distane from the oldest generator in P (v ) to v
insert statement \TPd (v) = v " on e
for eah basi blok b 2 F G
for eah v 2 insert(b)
if v is invariant then d = 0
else d = distane from the oldest generator in P (v ) to v
insert statement \TPd (v) = v " at the end of b
for eah v suh that insert(a,b) (v ) is true
8a 2 preds(b)
end
ollapse loads of v into beginning of b
Figure 2.9
InsertStatements
After inserting register opies, our example ode beomes:
1
2
3
2.2.12
DO 3 I = 1, 100
IF (M(I) .LT. 0) THEN
T10 = C(I)
E(I) = T10
ELSE
T10 = C(I)
ENDIF
T30 = T10 + D(I)
A(I) = T30
T20 = T20 + T31
B(K) = T20
T31 = T30
Code Motion
It may be the ase that the assignment to or a load from a generator may be moved entirely out of the
innermost loop. This is possible when the referene to the generator is invariant with respet to the innermost
loop. In the example above, B(K) does not hange with eah iteration of the I-loop; therefore, its value an
be kept in a register during the entire exeution of the loop and stored bak into B(K) after the loop exit.
1
2
3
DO 3 I = 1, 100
IF (M(I) .LT. 0) THEN
T10 = C(I)
E(I) = T10
ELSE
T10 = C(I)
ENDIF
T30 = T10 + D(I)
A(I) = T30
T20 = T20 + T31
T31 = T30
B(K) = T20
The algorithm for this type of ode motion is shown in Figure 2.11.
28
CHAPTER 2.
SCALAR REPLACEMENT
Proedure InsertRegisterCopies(G,P,x)
Input: G = (V; E ), the dependene graph
P = set of referene partitions
x = peeled iteration number or
1 for a loop body
for eah p 2 P
for i = min(rp 1; x) to 1 do
insert \Tpi = Tpi 1 " at the end of loop body in G
end
Figure 2.10
InsertRegisterCopies
When inonsistent dependenes leave an invariant array referene that is a store, the generating store for
that variable annot be moved outside of the innermost loop. Consider the following example.
10
DO 10 J = 1, N
A(I) = A(I) + A(J)
The true dependene from A(I) to A(J) is not onsistent. If the value of A(I) were stored into A(I) outside
of the loop, then the value of A(J) would be wrong whenever I=J and I > 1.
2.2.13
Initialization
To ensure that the temporary variables ontain the orret values upon entry to the loop, it is peeled using the
algorithm in Figure 2.12. We peel max(rp1 ; : : : ; rpn ) 1 iterations from the beginning of the loop, replaing
the members of a variant partition p for peeled iteration k with their original array referene, substituting
the iteration value for the indution variable, only if j k for a temporary Tpj . For invariant partitions, we
only replae non-partition generators' temporaries on the rst peeled iteration. Additionally, we let rp = 2
for eah invariant partition when alulating the number of peeled iterations. This ensures that invariant
partitions will be initialized orretly. Finally, at the end of eah peeled iteration, the appropriate number
of register transfers is inserted.
When this transformation is applied to our example, we get the following ode.
IF (M(1) .LT. 0) THEN
T10 = C(1)
E(1) = T10
ELSE
T10 = C(1)
ENDIF
T30 = T10 + D(1)
A(1) = T30
T20 = B(K) + A(0)
T31 = T30
...
LOOP BODY
2.2.14
Register Subsumption
In our example loop, we have eliminated three loads and one store from eah iteration of the loop, at the
ost of three register-to-register transfers in eah iteration. Fortunately, inserted transfer instrutions an be
eliminated if we unroll the salar replaed loop using the algorithm in Figure 2.13. If we have the partitions
p0 ; p1 ; : : : ; pn, we an remove the transfers by unrolling lm(rp0 ; rp1 ; : : : ; rpn ) 1 times. In the kth unrolled
2.2.
29
ALGORITHM
Proedure CodeMotion(P )
Input: P = set of referene partitions
for eah p 2 P
for eah v 2 p
if v is a store ^v 2 lantin(entry) then
if 9e = (v; w) 2 E ^ v = w^
8e = (v; z) 2 E; e is onsistent then
move the store into v after the loop
else if 9e = (v; w) 2 E suh that e is onsistent ^
v = w^ 8e = (u; v ) 2 E; if e is true it is onsistent
move the load into v before the loop
end
Figure 2.11
CodeMotion
Proedure Initialize(P,G)
Input: P = set of referene partitions
G = (V; E ), the dependene graph
x = max(rp1 ; : : : ; rp2 ) 1
for k = 1 to x
G0 =peel of the kth iteration of G
for eah v 2 V 0
if v = Tpj ^ (v is variant ^ j k)_
(v is invariant ^ v 2= P (v) ^ j + 1 k^
v 's generator is loop arried) then
replae v with its original array referene
replae the inner loop indution variable with
its kth iteration
endif
InsertRegisterCopies(G0 ; P; k)
end
Figure 2.12
Initialize
30
CHAPTER 2.
SCALAR REPLACEMENT
body, the temporary variable Tpj is replaed with the variable Tpmod(j k;rp ) where mod(y; x) = y b xy x.
Essentially we apture the permutation of values by irulating the register names within the unrolled
iterations.
The nal result of salar replaement on our example is shown in Figure 2.14 (the pre-loop to apture
the extra iterations and the initialization ode are not shown).
2.3
Experiment
We have implemented a soure-to-soure translator in the Parasope programming environment, a programming system for Fortran, that uses the dependene analyzer from PFC. The translator replaes subsripted
variables with salars using the desribed algorithm. The experimental design is illustrated in Figure 2.15. In
this sheme, Parasope serves as a preproessor, rewriting Fortran programs to improve register alloation.
Both the original and transformed versions of the program are then ompiled and run using the standard
produt ompiler for the target mahine.
For our test mahine, we hose the IBM RS/6000 model 540 beause it had a good ompiler and a
large number of oating-point registers (32). In fat, the IBM XLF ompiler performs salar replaement
for those referenes that do not require dependene analysis. Many fully available loop-independent ases
and invariant ases are handled. Therefore, the results desribed here only reet the ases that required
dependene analysis. Essentially, we show the results of performing salar replaement on loop-arried
dependenes and in the presene of inonsistent dependenes.
Livermore Loops. We tested salar replaement on a number of the Livermore Loops. Some of the kernels
did not ontain opportunities for our algorithm; therefore, we do not show their results. In the table below,
we show the performane gain attained by our transformation system.
Loop
1
5
6
7
8
11
12
13
18
20
23
Iterations
10000
10000
10000
5000
10000
10000
10000
5000
1000
500
5000
Original
3.40s
3.05s
3.82s
3.94s
3.38s
4.52s
1.70s
3.25s
2.62s
2.99s
2.68s
Transformed
2.54s
1.36s
2.82s
2.02s
3.07s
1.69s
1.42s
3.01s
2.54s
2.90s
2.35s
Speedup
1.34
2.24
1.35
1.95
1.10
2.67
1.20
1.08
1.03
1.03
1.14
Some of the interesting results inlude the performanes of loops 5 and 11, whih ompute rst order
linear reurrenes. Livermore Loop 5 is shown below.
5
DO 5 I = 2,N
X(I) = Z(I) * (Y(I) - X(I-1))
Here, salar replaement not only removes one memory aess, but also improves pipeline performane. The
store to X(I) will no longer ause the load of X(I-1) on the next iteration to blok. Loop 11 presents a
similar situation.
Loop 6, shown below, is also an interesting ase beause it involves an invariant array referene that
requires dependene analysis to detet.
DO 6 I= 2,N
DO 6 K= 1,I-1
6
W(I)= W(I)+B(I,K)*W(I-K)
A ompiler that reognizes loop invariant addresses to get this ase, suh as the IBM ompiler, fails beause
of the load of W(I-K). Through the use of dependene analysis, we are able to prove that there is no
dependene between W(I) and W(I-K) that is arried by the innermost loop, allowing ode motion. Beause
of this additional information, we are able to get a speedup of 1.35.
2.3.
31
EXPERIMENT
Proedure Subsume(G,P)
Input: P = set of referene partitions
G = (V; E ), the dependene graph
x =lm(rp0 ; : : : ; rpn ) 1
unroll G0 x times
for Gk = eah of the x new loop bodies
for eah v 2 Vk
if v = Tpj then
replae v with Tpmod(j k;rp )
end
Figure 2.13
Subsume
DO 3 I = 2, 100,2
IF (M(I) .LT. 0) THEN
T10
= C(I)
E(I)
= T10
ELSE
T10
= C(I)
ENDIF
T30 = T10 + D(I)
2
A(I) = T30
T20
= T20 + T31
IF (M(I+1) .LT. 0) THEN
T10
= C(I+1)
E(I+1)
= T10
ELSE
T20
= C(I+1)
ENDIF
T31 = T10 + D(I+1)
A(I+1) = T31
3
T20
= T20 + T30
1
Figure 2.14
Fortran
program
Example After Salar Replaement
transformer
(Parasope)
Figure 2.15
Fortran
ompiler
Experimental design.
improved
original
32
CHAPTER 2.
SCALAR REPLACEMENT
We also tested salar replaement on both the point and blok versions of LU
deomposition with and without partial pivoting. In the table below, we show the results.
Linear Algebra Kernels.
Kernel
LU Deomp
Blok LU
LU w/ Pivot
Blok LU w/ Pivot
Original
6.76s
6.39s
7.01s
6.84s
Transformed
6.09s
4.40s
6.35s
4.81s
Speedup
1.11
1.45
1.10
1.42
Eah of these kernels ontains invariant array referenes that require dependene analysis to detet. The
speedup ahieved on the blok algorithms is higher beause an invariant load and store are removed rather
than just a load as in the point algorithms.
To omplete our study we ran a number of Fortran appliations through our translator.
We hose programs from Spe, Perfet, RiCEPS and loal soures. Of those programs that belong to the
benhmark suites, but are not inluded in the experiment, 5 failed to be suessfully analyzed by PFC, 1
failed to ompile on the RS6000 and 10 ontained no opportunities for our algorithms. Table 3 ontains a
short desription of eah appliation.
Appliations.
Suite
SPEC
Perfet
RiCEPS
Loal
Appliation
Matrix300
Tomatv
Adm
Ar2d
Flo52
Onedim
Shal
Simple
Sphot
Wave
CoOpt
Seval
Sor
Desription
Matrix Multipliation
Mesh Generation
Pseudospetral Air Pollution
2d Fluid-Flow Solver
Transoni Invisid Flow
Time-Independent Shrodinger Equation
Weather Predition
2d Hydrodynamis
Partile Transport
Eletromagneti Partile Simulation
Oil Exploration
B-Spline Evaluation
Suessive Over-Relaxation
The results of performing salar replaement on these appliations is reported in the following table. Any
appliation not listed observed a speedup of 1.00.
Suite
Perfet
RiCEPS
Loal
Program
Adm
Ar2d
Flo52
Shal
Simple
Sphot
Wave
CoOpt
Seval
Sor
Original
236.84s
410.13s
66.32s
302.03s
963.20s
3.85s
445.94s
122.88s
0.62s
1.83s
Transformed
228.84s
407.57s
63.83s
290.42s
934.13s
3.78s
431.11s
120.44s
0.56s
1.26s
Speedup
1.03
1.01
1.04
1.04
1.03
1.02
1.03
1.02
1.11
1.46
The appliations Sor and Seval performed the best beause we were able to optimize their respetive
omputationally intensive loop. Eah had one loop whih omprised almost the entire running time of the
program. For the program Simple, one loop omprised approximately 50% of the program exeution time,
but the arried dependene ould not be exposed due to a lak of registers. In fat, without the registerpressure minimization algorithm of Setion 2.2.8, program performane deteriorated. Sphot's improvement
was gained by performing salar replaement on one partition in one loop. This partiular partition ontained
a loop-independent partially available generator that required our extension to handle ontrol ow.
The IBM RS/6000 has a load penalty of only 1 yle. On proessors with larger load penalties, suh as the
DEC Alpha, we would expet to see a larger performane gain through salar replaement. Additionally, the
2.4.
SUMMARY
33
problem sizes for benhmark are typially small. On larger problem sizes, we expet to see larger performane
gains due to a higher perentage of time being spent inside of loops.
2.4
Summary
In this hapter, we have presented an algorithm to perform salar replaement in the presene of forward
onditional-ontrol ow. By mapping partial redundany elimination to salar replaement, we are able to
ensure that we will not inrease the run-time memory osts of a loop. We have applied our algorithm to
a number of kernels and whole appliations and shown that integer-fator speedups over good optimizing
ompilers are possible on kernels.
35
Chapter 3
Unroll-And-Jam
Beause applying salar replaement alone may still leave a loop memory bound, we an sometimes apply
unroll-and-jam before salar replaement to allow a larger redution in loop balane. For example, reall
from Chapter 1 that given the loop
DO 10 I = 1, 2*M
DO 10 J = 1, N
10
A(I) = A(I) + B(J)
with a balane of 1, unroll-and-jam of the I-loop by 1 produes the loop
DO 10 I = 1, 2*M, 2
DO 10 J = 1, N
A(I) = A(I) + B(J)
10
A(I+1) = A(I+1) + B(J)
with a balane of 0.5. On a mahine that an perform 2 ops per load for a balane of 0.5, the transformed
loop would exeute faster beause it would be balaned.
Although unroll-and-jam has been studied extensively, it has not been shown how to tailor unroll-andjam to spei loops run on spei arhitetures. In the past, unroll amounts have been determined
experimentally and speied with a ompile-time parameter [CCK90℄. However, the best hoie for unroll
amounts varies between loops and arhitetures. Therefore, in this hapter, we derive a method to hose
unroll amounts automatially in order to balane program loops with respet to a spei target arhiteture.
Unroll-and-jam is tailored to a spei mahine based on a few parameters of the arhiteture, suh as eetive
number of registers and mahine balane. The result is a mahine-independent transformation system in the
sense that it an be retargeted to new proessors by hanging parameters.
This hapter begins with an overview of the safety onditions for unroll-and-jam. Then, we present
a method for updating the dependene graph so that salar replaement an take advantage of the new
opportunities reated by unroll-and-jam. Next, we derive an automati method to balane program loops
with respet to a partiular arhiteture and, nally, we report on performane improvements ahieved by
an experimental implementation of this algorithm.
3.1
Safety
Certain dependenes prevent or limit the amount of unrolling that an be done and still allow jamming to
be safe. Consider the following loop.
DO 10 I = 1, 2*M
DO 10 J = 1,N
10
A(I,J) = A(I-1,J+1)
A true dependene with a distane vetor of h1; 1i goes from A(I,J) to A(I-1,J+1). Unrolling the outer
loop one reates a loop-independent true dependene that beomes an antidependene in the reverse diretion after loop jamming. This dependene prevents jamming beause the order of the store to and load
36
CHAPTER 3.
UNROLL-AND-JAM
from loation would be reversed as shown in the unrolled ode below.
DO 10 I = 1, 2*M,2
DO 10 J = 1,N
A(I,J) = A(I-1,J+1)
10
A(I+1,J) = A(I,J+1)
Loation A(3,2) is dened on iteration h3; 2i and read on iteration h4; 1i in the original loop. In the
transformed loop, the loation is read on iteration h3; 1i and dened on iteration h3; 2i, reversing the aess
order and hanging the semantis of the loop. With a semantis that is only dened on orret programs (as
in Fortran), we say that unroll-and-jam is safe if the ow of values is not hanged. The following theorem
summarizes this property.
Theorem 3.1
Let e be a true, output or antidependene arried at level k with distane vetor
d = h0; : : : ; 0; dk ; 0; : : : ; 0; dj ; : : :i;
where dj < 0 and all omponents between the k th and j th are 0. Then the loop at level k an be
unrolled at most dk 1 time before a dependene is generated that prevents fusion of the inner
n k loops.
Proof
See Callahan, et. al [CCK88℄.
Essentially, unrolling more than dk 1 will introdue a dependene with a negative entry in the outermost
position after fusion. Sine negative thresholds are not dened, the dependene diretion must be reversed.
Dependene diretion reversal makes unroll-and-jam unsafe. Additionally, when non-DO statements appear
at levels other than the innermost level, loop distribution must be safe in order for unroll-and-jam to be safe.
Essentially, no reurrene involving data or ontrol dependenes an be violated [AK87℄.
3.2
Dependene Copying
To allow salar replaement to take advantage of the new opportunities for reuse reated by unroll-and-jam,
we must update the dependene graph to reet these hanges. Below is a method to ompute the updated
dependene graph after loop unrolling for onsistent dependenes that ontain only one loop indution
variable in eah subsript position and are not invariant with respet to the unrolled loop. [CCK88℄.
0
0 0
If 0 we unroll the mth loop in a nest L by a fator of k , then the updated dependene graph G = (V ; E )
for L an be omputed from the dependene graph G = (V; E ) for L by the following rules:
0
1. For eah v 2 V , there are k + 1 nodes v0 ; : : : ; vk in V . These orrespond to the original referene and
its k opies.
2. For eah edge e = hv; wi 2 E , with distane vetor d(e), there are k + 1 edges e0 ; : : : ; ek where
ej = hvj ; wi i, vj is the j th opy of v, wi is the ith opy of w and
i = (j + dm (e)) mod (k + 1)
(
b dkm+1(e) if i j
dm (ej ) =
d
(e)
m
b k+1 + 1 if i < j
Below dependene opying is desribed for invariant referenes. To aid in the explanation, the following
terms are dened.
1. If referenes are invariant with respet to a partiular loop, we all that loop referene invariant.
2. If a spei referene, v , is invariant with respet to a loop, we all that loop v -invariant.
3.3.
IMPROVING BALANCE WITH UNROLL-AND-JAM
37
The behavior of invariant referenes diers from that of variant referenes when unroll-and-jam is applied.
Given an invariant referene v , the distane vetor value of an inoming edge orresponding to a v -invariant
loop represents multiple values rather than just one value. Therefore, we rst apply the previous formula
to the minimum distane. Then, we reate a opy of the original edge for eah new loop body between
the soure and sink orresponding the original referenes and insert loop-independent dependenes from the
referenes within a statement to all idential referenes following them.
To extend dependene opying to unroll-and-jam, we have the following rules using an outermost- to
innermost-loop appliation ordering. If
we unroll loop k
we have a new distane vetor, d(e) = hd ; d ; : : : ; dn i
dk (e) = 0
8m < k; dm (e) = 0
1
2
then the next inner non-zero entry, dp (e); p > k , beomes the threshold of e and p beomes the arrier of e.
If 8m; dm (e) = 0, then e is loop independent. If 9m k suh that dm (e) > 0 then e remains arried by loop
m with the same threshold.
The safety rules of unroll-and-jam prevent the exposure of a negative distane vetor entry in the outermost non-zero position for true, output and antidependenes. However, the same is not true for input
dependenes. The order in whih reads are performed does not hange the semantis of the loop. Therefore,
if the outermost entry of an input dependene distane vetor beomes negative during unroll-and-jam, we
hange the diretion of the dependene and negate the entries in the vetor.
To demonstrate dependene opying, onsider Figure 3.1. The original edge from A(I,J) to A(I,J-2)
has a distane vetor of h2; 0i before unroll-and-jam. After unroll-and-jam the edge from A(I,J) now goes
to A(I,J) in statement 10 and has a vetor of h0; 0i. An edge from A(I,J+1) to A(I,J-2) with a vetor of
h1; 0i and an edge from A(I,J+2) to A(I,J-1) with a vetor of h1; 0i have also been reated. For the rst
edge, the rst formula for dm (ej ) applies. For the latter two edges, the seond formula applies.
3.3
Improving Balane with Unroll-and-Jam
As stated earlier, we would like to onvert loops that are bound by main-memory aess time into loops
that are balaned. Sine unroll-and-jam an introdue more oating-point operations without a proportional
inrease in memory aesses, it has the potential to improve loop balane. Although unroll-and-jam improves
balane, using it arelessly an be ounterprodutive. Transformed loops that spill oating-point registers
or whose objet ode is larger than the instrution ahe may suer performane degradation. Sine the
size of the objet ode is ompiler dependent, we are fored to assume that the instrution ahe is large
enough to hold the unrolled loop body. By doing so, we an remain relatively independent of the details
of a partiular software system. This assumption proved valid on the IBM RS/6000 and leaves us with the
following objetives for unroll-and-jam.
1. Balane a loop with a partiular arhiteture.
2. Control register pressure.
DO 10 J = 1,N
DO 10 I = 1,N
10
A(I,J) = A(I,J-2)
Figure 3.1
|
|
|
|
|
10
DO 10 J = 1,N,3
DO 10 I = 1,N
A(I,J)
= A(I,J-2)
A(I,J+1) = A(I,J-1)
A(I,J+2) = A(I,J)
Distane Vetors Before and After Unroll-and-Jam
38
CHAPTER 3.
UNROLL-AND-JAM
Expressing these goals as an integer optimization problem, we have
objetive funtion:
onstraint:
min jL M j
# oating-point registers required register-set size
where the variables in the problem are the unroll amounts for eah of the loops in a loop nest. For the
solution to the objetive funtion, we prefer a slightly negative dierene over a slightly positive dierene.
For eah loop nest within a program, we model its possible transformation as a problem of this form.
Solving it will give us the unroll amounts to balane the loop nest as muh as possible. Using the following definitions throughout our derivation, we will show how to onstrut and eÆiently solve a balane-optimization
problem at ompile time.
V
Vr
Vw
Xi
=
=
=
=
array referenes in an innermost loop
members of V that are memory reads
members of V that are memory writes
number of times the ith outermost loop in L is unrolled + 1
(This is the number of loop bodies reated by unrolling loop i)
F = number of oating-point operations after unroll-and-jam
M = number of memory referenes after unroll-and-jam
For the purposes of this derivation, we will assume that eah loop nest is perfetly nested. We will show how
to handle non-perfetly nested loops in Setion 3.4.6. Additionally, loops with onditional-ontrol ow in
the innermost loop body will not be andidates for unroll-and-jam. Inserting additional ontrol ow within
an innermost loop an have disastrous eets upon performane.y
3.3.1
Computing Transformed-Loop Balane.
Sine the value of m is a ompile-time onstant dened by the target arhiteture, we need only reate the
formula for L at ompile time to reate the objetive funtion. To reate L , we ompute the number of
oating-point operations, F , and memory referenes, M , in one innermost-loop iteration. F is simply the
number of oating-point operations in the original loop, f , multiplied by the number of loop bodies reated
by unroll-and-jam, giving
F =f
Y
i<n
Xi ;
1
where n is the number of loops in a nest. M requires a detailed analysis of the dependene graph.
Additionally, in omputing M we will assume an innite register set. The register-pressure onstraint
in the optimization problem will ensure that any assumption about a memory referene being removed will
be true.
To simplify the derivation of our formulation for memory yles, we initially assume that at most one
inoming or outgoing edge is inident upon eah v 2 V and that all subsript positions ontain at most one
indution variable. We will remove these restritions later. Additionally, we partition the referene set of
the loop into sets that exhibit dierent memory behavior when unroll-and-jam is applied. These sets are
listed below.
1. V ; = referenes without an inoming onsistent dependene.
2. VrC = memory reads that have a loop-arried or loop-independent inoming onsistent dependene,
but are not invariant with respet to any loop.
On massively parallel systems, it may be advantageous to have L << M
y Experiments have shown this to be true on the IBM RS/6000.
to redue network ontention.
3.3.
39
IMPROVING BALANCE WITH UNROLL-AND-JAM
3. VrI = memory reads that are invariant with respet to a loop.
4. VwC = memory writes that have a loop-arried or loop-independent inoming onsistent dependene,
but are not invariant with respet to any loop.
5. VwI = memory writes that are invariant with respet to a loop.
Using this partitioning, we ompute the ost assoiated with eah partition, fM ; ; MrC ; MrI ; MwC ; MwI g, to
get the total memory ost,
M = M ; + MrC + MrI + MwC + MwI .
For the rst partition, M ; , eah opy of the loop body produed by unroll-and-jam ontains one memory
referene assoiated with eah original referene. This value an be expressed as follows.
M; =
X
v2V ;
(
Y
i<n
Xi ):
1
For example, in Figure 3.2, unrolling the outer two loops by 1 reates 3 additional opies from the referene
to A(I,J,K) that will be referenes to memory.
For eah v 2 VrC , unroll-and-jam may reate dependenes useful in salar replaement for some of the
new referenes reated. In order for a referene, v , to be salar replaed, its inoming dependene, ev , must
be onverted into an innermost-loop-arried or a loop-independent dependene, whih we all an innermost
dependene edge. In terms of the distane vetor assoiated with an edge, d(ev ) = hd1 ; d2 ; : : : ; dn i, the edge is
made innermost when it beomes d(e0v ) = h0; 0; : : : ; 0; dn i. After unroll-and-jam, only some of the referenes
will be the sink of an innermost edge. It is our goal to quantify this value. For example, in Figure 3.3, the
rst opy of A(I,J-2), the referene to A(I,J-1), is the sink of a dependene from A(I,J+3) that is not
innermost, but the seond and third opies, A(I,J) and A(I,J+1), are the sinks of innermost edges (in this
ase, ones that are loop independent) and will be removed by salar replaement.
To quantify the behavior of VrC , we rst ompute the total number of referenes before salar replaement
as in M ; ,
X
v2VrC
(
Y
i<n
Xi ):
1
Next, we determine whih of the new referenes will be removed by salar replaement by quantifying the
number of referenes that will be the sink of an innermost dependene after unroll-and-jam. To simplify the
derivation, we onsider unrolling only loop i, where 8e and 8k 6= i; n we have dk (e) = 0. This limits the
dependenes to be arried by loop i or to be innermost already. Later, we will extend the formulation to
arbitrary loop nests and distane vetors.
Reall from our dependene-opying formulas in Setion 3.2 that di (ev ) < Xi , for some v 2 VrC , must be
true in order to reate a distane vetor with a 0 in the ith position, resulting in an innermost dependene
edge. However, depending upon whih of the two distane formulas is applied, the edge may still have a 1 in
the ith position. In order for the formula that allows an entry to beome 0, di (e0v ) = b diX(eiv ) ; to be applied,
the edge reated by unroll-and-jam, e0v = hwm ; vn i, must have n m true. If n < m, then the appliable
formula, di (e0v ) = b diX(eiv ) + 1, prevents an entry from beoming 0. In the following theorem, we show that
n m is true for Xi di (ev ) of the new dependene edges if di (ev ) < Xi . This shows that Xi di (ev ) new
referenes an be removed by salar replaement.
DO 10 K = 1,N
DO 10 J = 1,N
DO 10 I = 1,N
10
A(I,J,K) = ...
Figure 3.2
|
|
|
|
|
|
|
DO 10 K = 1,N,2
DO 10 J = 1,N,2
DO 10 I = 1,N
A(I,J,K)
A(I,J,K+1)
A(I,J+1,K)
10
A(I,J+1,K+1)
=
=
=
=
...
...
...
...
V ; Before and After Unroll-and-Jam
40
CHAPTER 3.
UNROLL-AND-JAM
Theorem 3.2
Given 0 m; di (e) < Xi and n = (m + di (e)) mod Xi for eah new edge e0v = hwm ; vn i reated
from ev = hw0 ; v0 i by unroll-and-jam, then
nm
() m < Xi di (e):
Proof
())
Given n m, assume
m Xi di (e).
Sine
0 m; di (e) < Xi
we have
n = (m + di (e)) mod Xi
= m + di (e) Xi
giving
m + di (e) Xi m.
Sine
di (e) < Xi ,
the inequality is violated, leaving
m < Xi di (e).
(()
Given
m < Xi
then
di (e) and m; di (e) 0,
n = (m + di (e)) mod Xi
= m + di (e)
Sine
m; di (e) 0
we have
m + di (e) m.
giving
n m.
In the ase where Xi di (ev ), no edge an be made innermost, resulting in a general formula of (Xi di (ev ))+
edges with di (e0v ) = 0, where x+ is the positive part of x.
For an arbitrary distane vetor, if any outer entry is left greater than 0 after unroll-and-jam, the new
edge will not be innermost. Additionally, one a dependene edge beomes innermost, unrolling any loop
additionally reates a opy of that edge. For example, in Figure 3.3, the load from A(I,J+1) has an inoming
DO 10 J = 1,N
DO 10 I = 1,N
10
A(I,J) = A(1,J-2)
Figure 3.3
|
|
|
|
|
|
10
DO 10 J = 1,N,3
DO 10 I = 1,N
A(I,J)
= A(I,J-2)
A(I,J+1) = A(I,J-1)
A(I,J+2) = A(I,J)
A(I,J+3) = A(I,J+1)
VrC Before and After Unroll-and-Jam
3.3.
41
IMPROVING BALANCE WITH UNROLL-AND-JAM
innermost edge that is a opy of the inoming edge for the load from A(I,J). The edge into A(I,J+1) was
reated by additional unrolling of the J-loop after A(I,J)'s edge was reated. Quantifying these ideas, we
have
Y
i<n
(Xi
di (ev ))+
1
referenes that an be salar replaed. Subtrating this value from the total number of referenes before
salar replaement and after unroll-and-jam gives the value for MrC . Therefore,
MrC =
X
v2VrC
(
Y
i<n
Y
Xi
i<n
1
(Xi
di (ev ))+ ):
1
In Figure 3.3, X1 = 4 and d1 (e) = 2 for the referene to A(I,J-2). Our formula orretly predits that 2
memory referenes will be removed. Note that if X1 2 were true, no memory referenes would have been
removed beause no distane vetor would have a 0 in the rst entry.
For referenes that are invariant with respet to a loop, memory behavior is dierent from variant
referenes. If v 2 VrI is invariant with respet to loop Li , i < n, unrolling Li will not introdue more
referenes to memory beause eah unrolled opy of v will aess the same memory loation as v . This
is beause the indution variable for Li does not appear in the subsript expression of v . Unrolling any
loop that is not v -invariant will introdue memory referenes. Finally, if v is invariant with respet to the
innermost loop, no memory referenes will be left after salar replaement no matter whih outer loops are
unrolled. In Figure 3.4, unrolling the K-loop introdues memory referenes, while unrolling the J-loop does
not. Using these properties, we have
MrI =
X
v2VrI
(
Y
i<n
!(ev ; i; n))
1
where
!(e; i; n) ( if e is invariant wrt loop n then
return 0
else if e is invariant wrt loop i then
return 1
else
return Xi
Members of VwC behave exatly as those in VrC exept that salar replaement will remove stores only when
their inoming dependene is loop independent. Thus, for MwC we have
MwC =
DO 10 K =
DO 10 J
DO 10
10
...
1,N
= 1,N
I = 1,N
= A(I,K)
Figure 3.4
X
v2VwC
|
|
|
|
|
|
|
(v; 1; n; n)
10
DO 10 K =
DO 10 J
DO 10
...
...
...
...
1,N,2
= 1,N,2
I = 1,N
= A(I,K)
= A(I,K+1)
= A(I,K)
= A(I,K+1)
VrI Before and After Unroll-and-Jam
42
CHAPTER 3.
UNROLL-AND-JAM
where
(v; j; k; n) = if dn (ev ) =Y
0 and ev isY
an output edge then
return (
Xi
(Xi di (ev ))+ )
j i<k
j i<k
else
return
Y
j i<k
Xi
Finally, MwI is dened exatly as MrI .
Now that we have obtained the formulation for loop balane in relation to unroll-and-jam, we an use it
to prove that unroll-and-jam will never inrease loop balane. Consider the following formula for L when
unrolling only loop i.
X
L =
v2V ;
Xi +
X
v2VrC
Xi (Xi
di (ev ))+ +
f
X
v2VrI [VwI
Xi
!(ev ; i; n)+
X
v2VwC
(v; i; i + 1; n)
If we examine the verties and edges in eah of the above vertex sets to determine the simplied formula for
ML, we obtain the following, where 8j; aj 0.
M;
MrC
MrI
MwC
MwI
=
=
=
=
=
a 0 Xi
a 1 Xi + a 2
a 3 Xi + a 4
a 5 Xi + a 6
a 7 Xi + a 8
Combining these terms we get
L = (0 XfXi +i 1 )
where 0 ; 1 0. Note that the values of 0 and 1 may vary with the value of Xi . As Xi inreases, 0 may
derease and 1 may inrease beause the value of (Xi di (ev ))+ may beome positive for some v 2 VrC . In
the theorem below, we show that L will not inrease as Xi inreases, showing that unrolling one loop will
not inrease loop balane.
Theorem 3.3
The funtion for L is noninreasing in the i dimension.
Proof
For the original loop, Xi = 1 and
L = (0 +f 1 )
If we unroll-and-jam the i-loop n 1 times, n > 1, then Xi = n giving
0
L0 = (0 n+fn1 +2 )
where 2 is the sum of all di (ev ) where (Xi di (ev ))+ beame positive with the inrease in Xi .
Sine 0 will derease or remain the same with the hange in Xi , 00 0 . The dierene between
0 and 00 is the number of edges where (Xi di (ev ))+ beomes positive with the hange in Xi .
Given that Xi = n, the maximum value of any di (ev ) added into 2 is n 1 (any greater value
would not make (Xi di (ev ))+ positive). Therefore, the maximum value of L0 is
0
0 00 )(n 1))
L0 = (0 n+1 +(fn
3.3.
43
IMPROVING BALANCE WITH UNROLL-AND-JAM
To show that L is noninreasing, we must show that L L0 or that the unrolled loop's balane
is no greater than the original balane.
n + n (n 1) L
nL
n
n(0 +1 )
nf
L0
L0
00 n+(0 00 )(n
fn
(
1 )
1)+
n00 + n0 n00 + 00 0 + 1
0
1
00 0
1
Sine n > 1; 1 0 and 00 0 , the inequality holds and L is noninreasing.
Given that we unroll in an outermost- to innermost-loop order, unrolling loop 1 produes a new loop with
balane L1 suh that L1 L by Theorem 3.3. After unrolling loop i, assume we have 8j < i; Li Lj .
If we next unroll loop i + 1 to produe Li+1 , then again by Theorem 3.3, Li+1 Li . Therefore, unrolling
multiple loops will not inrease loop balane.
An Example.
As an example of a formula for loop balane, onsider the following loop.
DO 10 J = 1,N
DO 10 I = 1,N
10
A(I,J) = A(I,J-1) + B(I)
We have A(I,J)
gives
2 V ; , A(I,J-1) 2 VrC
with an inoming distane vetor of h1; 0i, and B(I)
L = X1 +X1 X(X11
3.3.2
2 VrI .
This
+
1) +1
Estimating Register Pressure
Now that we have the formula for omputing loop balane given a set of unroll amounts, we must reate
the formula for register pressure. To ompute the number of registers required by an unrolled loop, we
need to determine whih referenes an be removed by salar replaement and how many registers eah of
those referenes need. We will refer to this quantity as R. As in omputing M , we onsider the partitioned
sets of V , fV ; ; VrC ; VrI g, but we do not onsider Vw beause its register requirements are inluded in the
omputation of Vr . Assoiated with eah of the partitions of V is the number of registers required by that
set, fR; ; RrC ; RrI g, giving
R = R; + RrC + RrI .
;
Sine eah member of V has no inoming dependene, we annot use registers to apture reuse. However,
members of V ; still require a register to hold a value during expression evaluation. To ompute the number
of registers required for those memory referenes not salar replaed and for those temporaries needed during
the evaluation of expressions, we use the tree labeling tehnique of Sethi and Ullman [SU70℄. Using this
tehnique, the number of registers required for eah expression in the original loop body is omputed and
the maximum over all expressions is taken as the value for R; .
For variant referenes whose inoming edge is arried by some loop, v 2 VrC , we need to determine whih
(or how many) referenes will be removed by salar replaement after unroll-and-jam and how many registers
are required for those referenes. The former value is derived in the omputation of MrC and is shown below.
X
v2VrC
(
Y
i<n
(Xi
di (ev ))+ )
1
The latter value, dn (ev ) + 1, omes from salar replaement, where we assume that all registers interfere.
Any value that ows aross a loop-arried dependene interferes with all other values and prediting the
eets of sheduling before optimization to handle loop-independent interferene is beyond the sope of this
work. Therefore,
RrC =
X
v2VrC
(
Y
i<n
1
(Xi
di (ev ))+ ) (dn (ev ) + 1):
44
CHAPTER 3.
UNROLL-AND-JAM
Referenes that are invariant with respet to an outer loop require registers only if a orresponding refereneinvariant loop is unrolled. However, referenes that are invariant with respet to the innermost loop always
require registers. For eah v 2 VrI , unrolling loops that are not v -invariant reates referenes that possibly
require additional registers. In Figure 3.4, the referene to A(I,K) requires no registers unless the J-loop is
unrolled and sine K is unrolled one, two registers are required. Formulating this behavior, we have
RrI =
X
v2VrI
(
Y
i<n
(ev ; i; n))
1
where
(e; i; n) ( if e is invariant wrt loop i then
if (9Xj jXj > 1 ^ e is invariant wrt loop j ) or
(e is invariant wrt loop n) then
return 1
else
return 0
else
return Xi
An Example.
As an example of a formula for omputing register pressure, onsider the following loop.
DO 10 J = 1,N
DO 10 I = 1,N
10
A(I,J) = A(I,J-1) + B(J)
We have A(I,J) 2 V ; , A(I,J-1)
gives
R = 1 + (X1 1)+ + X1 :
3.4
2 VrC
with an inoming distane vetor of h1; 0i and B(J)
2 VrI .
This
Applying Unroll-and-Jam in a Compiler
In this setion, we will disuss how to take the previous optimization problem and solve it pratially for
real program loop nests. In general, integer-programming problems are NP-omplete, but we an make
simplifying assumptions for our optimization problem that will make its solution tratable. First, we show
how to hoose the loops to whih unroll-and-jam will be applied and then we show how to hoose the unroll
amounts for those loops.
3.4.1
Piking Loops to Unroll
Applying unroll-and-jam to all of the loops in an arbitrary loop nest an make the solution to our optimization
problem extremely omplex. To simplify the solution proedure, we will only apply unroll-and-jam to a subset
of the loop nest. Sine experiene suggests that most loop nests have a nesting depth of 3 or less, we an
limit unroll-and-jam to 1 or 2 loops in a given nest.
Our heuristi for determining the subset of loops for unroll-and-jam is to pik those loops whose unrolling
is likely to result in the fewest memory referenes in the unrolled loop. In partiular, we unroll the loops
that arry the most dependenes that will be innermost after unroll-and-jam is applied. To nd these loops,
we examine eah distane vetor to determine if it an be made innermost for eah pair of loops in the nest.
Given a distane vetor d(e) = hd1 ; : : : ; di ; : : : ; dj ; : : : ; dn i where di ; dj 0, unrolling loops i and j an make
e innermost if 8k; k 6= i; j; n we have dk (e) = 0. If we unroll only one loop, i, then e an be made innermost
if 8k; k 6= i; n we have dk (e) = 0. The algorithm for hoosing loops is shown in Figure 3.5.
3.4.
45
APPLYING UNROLL-AND-JAM IN A COMPILER
proedure PikLoops(V )
Input: V = array referenes in a loop
for eah v 2 V
if v is a memory read or dn (ev ) = 0 then
for i = 1 to n 1
for eah j ij8k; k 6= i; j; n we have dk (e) = 0
ount(i; j )++
if unrolling two loops
unroll loops i; j j ount(i; j ) is the maximum
else
unroll loop ij ount(i; i) is the maximum
end
Figure 3.5
3.4.2
PikLoops
Piking Unroll Amounts
In the following disussion of our problem solution, we onsider unrolling only one loop to bring larity to the
solution proedure. Later, we will disuss how to extend the solution to the problem for two loops. Given
this restrition, we an redue the optimization formula to the following, where Æ = (Xi di (e))+ and % is
the eetive size of the oating-point register set for the target arhiteture.
min
X
v V
2 ;
Xi +
R ; + Xi X
v2VrC
X
v2VrC
(Xi
Æ )+
X
v2VrI [VwI
!(ev ; i; n)+
f Xi
(Æ (dn (ev ) + 1)) +
X
v2VrI
X
v2VwC
(ev ; i; i + 1; n)
m (ev ; i; n) %
Xi 1
We begin the solution proedure by examining the verties and edges of the dependene graph to aumulate
the oeÆient for Xi and any onstant value as was shown earlier. Essentially, we solve the summations to
get a linear funtion of Xi . However, the + -funtion requires us to keep an array of oeÆients and onstants,
one set for eah value of di , that will be used depending upon the value of Xi . In pratie, most distanes
are 0, 1 or 2, allowing us to keep 4 sets of values: one for eah ommon distane and one for the remaining
distanes. If di > 2 is true for some edge, our results may be a bit impreise if 3 Xi < di . However,
experimentation disovered no instanes of this situation.
Given a set of oeÆients, we an searh the solution spae, while heking register pressure, to determine
the best unroll fator. Sine most dependene distanes are 0, 1 or 2, unrolling more than % will most likely
inrease the register pressure beyond the physial limits of the target mahine. Therefore, by Theorem 3.3,
we an nd the solution to the optimization problem in the solution spae in O(log %) time.
To extend unroll-and-jam to two loops, we an reate a simplied problem from the original by setting
n = 2. A formula for two partiular loops, i and j , an be onstruted by examining the dependene distane
vetors as in the ase for one loop. By Theorem 3.3, if we hold either Xi or Xj onstant, the funtion for L
is noninreasing, whih gives a two dimensional solution spae with eah row and olumn sorted in dereasing
order, lending itself to an O(%) intelligent searh. In general, a solution for unroll-and-jam of k loops, k > 1,
an be found in O(%k 1 ) time.
46
CHAPTER 3.
3.4.3
UNROLL-AND-JAM
Removing Interlok
After unroll-and-jam, a loop may ontain pipeline interlok, leaving idle omputational yles. To remove
these yles, we use the tehnique of Callahan, et. al, to estimate the amount of interlok and then unroll
one loop to reate more parallelism by introduing enough opies of the inner loop reurrene [CCK88℄.
Essentially, the length of the longest reurrene arried by the innermost loop is alulated and then unrolland-jam is applied to an outer loop until there are more oating-point operations than the reurrene length.
3.4.4
Multiple Edges
In general, we annot assume that there will be only one edge entering or leaving eah node in the dependene
graph sine multiple inident edges are possible and likely. One possible solution is to treat eah edge
separately as if it were the only inident edge. Unfortunately, this is inadequate beause many edges may
orrespond to the ow of the same set of values. In the loop,
DO 10 J = 1,N
DO 10 I = 1,N
10
B(I,J) = A(I-1,J) + A(I-1,J) + A(I,J)
there is an input dependene from A(I,J) to eah A(I-1,J) with distane vetor h0; 1i and a loop-independent
dependene between the A(I-1,J)'s. Considering eah edge separately would give a register pressure estimation of 5 registers per unrolled iteration of J rather than the atual value of 2. Both values provided
ome from the same soure, requiring us to use the same registers. Although the original estimation is
onservatively safe, it is more onservative than neessary. To improve our estimation, we onsider register
sharing.
To relate all referenes that aess the same values, our salar replaement algorithm onsiders the oldest
referene in a dependene hain as the referene that provides the value for the entire hain. Using this
tehnique alone to apture sharing within the ontext of unroll-and-jam, however, an ause us to miss
registers that are used. Consider the following example.
DO 10 I = 1,N
DO 10 J = 1,N
10
C(I,J) = A(I+1,J) + A(I,J) + A(I,J)
The seond referene to A(I,J) has two inoming input dependenes: one with a distane vetor of h1; 0i
from A(I+1,J) and one with a distane vetor of h0; 0i from A(I,J). After unrolling the I-loop by 1 we have
DO 10 I = 1,N,2
DO 10 J = 1,N
1
C(I,J)
= A(I+1,J) + A(I,J)
+ A(I,J)
10
C(I+1,J) = A(I+2,J) + A(I+1,J) + A(I+1,J)
Now, the referene A(I+1,J) in statement 1 provides the value used in statement 10 by both referenes to
A(I+1,J), and the rst referene to A(I,J) in statement 1 provides the value used in the seond referene
to A(I,J) in statement 1. Here, we annot isolate whih of the two referenes to A will ultimately provide
the value for the seond A(I,J) before unrolling beause they both provide the value at dierent points.
Therefore, we annot pik just the edge from the oldest referene, A(I+1,J), to alulate its register pressure.
Instead, we an use a stepwise approximation where eah possible oldest referene within a dependene hain
is onsidered.
The rst step is to partition the dependene graph so that all referenes using the same registers after
unrolling will be put in the same partition. The algorithm in Figure 3.6 aomplishes this task. One we
have the referenes partitioned, we apply the algorithm in Figure 3.7 to alulate the oeÆients used in R.
First, in the proedure OldestValue, we ompute the node in eah register partition that is rst to referene
the value that ows throughout the partition. We look for the referene, v , that has no inoming dependene
from another member of its partition or has only loop-arried inoming dependenes that are arried by
v-invariant loops. We onsider only those edges that an reate reuse in the unrolled loop as desribed in
Setion 3.4.1. After unroll-and-jam, the oldest referene will provide the value for salar replaement for the
entire partition and is alled the generator of the partition.
Next, in proedure SummarizeEdges, we determine the distane vetor that will enompass all of the
outgoing edges (one to eah node in the partition) from the generator of eah partition. Beause we want
3.4.
APPLYING UNROLL-AND-JAM IN A COMPILER
47
Proedure PartitionNodes(G; j; k)
Input: G = (V; E ), the dependene graph
j; k = loops to be unrolled
put eah v 2 V in its own partition, P (v )
while 9 an unmarked node do
let v be an arbitrary unmarked node
mark v
forall w 2 V j((9e = (w; v ) _ e = (v; w))^
(e is input or true and is onsistent) ^
(for i = 1 : : : n 1; i 6= j; i 6= k; di (e) = 0)) do
mark u
merge P (u) and P (v )
enddo
enddo
end
Figure 3.6
PartitionNodes
to be onservative, for i = 1; : : : ; n 1 we pik the minimum di for all of the generator's outgoing edges.
This guarantees that the innermost-loop reuse within a partition will be measured as soon as possible in
the unrolled loop body. For dn , we pik the maximum value to make sure that we do not underestimate
the number of registers used. After omputing a summarized distane vetor, we use it to aumulate the
oeÆients to the equation for register pressure.
At this point in the algorithm, we have reated a distane vetor and formula to estimate the register
requirements of a partition partially given a set of unroll values. In addition, we must aount for the
intermediate points where the generator of a partition does not provide the reuse at the innermost loop level
for some of the referenes within the partition as shown in the previous example (i.e. A(I-1,J) and A(I,J)).
We will underestimate register requirements if we fail to handle these situations.
First, using proedure Prune, we remove all nodes from a partition for whih there are no intermediate
points where a referene other than the urrent generator provides the values for salar replaement. Any
referene that ours on the same iteration of loops 1 to n 1 as the urrent generator of the partition will
have no suh points. Their value will always be provided by the urrent generator. Next, we ompute the
additional number of registers needed by the remaining nodes in the partition by alulating the oeÆients
of R for an intermediate generator of the modied partition. The only dierene from the omputation for the
original generator is that we modify the oeÆients of R derived from the summarized distane vetor of the
modied partition by aounting for the registers that we have already ounted with the previous generator.
This is done by omputing the oeÆients for R as if an intermediate generator were the generator in an
original partition. Then, we subtrat from R the number of referenes that have already been handled by
the previous generator. We use the distane vetor of the edge, e^, from the previous generator, v , to the
new intermediate generator, u, to ompute the latter value. Given ds for the urrent summarized distane
vetor and d(^e) for the edge between the urrent and previous generators, we ompute Rds R(ds +d(^e)) . The
value ds + d(^e) would represent the summarized distane vetor if the previous generator were the generator
of the modied partition. In our example, ds = h0; 0i and d(^e) = h1; 0i giving (X1 0)+ (X1 1)+
additional referenes removed. These steps for intermediate points are repeated until the partition has 1 or
less members.
With the allowane of multiple edges, referenes may be invariant with respet to only one of the unrolled
loops and still have a dependene arried by the other unrolled loop. In this ase, we treat the referenes as
a member of V I with respet to the rst loop and V C with respet to the seond.
Partitioning is not neessary to handle multiple edges for M beause sharing does not our. Instead, we
onsider eah referene separately using a summarized vetor for eah v 2 V that onsists of the minimum
di over all inoming edges of v for i = 1; : : : ; n 1. Although this is a slightly dierent approximation then
used for R, it is more aurate and will result in a better estimate of L .
48
CHAPTER 3.
UNROLL-AND-JAM
Proedure ComputeR(P; G; R)
Input: P = partition of memory referenes
G = (V; E ), the dependene graph
R = register pressure formula
foreah p 2 P do
v = OldestValue(p)
d = SummarizeEdges(v )
update R using d
Prune(p,v )
while size(p) > 1 do
u = OldestValue(p)
d = SummarizeEdges(u)
update R using Rd Rd+d(^e) ; e^ = (v; u)
Prune(p,u)
v=u
enddo
enddo
end
Proedure OldestValue(p)
foreah v 2 p
if 8e = (u; v ); v is invariant wrt the loop at level(e)_
P (u) 6= P (v )) then return v
end
Proedure SummarizeEdges(v )
for i = 1 to n 1
foreah e = (v; w) 2 E; w is a memory read
dsi = min(dsi ; di (e))
foreah e 2 E; w is a memory read
dsn = max(dsn ; dn (e))
return ds
end
Proedure Prune(p,v )
remove v from p
foreah e = (v; w) 2 E
if d(e) = (=; =; : : : ; =; ) then remove w
end
Figure 3.7
Compute CoeÆients for Optimization Problem
3.4.
APPLYING UNROLL-AND-JAM IN A COMPILER
3.4.5
49
Multiple-Indution-Variable Subsripts
In Setion 3.2, we disussed a method for updating the dependene graph for referenes with one indution
variable in eah subsript position. Unfortunately, this method does not work on subsripts ontaining
multiple indution variables, alled MIV subsripts, that have inoming onsistent dependenes [GKT91℄.
Consider the following loop.
DO 10 I = 1,N
DO 10 J = 1,N
10
A(I) = A(I) + B(I-J)
The load of B(I-J) has a loop-independent inoming onsistent dependene from itself arried by the I-loop.
After unroll-and-jam of the I-loop by a fator of 2, we have
DO 10 I = 1,N,3
DO 10 J = 1,N
A(I) = A(I) + B(I-J)
A(I) = A(I) + B(I-J+1)
10
A(I) = A(I) + B(I-J+2)
Here, there is an input dependene arried by the innermost loop from B(I-J+2) to both B(I-J+1) and
B(I-J). It was not possible for new referenes that ontained only single-indution-variable subsripts,
alled SIV subsripts, to have multiple edges that have the same soure and that arose from one original
edge. In the above example, even though the I-loop step value has inreased, the presene of J in the
subsript removes its eet for the exeution of the innermost loop. The formula presented in Setion 3.2 is
invalid in this ase sine the hange in loop step does not have the same eet on dependene distane and
loation.
In general, we would need to re-apply dependene analysis to MIV subsripts to get the full and orret
dependene graph. This is beause of the interation between multiple MIV subsripts. In addition, it would
be very diÆult to ompute M and R given the behavior of suh subsripts. Therefore, we restrit the type
of MIV subsripts that we handle to a small, preditable subset, V MIV . In our experimentation, the only
MIV subsripts that we found t into the following desription of V MIV .
In order for a referene to be lassied as MIV in our system, it must have the following properties:
1. The referene is inident only upon onsistent dependenes.
2. The referene ontains only one MIV subsript position.
3. The inner-loop indution variable does not appear in the subsript or it appears only in the MIV
subsript position.
4. At most, one unrolled-loop indution variable is in the MIV subsript position.
5. The oeÆients of the indution variables in the MIV subsript position are 1 for an unrolled loop and
1 or 1 for the innermost loop.
If an MIV referene is invariant with respet to the innermost loop, it is treated as a member of VrI or VwI . If
a loop is unrolled and its indution variable is not in the MIV subsript position, we use the lassiation of
the referene in relation to the unrolled-loop indution variable when unroll-and-jam is applied (e.g, B(I-J)
would be treated as a member of VrI with respet to a K-loop). Any referene ontaining multiple-indutionvariable subsripts that do not t the above riteria are lassied as members of V ; .
To update the dependene graph for V MIV , we restrit ourselves to the innermost loop to satisfy the needs
of salar replaement only. Essentially, we will apply the strong SIV test for the innermost-loop indution
variable on eah MIV referene pair [GKT91℄. Given the referenes, A(an In + a0 ) and A(bn In + b0 ), where
an = bn , there is a dependene between the two referenes with a onsistent threshold d if d = a0anb0 is an
integer. Sine we restrit an and bn to be 1 or 1, there will always be a dependene.
50
CHAPTER 3.
UNROLL-AND-JAM
To ompute M MIV and RMIV , we must reognize that the eet of unroll-and-jam on MIV subsripts is
to inrease the distane from the rst to the last aess of a value by the unroll amount of the loop. Consider
the following loop.
DO 10 I = 1,N
DO 10 J = 1,N
10
... = B(I-J) + B(I-J+1)
Unroll-and-jam of I by 1 produes
DO 10 I
DO 10
...
10
...
=
J
=
=
1,N,3
= 1,N
B(I-J) + B(I-J+1)
B(I-J+1)+ B(I-J+2)
In the original loop, the dependene distane for B's partition was 1. After unroll-and-jam by 1, the distane
is 2. This example shows that eah v 2 VrMIV requires Xi + dn (e) registers, assuming a step value of 1 for
loop i, when unrolling the loop whose indution variable is in the MIV subsript position. In the loop,
DO 10 I = 1,N
DO 10 J = 1,N
10
... = B(I-J)
we must unroll the I-loop at least one to get a dependene arried by the innermost loop. To quantify this,
we have
RrMIV =
X
v2VrMIV
pos(Xi
di (ev )) (Xi + dn (ev ))
where dn (ev ) is the maximum distane in a partition and pos(x) returns 1 if x > 0, otherwise it returns 0.
Here, pos(Xi di (ev )) ensures that a dependene will be arried by the innermost loop after unroll-and-jam.
Only the MIV referene that is rst to aess a value will be left as a referene to memory after unrolland-jam and salar replaement. This is beause eah later referene introdued by unroll-and-jam will be
the sink of an innermost dependene, as shown in the examples. Therefore,
MrMIV =
X
v2VrMIV
1:
For eah v 2 VwMIV , unrolling at least dn (ev ) will reate an idential subsript expression, resulting in an
inoming loop-independent dependene. Therefore,
MwMIV =
X
v2VwMIV
min(Xi ; dn (ev ));
where dn (ev ) is the minimum over all of v 's inoming output dependene edges.
3.4.6
Non-Perfetly Nested Loops
It may be the ase that the loop nest on whih we wish to perform unroll-and-jam is not perfetly nested.
Consider the following loop.
DO 10 I = 1,N
DO 20 J = 1,N
20
A(I) = A(I) + B(J)
DO 10 J = 1,N
10
C(J) = C(J) + D(I)
On the RS/6000, where M = 1, the I-loop would not be unrolled for the loop surrounding statement 20,
but would be unrolled for the loop surrounding statement 10. To handle this situation, we an distribute
3.5.
51
EXPERIMENT
the I-loop around the J-loops as follows
DO 20 I = 1,N
DO 20 J = 1,N
20
A(I) = A(I) + B(J)
DO 10 I = 1,N
DO 10 J = 1,N
10
C(J) = C(J) + D(I)
i
and unroll eah loop independently.
In general, we ompute unroll amounts with a balane-optimization formula for eah innermost loop
body as if it were within a perfetly nested loop body. A vetor of unroll amounts that inludes a value for
eah of a loop's surrounding loops and a value for the loop itself is kept for eah loop within the nest. When
determining if distribution is neessary, if we enounter a loop that has multiple inner loops at the next level,
we ompare eah of the unroll vetors for eah of the inner loops to determine if any unroll amounts dier.
If they dier, we distribute eah of the loops from the urrent loop to the loop at the outermost level where
the unroll amounts dier. We then proeed by examining eah of the newly reated loops for additional
distribution. To handle safety, if any loop that must be distributed annot be distributed, we set the unroll
vetor values in eah vetor to the smallest unroll amount in all of the vetors for eah level. Distribution is
unsafe if it breaks a reurrene [AK87℄. The omplete algorithm for distributing loops for unroll-and-jam is
given in Figure 3.8.
3.5
Experiment
Using the experimental system desribed in Chapter 2, we have implemented unroll-and-jam as presented
in this hapter. In addition to ontrolling oating-point register pressure, we found that address-register
pressure needed to be ontrolled. For the RS/6000, any expression having no inoming innermost-loop-arried
or loop-independent dependene required an address register. Although this removes our independene from
a partiular ompiler, it ould not be avoided for performane reasons. To allow for an imperfet salar
optimizer, the eetive register-set size for both oating-point and address registers was hosen to be 26.
This value was determined experimentally.
For our rst experiment, we hoose DMXPY from the Level 2 BLAS library [DDHH88℄. This
version of DMXPY is a hand-optimized vetor-matrix multiply that is mahine-spei and diÆult to understand. We applied our transformations to the unoptimized version and ompared the performane with
the BLAS2 version.
DMXPY.
Array Size
300
Iterations
700
Original
7.78s
BLAS2
5.18s
UJ+SR
5.06s
Speedup
1.54
As the table shows, we were able to attain slightly better performane than the BLAS version, while allowing
the kernel to be expressed in a mahine-independent fashion. The hand-optimized version was not spei
to the RS/6000 and more hand optimization would have been required to tune it to the RS/6000. However,
our automati tehniques took are of the mahine-spei details without the knowledge or intervention of
the programmer and still obtained good performane.
The next experiment was performed on two versions of matrix multiply. The JKI loop
ordering for better ahe performane is shown below.
Matrix Multiply.
DO 10 J = 1,N
DO 10 K = 1,N
DO 10 I = 1,N
10
C(I,J) = C(I,J) + A(I,K) * B(K,J)
The results from our transformation system show that integer fator speedups are attainable on both small
and large matries for both versions of the loop. Not only did we improve memory reuse within the loop,
52
CHAPTER 3.
UNROLL-AND-JAM
Proedure DistributeLoops(L; i)
Input: L = loop to hek for distribution
i = nesting level of L
if L has multiple inner loops at level i + 1 then
l = outermost level where the unroll amounts dier for the
loop nests.
if l > 0 then
if loops at levels l; : : : ; i an be distributed then
L=L
while level(L) l do
distribute L around its inner loops
L = parent(L)
enddo
else
for j = 1 to i do
m = minimum of all unroll vetors for the inner loops of
L at level j
assign to those vetors at level j the value m
enddo
endif
endif
N = the rst inner loop of L at level i + 1
if N exists then
DistributeLoops(N; i + 1)
N = the next loop at level i following L
while N exists do
DistributeLoops(N; i)
N = the next loop at level i following N
enddo
end
Figure 3.8
Distribute Loops for Unroll-and-jam
3.5.
53
EXPERIMENT
but also available instrution-level parallelism.
Loop Order
JIK
JKI
Array Size
50x50
500x500
50x50
500x500
Iterations
500
1
500
1
Original
4.53s
135.61s
6.71s
15.49s
SR
4.53s
135.61s
6.71s
15.49s
UJ+SR
2.37s
44.16s
3.3s
6.6s
Speedup
1.91
3.07
2.04
2.35
Comparing the results of the two versions of matrix multiply shows how ritial memory performane really
is. The JKI version has better overall ahe loality and on large matries, that property gives dramatially
better performane than the JIK version.
We also experimented with both the point and blok versions of LU deomposition with and without partial pivoting (LU and LUP, respetively). In the table below, we show the results
of salar replaement and unroll-and-jam.
Linear Algebra Kernels.
Kernel
Blok LU
Blok LUP
Array Size
300x300
300x300
500x500
500x500
300x300
300x300
500x500
500x500
Blok Size
32
64
32
64
32
64
32
64
Original
1.37s
1.42s
6.58s
6.59s
1.42s
1.48s
6.85s
6.83s
SR
0.91s
1.02s
4.51s
4.54s
0.97s
1.10s
4.81s
4.85s
UJ+SR
0.56s
0.72s
2.48s
2.79s
0.67s
0.77s
2.72s
3.02s
Speedup
2.45
1.97
2.65
2.36
2.12
1.92
2.52
2.26
As shown in the above examples, a fator of 2 to 2.5 was attained. Eah of the kernels ontained an
inner-loop redution that was amenable to unroll-and-jam. In these instanes, unroll-and-jam introdued
multiple parallel opies of the inner-loop reurrene while simultaneously improving data loality.
Next, we studied three of the NAS kernels from the SPEC benhmark suite. A speedup
was observed for eah of these kernels with Emit and Gmtry doing muh better. The reason is beause
one of the main omputational loops in both Emit and Gmtry ontained an outer-loop redution that was
unroll-and-jammed.
NAS Kernels.
Kernel
Vpenta
Emit
Gmtry
Original
149.68s
35.1s
155.3s
SR
149.68s
33.83s
154.74s
UJ+SR
138.69s
24.57s
25.95s
Speedup
1.08
1.43
5.98
We also inluded in our experiment two geophysis kernels: one that omputed
the adjoint onvolution of two time series and another that omputed the onvolution. Eah of these loops
required the optimization of trapezoidal, rhomboidal and triangular iteration spaes using tehniques that
we develop in Chapter 5. Again, these kernels ontained inner-loop redutions.
Geophysis Kernels.
Kernel
Afold
Fold
Iterations
1000
1000
1000
1000
Array Size
300
500
300
500
Original
4.59s
12.46s
4.61s
12.56s
SR
4.59s
12.46s
4.61s
12.56s
UJ+SR
2.55s
6.65s
2.53s
6.63s
Speedup
1.80
1.87
1.82
1.91
To omplete our study we ran a number of Fortran appliations through our translator.
We hose programs from SPEC, Perfet, RICEPS and loal soures. Those programs that belong to our
benhmark suites but are not inluded in this experiment ontained no opportunities for our algorithm. The
Appliations.
54
CHAPTER 3.
UNROLL-AND-JAM
results of performing salar replaement and unroll-and-jam on these appliations are shown in the following
table.z
Program
Ar2d
CoOpt
Flo52
Matrix300
Onedim
Simple
Tomatv
Original
410.13s
122.88s
61.01s
149.6s
4.41s
963.20s
37.66s
SR
407.57s
120.44s
58.61s
149.6s
4.41s
934.13s
37.66s
UJ+SR
401.96
116.67s
58.8s
33.21s
3.96s
928.84s
37.41s
Speedup
1.02
1.05
1.04
4.50
1.11
1.04
1.01
The appliations that observed the largest improvements with unroll-and-jam (Matrix300, Onedim) were
dominated by the ost of loops ontaining redutions that were highly optimized by our system. Although
many of the loops found in other programs reeived a 15%-40% improvement, the appliations themselves
were not dominated originally by the osts of these loops.
Throughout our study, we found that unroll-and-jam was most eetive in the presene of redutions.
Unrolling a redution introdued very few stores and improved ahe loality and instrution-level parallelism.
Stores seemed to limit the pereived available parallelism on the RS/6000. Unroll-and-jam was least eetive
in the presene of a oating-point divide beause of its high omputational ost. Beause a divide takes 19
yles on the RS/6000, its ost dominated loop performane enough to make data loality a minimal fator.
3.6
Summary
In this hapter, we have developed an optimization problem that minimizes the distane between mahine
balane and loop balane, bringing the balane of the loop as lose as possible to the balane of the mahine.
We have shown how to use a few simplifying assumptions to make the solution times fast enough for inlusion
in ompilers. We have implemented these transformations in a Fortran soure-to-soure preproessor and
shown its eetiveness by applying it to a substantial olletion of kernels and whole programs. These
results show that, over whole programs, modest improvements are usually ahieved with spetaular results
ourring on a few programs. The methods are partiularly suessful on kernels from linear algebra. These
results are ahieved on an IBM RS/6000, whih has an extremely eetive optimizing ompiler. We would
expet more dramati improvements over a less sophistiated ompiler. These methods should also produe
larger improvements on mahines where the load penalties are greater.
z Our version of Matrix300 is after proedure loning and inlining to reate ontext for unroll-and-jam [BCHT90℄.
55
Chapter 4
Loop Interhange
The previous two hapters have addressed the problem of improving the performane of memory-bound
loops under the assumption that good ahe loality already exists in program loops. It is the ase, however,
that not all loops exhibit good ahe loality, resulting in idle omputational yles while waiting for main
memory to return data. For example, in the loop,
DO 10 I = 1, N
DO 10 J = 1, N
10
A = A + B(I,J)
referenes to suessive elements of B are a long distane apart in number of memory aesses. Most likely,
urrent ahe arhitetures would not be able to apture the potential ahe-line reuse available beause of
the volume of data aessed between reuse points. With eah referene to B being a ahe miss, the loop
would spend a majority of its time waiting on main memory. However, if we interhange the I- and J-loops
to get
DO 10 J = 1, N
DO 10 I = 1, N
10
A = A + B(I,J)
the referenes to suessive elements of B immediately follow one another. In this ase, we have attained
loality of referene for B by moving reuse points loser together. The result will be fewer idle yles waiting
on main memory.
In this hapter we show how the ompiler an automate the above proess to attain a loop ordering with
good memory performane. We begin with a model of memory performane for program loops. Then, we
show how to use this model to hoose the loop ordering that maximizes memory performane. Finally, we
present an experiment with an implementation of this tehnique.
4.1
Performane Model
Our model of memory performane onsist of two parts. The rst part models the ability of ahe to retain
values between reuse points. The seond part models the osts assoiated with eah memory referene within
an innermost loop.
4.1.1
Data Loality
When applied to data loality within ahe, a data dependene an be thought of as a potential opportunity
for reuse. The reason the reuse opportunity is only potential is ahe interferene | where two data items
need to oupy the same loation in ahe at the same time. Previous studies have shown that interferene
is hard to predit and often prevents outer-loop reuse [CP90, LRW91℄. However, inner-loop loality is likely
to be aptured by ahe beause of short distanes between reuse points. Given these fators, our model
of ahe will assume that all outer-loop reuse will be prevented by ahe interferene and that all innerloop reuse will be aptured by ahe. This assumption allows us to ignore the unpreditable eets of set
56
CHAPTER 4.
LOOP INTERCHANGE
assoiativity and onentrate solely on the ahe line size, miss penalty and aess ost to measure memory
performane. In essene, by ignoring set assoiativity, we assume that the ahe will losely resemble a fully
assoiative ahe in relation to innermost-loop reuse. Although preision is lowered, our experimentation
shows that loop interhange based on this model is eetive in attaining loality of referene within ahe.
4.1.2
Memory Cyles
To ompute the ost of an array referene, we must rst know the array storage layout for a partiular
programming language. Our model assumes that arrays are stored in olumn-major order. Row-major order
an be handled with slight modiations. One we know the storage layout, we an determine where in the
memory hierarhy values reside and the assoiated ost of aessing eah value by analyzing reuse properties.
In the rest of this setion, we show how to determine the reuse properties of array referenes based upon the
dependene graph and how to assign an average ost to eah memory aess.
One reuse property, temporal reuse, ours when a referene aesses data that has been previously
aessed in the urrent or a previous iteration of an innermost loop. These referenes are represented in the
dependene graph as the sink of a loop-independent or innermost-loop-arried onsistent dependene. If a
referene is the sink of an output or antidependene, eah aess will be out of ahe (assuming a write-bak
ahe) and will ost T = Ch yles, where Ch is the ahe aess ost. If a referene is the sink of a true or
input dependene, then it will be removed by salar replaement, resulting in a ost of 0 yles per aess. In
Figure 4.1, the referene to A(I-1,J) has temporal reuse of the value dened by A(I,J) 1 iteration earlier
and will be removed by salar replaement. Therefore, it will ost 0 yles.
The other reuse property, spatial reuse, ours when a referene aesses data that is in the same ahe
line as some previous aess. One type of referene possessing spatial reuse has the innermost-loop indution
variable ontained only in the rst subsript position and has no inoming onsistent dependene. This
referene type aesses suessive loations in a ahe line on suessive iterations of the innermost loop. It
requires the ost of aessing ahe for every memory aess plus the ahe miss penalty, Cm , for every aess
that goes to main memory. Assuming that the ahe line size is Cl words and the referene has a stride of
s words between suessive aesses, b Csl suessive aesses will be in the same ahe line. Therefore, the
probability that a referene with spatial reuse will be a ahe miss is
Pm = b C1l ;
s
giving the following formulation for memory ost
Sl = Ch + Pm Cm :
In Figure 4.1, the referene to A(I,J) has spatial reuse of the form just desribed. Given a ahe with
Ch = 1; Cm = 8 and Cl = 16, A(I,J) osts 1 + 168 = 1:5 yles/aess.
The other type of referene having spatial reuse aesses memory within the same ahe line as a dierent
referene but does not aess the same value. These referenes will be represented as the sink of an outerloop-arried onsistent dependene with a distane vetor of h0; : : : ; 0; di ; 0; : : : ; 0; dn i, where the indution
variable for loop i is only in the rst subsript position. In this ase, memory aesses will be ahe hits on
all but possibly the rst few aesses and essentially ost ST = Ch . In Figure 4.1, the referene to C(J-1,I)
has this type of spatial reuse due to the referene to C(J,I). Under the previous ahe parameters, C(J-1,I)
requires 1 yle/aess.
The remaining types of referenes an be lassied as those that ontain no reuse and those that are
stored in registers. Array referenes without spatial and temporal reuse fall into the former ategory and
require N = Ch + Cm yles. Referenes to salars fall into the latter ategory and require 0 yles. C(J,I)
in Figure 4.1 has no reuse sine the stride for onseutive elements is too large to attain spatial reuse. It has
a ost of 9 yles/aess under the urrent ahe parameters.
DO 10 J = 1,N
DO 10 I = 1,N
10
A(I,J) = A(I-1,J) + C(J,I) + C(J-1,I)
Figure 4.1
Example Loop for Memory Costs
4.2.
57
ALGORITHM
4.2
Algorithm
This setion presents an algorithm for ordering loop nests to maximize memory performane that is based
upon the memory-ost model desribed in the previous setion. First, we show how to ompute the order
for memory performane for perfetly nested loops. Then, we show how to extend the algorithm to handle
non-perfetly nested loops.
4.2.1
Computing Loop Order
Using the model of memory from Setion 4.1, we will apply a loop interhange algorithm to a loop nest to
minimize the ost of aessing memory. The algorithm onsiders the memory osts of the referenes in the
innermost-loop body when eah loop is interhanged to the innermost position and then orders the loops
from innermost to outermost based upon inreasing osts. Although osts are omputed relative only to the
innermost position, they reet the data-loality merit of a loop relative to all other loops within the nest.
Loops with a lower memory ost have a higher loality and have a higher probability of attaining outer-loop
reuse if interferene happens not to our. To atually ompute the memory osts assoiated with eah loop
as desribed, the distane vetors of all of the dependene edges in the loop nest are viewed as if the entry
for the loop under onsideration were shifted to the innermost position with all other entries remaining in
the same relative position. This will allow determination of whih edges an be made innermost to apture
reuse.
After omputing the memory ost for eah loop and omputing the order for the loops, we must ensure
that no violation of dependenes will our. Safety is ensured using the tehnique of MKinley, et al. [KM92℄.
Beginning, with the outermost position in the loop, we selet the loop with the highest memory ost that an
safely go in the urrent loop position. A loop positioning is safe if no resulting dependene has a negative
threshold [AK87℄. After seleting the outermost loop, we proeed by iteratively seleting the loop with the
next highest ost for the next outermost position until a omplete ordering is obtained. The algorithm for
omputing loop order for perfetly nested loops is shown in Figure 4.2.
An Example.
memory osts.
To illustrate the algorithm in Figure 4.2, onsider the following loop under two sets of
DO 10 I = 1,N
DO 10 J = 1,N
10
A(J) = A(J) + B(J,I)
Given Cl = 8; Ch = 1 and Cm = 8, the ost of the I-loop in the innermost position is 0 + 0 + 9 = 9
yles/iteration and the ost of the J-loop is 3 (1 + 81 8) = 6 yles/iteration. This ost model would argue
for an I,J loop ordering. However, if Ch were 3 yles, the ost of the I-loop beomes 11 yles/iteration
Proedure LoopOrder(L)
Input: L = loop nest
ompute memory osts for eah loop
P = SortByCost(L)
if 9 a dependene with a negative threshold then
L=P
for i = 1 to jLj then
Pi = the outermost loop, lj 2 L j
putting lj at level i will not reate a dependene
with a negative threshold
remove Pi from L
enddo
endif
store P for this perfetly nested loop
end
Figure 4.2
Algorithm for Ordering Loops
58
CHAPTER 4.
LOOP INTERCHANGE
and the ost of the J-loop beomes 12 yles/iteration, giving a J,I loop ordering. As illustrated by this
example, our ost model an optimize for a loop/arhiteture ombination that alls for either better ahe
reuse or better register reuse. Unfortunately, we do not have aess to a mahine with a high load penalty to
validate this eet of our model. It may be the ase that TLB misses, due to long strides for referenes with
no reuse, would dominate the ost of memory aesses. In this ase, our model would need to be updated
to reet TLB performane.
4.2.2
Non-Perfetly Nested Loops
It may be the ase that the loop nest on whih we wish to perform interhange is not perfetly nested.
Consider the following loop.
DO 10 I = 1,N
DO 20 J = 1,N
20
A(I,J) = A(I,J) + B(I,J)
DO 10 J = 1,N
10
C(J,I) = C(J,I) + D(J,I)
It is desirable to interhange the I- and J-loops for statement 20, but is not desirable for statement 10. To
handle this situation, we will use an approah similar to that used in unroll-and-jam, where eah innermost
loop body will be onsidered as if it were perfetly nested when omputing memory osts. If interhange
aross a non-perfetly nested portion of the loop nest is required or dierent loop orderings of ommon loops
are requested, then the interhanged loop is distributed, when safe, and the interhange is performed as
desired [Wol86a℄. Distribution safety an be inorporated into LoopOrder at the point where interhange
safety is tested. To be onservative, any loop ontaining a distribution-preventing reurrene and any
loop nested outside of that loop annot be interhanged inward. In the previous example, the result after
distribution and interhange would be
DO 20 J = 1,N
DO 20 I = 1,N
20
A(I,J) = A(I,J) + B(I,J)
DO 10 I = 1,N
DO 10 J = 1,N
10
C(J,I) = C(J,I) + D(J,I)
In Figure 4.3, we show the omplete algorithm for ordering nested loops.
4.3
Experiment
We have implemented our loop interhange algorithm in the Parasope programming environment along with
salar replaement and unroll-and-jam. Our experiment was performed on the IBM RS/6000 model 540 whih
has the following ahe parameters: Ch = 1; Cm = 8 and Cl = 128 bytes. The order in whih transformations
are performed in our system is loop interhange followed by unroll-and-jam and salar replaement. First,
we get the best data loality in the innermost loop and then we improve the resulting balane.
Matrix Multiply. We begin our study with the eets of loop order on matrix multiply using matries
too large for ahe to apture all reuse. In the table below, the performane of the various versions using
two dierent-sized matries is given.
Loop Order
IJK
IKJ
JIK
JKI
KIJ
KJI
Array Size
300x300
500x500
300x300
500x500
300x300
500x500
300x300
500x500
300x300
500x500
300x300
500x500
Time
12.51s
131.71s
50.93s
366.77s
12.46s
135.61s
3.38s
15.49s
51.32s
366.62s
3.45s
15.83s
4.3.
59
EXPERIMENT
Proedure Interhange(L; i)
Input: L = loop nest
i = level of outermost loop
for N = eah possible perfet loop nest in L do
LoopOrder(N )
InterhangeWithDistribution(L; i)
end
Proedure InterhangeWithDistribution(L; i)
if L has multiple inner loops at level i + 1 then
if interhange is desired aross level i + 1 or
multiple orderings requested for levels 1 to i then
distribute L
for N = L and its distributed opies do
Interhange(N; i)
return
endif
if L is innermost then
let N = ordering for perfet nest produed by LoopOrder
order perfet nest of loops by N
endif
N = the rst inner loop of L at level i + 1
if N exists then
InterhangeWithDistribution(N; i + 1)
N = the next loop at level i following L
while N exists do
InterhangeWithDistribution(N; i)
N = the next loop at level i following N
enddo
end
Figure 4.3
Algorithm for Non-Perfetly Nested Loops
60
CHAPTER 4.
LOOP INTERCHANGE
The loop order with the best performane is JKI beause of stride-one aess in ahe. This is the loop
order that our transformation system will derive given any of the possible orderings. The key observation to
make is that the programmer an speify matrix multiply in the manner that is most understandable to him
without onern for memory performane. Using our tehnique, the ompiler will give the programmer good
performane without requiring knowledge of the implementation of the array storage layout for a partiular
programming language nor the underlying mahine. In essene, mahine-independent programming is made
possible.
NAS Kernels.
Next, we studied the eets of loop interhange on a ouple of the Nas kernels from the
SPEC benhmark suite. As is shown in the results below, getting the proper loop ordering for memory
performane an have a dramati eet.
Kernel
Vpenta
Gmtry
Original
149.68s
155.30s
SR
149.68s
154.74s
LI+SR
115.62s
17.89s
UJ+SR
138.69s
25.95s
LI+UJ+SR
115.62s
18.07s
Speedup
1.29
8.68
The speedup reported for Gmtry does not inlude the appliation of unroll-and-jam. Although we have
no analysis tool to determine what exatly aused the slight degradation in performane, the most likely
suspet is ahe interferene. Unroll-and-jam probably reated enough opies of the inner loop to inur ahe
interferene problems due to set assoiativity.
To omplete our study, we ran our set of appliations through the translator to determine
the eetiveness of loop interhange. In the following table, the results are shown for those appliations
where loop interhange was appliable.
Appliations.
Program
Ar2d
Simple
Wave
Original
410.13s
963.20s
445.94s
SR
407.57s
934.13s
431.11s
LI+SR
190.69s
850.18s
414.63s
UJ+SR
401.96
928.84s
431.11s
LI+UJ+SR
192.53s
847.82s
414.63s
Speedup
2.15
1.14
1.08
Again, we have shown that remarkable speedups are attainable with loop interhange. Ar2d improved by a
fator of over 2 on a single proessor. This appliation was written in a \vetorizable" style with regard for
vetor operations rather than data loality. What our transformation system has done is to allow the ode
to be portable aross dierent arhitetures and still attain good performane. The very slight degradation
when unroll-and-jam was added is probably again due to ahe interferene.
Throughout our study, one fator stood out as the key to performane on mahines like the rs6000
| stride-one aess in the ahe. Long ahe lines, a low aess ost and a high miss penalty reate
this phenomena. Attaining spatial reuse was a simple yet extremely eetive approah to obtaining high
performane.
4.4
Summary
In this hapter, we have shown how to automatially order loops in a nest to give good memory performane.
The model that we have presented for memory performane is simple, but extremely eetive. Using our
transformation system, the programmer is freed from having to worry about his loop order, allowing him
to write his ode in a mahine-independent form with ondene that it will be automatially optimized to
ahieve good memory performane. The results presented here represent a positive step toward enouraging
mahine-independent programming.
61
Chapter 5
Blokability
In Chapter 1, we presented two transformations, unroll-and-jam and strip-mine-and-interhange, to eet
iteration-spae bloking. Although these transformations have been studied extensively, they have mainly
been applied to kernels written in an easily analyzed algorithmi style. In this hapter, we examine the
appliability of iteration-spae bloking on more omplex algorithmi styles. Speially, we present the
results of a projet to see if a ompiler ould automatially generate blok algorithms similar to those found
in LAPACK from the orresponding point algorithms expressed in Fortran 77. In performing this study, we
address the question, \What information does a ompiler need in order to derive blok versions of real-world
odes that are ompetitive with the best hand-bloked versions?"
In the ourse of this study, we have found transformation algorithms that an be suessfully used on
triangular loops, whih are quite ommon in linear algebra, and trapezoidal loops. In addition, we have
disovered an algorithmi approah that an be used to analyze and blok programs that exhibit omplex
dependene patterns. The latter method has been suessfully applied to blok LU deomposition without
pivoting. The key to many of these results is a transformation known as index-set splitting. Our results
with this transformation show that a wide lass of numerial algorithms an be automatially optimized for
a partiular mahine's memory hierarhy even if they are expressed in their natural form. In addition, we
have disovered that speialized knowledge about whih operations ommute with one another an enable
ompilers to blok odes that were previously thought to be unblokable by automati means.
This hapter begins with a review of iteration-spae bloking. Next, the transformations that we found
were neessary to blok algorithms like those found in LAPACK are presented. Then, we present a study of
the appliation of these transformations to derive the blok LAPACK-like algorithms from their orresponding
point algorithms. Finally, for those algorithms that annot be bloked by a ompiler, we propose a set of
language extensions to allow the expression of blok algorithms in a mahine-independent form.
5.0.1
Iteration-Spae Bloking
To improve the memory behavior of loops that aess more data than an be handled by a ahe, the
iteration spae of a loop an be bloked into setions whose temporal reuse an be aptured by the ahe.
Strip-mine-and-interhange is a transformation that ahieves this result [Wol87, WL91℄. The eet is to
shorten the distane between the soure and sink of a dependene so that it is more likely for the datum to
reside in ahe when the reuse ours. Consider the following loop nest.
DO 10 J = 1,N
DO 10 I = 1,M
10
A(I) = A(I) + B(J)
Assuming that the value of M is muh greater than the size of the ahe, we would get temporal reuse of the
values of B, while missing the temporal reuse of the values of A on eah iteration of J. To apture A's reuse,
we an use strip-mine-and-interhange as shown below.
DO 10 J = 1,N,JS
DO 10 I = 1,M
DO 10 JJ = J, MIN(J+JS-1,N)
10
A(I) = A(I) + B(JJ)
62
CHAPTER 5.
BLOCKABILITY
Now, we an apture the temporal reuse of JS values of B out of ahe for every iteration of the J-loop if JS
is less that the size of the ahe and no ahe interferene ours, and we an apture the temporal reuse of
A in registers [LRW91℄.
As stated earlier, both strip-mine-and-interhange and unroll-and-jam make up the transformation tehnique known as iteration-spae bloking. Essentially, unroll-and-jam is strip-mine-and-interhange with the
innermost loop unrolled. The dierene is that unroll-and-jam is used to blok for registers and strip-mineand-interhange for ahe.
5.1
Index-Set Splitting
Iteration-spae bloking annot always be diretly applied as shown in the previous setion. Sometimes a
transformation alled index-set splitting must be applied to allow bloking. Index-set splitting involves the
reation of multiple loops from one original loop, where eah new loop iterates over a portion of the original
iteration spae. Exeution order is not hanged and the original iteration spae is still ompletely exeuted.
As an example of index-set splitting, onsider the following loop.
10
DO 10 I = 1,N
A(I) = A(I) + B(I)
We an split the index set of I at iteration 100 to obtain
DO 10 I = 1,MIN(N,100)
A(I) = A(I) + B(I)
DO 20 I = MIN(N,100)+1,N
20
A(I) = A(I) + B(I)
10
Although this transformation does nothing by itself, its appliation an enable the bloking of omplex
loop forms. This setion shows how index-set splitting allows the bloking of triangular-, trapezoidal-, and
rhomboidal-shaped iteration spaes and the partial bloking of loops with omplex dependene patterns.
5.1.1
Triangular Iteration Spaes
When the iteration spae of a loop is not retangular, iteration-spae bloking annot be diretly applied.
The problem is that when performing interhange of loops that iterate over a triangular region, the loop
bounds must be modied to preserve the semantis of the loop [Wol86a, Wol87℄. Below, we will derive the
formula for determining loop bounds when bloking is performed on triangular iteration spaes. We begin
with the derivation for strip-mine-and-interhange and then extend it to unroll-and-jam.
The general form of a strip-mined triangular loop is given below. and are integer onstants ( may
be a loop invariant) and > 0.
10
DO 10 I = 1,N,IS
DO 10 II = I,I+IS-1
DO 10 J = II+ ,M
loop body
Figure 5.1 gives a graphial desription of the iteration spae of this loop. To interhange the II and J loops,
we have to aount for the fat that the line J=II+ intersets the iteration spae at the point (I,I+ ).
Therefore, when we interhange the loops, the II-loop must iterate over a trapezoidal region, requiring its
upper bound to be J until (J ) > I+IS-1. This gives the following loop nest.
10
DO 10 I = 1,N,IS
DO 10 J = I+ ,M
DO 10 II = I,MIN((J- )/,I+IS-1)
loop body
This formula an be trivially extended to handle the ases where < 0 and where a linear funtion of I
appears in the upper bound instead of the lower bound (see Appendix A).
Given the formula for triangular strip-mine-and-interhange, we an extend it to triangular unroll-andjam as follows. The iteration spae dened by the two inner loops is a trapezoidal region, making unrolling the
innermost loop non-trivial beause the number of iterations vary with J. To overome this, we an use indexset splitting on J to reate one loop that iterates over the triangular region below the line J=(I+IS-1)+
5.1.
63
INDEX-SET SPLITTING
J
6
M
J=II+
+
1
I
Figure 5.1
I+IS-1
N
- II
Upper Left Triangular Iteration Spae
and one loop that iterates over the retangular region above the line. Sine we know the length of the
retangular region, the seond loop an be unrolled to give the following loop nest.
20
DO 10 I = 1,N,IS
DO 20 II = I,I+IS-2
DO 20 J = II+ ,MIN((I+IS-2)+ ,M)
loop body
DO 10 J =
10
(I+IS-1)+ ,M
unrolled loop body
Depending upon the values of and , it may also be possible to determine the size of the triangular region;
therefore, it may be possible to ompletely unroll the rst loop nest to eliminate the overhead. Additionally,
triangular unroll-and-jam an be trivially extended to handle other ommon triangles (see Appendix A).
To see the potential of triangular unroll-and-jam, onsider the following loop that is used in a bak solve
after LU deomposition.
10
DO 10 I = 1,N
DO 10 J = I+1,N
A(J) = A(J) + B(J,I) * A(I)
We used an automati system to perform triangular unroll-and-jam and salar replaement on the above
loop. We then ran the result on arrays of DOUBLE-PRECISION REALS on an IBM RS/6000 540. The results
are shown in the table below.
Size
300
500
5.1.2
Iterations
500
200
Original
6.09s
6.78s
Xformed
3.49s
3.82s
Speedup
1.74
1.77
Trapezoidal Iteration Spaes
While the previous method applies to many of the ommon non-retangular-shaped iteration spaes, there are
still some important loops that it will not handle. In linear algebra, seismi and partial dierential equation
odes, we often nd loops with trapezoidal-shaped iteration spaes. Consider the following example, where
L is assumed to be a onstant and > 0.
10
DO 10 I = 1,N
DO 10 J = L,MIN(I+ ,N)
loop body
64
CHAPTER 5.
BLOCKABILITY
Here, as is often the ase, a MIN funtion has been used to handle boundary onditions, resulting in the
loop's trapezoidal shape as shown in Figure 5.2. The formula to handle a triangular region does not apply
in this ase, so we must extend it.
The trapezoidal iteration spae ontains one retangular region and one triangular region separated at
the point where I+ = N. Beause we already know how to handle retangular and triangular regions and
beause we want to redue the exeution overhead in a retangular region, we an split the index set of I
into two separate regions at the point I = N . This gives the following loop nest.
10
10
DO 10 I = 1,MIN(N,(N- )/)
DO 10 J = L,I+
loop body
DO 10 I = MAX(1,MIN(N,(N- )/)+1),N
DO 10 J = L,N
loop body
We an now apply triangular iteration-spae bloking to the rst loop and retangular unroll-and-jam to the
seond loop. In addition to loops with MIN funtions in the upper bound, index-set splitting of trapezoidal
regions an be extended to allow MAX funtions in the lower bound of the inner loop (see Appendix A).
The lower bound, L, of the inner loop in the trapezoidal nest need not be restrited to a onstant value.
It an essentially be any funtion that produes an iteration spae that an be bloked. In the loop,
DO 10 I = 1,N1
DO 10 J = I,MIN(I+N2,N3)
10
F3(I) = F3(I)+DT*F1(K)*WORK(I-K)
the lower bound is a linear funtion of the outer-loop indution variable, resulting in rhomboidal and triangular regions (see Figure 5.3). To handle this loop, bloking an be extended to rhomboidal regions using
index-set splitting as in the ase for triangular regions (see Appendix A).
To see the potential for improving the performane of trapezoidal loops, see Setion 3.5 under Geophysis
Kernels. The algorithm Afold is the example shown above and omputes the adjoint-onvolution of two
time series. Fold omputes the onvolution of two time series and ontains a MAX funtion in the lower bound
and a MIN funtion in the upper bound.
J
N
L
Figure 5.2
6
1
N N
-I
Trapezoidal Iteration Spae with Retangle
5.1.
65
INDEX-SET SPLITTING
J
N3
1
6
1
Figure 5.3
5.1.3
N3 N2
N1
-I
Trapezoidal Iteration Spae with Rhomboid
Complex Dependene Patterns
In some ases, it is not only the shape of the iteration spae that presents diÆulties for the ompiler, but
also the dependene patterns within the loop. Consider the strip-mined example below.
DO 10 I = 1,N,IS
DO 10 II = I, I+IS-1
T(II) = A(II)
DO 10 K = II,N
10
A(K) = A(K) + T(II)
To omplete bloking, the II-loop must be interhanged into the innermost position. Unfortunately, there
is an reurrene between the denition of A(K) and the load from A(II) arried by the II-loop. Using
only standard dependene abstrations, suh as distane and diretion vetors, we would be prevented from
bloking the loop [Wol82℄. However, if we analyze the setions of the arrays that are aessed at the soure
and sink of the oending dependene using array summary information, the potential to apply bloking is
revealed [CK87, HK91℄. Consider Figure 5.4. The region of the array A read by the referene to A(II) goes
from I to I+IS-1 and the region written by A(K) goes from I to N. Therefore, the reurrene does not exist
for the region from I+IS to N.
To allow partial bloking of the loop, we an split the index set so that one loop iterates over the ommon
region and one loop iterates over the disjoint region. To determine the split point to reate these regions, we
set the subsript expression for the larger region equal to the boundary between the ommon and disjoint
regions and solve for the inner indution variable. In our example, we let K = I+IS-1 and solve for K.
Splitting at this point yields
DO 10 I = 1,N,IS
DO 10 II = I,I+IS-1
T(II) = A(II)
DO 20 K = I,I+IS-1
20
A(K) = A(K) + T(II)
DO 10 K = I+IS,N
10
A(K) = A(K) + T(II)
We an now distribute the II-loop and omplete bloking on the loop nest surrounding statement 10.
The method that we have just desribed may be appliable when the referenes involved in the preventing
dependenes have dierent indution variables in orresponding positions (e.g., A(II) and A(K) in the
previous example). An outline of our appliation of the method after strip mining is given below.
1. Calulate the setions of the soure and sink of the preventing dependene.
2. Interset and union the setions.
66
CHAPTER 5.
B
B
BB
B
B
BB
B B B B
B B B B
BB BB BB BB A
BB BB BB BB BB BB
1
I
BLOCKABILITY
Figure 5.4
I+IS-1
Data Spae for A
N
3. If the intersetion is equal to the union then stop.
4. Set the subsript expression of the larger setion equal to the boundary between the disjoint and
ommon setions and solve for the inner-loop indution variable.
5. Split the index set of the inner loop at this point and proeed with bloking on the disjoint region.
6. Repeat steps 4 and 5 if there are multiple boundaries.
If multiple transformation-preventing dependenes exist, we proess eah one with the above steps until
failure or a region is reated where bloking an be performed.
5.2
Control Flow
In addition to iteration-spae shapes and dependene patterns, we must also onsider the eets of ontrol
ow on bloking. It may be the ase that an inner loop is guarded by an IF-statement to prevent unneessary
omputation. Consider the following matrix multiply ode.
DO 10 J = 1,N
DO 10 K = 1,N
IF (B(K,J) .NE. 0.0) THEN
DO 20 I = 1,N
20
C(I,J) = C(I,J) + A(I,K) * B(K,J)
ENDIF
10 CONTINUE
If we were to ignore the IF-statement and perform unroll-and-jam on the K-loop, we would obtain the
following ode.
DO 10 J = 1,N
DO 10 K = 1,N,2
IF (B(K,J) .NE. 0.0) THEN
DO 20 I = 1,N
C(I,J) = C(I,J) + A(I,K) * B(K,J)
20
C(I,J) = C(I,J) + A(I,K+1) * B(K+1,J)
ENDIF
10 CONTINUE
Here, the value of B(K+1,J) is never tested and statements that were not exeuted in the original ode may
be exeuted in the unrolled ode. Thus, we have performed unroll-and-jam illegally.
One possible method to preserve orretness is to move the guard into the innermost loop and repliate
it for eah unrolled iteration. This, however, would result in a performane degradation due to a derease
in loop-level parallelism and an inrease in instrutions exeuted. Instead, we an use a ombination of
IF-onversion and sparse-matrix tehniques that we all IF-inspetion to allow us to keep the guard out
of the innermost loop and still allow bloking [AK87℄. The idea is to inspet at run-time the values of an
outer-loop indution variable for whih the guard is true and the inner loop is exeuted. Then, we exeute
5.3.
67
SOLVING SYSTEMS OF LINEAR EQUATIONS
the loop nest for only those values. To eet IF-inspetion, ode is inserted within the IF-statement to
reord loop bounds information for the loop that we wish to transform. On the true branh of the guard to
be inspeted we insert the following ode, where KC is initialized to 1, FLAG is initialized to false, K is the
indution variable of the loop to be inspeted and KLB is the store for the lower bound of an exeuted range.
IF (.NOT. FLAG) THEN
KC = KC + 1
KLB(KC) = K
FLAG = .TRUE.
ENDIF
On the false branh of the inspeted guard, we insert the following ode to store the upper bound of eah
exeuted range.
IF (FLAG) THEN
KUB(KC) = K-1
FLAG = .FALSE.
ENDIF
Note that we must also aount for the fat that the value of the guard ould be true on the last iteration
of the loop, requiring a test of FLAG to store the upper bound of the last range after the IF-inspetion loop
body.
After inserting the inspetion ode, we an distribute the loop we wish to transform around the inspetion
ode and reate a new loop nest that exeutes over the iteration spae where the innermost loop was exeuted.
In our example, the result would be
C
C
C
DO 20 K = 1,N
IF-inspetion ode
20
10
CONTINUE
DO KN = 1,KC
DO K = KLB(KN),KUB(KN)
DO I = 1,N
C(I,J) = C(I,J) + A(I,K) * B(K,J)
The KN-loop exeutes over the number of ranges where the guarded loop is exeuted and the K-loop exeutes
within those ranges. The new K-loop an be transformed as was desired in the original loop nest. The nal
IF-inspeted ode for our example is shown in Figure 5.5
If the ranges over whih the inner loop is exeuted in the original loop are large, the inrease in runtime ost aused by IF-inspetion an be more than ounterated by the improvements in performane on
the newly transformed inner loop. To show this, we performed unroll-and-jam on our IF-inspeted matrix
multiply example and ran it on an IBM RS/6000 model 540 on 300x300 arrays of REALS. In the table below,
Frequeny shows how often B(K,J) = 0, UJ is the result of performing unroll-and-jam after moving the
guard into the innermost loop and UJ+IF is the result of performing unroll-and-jam after IF-inspetion.
Frequeny
2.5%
10%
5.3
Original
3.33s
3.08s
UJ
3.84s
3.71s
UJ+IF
2.25s
2.13s
Speedup
1.48
1.45
Solving Systems of Linear Equations
LAPACK is a projet whose goal is to replae the algorithms in LINPACK and EISPACK with blok al-
gorithms that have better ahe performane. Unfortunately, sientists have spent years developing this
pakage to attain high performane on a variety of arhitetures. We believe that this proess is the wrong
diretion for high-performane omputing. Compilers, not programmers, should handle the mahine-spei
details required to attain high performane. It should be possible for algorithms to be expressed in a natural, mahine-independent form with the ompiler performing the mahine-spei optimizations to attain
performane.
68
CHAPTER 5.
20
10
FLAG = .FALSE.
DO 10 J = 1,N
KC = 0
DO 20 K = 1,N
IF (B(K,J) .NE. 0.0) THEN
IF (.NOT. FLAG) THEN
KC = KC + 1
KLB(KC) = K
FLAG = .TRUE.
ENDIF
ELSE
IF (FLAG) THEN
KUB(KC) = K-1
FLAG = .FALSE.
ENDIF
ENDIF
CONTINUE
IF (FLAG) THEN
KUB(KC) = N
FLAG = .FALSE.
ENDIF
DO 10 KN = 1,KC
DO 10 K = KLB(KN),KUB(KN)
DO 10 I = 1,N
C(I,J) = C(I,J) + A(I,K) * B(K,J)
Figure 5.5
Matrix Multiply After IF-Inspetion
BLOCKABILITY
5.3.
SOLVING SYSTEMS OF LINEAR EQUATIONS
69
In this setion, we examine the eÆay of this hypothesis by studying the mahine-independent expression
of algorithms similar to those found in LAPACK. We examine the blokability of three algorithms for solving
systems of linear equations, where an algorithm is \blokable" if a ompiler an automatially derive the best
known blok algorithm, similar to the one found in LAPACK, from its orresponding mahine-independent
point algorithm. Deriving the blok algorithms poses two main problems. The rst is developing the
transformations that allow the ompiler to attain the blok algorithm from the point algorithm. The seond
is to determine the mahine-dependent bloking fator for the blok algorithm. In this setion, we address
the rst issue. The seond is beyond the sope of this thesis.
Our study shows that LU deomposition without pivoting is a blokable algorithm using the tehniques
derived in Setion 5.1, LU deomposition with partial pivoting is blokable if information in addition to the
index-set splitting tehniques is supplied to the ompiler and QR deomposition with Householder transformations is not blokable. The study will also show how to improve the memory performane of a fourth
non-LAPACK algorithm, QR deomposition with Givens rotations using the tehniques of Setions 5.1 and 5.2.
5.3.1
LU Deomposition without Pivoting
Gaussian elimination is a form of LU deomposition where the matrix A is deomposed into two matries,
L and U , suh that
A = LU ,
L is a unit lower triangular matrix and U is an upper triangular matrix. This deomposition an be obtained
by multiplying the matrix A by a series of elementary lower triangular matries, Mk : : : M1 , as follows [Ste73℄.
A = LU
A = M1 1 : : : M k 1 U
U = Mk : : : M 1 A
(5.1)
Using Equation 5.1, an algorithm for LU deomposition without pivoting an be derived. This point algorithm, where statement 20 omputes Mk and statement 10 applies Mk to A, is shown below.
20
10
DO 10 K = 1,N-1
DO 20 I = K+1,N
A(I,K) = A(I,K) / A(K,K)
DO 10 J = K+1,N
DO 10 I = K+1,N
A(I,J) = A(I,J) - A(I,K) * A(K,J)
Unfortunately, this algorithm exhibits poor ahe performane on large matries. To improve its ahe
performane, sientists have developed a blok algorithm that essentially groups a number of updates to
the matrix A and applies them together to a blok portion of the array [DDSvdV91℄. To attain the best
bloking, strip-mine-and-interhange is performed on the outer K-loop for only a portion of the inner loop
nest, requiring the tehnique desribed in Setion 5.1.3 for automati derivation of the blok algorithm.
Consider the strip-mined version of LU deomposition below.
DO 10 K = 1,N-1,KS
DO 10 KK = K,K+KS-1
DO 20 I = KK+1,N
20
A(I,KK) = A(I,KK)/A(KK,KK)
DO 10 J = KK+1,N
DO 10 I = KK+1,N
10
A(I,J) = A(I,J) - A(I,KK) * A(KK,J)
To omplete the bloking of this loop, the KK-loop would have to be distributed around the loop that
surrounds statement 20 and around the loop nest that surrounds statement 10 before being interhanged to
70
CHAPTER 5.
20
K
BLOCKABILITY
10
N
K
Figure 5.6
K+KS-1
N
Regions Aessed in LU Deomposition
the innermost position. However, there is a reurrene between statements 20 and 10 arried by the KK-loop
that prevents distribution unless index-set splitting is done.
If we analyze the regions of the array A aessed for the entire exeution of the KK-loop, we nd that the
region touhed by statement 20 is a subset of the region touhed by statement 10 (Figure 5.6 gives a graphial
desription of the data regions). Sine the reurrene exists for only a portion of the iteration spae, we an
split the larger region, dened by the referene to A(I,J) in statement 10, at the point J = K+KS-1. The
new loop that overs the disjoint region is shown below.
DO 10 KK = K,K+KS-1
DO 10 J = K+KS,N
DO 10 I = KK+1,N
10
A(I,J) = A(I,J) - A(I,KK) * A(KK,J)
Now, we an use triangular interhange to put the KK-loop in the innermost position. At this point, we have
obtained the best blok algorithm, making LU deomposition blokable (see Figure 5.7). Not only does this
blok algorithm exhibit better data loality, it also has inreased parallelism as the J-loop that surrounds
statement 10 an be made parallel.
At this point, we would like to perform unroll-and-jam and salar replaement to further improve the
performane of blok LU deomposition. Unfortunately, the true dependene from A(I,J) to A(KK,J) in
DO 10 K = 1,N-1,KS
DO 20 KK = K,MIN(K+KS-1,N-1)
DO 30 I = KK+1,N
30
A(I,KK) = A(I,KK)/A(KK,KK)
DO 20 J = KK+1,K+KS-1
DO 20 I = KK+1,N
20
A(I,J) = A(I,J) - A(I,KK) * A(KK,J)
DO 10 J = K+KS,N
DO 10 I =K+1,N
DO 10 KK = K,MIN(MIN(K+KS-1,N-1),I-1)
10
A(I,J) = A(I,J) - A(I,KK) * A(KK,J)
Figure 5.7
Blok LU Deomposition
5.3.
71
SOLVING SYSTEMS OF LINEAR EQUATIONS
statement 10 is inonsistent and when moved inward by unroll-and-jam would prevent ode motion of the
assignment to A(I,J). However, if we examine the array setions of the soure and sink of the dependene
after splitting for trapezoidal and triangular regions, we nd that the dependene does not exist in an
unrolled loop body.
We applied our algorithm by hand to LU deomposition and ompared its performane with the original
program and a hand oded version of the right-looking algorithm [DDSvdV91℄. In the table below, \Blok
1" refers to the right-looking version and \Blok 2" refers to our algorithm in Figure 5.7. In addition, we
used our automati system to perform trapezoidal unroll-and-jam and salar replaement, to our bloked
ode, produing the version referred to as \Blok 2+". The experiment was run on an IBM RS/6000 model
540 using DOUBLE-PRECISION REALS. The reader should note that these nal transformations ould have
been applied to the Sorensen version as well, with similar improvements.
Array Size
300x300
300x300
500x500
500x500
5.3.2
Blok Size
32
64
32
64
Original
1.47s
1.47s
6.76s
6.76s
Blok 1
1.37s
1.42s
6.58s
6.59s
Blok 2
1.35s
1.38s
6.44s
6.38s
Blok 2+
0.49s
0.58s
2.13s
2.27s
Speedup
3.00
2.53
3.17
2.98
LU Deomposition with Partial Pivoting
Although the ompiler an disover the potential for bloking in LU deomposition without pivoting using
dependene information, the same annot be said when partial pivoting for numerial stability is added to
the algorithm. Using the following matrix formulation
U = Mn 1Pn 1 M3 P3 M2 P2 M1 P1 A,
the point version that inludes partial pivoting an be derived (see Figure 5.8) [Ste73℄. While we an apply
index-set splitting to the algorithm in Figure 5.8 after strip mining to break the reurrene arried by the new
KK-loop involving statement 10 and statement 40 as in the previous setion, we annot break the reurrene
involving statements 10 and 25 using this tehnique.
After index-set splitting, we have the following relevant setions of ode.
DO 10 KK = K,K+KS-1
...
DO 30 J = 1,N
TAU = A(KK,J)
25
A(KK,J) = A(IMAX,J)
30
A(IMAX,J) = TAU
C ...
DO 10 J = K+KS,N
DO 10 I = KK+1,N
10
A(I,J) = A(I,J) - A(I,KK) * A(KK,J)
C
Distributing the KK-loop around both J-loops would onvert what was originally a true dependene from
A(I,J) in statement 10 to A(IMAX,J) in statement 25 into an anti-dependene in the reverse diretion.
The rules for the preservation of data dependene prohibit the reversing of a dependene diretion, whih
would seem to prelude the existene of a blok analogue similar to the non-pivoting ase. However, a blok
algorithm, that essentially ignores the preventing reurrene and is similar to the non-pivoting ase, an still
be mathematially derived using the following result from linear algebra (see Figure 5.9) [DDSvdV91, Ste73℄.
If we have
1 0
1
0
M1 =
;
P
=
2
m1 I
0 P^2
The dependene from A(I,J) to A(KK,J) in statement 10 needed to be deleted in order for our system to work.
Not only do we
need setion analysis to handle the dependene, but also
inorretly reported the dependene as interhange preventing.
In this version of the algorithm, row interhanges are performed aross every olumn. Although this is not neessary (it an
be done only for olumns K through N), it is done to allow the derivation of the blok algorithm.
PFC
72
CHAPTER 5.
DO 10 K = 1,N-1
TAU = ABS(A(K,K))
IMAX = K
DO 20 I = K+1,N
IF (ABS(A(I,K)) .LE. TAU) GOTO 20
IMAX = I
TAU = ABS(A(I,K))
20
CONTINUE
DO 30 J = 1,N
TAU = A(K,J)
25
A(K,J) = A(IMAX,J)
30
A(IMAX,J) = TAU
DO 40 I = K+1,N
40
A(I,K) = A(I,K)/A(K,K)
DO 10 J = K+1,N
DO 10 I = K+1,N
10
A(I,J) = A(I,J) - A(I,K) * A(K,J)
Figure 5.8
LU Deomposition with Partial Pivoting
DO 10 K = 1,N-1,KS
DO 20 KK = K,MIN(K+KS-1,N-1)
TAU = ABS(A(K,K))
IMAX = K
DO 30 I = K+1,N
IF (ABS(A(I,K)) .LE. TAU) GOTO 20
IMAX = I
TAU = ABS(A(I,K))
30
CONTINUE
DO 40 J = 1,N
TAU = A(KK,J)
25
A(KK,J) = A(IMAX,J)
40
A(IMAX,J) = TAU
DO 50 I = KK+1,N
50
A(I,K) = A(I,KK)/A(KK,KK)
DO 20 J = KK+1,K+KS-1
DO 20 I = KK+1,N
20
A(I,J) = A(I,J) - A(I,KK) * A(KK,J)
DO 10 J = K+KS,N
DO 10 I =K+1,N
DO 10 KK = K,MIN(MIN(K+KS-1,N-1),I-1)
10
A(I,J) = A(I,J) - A(I,KK) * A(KK,J)
Figure 5.9
Blok LU Deomposition with Partial Pivoting
BLOCKABILITY
5.3.
73
SOLVING SYSTEMS OF LINEAR EQUATIONS
then
P2 M1 =
=
=
1 0
1
0
m1 I
0 P^2
1
0
P^2 m1 P^2
1
0
1 0
P^2 m1 I
0 P^2
^ 1 P2 :
= M
(5.2)
This result shows that we an postpone the appliation of the eliminator M1 until after the appliation of
the permutation matrix P2 if we also permute the rows of the eliminator. Extending Equation 5.2 to the
entire formulation we have
U = Mn 1Pn 1 Mn 2Pn 2 Mn 3 Pn 3 M1P1 A
^ n 2 Pn 1 Pn 2 Mn 3 Pn 3 M1 P1 A
= Mn 1 M
^ n 3 Pn 1 Pn 2 Pn 3 M1 P1 A
^ n 2M
= Mn 1 M
^n 3 M
^ 1 Pn 1 Pn 2 Pn 3 P1 A:
^
= Mn 1 Mn 2 M
In the implementation of the blok algorithm, Pi annot be omputed until step i of the point algorithm.
Pi only depends upon the rst i olumns of A, allowing the omputation of k Pi 's and M^ i 's, where k is the
^ i 's as is done in Figure 5.9 [DDSvdV91℄.
bloking fator, and then the blok appliation of the M
To install the above result into the ompiler, we examine its impliations from a data dependene viewpoint. In the point version, eah row interhange is followed by a whole-olumn update in whih eah row
element is updated independently. In the blok version, multiple row interhanges may our before a partiular olumn is updated. The same omputations (olumn updates) are performed in both the point and
blok versions, but these omputations may our in dierent loations (rows) of the array beause of the
appliation of Pi+1 to Mi . The key onept for the ompiler to understand is that row interhanges and
whole-olumn updates are ommutable operations. Data dependene alone is not suÆient to understand
this. A data dependene relation maps values to memory loations. It reveals the sequene of values that
pass through a partiular loation. In the blok version of LU deomposition, the sequene of values that
pass through a loation is dierent from the point version, although the nal values are idential. Therefore,
from the point of view of a ompiler that only understands data dependene, LU deomposition with partial
pivoting is not blokable.
Fortunately, a ompiler an be equipped to understand that operations on whole olumns are ommutable
with row permutations. To upgrade the ompiler, one would have to install pattern mathing to reognize
both the row permutations and whole-olumn updates to prove that the reurrene involving statements
10 and 25 of the index-set split ode ould be ignored. Forms of pattern mathing are already done in
ommerially available ompilers, so it is reasonable to believe that we an reognize the situation in LU
deomposition. The question is, however, \Will the inrease in knowledge be protable?" To see the potential
protability of making the ompiler more sophistiated, onsider the table below, where \Blok" refers to
the algorithm given in Figure 5.9 and \Blok+" refers to that algorithm after unroll-and-jam and salar
replaement. This experiment was run on an IBM RS/6000 model 540 using DOUBLE-PRECISION REALS.
Array Size
300x300
300x300
500x500
500x500
5.3.3
Blok Size
32
64
32
64
Original
1.52s
1.52s
7.01s
7.01s
Blok
1.42s
1.48s
6.85s
6.83s
Blok+
0.58s
0.67s
2.58s
2.73s
Speedup
2.62
2.27
2.72
2.57
QR Deomposition with Householder Transformations
The key to Gaussian elimination is the multipliation of the matrix A by a series of elementary lower
triangular matries that introdue zeros below eah diagonal element. Any lass of matries that have this
74
CHAPTER 5.
BLOCKABILITY
property an be used to solve a system of linear equations. One suh lass, having orthonormal olumns, is
used in QR deomposition [Ste73℄.
If A has linearly independent olumns, then A an be written uniquely in the form
A = QR,
where Q has orthonormal olumns, QQT = I and R is upper triangular with positive diagonal elements.
One lass of matries that ts the properties of Q is elementary reetors or Householder transformations of
the form I 2vv T .
The point algorithm for this form of QR deomposition onsists of iteratively applying the elementary
reetor Vk = I 2vk vkT to Ak to obtain Ak+1 for k = 1; : : : ; n 1. Eah Vk eliminates the values below the
diagonal in the k th olumn. For a more detailed disussion of the QR algorithm and the omputation of Vk ,
see Stewart [Ste73℄.
Although pivoting is not neessary for QR deomposition, the best blok algorithm is not an aggregation of
the original algorithm. The blok appliation of a number of elementary reetors involves both omputation
and storage that does not exist in the original algorithm [DDSvdV91℄. Given
A=
The rst step is to fator
and then solve
where
A11
A21
=
A^12
A^22
A11 A12
A21 A22
Q11 Q12
Q21 Q22
= Q^
Q^ = (I 2v1 v1T )(I
= I 2V T V T :
:
A12
A22
R11
0
;
;
2v2 v2T ) (I
2vb vbT )
The diÆulty for the ompiler omes in the omputation of I 2V T V T beause it involves spae and
omputation that did not exist in the original point algorithm. To illustrate this, onsider the ase where
the blok size is 2.
Q^ = (I 2v1 v1T )(I 2v2 v2T ) T 1 (v1T v2 )
v
1
= I 2(v1 v2 )
0
1
v2T
Here, the omputation of the matrix
T
T = 10 (v1 v21)
is not part of the original algorithm, making it is impossible to determine the omputation of Q^ from the
data dependene information.
The expression of this blok algorithm requires the hoie of a mahine-dependent bloking fator. We
know of no way to express this algorithm in a urrent programming language in a manner that would allow a
ompiler to automatially hose that fator. Can we enhane the expressibility of a language to allow blok
algorithms to be stated in a mahine-independent form? One possible solution is to dene looping onstruts
whose semantis allow the ompiler omplete freedom in hoosing the bloking fator. In Setion 5.4, we
will address this issue.
5.3.4
QR Deomposition with Givens Rotations
Another form of orthogonal matrix that an be used in QR deomposition is the Givens rotation matrix
[Sew90℄. We urrently know of no best blok algorithm to derive, so instead we show that the index-set
splitting tehnique desribed in Setion 5.1.3 and IF-inspetion have wider appliability.
Consider the Fortran ode for Givens QR shown in Figure 5.10 (note that this algorithm does not hek
for overow) [Sew90℄. The referenes to A in the inner K-loop have a long stride between suessive aesses,
resulting in poor ahe performane. Our algorithm from Chapter 4 would reommend interhanging the Jloop to the innermost position, giving stride-one aess to the referenes to A(J,K) and making the referenes
5.4.
75
LANGUAGE EXTENSIONS
10
DO 10 L = 1,N
DO 10 J = L+1,M
IF (A(J,L) .EQ. 0.0) GOTO 10
DEN = DSQRT(A(L,L)*A(L,L) + A(J,L)*A(J,L))
C = A(L,L)/DEN
S = A(J,L)/DEN
DO 10 K = L,N
A1 = A(L,K)
A2 = A(J,K)
A(L,K) = C*A1 + S*A2
A(J,K) = -S*A1 + C*A2
Figure 5.10
QR Deomposition with Givens Rotations
to A(L,K) invariant with respet to the innermost loop. In this ase, loop interhange would neessitate
distribution of the J-loop around the IF-blok and the K-loop. However, a reurrene onsisting of a true
and antidependene between the denition of A(L,K) and the use of A(L,L) seems to prevent distribution.
Examining the regular setions for these referenes reveals that the reurrene only exists for the element
A(L,L), allowing index-set splitting of the K-loop at L, IF-inspetion of the J-loop, distribution (with salar
expansion) and interhange as shown Figure 5.11 [KKP+ 81℄. Below is a table of the results of the performane
of Givens QR using DOUBLE-PRECISION REALS run on an IBM RS/6000 model 540.
Array Size
300x300
500x500
5.4
Original
6.86s
84.0s
Optimized
3.37s
15.3s
Speedup
2.04
5.49
Language Extensions
The examination of QR deomposition with Householder transformations has shown that some blok algorithms annot be derived by a ompiler from their orresponding point algorithms. In order for us to
maintain our goal of mahine-independent oding styles, we need to allow the expression of these types of
blok algorithms in a mahine-independent form. Speially, we need to diret the ompiler to pik the
mahine-dependent bloking fator for an algorithm automatially.
To this end, we present a preliminary proposal for two looping onstruts to guide the ompiler's hoie
of bloking fator. These onstruts are BLOCK DO and IN DO. BLOCK DO speies a DO-loop whose bloking
fator is hosen by the ompiler. IN DO speies a DO-loop that exeutes over the region dened by a
orresponding BLOCK DO and guides the ompiler to the regions that it should analyze to determine the
bloking fator. The bounds of an IN DO statement are optional. If they are not expressed, the bounds are
assumed to start at the rst value in the speied blok and end at the last value with a step of 1. To allow
indexing within a blok region, we dene LAST to return the last index value in a blok. For example, if
LU deomposition were not a blokable algorithm, it ould be oded as in Figure 5.12 to ahieve mahine
independene.
The prinipal advantage of the extensions is that the programmer an express a non-blokable algorithm
in a natural blok form, while leaving the mahine-dependent details, namely the hoie of bloking fator,
to the ompiler. In the ase of LAPACK, the language extensions ould be used, when neessary, to ode
the algorithms for a soure-level library that is independent of the hoie of bloking fator. Then, using
ompiler tehnology, the library ould be ported from mahine to mahine and still retain good performane.
By doing so, we would remove the only mahine-dependeny problem of LAPACK and make it more aessible
76
CHAPTER 5.
DO 10 L = 1,N
DO 20 J = L+1,M
IF (A(J,L) .EQ. 0.0) GOTO 20
DEN = DSQRT(A(L,L)*A(L,L) + A(J,L)*A(J,L))
C(J) = A(L,L)/DEN
S(J) = A(J,L)/DEN
A1 = A(L,L)
A2 = A(J,L)
A(L,L) = C(J)*A1 + S(J)*A2
A(J,L) = -S(J)*A1 + C(J)*A2
C
C
C
IF-Inspetion Code
20
10
ENDIF
CONTINUE
DO 10 K = L+1,N
DO 10 JN = 1,JC
DO 10 J = JLB(JN),JUB(JN)
A1 = A(L,K)
A2 = A(J,K)
A(L,K) = C(J)*A1 + S(J)*A2
A(J,K) = -S(J)*A1 + C(J)*A2
Figure 5.11
Optimized QR Deomposition with Givens Rotations
BLOCK DO K = 1,N-1
IN K DO KK
DO I = KK+1,N
A(I,KK) = A(I,KK)/A(KK,KK)
ENDDO
DO J = KK+1,LAST(K)
DO I = KK+1,N
A(I,J) = A(I,J) - A(I,KK) * A(KK,J)
ENDDO
ENDDO
ENDDO
DO J = LAST(K)+1,N
DO I = K+1,N
IN K DO KK = K,MIN(LAST(K),I-1)
A(I,J) = A(I,J) - A(I,KK) * A(KK,J)
ENDDO
ENDDO
ENDDO
ENDDO
Figure 5.12
Blok LU in Extended Fortran
BLOCKABILITY
5.5.
SUMMARY
77
for new arhitetures. To realize this goal, researh eorts must fous on eetive tehniques for the hoie
of bloking fator.
5.5
Summary
We have set out to determine whether a ompiler an automatially restruture omputations well enough to
avoid the need for hand bloking and enourage mahine-independent programming. To that end, we have
examined a olletion of programs similar to LAPACK for whih we were able to aquire both the blok and
orresponding point algorithms. For eah of these programs, we determined whether a plausible ompiler
tehnology ould sueed in obtaining the blok version from the point algorithm.
The results of this study are enouraging: we an blok triangular, trapezoidal and rhomboidal loops and
we have found that many of the problems introdued by omplex dependene patterns an be overome by
the use of the transformation known as \index-set splitting". In many ases, index-set splitting yields odes
that exhibit performane at least as good as the best blok algorithms produed by LAPACK developers. In
addition, we have shown that, in the speial ase of LU deomposition with partial pivoting, knowledge about
whih operations ommute an enable a ompiler to sueed in bloking odes that ould not be bloked by
any ompiler based stritly on dependene analysis.
79
Chapter 6
Conlusion
This dissertation has dealt with ompiler-direted management of the memory hierarhy. We have desribed
algorithms to improve the balane between oating-point operations and memory requirements in program
loops by reduing the number of memory referenes and improving ahe performane. These algorithms
work under the assumption that the ompiler for the target mahine is eetive at sheduling and register
alloation. The implementation of our algorithms has validated our methodology by showing that integerfator speedups over quality ommerial optimizers are possible on whole appliations. Our hope is that these
results will be used to enourage programmers to write their appliations in a natural, mahine-independent
form, leaving the ompiler to handle mahine-dependent optimization details.
In this hapter, we review this thesis. First, we disuss the ontributions that we have made to register
alloation and automati management of ahe. Next, we disuss the issues related to ahe performane
that still must be solved and nally, we present some losing remarks.
6.1
6.1.1
Contributions
Registers
We have developed and implemented an algorithm to perform salar replaement in the presene of innerloop onditional ontrol ow. The goal of salar replaement is to expose the ow of values in arrays with
salar temporaries so that standard data-ow analysis will disover the potential for register alloation. By
mapping partial redundany elimination to salar replaement, we are able to replae array referenes whose
dening value is only partially available, something that was not done before this thesis. The implementation
of this algorithm has shown that signiant improvements are possible on sienti appliations.
We have also developed and implemented an algorithm to apply unroll-and-jam to a loop nest to improve
its balane between memory referenes and oating-point operations automatially. Our algorithm hooses
unroll amounts for one or two loops to reate a loop that is balaned as muh as possible on a partiular
arhiteture. Inluded in this algorithm is an estimate of oating-point register pressure that is used to
prevent spill ode insertion in the nal ompiled loop. The results of an experiment using this tehnique have
shown that integer-fator speedups are possible on some appliations. In partiular, redutions benet greatly
from unroll-and-jam beause of both improved balane and easily attained instrution-level parallelism.
The algorithms for salar replaement and unroll-and-jam eliminate the need for hand optimization to
eet register alloation of array values. Not only do the algorithms apture reuse in inner loops, but also in
outer loops. Although the outer-loop reuse an be obtained by hand, the proess is extremely tedious and
error prone and produes mahine-dependent programs. In one partiular ase, we have shown that hand
optimization atually produes slightly worse ode than the automatially derived version. By relying on
ompiler tehnology to handle mahine-dependent details, programs beome more readable and are portable
aross dierent arhitetures.
Beause salar replaement and unroll-and-jam an greatly inrease register pressure within loops, one
question might be \How many registers are enough?" The answer depends upon the balane of the target
80
CHAPTER 6.
CONCLUSION
mahine. For mahines that an perform multiple oating-point operations per memory operation, more
registers are needed to ompensate for a lower memory bandwidth. However, balaned arhitetures will
require fewer registers beause memory bandwidth is not as muh of a bottlenek.
6.1.2
Cahe
We have developed and implemented an algorithm to attain the best loop ordering for a loop nest in relation
to memory-hierarhy performane. Our algorithm is simple, but eetive. It safely ignores the eets of
ahe interferene on reuse by only onsidering reuse in the innermost loop. The algorithm is driven by
ahe line size, aess ost and miss penalty. Implementation has shown that the algorithm is apable of
ahieving dramati speedups on whole appliations on a single proessor. Using this tehnology, it is possible
for a programmer to order loop nests independent of language implementation of array storage and ahe
struture.
We have also shown that urrent ompiler tehniques are not suÆient to perform iteration-spae bloking on real-world algorithms. Trapezoidal-, rhomboidal- and triangular-shaped iteration spaes, whih are
ommon in linear algebra and geophysis odes, require a transformation known as index-set splitting to be
onsidered blokable. We have derived formulas to handle these ommon loop shapes with index-set splitting
and shown that bloking these loops an result in signiant speedups.
In addition, we have applied index-set splitting to dependenes that prevent iteration-spae bloking.
The objetive is to reate new loops where the preventing dependenes do not exist and bloking an be
performed. Using this tehnique, we have been able to derive automatially the best-known blok algorithms
for LU deomposition with and without pivoting. Previously, ompiler tehnology was unable to aomplish
this. Unfortunately, not all blok algorithms an be derived automatially by a ompiler. Those blok formulations that represent a hange of algorithm from their orresponding point algorithm annot be obtained
automatially. To handle these situations, we have proposed that a set of programming-language extensions
be developed to allow a programmer to speify blok algorithms in a mahine-independent manner.
Finally, our study of blokability has also led to a transformation alled IF-inspetion to handle inner loops
that are guarded by ontrol onditions. With this transformation, we determine exatly whih iterations of
an innermost loop will exeute and then, we optimize the memory performane of the loop nest for those
iterations. Large integer-fator speedups have been shown to be possible with IF-inspetion.
6.2
Future Work
Although we have addressed many memory hierarhy issues in this dissertation, there is still muh left to do.
Most of that work lies in the area of ahe management. In this setion, we will survey some of the major
issues yet to be solved.
In the omputation of loop balane, memory referenes are assigned a uniform ost under the assumption
that all aesses are made out of the ahe. This is not always the ase. Can we do a better job of omputing
the memory requirements of a loop? By applying our memory-hierarhy ost model presented in Chapter 4,
we an ompute the atual number of yles needed to aess memory within the innermost loop body. Using
this measurement we will be able to get a better measurement of the relationship between omputation and
memory yles, resulting in a better measure of loop balane. The question is \Does this inreased preision
matter to performane?" An implementation and omparison between the two omputations of balane
would answer the question.
Our treatment of iteration-spae bloking for ahe is not omplete. Although we have studied extensions
to urrent ompiler tehniques to allow a larger lass of algorithms to be bloked automatially, these
additional tehniques alone are insuÆient for implementation in a real ompiler. Piking blok sizes and
dealing with ahe interferene beause of set assoiativity are two issues that must be solved to make
automati bloking a viable tehnology. The optimal blok size for a loop is dependent upon the behavior
of the set assoiativity of the ahe and has been shown to be diÆult to determine [LRW91℄. Can the
ompiler predit these eets at ompile time or is it hopeless to perform automati bloking with today's
ahe arhitetures? Additionally, an implementation of the tehniques developed in Chapter 5 needs to be
done to show its viability.
6.3.
FINAL REMARKS
81
As our study of blokability revealed, not all algorithms an be written in a style that will allow them
to be bloked optimally. The language extensions presented in Chapter 5 provide a vehile for programmers
to express those blok algorithms in a mahine-independent manner. It must be determined exatly whih
extensions are needed and how to implement them eetively. Assuming that we an automatially determine
blok sizes, an we use the language extensions to allow variable blok sizes to inrease performane?
One transformation for memory performane that has not been disussed in this thesis is software
prefething. Previous work in this area has ignored the eets of ahe-line length on the redundany
of prefething in the presene of limited issue slots [CKP91℄. Can we take advantage of our loop interhange
algorithm to attain stride-one aesses and use ahe-line size to derive an eÆient and eetive method for
using software prefething? Future eorts should be direted toward the development of an eetive software
prefething algorithm.
Finally, we have not studied the eets of multi-level ahes on the performane of sienti appliations.
Many urrent systems use a MIPS R3000 or an Intel i860 with a seond-level ahe to attain better ahe
performane. Are urrent ompiler tehniques (with the addition of the rest of our future work) good enough
to handle this inrease in omplexity? What is the orret way to view the higher-level ahes? What is the
true payo of level-2 ahes? How muh improvement an be attained with the extra level of ahe? These
issues and others must be answered before suh omplex memory hierarhies beome eetive.
6.3
Final Remarks
The omplexity in the design of modern memory hierarhies and the lak of sophistiation in modern ommerial ompilers have put a signiant burden on the programmer to ahieve any large fration of performane available on high-performane arhitetures. Given that future mahine designs are ertain to have
inreasingly omplex memory hierarhies, ompilers will need to adopt inreasingly sophistiated memorymanagement strategies to oset the need for programmers to perform hand optimization. It is our belief
that programmers should not be burdened with arhitetural details, but rather onentrate solely on program logi. To this end, our goal has been to nd ompiler tehniques that would make it possible for
a programmer to express numerial algorithms naturally with the expetation of good memory-hierarhy
performane. We have demonstrated that there exist readily implementable methods that an manage the
oating-point register set and improve the eetiveness of ahe. By aomplishing these objetives, we have
taken a signiant step towards ahieving our goal.
83
Appendix A
Formulas for Non-Retangular
Iteration Spaes
A.1
A.1.1
Triangular Loops
Upper Left:
10
>0
DO 10 I = 1,N
DO 10 J = I+ ,M
loop body
J
6
M
+
1
Figure A.1
I
I+IS-1
N
- II
Upper Left Triangular Iteration Spae
84
APPENDIX A. FORMULAS FOR NON-RECTANGULAR ITERATION SPACES
Strip-Mine-and-Interhange Formula
10
DO 10 I = 1,N,IS
DO 10 J = I+ ,M
DO 10 II = I,MIN((J- )/,I+IS-1)
loop body
Unroll-and-Jam Formula
20
10
DO 10 I = 1,N,IS
DO 20 II = I,I+IS-2
DO 20 J = II+ ,MIN((I+IS-2)+ ,M)
loop body
DO 10 J =
(I+IS-1)+ ,M
unrolled loop body
A.1.
85
TRIANGULAR LOOPS
A.1.2
Upper Right:
10
<0
DO 10 I = 1,N
DO 10 J = I+ ,M
loop body
J
6
M
N+
1
Figure A.2
I
I+IS-1
DO 10 I = 1,N,IS
DO 10 J = (I+IS-1)+ ,M
DO 10 II = MAX(I,(J- )/),I+IS-1
loop body
Unroll-and-Jam Formula
20
10
DO 10 I = 1,N,IS
DO 20 II = I+1,I+IS-1
DO 20 J = II+ ,MIN((I+1)+ ,M)
loop body
DO 10 J =
I+ ,M
unrolled loop body
N
- II
Upper Right Triangular Iteration Spae
Strip-Mine-and-Interhange Formula
10
86
A.1.3
APPENDIX A. FORMULAS FOR NON-RECTANGULAR ITERATION SPACES
Lower Right:
10
>0
DO 10 I = 1,N
DO 10 J = L,I+
loop body
J
6
N+
L
I
1
Figure A.3
I+IS-1
DO 10 I = 1,N,IS
DO 10 J = L,(I+IS-1)+
DO 10 II = MAX(I,(J- )/),I+IS-1
loop body
Unroll-and-Jam Formula
20
10
DO 10 I = 1,N,IS
DO 20 J = L,I+
unrolled loop body
DO 10 II = I+1,I+IS-1
DO 10 J = MAX((I+1)+ ,L),II+
loop body
- II
Lower Right Triangular Iteration Spae
Strip-Mine-and-Interhange Formula
10
N
A.1.
87
TRIANGULAR LOOPS
A.1.4
Lower Left:
10
<0
DO 10 I = 1,N
DO 10 J = L,I+
loop body
J
6
+
L
I
1
Figure A.4
I+IS-1
DO 10 I = 1,N,IS
DO 10 J = L,I+
DO 10 II = I,MIN((J- )/),I+IS-1)
loop body
Unroll-and-Jam Formula
20
10
DO 10 I = 1,N,IS
DO 20 J = L,(I+IS-1)+
unrolled loop body
DO 10 II = I,I+IS-2
DO 10 J = MAX((I+IS-2)+ ,L),II+
loop body
N
- II
Lower Left Triangular Iteration Spae
Strip-Mine-and-Interhange Formula
10
88
A.2
A.2.1
APPENDIX A. FORMULAS FOR NON-RECTANGULAR ITERATION SPACES
Trapezoidal Loops
Upper-Bound
MIN
Funtion
Assume ; > 0.
10
DO 10 I = 1,N
DO 10 J = L,MIN(I+ ,N)
loop body
J
6
N
L
N 1
Figure A.5
20
DO 10 I = 1,MIN(N,(N- )/)
DO 10 J = L,I+
loop body
DO 20 I = MAX(1,MIN(N,(N- )/)+1),N
DO 20 J = L,N
loop body
-I
Trapezoidal Iteration Spae with MIN Funtion
After Index-Set Splitting
10
N
A.2.
89
TRAPEZOIDAL LOOPS
A.2.2
Lower-Bound
MAX
Funtion
Assume ; > 0.
10
DO 10 I = 1,N
DO 10 J =MAX(I+ ,L),N
loop body
J
6
N
1
N 1
Figure A.6
20
DO 10 I = 1,MIN(N,(L- )/)
DO 10 J = L,N
loop body
DO 20 I = MAX(1,MIN(N,(L- )/)+1),N
DO 20 J = + ,N
loop body
-I
Trapezoidal Iteration Spae with MAX Funtion
After Index-Set Splitting
10
N
90
A.3
APPENDIX A. FORMULAS FOR NON-RECTANGULAR ITERATION SPACES
Rhomboidal Loops
Assume > 0.
10
DO 10 I = 1,N1
DO 10 J = I+N,I+M
loop body
J
6
N1+M
+N
1
I
Figure A.7
I+IS-1
N1
Rhomboidal Iteration Spae
Strip-Mine-and-Interhange Formula
10
DO 10 I = 1,N1,IS
DO 10 J = I+N,I+M
DO 10 II = MAX(I,I+M),MIN(I+N,I+IS-1)
loop body
Unroll-and-Jam Formula
20
30
10
DO 10 I = 1,N1,IS
DO 20 II = I,I+IS-2
DO 20 J = II+N,MIN((I+IS-2)+N,(I+IS-2)+M)
loop body
DO 30 J =
(I+IS-1)+N,I+M
unrolled loop body
DO 10 II = I+1,I+IS-1
DO 10 J = MAX((I+1)+N,(I+1)+M),II+M
loop body
- II
BIBLIOGRAPHY
91
Bibliography
[AC72℄
F.E. Allen and J. Coke. A atalogue of optimizing transformations. In Design and Optimization of Compilers, pages 1{30. Prentie-Hall, 1972.
[AK87℄
J.R. Allen and K. Kennedy. Automati translation of Fortran programs to vetor form. ACM
Transations on Programming Languages and Systems, 9(4):491{542, Otober 1987.
[AK88℄
J.R. Allen and K. Kennedy. Vetor register alloation. Tehnial Report TR86-45, Department
of Computer Siene, Rie University, 1988.
[AN87℄
A. Aiken and A. Niolau. Loop quantization: An analysis and algorithm. Tehnial Report
87-821, Cornell University, Marh 1987.
[AS78℄
W. Abu-Sufah. Improving the Performane of Virtual Memory Computers. PhD thesis, Dept.
of Computer Siene, University of Illinois, 1978.
[ASM86℄
W. Abu-Sufah and A. Malony. Vetor proessing on the alliant FX/8 multiproessors. In
Proeedings of the 1986 International Conferene on Parallel Proessing, pages 559{566, August
1986.
[BCHT90℄
P. Briggs, K.D. Cooper, M.W. Hall, and L. Torzon. Goal-direted interproedural optimization. Tehnial Report TR90-102, Rie University, CRPC, November 1990.
[BCKT89℄
P. Briggs, K.D. Cooper, K. Kennedy, and L. Torzon. Coloring heuristis for register alloation. In Proeedings of the ACM SIGPLAN 89 Conferene on Program Language Design and
Implementation, Portland, OR, June 1989.
[BS88℄
M. Berry and A. Sameh. Multiproessor shemes for solving blok tridiagonal linear systems.
International Journal of Superomputer Appliations, 2(3):37{57, Fall 1988.
[CAC+ 81℄
G.J. Chaitin, M.A. Auslander, A.K. Chandra, J. Coke, M.E. Hopkins, and P.W. Markstein.
Register alloation via oloring. Computer Languages, 6:45{57, January 1981.
[Cal86℄
D.A. Calahan. Blok-oriented, loal-memory-based linear equation solution on the Cray-2:
Uniproessor algorithm. In Proeedings of the 1986 International Conferene on Parallel Proessing, 1986.
[CCK88℄
D. Callahan, J. Coke, and K. Kennedy. Estimating interlok and improving balane for
pipelined mahines. Journal of Parallel and Distributed Computing, 5, 1988.
[CCK90℄
D. Callahan, S. Carr, and K. Kennedy. Improving register alloation for subsripted variables. In Proeedings of the SIGPLAN '90 Conferene on Programming Language Design and
Implementation, White Plains, NY, June 1990.
[CH84℄
F. Chow and J. Hennessy. Register alloation by priority-based oloring. In Proeedings of the
SIGPLAN '84 Symposium on Compiler Constrution, SIGPLAN Noties Vol. 19, No. 6, June
1984.
92
BIBLIOGRAPHY
[CK77℄
John Coke and Ken Kennedy. An algorithm for redution of operator strength. Communiations of the ACM, 20(11), November 1977.
[CK87℄
D. Callahan and K. Kennedy. Analysis of interproedural side eets in a parallel programming environment. In Proeedings of the First International Conferene on Superomputing.
Springer-Verlag, Athens, Greee, 1987.
[CKP91℄
D. Callahan, K. Kennedy, and A. Portereld. Software prefething. In Proeedings of the Fourth
International Conferene on Arhiteural Support for Programming Languages and Operating
Systems, Santa Clara, CA, April 1991.
[CP90℄
D. Callahan and A. Portereld. Data ahe performane of superomputer appliations. In
Superomputing '90, 1990.
[DBMS79℄
J.J. Dongarra, J.R. Bunh, C.B. Moler, and G.W. Stewart. LINPACK User's Guide. SIAM
Publiations, Philadelphia, 1979.
[DDDH90℄
J.J. Dongarra, J. DuCroz, I. Du, and S. Hammerling. A set of level 3 basi linear algebra
subprograms. ACM Transations on Mathematial Software, 16:1{17, 1990.
[DDHH88℄
J.J. Dongarra, J. DuCroz, S. Hammerling, and R. Hanson. An extendend set of fortran basi
linear algebra subprograms. ACM Transations on Mathematial Software, 14:1{17, 1988.
[DDSvdV91℄ J.J. Dongarra, I.S. Du, D.C. Sorensen, and H.A. van der Vorst. Solving Linear Systems on
Vetor and Shared-Memory Computers. SIAM, Philadelphia, 1991.
[DS88℄
K.H. Drehsler and M.P. Stadel. A solution to a problem with morel and renvoise's \global
optimization by suppression of partial redudanies". ACM Transations on Programming Languages and Systems, 10(4):635{640, Otober 1988.
[Fab79℄
Janet Fabri. Automati storage optimization. In Proeedings of the SIGPLAN Symposium on
Compiler Constrution, Denver, CO, 1979.
[GJ79℄
M.R. Garey and D.S. Johnson. Computers and Intratability: A Guide to the Theory of NPCompleteness. W.H. Freeman and Co., San Franiso, 1979.
[GJG87℄
D. Gannon, W. Jalby, and K. Gallivan. Strategies for ahe and loal memory management
by global program transformations. In Proeedings of the First International Conferene on
Superomputing. Springer-Verlag, Athens, Greee, 1987.
[GJMS88℄
K. Gallivan, W. Jalby, U. Meier, and A.H. Sameh. Impat of hierarhial memory systems on
linear algebra design. International Journal of Superomputer Appliations, 2(1):12{48, Spring
1988.
[GKT91℄
G. Go, K. Kennedy, and C.W. Tseng. Pratial dependene testing. In Proeedings of the
SIGPLAN '91 Conferene on Programming Language Design and Implementation, Toronto,
Ontario, June 1991.
[GM86℄
P.B. Gibbons and S.S. Muhnik. EÆient instrution sheduling for a pipelined arhiteture.
In Proeedings of the SIGPLAN '86 Symposium on Compiler Constrution, 1986.
[GS84℄
J.H. GriÆn and M.L. Simmons. Los Alamos National Laboratory Computer Benhmarking
1983. Tehnial Report LA-10051-MS, Los Alamos National Laboratory, June 1984.
[HK91℄
P. Havlak and K. Kennedy. An implementation of interproedural bounded regular setion
analysis. IEEE Transations on Parallel and Distributed Systems, 2(3):350{360, July 1991.
[IT88℄
F. Irigoin and R. Triolet. Supernode partitioning. In Conferene Reord of the Fifteenth ACM
Symposium on the Priniples of Programming Languages, pages 319{328, January 1988.
BIBLIOGRAPHY
93
[KKP+ 81℄
D. Kuk, R. Kuhn, D. Padua, B. Leasure, and M. Wolfe. Dependene graphs and ompiler
optimizations. In Conferene Reord of the Eight ACM Symposium on the Priniples of Programming Languages, 1981.
[KM92℄
K. Kennedy and K. MKinley. Optimizing for parallelism and memory hierarhy. In Proeedings
of the 1992 International Conferene on Superomputing, Washington, DC, July 1992.
[Ku78℄
D. Kuk. The Struture of Computers and Computations Volume 1. John Wiley and Sons,
New York, 1978.
[LHKK79℄
C. Lawson, R. Hanson, D. Kinaid, and F. Krogh. Basi linear algebra subprograms for fortran
usage. ACM Transations on Mathematial Software, 5:308{329, 1979.
[LRW91℄
M.S. Lam, E.E. Rothberg, and M.E. Wolf. The ahe performane and optimizations of bloked
algorithms. In Proeedings of the Fourth International Conferene on Arhiteural Support for
Programming Languages and Operating Systems, April 1991.
[LS88℄
B. Liu and N. Strother. Programming in VS FORTRAN on the IBM 3090 for maximum vetor
performane. Computer, 21(6), June 1988.
[MR79℄
E. Morel and C. Revoise. Global optimization by suppression of partial redundanies. Communiations of the ACM, 22(2), February 1979.
[Por89℄
A.K. Portereld. Software Methods for Improvement of Cahe Performane on Superomputer
Appliations. PhD thesis, Rie University, May 1989.
[Sew90℄
G Sewell. Computational Methods of Linear Algebra. Ellis Horwood, England, 1990.
[Ste73℄
G.W. Stewart. Introdution to Matrix Computations. Aademi Press, New York, 1973.
[SU70℄
R. Sethi and J.D. Ullman. The generation of optimal ode for arithmeti expressions. Journal
of the ACM, 17(4):715{728, Otober 1970.
[Tha81℄
Khalid O. Thabit. Cahe Managemant by the Compiler. PhD thesis, Rie University, November
1981.
[WL91℄
M.E. Wolf and M.S. Lam. A data loality optimizing algorithm. In Proeedings of the SIGPLAN
'91 Conferene on Programming Language Design and Implementation, June 1991.
[Wol82℄
M. Wolfe. Optimizing Superompilers for Superomputers. PhD thesis, University of Illinois,
Otober 1982.
[Wol86a℄
M. Wolfe. Advaned loop interhange. In Proeedings of the 1986 International Conferene on
Parallel Proessing, August 1986.
[Wol86b℄
M. Wolfe. Loop skewing: The wavefront method revisited. Journal of Parallel Programming,
1986.
[Wol87℄
M. Wolfe. Iteration spae tiling for memory hierarhies. In Proeedings of the Third SIAM
Conferene on Parallel Proessing for Sienti Computing, Deember 1987.
[Wol89℄
M. Wolfe. More iteration spae tiling. In Proeedings of the Superomputing '89 Conferene,
1989.
Download