Improving Effective Bandwidth through Compiler Enhancement of Global and Dynamic Cache Reuse

advertisement
RICE UNIVERSITY
Improving Effective Bandwidth through
Compiler Enhancement of
Global and Dynamic Cache Reuse
by
Chen Ding
A Thesis Submitted
in Partial Fulfillment of the
Requirements for the Degree
Doctor of Philosophy
Approved, Thesis Committee:
Ken Kennedy
Ann and John Doerr Professor, Chair
Computer Science
Keith Cooper
Associate Professor
Computer Science
Danny C. Sorensen
Professor
Computational and Applied Mathematics
Alan Cox
Associate Professor
Computer Science
John Mellor-Crummey
Senior Faculty Fellow
Computer Science
Houston, Texas
January 14, 2000
Improving Effective Bandwidth through
Compiler Enhancement of
Global and Dynamic Cache Reuse
Chen Ding
Abstract
While CPU speed has been improved by a factor of 6400 over the past twenty years,
memory bandwidth has increased by a factor of only 139 during the same period.
Consequently, on modern machines the limited data supply simply cannot keep a
CPU busy, and applications often utilize only a few percent of peak CPU performance.
The hardware solution, which provides layers of high-bandwidth data cache, is not
effective for large and complex applications primarily for two reasons: far-separated
data reuse and large-stride data access. The first repeats unnecessary transfer and
the second communicates useless data. Both waste memory bandwidth.
This dissertation pursues a software remedy. It investigates the potential for
compiler optimizations to alter program behavior and reduce its memory bandwidth
consumption. To this end, this research has studied a two-step transformation strategy: first fuse computations on the same data and then group data used by the same
computation. Existing techniques such as loop blocking can be viewed as an application of this strategy within a single loop nest. In order to carry out this strategy
to its full extent, this research has developed a set of compiler transformations that
perform computation fusion and data grouping over the whole program and during
the entire execution. The major new techniques and their unique contributions are
Maximal loop fusion: an algorithm that achieves maximal fusion among all
program statements and bounded reuse distance within a fused loop.
Inter-array data regrouping: the first to selectively group global data structures and to do so with guaranteed profitability and compile-time optimality.
Locality grouping and dynamic packing: the first set of compiler-inserted
and compiler-optimized computation and data transformations at run time.
These optimizations have been implemented in a research compiler and evaluated
on real-world applications on SGI Origin2000. The result shows that, on average,
the new strategy eliminates 41% of memory loads in regular applications and 63% in
irregular and dynamic programs. As a result, the overall execution time is shortened
by 12% to 77%.
In addition to compiler optimizations, this research has developed a performance
model and designed a performance tool. The former allows precise measurement
of the memory bandwidth bottleneck; the latter enables effective user tuning and
accurate performance prediction for large applications: neither goal was achieved
before this thesis.
Acknowledgments
I wish to thank my advisor, Ken Kennedy, for his inspiration, technical direction and
material support. Without him this dissertation would not be possible. I want to
thank my other committee members, Keith Cooper, John Mellor-Crummey, Alan Cox,
and Danny Sorensen, for their interest and help. Sarita Adve helped with my proposal.
I also thank Ellen Butler for always reserving me a slot in Ken’s busy schedule. The
implementation of my work was based on the D System, an infrastructure project led
by John Mellor-Curmmey (and in part by Vikram Adve before his leave). I heavily
used the scalar compiler framework put together by Nat Macintosh. The D System
also contains components from previous compilers built by generations of graduate
students.
I am very fortunate to study in a small department with leading researchers working in a close environment and in the same wonderful building. The professors and
students of other groups not only give superb teaching but also are always ready to
help. I thank in particular the language, scalar compiler, architecture and system
group. My work was also helped by Nathaniel Dean and William Cook of computational mathematics department. My writing was significantly improved by a seminar
taught by Jan Hewitt. In addition, I thank Ron Goldman for his valuable lunch-time
advice and my officemate Dejan Mircevski for helping me on everything I asked. The
financial support for my study came from Rice University, DARPA, and Compaq
Corporation.
I received my M.S. degree from Michigan Tech., where my former advisors Phil
Sweany and Steve Carr helped me to build a solid foundation for my research career. I
thank also other outside researchers for their help especially Kathryn Knobe, Kathryn
McKinley, Wei Li, and Chau-Wen Tseng.
This dissertation is dedicated to my family: my dear wife Linlin, my parents
Shengyao Ding and Ruizhe Liu, and my brother Rui, for the never-ending love, support, and encouragement. I always remember what my father told me: “There are
mountains after mountains and sky outside sky.”
v
Contents
Abstract
Acknowledgments
List of Illustrations
ii
iv
viii
1 Introduction
1.0
1.1
1.2
1.3
1.4
1
Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problem of Memory Performance . . . . . . . . . . . . . .
1.1.0 Definitions . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Conflicting Trends of Software and Hardware . . .
1.1.2 Memory Bandwidth Bottleneck . . . . . . . . . . .
Solution through Cache Reuse . . . . . . . . . . . . . . . .
1.2.1 Two-Step Strategy of Cache Reuse . . . . . . . . .
1.2.2 The Need for Compiler Automation . . . . . . . . .
1.2.3 A Unified Compiler Strategy . . . . . . . . . . . . .
Related Work . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Complementary Techniques . . . . . . . . . . . . .
1.3.2 Global and Dynamic Optimizations . . . . . . . . .
1.3.3 Performance Model and Tool for Memory Hierarchy
1.3.4 Summary of Limitations . . . . . . . . . . . . . . .
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Global Computation Fusion
2.1
2.2
2.3
Introduction . . . . . . . . . . . . . . . .
Analysis of Data Reuse . . . . . . . . . .
2.2.1 Reuse Distance . . . . . . . . . .
2.2.2 Reuse-Driven Execution . . . . .
An Algorithm for Maximal Loop Fusion
2.3.1 Single-Level Fusion . . . . . . . .
2.3.2 Properties . . . . . . . . . . . . .
1
2
2
3
5
8
9
12
14
15
16
17
23
24
25
26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
27
27
27
30
33
35
vi
2.4
2.5
2.6
2.3.3 Multi-level Fusion . . . . . . . . . . . . . .
Optimal Loop Fusion . . . . . . . . . . . . . . . .
2.4.1 Loop Fusion for Minimal Reuse Distance .
2.4.2 Loop Fusion for Minimal Data Sharing . .
2.4.3 An Open Question . . . . . . . . . . . . .
Advanced Optimizations Enabled by Loop Fusion
2.5.1 Storage Reduction . . . . . . . . . . . . .
2.5.2 Store Elimination . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Global Data Regrouping
3.1
3.2
3.3
3.4
3.5
Introduction . . . . . . . . . . . . . . . . . .
Program Analysis . . . . . . . . . . . . . . .
Regrouping Algorithm . . . . . . . . . . . .
3.3.1 One-Level Regrouping . . . . . . . .
3.3.2 Optimality . . . . . . . . . . . . . . .
3.3.3 Multi-level Regrouping . . . . . . . .
Extensions . . . . . . . . . . . . . . . . . . .
3.4.1 Allowing Useless Data . . . . . . . .
3.4.2 Allowing Dynamic Data Regrouping
3.4.3 Minimizing Data Writebacks . . . . .
Summary . . . . . . . . . . . . . . . . . . .
53
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Run-time Cache-reuse Optimizations
4.1
4.2
4.3
4.4
37
38
38
41
47
48
48
50
51
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
Locality Grouping and Data Packing . . . . . . . . . . . .
4.2.1 Locality Grouping . . . . . . . . . . . . . . . . . .
4.2.2 Dynamic Data Packing . . . . . . . . . . . . . . . .
4.2.3 Combining Computation and Data Transformation
Compiler Support for Dynamic Data Packing . . . . . . .
4.3.1 Packing and Packing Optimizations . . . . . . . . .
4.3.2 Compiler Analysis and Instrumentation . . . . . . .
4.3.3 Extensions to Fully Automatic Packing . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
54
56
56
58
60
61
63
64
64
65
67
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
68
68
70
74
75
75
78
81
81
vii
5 Performance Tuning and Prediction
5.1
5.2
5.3
5.4
5.5
Introduction . . . . . . . . . . . . . . . .
Bandwidth-based Performance Tool . . .
5.2.1 Data Analysis . . . . . . . . . . .
5.2.2 Integration with Compiler . . . .
Performance Tuning and Prediction . . .
Extensions to More Accurate Estimation
Summary . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
83
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Evaluation
6.1
6.2
6.3
6.4
6.5
6.6
Implementation . . . . . . . . . . . . . . . . .
6.1.1 Maximal Loop Fusion . . . . . . . . .
6.1.2 Inter-array Data Regrouping . . . . . .
6.1.3 Data Packing and Its Optimizations . .
Experimental Design . . . . . . . . . . . . . .
Effect on Regular Applications . . . . . . . . .
6.3.1 Applications . . . . . . . . . . . . . . .
6.3.2 Transformations Applied . . . . . . . .
6.3.3 Effect of Transformations . . . . . . .
Effect on Irregular and Dynamic Applications
6.4.1 Applications . . . . . . . . . . . . . . .
6.4.2 Transformations Applied . . . . . . . .
6.4.3 Effect of Transformations . . . . . . .
Effect of Performance Tuning and Predication
Summary . . . . . . . . . . . . . . . . . . . .
7 Conclusions
7.1
7.2
7.3
83
84
84
85
86
87
88
90
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
90
90
91
92
92
93
94
94
94
99
99
100
102
105
109
112
Compiler Optimizations for Cache Reuse . . . . . . . . . . . . . . . . 112
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Bibliography
118
viii
Illustrations
1.1
1.2
1.3
1.4
1.5
1.6
Comparison between program and machine balance . . . . . . . . .
Ratios of bandwidth demand to its supply . . . . . . . . . . . . . .
Example of global cache reuse . . . . . . . . . . . . . . . . . . . . .
Example of dynamic cache reuse . . . . . . . . . . . . . . . . . . . .
Comparison among hardware/OS, programmers and compilers . . .
The overall compiler strategy for maximizing memory performance .
.
.
.
.
.
.
6
7
10
12
13
14
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
Example reuse distances . . . . . . . . . .
Algorithm for reuse-driven execution . . .
Effect of reuse-driven execution (I) . . . .
Effect of reuse-driven execution (II) . . . .
Examples of loop fusion . . . . . . . . . .
Assumptions on the input program . . . .
Algorithm for one-level fusion . . . . . . .
Algorithm for multi-level fusion . . . . . .
Example of bandwidth-minimal loop fusion
Minimal-cut algorithm for a hyper-graph .
Array shrinking and peeling . . . . . . . .
Store elimination . . . . . . . . . . . . . .
Effect of store elimination . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
29
30
31
32
33
34
39
43
46
49
50
51
3.1
3.2
3.3
3.4
3.5
Example of inter-array data regrouping . . . . . . . . . . . .
Computation phases of a hydrodynamics simulation program
Example of multi-level data regrouping . . . . . . . . . . . .
Algorithm for multi-level data regrouping . . . . . . . . . . .
Examples of extending data regrouping . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
56
61
62
63
4.1
Example of locality grouping . . . . . . . . . . . . . . . . . . . . . . .
69
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ix
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
Effect of locality grouping . . . . . . . . . . . . . . . .
Example of data packing . . . . . . . . . . . . . . . . .
Algorithm of consecutive data packing . . . . . . . . .
Moldyn and Mesh, on 2K and 4K cache . . . . . . . . .
Mesh after locality grouping . . . . . . . . . . . . . . .
Moldyn kernel with a packing directive . . . . . . . . .
Moldyn kernel after data packing . . . . . . . . . . . .
Moldyn kernel after packing optimizations . . . . . . .
Primitive packing groups in Moldyn . . . . . . . . . . .
Compiler indirection analysis and packing optimization
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
70
71
73
75
76
77
78
79
80
5.1
Structure of the performance tool . . . . . . . . . . . . . . . . . . . .
85
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
Descriptions of Regular applications . . . . . . . . . . . . . . . .
Effect of transformations on regular applications . . . . . . . . .
Reuse distances of NAS/SP after maximal fusion . . . . . . . .
Descriptions of irregular and dynamic applications . . . . . . . .
Input sizes of irregular and dynamic applications . . . . . . . .
Transformations applied to irregular and dynamic applications .
Effect of transformations on irregular and dynamic applications
Effect of compiler optimizations for data packing . . . . . . . .
Memory bandwidth utilization of NAS/SP . . . . . . . . . . . .
Actual and predicted execution time . . . . . . . . . . . . . . .
Actual and predicted data transfer . . . . . . . . . . . . . . . .
7.1
Summary of evaluation results . . . . . . . . . . . . . . . . . . . . . . 114
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
94
95
99
100
100
101
103
105
106
108
109
1
Chapter 1
Introduction
“In science there is only Physics; all the rest is stamp collecting” – Ernest
Rutherford (1871-1937)
1.0
Thesis
At the dawn of the 21st century, the computing world is witnessing two powerful
but diverging trends of hardware and software. On the hardware side, single-chip
microprocessors have become the dominant platform for most applications simply
because of their tremendous computing power, which has increased by an astonishing
6400 times in the past twenty years. However, in sharp contrast to the rapid on-chip
improvement is the much slower rate of growth for off-chip memory bandwidth, which
has increased by merely 139 times over the same period of time. To close the memory
gap, all modern machines provide high-bandwidth on-chip data caches in the hope
that most data can be cached so that applications can largely avoid direct access to
memory.
Although caches have been successful for programs with small data sets and simple access patterns, their effectiveness has become increasingly problematic as the
software community has been relentlessly pushing into ever larger and more complex
systems. Not only do today’s programs employ a massive amount of data that is far
too large to fit in cache, they also access memory in a complex and dynamically changing manner that leads to extremely poor utilization of the available cache resource.
The problem of poor cache utilization is further compounded by the use of moduleor component-based programming styles that fragment both computation and data
that could be otherwise cached together. As a result of the poor cache utilization and
consequently poor memory performance, many applications can achieve only a few
percent of peak CPU performance on modern machines, leaving room for a potential
improvement of an order of magnitude if only caches could be better utilized.
2
The purpose of this thesis is to bridge the diverging trends of software and hardware by developing a compiler strategy that automatically transforms programs to
fully utilize machine cache. Specifically, this work demonstrates that
Global and run-time transformations can substantially improve the
overall performance of large, dynamic applications on machines
built from modern microprocessors; furthermore, these transformations can be automated and combined into a coherent compiler strategy.
The rest of this chapter first explains the problem of memory bottleneck and the
solution of cache reuse. Then it presents the overall compiler strategy for maximizing
cache reuse and compares this strategy with previous work. The succeeding chapters
will then flesh out the various components of the new compiler strategy.
1.1
Problem of Memory Performance
The problem of memory performance is rooted in the diverging trends of hardware
and software, in particular in the growing mismatch between the insufficient memory bandwidth supplied by machines and the massive memory transfer demanded
by applications. This section first defines a few key concepts of memory hierarchy.
The main part then studies the fundamental balance between computation and data
transfer on computing systems, formulates a performance model based on the concept of balance, and finally uses the model to identify the performance bottleneck on
modern machines.
1.1.0
Definitions
All modern machines built from microprocessors have data transferred through several
levels of storage. The closest to CPU is a set of registers, then one or more levels
of cache, and finally the main memory. This layered memory organization is called
memory hierarchy.
Memory bandwidth is the data bandwidth between CPU and main memory, that is,
how much data is communicated between them in each second. The communication is
two-way: data is fetched into CPU through memory reads and sent back to memory by
memory writebacks. The memory bandwidth of a program is called effective memory
bandwidth, which is the number of memory reads and writebacks a program performs
3
in each second. Since CPU and main memory are on different computer chips, the
effective memory bandwidth of a program is constrained by the physical memory
bandwidth of a machine, which is the cross-chip or off-chip hardware bandwidth
between CPU and main memory.
Cache is a data buffer between CPU and main memory. It serves memory requests for the buffered data without accessing main memory. Cache is organized as a
collection of non-unit cache blocks or cache lines. If a data item is buffered in cache,
the whole block of the adjacent data is also loaded into the same cache block. A
memory reference is a cache hit if the requested data is in a cache block; otherwise it
is a cache miss, and the data is loaded directly from memory.
A repeated memory reference to the same data is a data reuse. If the requested
data item is in cache, the data access is a cache reuse. Cache reuse may happen
directly when the same data is requested twice, in which case the reuse is called a
temporal cache reuse. Cache reuse may happen indirectly when a fresh data request
hits in cache because the requested data has been brought in by the block transfer of
a formerly requested data item, in which case the reuse is called a spatial cache reuse.
Since large cache blocks are more efficient for contiguous data access and less costly for
cache coherence, the size of cache blocks on modern machines is fairly large, ranging
from 32 bytes to 128 bytes. Large cache blocks make cache spatial reuse extremely
important for good cache utilization.
In the literature, cache spatial reuse is often defined differently in that it includes
the fuzzy property that cache blocks do not unnecessarily conflict with each other to
cause premature eviction from cache. This dissertation uses cache spatial reuse to
denote only the reuse within a cache block; the conflicts among different cache blocks
are referred to as cache interference.
1.1.1
Conflicting Trends of Software and Hardware
Since the advent of microprocessors in late 1970s, the capacity gap between off-chip
memory bandwidth and on-chip CPU power has been steadily widening. Historical
figures on processor performance and off-chip bandwidth have shown that over the
past twenty years, the average annual increase in CPU power is 55%, but the average
improvement in off-chip data bandwidth is merely 28%1 . In other words, as CPU
1
Estimation based on the historical figures compiled by Burger et al [BGK96].
4
power increased by 6,400 times in the past, memory bandwidth increased by no more
than 139 times.
To bridge the memory gap, all modern machines provide high-bandwidth on-chip
data cache in the hope that most memory reads and writebacks can be served by
cache without consuming the valuable memory bandwidth. Although machine cache
has been successful for programs with small data sets and simple access patterns, its
effectiveness has become increasingly problematic because of the following directions
pursued by modern software:
• Large data sets: A major goal in computing is to model the physical world, from
a galaxy to a DNA, from an airplane to a robot, and from molecular dynamics
to electromagnetism. Since we desire as large scope and as high precision as
possible, the demand for larger data representations is insatiable.
• Dynamic computation: Most real-world events are non-uniform and evolving,
such as that of a car crash or a drug injection. Consequently, both their computation structure and data representation are irregular and dynamically changing.
Even in simpler cases where data stays the same, the order of data access may
still change radically in different parts of a program. For example, a physical
model can be traversed first top-down and then inside out.
• Modularized programming: To manage the complexity of developing software
systems with sophisticated capabilities, modern software development must
practice modularization along with computation and data abstraction. A computing task is frequently divided into a hierarchy of sub-steps, and a complex
object broken into many sub-components.
Since large programs perform computation in many phases and access data in
many different places, accesses to the same data item are far separated in time, and
these accesses are often non-contiguous with large strides. When the reuse of a data
item is far separated by a large amount of other data access, the value may be evicted
from cache before it is reused, causing unnecessary data transfer from memory. Largestride accesses, on the other hand, waste cache capacity by causing useless data to be
transferred to cache. Furthermore, low utilization of cache blocks leads to an underutilized cache, effectively reducing its size and causing even more memory transfer.
Moreover, the extensive use of function and data abstraction aggravates the problem
by fragmenting computations and data that could be otherwise cached together.
5
On parallel machines such as high-end servers and supercomputers, the problem of
excessive memory transfer is as serious as it is on uni-processor machines. In fact, the
bandwidth problem may cause even worse consequences for such machines because
memory bandwidth is shared by a potentially large number of processors and consequently is a more critical resource. A single memory module can become the point of
contention and the bottleneck of the whole parallel system. Recently, cache-coherent
shared-memory multiprocessors have become increasingly popular because of their
ease of programming. On such machines, a cache block is the basis of cache coherence and consequently the unit of inter-processor communication. Therefore, low
cache-block utilization wastes not only memory bandwidth but also network bandwidth.
In summary, the analysis of hardware and software trends has revealed an alarming
tension between the excessive demand of memory transfer and the limited supply
of memory bandwidth. The next section examines the effect of this mismatch on
performance.
1.1.2
Memory Bandwidth Bottleneck
This section quantifies the memory bandwidth constraint by modeling and measuring
the fundamental balance between computation and data transfer.
Balance between Computation and Data Transfer
To understand the supply and demand of memory bandwidth as well as other computer resources, it is necessary to go back to the basis of computing systems, which
is the balance between computation and data transfer. This section first formulates
a performance model based on the concept of balance and then uses the model to
examine the performance bottleneck on current machines.
Both a program and a machine have balance. Program balance is the amount
of data transfer (including both data reads and writes) that the program needs for
each computation operation; machine balance is the amount of data transfer that the
machine provides for each machine operation. Specifically, for a scientific program,
the program balance is the average number of bytes that must be transferred per
floating-point operation (flop) in the program; the machine balance is the number
of bytes the machine can transfer per flop in its peak flop rate. On machines with
6
multiple levels of cache memory, the balance includes the data transfer between all
adjacent levels.
The table in Figure 1.1 compares program and machine balance. The upper half
of the table lists the balance of six representative scientific applications2 , including four kernels—convolution, dmxpy, matrix multiply, FFT—and two application
benchmarks—SP from the NAS benchmark suite and Sweep3D from DOE. For example, the first row shows that for each flop, convolution requires transferring 6.4
bytes between the level-one cache (L1) and registers, 5.1 bytes between L1 and the
level-two cache (L2), and 5.2 bytes between L2 and memory. The last row gives the
balance of SGI Origin20003 , which shows that for each flop at its peak performance,
the machine can transfer 4 bytes between registers and cache, 4 bytes between L1
and L2, but merely 0.8 bytes between cache and memory.
As the last column of the table shows, with the exception of mm(-O3), all applications demand a substantially higher rate of memory transfer than that provided by
Origin2000. The demands are between 2.7 to 8.4 bytes per flop, while the supply is
only 0.8 byte per flop. The striking mismatch clearly confirms the fact that memory
bandwidth is a serious performance bottleneck. In fact, memory bandwidth is the
least sufficient resource because its mismatch is much larger than that of register and
Programs
convolution
dmxpy
mm (-O2)
mm (-O3)
FFT
NAS/SP
Sweep3D
Origin2000
Program/machine Balance
L1-Reg L2-L1 Mem-L2
6.4
5.1
5.2
8.3
8.3
8.4
24.0
8.2
5.9
8.08
0.97
0.04
8.3
3.0
2.7
10.8
6.4
4.9
15.0
9.1
7.8
4
4
0.8
Figure 1.1 Comparison between program and machine balance
Program balances are calculated by measuring the number of flops, register loads/stores and cache
misses/writebacks through hardware counters on SGI Origin2000.
3
The machine balance is calculated by taking the flop rate and register throughput from hardware
specification and measuring memory bandwidth through STREAM[McC95] and cache bandwidth
through CacheBench[ML98].
2
7
cache bandwidth, shown by the second and the third column in Figure 1.1. The next
section will take a closer look at this memory bandwidth bottleneck.
The reason matrix multiply mm (-O3) requires very little memory transfer is that
at the highest optimization level of -O3, the compiler performs advanced computation blocking, first developed by Carr and Kennedy[CK89]. The dramatic change of
results from -O2 to -O3 is clear evidence that a compiler may significantly reduce the
application’s demand for memory bandwidth; nevertheless, the current compiler is
not effective for all other programs. I will return to compiler issues in a moment and
for the rest of this dissertation.
Memory Bandwidth Bottleneck
The precise ratios of the demand of data bandwidth to its supply can be calculated by
dividing the program balances with the machine balance of Origin2000. The results
are listed in Figure 1.2. They show the degree of mismatch for each application at
each memory hierarchy level. The last column shows the largest gap: the programs
require 3.4 to 10.5 times as much memory bandwidth as that provided by the machine,
verifying that memory bandwidth is the most limited resource. The data bandwidth
on the other two levels of memory hierarchy is also insufficient by factors between 1.3
to 6.0, but the problem is comparatively less serious.
The insufficient memory bandwidth compels applications into unavoidable low
performance simply because data from memory cannot be delivered fast enough to
keep CPU busy. For example, the Linpack kernel dmxpy has a ratio of 10.5, which
means an average CPU utilization of no more than 1/10.5, or 9.5%. One may argue
Applications Ratios of demand over supply
L1-Reg L2-L1 Mem-L2
convolution
1.6
1.3
6.5
dmxpy
2.1
2.1
10.5
mmjki (-O2) 6.0
2.1
7.4
FFT
2.1
0.8
3.4
NAS/SP
2.7
1.6
6.1
Sweep3D
3.8
2.3
9.8
Figure 1.2 Ratios of bandwidth demand to its supply
8
that a kernel does not contain enough computation. However, the last two rows show
a grim picture even for large applications: the average CPU utilization can be no
more than 16% for NAS/SP and 10% for Sweep3D. In other words, over 80% of CPU
capacity is left unused because of the memory bandwidth bottleneck.
The memory bandwidth bottleneck exists on other machines as well. To fully
utilize a processor of comparable speed as MIPS R10K on Origin2000, a machine
would need 3.4 to 10.5 times of the 300 MB/s memory bandwidth of Origin2000.
Therefore, a machine must have 1.02 GB/s to 3.15GB/s of memory bandwidth, far
exceeding the capacity of current machines such as those from HP and Intel. As
CPU speed rapidly increases, future systems will have even worse balance and a more
serious bottleneck because of the lack of memory bandwidth.
So far, the balance-based performance model has not considered the effect of the
latency constraint and, in particular, the effect of memory latency. It is possible that
memory access incurs such a high latency that even the limited memory bandwidth is
scarcely used. To verify that this is not the case, an additional study was performed
to measure the actual bandwidth consumption of a group of program kernels and a
full benchmark application, as reported in [DK00]. It found that these applications
consume most of the available memory bandwidth. Therefore, memory bandwidth is
a more limiting factor to performance than is memory latency.
In conclusion, the empirical study has shown that for most applications, machine
memory bandwidth is between one third and one tenth of that needed. As a result,
over 80% of CPU power is left un-utilized by large applications, indicating a significant
performance potential that may be realized if the applications can better utilize the
limited memory bandwidth. The next section introduces the solution developed by
this dissertation: improving effective memory bandwidth through global and dynamic
cache reuse.
1.2
Solution through Cache Reuse
This section starts with the general strategy of cache reuse, illustrates its power in
exploiting global and dynamic cache reuse, demonstrates the necessity for its compiler
automation, and finally presents the overall compiler strategy that systematically
applies this strategy to maximize cache performance.
9
1.2.1
Two-Step Strategy of Cache Reuse
Cache reuse can be maximized by the following two-step strategy.
• Step 1. fuse all the computation on the same data
• Step 2. group all the data used by the same computation
The first step, computation fusion, groups all the uses of the same data so that
when a data item is loaded into cache, the program performs all computation on that
data before moving it out. The second step, data grouping, gathers all data used by
the same computation so that during the computation, all cache blocks are utilized
to the greatest extent possible. Both temporal and spatial cache reuse are maximized
as a result of these steps.
Both steps have an implicit pre-step of separation before fusion and regrouping. The first step breaks computations into the smallest units before fusion so that
unrelated computations are separated. Similarly, the second step divides data into
the smallest pieces before regrouping so that unrelated data parts are disjointed.
Therefore, the two-step strategy can be viewed as having four steps if the separation
steps are made explicit.
The strategy is a direct solution to the problems caused by far-separated reuse and
large-stride access common in data-intensive programs. The fusion step minimizes
the distance of data reuse, and the grouping step optimizes the stride of data access.
As a result, computation fusion eliminates repeated memory transfer of the same data
while data grouping fully utilizes each memory transfer. Together they minimize the
total number of transferred cache blocks and therefore the total amount of memory
bandwidth consumption.
The two steps of this strategy are inherently related: they are inseparable and
they must proceed in order. The second step depends on the first because without
fusion, data reuses remain far-separated and the repeated data access would miss
in cache regardless of data grouping. On the other hand, the first step should be
followed by the second because without data grouping, the cache and cache blocks
may be polluted with useless data to the extent that only a few percent of cache is
useful, and the effective memory bandwidth can be reduced by an order of magnitude.
Therefore, neither step can work well without the other. This strategy and its benefits
are especially evident when optimizing large and dynamic programs, as described in
the next two sections and validated in the later chapters.
10
Global Cache Reuse
The strategy of cache reuse can be applied at the global level to improve data reuse
across all program segments and in all data structures. Figure 1.3 illustrates global
cache reuse. The example in (a) is a typical program written by a typical programmer.
It starts with data initialization and then proceeds with several steps of computation.
Although clear and simple logically, the program suffers from far-separated data reuse.
For example, none of the input data is used until all other inputs are processed.
Computation fusion merges the computations on the same data, as shown in
Figure 1.3(b). In the fused function Fused Step 1, each data element is used immediately after its initialization, thus having a minimal reuse distance. Therefore, each
element can be now buffered and reused with a fixed-size cache.
Initialize(...) {
For i
initial[i].data1 <-...
initial[i].data2 <...
End for
}
Process(...) {
Step_1(...) {
For i
tmp1[i].data1 <initial[i].data1
tmp1[i].data2 <tmp1[i].data1
End for
}
Step_2(...) {
For i
tmp2[i].data1 <initial[i].data2
End for
}
...
Fused_Step_1(...) {
For i
initial[i].data1 <-...
tmp1[i].data1 <initial[i].data1
Fused_Step_1(...) {
For i
Data_Group_1[i].data1 <- ...
Data_Group_1[i].data2 <Data_Group_1[i].data1
tmp1[i].data2 <tmp1[i].data1
End for
Data_Group_1[i].data3 <Data_Group_1[i].data2
}
End for
}
Fused_Step_2(...) {
For i
initial[i].data2 <- ...
tmp2[i].data1 <initial[i].data2
Fused_Step_2(...) {
For i
Data_Group_2[i].data1 <- ...
Data_Group_2[i].data2 <Data_Group_2[i].data1
End for
End for
...
}
...
...
}
...
}
...
(b) Computation fusion
(c) Data grouping
(a) Original program
...
Figure 1.3 Example of global cache reuse
The fused program is not perfect because it makes scattered data access to different
arrays. The second step, data grouping, gathers data used by the same computation
into the same data array, as shown in Figure 1.3(c). After data grouping, not only are
related data elements used together, they also locate together in physical memory. In
11
combination, the fusion shortens temporal reuse between global computations, and
the grouping increases spatial reuse among global data.
As shown by the example program, computation fusion and data grouping promise
significant global benefit but also impose drastic changes to the whole program.
Unlike localized techniques, a global transformation may move a piece of computation or data far away from its original place. New challenges immediately arise on
maintaining correctness and estimating profitability. Interestingly, computation and
data transformations follow different restrictions and cause different concerns. They
raise different sets of questions.
Computation fusion is limited by data dependence. Given the widespread and
complex dependences in real programs, how much fusion can a program have, or
equivalently, how close can the uses of the same data be? Starting from that, how
much can be achieved by a source-level transformation through a compiler? Since
computation fusion may produce loops of a huge size, what is the overhead of fusion
and how to eliminate or reduce that overhead? Chapter 2 will study computation
fusion and address these challenging questions.
Unlike computation fusion, data grouping is not constrained by correctness because it does not violate any data dependence as long as a single storage is maintained
for each program data. However, while fusion has no side effect on the unaltered program parts, data grouping uniformly affects every program segment that accesses the
transformed data. In particular, data grouping in one place may not be beneficial
for another place and may in fact be detrimental to overall performance. Therefore,
the crucial problem of data grouping is evaluating its profitability: how to address
the conflicting requirements of different program segments, and ultimately, how to
find an optimal data layout for the whole program? Chapter 3 will study solutions
to these problems.
Dynamic Cache Reuse
A large class of applications is dynamic, where some data structures and their access
pattern remain unknown until run time and may change during the computation. An
example is a car-crash simulation where the shape of the car remains unknown until
the simulation starts, and the shape may change radically during the simulation.
To optimize a dynamic application, the strategy of cache reuse must be applied
at run time after the computation and its data access are determined. Figure 1.4
12
illustrates dynamic data grouping. The example computation sequence traverses
random elements of array f. The stride of access is large and varied. Data grouping
first records the random data access and then gathers simultaneously used data into
contiguous memory locations. If the data is accessed in the same or similar order
multiple times, the overhead of grouping can be amortized effectively. With the
transformed array shown in Figure 1.4, the dynamic access becomes more contiguous
and obtains a better utilization of cache.
Figure 1.4 Example of dynamic cache reuse
Because of the unpredictable and dynamic nature of the computation and data,
both analysis and transformation have to be performed at run time and probably be
performed multiple times. Questions immediately arise on the feasibility, legality and
profitability of such transformations. How to insert run-time analysis and code generation? What methods are cost-effective at run time? How to ensure their correctness,
especially in the presence of repeated data layout changes? How much overhead do
they incur, and can it be reduced through additional compiler optimizations? These
questions will be addressed in Chapter 4.
1.2.2
The Need for Compiler Automation
Applying the strategy of cache reuse leads to radical program changes: computation
fusion rewrites the whole program structure, and data grouping re-shuffles the entire
data layout. In general, a program transformation may be carried out through three
13
different agents: programmers, compilers, or hardware/operating systems. However,
the global scope and extensive scale of computation fusion and data grouping suggest
that an automatic compiler is the most viable approach. To demonstrate, Figure 1.5
lists the characteristics of all three options.
approaches
advantages
√
hardware or
precise run-time
operating systems
information
programmers
compilers
disadvantages
× very limited scope
× run-time overhead of analysis
and transformation
√
domain knowledge × loss of function and data abstraction
× inter-dependence between function
and data
√
× imprecise program and
√ global scope
off-line analysis
machine information
and transformation
Figure 1.5 Comparison among hardware/OS, programmers and compilers
Hardware and operating systems have precise knowledge of the operations being executed and the data being accessed. However, they cannot anticipate future:
they can foresee at most a limited number of instructions down the executing path.
Furthermore, because of the run-time overhead, they cannot afford extensive analysis
and large-scale transformation, both of which are necessary for computation fusion
and data grouping.
Programmers have domain knowledge of their applications. But manual computation fusion and data grouping render program abstraction and modularization
impossible. Indeed, various functions must be mixed together if they access the
same data; similarly, different data structures must be merged if they belong to the
same computation. Furthermore, data layout now depends on computation structure.
Whenever a memory access is added or deleted, the entire data layout may have to be
reorganized. Therefore, if software development is to be scalable and maintainable,
manual fusion and grouping should be mostly avoided.
Among all three approaches, only a compiler can afford the global scope and the
extensive scale of computation fusion and data grouping. Given a source program,
a compiler can analyze and transform the structure of both global computation and
global data. The analysis and transformation are off-line without incurring any run-
14
time overhead. A compiler, however, has its limitations. Its source-level analysis
may not always accurate, and it cannot quantify the machine-dependent effect of a
transformation. Despite its limitations, a compiler is currently the only viable choice
to apply the strategy of cache reuse. If it succeeds, the benefit is enormous. The next
section outlines such a compiler.
1.2.3
A Unified Compiler Strategy
This section presents a unified compiler strategy that maximizes memory hierarchy
performance. It has four phases, as shown in Figure 1.6. The first two phases minimize
overall memory transfer by maximizing cache reuse. The third phase schedules the
remaining memory and cache access to tolerate its latency. The last phase engages
user’s help in identifying additional optimization opportunities that have been missed
by automatic methods. The last column of Figure 1.6 lists the suitable techniques.
Those developed by this dissertation are marked with a ⋆.
main phases
temporal reuse
in
cache and
registers
cache-block
reuse and
cache
utilization
latency
tolerance
user tuning
sub-steps
suitable techniques
(⋆ developed by this research)
global (multi-loop)
⋆ maximal loop fusion
local (single loop)
unroll-and-jam, loop blocking,
register allocation
dynamic
⋆ locality grouping,
space partitioning, curve ordering
inter-array spatial reuse ⋆ inter-array data regrouping
intra-array spatial reuse memory-order loop permutation,
array reshaping, combined schemes
dynamic spatial reuse
⋆ dynamic data packing
cache non-interference
array padding, array copying,
cache-conscious placement
local (single loop)
data prefetching,
instruction scheduling
global (whole-program) ⋆ model of machine & program balance
⋆ bandwidth-based performance tool
Figure 1.6 The overall compiler strategy
for maximizing memory performance
15
The first phase converts data reuse into cache and register reuse. The primary
method is computation fusion, which is first carried out at the global level across
multiple loops, then at the local level within a single loop nest, and finally at run
time for dynamic applications.
On an ideal machine with unit-size cache blocks, the first phase is sufficient for
minimizing memory transfer. On a real machine, however, the second phase is needed
to fully utilize non-unit cache blocks as well as memory pages. The first step of this
phase exploits spatial reuse among global arrays. The succeeding steps improve spatial
reuse within a single array both statically for regular programs and dynamically for
dynamic applications. Finally, the last step adjusts the placement of large arrays to
avoid the remaining cache interference.
After minimizing the amount of memory access by the first two phases, the third
phase schedules the expensive memory and cache accesses so that their latency can be
hidden as much as possible. The scheduling includes source-level data prefetching for
high-latency memory access and assembly-level instruction scheduling for low-latency
cache access. It should be noted that although latency tolerance is important, it does
not help in ameliorating the memory bandwidth bottleneck as the previous phases
do. In fact, data prefetching exacerbates the memory bandwidth problem because it
causes additional memory transfer.
Compiler transformations, however, may still miss optimization opportunities or
make imperfect transformations. When this happens, user tuning is necessary to
achieve top performance. The last phase provides effective and efficient user tuning
though a bandwidth-based performance tool. The tool can also provide accurate
compile-time performance prediction, which is crucial for subsequent parallelization
and run-time scheduling.
The global and dynamic techniques developed by this work play a vital role in
the overall compiler strategy. The later chapters will describe these techniques and
demonstrate their importance. The next section discusses existing local techniques
and their limitations, as well as previous attempts at global and dynamic optimizations.
1.3
Related Work
This section surveys the techniques related to the overall compiler strategy, especially
the previous work on global and dynamic transformations. Their limitations are first
16
discussed individually and then summarized in the last section from three aspects:
narrower purpose, lack of integrated transformation, and lack of compiler automation.
1.3.1
Complementary Techniques
Loop blocking and data prefetching are two widely used optimizations for memory
hierarchy. They complement but cannot achieve the effect of global computation
fusion and data grouping.
Loop Blocking
Loop blocking is a transformation that groups computations on sub-blocks of data
that are small enough to fit in registers or in cache. A comprehensive study of blocking techniques can be found in Carr’s dissertation[Car92]. The recent developments
include the work by Kodukula et al[KAP97] and by Song and Li[SL99]. Since the
new studies can implicitly optimize beyond a single loop nest, they will be discussed
in the next section with the explicit work on loop fusion.
The primary limitation of loop blocking is its local scope: blocking is applied
only to a single loop nest at a time. Consequently, it cannot exploit data reuse
among disjoint loops. To overcome this limitation, we have to fuse multiple loops
and determine how to interleave their iterations. This is precisely the process of loop
fusion, which is discussed in the next section. Another limitation of blocking is that it
cannot block computations and data that are unknown at compile time. Section 1.3.2
discusses related dynamic transformations.
Data Prefetching
Data prefetching is another widely studied technique. Unlike loop blocking or loop
fusion, the goal of data prefetching is to tolerate or hide memory latency rather
than to eliminate the memory access. Data prefetching identifies memory references
that are cache misses and then dispatches them early enough in execution so that
their latency can be overlapped with useful computation. Porterfield first developed
software prefetching[Por89]. Mowry designed and evaluated a complete algorithm
that later gained wide acceptance[Mow94].
Data prefetching, however, cannot hide memory latency imposed by the memory
bandwidth bottleneck. Indeed, data prefetching does not reduce a single byte of
memory transfer. On the contrary, it incurs additional memory transfer because it
17
may prefetch the wrong data or prefetch too early or too late. Since actual memory
latency is the reciprocal of the consumed memory bandwidth, data prefetching cannot
completely hide memory latency unless the memory bandwidth bottleneck has been
alleviated by other optimizations.
1.3.2
Global and Dynamic Optimizations
This section discusses previous work on global loop fusion, global data placement and
dynamic optimizations.
Global Loop Fusion
Many researchers have studied loop fusion. Allen and Cocke first published the
transformation[AC72]. The first significant role of fusion is to improve data reuse in
a virtual memory system, studied by Abu-Sufah et al[ASKL81]. Wolfe gave a simple
test for the legality of fusion[Wol82]. Two loops cannot be fused if they have fusionpreventing dependences, which are those forward dependences that are reversed after
loop fusion. In the same work, Wolfe demonstrated through a few examples how loop
fusion improves register reuse and reduces data storage on vector machines.
The first implementation of fusion in a compiler is by Allen[All83], who used loop
fusion to improve register reuse in a legendary compiler that was later adopted by all
vector supercomputers[AK87]. In its implementation, Allen required that fusible
loops must have the same lower bound, upper bound and increment, no fusionpreventing dependence, and no true dependence on any intervening statements. Since
the improvement by fusion is not as large as by other transformations such as loop
interchange, Allen used fusion as a “cleanup” operation.
Loop fusion later took a prominent role in the work of Callahan, who used it
to detect and construct coarse-grain parallelism[Cal87]. He gave a greedy fusion
algorithm that runs in linear time to the number of loops and produces the minimal
number of fused loops. The restriction for correctness is the same as in earlier studies,
and the criterion for profitability is parallelism rather than cache reuse. So Callahan’s
method may fuse loops of no data sharing.
To enable more loop fusion, Porterfield introduced a transformation called peeland-jam, which can fuse loops with fusion-preventing dependences by peeling off some
iterations of the first loop and then applying fusion on the remaining parts[Por89].
While Porterfield considered only a pair of loops, Manjikian and Abdelrahman later
18
extended peel-and-jam to find the minimal peeling factor for a group of fusible
loops[MA97]. They evaluated their fusion scheme for parallel programs. Also enabled
by peel-and-jam, Song and Li developed a new tiling method that blocks multiple
loops within a time-step loop with the goal of improving cache reuse[SL99]. However,
these methods are not a complete global strategy because they did not address the
cases where not all loops in a program are fusible. In addition, peel-and-jam is a limited form of loop alignment because it can only shift the first loop up (or the second
loop down), but not the reverse. So it does not always minimize the distance of data
reuse in fused loops. Finally, peel-and-jam cannot fuse loops that have intervening
statements that use the same data.
To find a solution for global loop fusion, a graph-partitioning formulation was studied independently both by Gao et al.[GOST92] and by Kennedy and McKinley[KM93].
Both their aims were to improve temporal reuse in registers, and they modeled the
benefit of register reuse as weighted edges between a pair of loops. The goal was
to partition all loops into legal fusible groups so that the inter-group edge weight
(unrealized data reuse) is minimal. Kennedy and McKinley proved that the general
fusion problem is NP-Complete. Both approaches used the heuristic that recursively
applies min-cut algorithm to bi-partition the graph. Both avoided fusing loops with
fusion-preventing dependences. However, a weighted-edge between two loops does
not correctly model data sharing. Therefore, the partitioning method on normal
graphs does not minimize the bandwidth consumption of the whole program. In another study of loop fusion, Darte considered the added complexity of loop shifting
and proved that even loop fusion for single types (e.g. parallel loops) is strongly NPcomplete in the presence of loop shifting[Dar99]. Recently, Kennedy developed a fast
algorithm that always fuses along the heaviest edge[Ken99]. His algorithm allows accurate modeling of data sharing as well as the use of fusion enabling transformations.
But none of these algorithms has been implemented or evaluated.
The first implementation for general fusion and its evaluation on non-trivial programs were accomplished by McKinley et al[MCT96]. They fused only loops with an
equal number of iterations and with no fusion-preventing dependences. As a result,
only 80 out of 1400, or 6% of tested loops were fused. The effect on full applications was mixed: fusion improved the hit rate for four out of 35 programs by 0.24%
to 0.95%, but it also degraded performance of other three programs. Singhai and
McKinley improved the fusion heuristic by considering the register pressure and by
approximating graph partitioning with optimal tree partitioning[hSM97]. Since they
19
fused only loops with no fusion-preventing dependences, the improvement to wholeprogram performance is modest except for two programs running on DEC Alpha. The
potential of global data reuse is much larger, as demonstrated by a simulation study
by McKinley and Temam[MT96]. They found that majority of program misses are
inter-loop temporal reuses. Therefore, the important question remains open on the
potential of global fusion, especially when aggressive fusion-enabling transformations
are used.
To enable more aggressive loop fusion, some researchers have taken a radically different approach. Instead of blocking loops, Kodukula et al. tiled data and “shackled”
computations on each data tile[KAP97]. Similarly, Pugh and Rosser sliced computations on each data element or data block[PR99]. Although effective for blocking
single loops, data-oriented approaches are not yet practical as a global strategy for
three reasons. First, without regular loop structures, it is not clear how to formulate and direct a global transformation. The shape of the transformed program is
highly dependent on the choice of not only the shackled or sliced data but also of its
starting loop. Furthermore, to maintain correctness, these methods need to compute
all-to-all transitive dependences, whose complexity is cubic in the number of memory
references in a program. Even when the dependence information is available, it is
still not clear how to derive the best partitioning and ordering of the computations
on different data elements, especially in the face of a large amount of unstructured
computation. Finally, it is not clear how data-oriented transformations interact with
traditional loop-based transformations, and how the side effect of fusion can be tackled. Kodukula et al. did not apply their work beyond a single loop nest[KAP97]. Pugh
and Rosser tested Swim and Tomcatv and found mixed results. On SGI Octane, the
first program was improved by 10% but the second “interacted poorly with the SGI
compiler”[PR99].
The previous work on loop fusion did not combine it with data transformations
with one exception. Manjikian and Abdelrahman, who applied padding to reduce
cache conflicts[MA97]. Array padding at large data granularity is not a direct solution
to cache utilization and has several important shortcomings compared to fine-grain
data optimization, as discussed in the next section.
20
Global Data Placement
Once computation is optimized, data layout still needs careful arrangement because
it affects the utilization within cache blocks and the interference among cache blocks.
Thabit studied the packing of scalars into cache blocks[Tha81]. He proved that
finding the optimal packing for non-unit cache blocks is NP-complete.
The primary method for exploiting spatial reuse in arrays is to make data access contiguous. Instead of rearranging data, the early studies reordered loops so
that the innermost loop traverses data contiguously within each array. Various loop
permutation schemes were studied for perfect loop nests or loops that can be made
perfect, including those by Abu-Sufah et al.[ASKL81], Gannon et al.[GJG88], Wolf
and Lam[WL91], and Ferrante et al.[FST91]. McKinley et al. developed an effective heuristic that permutes loops into memory order for both perfect or non-perfect
nested loops[MCT96]. Loop reordering, however, cannot always achieve contiguous
data traversal because of data dependences. This observation led Cierniak and Li to
combine data transformation with loop reordering[CL95], a technique that was subsequently expanded by Kandemir et al[KCRB98]. Regardless of the form of transformation, all these techniques are limited by their goal, which is to improve data reuse
within a single array, or intra-array spatial reuse.
Data reuse within a single array is not adequate because not all data access to
the same array can be made contiguous. One example is a dynamic application,
where the data access within the same array is unpredictable at compile time, making
it impossible to obtain contiguous memory access. Another example is a regular
application, where the computation traverses high-dimensional data through different
directions. Again, data access to a single array cannot always be made contiguous.
Data reuse among multiple arrays presents a promising alternative when data access can not be made contiguous. By combining multiple arrays and increasing the
granularity of data access, the portion of useful data in each cache block can be significantly increased. In fact for large programs with many data arrays, inter-array reuse
may fully utilize cache blocks without the need for contiguous data access. Interarray data transformations, however, have not been attempted except for the work
by Eggers and Jeremiassen[JE95]. They grouped all arrays accessed by a parallel
thread to reduce false sharing among parallel processors. However, blindly grouping
local data pollutes cache and cache blocks with useless data because not all local
data objects are used at once. Besides the work on arrays, many researchers stud-
21
ied data placement optimizations for cache spatial reuse among pointer-based data
structures. Seidl and Zorn clustered frequently referenced objects[SZ98], and Calder
et al. reordered objects based on their temporal relations[CCJA98]. Chilimbi et
al. clustered frequently used attributes within each object class[CDL99]. The basic
approach shared by these methods is to place frequently used or closely referenced objects in close by memory locations. However, that two objects being either frequently
accessed or for one time together accessed does not mean that they are always simultaneously accessed. Hence, their methods may place useless data into cache blocks
and therefore degrade actual performance. In a large program where different data
structures are used at different times, greedy grouping can seriously degrade cacheblock reuse rather than improving it. Furthermore, these methods are static and
therefore cannot fully optimize dynamic programs whose data access pattern changes
during execution. For example, in a sparse-matrix code, the matrix may be iterated
first by rows and then by columns. In scientific simulations, the computation order
changes as the physical model evolves. In these cases, a fixed static data layout is
not likely to perform well throughout the computation.
In addition to the reuse within the same cache block, attention needs to be paid to
the interference among multiple cache blocks. A program can rearrange the location
of whole arrays or array fragments in two ways: make them either well separated
by padding, studied by Bailey[Bai92], or fully contiguous by copying, first used by
Lam et al[LRW91]. Reducing cache interference, however, is not an approach as direct and effective as improving cache-block reuse. The best way to eliminate any
cache interference is to place simultaneously used data into the same cache block,
not by arranging them into multiple cache blocks. The large granularity used by
packing precludes data reordering within the data object and across multiple data
objects. Furthermore, padding cannot be applied to arrays of unknown size or machines with different cache parameters. It can reduce only cache interference but not
the page-table working set. Moreover, both padding and copying carry a run-time
cost, especially copying. Therefore, a compiler should first organize data within the
same cache block and then use techniques such as data padding and copying to reduce
cache interference if necessary.
Kremer developed a general formulation for finding the optimal data layout that is
either static or dynamic for a program at the expense of being an NP-hard
problem[Kre95]. He also showed that it is practical to use integer programming
to find an optimal solution for normal programs. However, Kremer’s formulation
22
requires the estimation on the overhead and the benefit of a data transformation,
which is not readily available to a compiler. He and others demonstrated that runtime communication and computation performance could be approximated through
the use of training sets[BFKK91]. However, it is yet to be seen how well memory
hierarchy performance can be predicted.
Dynamic Transformations
Researchers have long been studying dynamic applications such as molecular simulations. The best-known scheme is called inspector-executor, pioneered by Saltz and his
colleagues[DUSH94]. At run time, the inspector analyzes the computation and produces an efficient parallelization scheme. Then the executor carries out the parallel
execution.
Various specific schemes were also developed for optimizing cache performance.
Saltz’s group extended the inspector-executor model and used a reverse Cuthill Mcgee
ordering to improve locality in a multi-grid computation[DMS+92]. Another method,
domain partitioning, has been used to block computation for cache by Tomko and
Abraham[TA94]. Al-Furaih and Ranka examined graph-based clustering of irregular
data for cache[AFR98]. Mellor-Crummey et al. employed space-filling curve ordering
to block N-body type computations for multi-level memory hierarchy[MCWK99]. The
above methods are powerful, but they incur a cost higher than linear to the number
of data objects. Such cost becomes significant on large data sets and may not be cost
effective for run-time readjustments. In addition, these transformations rely on the
user knowledge. For example, the computation consists of interactions of either near
by particles in a physical domain or neighboring nodes in an irregular graph.
Han and Tseng used a general scheme of grouping parallel computations accessing
the same data object onto the same processor[HT98]. Although their transformation
can be done in linear time and may be cost-effective for cache, they did not extend it
to optimize cache performance. Mitchell et al. studied single non-affine memory references and used a more powerful method for partitioning, which is to sort irregular
data access into “buckets”[MCF99]. Mitchell et al. discussed methods for automatically detecting opportunities for their optimization, but they did not show how to
preserve the correctness by an automatic compiler.
A common limitation shared by all previous run-time techniques is the lack of
general-purpose compiler automation. They targeted either a specific application do-
23
main or a very simplified computation model. The insufficient automation support
limits the type of programs that can be handled and optimizations that can be used.
As a result, large dynamic applications had to be transformed partially or wholly by
hand. Since both the order of computation and the layout of data may be reorganized multiple times at run time, the code transformation process is extremely labor
intensive and error prone. Even if a hand-optimized version is possible, it will be
very difficult to maintain when new functions are added. Moreover, switching among
and experimenting with different optimization schemes are even harder. Therefore,
if run-time optimizations are to be practical and prevalent, they must be sufficiently
automated.
1.3.3
Performance Model and Tool for Memory Hierarchy
Callahan et al. first used the concept balance to model whether register throughput
can keep up with the CPU speed for scientific applications[CCK88]. However, they
did not consider other levels of memory hierarchy.
In the past, monitoring memory hierarchy performance has to rely on machine
simulators to gauge the exposed memory latency. Callahan et al. first used a compilerbased approach to analyze and visualize memory hierarchy performance with a memory simulator[CKP90] . Goldberg and Hennessy measured memory stall time by comparing actual running time with the simulation result of running the same program on
a perfect memory[GH93]. Simulators, however, are inconvenient in practice because
they are much slower than actual execution, and they are architecture-dependent. In
addition, simulation-based approaches cannot be used for predicting memory hierarchy performance because it has to run the program before collecting its performance
data.
Static or semi-static methods can be used to approximate run-time behavior and
thus predict program performance. Bala et al. used training sets, which construct a
database for the cost of various communication operations, to model communication
performance in data-parallel programs [BFKK91]. They did not consider cache performance, although the same idea applies to cache. In another work, Clements and
Quinn predicted cache performance by multiplying the number of cache-misses with
memory latency[CQ93]. Their method is no longer accurate on modern machines,
where memory transfers proceed in parallel with each other as well as with CPU
24
computations. Moreover, they did not extend their work to support performance
tuning.
Recently, researchers began to use bandwidth to measure machine memory performance. Examples are the STREAM benchmark by McCalpin [McC95] and CacheBench
by Mucci and London [ML98]. However, neither of them explored the possibility of
full-program tuning and performance prediction.
1.3.4
Summary of Limitations
No previous work has taken the goal of minimizing the total amount of data transfer
between memory and CPU, nor has anyone explored the compiler strategy of global
and dynamic computation fusion and data grouping. As a result, previous work shares
the following three limitations:
Narrower purpose Previous techniques were not designed to solve the memory
bandwidth problem, where single-loop based cache reuse is inadequate and latency
tolerance is of no help. Therefore, global and dynamic cache reuse is the only software
alternative to alleviate the memory bandwidth bottleneck. Failing to recognize this,
existing techniques either do not address global and dynamic optimization or are not
aimed at improving cache reuse. The former includes loop blocking; the latter, loop
fusion and dynamic parallelization. Furthermore, none of previous studies addressed
the problem of minimizing memory writebacks because they focused only on the
latency of memory reads, not the bandwidth consumption of all memory access.
Lack of integrated transformation No previous work has successfully developed aggressive forms of computation fusion and data grouping, in part because
these two steps have not been studied as a combined strategy. Without global data
grouping, computation fusion may lead to extreme low cache utilization because the
fused loop accesses too many dispersed data items. On the other hand, without aggressive fusion, data grouping may find little opportunity for combining global data
structures since their accesses are separated in different parts of a program.
Lack of automation Because of the focus on memory latency, previous techniques are burdened with improving each memory reference individually, while neglecting the final goal of overall cache reuse of long computations on large data structures. These limitations lead to a preference of programmer-supplied or domain-
25
specific transformations over compiler automation because of the possible compiler
overhead. Unfortunately, manual or semi-manual techniques not only cannot master
the scope and the scale of global computation fusion and data grouping, but they also
lead to programming styles that are not maintainable and not portable. Furthermore,
the focus on the latency of individual memory access makes performance modeling
and debugging impractical, leading to ineffective user assistance for monitoring and
tuning memory hierarchy performance.
1.4
Overview
To overcome the limitations of previous work, this dissertation has developed a new
set of techniques that unleash the power of global and dynamic cache reuse. Chapter 2
describes global computation fusion, which exploits cache temporal reuse for the whole
program. Chapter 3 presents inter-array data regrouping, which maximizes spatial
reuse for the entire data. The dynamic transformations are described in Chapter 4,
which include locality grouping for computation fusion and dynamic packing for data
grouping. Chapter 5 complements these automatic techniques with a performance
tool. The implementation and evaluation of all these techniques are described in
Chapter 6. Finally, Chapter 7 summarizes the techniques having been developed and
outlines their possible extensions.
“as long as there were no machines, programming was no problem at all; when
we had a few weak computers, programming became a mild problem, and
now that we have gigantic computers, programming has become an equally
gigantic problem.” – Edsger W. Dijkstra, 1972
26
Chapter 2
Global Computation Fusion
“I hate quotations. Tell me what you know.” – Ralph Waldo Emerson ( 18031882)
2.1
Introduction
As the first step to address the bandwidth limitation, this chapter explores the potential of global fusion in improving cache reuse over whole programs. The chapter
investigates the following three problems. How is data reused in real programs? How
beneficial is global fusion? And how much benefit can be realized by automatic
transformations?
The chapter first defines reuse distance, a concept which precisely measures data
reuse in a program before and after transformations. The chapter then studies two
fusion transformations—one at the machine level and one at the program level. The
machine-level model, reuse-driven execution, examines the potential of global fusion
on an ideal machine, which always executes next the instructions that carry data
reuse. More important is the source-level transformation, maximal loop fusion, which
realizes the benefit of fusion on real machines. The main part of the chapter describes
the algorithm of maximal loop fusion and shows that the new algorithm fuses loops
whenever possible and achieves bounded reuse distance within a fused loop.
Although maximal fusion is the most aggressive in fusing global computations, it
is not optimal. It does not minimize reuse distance within the fused loop, nor does
it minimize the amount of data sharing among fused loops. The chapter formulates
these problems and examines their complexity.
By bringing together all uses of the same data, global computation fusion shortens
its live range. The localized data usage allows for aggressive storage transformations.
The last part of the chapter describes two: storage reduction reduces the size of arrays,
and store elimination removes memory writebacks to arrays.
27
2.2
Analysis of Data Reuse
This section first defines the concept of reuse distance and then explores the potential
for minimizing reuse distances through reuse-driven execution.
2.2.1
Reuse Distance
In a sequential execution, the reuse distance of a data reference is the number of
the distinctive data items appeared between this reference and the closest previous
reference to the same data. The example in Figure 2.1(a) shows four data reuses and
their reuse distance. On a perfect cache (fully associative with LRU replacement), a
data reuse hits in cache if and only if its reuse distance is smaller than the cache size.
rd=1
rd=2
a b c a a c b
rd=2
a a a b b c c
rd=0
(a) Example sequence and
its reuse distances
(b) Transformed data access sequence.
All reuse distances are zero.
Figure 2.1 Example reuse distances
To avoid cache misses due to long reuse distances, a program can fuse computations on the same data. Figure 2.1(b) shows the computation sequence after fusion,
where all reuse distances are reduced to zero. In general, the problem of finding minimal reuse distance can be reduced from the problem of weighted k-way cut4. The
next section studies the use of heuristic-based fusion on real programs.
2.2.2
Reuse-Driven Execution
This section presents and evaluates reuse-driven execution, a machine-level strategy
which fuses run-time instructions accessing the same data. In a sense, it is the inverse
of Belady policy. While Belady evicts data that has the furthest reuse, reuse-driven
execution executes the instruction that has the closest reuse. The insight gained in
Section 2.4.1 studies this problem and demonstrates that a polynomial-time solution is unlikely
because even the problem of unweighted 3-way cut is NP-complete.
4
28
this study will provide the motivation for the source-level transformation presented
in the next section.
Given a program, its reuse-driven execution is constructed as follows. First, the
source program is instrumented to collect the run-time trace of all source-level instructions as well as all their data access. The trace is re-run on an ideal parallel
machine where an instruction is executed as soon as all its operands have been computed. The trace of an ideal execution gives the ordering of instructions and their
minimal time difference. Finally, reuse-driven execution is carried out by the algorithm given in Figure 2.2. It is reuse-driven because it gives priority of execution to
later instructions that reuse the data of the current instruction. It employs a FIFO
queue to sequentialize the execution of instructions.
The effect of reuse-driven execution is shown in Figure 2.3 for a kernel program
ADI and an application benchmark NAS/SP (Serial version 2.3); the former has 8
loops in 4 loop nests, and the latter has over 218 loops in 67 loop nests. In each
figure, a point at (x, y) indicates that y thousands of memory references have a reuse
distance between [2(x−1), 2x ). The figure links discrete points into a curve to emphasize
the elevated hills, where large portions of memory references reside. The important
measure is not the length of a reuse distance, rather it is whether the length increases
with the input size. If so, the data reuse will become a cache miss when data input is
sufficiently large. We call those reuses whose reuse distance increases with the input
size evadable reuses.
The upper two figures of Figure 2.3 show the reuse distances of ADI on two input
sizes. The two curves in each figure show reuse distances of the original program
and that of reuse-driven execution. In the original program, over 40% of memory
references (25 thousand in the first and 99 thousand in the second) are evadable reuses.
However, reuse-driven execution not only reduced the number of evadable reuses
by 33% (from 40% to 27%), but also slowed the lengthening rate of the remaining
evadable reuses.
A similar improvement is seen on NAS/SP, where reuse-driven execution reduced
the number of evadable reuses by 63% and slowed the rate of lengthening of reuse
distances.
We also tested two other programs—a FFT kernel and a full application,
DOE/Sweep3D, shown in Figure 2.4. Reuse-driven execution did not improve FFT
(where the number of evadable reuses was increased by 6%), but it reduced evadable
reuses by 67% in DOE/Sweep3D. In addition, other heuristics of reuse-driven execu-
29
function Main
for each instruction i in the ideal parallel execution order
enqueue i to ReuseQueue
while ReuseQueue is not empty
dequeue instruction i from ReuseQueue
if (i has not been executed)
ForceExecute(i)
end while
end for
end Main
function ForceExecute(instruction j)
while there exists un-executed instruction i that produces operands for j
ForceExecute(i)
end while
execute j
for each variable t used by j
find the next instruction m that uses t
enqueue m into ReuseQueue
end for
end ForceExecute
Figure 2.2 Algorithm for reuse-driven execution
30
ADI, 50x50
ADI, 100x100
30
120
program order
reuse−driven execution
number of references (in thousands)
number of references (in thousands)
program order
reuse−driven execution
20
10
0
0
2
4
6
8
10
12
14
reuse distance (log scale, base 2)
16
90
60
30
0
18
0
2
4
NAS/SP, 14x14x14
18
5000
program order
reuse−driven execution
program order
reuse−driven execution
4500
number of references (in thousands)
450
number of references (in thousands)
16
NAS/SP, 28x28x28
500
400
350
300
250
200
150
100
50
0
6
8
10
12
14
reuse distance (log scale, base 2)
4000
3500
3000
2500
2000
1500
1000
500
0
2
4
6
8
10 12 14 16 18
reuse distance (log scale, base 2)
20
22
0
0
2
4
6
8
10 12 14 16 18
reuse distance (log scale, base 2)
20
22
Figure 2.3 Effect of reuse-driven execution (I)
tion were also evaluated. For example, that of not executing the next reuse if it is
too far away (in the ideal parallel execution order). But the result was not improved.
The experiment with reuse-driven execution demonstrates the potential of fusion as
a global strategy for reducing the number of evadable reuses in large applications
with multiple loop nests. The next section studies aggressive loop fusion as a way to
realize this benefit. The effect of loop fusion on reuse distances will be measured in
Chapter 6.
2.3
An Algorithm for Maximal Loop Fusion
Since loops contain most data access and data reuse, loop fusion is obviously a promising solution for shortening reuse distances. The first half of this section presents an
31
FFT, 64x64
FFT, 128x128
1200
program order
reuse−driven execution
250
number of references (in thousands)
number of references (in thousands)
300
200
150
100
50
0
0
2
4
6
8
10
12
14
reuse distance (log scale)
16
18
1000
800
600
400
200
0
20
program order
reuse−driven execution
0
2
4
600
200
0
2
4
6
8
10
12
14
reuse distance (log scale)
18
20
program order
reuse−driven execution
400
0
16
DOE/Sweep3D, 20x20x20
program order
reuse−driven execution
number of references (in thousands)
number of references (in thousands)
DOE/Sweep3D, 10x10x10
6
8
10
12
14
reuse distance (log scale)
16
18
20
4000
2000
0
0
2
4
6
8
10
12
14
reuse distance (log scale)
16
18
20
Figure 2.4 Effect of reuse-driven execution (II)
efficient algorithm that achieves maximal loop fusion and bounded reuse distance.
The second half formulates the problem of optimal loop fusion and then studies its
complexity. Although the following discussion assumes that a program is structured
in loops and arrays, the formulation and solution to loop fusion apply to programs in
other language structures such as recursive functions and object-based data.
The example program in Figure 2.5(a) has two loops sharing the access to array
A. They cannot be fused directly because of the two intervening statements that also
access part of A. To enable loop fusion, we need three supporting transformations.
The first is statement embedding, which fuses the two non-loop statements into the
first loop. It schedules A[2]=0.0 in the second iteration, where A[2] is last used.
Similarly, it puts A[1]=A[N] in the last iteration, where A[N] is last computed.
32
for i=2, N
A[i] = f(A[i-1])
end for
s3 A[1] = A[N]
s4 A[2] = 0.0
for i=3, N
B[i] = g(A[i-2])
end for
for i=2, N
A[i] = f(A[i-1])
if (i==3)
A[2] = 0.0
else if (i == N)
A[1] = A[N]
end if
if (i>2 and i<N)
B[i+1] = g(A[i-1])
end if
end for
B[3] = g(A[1])
(a) fusion by statement embedding,
loop alignment and loop splitting
for i=2,N
A[i] = f(A[i-1])
end for
A[1] = A[N]
for i=2,N
A[i] = f(A[i-1])
end for
(b) example of loops that
cannot be fused
Figure 2.5 Examples of loop fusion
After statement embedding, two loops are still not directly fusible because the
first iteration of the second loop depends on the last iteration of the first loop. The
second transformation, iteration reordering, splits the second loop and peels off its
first iteration so that the remaining iterations can be fused with the first loop.
When two loops are fused, the third transformation, loop alignment, ensures that
their iterations are properly aligned. The second loop is shifted up by one iteration
so that the reuse of A[i − 1] happens within the same iteration. Otherwise, the reuse
would be one iteration apart, unnecessarily lengthening reuse distance. The fused
program is shown in Figure 2.5(a), where array A is closely reused.
Loop fusion introduces instruction overhead to the fused program because of
the inserted branch statements. Although this overhead was prohibitively high for
previous-generation machines, today’s fast processors on modern machines can easily offset this additional cost. Chapter 6 will measure the effect of fusion on real
machines.
Although the supporting transformations enable loop fusion in this example, they
do not always succeed. For example, the two loops in Figure 2.5(b) can never be
fused because all iterations of the second loop depend on all iterations of the first
loop. The dependences caused by the intervening statements make fusion impossible.
Since the feasibility test of fusion has to consider the effect of non-loop statements,
the cost can be too high if loop fusion is tested for every pair of loops. To avoid this
cost, the following algorithm employs incremental fusion, which examines only the
closest pair of data-sharing loops for fusion.
33
2.3.1
Single-Level Fusion
The following discussion of loop fusion makes several assumptions as listed in Figure 2.6.
At the beginning, we consider only single-dimensional loops accessing single-dimensional
arrays. Later we will use the same algorithm to fuse multi-dimensional loops level by
level. The other restrictions in Figure 2.6 can also be relaxed, however, at the cost
of a more complex fusion algorithm. For example, index expressions like A(d ∗ i + c)
can be considered by projecting the sparse index set into a dense index set.
• a program is a list of loop and non-loop statements
• all loops are one-dimensional and so are all variables
• all data accesses are in one of the two forms: A[i + t] and A[t], where A is the
variable name, i is the loop index, and t is a loop-invariant constant
Figure 2.6 Assumptions on the input program
The fusion algorithm is given in Figure 2.7, which incrementally fuses all datasharing loops. For each statement p[i], subroutine GreedilyFuse tries to fuse it upwards
with the closest predecessor p[j] that accesses the same data. If p[i] is a statement, it
can be embedded into p[j]. Otherwise, subroutine FusibleTest is called to test whether
the two loops can be fused. If p[i] is fused with p[j], GreedilyFuse is recursively applied
on p[j] because it now accesses a larger set of data.
Subroutine FusibleTest determines whether two loops can be fused, and if so,
what reordering is needed and what the minimal alignment factor is. An alignment
factor of k means to shift the iterations of the second loop down by k iterations. The
alignment factor can be negative, when the second loop is shifted up to bring together
data reuses. For each data array, the subroutine determines the smallest alignment
factor that both satisfies data dependence and has the closest data reuse. To avoid
unnecessarily increasing the alignment factor, the algorithm does not allow positive
alignment factors for read-read data reuse. The final alignment factor is the largest
found among all arrays. The algorithm avoids repeated FusibleTest by remembering
infusible loop pairs.
A fused loop is represented as a collection of loop and non-loop statements, where
loops are aligned with each other, and non-loop statements are embedded in some
iteration of the fused loop. The data footprint of a loop includes the access to all
34
SingleLevelFusion
let p be the list of program statements, either loop or non-loop statements
iterate p[i] from the first statement to the last in the program
GreedilyFuse(p[i])
end SingleLevelFusion
Subroutine GreedilyFuse(p[i])
search from p[i] to find the most recent predecessor p[j] sharing data with p[i]
if p[j] does not exist, exit and return
if (p[i] is not a loop)
embed p[i] into p[j]
make p[i] an empty statement
else if (FusibleTest(p[i], p[j]) finds a constant alignment factor)
if (no splitting is required)
fuse p[i] into p[j] by aligning p[i] by the alignment factor
make p[i] an empty statement
GreedilyFuse(p[j])
end if
if (splitting is required)
split p[i] and/or p[j] and fuse p[i] into p[j] by aligning p[i]
make p[i] an empty statement
GreedilyFuse(p[j])
for each remaining pieces t’ after splitting
GreedilyFuse(t’)
end if
end if
end FuseStatement
Subroutine FusibleTest(p[i], p[j])
if (p[i] p[j]) has been marked as not fusible
return false
end if
for each array accessed in both p[i] and p[j]
find the smallest alignment factor that
(1) satisfies data dependence, and
(2) has the closest reuse
apply iteration reordering if necessary and possible
end for
find the largest of all alignment factors
if (the alignment factor is a bounded constant)
return the alignment factor
else
mark (p[i] p[j]) as not fusible
return false
end if
end FusibleTest
Figure 2.7 Algorithm for one-level fusion
35
arrays. For each array, the data access consists of loop-invariant array locations and
loop-variant ranges such as [i + c1 , i + c2 ], where i is the loop index and c1 and c2 are
loop-invariant constants. Data dependences and alignment factors are calculated by
checking for non-empty intersections among footprints.
2.3.2
Properties
Maximal Fusion The three transformations achieve maximal fusion. Statement
embedding can always fuse a non-loop statement into a loop. Loop alignment avoids
conflicting data access between two loops by delaying the second loop by a sufficient
factor. When a bounded alignment factor cannot be found, iteration reordering is
used to extracts fusible iterations and arranges them in a fusible order. Examples of
iteration reordering include loop splitting and loop reversal. By employing these three
transformations, the algorithm in Figure 2.7 fuses two loops whenever (1) they share
data and (2) their fusion is permitted by data dependence. Therefore, the algorithm
achieves maximal fusion.
Bounded Reuse Distance The length of reuse distances are bounded after loop
fusion, as proved in the following theorem. The restrictions listed in Figure 2.6 are
implicitly assumed throughout this section.
Theorem 2.1 In a fused loop, if the effect of loop-invariant data accesses
is excluded, the reuse distance of all other data accesses is not evadable,
that is, the reuse distance of all loop-variant accesses does not increase
when the input size grows.
Proof The reuse distance between two uses of the same data is bounded by the
product of the number of iterations between the two uses and the amount of data
accessed in each iteration. Next we examine the maximal value of these two terms.
The iteration difference between two data-sharing statements increases only because of loop alignment. Since the alignment factor between each pair of loops is a
constant, the total iteration difference between any two fusible loops is at most O(L),
where L is the number of loops in the program.
Excluding loop-invariant accesses, a fused loop accesses a collection of loop-variant
ranges. A loop has at most O(A) such ranges, where A is the number of arrays in
the program. In a given iteration, a loop-variant range includes a constant number
36
of data elements (because of the restrictions made in Figure 2.6). Therefore, each
iteration accesses at most O(A) data elements.
Since two uses of the same data is at most O(L) iterations apart with at most
O(A) elements in each iteration, the upper bound on the reuse distance is O(A ∗ L),
which is independent of the sizes of arrays.
The upper bound on reuse distance, O(A∗L), is tight because a worst-case example
can be constructed as follows: the first loop is B(i)=A(i+1), then are L loops of
B(i)=B(i+1), finally is A(i)=B(i). Since the two accesses to A(i) must be separated
by L iterations, the reuse distance can be no less than L. Therefore, the fusion
algorithm achieves the tightest asymptotic upper bound on reuse distances.
Fast Algorithm The following theorem gives the time complexity of the fusion
algorithm in Figure 2.7. The cost is in fact smaller in practice because a restricted
version of loop fusion suffices for all tested programs, as explained after the theorem.
Theorem 2.2 The time complexity for the algorithm is O(V ∗ V ′ ∗ (T +
A)), where V is the number of program statements before fusion, V ′ is
the number of fused loops after fusion, T is the cost of FusibleTest, and
A is the number of data arrays in the program.
Proof The complexity of the fusion algorithm is the number of invocations of
GreedilyFuse times its cost. GreedilyFuse is called for each program statements first
and then for each new loop generated by fusion. In a program of V program statements, fusion generates at most V new loops (each successful fusion decreases the
number of loops by one). Therefore, the number of invocations of GreedilyFuse is
O(V ).
The cost of each GreedilyFuse includes the cost of (1) finding the most recent datasharing loop, (2) fusing two statements, and (3) checking the fusibility by FusibleTest.
The data-sharing loop can be found by a backward search through fused loops. The
cost is O(V ′ ∗ A), where V ′ is the number of loops after fusion. Fusing two statements
requires updating the data footprint information, the cost of which is O(A). When
examining a fusion candidate, FusibleTest is invoked at most once for each fused loop,
so the number of invocations is O(V ∗ V ′), discounting the additional loops created
by iteration reordering. Each invocation of FusibleTest takes O(A) to check all arrays
of a footprint. The cost of iteration reordering is assumed to be T . Hence, the total
37
cost of FusibleTest is O(V ∗ V ′ ∗ (A + T )). The remaining cost of GreedilyFuse is
O(V ∗ (V ′ ∗ A + A)). Therefore, the total cost of the algorithm is O(V ∗ V ′ ∗ (A + T )).
The implementation in Chapter 6 makes two simplifications. It assumes that all
loop-invariant array accesses are on bordering elements, and it reorders iterations only
by splitting at boundary loop iterations. As shown in the evaluation chapter, these
two assumptions are sufficient to capture all possible fusion in the programs we tested.
In the simplified algorithm, the cost of each FusibleTest is O(A). Therefore, the time
complexity of fusion is O(V ∗ V ′ ∗ A). In a typical program where the number of fused
loops and the number of arrays are orders of magnitude smaller than the number of
program statements, the cost of simplified loop fusion is approximately linear to the
size of the program.
2.3.3
Multi-level Fusion
The previous sections have assumed loops and arrays of a single dimension. For
programs with multi-dimensional loops and arrays, the same fusion algorithm can
be applied level by level as long as the ordering of loop and data dimensions is
determined. Figure 2.8 gives the algorithm for multi-level fusion, which maximizes
the overall degree of fusion by favoring loop fusion at outer loop levels.
While all data structures and loop levels are used to determine the correctness of
fusion, only large data structures are considered in determining the profitability of
fusion because the sole concern is the overall data reuse. In the following discussion,
the term data dimension denotes a data dimension of a large array, and the term loop
level denotes a loop that iterates a data dimension of a large array.
For each loop level starting from the outermost (i.e., level 1), MultiLevelFusion
determines loop fusion at a given level in three steps. The first step tries loop fusion
for each data dimension and picks the data dimension that would have the smallest
number of fused loops. The second step applies loop fusion for the chosen data
dimension. Note that loops of the current level that traverse other data dimensions
are also fused if they access the same data dimension. The third step recursively
applies MultiLevelFusion at a higher level for each fused loop generated at the current
level. Since loops can be fused on a data dimension other than the chosen dimension,
the dimension s in this step is not always the dimension s′ found in the second step
of the algorithm.
38
Since at all data dimensions are examined at most once at each loop level, the
cost of MultiLevelFusion is O(D2 ∗ M), where D is the number of data dimensions
and M is the cost of SingleLevelFusion given in Figure 2.7.
After loop levels and array dimensions are ordered, one issue that still remains
is to choose between the choice of fusing two loops and that of embedding one loop
into another. Loop embedding is the equivalent of statement embedding in a multidimensional program. The resolution is as follows. Given two loops, if they iterate
the same data dimension, loop fusion is applied. If, however, the data dimensions
iterated by one loop is a subset of data dimensions iterated by another, the former
loop is embedded into the latter. In all other cases, two loops are considered as not
sharing data, and neither loop fusion nor statement embedding is attempted.
2.4
Optimal Loop Fusion
The maximal fusion presented in the previous section is not optimal because it does
not minimize the reuse distance within a fused loop and the data sharing among fused
loops. This section formulates these two problems, examines their complexity, and
discusses special cases that are polynomial-time solvable.
2.4.1
Loop Fusion for Minimal Reuse Distance
The alignment of loops during fusion determines the reuse distance in the fused loop.
The problem for finding the minimal reuse distance can be formulated as a scheduling
problem, defined as follows.
Problem 2.1 Scheduling for Minimal Live Range (MLR) is a
triple of (D = (V, E), A), where D is a directed acyclic graph (dag) with
a vertex set V and edge set E, and A is a set of variables. Each node is in
fact an operation which accesses a subset of A. The task is to schedule all
operations at some time slot. The correctness of the schedule is specified
by the directed edges of the dag. For each edge, the sink node cannot be
scheduled until w time slots after the execution of the source node. The
quality of the schedule is measured by the live range of variables. Given
a schedule, the live range of a variable is the time difference between its
first and last use in the schedule.
39
MultiLevelFusion(S: set of data dimensions,
L: current loop level)
/* Step 1. find the best data dimension for loop level L */
for each data dimension s
LoopInterchange(s, L)
apply SingleLevelFusion and count the number of fused loops
end for
chose the data dimension s’ that has the fewest fused loops
/* Step 2. fuse loops for level L */
LoopInterchange(s’, L)
apply loop fusion at level L by invoking SingleLevelFusion
/* Step 3. for each loop of level L, continue fusion at level L+1 */
for each loop nest
recursively apply MultiLevelFusion(S - {s}), where s is the
data dimension iterated by the loop of level L
end for
end MultiLevelFusion
Subroutine LoopInterchange(s: data dimension, L: loop level)
for each loop nest
if (loop level t (>=L) iterates data dimension s)
apply loop interchange to make level t into level L if possible
end if
end for
end LoopInterchange
Figure 2.8 Algorithm for multi-level fusion
40
Given a machine with an unlimited number of processors, the problem of
MLR is to find a schedule such that
• (Correctness) for each edge of E, the sink node is scheduled at least
w time slots after the source node, where w is the weight of the edge,
and
• (Optimality) the sum of the live range of all variables is minimal.
The problem of scheduling for minimal live range differs from traditional problems
of task scheduling because the latter group uses a machine of a fixed number of
processors. The classical problem, Precedence Constrained Scheduling (PCS), asks
whether a dag of tasks can be scheduled in three machine time slots. PCS is NPcomplete because it can be reduced from another NP-complete problem of finding a
size-k clique in a graph. Because the reduction relies on the fact that the machine
resources are limited, the same proof cannot be applied to MLR.
A possible formulation of MLR is as a graph-partitioning problem where operations scheduled at the same time slot are grouped in the same partition of a graph.
The graph-based formulation reflects unlimited resources because each partition can
contain arbitrarily many operations. A related partitioning problem is k-way cut,
defined as follows.
Problem 2.2 Given a graph G = (V, E) where each edge has a unit
weight but each node can connect any number of edges. Also designate
k nodes in V as terminals. The k-way cut problem is to find a set of
edges of minimal total weight such that removing these edges renders all
k terminals disconnected from each other.
The k-way cut problem is NP-hard, as proved by Dahlhaus et al.[DJP+92]. They
showed that k-way cut is NP-hard even when k is equal to 3. Their proof used a
reduction from the problem of MAX-cut, which finds a maximal number of edges
separating two nodes in a graph.
MLR can be reduced from a problem similar to k-way cut. In particular, given a
k-way cut problem, one possible way to convert it into a MLR problem is as follows.
First, we create a list of k unit-time, sequential operations. Then for each node in
the problem of k-way cut, we create an independent operation that can be executed
with any of the sequential operations. For example, node u and v in k-way cut
41
become operation u′ and v ′ in MLR. Finally for each edge in k-way cut, we add a
new variable that is accessed by the corresponding operations. For example, if a edge
connects u and v, we add a variable t that is accessed by u′ and v ′. After conversion,
every possible schedule corresponds to a k-partitioning and vice versa. Note that
this reduction process builds a scheduling problem with an unlimited number of data
variables.
The data-sharing relationship in MLR can be modeled by the cross-partition
weight in k-way cut. One complication, however, is that the data sharing between
operations of far apart time slots contributes a longer live range than does the data
sharing between operations of close by time slots. For example, the live range between a variable accessed in the first and third time slots is twice the length of the
live range of a variable shared between the first and second time slot. Therefore, the
goal of minimization for MLR slightly differs from the k-way cut. For the reduction
to be correct, MLR should be reduced from the following modified problem of k-way
cut.
Problem 2.3 Weigted k-way cut is a graph G = (V, E). V includes
a set of k terminals, v1, . . . , vk . Each edge is of unit weight, and each
node connects to an arbitrary number of edges. Let p be a partitioning of
graph nodes so that vi (i=1,. . . ,k) are in different partitions. Let ni,j be
the number of edges between the nodes of the partition containing vi and
those of the partition containing vj (i, j = 1, . . . , k). The problem is to find
the k-way partitioning so that the function Σj>i and i,j=1,...,k (ni,j ∗ (j − i))
is minimized.
The modified problem is weighted because the objective function has been changed
from minimizing Σni,j to minimizing Σ(ni,j ∗ (j − i)). The weighted k-way cut
problem is not known to be NP-hard. However, considering the complexity of unweighted k-way cut, the weighted version is not likely to be polynomial-time solvable.
Section 2.4.3 will revisit this problem and explore a better formulation of loop fusion.
2.4.2
Loop Fusion for Minimal Data Sharing
Maximal loop fusion is not optimal because it does not minimize the amount of data
reloading among fused loops. Therefore, it does not minimize the amount of total
memory transfer for the whole program. This section first formulates the problem
42
of loop fusion for minimal memory transfer, then gives a polynomial solution to a
restricted form of this problem, and finally proves that the complexity of the unrestricted form is NP-complete. In the process, it also points out the inadequacy
of the popular fusion model given by Gao et al.[GOST92] and by Kennedy and
McKinley[KM93].
Formulation
Given a sequence of loops accessing a set of data arrays, we can model both the
computation and the data in a fusion graph. A fusion graph consists of nodes—
each loop is a node— and two types of edges—directed edges for modeling data
dependences and undirected edges for fusion-preventing constraints. Although this
definition of a fusion graph looks similar to that of previous work, the objective of
fusion is radically different as stated below.
Problem 2.4 Bandwidth-minimal fusion problem: Given a fusion
graph, how can we divide the nodes into a sequence of partitions such that
• (Correctness) each node appears in one and only one partition; the
nodes in each partition have no fusion preventing constraint among
them; and dependence edges flow only from an earlier partition to a
later partition in the sequence,
• (Optimality) the sum of the number of distinct arrays in all partitions
is minimal.
The correctness constraint ensures that loop fusion obeys data dependences and
fusion-preventing constraints. Assuming arrays are large enough to prohibit cache
reuse among disjoint loops, the second requirement ensures optimality because for
each loop, the number of distinct arrays is the number of arrays the loop reads from
memory during execution. Therefore, the minimal number of arrays in all partitions
means the minimal memory transfer and minimal bandwidth consumption for the
whole program.
For example, Figure 2.9 shows the fusion graph of six loops. Assuming that loop
5 and loop 6 cannot be fused, but either of them can be freely fused with any other
four loops. Loop 6 depends on loop 5. Without fusion, the total number of arrays in
the six loops accessed is 20. The optimal fusion leaves loop 5 alone and fuses all other
43
loops. The number of distinct arrays is 1 in the first partition and 6 in the second,
thus the total memory transfer is reduced from 20 arrays to 7.
Data Arrays: A, B, C, D, E, F
Scalar Data: sum
data sharing
A, D, E, F
Loop 1
A, D, E, F
Loop 2
A, D, E, F
Loop 3
B, C, D, E, F Loop 4
fusion preventing constraint
Loop 5
A
sum
B, C sum
Loop 6
data dependence
Figure 2.9 Example of bandwidth-minimal loop fusion
The optimality of bandwidth-minimal fusion is different from previous work on
loop fusion. Both Gao et al.[GOST92] and Kennedy and McKinley[KM93] constructed a fusion graph in a similar way but modeled data reuse as weighted edges
between graph nodes. For example, the edge weight between loop 1 and 2 would be 4
because they share four arrays. Their goal is to partition the nodes so that the total
weight of cross-partition edges is minimal.
The sum of edge weights does not correctly model the aggregation of data reuse.
For example, in Figure 2.9, loop 1 to 3 each has a single-weight edge to loop 5. But
the aggregated reuse between the first three loops and loop 5 should not be 3; on the
contrary, the amount of data sharing is 1 because they share access to only one array,
A.
To show that weighted-edge formulation is not optimal, it is suffice to give a
counter example, which is the real purpose of Figure 2.9. The optimal weighted-edge
fusion is to fuse the first five loops and leave loop 6 alone. The total weight of crosspartition edges is 2, which lies between loop 4 and 6. However, this fusion has to load
8 arrays (6 in the first partition and 2 in the second), while the previous bandwidthminimal fusion needs only 7. Reversely, the total inter-partition edge weight of the
44
bandwidth-minimal fusion is 3, clearly not optimal based on the weighted-edge formulation. Therefore, the weighted-edge formulation does not minimize overall memory
transfer.
To understand the effect of data sharing and the complexity of bandwidth-minimal
fusion, the remaining part of this section studies a model based on a different type of
graphs, hyper-graphs.
Solution Based On Hyper-graphs
The traditional definition of an edge is inadequate for modeling data use because the
same data can be shared by more than two loops. Instead, we should use hyper-edges
because a hyper-edge can connect any number of nodes in a graph. A graph with
hyper-edges is called a hyper-graph. The optimality requirement of loop fusion can
now be restated as follows.
Problem 2.5 Bandwidth-minimal fusion problem (II): Given a
fusion graph as constructed by Problem 2.4, add a hyper-edge for each
array in the program, which connects all loops that access the array. How
can we divide all nodes into a sequence of partitions such that
• (Correctness) criteria are the same as Problem 2.4, but
• (Optimality) for each hyper-edge, let the length be the number of
partitions the edge connects to after partitioning, then the goal is to
minimize the total length of all hyper-edges.
The next part first solves the problem of optimal two-partitioning on hyper-graphs
and then proves the NP-completeness of multi-partitioning.
Two-partitioning is a special class of the fusion problem where the fusion graph has
only one fusion-preventing edge and no data dependence edge among non-terminal
nodes. The result of fusion will produce two partitions where any non-terminal node
can appear in any partition. The example in Figure 2.9 is a two-partitioning problem.
Two-partitioning can be solved as a connectivity problem between two nodes.
Two nodes are connected if there is a path between them. A path between two nodes
is a sequence of hyper-edges where the first edge connects one node, the last edge
connects the other node, and consecutive ones connect intersecting groups of nodes.
Given a hyper-graph with two end nodes, a cut is a set of hyper-edges such that
taking out these edges would disconnect the end nodes. In a two-partitioning problem,
45
any cut is a legal partitioning. The size of the cut determines the total amount of
data loading, which is the total amount of data plus the size of the cut (which is the
total amount of data reloading). Therefore, to obtain the optimal fusion is to find a
minimal cut.
The algorithm given is Figure 2.10 finds a minimal cut for a hyper-graph. At the
first step, the algorithm transforms the hyper-graph into a normal graph by converting
each hyper-edge into a node, and connecting two nodes in the new graph when the
respective hyper-edges overlap. The conversion also constructs two new end nodes for
the transformed graph. The problem now becomes one of finding minimal vertex cut
on a normal graph. The second step applies standard algorithm for minimal vertex
cut, which converts the graph into a directed graph, splits each node into two and
connects them with a directed edge, and finally finds the edge cut set by the standard
Ford-Fulkerson method. The last step transforms the vertex-cut to the hyper-edge
cut in the fusion graph and constructs the two partitions.
Although algorithm in Figure 2.10 can find minimal cut for hyper-edges with
non-negative weights, we are only concerned with fusion graphs where edges have
unit-weight. In this case, the first step of the minimal-cut algorithm in Figure 2.10
takes O(E + V ); the second step takes O(V ′(E ′ + V ′ )) if breadth-first search is used
to find augmenting paths; finally, the last step takes O(E + V ). Since V ′ = E in the
second step, the overall cost is O(E(E ′ + E) + V ), where E is the number of arrays,
V is the number of loops and E ′ is the number of the pair of arrays that are accessed
by the same loop. In the worst case, E ′ = E 2, and the algorithm takes O(E 3 + V ).
What is surprising is that although the time is cubic to the number of arrays, it is
linear to the number of loops in a program.
By far the solution method has assumed the absence of dependence edges. The
dependence relation can be enforced by adding hyper-edges to the fusion graph. Given
a fusion graph with N edges and two end nodes s and t, assume the dependence
relations form an acyclic graph. Then if node a depends on b, we can add three sets
of N edges connecting s and a, a and b, and b and t. Minimal-cut will still find the
minimal cut although each dependence adds a weight of N to the total weight of
minimal cut. Any dependence violation would add an extra N to the weight of a cut,
which makes it impossible to be minimal. In other words, any minimal cut will not
place a before b, and the dependence is observed. However, adding such edges would
increase the time complexity because the number of hyper-edges will be in the same
order as the number of dependence edges.
46
Input
Output
Algorithm
A hyper-graph G = (V, E).
Two nodes s and t ∈ V .
A set of edges C, which is a minimal cut between s and t.
Two partitions V1 and V2 , where s ∈ V1 , t ∈ V2 , V1 = V − V2 , and
a edge e connects V1 and V2 iff e ∈ C.
/* Initialization */
let C, V1 and V2 be empty sets
/* Step 1: convert G to a normal graph */
construct a normal graph G’=(V’,E’)
let array map be the one-to-one map between V’ and E
add a node v to V’ for each hyper-edge e in E; let map[v] = e
add edge (v1, v2) in G’ iff map[v1] and map[v2] overlap in G
/* add in two end nodes */
add two new nodes s’ and t’ to V’
for each node v in V’
add edge (s’, v) if map[v] contains s in G
add edge (t’, v) if map[v] contains t in G
/* Step 2: find the minimal vertex cut in G’ between s’ and t’ */
convert G’ into a directed graph
split each node in V’ and add in a directed edge in between
use For-Fulkerson method to find the minimal edge cut
convert the minimal edge cut into the vertex cut in G’
/* Step 3: construct the cut set and the partitions in G*/
let C be the node cut set of G’ found in the previous step
delete all edges of G corresponding to nodes in C
let V1 be the set of nodes connected to s in G; let V2 be V-V1
return C, V1 and V2
Figure 2.10 Minimal-cut algorithm for a hyper-graph
47
The Complexity of General Loop Fusion
Although the two-partitioning problem can be solved in polynomial time, the multipartitioning form of bandwidth-minimal fusion is NP-complete.
Theorem 2.3 Multi-partitioning of bandwidth-minimal fusion is NPcomplete when the number of partitions is greater than two.
Proof The fusion problem is in NP because loops or nodes of a fusion graph can
be partitioned in a non-deterministic way, and the legality and optimality can be
checked in polynomial time.
The fusion problem is also NP-hard. To prove this, we reduce k-way cut problem,
defined in Problem 2.2, to the fusion problem. Given a graph G = (V, E) and k nodes
to be designated as terminals, k-way cut is to find a set of edges of minimal total
weight such that removing the edges renders all k terminals disconnected from each
other. To convert a k-way cut problem to a fusion problem, we construct a hypergraph G′ = (V ′ , E ′) where V ′ = V . We add in a fusion preventing edge between each
pair of terminals, and for each edge in E, we add a new hyper-edge connecting the two
end nodes of the edge. It is easy to see that a minimal k-way cut in G is an optimal
fusion in G′ and vice versa. Since k-way cut is NP-complete, bandwidth-minimal
fusion is NP-hard when the number of partitions is greater than two. Therefore, it is
NP-complete.
2.4.3
An Open Question
The previous two sections formulated the problem of optimal loop fusion as a graphpartitioning problem, in particular, as unweighted and weighted k-way cut. However,
the formulation is not entirely precise for the following two reasons.
On the one hand, optimal loop fusion is simpler than k-way cut because the fusion
graph usually has a limited number of edges. The problem of k-way cut assumes that
a node can connect to an unbounded number of edges. The number of edges in a
fusion graph corresponds to the number of data structures in the program. Although
this number is not bounded, it is usually not proportional to the size of the program.
Therefore, the cost of optimal k-way cut may not be very high for real programs
where the number of data structures is small.
On the other hand, optimal fusion is more complex than k-way cut because of the
dependence relations among program statements. Unlike k-way cut in which every
48
node can be grouped with any terminal, a loop can be fused with another only if
no dependence is violated. For a real program, a loop can be fused with a subset of
loops, and the members of this subset are determined by how other loops are fused.
Considering these two differences, the question therefore remains open on how
should loop fusion be formulated and what the complexity is in terms of both the
number of loops and data structures.
2.5
Advanced Optimizations Enabled by Loop Fusion
Aggressive fusion enables other optimizations. For example, the use of an array can
become enclosed within one or a few loops. The localized use allows aggressive storage
transformations that are not possible otherwise. This section describes the idea of
two such storage optimizations: storage reduction, which replaces a large array with a
small section or a scalar; and store elimination, which avoids writing back new values
to an array. Both save a significant more amount of memory bandwidth than loop
fusion.
2.5.1
Storage Reduction
After loop fusion, if the live range of an array is shortened to stay within a single
loop nest, the array can be replaced by a smaller data section or even a scalar. In
particular, two opportunities exist for storage reduction. The first case is where the
live range of a data element (all uses of the data element) is short, for example,
within one loop iteration. The second case is where the live range spans the whole
loop, but only a small section of data elements have such a live range. The first
case can be optimized by array shrinking, where a small temporary buffer is used to
carry live ranges. The second case can be optimized by array peeling, where only a
reduced section of an array is saved in a dedicated storage. Figure 2.11 illustrates
both transformations.
The example program in Figure 2.11(a) uses two large arrays a[N, N] and b[N, N].
Loop fusion transforms the program into Figure 2.11(b). Not only does the fused loop
contain all accesses to both arrays, the definitions and uses of many array elements are
very close in computation. The live range of a b-array element is within one iteration
of the inner loop. Therefore, the whole b array can be replaced by a scalar b1. The
live range of an a-array element is longer, but it is still within every two consecutive
j iterations. Therefore, array a[N, N] can be reduced into a smaller buffer a3[N],
49
// Initialization of data
sum = 0.0
For j=1, N
For i=1, N
read(a[i,j])
End for
End for
For i=1, N
read(a[i,1])
End for
// Computation
For j=2, N
For i=1, N
b[i,j] = f(a[i,j-1],
a[i,j])
End for
End for
For i=1, N
b[i,N] = g(b[i,N],
a[i,1])
End for
// Check results
sum = 0.0
For j=2, N
For i=1, N
sum += a[i,j]+b[i,j]
End for
End for
For j=2, N
For i=1, N
read(a[i,j])
b[i,j] = f(a[i,j-1],
a[i,j])
if (j<=N-1)
sum += a[i,j]+b[i,j]
else
b[i,N] = g(b[i,N],
a[i,1])
sum += b[i,N]+a[i,N]
end if
End for
End for
print sum
sum = 0.0
For i=1, N
read(a1[i])
End for
For j=2, N
For i=1, N
read(a2)
if (j=2)
b1 = f(a1[i],a2)
else
b1 = f(a3[i],a2)
end if
if (j<=N-1)
sum += b1+a2
a3[i] = a2
else
b1 = g(b1,a1[i])
sum += b1+a2
end if
End for
End for
print sum
print sum
(a) Original program
(b) After loop fusion
(c) After array shrinking and peeling
Figure 2.11 Array shrinking and peeling
which carries values from one j iteration to the next. A section of a[N, N] array has
a live range spanning the whole loop because a[1 . . . N, 1] is defined at the beginning
and used at the end. These elements can be peeled off into a smaller array a1[N] and
saved throughout the loop. After array shrinking and peeling, the original two arrays
of size N 2 have been replaced by two arrays of size N plus two scalars, achieving a
dramatic reduction in storage space.
Storage reduction directly reduces the bandwidth consumption between all levels
of memory hierarchy. First, the optimized program occupies a smaller amount of
memory, resulting in less memory-CPU transfer. Second, it has a smaller footprint
in cache, increasing the chance of cache reuse. When an array can be reduced to a
scalar, all its uses can be completed in a register, eliminating cache-register transfers
as well.
50
2.5.2
Store Elimination
While storage reduction optimizes only localized arrays, the second transformation,
store elimination, improves bandwidth utilization of arrays whose live range spans
multiple loop nests. The transformation first locates the loop containing the last
segment of the live range and then finishes all uses of the array so that the program
no longer needs to write new values back to the array.
For i=1, N
res[i] = res[i]+data[i]
End for
sum = 0.0
For i=1, N
sum += res[i]
End for
sum = 0.0
For i=1, N
res[i] = res[i]+data[i]
sum += res[i]
End for
print sum
sum = 0.0
For i=1, N
sum += res[i]+data[i]
End for
print sum
print sum
(a) Original program
(b) After loop fusion
(c) After store elimination
Figure 2.12 Store elimination
The program in Figure 2.12 illustrates this transformation. The first loop in
Figure 2.12(a) assigns new values to the res array, which is used in the next loop.
After the two loops are fused in (b), the writeback of the updated res array can
be eliminated because all uses of res are already completed in the fused loop. The
program after store elimination is shown in Figure 2.12(c).
The goal of store elimination differs from all previous cache optimizations because
it changes only the behavior of data writebacks and it does not affect the performance
of memory reads at all. Store elimination has no benefit if memory latency is the
main performance constraint. However, if the bottleneck is memory bandwidth, store
elimination becomes extremely useful because reducing memory writebacks is as important as reducing memory reads. The following experiment verifies the benefit of
store elimination on two of today’s fastest machines: HP/Convex Exemplar and SGI
Origin2000 (with R10K processors).
The table in Figure 2.13 lists the reduction in execution time by loop fusion and
store elimination. Fusion without store elimination reduces running time by 31% on
Origin and 13% on Exemplar; store elimination further reduces execution time by
51
27% on Origin and 33% on Exemplar. The combined effect is a speedup of almost 2
on both machines, clearly demonstrating the benefit of store elimination.
machines original
Origin2000 0.32 sec
Exemplar 0.24 sec
fusion only
0.22 sec
0.21 sec
store elimination
0.16 sec
0.14 sec
Figure 2.13 Effect of store elimination
2.6
Summary
The central task of chapter is to minimize the distance of data reuse. It first used the
ideal reuse-driven execution to measure the potential of global computation fusion.
Then it developed a new fusion algorithm, maximal loop fusion, which fuses all datasharing program statements whenever possible and achieves bounded reuse distance
within a fused loop. The new algorithm employs statement embedding, loop alignment, and iteration reordering to support single-level loop fusion. For programs with
multi-dimensional loops and arrays, the new algorithm always minimizes the number
of outer loops. Under reasonable assumptions, the time complexity of maximal fusion
is O(V ∗ V ′ ∗ A ∗ D2 ), where V is the number of program statements before fusion,
V ′ is the number of fused loops after fusion, A is the number of data arrays, and D
is the number of dimensions of the largest array in a program.
Maximal fusion is not optimal because it does not minimize reuse distance within
a fused loop and it does not minimize the amount of data sharing among fused loops.
The chapter formulated the first problem as weighted k-way cut and the second
problem as unweighted k-way cut. It used hyper-graphs to model data sharing and
proved that fusion for minimal data sharing is NP-complete.
Loop fusion enables advanced storage optimizations. This chapter described two:
storage reduction reduces the size of arrays by array shrinking and array peeling,
and store elimination removes memory writebacks by finishing all uses of the data in
advance. Store elimination is the first program transformation in the literature that
exclusively targets memory bandwidth. These two techniques are not fully developed
here and are part of the future work.
One main reason that loop fusion and storage optimizations become profitable on
modern machines is that their additional instruction overhead is compensated by fast
52
processors. In general, the dramatically increased computing power has allowed much
more aggressive ways of program optimization. However, one question we should not
forget to ask is whether we want to optimize programs manually or automatically.
Loop fusion is an example of complex program transformation that can and should
be automated by a compiler.
In addition to instruction overhead, loop fusion has a side effect on memory hierarchy performance because it may merge too much data access in a fused loop. The
next chapter will show how to mitigate this problem by optimizing data layout and
exploiting spatial reuse among the global data.
53
Chapter 3
Global Data Regrouping
3.1
Introduction
Since cache consists of non-unit cache blocks, sufficient use of cache blocks becomes
critically important because low cache-block utilization leads directly to both low
memory-bandwidth utilization and low cache utilization. For example, for cache
blocks of 16 numbers, if only one number is useful in each cache block, 15/16 or 94%
of memory bandwidth is wasted, and furthermore, 94% of cache space is occupied by
useless data and only 6% of cache is available for data reuse.
A compiler can improve cache-block utilization, or equivalently, cache-block spatial reuse, by packing useful data into cache blocks so that all data elements in a cache
block are consumed before it is evicted. Since a program employs many data arrays,
the useful data in each cache block may come from two sources: the data within
one array, or the data from multiple arrays. Cache-block reuse within a single array
is not always possible because not all access to an array can be made contiguous.
Common examples are programs with regular, but high dimensional data, and programs with irregular and dynamic data. Furthermore, even in the case of contiguous
access within single arrays, cache reuse can still be seriously hindered by excessive
cache interference when too many arrays are accessed simultaneously, as for example,
after loop fusion. Therefore, a compiler needs to combine useful data from multiple
arrays to address the limitations of single-array data reuse.
This chapter presents inter-array data regrouping, a global data transformation
that first splits and then selectively regroups all data arrays in a program. Figure 3.1
gives an example of this transformation. The left-hand side of the figure shows the
example program, which traverses a matrix first by rows and then by columns. One
of the loops must access non-contiguous data and cause low cache-block utilization
because only one number in each cache block is useful. Inter-array data regrouping
combines the two arrays by putting them into a single array that has an extra dimension, as shown in the right-hand side of Figure 3.1. Assuming that the first data
54
dimension is contiguous in memory, the regrouped version guarantees at least two
useful numbers in each cache block regardless of the order of traversal.
Array a[N,N], b[N,N]
Array c[2,N,N]
// row-by-row traversal
For j=1, N
For i=1, N
F( a[i,j], b[i,j] )
End for
End for
// row-by-row traversal
For j=1, N
For i=1, N
F( c[1,i,j], c[2,i,j] )
End for
End for
// column-by-column traversal
For i=1, N
For j=1, N
G( a[i,j], b[i,j] )
End for
End for
// column-by-column traversal
For i=1, N
For j=1, N
G( c[1,i,j], c[2,i,j] )
End for
End for
Figure 3.1 Example of inter-array data regrouping
In addition to improving cache spatial reuse, data regrouping also reduces the
page-table (TLB) working set of a program because it merges multiple arrays into
a single one. On modern machines, the cost of TLB overflow is very harmful to
performance because CPU cannot continue program execution during a TLB miss.
Inter-array data regrouping can also improve communication performance of sharedmemory parallel machines. On these machines, cache blocks are the basis of data
consistency and consequently the unit of communication among parallel processors.
Good cache-block utilization enabled by inter-array data regrouping can amortize the
latency of communication and fully utilize communication bandwidth. However, the
use of data regrouping on parallel machines is outside the scope of this dissertation.
The rest of the chapter formulates the problem of inter-array data regrouping,
presents its solution and discusses its extensions.
3.2
Program Analysis
Given a program, a compiler identifies in two steps all opportunities of inter-array
data regrouping. The first step partitions the program into a sequence of computation
phases. A computation phase is defined as a segment of the program that accesses
data larger than cache. A compiler can estimate the amount of data access in loop
structures. The related compiler analysis techniques will be discussed in Chapter 5,
55
which designs a data analysis tool that estimates the total amount of memory transfer.
In places of insufficient information, a compiler can assume that the unknown loop
counts are large and unknown data references iterate the whole array. The purpose
of these conservative assumptions is to guarantee profitability. The correctness is not
affected regardless of compiler assumptions.
The second step of the analysis identifies the sets of compatible arrays. Two
arrays are compatible if their sizes differ by at most a constant, and if they are
always accessed in the same order in each computation phase. For example, the
size of array A(3, N) is compatible with B(N) and with B(N − 3) but not with
C(N/2) or D(N, N). The access order from A(1) to A(N) is compatible with B(1)
to B(N) but not with the order from C(N) to C(1) or from D(1) to D(N/2). The
second criterion does allow compatible arrays to be accessed differently in different
computation phases, as long as they have the same traversal order in the same phase5 .
The second step requires identifying the data access order within each array.
Regular programs can be analyzed with various forms of array section analysis. For
irregular or dynamic programs, a compiler can use the data-indirection analysis described in Section 4.3.2 of Chapter 4.
The other important task of the second step is the separation of arrays into the
smallest possible units, which is done by splitting constant-size data dimensions into
multiple arrays. For example, A(2, N) is converted into A1(N) and A2(N).
After the partitioning of computation phases and compatible arrays, the task of
data regrouping becomes clear. First, data regrouping transforms each set of compatible arrays separately because grouping incompatible arrays is either impossible or
too costly. Second, a program is now modeled as a sequence of computation phases,
each of which accesses a subset of compatible arrays. The goal of data regrouping is
to divide the set of compatible arrays into a set of new arrays such that the overall
cache-block reuse is maximized in all computation phases.
In general, the traversal orders of two arrays need not to be the same as long as they maintain a
consistent relationship. For example, array A and B have consistent traversal order if whenever A[i]
is accessed, B[f(i)] is accessed, where f(x) is a one-to-one function.
5
56
3.3
Regrouping Algorithm
3.3.1
One-Level Regrouping
This section illustrates the problem and the solution of data regrouping through an
example—the application Magi from DOD, which simulates the shock and material
response of particles in a three-dimensional space (based on smoothed particle hydrodynamics method). The table in Figure 3.2 lists the six major computation phases
of the program as well as the attributes of particles used in each phase. Since the
program stores an attribute of all particles in a separate array, different attributes do
not share the same cache block. Therefore, if a computation phase uses k attributes,
it needs to load in k cache blocks when it accesses a particle.
1
2
3
4
5
6
Computation phases
constructing interaction list
smoothing attributes
hydrodynamic interactions 1
hydrodynamic interactions 2
stress interaction 1
stress interaction 2
Attributes accessed
position
position, speed, heat, derivate, viscosity
density, momentum
momentum, volume, energy, cumulative totals
volume, energy, strength, cumulative totals
density, strength
Figure 3.2 Computation phases of a hydrodynamics simulation program
Combining multiple arrays can reduce the number of cache blocks accessed and
consequently improve cache-block reuse. For example, we can group position and
velocity into a new array such that the ith element of the new array contains the
position and velocity of the ith particle. After array grouping, each particle reference
of the second phase accesses one fewer cache blocks since position and velocity are
now loaded by a single cache block. In fact, we can regroup all five arrays used
in the second phase and consequently merge all attributes into a single cache block
(assuming a cache block holds five attributes).
However, excessive grouping in one phase may hurt cache-block reuse in other
phases. For example, grouping position with speed wastes a half of each cache block
in the first phase because the speed attribute is never referenced in that phase.
The example program shows two requirements for data regrouping. The first is
to fuse as many arrays as possible in order to minimize the number of loaded cache
blocks, but at the same time, the other requirement is not to introduce any useless
57
data through regrouping. In fact, the second requirement mandates that two arrays
should not be grouped unless they are always accessed together. Therefore, the goal
of data regrouping is to partition data arrays such that (1) two arrays are in the same
partition only if they are always accessed together, and (2) the size of each partition
is the largest possible. The first property ensures no waste of cache, and the second
property guarantees the maximal cache-block reuse.
Although condition (1) might seem a bit restrictive in practice, many applications
use multiple fields of a data structure array together. The algorithm will split each
field as a separate array. In addition, aggressive loop fusion often gathers data access
of a large number of arrays in a fused loop. Therefore, it should be quite common
for two or more arrays to always be accessed together. Later, Section 3.4 discusses
methods for relaxing condition (1) at the cost of making the analysis more complex.
The problem of optimal regrouping is equivalent to a set-partitioning problem. A
program can be modeled as a set and a sequence of subsets where the set represents
all arrays and each subset models the data access of a computation phase in the
program.
Given a set and a sequence of subsets, we say two elements are buddies if for any
subset containing one element, it must contain the other one. The buddy relation is
reflexive, symmetric, and transitive; therefore it is a partition. A buddy partitioning
satisfies the two requirements of data regrouping because (1) all elements in each
partition are buddies, and (2) all buddies belong to the same partition. Thus the
data-regrouping problem is the same as finding a partitioning of buddies. For example
in Figure 3.2, array volume and energy are buddies because they are always accessed
together.
The buddy partitioning can be solved with efficient algorithms. For example,
the following partitioning method uses set memberships for each array, that is, a bit
vector whose entry i is 1 if the array is accessed by the ith phase. The method uses a
radix sort to find arrays with the same set memberships, i.e. arrays that are always
accessed together. Assuming a total of N arrays and S computation phases, the time
complexity of the method is O(N ∗ S). If a bit-vector is used for S in the actual
implementation, the algorithm runs in O(N) vector steps. In this sense, the cost of
regrouping is linear to the number of arrays.
58
3.3.2
Optimality
Qualitatively, the algorithm groups two arrays when and only when it is always profitable to do so. To prove, consider on the one hand, data regrouping never includes
any useless data into cache, so it is applied only when profitable; on the other hand,
whenever two arrays can be merged without introducing useless data, they are regrouped by the algorithm. Therefore, data regrouping exploits inter-array spatial
reuse when and only when it is always profitable.
Under reasonable assumptions, the optimality can also be defined quantitatively
in terms of the amount of memory access and the size of TLB working set. The
key link between an array layout and the overall data access is the concept called
iteration footprint, which is the number of distinct arrays accessed by one iteration of
a computation phase. Assuming an array element is smaller than a cache block but
an array is larger than a virtual memory page, then the iteration footprint is equal to
the number of cache blocks and the number of pages accessed by one iteration. The
following lemma shows that data regrouping minimizes the iteration footprint.
Lemma 3.1 Under the restriction of no useless data in cache blocks,
data regrouping minimizes the iteration footprint of each computation
phase.
Proof After buddy partitioning, two arrays are regrouped when and only when they
are always accessed together. In other words, two arrays are combined when and only
when doing so does not introduce any useless data. Therefore, for any computation
phase after regrouping, no further array grouping is possible without introducing
useless data. Thus, the iteration footprint is minimal after data regrouping.
The size of a footprint directly affects cache performance because the more arrays
are accessed, the more active cache blocks are needed in cache, and therefore, the more
chances of premature eviction of useful data caused by either limited cache capacity
or associativity. For convenience, we refer to both cache capacity misses and cache
interference misses collectively as cache overhead misses. It is reasonable to assume
that the number of cache overhead misses is a non-decreasing function on the number
of active arrays. Intuitively, a smaller footprint should never cause more overhead
misses because a reduced number of active cache blocks can always be arranged so
that their conflicts with cache capacity and with each other do not increase. With this
59
assumption, the following theorem proves that a minimal footprint leads to minimal
cache overhead.
Theorem 3.1 Given a program of n computation phases, where the
total number of cache overhead misses is a non-decreasing function on the
size of its iteration footprint k, then data regrouping minimizes the total
number of overhead misses in the whole program.
Proof Assuming the number of overhead misses in the n computation phases is
f1 (k1 ), f2 (k2 ), . . . , fn (kn ), then the total amount of memory re-transfer is proportional
to f1 (k1 ) + f2(k2 ) + . . . + fn(kn ). According to the previous lemma, k1 , k2, . . . , kn are
the smallest possible after regrouping. Since all functions are non-decreasing, the sum
of overhead misses is therefore minimal after data regrouping.
The assumption made by the theorem covers a broad range of data access patterns
in real programs, including two extreme cases. The first is the worst extreme, where
no cache reuse happens, for example, in random data access. The total number
of cache misses is linear to the size of the iteration footprint since each data access
causes a cache miss. The other extreme is perfect cache reuse where no cache overhead
miss occurs, for example, in contiguous data access. The total number of repeated
memory transfer is zero. In both cases, the number of cache overhead misses is a nondecreasing function on the size of the iteration footprint. Therefore, data regrouping
is optimal in both cases according to the theorem just proved.
In a similar way, data regrouping minimizes the overall TLB working set of a
program. Assuming arrays do not share the same memory page, the size of the
iteration footprint, i.e. the number of distinct arrays accessed by a computation
phase, is in fact the size of its TLB working set. Since the size of TLB working set is
a non-decreasing function over the iteration footprint, the same proof can show that
data regrouping minimizes the overall TLB working set of the whole program.
A less obvious benefit of data regrouping is the elimination of useless data by
grouping only those parts that are used by a computation phase of a program. The
elimination of useless data by array regrouping is extremely important for applications
written in languages with data abstraction features, as in, for example, C, C++, Java
and Fortran 90. In these programs, a data object contains many attributes, but only
a fraction of them is used in a given computation phase. Data regrouping will break
each attribute into a separate location and group only those that are used in a way
that is compile-time optimal for the whole program.
60
In summary, the regrouping algorithm is optimal because it minimizes all iteration
footprints of a program. With the assumption that cache overhead is a non-decreasing
function over the size of iteration footprints, data regrouping achieves maximal cache
reuse and minimal TLB working set.
3.3.3
Multi-level Regrouping
The previous sections have been aimed at improving cache-block reuse and therefore
did not group data at granularity larger than an array element. This section overcomes
this limitation by grouping arrays at higher levels. The extension is beneficial because
optimizing the layout of array segments reduces cache interference and the page-table
working set.
The example program in Figure 3.3 illustrates multi-level data regrouping. Array
A and B are grouped at the element level to improve spatial reuse in cache blocks.
In addition, the columns of all three arrays are grouped so that each outer-loop
iteration accesses a contiguous segment of memory. Consider, for example, the data
access of the first iteration of the outer loop. The first inner loop iterates through
the first column D[1 − 2, 1 . . . N, 1, 1]. Then the second inner loop traverses through
the second column D[1 . . . N, 2, 1]. Therefore, each outer-loop iteration accesses a
contiguous section of memory. Indeed, multi-level regrouping achieves contiguous
data access even for non-perfectly nested loops.
It should be noted that popular programming languages such as Fortran do not
allow arrays of non-uniform dimensions like those of array D. However, this is not a
problem when regrouping is applied by a back-end compiler. In addition as revealed
later in the evaluation chapter, source-level regrouping may negatively affect register
allocation of the back-end compiler. However, data regrouping does not change the
relationship of temporal reuse of any variable. The problem must be the confusion
caused solely by source-level changes. Therefore, the problem should be easily solved
if data regrouping is applied by the back-end compiler itself.
The algorithm for multi-level regrouping is shown in Figure 3.4. The first step of
MultiLevelRegrouping collects simultaneous data access at all array dimensions. Two
criteria are used to find the set of arrays accessed at a given dimension. The first is
necessary for the algorithm to be correct. The second criterion does not affect correctness, but it make sure that the algorithm considers only those memory references that
access the whole array. The sets of arrays found for each data dimension correspond
61
for i
for j
g( A[j,i], B[j,i] )
end for
for j
t( C[j,i] )
end for
end for
A[j,i] -> D[1,j,1,i]
B[j,i] -> D[2,j,1,i]
C[j,i] ->
D[j,2,i]
Figure 3.3 Example of multi-level data regrouping
to the computation phases for that dimension. The second step of the algorithm then
applies one-level regrouping for each data dimension. Subroutine OneLevelRegrouping
uses the partitioning algorithm discussed in Section 3.3.1 to regroup arrays at a single
data dimension. The correctness of multi-level regrouping is proved in the following
theorem. The purpose of the proof is to show that the grouping decision at a lower
level (e.g., grouping array a with b) does not contradicts the decision at a higher level
(e.g., separating a and b).
Theorem 3.2 If the algorithm in Figure 3.3 merges two arrays at data
dimension d, the algorithm must also group these two arrays at all dimensions higher than d.
Proof It suffices to prove that if the algorithm groups two arrays at d, the two
arrays are always accessed together at dimensions higher than d. Suppose a loop l
exists where the two arrays are not accessed together at an outer dimension. Among
all references (of these two arrays) that are considered for dimension d, some of them
must be enclosed by loop l because l iterates through a dimension higher than d.
Since the two arrays are not always accessed together under loop l, the first step
of the algorithm must find a set for dimension d that contains only one of the two
arrays. Therefore, the two arrays will not be grouped at dimension d by the algorithm.
Contradiction.
3.4
Extensions
The previous section makes two restrictions in determining data regrouping. The first
is disallowing any useless data, and the second is assuming a static data layout without dynamic data re-mapping. This section relaxes these two restrictions and gives
62
Assumptions
arrays are stored in column-major order
only memory references to compatible arrays are considered
MutiLevelRegrouping
/* Step 1. find the subsets of arrays accessed in all loop levels */
for each loop i
examine all array references inside this loop, and
for each data dimension d, find the set s of arrays such that
(1) each array of s is accessed at all dimensions higher than or
equal to d by loop i and its outer loops, and
(2) each array of s is accessed at all other dimensions by the
inner loops of i
end for
/* Step 2. partition arrays for each data dimension */
for each data dimension d
let S be the collection of sets found in Step 2 for dimension d
let A be the set of all arrays
OneLevelRegrouping(A, S, d)
end for
end MultiLevelRegrouping
OneLevelRegrouping(A: all arrays, S: subsets of A, d: current dimension)
let N be the size of A
/* construct a bit vector for each subset */
for each subset s in S
construct a bit vector b of length N
for i from 1 to N
if (array i is in s) b[i]=1 otherwise b[i]=0
end for
end for
/* partition arrays */
sort all bit vectors using radix sort
group arrays that have the same bit vector at dimension d
end OneLevelRegrouping
Figure 3.4 Algorithm for multi-level data regrouping
63
modified solutions. In addition, this section expands the scope of data regrouping to
minimizing not only memory reads but also memory writebacks.
3.4.1
Allowing Useless Data
Sometimes allowing useless data may lead to better performance. An example is the
first program in Figure 3.5. Since the first loops is executed 100 times more often
than the second loop, it is very likely that the benefit of grouping A and B in the
first loop exceeds the overhead of introducing useless data in the second loop.
Allowing useless data
for step=1, t
for i=1, N
Foo(A[i], B[i])
end for
end for
for i=1, N
Bar(A[i])
end for
Allowing dynamic data remapping
for step=1, t
for i=1, N
Foo(A[i], B[i])
end for
end for
for step=1, t
for i=1, N
Bar(A[i])
end for
end for
Figure 3.5 Examples of extending data regrouping
However, the tradeoff depends on the exact performance gain due to data regrouping and the performance loss due to useless data. Both the benefit and cost
are machine dependent. Therefore, not to include useless data is in fact compile-time
optimal because otherwise regrouping cannot guarantee profitability. In practice, the
implementation of data regrouping considers only frequently executed computation
phases. It applies data regrouping only on loops that are inside a time-step loop.
When the exact run-time benefit of regrouping and the overhead of useless data
are known, the problem of optimal regrouping can be formulated with a weighted,
undirected graph, called a data-regrouping graph. Each array is a node in the graph.
The weight of each edge is the run-time benefit of regrouping its two end nodes minus
the overhead of such grouping. The goal is to pack arrays that are most beneficial
into the same cache block. However, the packing problem on a data-regrouping graph
is NP-hard because it can be reduced from the G-partitioning problem[KH78].
64
3.4.2
Allowing Dynamic Data Regrouping
Until now, data regrouping uses a single data layout for the whole program. An
alternative strategy is to allow dynamic regrouping of data between computation
phases so that the data layout of a particular phase can be optimized without worrying
about the side effects in other phases. An example is the program in the right-hand
side of Figure 3.5. The best strategy may be to group A and B at the beginning of
the program and then separate these two arrays after the first time-step loop.
As in the case of allowing useless data, the profitability of dynamic regrouping
depends on the exact benefit of data grouping and the overhead of run-time readjustment, both of which are machine dependent. Therefore, not to use dynamic
regrouping is compile-time optimal because it never causes negative performance impact. A possible extension is to apply data regrouping within different time-step loops
and insert dynamic data regrouping in between.
When the precise benefit of regrouping and the cost of dynamic re-mapping
is known, the problem can be formulated in the same way as the one given by
Kremer[Kre95]. In his formulation, a program is separated into computation phases.
Each data layout results in a different execution time for each phase plus the cost of
changing data layouts among phases. The optimal layout, either dynamic or static, is
the one that minimizes the overall execution time. Without modification, Kremer’s
formulation can model the search space of all static or dynamic data-regrouping
schemes. However, as Kremer has proved, finding the optimal layout is NP-hard.
Since the search space is generally not large, he successfully used 0-1 integer programming to find the optimal data layout. The same method can be used to find the
optimal data regrouping when dynamic regrouping is allowed.
3.4.3
Minimizing Data Writebacks
On machines with insufficient memory bandwidth, data writebacks impede memory
read performance because they consume part of the available memory bandwidth.
To avoid unnecessary writebacks, read-only data should not be in the same cache
block as modified data otherwise read-only data will be unnecessarily written back.
Therefore, data regrouping should not combine arrays unless they are all read-only or
all modified in any computation phase. This new requirement can be easily enforced
as follows. For each computation phase, split the accessed arrays into two disjoint
subsets: the first is the set of read-only arrays and the second is the modified arrays.
65
Treat each subset as a distinctive computation phase and then apply the partitioning.
As a result, two arrays are grouped if and only if they are always accessed together,
and the type of the access is either both read-only or both modified. With this
extension, data regrouping finds the largest subsets of arrays that can be grouped
without introducing useless data or redundant writebacks. Note that two arrays are
grouped does not mean they must be read-only or modified throughout the whole
program. They can be both read-only in some phases and both modified in other
phases.
The above extension can be easily included into the multi-level data regrouping
algorithm given in Figure 3.4. In its first step, if the data dimension is the innermost
dimension, the algorithm will split each set s into two sets, one with all read-only
arrays and one with modified arrays. The regroupings at higher dimensions are not
affected because they cannot interleave data into the same cache block. All other
aspects of the algorithm are also unchanged.
When redundant writebacks are allowed, data regrouping can be more aggressive
by first combining data solely based on data access and then separating read-only and
modified data within each partition. The separation step is not easy because different
computation phases read and write a different set of arrays. The general problem can
be modeled with a weighted, undirected graph, in which each array is a node and
each edge has a weight labeling the combined effect of both regrouping and redundant
writebacks. The goal of regrouping is to pack nodes into cache blocks to maximize
the benefit. As in the case of allowing useless data, the packing problem here is also
NP-hard because it can be reduced from the G-partitioning problem[KH78].
3.5
Summary
This chapter has developed inter-array data regrouping, a global data transformation
that first splits and then regroups all arrays to achieve maximal inter-array spatial
reuse. The compiler first divides a program into computation phases and then partitions arrays into compatible sets. Data regrouping is applied within a compatible
set. Two arrays are grouped if and only if they are always accessed together. The
algorithm for regrouping is efficient; its time complexity is O(V ∗ A), where V is the
length of the program and A is the number of data structures in the program. In the
case of high dimensional data arrays, data grouping is applied at a hierarchy of levels
to maximize the degree of contiguous access at each loop level.
66
The regrouping method is conservative because it never interleaves useful data
with useless data at any moment of computation. Therefore it guarantees profitability. When useless data and dynamic regrouping are prohibited, the regrouping algorithm is also optimal. The relaxation of either constraint makes optimal data layout
dependent on the exact run-time effect of a data transformation, which makes the
result machine dependent. In contrast, the conservative regrouping achieves the best
machine-independent data layout, that is, the compile-time optimal solution. In addition, the chapter proved that relaxing either constraint leads to NP-hard problems.
Data regrouping can be extended to avoid unnecessary memory writebacks by
separating read-only access and read-write access into different cache blocks. The
extension has been incorporated in the overall algorithm of multi-level, inter-array
data regrouping.
Global data regrouping enables users to define data structures in their own style
without worrying about their impact on performance. Programmers should not attempt data optimizations because the optimal array layout depends on the computation structure of the program. Manual transformation would have to readjust every
time the program changes. With inter-array data regrouping described in this chapter, a compiler can now derive the optimal data layout regardless of the initial choice
of users. Indeed, data regrouping is a perfect job for an automatic compiler and a
compiler can do it perfectly.
67
Chapter 4
Run-time Cache-reuse Optimizations
4.1
Introduction
The previous chapters have assumed that both the data structure and its access
pattern are fixed and are known to a compiler. A large class of applications, however,
employs extensible data structures whose shape and content are constructed and
changed at run time. An example is molecular dynamics simulation, which models the
movement of particles in some physical domain (e.g. a 3-D space). The distribution
of molecules remains unknown until run time, and the distribution itself changes
during the computation. Another example is sparse linear algebra, where the nonzero entries in a sparse matrix change dynamically. Because of their non-uniform and
unpredictable nature, they are called irregular and dynamic applications.
Irregular and dynamic applications pose two new problems for cache optimization.
First, since a compiler knows neither the data and its access, optimizations cannot be
applied at compile-time. In addition, since the computation evolves during the execution, no fixed program organization is likely to perform well at all times. Therefore,
a program may have to be transformed multiple times during the execution.
This chapter studies run-time optimizations for improving cache reuse in irregular
and dynamic applications. Specifically, it presents the run-time version of computation fusion and data grouping—locality grouping and data packing. The first half
of the chapter describes and evaluates these two transformations. The second half
is devoted to the compiler support for dynamic data packing. It first presents the
compiler analysis that automatically detects all opportunities of data packing. Since
switching among different data layouts at run time carries a significant overhead, two
compiler optimizations are then introduced to eliminate most of this overhead.
68
4.2
Locality Grouping and Data Packing
This section describes two run-time transformations: locality grouping, which fuses
dynamic computations on the same data item; and dynamic data packing, which
then groups data items that are used together. Both transformations are evaluated,
individually and combined, through various access sequences on simulated caches.
4.2.1
Locality Grouping
The effectiveness of cache is predicated on the existence of locality and good computation structure exploiting that locality. In a dynamic application such as molecular
dynamics simulation, the locality comes directly from its physical model in which
a particle interacts only with its neighbors. A set of neighboring particles forms a
locality group in which most interactions occur within the group. In most programs,
however, locality groups are not well separated. Although schemes such as domain
partitioning and space-fitting curve ordering exist for explicitly extracting locality,
they are time-consuming and may therefore not be cost-effective in improving cache
performance of a sequential execution. Another limitation of previous work is that it
relies on user knowledge and manual program transformation. To pursue a faster algorithm and a general program transformation model, this section presents the most
efficient, yet also most general reordering scheme, locality grouping.
Given a sequence of independent computations, locality grouping clusters those
sharing access to the same data. Figure 4.1(a) shows an example input to a N-body
simulation program. Graph (a) draws three example objects and their interactions
and Graph (b) is the example sequence of all interactions, which are independent
computations. Assuming a cache of 3 objects, the example sequence incurs 10 misses.
Locality grouping reorders the sequence so that all computations on the same object
are clustered. The new sequence starts with all interactions on object a, then b, until
the last object g. The locality-grouped access sequence incurs only 6 misses.
Locality grouping incurs minimal run-time overhead. It consists of a two-pass
radix sort: the first pass collects a histogram and the second pass produces the
locality-grouped sequence. Locality grouping is widely applicable and can optimize
any set of independent computations. A compiler can automate locality grouping
by identifying parallel computations and inserting a call to a run-time library. The
legality and profitability of locality grouping can be determined either by compiler
69
b
a
c
e
f
(b
(e
(e
(a
(f
(a
c)
g)
f)
b)
g)
c)
g
(b) Example Sequence
10 misses
(3-element cache, fully
associative, LRU
(a) Example Interactions
replacement)
(a b)
(a c)
Group on (a)
(b c)
Group on (b)
(e g)
(e f)
Group on (e)
(f g)
Group on (f)
(c) Sequence after
Locality Grouping
6 misses
Figure 4.1 Example of locality grouping
analysis or user directives. One example use of user directive is presented in detail in
the second half of this chapter.
The remainder of the section evaluates locality grouping on a data set from mesh,
a structural simulation. The data set is a list of edges of a mesh structure of some
physical object such as an airplane. Each edge connects two nodes of the mesh. This
specific data set, provided by the Chaos group at University of Maryland, has 10K
nodes and 60K edges. The experiment simulates only the data accesses on a fully
associative cache in order to isolate the inherent cache reuse behavior from other
factors. In fact, the simulation is very similar to the one used in Section 2.2, which
measures the reuse distance of repeated data access. The cache misses are the reuses
whose reuse distance is greater than or equal to cache size. The specific cache sizes
measured are 2K and 4K objects. The cache uses unit-length cache lines.
Figure 4.2 gives the miss rate of mesh with and without locality grouping. Locality
grouping eliminates 96.9% of cache misses in the 2K cache and 99.4% in the 4K cache.
The miss rates after locality grouping are extremely low, especially in the 4K cache
(0.37%). Further decreasing miss rate with more powerful reordering schemes in this
case is unlikely to be cost-effective if the overhead of extra execution time does not
out-weigh the additional gain.
miss rate
of mesh
Original
2K cache
4K
93.2%
63.5%
After locality grouping
2K cache
4K
2.93%
0.37%
Figure 4.2 Effect of locality grouping
70
4.2.2
Dynamic Data Packing
Correct data placement is critical to effective use of available memory bandwidth.
Dynamic data packing is a run-time optimization that groups data accessed at close
intervals in the program into the same cache line. For example, if two objects are
always accessed consecutively in a computation, placing them adjacent to each other
increases bandwidth utilization by increasing the number of bytes on each line that
are used before the line is evicted.
Figure 4.3 will be used as an example throughout this section to illustrate the
packing algorithms and their effects. Figure 4.3(a) shows an example access sequence.
The objects are numbered by their location in memory. In the sequence, the first
object interacts with the 600th and 800th object and subsequently the latter two
objects interact with each other. Assume that the cache size is limited and the access
to the last pair of the 600th and 800th objects cannot reuse the data loaded at the
beginning. Since each of these three objects is on different cache lines, the total
number of cache misses is 5. A transformed data layout is shown in Figure 4.3(b),
where the three objects are relocated at positions 0 to 2. Assuming a cache line can
hold three objects, the transformed layout only incurs two cache misses, a significant
reduction from the previous figure of 5 misses.
(0 800)
(0 600)
data set
larger than
Runtime data
(0 1)
(0 2)
packing
...
...
(600 800)
(2 1)
cache
(1 2) fall into the
same cache line
5 cache misses
2 cache misses
(a) Example interaction list
(b) Interaction list
before packing
after packing
Figure 4.3 Example of data packing
The rest of this section presents three packing algorithms and a comparison study
of their performance on different types of run-time inputs.
71
Packing Algorithms
The simplest packing strategy is to place data in the order they first appear in the
access sequence. I call this strategy consecutive packing or first-touch packing. The
packing algorithm is shown in Figure 4.4. To ensure that each object has one and
only one location in the new storage, the algorithm uses a tag for each object to label
whether the object has been packed or not.
initializing each tag to be false (not packed)
for each object i in the access sequence
if i has not been packed
place i in the next available location
mark its tag to be true (packed)
end if
end iteration
place the remaining unpacked objects
Figure 4.4 Algorithm of consecutive data packing
Consecutive packing carries a minimal time and space overhead because it traverses the access sequence and object array once and only once. For access sequences
in which each object is referenced at most once, consecutive packing yields optimal
cache line utilization because the objects are visited in stride-one fashion during the
computation. Achieving an optimal packing in the presence of repeated accesses, on
the other hand, is NP-complete, as this problem can be reduced to the G-partition
problem[KH78] following a similar reduction by Thabit[Tha81]. The packing algorithms presented in this section are therefore based on heuristics.
One shortcoming of consecutive packing is that it does not take into account the
different reuse patterns of different objects. Group packing attempts to overcome
this problem by classifying objects according to their reuse pattern and applying
consecutive packing within each group. In the example in Figure 4.3(b), the first
object is not reused later but the 600th and 800th object are reused after a similar
interval. Based on reuse patterns, group packing puts the latter two objects into a
new group and packs them separately from the first object. If we assume a cache line
of two objects, consecutive packing fails to put the latter two objects into one cache
line but group packing succeeds. As a result, consecutive packing yields four misses
while group packing incurs only three.
72
The key challenge for group packing is how to characterize a reuse pattern. The
simplest approach is to use the average reappearance distance of each object in the
access sequence, which can be efficiently computed in a single pass. More complex
characterizations of reuse patterns may be desirable if a user or compiler has additional knowledge on how objects are reused. However, more complex reuse patterns
may incur higher computation costs at run time.
The separation of objects based on reuse patterns is not always profitable. It is
possible that two objects with the same reuse pattern are so far apart in the access
sequence so that they can never be in cache simultaneously. In this case, we do
not want to pack them together. To solve this problem, we need to consider the
distance between objects in the access sequence as well as their reuse pattern. This
consideration motivates the third packing algorithm, consecutive-group packing.
Consecutive-group packing groups objects based on the position of their first appearance. For example, it first groups the objects appeared in the first N positions
in the access sequence, then the objects in the next N positions, and so on until the
end of the access sequence. The parameter N is the consecutive range. Within each
range group, objects can then be reorganized with group packing.
The length of the consecutive range determines the balance between exploiting
closeness and exploiting reuse patterns. When the consecutive range is 1, data packing
is the same as consecutive packing. When the range is the full sequence, the packing is
the same as group packing. In this sense, these three packing algorithms are actually
one single packing heuristic with different parameters.
Evaluation of Packing Algorithms
All three packing algorithms are evaluated on mesh and another input access stream
which is extracted from moldyn, a molecular dynamics simulation program. The
Moldyn program initializes approximately 8K molecules with random positions. As
before, the experiment simulated only the data access on a fully associative cache.
The group packing classifies objects by their average reappearance distance; it is
parameterized by its distance granularity. A granularity of 1000 means that objects
whose average reappearance distance fall in each 1000-element range are grouped
together. Consecutive-group packing has two parameters: the first is the consecutive
range, and the second is the grouping packing algorithm used inside each range.
73
Moldyn, 4K cache
8.0
original
consecutive
group(1K)
consecutive−group(1K, group(1K))
consecutive−group(25, group(4K))
20.0
miss rate (%)
miss rate (%)
6.0
Moldyn, 2K cache
30.0
original
consecutive
group(1K)
consecutive−group(1K, group(1K))
consecutive−group(25, group(4K))
4.0
10.0
2.0
0.0
1
2
4
8
cache line size
0.0
16
1
Mesh, 4K cache
20.0
original
consecutive
group(1K)
consecutive−group(1K, group(1K))
consecutive−group(1K, group(150))
80.0
miss rate (%)
miss rate (%)
100.0
original
consecutive
group(1K)
consecutive−group(1K, group(1K))
consecutive−group(1K, group(150))
40.0
0.0
16
Mesh, 2K cache
80.0
60.0
2
4
8
cache line size
60.0
40.0
20.0
1
2
4
8
cache line size
16
0.0
1
2
4
8
cache line size
16
Figure 4.5 Moldyn and Mesh, on 2K and 4K cache
The four graphs in Figure 4.5 show the effect of packing on the moldyn and
mesh data sets. The upper-left graph draws the miss rate on a 4K-sized cache for
different cache line sizes from 1 to 16 molecules long. The miss rate of the original data
layout, shown by the first bar of each cluster, increases dramatically as cache lines get
longer. The cache with 16-molecule cache lines incurs 6 times the number of misses
of the unit-line cache. Since the total amount of memory transfer is the number
of misses times cache line size, the 16-molecule cache lines result in 96 times the
memory transfer volume of the unit cache line case—it is wasting 99% of the available
memory bandwidth! Even 2-molecule cache lines waste over 80% of available memory
bandwidth. After various packing algorithms are applied, however, the miss rates drop
significantly, as indicated in the remaining four bars in each cluster. Consecutive
packing reduces the miss rate by factors ranging from 7.33 to over 26. Because
of the absence of consistent reuse pattern, group and consecutive-group packing do
not perform as well as consecutive packing but nevertheless reduce the miss rate
by a similar amount. The upper-right graph shows the effect of packing on a 2K
74
cache, which is very similar to the 4K cache except that the improvement is smaller.
Consecutive packing still performs the best and reduces the miss rate by 27% to a
factor of 3.2.
The original access sequence of the mesh data set has a cyclic reuse pattern and
a very high miss rate; see, for example, 64% on the 4K cache, shown in the lowerleft graph of Figure 4.5. Interestingly, the cyclic data access pattern scales well on
longer cache lines, except at the size of 8. Data packing, however, evenly reduces
miss rate on all cache line sizes, including the size of 8. At that size, packing improves from 29% to 46%. On other sizes, consecutive packing and group packing
yield slightly higher miss rates than the original data layout. One configuration,
consecutive-group(1K,group(150)), is found to be the best of all; it achieves the lowest miss rate in all cases, although it is only marginally better on sizes other than
8. Similar results are seen on a 2K cache, shown by lower-right graph in Figure 4.5.
The same version of consecutive-group packing reduces miss rate by 1% to 39%. It
should be noted that the result of consecutive-group packing is very close to the ideal
case where the miss rate halves when cache line size doubles. As shown in the next
section, dynamic packing, when combined with locality grouping, can reduce the miss
rate to as low as 0.02%.
4.2.3
Combining Computation and Data Transformation
When locality grouping is combined with data packing on mesh (moldyn was already
in locality-grouped form), the improvement is far greater than when they are individually applied. Figure 4.6 shows miss rates of mesh after locality grouping. On a 4K
cache, the miss rate on a unit-line cache is reduced from 64% to 0.37% after locality
grouping. On longer cache-line sizes, data packing further reduces the miss rate by
15% to a factor of over 6. On the 16-molecule cache line case, the combined effect is
a reduction from a miss rate of 4.52% (shown in Figure 4.5) to 0.02%, a factor of 226.
On a 2K cache with 16-molecule cache lines, the combined transformations reduce
miss rate from 7.48% to 0.25%, a factor of 30. Although not shown in the graph,
group and consecutive-group packing do not perform as well as consecutive packing.
In summary, the simulation results show that locality grouping effectively extracts
computation locality, and data packing significantly improves data locality. The effect
of data packing becomes even more pronounced in caches with longer cache lines. In
both programs, simple consecutive packing performs the best after locality grouping,
75
Mesh, 4K cache
Mesh, 2K cache
0.40
3.00
locality grouping only
locality grouping + packing
locality grouping only
locality grouping + packing
miss rate (%)
miss rate (%)
0.30
0.20
2.00
1.00
0.10
0.00
1
2
4
8
cache line size
16
0.00
1
2
4
8
cache line size
16
Figure 4.6 Mesh after locality grouping
and the combination of locality grouping and consecutive packing yields the lowest
miss rate.
4.3
Compiler Support for Dynamic Data Packing
Run-time data transformations, dynamic data packing in particular, involve redirecting memory accesses to each transformed data structure. Such run-time changes
complicate program transformations and induce overhead during the execution. This
section first illustrates the process of data packing and the two optimizations that
reduce packing overhead. More important is the compiler analysis that identifies all
opportunities for data packing and its optimizations. The compiler analysis itself can
guarantee correctness, but it still relies on a user to hint on profitability. The last
section extends the compiler framework to collect run-time feedback and to automate
the profitability analysis.
4.3.1
Packing and Packing Optimizations
The core mechanism for supporting packing is a run-time data map, which maps from
the old location before data packing to the new location after data packing. Each
access to a transformed array is augmented with the indirection of the corresponding
run-time map. Thus the correctness of packing is ensured regardless the location
and the frequency of packing. Some existing language features such as sequence
and storage association in Fortran prevent a compiler from accurately detecting all
76
accesses to a transformed array. However, this problem can be safely solved in a
combination of compile, link and run-time checks described in[CCC+97].
Although the compiler support can guarantee the correctness of packing, it needs
additional information to decide on the profitability of packing. Our compiler currently relies on a one-line user directive to specify whether packing should be applied,
when and where packing should be carried out and which access sequence should
be used to direct packing. The packing directive provides users with full power of
controlling data packing, yet relieves them from any program transformation work.
The last part of this section will show how the profitability analysis of packing can
be automated without relying on any user-supplied directive.
A simplified dynamic program is given in Figure 4.7 to illustrate our compiler
support for data packing. The kernel of Moldyn has two computation loops: the first
loop calculates cumulative forces on each object, and the second loop calculates the
new location of each object as a result of those forces. The packing directive specifies
that packing is to be applied before the first loop.
Packing Directive: apply packing using interactions
for each pair (i,j) in interactions
calculate_force( force[i], force[j] )
end for
for each object i
update_location( location[i], force[i] )
end for
Figure 4.7 Moldyn kernel with a packing directive
The straightforward (unoptimized) packing produces the code shown in Figure 4.8.
The call to apply packing analyzes the interactions array, packs force array and
generates the run-time data map, inter$map. After packing, indirections are added
in both loops.
The cost of data packing includes both data reorganization during packing and
data redirection after packing. The first cost can be balanced by adjusting frequency
of packing. Thus the cost of reorganizing data is amortized over multiple computation
iterations. A compiler can make sure that this cost does not outweigh any performance
77
apply_packing( interactions[*], force[*], inter$map[*])
for each pair (i,j) in the interaction array
calculate_force( force[ inter$map[i] ],
force[ inter$map[j] ] )
end for
for each object i
update_location(location[i], force[ inter$map[i] ])
end for
Figure 4.8 Moldyn kernel after data packing
gain by either applying packing infrequently or making it adjustable at run time. As it
will be shown in Chapter 6, data reorganization incurs negligible overhead in practice.
Data indirection, on the other hand, can be very expensive, because its cost is
incurred on every access to a transformed array. The indirection overhead comes
from two sources: the instruction overhead of indirection and the references to runtime data maps. The indirection instructions have a direct impact on the number of
memory loads but the overhead becomes less significant in deeper memory hierarchy
levels. However, the cost of run-time data maps has an consistent effect on all levels
of cache, although this cost is likely to be small in cases where the same data map
is shared by many data arrays. In addition, as shown next, the cost of indirection
can be almost entirely eliminated by two compiler optimizations, pointer update and
array alignment.
Pointer update modifies all references to transformed data arrays so that the
indirections are no longer necessary. In the above example, this means that the
references in interactions array are changed so that the indirections in the first loop
can be completely eliminated. To implement this transformation correctly, a compiler
must (1) make sure that every indirection array is associated with only one run-time
data map and (2) when packing multiple times, maintain two maps for each run-time
data map, one maps from the original layout and the other maps from the most recent
data layout.
The indirections in the second loop can be eliminated by array alignment, which
reorganizes the location array in the same way as the force array, that is, aligns the
i’s element of both arrays. Two requirements are necessary for this optimization to
be legal: (1) the loop iterations can be arbitrarily reordered, and (2) the range of
78
loop iterations is identical to the range of re-mapped data. The second optimization,
strictly speaking, is more than a data transformation because it reorders loop iterations. However, the reordering preserves all data dependences and therefore preserves
full numerical accuracy of an application.
The example code after applying pointer update and array alignment is shown in
Figure 4.9. The update map array is added to map data from the last layout to the
current layout. After the two transformations, all indirections through the inter$map
array have been removed.
apply_packing( interactions[*], force[*],
inter$map[*], update_map[*] )
update_indirection_array( interactions[*],
update_map[*] )
transform_data_array(location[*], update_map[*])
for each pair (i,j) in interactions
calculate_force( force[i], force[j] )
end for
for each object i
update_location( location[i], force[i] )
end for
Figure 4.9 Moldyn kernel after packing optimizations
The overhead of array alignment can be further reduced by avoiding packing those
data arrays that are not live at the point of data packing. In the above example, if
the location array does not carry any live values at the point of packing, then the
third call, which transforms location array, can be removed.
4.3.2
Compiler Analysis and Instrumentation
The first step of the compiler support is to find what I call primitive packing groups.
A primitive packing group contains two sets of arrays: the set of access arrays, which
hold the indirect access sequence, and the set of data arrays, which are either indirectly accessed through the first set of arrays or alignable with some arrays that are
indirectly accessed.
79
Primitive packing groups are identified as follows. For each indirect access, the
compiler finds the two arrays that are involved. They are an indirect access pair, and
they form a primitive packing group. For each loop, if its iterations can be arbitrarily
reordered, all accessed arrays are in an alignment group, and they form a primitive
packing group where the access array set is empty. An indirect access pair and an
alignment group of the Moldyn kernel are shown in Figure 4.10.
Figure 4.10 Primitive packing groups in Moldyn
After finding all primitive packing groups, the compiler partitions these groups
into disjoint packing partitions. Two primitive packing groups are disjoint if they do
not share any array between their access array sets and between their data array sets.
A union-find algorithm can efficiently perform the partitioning.
After partitioning, each disjoint packing partition is a packing candidate. The
compiler then chooses those candidates that contain arrays specified in user directives.
The two packing optimizations are readily applied on any packing candidate, should
it become the choice of packing. Pointer update changes all arrays in the access array
set; array alignment transforms all arrays in the data array set and reorders the loops
that access aligned arrays.
The use of packing optimizations needs to be restricted if a packing candidate
has inconsistent requirements for array alignment and pointer update. The checks
for correctness are as follows. A data array can be reordered if all optimizations
agree on a single layout; an access array can be updated if all its indirections point
to data arrays of the same layout, or equivalently, if all its update requirement are
80
the same. The optimizations are disabled for arrays with conflicting transformation
requirements.
The correctness of array alignment requires one additional check, which is that
any reordered loop must traverse the full range of all transformed arrays within the
loop. If a compiler knows the exact bounds of such loops, it can restrict packing to
reorder only within that range. Otherwise, a compiler must check at run time whether
the range requirement is met before a loop, and if not, fall back to the unoptimized
version with data indirections.
Whenever the packing optimizations are not applicable or are disabled, the compiler inserts indirections through run-time maps to all accesses to the transformed
data. The overall process of compiler analysis and instrumentation is summarized in
Figure 4.11.
Figure 4.11 Compiler indirection analysis and packing optimization
The compiler analysis does not assume any outside knowledge of a program and
its data structures. Yet it is powerful enough to identify all data indirections. In fact,
the packing candidates form an indirect access graph, which identifies all data arrays
as well as pointer arrays at different levels. For example, in a program with two-level
indirections, an array of first-level pointers is not only an access array in one packing
candidate but also a data array in another packing candidate. Both data and pointer
arrays can be packed, and packing at one level is independent from all other levels.
81
4.3.3
Extensions to Fully Automatic Packing
Although the one-line packing directive is convenient when a user knows how to apply
packing, the mandatory requirement for such a directive is not desirable in situations
when a user cannot make an accurate judgement on the profitability of packing.
This section discusses several straightforward extensions that can fully automate the
profitability analysis, specifically, extensions that decide whether, where, and when
to apply packing.
With the algorithm described in the previous section, a compiler can identify all
packing candidates. For each candidate, the compiler can record the access sequence
at run time and determine whether it is non-contiguous and, if so, whether packing
can improve its spatial reuse. Such decisions depend on program inputs and must
be made with some sort of run-time feedback system. In addition, the same data
may be indirectly accessed by more than one access sequence, each may demand a
different reorganization scheme. Again, run-time analysis is necessary to pick out the
best packing choice.
Once the compiler chooses a packing candidate, it can place packing calls right
before the place where the indirect data accesses begin. The placement requires
finding the right loop level under which the whole indirect access sequence is iterated.
The frequency of packing can also be automatically determined. One efficient
scheme is to monitor the average data distance in an indirect access sequence and
only invoke packing routines when adjacent computations access data that are too far
apart in memory. Since the overhead of data reorganization can be easily monitored
at run-time, the frequency of packing can be automatically controlled to balance the
cost of data reorganization.
4.4
Summary
This chapter has presented two new techniques, locality grouping and data packing,
which are the run-time version of computation fusion and data grouping. Two goals
have been achieved: to find the optimizations that are cost-effective at run time, and
to use a compiler to automate these optimizations and to reduce their overhead.
Locality grouping brings together all the computation units involving the same
data element. Its time and space cost is linear to the number of computation units.
It is the least expensive among all existing reordering schemes, yet it is shown to be
very powerful in a simulation study, leaving little room for additional improvement by
82
more expensive methods. Furthermore, locality grouping is vital for the subsequent
data transformation to be effective.
Data packing clusters simultaneously used data elements into adjacent memory
locations. Since optimal data packing is NP-complete, the chapter presented and
evaluated three heuristic-based packing algorithms and found that simple consecutive
packing (at linear time and space cost) performs best when carried out after locality
grouping. When evaluated on a real data set, the combined computation and data
transformation reduced memory traffic by a factor of over 200.
More importantly, this chapter described a general compiler support for run-time
data transformations such as data packing. The core is an analysis algorithm for
detecting the structure of indirect data access in a program. This compiler analysis
serves two important purposes. The first is to automatically identify all opportunities
for data packing. The second is to enable two optimizations, pointer update and array
alignment, which eliminate data indirections after run-time data relocation. After a
new data layout is constructed by data packing, pointer update modifies the content
of an access array to redirect it to point to the new data layout. Array alignment
transforms other related arrays into the same data layout as the packed arrays so
that the full traversal of both groups of arrays can be made contiguous.
83
Chapter 5
Performance Tuning and Prediction
5.1
Introduction
The preceding chapters have developed automatic compiler transformations that minimize the overall memory transfer. Although effective, automatic optimizations are
not fully satisfactory for two reasons. First, a compiler may fail to optimize some
part of a program because of either imprecise analysis or imperfect transformations.
As compilers are taking an increasingly important role in optimizing the deep and
complex memory hierarchy, their failure also becomes more dangerous and may lead
to serious performance slowdown. The second limitation of compiler optimizations is
their inability to provide estimation for program execution time, although such estimates would be extremely helpful in subsequent parallelization and task scheduling.
To overcome these two limitations, a compiler needs to provide support for performance tuning and prediction. The former detects and locates performance problems
and therefore allows effective user tuning; the latter estimates the execution time of
various program units and thus enables efficient concurrent execution.
In the past, various performance tools have been developed to support user tuning and performance prediction. However, existing tools have not been effective in
practice because they either do not consider memory hierarchy or do so by pursuing
the difficult measurement of memory latency. Since the exposed latency of a memory reference is determined by many factors of a machine and a program, previous
tools have to rely on detailed machine simulations. Not only are simulations expensive, machine-specific and error-prone, but they also cannot predict memory hierarchy
performance.
To provide a practical performance tool, this chapter investigates a bandwidthbased approach. Because memory bandwidth is the bottleneck, program performance
is largely determined by its memory bandwidth utilization. On the one hand, memory bandwidth utilization determines machine utilization because the consumption
of the bottleneck resource determines the consumption of the whole system. On the
84
other hand, memory transfer time determines program execution time because the
time spent on crossing the bottleneck is the time spent on crossing the whole system.
Therefore, we can monitor program performance based on its memory bandwidth utilization, and we can predict its performance based on its memory transfer time. Based
on these two observations, the following sections present the design of a bandwidthbased performance tool, describe its use in performance tuning and predictions, and
discuss several extensions for more accurate analysis.
5.2
Bandwidth-based Performance Tool
The bandwidth-based performance tool takes as input, a source program along with
its inputs and parameters for the target machine. It first estimates the total amount of
data transfer between memory and cache. This figure is then used to either predict the
performance without running the program or locate memory hierarchy performance
problems given the actual running time. Figure 5.1 shows the structure of the tool,
as well as its inputs and outputs.
5.2.1
Data Analysis
The core support of the tool is the data analysis that estimates the total amount of
data transfer between memory and cache. First, a compiler partitions the program
into a hierarchy of computation units. A computation unit is defined as a segment of
the program that accesses data larger than cache. Given a loop structure, a compiler
can automatically measure the bounds and stride of array access through, for example,
interprocedural bounded-section analysis developed by Havlak and Kennedy[HK91].
The bounded array sections is then used to calculate the total amount of data access
and to determine whether the amount is greater than the size of the cache. The
additional amount of memory transfer due to cache interference can be approximated
by the technique given by Ferrante et al[FST91]. Once a program is partitioned into
a hierarchy of computation units, a bottom-up pass is needed to summarize the total
amount of memory access for each computation unit in the program hierarchy until
the root node—the whole program.
Since exact data analysis requires precise information on the bounds of loops and
coefficients of array access, the analysis step needs to have run-time program input
to make the correct estimation, especially for programs with varying input sizes. In
certain cases, however, the number of iterations is still unknown until the end of
85
Runtime Program
Input Data Sizes
Loop Counts
Source Program
Machine Parameters
Maximal available bandwidth
Cache size
Cache-line size
Estimate Total Amount of Data Transfer
Between Memory and Cache
Application Runing
Time
Indentify Low
Performance Regions
Optimizing Compiler
Program Restructuring Tools
User
Predict Performance
Machine Scheduler
Figure 5.1 Structure of the performance tool
execution. An example is an iterative refinement computation, whose termination
point is determined by a convergence test at run time. In these cases, the analysis
can represent the total amount of memory access as a symbolic formula with the
number of iterations as an unknown term. A compiler can still successfully identify
the amount of data access within each iteration and provide performance tuning and
prediction at the granularity of one computation iteration.
5.2.2
Integration with Compiler
Since all data-analysis steps are performed statically, the performance tool can be
integrated into the program compiler. In fact, an optimizing compiler may already
have these analyses built in. So including this tool into the compiler is not only feasible
but also straightforward. Although the tool requires additional information about the
run-time program inputs, the data analysis can proceed at compile time with symbolic
program inputs and then re-evaluate the symbolic results before execution.
The integration of the tool into the compiler is not only feasible but also profitable for both the tool and the compiler. First, the tool should be aware of certain
compiler transformations such as data-reuse optimizations because they may change
the actual amount of memory transfer. The most notable is global fusion and data
grouping, presented in previous chapters, which can radically change the structure of
both the computation and data of a program and can reduce the overall amount of
memory transfer by integral factors. The performance tool must know these high-level
transformations for it to obtain an accurate estimate of memory transfer.
86
In addition to helping data analysis, the integration of the tool helps the compiler
to make better optimization decisions. Since the tool has the additional knowledge
of the program inputs, it can supply this information into the compiler. The precise
knowledge of run-time data and machine parameters is often necessary to certain
compiler optimizations such as cache blocking and array padding. Therefore, the
integration of the compiler and the performance tool improves not only the accuracy
of the performance tool but also the effectiveness of the compiler.
5.3
Performance Tuning and Prediction
Performance Tuning
In bandwidth-based performance tuning, a compiler searches for computation units
that have abnormally low memory bandwidth utilization. Because of memory bandwidth bottleneck, a low bandwidth utilization implies a low utilization of all other
hardware resources, therefore signaling an opportunity for tuning. A compiler can
automatically identify all such tuning opportunities in the following two steps.
1. The first step executes the program and collects the running time of all its computation units. The achieved memory bandwidth is calculated by dividing the
data transfer of each computation unit with its execution time. The achieved
memory bandwidth is then compared with machine memory bandwidth to obtain the bandwidth utilization.
2. Second, the tool singles out the computation units that have low memory bandwidth utilization as candidates for performance tuning. For each candidate, the
tool calculates the potential performance gain as the difference between the current execution time and the predicted execution time assuming full bandwidth
utilization. The tuning candidates are ordered by their potential performance
gain and then presented to a user.
Bandwidth-based performance tuning requires no special hardware support or
software simulation. It is well suited for different machines and compilers because the
use of actual running time includes the effect of all levels of compiler and hardware
optimizations. Therefore, it is not only automatic, but also accurate and widely
applicable.
87
Bandwidth-based performance tuning does not necessarily rely on compiler-directed
data analysis when applied on machines with hardware counters such as MIPS R10K
and Intel Pentium III. The hardware counters can accurately measure the number
and the size of memory transfers. With these counters, bandwidth-based tuning can
be applied to programs that are not amenable to static compiler analysis. However,
compiler analysis should be used whenever feasible for three reasons. First, sourcelevel analysis is necessary to partition a program into computation units and help
a user to understand the performance at the source level. Second, static analysis
is more accurate for tuning because it can identify the problem of excessive conflict
misses, while hardware counters cannot distinguish different types of misses. Third,
the compiler-directed analysis is more portable because it can be applied to all machine architectures including those with no hardware counters.
Performance Prediction
When a program uses all or most of machine memory bandwidth, its execution time
can be predicted by its estimated memory-transfer time, that is, by dividing the total
amount of memory transfer with the available memory bandwidth. This bandwidthbased prediction is simple, accurate and widely applicable to different machines, applications and parallelization schemes.
The assumption that a program utilizes all available bandwidth is not always
true—some parts of the program may have a low memory throughput even after
performance tuning. However, low memory throughput should happen very infrequently and it should not seriously distort the overall memory bandwidth utilization.
The variations in the overall utilization should not introduce large errors into performance prediction. Otherwise, the program must have a performance bottleneck
other than memory bandwidth. The next section discusses techniques for detecting
other resource bottlenecks such as loop recurrence or bandwidth saturation between
caches.
5.4
Extensions to More Accurate Estimation
Although the latency of arithmetic operations on modern machines is very small
compared to the time of memory transfer, it is still possible that computations in a
loop recurrence may involve so many operations that they become the performance
88
bottleneck. So the tool should identify such cases with the computation-interlock
analysis developed by Callahan et al[CCK88].
Excessive misses at other levels of memory hierarchy can be more expensive than
memory transfer. The examples are excessive register loads/stores, higher-level cache
misses, and TLB misses. To correctly detect these cases, the performance tool needs
to measure the resource consumption on other levels of memory hierarchy. In fact,
the tool can extend its data analysis to measure the number of higher-level cache
misses and TLB misses, which are in fact special cases of the existing data analysis.
On a machine with distributed memory modules, memory references may incur
remote data access. When a remote access is bandwidth limited, the tool can estimate its access time with the same bandwidth-based method except that it needs
to consider the communication bandwidth in addition to memory bandwidth. The
bandwidth-based method also needs to model bandwidth contention either at a memory module or in the network. When a remote access is not bandwidth-constrained,
we can train the performance estimator to recognize cases of long memory latency
using the idea of training sets[BFKK91]. The bandwidth-based tuning tool can automatically collect such cases from applications because they do not fully utilize
bandwidth.
Coherence misses in parallel programs should also be measured if they carry a
significant cost. A compiler can detect coherence misses, especially for compiler parallelized code[McI97].
5.5
Summary
This chapter has presented a design of a bandwidth-based tool for performance tuning
and prediction. The central part of the tool is the compiler analysis that divides a
program into computation phases and estimates the total amount of memory transfer
for each computation phase on a given machine. For performance tuning, the tool
uses the data estimation and program execution time to calculate memory bandwidth
utilization. Then it picks out those computation phases with low memory bandwidth
utilization for further tuning. For performance prediction, the tool approximates program execution time by dividing the total amount of memory transfer with machine
memory bandwidth. To improve the accuracy of bandwidth-based method and to
consider cases where memory bandwidth is not the critical resource, the proposed
89
tool is augmented with additional compiler analysis to monitor the effect of latency
and bandwidth constraint in other parts of a computer system.
The bandwidth-based approach promises to be much more efficient, yet more accurate than previous methods that are based on monitoring memory latency. Instead
of simulating individual memory access, the new tool assesses the cost of all memory
references, that is, the time of all memory transfer. The new tool is also much simpler
because it focuses on only very large data structures. By changing from latency based
to bandwidth based, the tool avoids the cost of previous techniques and enables fast,
accurate, and portable performance tuning and prediction.
90
Chapter 6
Evaluation
This chapter evaluates the compiler strategy of global and dynamic computation fusion and data grouping. Section 1 describes the compiler implementation. Section
2 explains the experimental setup, which measures two classes of applications: regular programs with structured loops and predictable data access, and irregular and
dynamic programs with unstructured computation and unpredictable data access.
Section 3 and Section 4 describe the benchmark applications, the applied transformations and the experiment results for each class of applications. Section 5 evaluates
the bandwidth-based performance tool. Finally, Section 6 summarizes the findings.
6.1
Implementation
The implementation is based on the D Compiler System at Rice University. The compiler performs whole program compilation given all source files of an input program.
It uses a powerful value-numbering package to handle symbolic variables and expressions inside each subroutine and parameter passing between subroutines. It has a
standard set of loop and dependence analysis, data flow analysis and interprocedural
analysis. The D Compiler compiles programs written in Fortran 77 and consequently
it does not handle recursion. However, recursion should present no fundamental obstacles to the new methods described in this dissertation.
This research have implemented loop fusion, array regrouping and dynamic data
packing for experimental evaluation. The following sections describe the implementation of these three new techniques.
6.1.1
Maximal Loop Fusion
For each loop, the compiler summarizes its data access by its data footprint. For each
dimension of an array, a data footprint describes whether the loop accesses the whole
dimension, a number of elements on the border, or a loop-variant section (a range
enclosing the loop index variable). Data dependence is tested by the intersection of
91
footprints. The range information is also used to calculate the minimal alignment
factor between loops.
Loop fusion is carried out by applying the fusion algorithm given in Figure 2.7 level
by level from the outermost to the innermost. The current implementation calculates
data footprints, aligns loops and schedules non-loop statements. Iteration reordering
is not yet implemented but the compiler signals the places where it is needed. Only
one program, Swim, required splitting, which was done by hand.
For multi-level loops, loop fusion orders loop levels to maximize the benefit of
fusion, as specified by the algorithm in Figure 2.8. The first loop level to fuse is the
one that produces the fewest loop nests after fusion. In the experiment, however, this
was largely unnecessary, as computations were mostly symmetric. One exception in
our test cases was Tomcatv, where level ordering (loop interchange) was performed
by hand.
Code generation is straightforward as mappings from the old iteration space to the
fused iteration space. Currently, the code is generated by the Omega library[Pug92],
which has been integrated into the D compiler system. Omega worked well for small
programs, where the compilation time was under one minute for all kernels. For the
full application SP, however, code generation took four minutes for one-level fusion
but one hour and a half for three-level fusion. In contrast, the fusion analysis took
about two minutes for one-level fusion and four minutes for three-level fusion. A
direct code generation scheme has been designed; its cost is linear to the number of
loop levels. But the implementation is currently not available.
6.1.2
Inter-array Data Regrouping
The analysis for data regrouping is trivial with data footprints. After fusion, data
regrouping is applied level by level on fused loops as specified by the algorithm in
Figure 3.4. However, the implementation makes two modifications. First, SGI compiler does a poor job when arrays are fully interleaved at the innermost data dimension. So the compiler instead groups arrays up to the second innermost dimension.
This restriction may result in grouping in the less desired dimension, as in the case
of Tomcatv. The other restriction is due to the limitation of Fortran language, which
does not allow non-uniform array dimensions. In cases where multi-level regrouping
produced non-uniform arrays, manual changes were made to not to group at outer
data dimensions.
92
Code generation for array regrouping is semi-automatic. The compiler generates
the choice of regrouping. Then new array declarations are added by hand, and array
references are transformed through the macro processor cpp. This scheme worked
well when the name of arrays was consistent and unique throughout the program,
which was the case for most of the programs tested. Manual changes were made to
Magi to make one data array global instead of passing by parameters.
Data regrouping was performed by hand for irregular and dynamic applications
because the experiment was performed before the implementation became available.
Since these applications used one-dimensional arrays and performed indirected access,
one-dimensional grouping is sufficient. Therefore, the manual approximation of data
regrouping was trivial.
6.1.3
Data Packing and Its Optimizations
The implementation for data packing and its optimization follows the algorithm described in Section 4.3.2 and the steps illustrated in Figure 4.11. The compiler recognizes the structure of indirect access in a program and identifies all packing opportunities. A one-line user directive is used to specify which array should be packed,
as well as where and how often it should be packed. The two packing optimizations,
pointer update and array alignment, are applied automatically.
The current implementation does not work on programs where the indirect access
sequence is incrementally computed because the one-line directive requires the existence of a full access sequence. A possible extension would be to allow user to specify
a region of computation in which to apply packing so that the compiler can record the
full access sequence at run time. The other restriction of the current implementation
is due to conservative handling of array parameter passing. For each subroutine with
array parameters, the implementation does not allow two different array layouts to
be passed to the same formal parameter. This problem can be solved by propagating
array layout information in a way similar to interprocedural constant propagation or
data type analysis and then cloning the subroutine for each reachable array layout.
In the programs encountered, however, there is no need for such cloning.
6.2
Experimental Design
All programs are measured on one of the three SGI machines: SGI Origin2000 with
R10K processors, SGI Origin2000 with R12K processors, and SGI O2 with a R10K
93
processor. Both R12K and R10K provide hardware counters that measure cache
misses and other hardware events with high accuracy. All machines have two caches:
L1 is 32KB in size and uses 32-byte cache lines, L2 uses 128-byte cache lines, and the
size of L2 is 1MB for O2 and 4MB for Origin2000 (with either R10K or R12K). Both
caches are two-way set associative. All processors achieve good latency hiding as
a result of dynamic, out-of-order instruction issuing and compiler-directed prefetching. All applications are compiled with the highest optimization flag and prefetching
turned on (f77 -n32 -mips4 -Ofast)6 . The SGI compiler is MIPSpro Version 7.30.
The experiment on irregular and dynamic applications used the slower R10K
processor on Origin2000 because the newer machine with R12K was not available at
the time of the experiment; the same is true for the evaluation of performance tuning
and prediction. The later evaluation of regular programs used Origin2000 with R12K
processors. In addition, it used SGI O2 for a direct comparison with an earlier work
by another group.
All transformations preserve dependences except locality grouping, which is applied
only once to one application. All optimized programs produce the identical output
as their unoptimized version.
The effect of optimizations is measured on execution time and the number of L1,
L2 and TLB misses. Cache misses represent only memory read traffic; therefore,
they are not equal to the amount of bandwidth consumption. However, the improved
cache reuse has similar effects on memory reads as it does on memory writebacks.
Data regrouping should eliminate unnecessary writebacks, but the function is disabled
because it does not regroup at the innermost data dimension. Since none of the evaluated techniques optimizes specially for memory writebacks, they are not measured
in this experiment. Nevertheless, all the executables of the original and the optimized
programs have been preserved, and any additional measurement can be made in the
future if necessary.
6.3
Effect on Regular Applications
This section evaluates the global strategy of maximal loop fusion as the first step
and inter-array data regrouping as the second. The programs are written in loop and
array structures with a data access pattern that is mostly compile-time predictable.
The only exception is Magi, where user specified ( -mp -64 -r10000 -O3 -SWP:=ON -mips4 OPT:IEEE arithmetic=3:round off=3:alias=restrict).
6
94
Irregular and dynamic applications are measured in the next section, where dynamic
optimizations are tested.
6.3.1
Applications
Loop fusion and data regrouping are tested on four applications described in Figure 6.1.
The applications come from SPEC and NAS benchmark suite except ADI, which is a
self-written kernel with separate loops processing boundary conditions. Since all programs use iterative algorithms, only the loops inside the time-step loop are counted.
name
Swim
Tomcatv
ADI
SP
source
input size No. lines loops (levels) arrays
SPEC95
513x513
429
6 (1-2)
15
SPEC95
513x513
221
5 (1-2)
7
self-written
2Kx2K
108
6 (1-2)
3
NAS/NPB
class B,
1141
67 (2-4)
15
Serial v2.3 3 iterations
Figure 6.1 Descriptions of Regular applications
6.3.2
Transformations Applied
Loop fusion and data regrouping have been applied to all programs. In addition,
an input program is processed by four preliminary transformations before applying
loop fusion. The first is procedure inlining, which brings all computation loops into
a single procedure. The next is array splitting and loop unrolling, which eliminates
data dimensions of a small constant size and loops that iterate those dimensions.
The third step is loop distribution. Finally, the last step propagates constants into
loop statements. The compiler performs loop unrolling and constant propagation
automatically. Currently, array splitting requires a user to specify the names, and
inlining is done by hand. However, both transformations can be automated with
additional implementation.
6.3.3
Effect of Transformations
The effect of optimizations is shown in Figure 6.2. All results are collected on
Origin2000 with R12K processors except Swim, which is on SGI O2. The graphs
95
for the first three applications show four sets of bars: the original performance (normalized to 1), the effect of only loop fusion, the effect of only data grouping, and the
effect of loop fusion plus data regrouping. For SP, one additional set of bars is used to
show the effect of fusing one loop level instead of fusing all loop levels. The execution
time and original miss rates are also given in the figures; however, reductions are
measured on the number of misses, not on the miss rate.
Swim 513 x 513
1.40
1.20
11.2%
1.40
4.64%
1.00
0.80
0.60
0.20
0.20
L1
L2
0.00
TLB
exe. time
ADI 2K x 2K
original
computation fusion only
data grouping only
fusion + grouping
1.20
3.47s
14.0%
3.32%
0.029%
L2
TLB
0.60
0.40
8.81x
original
1−level fusion only
3−level fusion only
data grouping only
3−level fusion+grouping
2.32x
1.20
0.029%
0.80
1.00
41.3s
2.15%
11.1%
0.93%
0.80
0.60
0.40
0.20
0.00
L1
1.40
normalized scale
normalized scale
1.00
2.15%
NAS/SP class B, 3 iterations
1.60
1.40
24.3%
0.60
0.40
exe. time
15.9s
0.80
0.40
0.00
original
computation fusion only
data grouping only
fusion + grouping
1.20
0.12%
normalized scale
normalized scale
1.00
564s
Tomcatv 513 x 513
original
computation fusion only
data grouping only
fusion + grouping
0.20
exe. time
L1
L2
TLB
0.00
exe. time
L1
L2
TLB
Figure 6.2 Effect of transformations on regular applications
The performance of Swim is reported for SGI O2 because it has the same cache
configuration as SGI Octane, a machine used in the work of iteration slicing by Pugh
and Rosser[PR99]. Maximal loop fusion fuses all loop nests with the help of loop
splitting and achieves the same improvement (10%) as Pugh and Rosser reported for
iteration slicing. The succeeding data grouping, shown by the fourth bar in each
cluster, merges 13 arrays into 3 and shortens execution time by 2% more because of
the additional reduction on L1 and TLB misses.
96
Compared to data regrouping with loop fusion, data regrouping without loop
fusion (shown by the third bar of each cluster) merges 13 arrays into 5 and results in
similar reductions on L2 and TLB but causes 6% more L1 misses. Data regrouping
without loop fusion achieves the shortest execution time, a reduction of 16%. The
reason it performs better than the combined strategy is that the benefit of loop fusion
does not outweigh its instruction overhead when the data size is as small as in this
program. When the input size is large, the benefit of loop fusion should become
dominant. We will use much larger programs for the last two test cases.
One side effect of data regrouping seen by this program is 61% increase in the
number of graduated register loads. The overhead shows that source-level data regrouping confuses the back-end compiler and results in less effective register allocation. However, the problem should disappear if data regrouping is applied by the
back-end compiler itself. Among all test cases, Swim is the only one where data
regrouping causes an increased amount of register traffic.
Compared to SGI O2, running Swim on Origin2000 incurs 66% fewer L2 misses
because of the larger L2 cache on Origin2000. Loop fusion decreases performance by
6% on Origin2000, but the loss is recovered by data regrouping. Data regrouping
without loop fusion reduces cache and TLB misses but increases register loads by
35%. Since cache is not the bottleneck for this data size on Origin2000, the increase
in register traffic has a dramatic effect of increasing execution time by 16%. However,
as explained before, data regrouping will not suffer from this overhead if it is applied
by the back-end compiler.
Tomcatv has two pipelined computations progressing along reverse directions, so
multi-level loop fusion permutes non-conflicting loops to the outside to enable fusion
at the outer loop level. Data regrouping merges 7 arrays into 4. As before, the
arrays are grouped at the outer data dimension instead of inner data dimension in
order to avoid the poor code generation of the SGI compiler (see Section 6.1.2).
Figure 6.2 shows the program with an input size of 513x513. Loop fusion alone
decreases performance by 1%, but the combined transformation reduces L1 misses
by 5%, L2 misses by 20% and overall execution time by 16%. Data regrouping after
loop fusion increases TLB misses by 3% because of the side effect of grouping at the
outer data dimension. The side effect is small and becomes visible only because of
the extremely low TLB miss rate of this program, which is 0.03%.
Data regrouping without loop fusion groups 7 arrays into 5 and reduces L1 misses
by 32%, L2 misses by 12%, TLB misses by 6% and as a result reduces execution time
97
by 18%. These reductions are larger than those of regrouping with loop fusion except
for L2 misses, where the combined transformation eliminates 8% more misses with
the help of loop fusion. Although the regrouping is similar with and without loop
fusion, the effect on cache and TLB misses differs significantly for two reasons. The
first is that the input data size is small, and the overhead of loop fusion is pronounced.
The other reason is that after loop interchange, the inner loops iterate through the
outer data dimension, making memory performance sensitive to small changes in
data layout. For example, regrouping with fusion increases TLB misses by 3%, but
regrouping without fusion decreases TLB misses by 6%. These variations should not
occur if arrays are regrouped at the inner data dimension by the back-end compiler. In
addition, data regrouping can also permute data dimensions. However, determining
the best order for data dimensions is a NP-complete problem. The current algorithm
is conservative and does not allow transformations like this because they may be
detrimental to overall performance.
The original data input is 257x257 for Tomcatv. At this small size on Origin2000,
Tomcatv exhibits similar behavior as Swim: loop fusion decreases performance by 2%
but data regrouping recovers the loss and improves performance by 1%. On SGI O2,
loop fusion decreases performance by 1%, but the combined transformation improves
performance by 5%.
ADI uses the largest input size and consequently enjoys the highest improvement.
The reduction is 39% for L1 misses, 44% for L2 and 56% for TLB. The execution
time is reduced by 57%, a speedup of 2.33. Since only three arrays are used in the
program, data regrouping has little benefit on L2, TLB and the execution time, but
it reduces L1 misses by 20%. Without loop fusion, however, data regrouping sees no
chance in merging any array. Therefore, regrouping without loop fusion has no effect
on performance.
Program changes for SP SP is a full application and deserves special attention in
evaluating the global strategy. The main computation subroutine, adi, uses 15 global
data arrays in 218 loops, organized in 67 nests (after inlining). Loop distribution and
loop unrolling result in 482 loops at three levels—157 loops at the first level, 161 at
the second, and 164 at the third. One-level loop fusion merges 157 outer-most loops
into 8 loop nests. The performance is shown by the second bar in the lower-right
graph of Figure 6.2. The full fusion further fuses loops in the remaining two levels
98
and produces 13 loops at the second level and 17 at the third. The performance of
full fusion is shown by the third bar in the graph.
The fourth and fifth sets of bars show the effect of data regrouping with and
without loop fusion. The original program has 15 global arrays. Array splitting
resulted in 42 arrays. After full loop fusion, data regrouping combines 42 arrays
into 17 new ones. The choice of regrouping is very different from the specification
given by the programmer. For example, the third new array consists of four original arrays: {ainv(N, N, N), us(N, N, N), qs(N, N, N), u(N, N, N, 1 − 5)}, and the
15th new array includes two disjoint sections of an original array: {lhs(N, N, N, 6 −
8), lhs(N, N, N, 11 − 13)}.
One-level fusion increases L1 misses by 5%, but reduces L2 misses by 33% and
execution time by 27%, signaling that the original performance bottleneck was on
memory bandwidth. Fusing all levels eliminates half of the L2 misses (49%). However,
it creates too much data access in the innermost loop and causes 8 times more TLB
misses. The performance is slowed by a factor of 2.32. Data regrouping, however,
merges related data in contiguous memory and achieves the best performance. It
reduces L1 misses by 20%, L2 by 51% and TLB by 39%. The execution time is
shortened by one third (33%), a speedup of 1.5 (from 64.5 Mf/s to 96.2 Mf/s).
Without loop fusion, however, data regrouping can merge only two original arrays,
lhs(N, N, N, 7) and lhs(N, N, N, 8). Consequently, data regrouping without loop fusion obtains only a modest improvement. It reduces L1 misses by 4%, L2 misses by
8%, and execution time by 7%. It actually increases TLB misses by 15%.
Recall that the motivation for maximal loop fusion comes from the simulation
study on reuse distances in Section 2.2, where reuse-driven execution on a perfect
machine found a large benefit from fusing computations on the same data. The
purpose of loop fusion, then, is to realize this benefit on a real machine. Figure 6.3
compares the effect of loop fusion with that of reuse-driven execution on SP. It shows
the reuse-distance curve of three versions of SP: the original program order, reusedriven execution order, and the transformed program order after maximal loop fusion.
Maximal loop fusion reduces 45% evadable reuses, which is not as good as the 63%
reduction by the ideal reuse-driven execution. However, maximal fusion does realize
a fairly large portion of its potential. Furthermore, the reduction on evadable reuses
is very close to the reduction of L2 misses on Origin2000 (51%), indicating that
the measurement of reuse distance closely matches L2 cache performance on a real
machine.
99
NAS/SP, 14x14x14
NAS/SP, 28x28x28
500
5000
program order
reuse−based fusion
reuse−driven execution
400
350
300
250
200
150
100
50
0
program order
reuse−based fusion
reuse−driven execution
4500
number of references (in thousands)
number of references (in thousands)
450
4000
3500
3000
2500
2000
1500
1000
500
0
2
4
6
8
10 12 14 16 18
reuse distance (log scale, base 2)
20
22
0
0
2
4
6
8
10 12 14 16 18
reuse distance (log scale, base 2)
20
22
Figure 6.3 Reuse distances of NAS/SP after maximal fusion
The above evaluation has verified the effectiveness of the global optimization strategy. Maximal loop fusion realizes much of the potential benefit of computation fusion
and brings together data reuses among all parts of a program. Data regrouping eliminates the overhead of loop fusion and translates the reduction on memory traffic to
the reduction on execution time. Together, these two techniques significantly improve
global cache reuse for the whole program.
6.4
Effect on Irregular and Dynamic Applications
In irregular and dynamic applications, the content of and access order to certain data
structures are unknown until run time and may change during the execution. This
section evaluates the effectiveness of dynamic optimizations, developed in Chapter 4.
The global optimization, data regrouping, is also applied.
6.4.1
Applications
Figure 6.4 lists the four irregular and dynamic applications used in the evaluation,
along with their description, programming language and code size. Three scientific
simulation applications from molecular dynamics, structural mechanics and hydrodynamics are used. Despite the difference in their physical model and computation,
they have similar dynamic data access patterns in which objects interact with their
neighbors. Moldyn and mesh are well-known benchmarks. A large input data set
is used for moldyn with random initialization. Mesh has a user-supplied input set.
Magi is a full, real-world application consisting of almost 10,000 lines of Fortran code.
100
In addition to the three simulation programs, a sparse-matrix benchmark is included
to show the effect of packing on irregular data accesses in such applications.
name
Moldyn
Mesh
Magi
NAS-CG
description
molecule dynamics simulation
structural simulation
particle hydrodynamics
sparse matrix-vector
multiplication
source
Chaos group
Chaos group
DoD
NAS/NPB
Serial v2.3
language
f77
C
f90/f77
f77
lines
660
932
9339
1141
Figure 6.4 Descriptions of irregular and dynamic applications
application input size
Moldyn
256K particles, 27.4M interactions,
1 iteration
Mesh
9.4K nodes, 60K edges, 20 iterations
Magi
28K particles, 253 cycles
NAS-CG
14K non-zero entries, 15 iterations
source of input
exe. time
random
53.2 sec
initialization
Chaos group
8.14 sec
DoD
885 sec
NASA/NPB
48.3 sec
Serial 2.3, Class A
Figure 6.5 Input sizes of irregular and dynamic applications
Figure 6.5 gives the input size for each application, the sources of the data inputs,
and the execution time before applying optimizations. The working set is significantly
larger than the L1 cache for all applications. Mesh, Magi and NAS − CG are a little
bit larger than L2. Moldyn has the largest data input and its data size is significantly
greater than the size of L2.
6.4.2
Transformations Applied
The transformations were applied in the following order: locality grouping, data regrouping, dynamic data packing and packing optimizations. Since the access sequence
is already transformed by locality grouping, consecutive packing is used for all cases
because of the observation made in Section 4.2.2. (One test case, NAS −CG, accesses
each element only once, therefore consecutive packing is optimal.)
101
Figure 6.6 lists, for each application, the optimizations applied and the program
components measured. Each of the base programs came with one or more of the three
optimizations done by hand. Such cases are labeled with a ‘+’ sign in the table. The
‘V’ signs indicate the optimizations added, except in the case of NAS-CG. The base
program of NAS − CG came with data packing already done by hand, but I removed
it for the purpose of demonstrating the effect of packing. I do not consider handapplied packing practical because of the complexity of transforming tens of arrays
repeatedly at run-time for a large program.
application
Moldyn
Mesh
Magi
NAS-CG
optimizations applied
program components
locality grouping regrouping packing measured
+
V
V
function Compute Force()
V
no effect
V
full application
+
V
V
full application
n/a
no effect
V/+
full application
Figure 6.6 Transformations applied to irregular and dynamic applications
Locality grouping and data regrouping were inserted by hand. Data regrouping
was applied to all programs but found opportunities only for Moldyn and Magi. Data
packing of moldyn and CG was performed automatically by our compiler given a
one-line directive of packing. The same compiler packing algorithm was applied to
mesh by hand because our compiler infrastructure cannot yet compile C. Unlike other
programs, Magi is written in Fortran90 and computes the interaction list incrementally. I slightly modified the source to let it run through the Fortran77 front-end and
inserted a loop to collect the overall data access sequence. Then our compiler successfully applied base packing transformation on the whole program. The application
of the two compiler optimizations were semi-automatic: I inserted a 3-line loop to
perform pointer update; and I annotated a few dependence-free loops which otherwise would not be recognized by the compiler due to the presence of procedural calls
inside them. All other transformations are performed by the compiler. The optimized
packing reorganizes a total of 45 arrays in magi.
The original program is referred to as the base program and the transformed version as the optimized program. For NAS-CG, the base program refers to the version
with no packing. Dynamic data packing is applied only once in each application
except magi where data are repacked every 75 iterations.
102
6.4.3
Effect of Transformations
The four graphs of Figure 6.7 show the effect of the three transformations. The
first plots the effect of optimizations on the execution speed. The first bar of each
application is the normalized performance (normalized to 1) of the base version. The
other bars show the performance after applying each transformation. Since not all
transformations are necessary, an application may not have all three bars. The second
bar, if shown, shows the speedup of locality grouping. The third and fourth bars show
the speedup due to data regrouping and data packing. The other three graphs are
organized in the same way, except that they are showing the reduction on the number
of L1, L2 and TLB misses. The graphs include the miss rate of the base program,
but the reduction is on the total number of misses, not on the miss rate.
Effect of Locality Grouping and Data Regrouping
Locality grouping eliminates over half of L1 and L2 misses in mesh and improves
performance by 20%. In addition, locality grouping avails the program for data
packing, which further reduces L1 misses by 35%. Without the locality-grouping step,
however, consecutive packing not only results in no improvement but also incurs 5%
more L1 misses and 54% more L2 misses. This confirms the observation from our
simulation study that locality grouping is critical for the later data optimization to
be effective.
Data regrouping significantly improves moldyn and magi. Magi has multiple
computation phases, data regrouping splits 22 arrays into 26 and regroups them into
6 new arrays. As a result of better spatial reuse, the execution time is improved by a
factor of 1.32 and cache misses are reduced by 38% for L1, 17% for L2, and 47% for
TLB. By contrast, merging all 26 arrays improves performance by only 12%, reduces
L1 misses by 35%, and as a side effect, increases L2 misses by 32%. Data regrouping
is even more effective on moldyn, eliminating 70% of L1 and L2 misses and almost
doubling the execution speed.
Effect of Dynamic Data Packing
Data packing is applied to all four applications after locality grouping and data regrouping. It further improves performance in all cases. For moldyn, packing improves
performance by a factor of 1.6 and reduces L2 misses by 21% and TLB misses by 88%
103
Improvement on computation speed
1.4
1.2
Normalized L1 misses
Normalized execution speed
(4.36)
original
+ locality grouping
+ data regrouping
+ data packing
3.0
Reduction on L1 misses
2.0
1.0
1.0
original
+ locality grouping
+ data regrouping
+ data packing
(19%)
(59%)
() original miss rate
(14%)
(45%)
0.8
0.6
0.4
0.2
0.0
moldyn
mesh
magi
0.0
NAS−CG
Reduction on L2 misses
Normalized L2 misses
1.2
1.0
original
+ locality grouping
+ data regrouping
+ data packing
(14%)
(0.06%)
1.2
(13%)
0.8
0.6
0.4
0.2
0.0
1.4
() original miss rate
(0.6%)
mesh
magi
NAS−CG
Reduction on TLB misses
Normalized TLB misses
1.4
moldyn
1.0
original
+ locality grouping
+ data regrouping
+ data packing
(7.9%)
(0.002%)
() original miss rate
(2.8%)
(1.9%)
0.8
0.6
0.4
0.2
moldyn
mesh
magi
NAS−CG
0.0
moldyn
mesh
magi
NAS−CG
Figure 6.7 Effect of transformations on
irregular and dynamic applications
over the version after data regrouping. For NAS − CG, the speedup is 4.36 and the
amount of reduction is 44% for L1, 85% for L2 and over 97% for TLB.
For mesh after locality grouping, packing slightly improves performance and reduces misses by additional 3% for L1 and 35% for L2. The main reason for the modest
improvement on L1 is that the data granularity (24 bytes) is close to the size of L1
cache lines (32 bytes), leaving little room for additional spatial reuse. In addition,
packing is directed by the traversal of edges, which does not work as well during
the traversal of faces. The number of L1 misses is reduced by over 6% during edge
traversals, but the reduction is less than 1% during face traversals. Since the input
104
data set almost fits in L2, the significant reduction in L2 misses does not produce a
visible effect on the execution time.
When applied after data regrouping on magi, packing speeds up the computation
by another 70 seconds (12%) and reduces L1 misses by 33% and TLB misses by 55%.
Because of the relatively small input data set, L2 and TLB misses are not a dominant
factor in performance. As a result, the speed improvement is not as pronounced as
the reduction in these misses.
Overall, packing achieves a significant reduction in the number of cache misses
especially for L2 and TLB, where opportunities for spatial reuse are abundant. The
reduction in L2 misses ranges from 21% to 84% for all four applications; the reduction
in TLB misses ranges from 55% to 97% except for mesh, whose working set fits in
TLB.
Packing Overhead and the Effect of Compiler Optimizations
The cost of dynamic data packing comes from the overhead of data reorganization
and the cost of indirect memory accesses. The time spent in packing has a negligible
effect on performance in all three applications measured. Packing time is 13% of the
time of one computation iteration in moldyn, and 5.4% in mesh. When packing is
applied for every 20 iterations, the cost is less than 0.7% in moldyn and 0.3% in
mesh. Magi packs data every 75 iterations and spends less than 0.15% of time on
packing routines.
The cost of data indirection after packing can be mostly eliminated by two compiler optimizations described in Section 4.3.1. Figure 6.8 shows the effect of these
two compiler optimizations on all four applications tested.
The upper-left graph shows that, for moldyn, the indirections (that can be optimized away) account for 10% of memory loads, 22% of L1 misses, 19% of L2 misses
and 37% of TLB misses. After the elimination of the indirections and the references
to the run-time map, execution time was reduced by 27%, a speedup of 1.37. The
improvement in mesh is even larger. In this case, the indirections account for 87%
of the loads from memory, in part because mesh is written in C and the compiler
does not do a good job of optimizing array references. Since the excessive number of
memory loads dominates execution time, the compiler optimizations achieve a similar reduction (82%) in execution time. The number of loads is increased in magi
after the optimizations because array alignment transforms 19 more arrays than the
105
Compiler Optimizations on Moldyn
Compiler Optimizations on Mesh
1.50
1.40
no optimization
with optimizations
no optimization
with optimizations
1.00
normalized scale
normalized scale
1.20
0.50
1.00
0.80
0.60
0.40
0.20
0.00
exe. time
loads
L1
L2
0.00
TLB
Compiler Optimizations on Magi
exe. time
loads
L1
1.50
no optimization
with optimizations
1.00
no optimization
with optimizations
normalized scale
normalized scale
TLB
Compiler Optimizations on NAS−CG
1.50
0.50
0.00
L2
exe. time
loads
L1
L2
TLB
1.00
0.50
0.00
exe. time
loads
L1
L2
TLB
Figure 6.8 Effect of compiler optimizations for data packing
base packing, and not all indirections to these arrays can be eliminated. Despite the
increased number of memory loads, the cache misses and TLB misses are reduced
by 10% to 33%, and the overall speed is improved by 8%. For NAS − CG, the
compiler recognizes that matrix entries are accessed in stride-one fashion and consequently, the compiler replaces the indirection accesses with direct stride-one iteration
of the reorganized data array. The transformed matrix-vector multiply kernel has
the equally efficient data access as the original hand-coded version. As a result, the
number of loads and cache misses is reduced by 23% to 50%. The TLB working set
fits in machine’s TLB buffer after the optimizations, removing 97% of TLB misses.
The execution time is reduced by 60%, a speedup of 2.47.
6.5
Effect of Performance Tuning and Predication
This section evaluates bandwidth-based performance tuning and prediction on a wellknown benchmark application, SP from NASA, on SGI Origin2000 with R10K processors. NAS/SP is a complete application with over 20 subroutines and 3000 lines
of Fortran77 code. Since the program consists of sequences of regular loop nests, it is
106
partitioned into two levels of computation phases—subroutines and then loop nests.
Class-B input is used and only three iterations are ran to save the experiment time.
Since the implementation of compiler analysis was not complete at the time of
experiment, the following evaluation does not have compiler estimation of memory
transfer. Instead, it used manual approximation and machine hardware counters.
Performance Tuning
Tuning opportunities are identified by measuring the bandwidth utilization of each
subroutine and each loop nest. Hardware counters are used to estimate the total
amount of memory transfer. The table in Figure 6.9 lists the effective memory bandwidth of seven major subroutines, which represents 95% of overall running time.
Subroutines Achieved BW BW Utilization
compute rhs
252MB/s
84%
x solve
266MB/s
89%
y solve
197MB/s
66%
z solve
262MB/s
87%
lhsx
321MB/s
107% 1
lhsy
279MB/s
93%
lhsz
96MB/s
32%
Figure 6.9 Memory bandwidth utilization of NAS/SP
The last column of the table in Figure 6.9 shows that all subroutines utilized 84%
or higher memory bandwidth except y solve and lhsz. The low memory bandwidth
utilization prompted the need for user tuning. Subroutine lhsz had the largest potential gain for performance tuning. The subroutine has three loop nests, all had normal
bandwidth utilization except the first one, which had an extremely low bandwidth
utilization of less than 11%. The inspection by a user (myself) revealed that the
problem was due to excessive TLB misses. Manual application of array expansion
and loop permutation was able to eliminate a large portion of the TLB misses and
improve the running time of the loop nest by a factor of 5 and the overall execution
time by over 15%.
Pure data-copying loops with little computation can achieve a memory bandwidth that is slightly
higher than 300MB/s on SGI Origin2000.
1
107
Similar tuning process was then performed on compute rhs, which had an average
bandwidth utilization of 84%. However, not all loops performed well in compute rhs.
The examination of loop-level bandwidth utilization found two loops that utilized 65%
and 44% of memory bandwidth because of cache conflicts in L1. Loop distribution
and array padding were applied by hand. The modifications improved the two loops
by 9% and 24% individually and overall running time by another 2.4%. After the
tuning of both lhsz and compute rhs, the performance of SP was improved from 45.1
MFlops/s to 55.5 MFlops/s, a speedup of 1.19.
Bandwidth-based tuning is more accurate in locating performance problems than
other tuning techniques because it monitors the most critical resource—memory bandwidth. For example, flop rates are not as effective. The flop rates of the previously
mentioned two loops in compute rhs are over 30 MFlop/s before tuning, which are
not much lower than other parts of the program. For example, all loops in lhsx have
a flop rate fewer than 18 MFlop/s. By comparing the flop rates, a user may draw the
wrong conclusion that the loops in lhsx are better candidates for tuning. However, the
loops in lhsx cannot be improved because they already saturate the available memory
bandwidth. Their flop rates are low because they are data-copying loops with little
computation.
The successful tuning of SP shows that the automatic tuning support is extremely
effective for a user to correct performance problems in large applications. Although
there were over 80 loop nests in SP, bandwidth-based tuning automatically located
three loop nests for performance tuning. As a result, we as programmers only needed
to inspect these three loops, and simple source-level changes improved overall performance by 19%. In other words, the bandwidth-based tuning tool allowed us to obtain
19% of overall improvement by examining less than 5% of the code.
Performance Prediction
Bandwidth-based performance prediction approximates program-running time with
the estimated memory-transfer time, that is, the total amount of memory transfer
divided by the memory bandwidth of the machine. This section examines the accuracy
of this prediction technique on the SP benchmark. Since the prediction requires
estimation on the amount of memory transfer, the experiment will first measure
it with hardware counters and then apply compiler analysis by hand to verify the
accuracy of the compiler-based estimation.
108
The table in Figure 6.10 lists the actual running time of a single iteration of SP,
the predicted time and the percent of error. The predicted time is the total amount
of memory transfer divided by memory bandwidth. The prediction is given both with
and without considering the effect of TLB misses in the first loop of lhsz, discussed in
the previous section on user tuning. The table lists two predictions, the first assumes
full memory bandwidth utilization for the whole program, and the other assumes an
average utilization of 90%.
Computation
adi w/o TLB est.
adi w TLB est.
adi w/o lhsz
Exe Pred. Time I Err. I Pred. Time II Err. II
Time Util=100%
Util=90%
59.0s
43.8s
-26%
48.6s
-18%
59.0s
50.9s
- 14%
55.7s
- 5.6%
47.0s
40.0s
- 15%
44.3s
- 5.7%
Figure 6.10 Actual and predicted execution time
The first row of table in Figure 6.10 gives the estimation without considering the
extra overhead of TLB misses in lhsz. The TLB overhead can be easily predicted by
multiplying the number of TLB misses with full memory latency (338ns according to
what is called restart latency in [HL97]), which adds to a total of 7.1 seconds. The
second row gives the performance prediction considering this TLB overhead. The
third row predicts performance for the program without lhsz (the rest represents over
80% of the overall execution time).
The third and fifth column of the table in Figure 6.10 show the error of prediction.
When assuming full bandwidth utilization, the prediction error is 26% when not
considering the abnormal TLB overhead, 14% when considering the TLB overhead,
and 15% for the program without lhsz. When the assumed utilization is 90%, the
prediction error is 18% when not considering TLB overhead, 5.6% when including the
TLB cost, and 5.7% for the program without lhsz. The table shows that, with the
estimation of the TLB cost and the assumption of 90% memory-bandwidth utilization,
bandwidth-based prediction is very accurate, with an error of less that 6%. The
similar errors in the last two rows also suggest that our static estimation of the TLB
overhead is accurate.
The above predictions measured the amount of memory transfer through hardware
counters. This is undesirable because we should predict program performance without
109
running the program. So the next question is how accurate is the static estimation
of a compiler. I manually applied a simplified version of the data analysis described
in Section 5.2. In fact, only the bounded-section analysis was used, which counted
only the number of capacity misses in each loop nest. I did not expect many conflict
misses because the L2 cache on SGI Origin2000 is two-way set associative and 4MB
in size.
Two subroutines were manually analyzed: compute rhs and lhsx, which consisted
of 40% of the total running time. Subroutine compute rhs had the largest code source
and the longest running time among all subroutines. It also resembled the mixed
access patterns in the whole program because it iterated the cubic data structure
through three directions. The subroutine lhsx accessed memory contiguously in a single stride. The following table lists the actual memory transfer measured by hardware
counters, the predicted memory transfer by the hand analysis, and the error of the
static estimation.
Subroutine
Actual
lhsx
396MB
compute rhs 5308MB
Predicted Error
406MB
+ 3%
5139MB
- 3%
Figure 6.11 Actual and predicted data transfer
The errors shown in the third column of the table in Figure 6.11 are within 3%,
indicating that the static estimation is indeed very accurate. Assuming this accuracy
holds for other parts of SP, the bandwidth-based analysis tool could predict the
overall performance within an error of less than 10%, assuming an average memory
bandwidth utilization of 90%.
6.6
Summary
Global optimizations The two-step global strategy of loop fusion and data regrouping are extremely effective for the benchmark programs tested, improving overall
speed by 14% to a factor of 2.33 for kernels and a factor of 1.5 for the full application.
Furthermore, the improvement is obtained solely through automatic source-to-source
compiler optimization. The success especially underscores the following three important aspects:
110
• Aggressive loop fusion. All test programs have loops with a different number of
dimensions. Mere loop alignment cannot fuse any of the tested programs except
for a few loops in SP. Swim also requires loop splitting.
• Conservative data regrouping. Data regrouping improves cache and program
performance in most cases. The only few degradations are due to the side
effects of data regrouping on the back-end compiler. These problems can be
easily corrected if the data transformation is made by the back-end compiler
itself. Therefore, data regrouping should always be beneficial in practice.
• Combined optimization strategy of computation fusion and data regrouping.
Although when used together they achieve substantial performance improvement, neither can do so when used alone. In fact, loop fusion degrades performance in most cases if used without data regrouping, and data regrouping sees
little or no opportunity without loop fusion, especially for large programs.
Data regrouping is also very beneficial for dynamic applications. Although these
programs have unpredictable data access within each array, the relations among multiple arrays are consistent and can be determined by a compiler. Consequently, data
regrouping is able to improve global cache reuse despite unknown and dynamic data
access patterns. For the two dynamic applications where data regrouping is applied,
it reduces 17% to 70% of cache and TLB misses and improves performance by factors
1.3 and 1.9.
Dynamic optimizations Run-time data packing is very effective for dynamic programs whose access pattern remains unknown until run time and changes during the
execution. By analyzing and optimizing data layout at run time, data packing reduces the number of L2 misses by 21% to 84% and the number of TLB misses by
33% to 97%, and as a result, improves overall program performance by a factor up to
4.36. The run-time cost of data reorganization is negligible, consuming only 0.14%
to 0.7% of the overall execution time. The overhead due to data indirections through
run-time data maps is significant, but it can be effectively eliminated by the two
packing optimizations, pointer update and array alignment. The two optimizations
improve performance by factors ranging from 1.08 to 5.56. In addition, the run-time
optimizations and global data regrouping complement each other and achieve the best
performance when both are used together.
111
Bandwidth-based performance tool Bandwidth-based tuning and prediction is
simple yet very accurate. When evaluated on the 3000-line NAS SP benchmark, it
enables a user to obtain an overall speedup of 1.19 by inspecting and tuning only 5%
of the program code. The compile-time prediction on the whole-program execution
time can be within 10% difference of the actual execution time.
112
Chapter 7
Conclusions
“To travel hopefully is a better thing than to arrive, and the true success is to
labour.” – Rober Louis Stevenson (1850-1894)
At the outset, this dissertation demonstrated the serious performance bottleneck
caused by insufficient memory bandwidth. From then on, it has taken the goal of minimizing memory-CPU communication through compiler optimizations. This chapter
first summarizes the new techniques that have been developed in the preceding chapters. Then it discusses future extensions of this work. Finally, the chapter concludes
with the final remarks restating the underlying theme of this dissertation.
7.1
Compiler Optimizations for Cache Reuse
The main contribution of this dissertation is a set of new compiler transformations
that optimizes cache performance both at the global level and at run time.
Global optimizations Global optimizations include computation fusion and data
grouping. Chapter 2 describes two algorithms for computation fusion: reuse-driven
execution measures the limit of global cache reuse on an ideal machine, and more
importantly, maximal loop fusion realizes most of the global benefit on a real machine. Maximal loop fusion fuses data-sharing statements whenever possible, achieves
bounded reuse distance within a fused loop, and maximizes the amount of fusion for
multi-dimensional loops. While maximal loop fusion improves temporal cache reuse
for the whole program, inter-array data regrouping maximizes spatial cache reuse for
the entire data. Inter-array regrouping, presented in Chapter 3, merges data at a
hierarchy of granularity from large array segments to individual array elements. It
makes data access as contiguous as possible and it eliminates unnecessary memory
writebacks. Both maximal loop fusion and inter-array data grouping are fast; their
time complexity is approximately O(V ∗ A), where V is the length of the program
and A is the number of data structures in the program.
113
Maximal loop fusion and inter-array data regrouping are currently the most powerful set of global transformations in the literature. Maximal loop fusion is more
aggressive than previous loop fusion techniques because it can fuse all statements in
a program whenever permitted by data dependence. Data regrouping is the first to
split and regroup global data structures and to do so with guaranteed profitability
and compile-time optimality. The overall strategy is also the first in the literature to
combine a global computation transformation with a global data transformation.
Dynamic optimizations Dynamic optimizations include Locality grouping and
data packing, presented in Chapter 4. They improve dynamic cache reuse by reordering irregular computation and data at run time. Locality grouping merges computations involving the same data, and data packing groups data used by the same
computation. Both are general-purpose transformations. Locality grouping reorders
any set of independent computations, data packing reorganizes any non-contiguous
data access. In addition, both transformations incur a minimal run-time overhead,
which is linear in time and space.
More importantly, locality grouping and data packing are the first set of dynamic
optimizations that are automatically inserted and optimized by a compiler. The basis
for this automation is compiler indirection analysis, which analyzes indirect data access in a program and identifies all opportunities of data reorganization. Two compiler
optimizations, pointer update and array alignment are used to remove data indirections after data relocation. Both optimizations are extremely effective in removing
the overhead of run-time data transformation.
Performance model and tool The balance-based performance model, described
in Chapter 1, is the first to consider the balance of bandwidth resources at all levels
of a computing system from the CPU flop rate to memory bandwidth. The balancebased model has clearly demonstrated the existence of the memory bandwidth bottleneck and its constraint on performance. Based on this model, Chapter 5 designed
a bandwidth-based tool. The new tool supports effective user tuning by automatically
locating performance problems in large applications. In addition, the tool provides
accurate performance prediction.
Summary of evaluation results The experimental evaluation has verified that
the new global strategy achieved dramatic reductions in the volume of data transferred
114
for the programs studied. The table in Figure 7.1 compares the amount of data
transferred for versions of each program with no optimization, with optimizations
provided by the SGI compiler, and after transformation via the strategy developed
in this dissertation. If we compare the average reduction in misses due to compiler
techniques, the new strategy, labeled by column New, does significantly better than
the SGI compiler, labeled by column SGI.
program
Swim
Tomcatv
ADI
NAS/SP
average
Moldyn
Mesh
Magi
NAS/CG
average
L1
NoOpt
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
misses
SGI
1.26
1.02
0.66
0.97
0.98
1.15
1.02
1.15
1.01
1.08
L2
New NoOpt
1.15
1.00
0.97
1.00
0.40
1.00
0.77
1.00
0.82
1.00
0.34
1.00
0.47
1.00
0.82
1.00
0.58
1.00
0.55
1.00
misses
SGI
1.10
0.49
0.94
1.00
0.88
0.99
1.34
1.25
0.95
1.13
TLB misses
New NoOpt SGI New
0.94
1.00
1.60 1.05
0.39
1.00
0.010 0.010
0.53
1.00
0.011 0.005
0.49
1.00
1.09 0.67
0.59
1.00
0.68 0.43
0.19
1.00
0.77 0.10
0.39
1.00
0.57 0.57
0.76
1.00
1.00 0.36
0.15
1.00
0.97 0.03
0.37
1.00
0.83 0.27
Figure 7.1 Summary of evaluation results
On average for the four regular applications measured, the new strategy outperforms the SGI compiler by factors of 9 for L1 misses, 3.4 for L2 misses, and 1.8 for
TLB misses. The improvement is even larger for the four irregular and dynamic programs, where both global and dynamic optimizations are applied. On average, the
new strategy reduces L1 misses by 45%, L2 misses by 63%, and TLB misses by 73%.
In contrast, the SGI compiler causes an average of 8% more L1 misses and 13% more
L2 misses for these dynamic applications. Thus, the global and dynamic strategy
developed in this dissertation has a clear advantage over the more local and static
strategies employed by an excellent commercial compiler.
7.2
Future Work
The successful development of global and dynamic optimizations has opened new
fascinating opportunities for future compiler research. New improvement can come
115
from overcoming the overhead of global and dynamic optimizations and from extending their capabilities. This section touches on research directions that are important
and can be approached by direct extensions of this work.
Storage optimization after fusion Extensive loop fusion enables new opportunities for storage optimization. Section 2.5 of Chapter 2 has given two examples of
storage reduction and store elimination. These two techniques need to be developed
and evaluated on real programs. In general, computation fusion provides more freedom in organizing data and its uses. The data transformations mentioned here are
just the tip of an iceberg.
Improving the fusion heuristic Although the benefit of maximal fusion has been
verified, it remains unknown how much more benefit can be obtained by improving
over the greedy heuristic currently used. How can data sharing be reduced among
fused loops? How well do other heuristics perform, especially the one of always fusing
along the heaviest edge by Kennedy[Ken99]? Further evaluation is required to find
the best fusion method.
Register allocation after fusion Since loop fusion may merge too much computation in a fused loop, it may overflow machine registers and dramatically increase the
number of register loads and stores. A direct remedy exists, which is to distribute a
large loop into smaller ones that do not overflow registers. The distribution is in fact
a form of fusion after computations are divided into the smallest unit. The fusion is
equivalent to a fixed-size partitioning of hyper-graphs and is NP-hard. The problem
is similar to loop fusion because it needs to minimize data sharing; however, it is also
different because it fuses loops up to a fixed size.
Data grouping for arbitrary programs In this dissertation, array regrouping
uses compiler analysis to identify computation phases and data access patterns. When
compiler analysis is not available, data regrouping can still use profiling analysis. By
profiling the execution, it can define computation phases as time intervals in which
the amount of data access is larger than cache. If two data items, as for example,
two members of different object classes, are always accessed together, they can then
be grouped into the same cache block. In this way, data grouping can be applied to
arbitrary programs.
116
Data grouping for parallel programs On shared memory machines, cache blocks
are the basis of data consistency and consequently the unit of communication among
parallel processors. Data regrouping can be applied to parallel programs to improve
cache-block utilization, which leads to reduced communication latency and increased
communication bandwidth.
Indirection analysis for object-oriented programs Indirection analysis can be
extended to dynamic allocated objects linked by pointers. It can be done by analyzing
the relation of objects based on their types and then recording the access sequence
of related objects at run time. Such an extension would allow data packing to be
applied to object-oriented programs.
Data reuse analysis for parallelization The reuse-distance analysis is a general
tool that can study unconventional optimizations by examining their effect on data
reuse. One important direction is to experiment with innovative methods of program
parallelization and data communication.
Automatic run-time optimizations The efficient performance monitoring technique, developed in Chapter 5, has made it possible that the execution status of
a program can be monitored and its performance problems can be identified and
corrected dynamically. This adaptability is extremely important for large programs
running on heterogeneous machines where a single program code cannot work well
everywhere.
7.3
Final Remarks
The dissertation can be viewed as a pursuit of two goals. The first, a fundamental one, is to minimize the memory-CPU communication by caching. The second,
a practical concern, is to avoid losing software productivity in the search for higher
performance. This research has found a middle ground to balance between these two
goals. It optimizes the whole program at all times but it does so with automatic
methods that are transparent to a programmer. This dual theme of optimization and
automation permeates this dissertation and extends to its vision of the future. As the
world is becoming a ubiquitously connected computing environment, programming
tasks in the future will be far more complex and difficult because of the scale of the
117
software and the complexity of hardware. Today’s manual process of programming is
unlikely to meet this future challenge. The extension of this work may offer a better
alternative through programming automation. In fact by providing powerful techniques for global and dynamic program transformation, this research has paved the
way for the automation of complex and large-scale programming tasks, thus making
software development less labor-intensive and more manageable.
“He who knows others is learned. He who knows himself is wise.” – Lao Tzu
(about 500 BC)
118
Bibliography
[AC72]
F. Allen and J. Cocke.
A catalogue of optimizing transformations.
In J. Rustin, editor, Design and Optimization of Compilers. PrenticeHall, 1972.
[AFR98]
I. Al-Furaih and S. Ranka.
Memory hierarchy management for iterative graph structures.
In Proceedings of IPPS, 1998.
[AK87]
J. R. Allen and K. Kennedy.
Automatic translation of Fortran programs to vector form.
ACM Transactions on Programming Languages and Systems, 9(4):491–
542, October 1987.
[All83]
J. R. Allen.
Dependence Analysis for Subscripted Variables and Its Application to
Program Transformations.
PhD thesis, Dept. of Computer Science, Rice University, April 1983.
[ASKL81]
W. Abu-Sufah, D. Kuck, and D. Lawrie.
On the performance enhancement of paging systems through program
analysis and transformations.
IEEE Transactions on Computers, C-30(5):341–356, May 1981.
[Bai92]
D. Bailey.
Unfavorable strides in cache memory systems.
Technical Report RNR-92-015, NASA Ames Research Center, 1992.
[BFKK91] V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer.
A static performance estimator to guide data partitioning decisions.
In Proceedings of the Third ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, Williamsburg, VA, April 1991.
[BGK96]
D. C. Burger, J. R. Goodman, and A. Kagi.
Memory bandwidth limitations of future microprocessors.
In Proceedings of the 23th International Symposium on Computer
Architecture, 1996.
[Cal87]
D. Callahan.
A Global Approach to Detection of Parallelism.
119
PhD thesis, Dept. of Computer Science, Rice University, March 1987.
[Car92]
S. Carr.
Memory-Hierarchy Management.
PhD thesis, Dept. of Computer Science, Rice University, September 1992.
[CCC+ 97]
R. Chandra, D. Chen, R. Cox, D.E. Maydan, and N. Nedeljkovic.
Data distribution support on distributed shared memory multiprocessors.
In Proceedings of ’97 Conference on Programming Language Design and
Implementation, 1997.
[CCJA98]
B. Calder, K. Chandra, S. John, and T. Austin.
Cache-conscious data placement.
In Proceedings of the Eighth International Conference on Architectural
Support for Programming Languages and Operating Systems
(ASPLOS-VIII), San Jose, Oct 1998.
[CCK88]
D. Callahan, J. Cocke, and K. Kennedy.
Estimating interlock and improving balance for pipelined machines.
Journal of Parallel and Distributed Computing, 5(4):334–358, August
1988.
[CDL99]
T.M. Chilimbi, B. Davidson, and J.R. Larus.
Cache-conscious structure definition.
In Proceedings of SIGPLAN Conference on Programming Language
Design and Implementation, 1999.
[CK89]
S. Carr and K. Kennedy.
Blocking linear algebra codes for memory hierarchies.
In Proceedings of the Fourth SIAM Conference on Parallel Processing for
Scientific Computing, Chicago, IL, December 1989.
[CKP90]
D. Callahan, K. Kennedy, and A. Porterfield.
Analyzing and visualizing performance of memory hierarchies.
In Performance Instrumentation and Visualization, pages 1–26. ACM
Press, 1990.
[CL95]
M. Cierniak and W. Li.
Unifying data and control transformations for distributed share dmemory machines.
In Proceedings of the SIGPLAN ’95 Conference on Programming
Language Design and Implementation, La Jolla, 1995.
[CQ93]
Mark J. Clement and Michael J. Quinn.
Analytical Performance Prediction on Multicomputers.
In Proceedings of Supercomputing’93, November 1993.
120
[Dar99]
Alain Darte.
On the complexity of loop fusion.
In Proceedings of International Conference on Parallel Architecture and
Compilation, pages 149–157, Newport Beach, CA, Oct 1999.
[DJP+ 92]
E. Dahlhaus, D. S. Johnson, C. H. Papadimitriou, P. D. Seymour, and
M. Yannakakis.
The complexity of multiway cuts.
In Proceedings of the 24th Annual ACM Symposium on the Theory of
Computing, May 1992.
[DK00]
C. Ding and K. Kennedy.
Memory bandwidth bottleneck and its amelioration by a compiler.
In Proceedings of the 2000 International Parallel and Distributed
Processing Symposium, Cancun, Mexico, May 2000.
[DMS+92]
R. Das, D. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy.
The design and implementation of a parallel unstructured euler solver
using software primitives.
In Proceedings of the 30th Aerospace Science Meeting, Reno, Navada,
January 1992.
[DUSH94]
R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang.
Communication optimizations for irregular scientific computations on
distributed memory architectures.
Journal of Parallel and Distributed Computing, 22(3):462–479,
September 1994.
[FST91]
J. Ferrante, V. Sarkar, and W. Thrash.
On estimating and enhancing cache effectiveness.
In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, Fourth
International Workshop, Santa Clara, CA, August 1991. SpringerVerlag.
[GH93]
Aaron J. Goldberg and John L Hennessy.
Mtool: An Integrated System for Performance Debugging Shared
Memory Multiprocessor Applications.
IEEE Transactions on Parallel and Distributed Systems, 4(1), 1993.
[GJG88]
D. Gannon, W. Jalby, and K. Gallivan.
Strategies for cache and local memory management by global program
transformation.
Journal of Parallel and Distributed Computing, 5(5):587–616, October
1988.
[GOST92]
G. Gao, R. Olsen, V. Sarkar, and R. Thekkath.
121
Collective loop fusion for array contraction.
In Proceedings of the Fifth Workshop on Languages and Compilers for
Parallel Computing, New Haven, CT, August 1992.
[HK91]
P. Havlak and K. Kennedy.
An implementation of interprocedural bounded regular section analysis.
IEEE Transactions on Parallel and Distributed Systems, 2(3):350–360,
July 1991.
[HL97]
Cristina Hristea and Daniel Lenoski.
Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks.
In Proceedings of SC97: High Performance Networking and Computing,
1997.
[hSM97]
harad Singhai and Kathryn S. McKinley.
A parameterized loop fusion algorithm for improving parallelism and
cache locality.
The Computer Journal, 40(6):340–355, 1997.
[HT98]
H. Han and C.-W. Tseng.
Improving compiler and run-time support for adaptive irregular codes.
In Proceedings of the International Conference on Parallel Architectures
and Compilation Techniques, October 1998.
[JE95]
Tor E. Jeremiassen and Susan J. Eggers.
Reducing false sharing on shared memory multiprocessors through compile time data transformations.
In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, pages 179–188, Santa Barbara,
CA, July 1995.
[KAP97]
Induprakas Kodukula, Nawaas Ahmed, and Keshav Pingali.
Data-centric multi-level blocking.
In Proceedings of the SIGPLAN ’97 Conference on Programming
Language Design and Implementation, Las Vegas, NV, June 1997.
[KCRB98] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee.
A matrix-based approach to the global locality optimization problem.
In Proceedings of International Conference on Parallel Architectures and
Compilation Techniques, 1998.
[Ken99]
K. Kennedy.
Fast greedy weighted fusion.
Technical Report CRPC-TR-99789, Center for Research on Parallel
Computation (CRPC), 1999.
[KH78]
D. G. Kirkpatrick and P. Hell.
122
[KM93]
On the completeness of a generalized matching problem.
In The Tenth Annual ACM Symposium on Theory of Computing, 1978.
K. Kennedy and K. S. McKinley.
Typed fusion with applications to parallel and sequential code generation.
Technical Report TR93-208, Dept. of Computer Science, Rice University,
August 1993.
(also available as CRPC-TR94370).
[Kre95]
U. Kremer.
Automatic Data Layout for Distributed Memory Machines.
PhD thesis, Dept. of Computer Science, Rice University, October 1995.
[LRW91]
M. Lam, E. Rothberg, and M. E. Wolf.
The cache performance and optimizations of blocked algorithms.
In Proceedings of the Fourth International Conference on Architectural
Support for Programming Languages and Operating Systems
(ASPLOS-IV), Santa Clara, CA, April 1991.
[MA97]
N. Manjikian and T. Abdelrahman.
Fusion of loops for parallelism and locality.
IEEE Transactions on Parallel and Distributed Systems, 8, 1997.
[McC95]
John D. McCalpin.
Sustainable memory bandwidth in current high performance computers.
http://reality.sgi.com/mccalpin asd/papers/bandwidth.ps, 1995.
[MCF99]
Nicholas Mitchell, Larry Carter, and Jeanne Ferrante.
Localizing non-affine array references.
In Proceedings of International Conference on Parallel Architecture and
Compilation, Newport Beach, CA, Oct 1999.
[McI97]
Nathaniel McIntosh.
Compiler Support for Software Prefetching.
PhD thesis, Rice University, Houston, TX, July 1997.
[MCT96]
K. S. McKinley, S. Carr, and C.-W. Tseng.
Improving data locality with loop transformations.
ACM Transactions on Programming Languages and Systems, 18(4):424–
453, July 1996.
[MCWK99] John Mellor-Crummey, David Whalley, and Ken Kennedy.
Improving memory hierarchy performance for irregular applications.
In Proceedings of the 13th ACM International Conference on
Supercomputing, pages 425–433, Rhodes, Greece, 1999.
[ML98]
Philips J. Mucci and Kevin London.
The cachebench report.
123
Technical Report ut-cs-98-394, University of Tennessee, 1998.
[Mow94]
T. Mowry.
Tolerating Latency Through Software Controlled Data Prefetching.
PhD thesis, Dept. of Computer Science, Stanford University, March 1994.
[MT96]
Kathryn S. McKinley and Olivier Temam.
A quantitative analysis of loop nest locality.
In Proceedings of Seventh International Conference on Architectural
Support for Programming Languages and Operating Systems
(ASPLOS-VII), Boston, MA, Oct 1996.
[Por89]
A. Porterfield.
Software Methods for Improvement of Cache Performance.
PhD thesis, Dept. of Computer Science, Rice University, May 1989.
[PR99]
W. Pugh and E. Rosser.
Iteration space slicing for locality.
In Proceedings of the Twelfth Workshop on Languages and Compilers for
Parallel Computing, August 1999.
[Pug92]
W. Pugh.
A practical algorithm for exact array dependence analysis.
Communications of the ACM, 35(8):102–114, August 1992.
[SL99]
Y. Song and Z. Li.
New tiling techniques to improve cache temporal locality.
In ACM SIGPLAN Conference on Programming Languages Design and
Implementation, 1999.
[SZ98]
M. L. Seidl and B. G. Zorn.
Segregating heap objects by reference behavior and lifetime.
In Proceedings of the Eighth International Conference on Architectural
Support for Programming Languages and Operating Systems
(ASPLOS-VIII), San Jose, Oct 1998.
[TA94]
K. A. Tomko and S. G. Abraham.
Data and program restructuring of irregular applications for cachecoherent multiprocessors.
In Proceedings of ’94 International Conference on Supercomputing, 1994.
[Tha81]
K. O. Thabit.
Cache Management by the Compiler.
PhD thesis, Dept. of Computer Science, Rice University, 1981.
[WL91]
M. E. Wolf and M. Lam.
A data locality optimizing algorithm.
In Proceedings of the SIGPLAN ’91 Conference on Programming
Language Design and Implementation, Toronto, Canada, June 1991.
124
[Wol82]
M. J. Wolfe.
Optimizing Supercompilers for Supercomputers.
PhD thesis, Dept. of Computer Science, University of Illinois at UrbanaChampaign, October 1982.
Download