Compiler Optimizations for Improving Data Locality

advertisement
Compiler
Optimizations
for Improving
Kathryn
Steve Cam
Data Locality
Chau-Wen
S. McKinley
edu
Computer Systems Laboratory
Stanford University
Department of Computer Science
University of Massachusetts
Department of Computer Science
Michigan Technological University
Tseng
tseng@cs.stanford.
rnckinley@cs. umass.edu
carr@cs.mtu.edu
Abstract
or parallel
In
because programmers will be able to achieve good performance
without makiig machine-dependent
source-level transformations.
the past decade, processor speed has become
significantly faster
than memory speed. Small, fast cache memories are designed
to overcome this discrepancy, but they are only effective when
programs exhibit data locality. In this paper, we present compiler
optimizations to improve data locality based on a simple yet accurate
cost model. The model computes both temporal and spatial reuse
of cache lines to find desirable loop organizations. The cost model
drives the application of compound transformations consisting of
loop permutation, loop fusion, loop distribution, and loop reversal.
We demonstrate that these program transformations are useful for
optimizing many programs,
To validate our optimization strategy, we implemented our rdgorithms and ran experiments on a large collection of scientific
programs and kernels. Experiments with kernels illustrate that our
model and algorithm can select and achieve the best performance.
For over thiity complete applications, we executed the original and
transformed versions and simulated cache hk rates. We collected
statistics about the inherent characteristics of these programs and
our ability to improve their data locality. To our knowledge, these
studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data
caches; however, our optimizations significantly improved several
programs.
1
1.1
programs
will
be more portable
Framework
1. Improve order of memory accesses to exploit all levels of the
memory hierarchy through loop permutation, fusion, distribution, skewing, and reversal. This process is mostly machineindependenc and requires knowledge only of the cache line
size.
of strip2. Fully utilize the cache through tiling, a combination
miniig
and loop permutation
~88]. Knowledge of the cache
size, associativity,
degrees of tiling
and replacement
policy is essential.
can be applied to exploit
multi-level
Higher
caches,
the TLB, etc.
(also known as
3. Promote register reuse through unroll-and-jam
register tiling) and scalar replacemeti [CCK90]. The number
and type of registers available are required to determine the
degree of unroll-and-jam and the number of array references
to replace with scalars.
In this paper, we concentrate on the first step.
Our algorithms
are complementary to and in fac~ improve the effectiveness
of
optittizations performed in the latter two steps [Car92]. However,
the other steps and interactions between steps are beyond the scope
of this paper.
Because processor speed is increasing at a much faster rate than
memory speed, computer architects have turned increasingly to the
use of memory hierarchies with one or more levels of cache memory.
Caches take advantage of data locality in programs. Data locality is
1.2
Overview
We present a compiler strategy based on an effective, yet simple,
model for estimating the cost of executing a given loop nest in terms
of the number of cache line references. This paper extends previous
work [KM92] with a slightly more accurate memory model. We
the property that references to the same memory location or adjacent
are reused within
Optimization
In addition,
Based on our experiments and experiences, we believe that compiler optimization to improve data locality should proceed in the
following ordec
Introduction
locations
machine.
a short period of time.
Caches also have an impact on programming;
programmers substantially enhance perfonuance by using a style that ensures more
use the model to derive a loop structure which results in the fewest
memory
an algorithm
references
are handled by the cache.
mers expend considerable
loops so that the innermost
umn, which
effort at improving
Scientific
locality
accesses to main memory.
program-
by structuring
pound loop transformations
loop iterates over the elements of a col-
are stored consecutively
in Fortran.
The loop structure is achieved through
that uses compound
and reversal.
This task is time
ented
The algorithm
in a source-to-source
We present extensive
loop transformations.
are permutation,
is unique
fusion,
The comdistribution,
to thk paper and is implem-
Fortran 77 translator.
empirical
results for kernels
and bench-
consuming, tedious, and error-prone. Instead, achieving good data
locality should be the responsibility
of the compiler. By placing the
burden on the compiler, programmers can get good uniprocessor
mark programs that validate the effectiveness of our optimization
strategy; they reveal programmers often use programming
styles
performance
with good locality.
even if they originally
wrote their program for a vector
Steve Car was supported by NSF grant CCR-9409341.
Chau-Wen
supported in part by an NSF CISE Postdoctoral
Fettowship
in Experimental
We measure both inherent
data locality
char-
acteristics of scientific programs and our abilhy to improve data
locality. When the cache miss rate for a program is non-negligible,
Tseng was
Sciemce.
we show there are often opportunities
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the ACM copyright notice and the
title of the publication and Its date appear, and notice is given
that copying is by permission of the Association of Computing
Machinery. To copy otherwise, or to republish, requires a fee
and/ors ecific permission.
ASPLO .& W 10/94 San Jose, California USA
@ 1994 ACM 0-89791 -660-3/94/0010..$3.50
252
optimization
consequently
algorithm
improves
to improve
data locality.
Our
takes advantage of these opportunities
and
performance,
As expected, loop permuta-
tion plays the key role; however, loop fusion and distribution
may
also produce significant improvements.
Our algorithms never found
an opportunity
where loop reversal could improve
locality.
2
Related
Abu-Sufslt
.
Work
first discussed applying
compiler
transformations
on data dependence (e.g., loop interchange,
tiling)
to improve
paging [AS79].
[KM92].
are loop-camied
and
We extend their original
for parallelism
3.2
whereas we integrate
they perform
permutation,
tion, and reversal into a comprehensive
and
fusion,
approach.
are loop-independent
if the accesses occur on different
is loop
they
loop iterations.
Sources of Data Reuse
to the same memory
memory
distribu-
if the accesses
occur in the same loop iteration,
The two sources of data reuse are temporal
cost model to capture
more types of reuse. The only transformation
permutation,
based
fusion, distribution,
Data dependence
to the same memory location
In thk paper, we extend and val-
idate recent research to integrate optimizations
memory
references.
locations
location,
reuse, multiple
accesses
and spatial reuse, accesses to nearby
that share a cache line or a block of memory
st
some level of the memory hierarchy (Unit-stride access is the most
common type of spatial locality).
Temporal and spatial reuse may
Our extensive
experimental
results are unique to this paper. We measure bIMh the
result horn self-reuse from a single array reference or group-reuse
effectiveness
of our approach and urdiie
from multiple
the inherent data locality
to exploit
characteristics
other optimization
of programs and our ability
To simplify
has several advantages
over previous
It is applicable
to a wider range of programs
require perfect
nests or nests that can be made perfect
[FST91,
the expected
permutations
combination
GJG88,
LP9~
and worse case.
[FST91, GJG88]
of permutation,
WL91].
research.
because we do not
with con-
It is quicker,
and reversal) [LP9~
n is typically
small).
only performs one evaluation
In comparison,
estimates
of temporal
transformations
determines the
and spatial
and tiling
reuse to improve
loop.l
fusion, dkrnbution,
We use the algorithms
Reference
Groups
Our
cost model
first
applies
group-temporal
an inner loop.
This
work
ated references [GJG88].
SPEC Benchmarks,
and routines
is per-
than they do,
degrade 6 programs/routines,
(b)
in one
In Wolf and Lam’s experiments,
reversal was seldom applied
it is implemented
than uniformly
gener-
to the same
create and combine
Condition
reversal,
researchers [FST91,
when including
reuse is by itself NP-hard
our algorithms
find the best transformations
permutation,
fusion and dktribution.
dependence,
$1 is a small constant d (Idl
or
< 2) and all other entries
3.4
transforma-
1 accounts
for group-temporal
distribu-
[KM93].
are efficient
for data locality
group-reuse,
by each loop using the functions
By
using
reuse and condition
we can calculate the reuse carried
RefCost and LoopCostin
To determine
the cost in cache lies
an arbitrary
array reference
group.
with
for the innermost
Figure 1.
of a reference group, we select
the deepest nesting
Each loop 1 with trip iterations
a candidate
from
position.
Data Dependence
for consecutive
LoopCosfthen
We assume the reader is familiar with concept of data dependence
[KKP+81,
GKT91].
F = {& . . . ~k } is a hybrid distance/duection
vector with the most precise information
derivable.
data dependence between two array references,
corresponding
left
loop enclosing
the
loop to innermost
by the
for 1, i.e., the number of
references,
tripl(clslstride)
or trip for non-consecutive
references.
calculates the total number of cache lines accessed by
all references when 1is the innermost loop. It simply sums RefCost
for all reference groups, then multiplies the result by the trip counts
It represents a
to right horn the outermost
references,
as
Let CIS be the cache lime
size in data items and stride be the step size of 1 multiplied
cache lines 1 uses: 1 for loop-invariant
each
in the nest is considered
coefficient of the loop index variable.
In Figure 1, ReKost calculates locality
data reuse and present an effective
2
reuse.
Loop Cost in Terms of Cache Lines
Once we accountfor
Background
In this section, we ch~acterize
data locality cost model.
All
in our system
loop nests (e.g., fusion,
with a cache model,
and
detects most forms of group-spatial
and usually
3.1
of
than previous
other subscripts must be identical.
locality.
is not practical
tion). Fusion for improving
3
more restrictive
if
iterations
is more general
equal to the cache line size in terms of array elements.
chose not to
We did integrate
approach taken by previous
GJG88, LP92, WL91]
heuristics
but slightly
group
reuse, i.e., they ac-
2. or, Refl and Refz refer to the same array and differ by at most
d’ in the first subscript dimension, where d’ is less than or
was never needed and
We therefore
can drive it.
but it did not help to improve
driving
skewing
[W0192].
even though
and our model
tions which
to calculate
are zero,
the NAS Benchmarks.
The exhaustive
or group-spatial
(a) ~ is a loop-independent
case by 20?i0 [W0192]. We degrade one program by 21?f0,
ApphJ from
skewing,
[KM92],
1. 3 Refi $Ref2,
results
and are executed on a d ifferent
a few more programs/routines
but their cache optimizations
[KMT93]
ReKkoup
RefGroup:
Two references Refi and Ref2 belong
reference group with respect to loop 1 if
in Dnasa;7 in the
because their cache optimization
and scalar replacement
processor. We improve
include
data locality.
a subset of our test suite. It is dNficult to clirectly
compare our experiments
include tiliig
loop bouncls even
Wolf and Lam’s evaluation
on the Perfect Benchmarks
Re~ost
are in the same reference
formulation
ever, it may be less precise because it ignores
formed
ReKhoup,
cess the same cache line on the same or different
data local-
ity NL9 1]. Their memory model is potentially more precise than
ours because it directly calculates reuse across outer loops; howwhen they are known constants.
algorithm
Two references
they exhibit
with
and reversal to improve
3.3
group-reuse.
and Lam use unimodular
on reuse that occurs between
The result reveals the relative amourm of reuse between loops in
the same nest and across dkjoint nests; it also drives penrnttation,
WL91],
best loop permutation.
Wolf
in this paper
storage.
and LoopCost to determine the total number of cache lines accessed
when a candidate loop 1 is placed in the innermost loop position.
our approach
step because it directly
loss of generality,
analysis, we concentrate
of the innermost
both in
Previous work generates all loop
or unimodular
trtmsfonnations
(a
skewing,
Without
small numbers of inner loop iterations. Our memory model assumes
there will be no conflict or capacity cache misses in one iteration
evaluates the locality of all legal permutations,
and then picks the
best. ‘Ilk process requires the evaluation of up to n! loop permutations (though
references.
we assume Fortran’s column-major
them.
Our approach
dhionals
studies,
1Lsm et d confirm this assumption~W91 ].
253
Figure 1: LoopCost
L=
72. =
INPUK
trapl
ALGOR~Nf:
{/,,...,
In} a loop nest with headers lb, ub, step
{ Refi,.
. . . Ref~}
representatives
from each reference group
=
(ub~ – lb, + step, )/stepl
Cls
=
the cache lime size,
f)
=
the coefficient
stride
=
I stept * coef(fl,
L.oopcost(2)
=
number of cache lines accessed with 1 as innermost
Loopcost(2)
=
coefiit,
OUTPUT:
Algorithm
~
of the index variable
u in the subscript
~ (it maybe
(RefCost(Refk(fl(,l,...,
loop
zn),...,fj(i~,...,
in)),
Z))~tr@h
h#l
k=l
RefCOSt(Re~k,
1
t~ (coef(fl,
rnpl
if ((stride
z) =
tripl
of all the remaining
loops.
ReK’ost
and LoopCost
subscript expressions
4
it) = O) A . . . A (coeflfj,
with one evaluation
appear in Fig-
fimion,
distribution,
and reversal.
4.1.1
describes
gal compound
the number
an algorithm
for discovering
loop nest transformations
of cache lines accessed.
and applying
references
to C(IJ)
time, where n is in the number
for matrix
in the same reference
resulting
To validate
permutations,
and B(KJ)
order;
exhbhs
To determine
the loop permutation
1promotes
are considered
which accesses the fewest cache
Changing
observation.
matrices
more reuse than loop i’ when both
as innermost
loops, 1will promote
our cost model,
ranking
> LOopCost(iJ.
to innermost
on perfect
we gathered results for all possible
To determine
we compare
positive
the permutation
the nest. If a legal permutation
loops.
the algorithm
is
evaluating
most expensive step. Our algorithm
predicts relative performance.
tiis
type of comparison
on several more kernels
with the same resul~
memory
order always
Reversal does not change the pattern of reuse, but it is an
i.e.,
itmay enable permutation
to achieve better locality.
We extend Permute to perform reversal as follows.
If memory order
is not legal, Permute places outer loops in position
firs~ building
up
locali~
in our experiments,
therefore we
will not discuss it further.
sorts the loops in O(n log n)
the locali~
accurately
Reversal did not improved
4.3
time. If it is not legal, Permute selects a legal permutation as close to
memory order as possible, taking worst case n(n – 1) time [KM92].
These steps are inexpensive;
factors of up to 3.7 on
if reversal is legal and enables the loop to be put in the position.
loop,
order is legal, as it is in 80% of
loops in our test suite, Permute simply
x 300
set stays in the
lexicographically
positive dependence vectors [KM92].
If Penmzte
cannot legally position a loop in a desired position, Permute tests
yields the best data locality.
When memory
of the working
Loop Reversal
enabler;
is
exists which
and so on. Because most data reuse occurs on the innermost
Complexity,
on performance.
Loop reversal reverses the order in which the iterations of a loop
nest execute and is legal if dependence
remain carried on ‘outer
if the order is a legal
the loop with the most reuse innerrnos~
it correctly
effect
the
guaranteed to find it. If the desired inner loop cannot be obtained, the
next most desirable inner loop is positioned innermost if possible,
positioning
with our
resulted in the best performance.
ermies in the distance/direction
vector. If the result is lexicographically
legal and we transform
2. Consistent
x 512 versus the 300
times vary by significant
and a small program
4.2
nests.z
loop has a dramatic
because a larger portion
We performed
Permute to achieve memory order when
one, we permute the corresponding
positions
the
of the nest with the least
If the bounds are symbolic,
We define tie algorithm
possible
ordering
(11 . . /.) so that LoopCost(l,.-l)
We call this permutation
cost rnernory order.
dominating terms.
temporal
the Sparc2, 6.2 on the i860, and 23.9 on the RS/6000. The entire
rank the loops using LQopCosL
loops horn outermost
and C(I,J)
them left to right from the least to the highest
is greater on the 512
cache. Execution
more
ranking
simply
A(I,K)
in the fewest cache line accesses.
the inner
The impact
reuse than 1’ at any outer loop position.
We therefore
and
MemoryOrder
model, choosing I as the inner loop results in the best execution time.
lines, we rely on the following
If loop
puts the two
and A(I,K)
loop invariant
cost (JKI, KJI, JIK, IJK, KI.J, IKJ) in Figure
Loop Permutation
multiply
group
Algorithm
to select JKI as memory
spatial locality
locality,
are implemented.
4.1
of
Multiplication
RejGroup
in separate groups for all loops.
exhibit
of these transformations
Matrix
2, algorithm
uses LoopCost
le-
that attempt to minimize
All
Example:
In Figure
B(KJ)
tests based on the cost model to determine when individual
transformations
are profitable.
Using these components, Section 4.5
presents Compound,
O(n)
loops in the nest.
bounds [McK92].
Each subsection
step for each loop in the nest. The complexity
of this step is therefore
In this section, we show how the cost model guides loop permutation,
zt) =0)
no reuse
Loop Transformations
Compound
consecutive
it) = O) A . . . A (Coelf(fj,
otherwise
nested loops, complicated
and nests with symbolic
loop invariant
it) = O)
< ck)
A(coefi(f~,
cls / stride
ure 1. This method evaluates imperfectly
zero)
ZL)I
Loop Fusion
Loop fusion takes multiple loop nests and combines their bodies
into one loop nest. It is legal only if no data dependence
are
of the nest is the
reversed [War84]. As an example of its effec~ consider the scalarization into Fortran 77 in Figure 3(b) of the Fortran 90 code fragment
computes the best permutation
for performing
‘In Section4.5, we perform imperfect interchanges with distribution.
254
ADI integration
in Figure 3(a). The Fortrsn 90 code
Figure 2: Matrix
Multiply
Figure 3: Loop Fusion
(a) SmpJe Fortran 90 loopsfor ADllntegration
DGI=2, N
SI
X(t,l :N) = X(I,l:N) - X(t-1,1 :N)*A(I,l:N)/B(I-l,l:N)
SZ B(T,l:N) = B(t,l :N) - A(I,l :N)*A(I,l:N)/B(I-l,l:N)
~“K=l,
N
DGI=l,
N
C(IJI = C(t,J) + A(LK) * I-K&J)
(b) J ‘hndafion
LoopCost
Rf
C;i]
A(I,K)
B(KJ)
(wirb CIS =4)
I
I
n*n~
l*n2
n*n2
total
2n3 + n2
---.......
50 –
~n*n2
n*n2
$n*n2
+n * n2
R
Times (in seeonds)
300 x 300
LoopCost
Sun Spare2
In@li860
--
20 — . -----
..
X(I,K)
A(I,K)
B(l,K)
7. :.7.= -
o—
I
I
I
IJK
KIJ
I
KJI
JIK
IKJ
512 X 512
n*n
+n*n
n*n
~n*n
n*n
+n*n
total
3*n2
~*n2
S, total
3*n2
;*ra2
52 total
2*n2
+*n2
s] + S2
5.n2
~*n2
If the fused ImopCost
400—
locality.
300—
200—
/
/
F
is lower, fusion alone will result in additional
As an example, fusing the two K loops in Figure 3 lowers
the LoopCost
----
for K from 5 nz to 3 nz.
need not be nested withii
4.3.2
o— .
I
KJI
I
I
IJK
KIJ
I
JIK
Loop Fusion
exhibits both poor temporal
locality.
and poor spatial reuse. The prclblem is
not the fault of the programmed
instead, it is inherent
can be expressed in Fortran 90.
locality
significantly
for all the arrays. This transformation
Profitability
of Loop
Loop fusion may improve
by moving
same cache line to the same loop iteration.
nest in which memory
is illustrated
4.3.3
in
accesses to the
Ref&oup
Two nests Me compatible
compatible
determine
the profitability
of iterations.
build a DAG from the candidate
the LoopCosts
nesta, we use
●
●
if all the statements were
Compute
ReKh-oup
independently
candidate
and add the results.
and LoopCost
compatible.
loops.
We therefore
the most locality,
compatibility
We
The edges are dependence
of the fused and unfused versions.
nests into sets of compatible
To yield
Compute ReKkoup and LoopCostas
A the same 22es\ i.e., fused.
headers is NP-hard [KM93];
between the loops; the weight of an edge is the difference
the cost model as follows:
●
Algorithm
apply a greedy strategy based on the depth of compatibility.
nested up to level 1. To
of fusing two compatible
Loop Fusion
order can be achieved.
adjacent set of n loops with compatible
at level / if the loops at level 1 to 1 are
and the headers are perfectly
and temporal
of the I loop is lower than the K loops,
here all the headers are not necessarily
if they already were in the same Ioop body. The two loop headers
if the loops have the same number
spatial
we detect that this transformation
Fusion thus serves two purposes:
1. to improve temporal locality, and
2. to fuse all inner loops, creating a nest that is permutable.
Previous research has shown that optimizing temporal locality for an
discovers this reuse between two nests by treating the statements as
are compatible
loop
For instance, fusing the K loops in Figure 3
Using the cost model,
itnprov ing spa-
Algoritlurr
reuse in imperfect
We then test if fusion of all inner nests is legal and creates a perfect
the compiler
Fusion
reuse directly
nests.
but memory order cannot be achieved because of the loop structure.
Figure 3(c).
4.3.1
improve
the loop nes~ improvirtg
is desirable since LoopCost
in how the
Fusing the K loops
for array B. In addkion,
is now able to apply loop interchange,
tial locality
loops for fusion
Note that the memory
a perfeet nest that enables a loop permutation
with better data locality.
results in temporal
Candidate
loop.
to enable Loop Permutation
Loop fusion may also induectly
nests by providing
IKJ
enables permuting
computation
a common
order for the fused loops may differ from the individual
........
JKI
(with cl. = 4.)
H
..
10 — . . . . ...+
JKI
l,K)
eu
40 —
30—
I-1 JO
(c) $ Loop Fusion& Interchange $
DGK=l,
N
DGI=2, N
x(t,x) = x(x,x) - X(t-1 ,K)*A(t,K)/B(t-l
JO
B(I,K) = B(M) - A(t,K)*Mt,K)/B( t-l,K)
l*n2
~n3 + nl
4n3 + n2
Execution
60-
l*n2
to Fcwfran 771
DGI=2. N
DGK=l,
N
X(M) = X(LK) - X(I-l,K)*A(LK)/B(
DO K=l, N
B(I,K) = B(I,K) - A(I,K)*A(I,K)/B(t-
between
We partition
the
nests at the deepest levels possible.
we first
and temporal locality.
fuse nests with
the deepest
Nests are fused only if is legal,
i.e., no dependence
are violated between the loops or in the DAG.
We update the graph, then fuse at the next level until all compatible
for each
sets are considered.
This algorithm
appeam in Figure 4. The
complexity of this algorithm is 0(m2 ) time and space, where m is
Compare the total LoopCosts.
the number of candidate nests for fusion.
255
Figure 5: Distribution
Figure 4: Fusion Algorithm
Distribute(.C,
Fuse(L)
S)
S=
ALGORITHM:
H3}, H, = {hk} asetof
BuildW= {HI,...,
compatible nests with depth(H,)
~ depth(H,
Build
dependence
DAG ~ with
foreach
H,={hl.
..h~},
{sl,
,s~}
containing
statements
Restrict the dependence graph to 6 carried at
level j or deeper and loop independent 6
., Pm }
Divide S into finest partitions P = {PI,..
edges and weights
i=l@j
s.t. f
between 11and h)
/* 3 edge (11, lz) with weight>
.
ALGORITHM:
forj=m–ltol
+l )
for II = hl to hm
for 12= hz to 11
if ((3 locality
aloopnest
Z = {21, . . . . /m},
INPUT
~ = /1, . . . . /k, nests that are fusion candidates
INrTrE
Algorithm
O*/
Sr, St e a recurrence,
s,, srt E p,.
compute MemoryOrder,
for each p,
if ( 3 i I MemoryOrder,
is achievable
distribution
& (it is legal to fuse them))
perform
return
fuse 1, and 12and update G
endfor
with
and permutation)
distribution
and permutation
endfor
endfor
end for
4.3.4
Example:
The original
Figure 5. It divides
Erlebacher
hand-coded
PDEs using ADI
version of Erlebacher,
integration
with
a program
3D arrays, mostly
titions
solving
smallest
consists of
out to the outermost
loops into memory order, producing a distributed program version.
Since the loops are fully distributed in this version, it resembles the
output of a Fortran 90 scalarizer.
In Table 1, we measure the performance
of the original
program
(Hand),
fusion (Distributed),
the transformed
program
that still enables permutation.
the
For
loop, stopping
if successful.
We only call Distribute if memory order cannot be achieved on a
nest and not all the inner nests can be fused (see Section 4.5). Dis-
We then applied Fuse to obtain
locality.
amount of distribution
par-
It performs
a nest of depth m, it starts with the loop at level m – 1 and works
single statement loops in memory order. We permuted the remainiig
more temporal
the statements into the finest granularity
and tests if that enables loop permutation.
without
ti”bute tests if distribution
will enable memory
for any of the partitions.
The dependence
test for loop permutation
and the fused version (Fused).
dence
Table 1: Performance of Erlebacher (in seconds)
Hand
Memory Order
is created by restricting
on statements in the partition
distribution
order to be achieved
structure
only if it combines
of interest.
The algorithm’s
time to determine
the LoopCost
complexity
to
We thus perform
with permutation
actual LoopCost.
required
its test to depento improve
is dominated
of the individual
the
by the
partitions.
See
Section 4.5.1 for an example.
4.5
Compound
The driving
loop, many variables are shared between loops. Permuting
into memory
order increases locality
grades locality
will
its impact can be significant.
Compound
de-
increase as more programs
are written
tween the resulting
its impact
with Fortran 90 array
Loop Distribution
and reversal as
the most reuse at the inner-
nests when legal and temporal
proved.
To optimize
memory
order and determining
goes onto
separates independent
d~tribution,
memory
The algorithm
for each statement.
a nes~ the algorithm
reuse can be placed innermost.
Loop distribution
fusion,
loop trans-
by achieving
Algorithm
Compound in Figure 6 considers adjacent loop nests.
It first optimizes each nest independently,
then applies fusion be-
syntax.
4.4
uses permutation,
most position
in performance
In addition,
of compound
actual LQopCost
needed to place the loop that provides
of the distributed version compared to the original. Even though the
benefits of fusion are additive rather than multiplicative
as in loop
permutation,
is to minimize
order for as many statements in the nest as possible.
the loops
in each nes~ but slightly
between nests, hence the degradation
Algorithm
force behind our application
formations
Fusion is always an improvement (of up to 17%) over the handcoded and distributed versions. Since each statement is in a separate
Transformation
the next loop.
into memory
statements in a single loop
locality
if the loop containing
If it can, the algorithm
Otherwise,
is im-
begins by computing
the most
does so and
it tries to enable permutation
order by fissing all inner loops to form a perfect nest.
into multiple looPs with identical headers. To maintain the meaning
of the original loop, statements in a recurrence (a cycle in the depen-
tries di.trIf fusion cwmot enable memory order, the algorithm
bution. If disrnbution succeeds in enabling memory order, several
dence graph) must be placed in the same loop. Groups of statements
new nests may be formed.
that must be in the same loop are called partitions.
the statements into the finest partitions,
only use loop distribution
loop permutation
to indirectly
improve
In our system we
in different
partitions
may prefer different
achievable
after distribution.
The algorithm
for fusion to recover temporal
reuse by enabling
on a nest that is not permutable3.
memory
Complexity.
Statements
Ignoring
of the compound
orders that are
Distn”bufe
Since the distribution
n is the maximum
appears in
algorithm
is O(n)
for a momen~ the complexity
+ 0(n2)
256
0(n2)
+ 0(m2)
time, where
number of loops in a nes~ and m is the maximum
steps are needed to evaluate the locality
nest.
divides
locality.
distribution
number of adjacent nests in the original
%stribution could also be effective if there is no temporal locality between partitions and the accessed arrays are too numerous to tit in cache at once, or register
pressure is a concern. We do not address these issues hem.
algorithm
these nests are candidates
program.
O(n)
non-trivial
of the statements
in each
simple steps result in the worst case when tinding
a
Figure 6: Compound
Loop Transformation
Figure 7: Cholesky
Algorithm
Factorization
(a) {KJJ form]
Compound(N)
JNNm
DO K=l,N
A(LK) = SQKT(A(JLK))
DO I=K+l,N
S1
~.k }, adjacent loop neStS
Af={nl,...
A&tC) = A(JJC)/A(K,K)
D(3 J=K+l J
A(IJ) = A(J,J)-A(LK)*A(J,K)
s~
ALGORITHM:
fori=ltok
Compute
S3
MemoryOrder
if (Permute(n,
(n,)
) places inner loop in memory
continue
else if (7s, is not a perfect nest&
adjacent loops rnj)
if (FuseAll(mj,2)
(b) J. {WI form} Loop Disti”bution & TriangulIzr Interchange &
DO K=l~
A(K,K) = SQKT(A(lC,K))
DO I=K+I ,N
A(LK)=A(t,K)/A(K,K)
DO J=K,N
DO I=J+l,N
A&,I+l) = A(I,J+l)-A(tJC) *A(J+lJC)
order)
contains only
and Permute(l)
places inner loop in memory
order)
continue
else if (Distribute(n,
LoopCost
,1))
efs
w
Fuse(1)
~ A!&/
end for
Fuse(Af)
legal permutation
problem.
and 0(rn2)
However,
steps result from building
because distribution
nests that are candidates
for fusion,
a single application
produces
the fusion
more adjacent
m includes
the addkional
In practice,
this increase was
jacent nests created by dktribution.
negligible;
K
n*n
of dkmibution
u S3
ad-
2n3
+
total
Consider
Cholesky
optimizing
ure 7(a) with
memory
Factorization
Compound.
kernel
LoopCost
order for the nest is KJI, ranking
in Fig-
determines
that
the nests horn lowest
6
cost to highest (KJI, JKI, KJJ, IKJ, JIK, IJK). Because KJI cannot
be achieved with permutation alone and fusion is of no help here,
Compound
calls Distribute.
separate partitions
3
Since the loop is of depth 3, Distn”bute
0
at depth 2, the I loop. S2 and S3 go into
starts by testing distribution
u
times (in seconds)
300 x 300
Factorization
the Cholesky
algorithm
il
}ras+ ?Z2
:n3 +n2
nz
Execution
Example:
I
l*n
1
never created more
than 3 new nests.
4.5.1
J
—
I
(there is no recurrence
between them at level 2
,p
.
....
I
u
-
........ .
:.:.
-”
RS6000
.-.
Sparc2
. . . ..i860
I
I
I
IKJ
JIK
I
JKI
KM
. . . . . . . . . . . . . . . .--. . . . .-—
....
KIJ
IJK
or deeper). Memory order of Sq is also KJJ. Distribution
of the I
loop places S3 alone in a JJ nest where I and J may be legally interchanged into memory
order, as shown in Figure 7(b). Note that our
system handles tbe permutation
of both triangular
KMT93].
and rectangular
Memoria
Fortran programs
nests.
formance.
To gather performance
possible
results for Choles~,
loop permutations;
they are all legal.
tion, we applied the miniial
(Wolfe
enumerates
to matrix
multiply,
dicted behavior.
structure;
all
stant propagation,
necessary.
elimination.
Compared
able distribution.
these loop organizations
~o191].)
there are more variations
in observed and pre-
Compound
performs
are due to the triangular
current
loop
directed
still attains the loop structure wih
auxiliary
forward
expression
version
variable
propagation,
Since scalar expansion
by the compiler.
the
con-
and dead code
will
further en-
is not integrated
we applied
Memoria
analysis,
substitution,
if scalar expansion
of the transformer,
that analyzes
their cache per-
of dependence
induction
It also determines
tmnslator
them to improve
To increase the precision
compiler
For each permuta-
amount of loop dishibution
These variations
however,
we generated
is a source-to-source
and transforms
in the
it by hand when
then used the resulting
and dependence graph to gather statistics and perform
code
data locality
the best performance.
optinizations
5
For our test suite, we used 35 programs from the Perfect Benchmarks, the SPEC benchmarks, the NAS kernels, and some miscel-
Experimental
To validate
Results
our optilzation
strategy, we implemented
and transformed
our test suite and simulated
cache hit rates. To measure our abil-
ity to improve
correctness
locality
5.1
locality,
we determined
could be ignored.
in the original,
laneous programs.
versions on
the best locality
We collected
transformed,
program
our algo-
rithms, executed the original
statistics
possible
comment
5.2
on the data
the cost model, the transformations,
above in Memoria,
ParaScope
Programming
Environment
the Memory
[Car92,
and the algo-
Compiler
CHH+9J,
They ranged
Their
execution
in size from
195 to 7608 non-
times on the IBM RS/6000 ranged
Transformation
Results
In Table 2, we report the results of transforming
the loop nests
of each program.
Table 2 first lists the number of non-comment
lines (Lines), the number of loops (hops),
and the number of loop
and ideul programs.
We implemented
lines.
Compound.
from seconds to a couple of hours.
if
Methodology
rithms described
using the algorithm
nests (Nests) for each program.
in the
Only
loop nests of depth 2 or
more are considered for transformation.
Memory Order and Inner
Loop entries reflect the percentage of loop nests and inner loops,
CK94,
257
Table 2: Memory
Memory Order
Peres I Fail
Orig
Program
Lines
Nests
Loops
Order Statistics
Inner Loop
Fad
Penn
Orig
percentages
%
Loop
Loop
Fusion
C]A
%
LoopCost
Distribution
Ratio
DIR
Final
Ideal
Perfect Benchmarks
6105
3965
3980
7608
1986 I
adm
arc2d
bdna
dvfesm
219
152
106
75
52
55
16
28
32
17
53
65
16
34
31
1
c1
35
c1
12
1
1
2
2.54
2
2.21
4.14
104
56
75
18
7
75
18
‘7
4
2
3
6
2.31
2.51
164
1491
80
76]1
63
831
65
95
19
5]
~ I
16
2
1
0
0
3.08
8.62
0/
o 11 1,72 I 1.79
o II 1.11
. . . I 1.7tt
2
1.00
1.13
6
2.05
2.20
0
6.10
4.98
0
2.32
5.58
2
1.99
7.95
0
1.00
14.81
1?. IIt
8-3 I
15
22
171
011
81811!73
mg3d
ocean
acd
2812
4343
2327
ii
115
94
46
56
45
9i
82
53
3
13
11
3
5
36
9s
84
58
o
13
16
2
4
15
0
2
0
1
0
0
0
spec77
3885
255
162
64
7
29
66
7
27
track
trfd
3735
485
57
67
32
29
50
52
16
0
25
34
0
1
0
0
1
0
dnasa7
doduc
fpppp
hydro2d
matrix300
mdljdp2
mdljsp2
ora
su2c0r
swm256
1105
5334
2718
4461
439
4316
3885
453
2514
487
111
60
23
110
4
4
4
6
84
16
50
33
8
55
2
1
1
3
36
8
64
6
88
100
50
0
0
100
42
88
14
6
12
0
50
0
0
0
19
12
34
56
19
0
48
66
SPEC Benchmarks
22
74
16
88
6
6
0
88
12
0
100
0
0
50
50
100
0
0
100
0
0
0
100
0
39
42
19
0
88
12
0
2
0
10
88
0
0
0
100
100
0
39
0
5
0
0
44
0
0
0
0
0
0
2
0
0
11
0
0
0
0
0
0
1
4
0
0
1
0
0
0
4
0
2
12
0
0
2
0
0
0
8
0
2.08
1.89
1.03
1.@
4,50
1.00
1.00
1.00
3.51
4.91
2.95
14.25
1.03
1.00
4.50
1.05
1.02
1.00
5.30
4.91
195
12
6
100
0
0
7
2
0
0
1.00
1.00
fl;52
...-=
mdv
123X
----
1
tomcatv
appbt
applu
I
25
4457
3285 I
35;16
305
855
265
773
676
aDDSV
bik cgm
embar
fftpde
mjrrid
I
181
155 I
184
0
11
3
40
43
,{
87 II
71 II
84
0
6
2
18
19
98
73 I
73
0
0
50
89
89
0
erlebacher 1I
I
870
75 II
81
39 I
1[801
I
797
1892
7519
lirrpackd
simple
wave
L
totals
.
respectively,
that are:
2644
1 14001]
100
originally
in memory
Perm:
permuted
into memory
Fail:
fail to achieve memory
entries. Similarly
411~]
~ 1II
010
83 1I
751
86 I
58]
13 II4II1OO11
-012511
91
511
2911311
691
1112011
IIt
0 I
i
3
o II
1.00
I
6 II
1.35
II
0
0
0
0
1
0
1 I
0
0
0
0
0
2 IIII
1.25
1.00
1.00
1.00
1.00 /
1.00 II
4.34
1.00
2.75
1.12
1.00
1.00
II
(t 1I
n )1II
lnll
.._- 1I
0 II 1.00 I
0 11 2.48 I
o
4.26 I
1.fnl
311112
8
4
0
0
0
0
0
0
0
0
0101128111
I
t)
1
75
I
012511-3
]-i//
861
91
511
651
291
611
74]
11115
t
11
01
6i2]l
01
70
26
0
I 229
80
23
interchange.
candidates
There
52
/
were 229 adjacent loop nests that were
other nests to improve reuse. Fusion was applicable
and completely
Compound
and the permuted
for the inner loop, the sum of the original
fused 26 and 12 nests respectively.
column,
D is the number of loop nests distributed
permuted entries is the percent of nests where the most desirable
innermost loop is positioned correctly.
in 17 programs
fused nests of depth 2 and 3. In Wave and Arc2d,
In the Loop Distribution
and the
—
for fusion and of these, 80 were fused with one or more
order.
that are in memory
i:;;
2.72
4.30
—/
order, or
is the sum of the original
1.26
8.03
I
Program
order,
of loop nests in the program
order after transformation
(1
enable
Orig:
The percentage
30 IIII
-41/
22 II
851]
I
01]
NAS Benchmarks
010[13111101
01
2[[100[
3 I 24 II
79 I
61
15 II
12
15
80
12
8
0
0
0
0
0
0
100
0
0
100
50
50
0
0
50
0
11
100
0
0
111011100
01011311
I
MLrcellaneow
I
6.10
to achieve
a better loop permutation, and
R is the number of nests that resulted.
Table 2 also lists the number of times that fusion and distribution
were applied by the compound algorithm.
Either fusion, distribu-
The Compound
tion, or both were applied to 22 out of the 35 programs.
permutation
In the Loop Fusion
loop for at least one of the resultant nests. Compound applied
distribution
in 12 of the 35 programs.
Ort 23 nests, distribution
column,
C is the number of candidate nests for fusion, and
A is the number of nests that were actually fused.
Candidate
locality
enabled loop permutation
nest correctly,
for these programs;
Fusion improved
LoopCost
group-temporal
it did not find any opportunities
only applied distribution
when it enabled
order in a nest or in the innermost
to position
creating 29 additional
the inner loop or the entire
nests. In Bdna, Ocean, Applu
and Su2cor 6 or more nests resulted.
nes~ for fusion were adjacent nests, where at least one
pair of nests were compatible.
algorithm
to attairt memory
LoopCost
to
258
Ratio
in Table 2 estimates the potential
for the final transformed
program
(Final)
reduction
in
and the ideal
program
(Ideal).
The ideal program
every nest without
in the implementation.
By ignoring
the best data locality
versions,
one could
the average ratio
LoopCost
achieves memory
regard to dependence constraints
is listed.
order for
Figure 8: Achieving
or Iiiitations
Memory
Order for Loop Nests
correctness, it is in some sense
achieve.
of original
For the final
LoopCost
and ideal
to transformed
These values reveal the potential
for locality
improvement.
Memoria
may not obtain memory
loop permutation
followed
order due to the following:
is illegal due to dependence;
by permutation
bounds are too complex,
is illegal
due to dependence;
i.e., not rectangulm
20’%. of nests where the compiler
(3) the loop
or triangular.
For the
could not achieve memory
der, 87% were because permutation
by permutation
(1)
(2) loop distribution
and then distribution
or-
followed
could not be applied because of dependence
<=20
con-
straints. The rest were because the loop bounds were too complex.
More sophkticated
algorithms
5.3
dependence
to lransform
Coding
Imprecise
>= 70 >=80
>= 90
Percentage of Loop Nests in Memoxy Order
may ena~ble the
a few more nests.
Figure 9: Achieving
Memory
Order for the Inner Loop
Styles
dependence
for improvements
analysis is a factor in limiting
in our application
analysis for the program
for our algorithm
data ‘locality
U
Original
due to the use of index
~
Final
Cgm cannot expose potential
Mg3d is written
style introduces
the potential
suite. For example, dependence
because of imprecision
arrays. The program
coding
testing tectilques
>= 60
>=40
symbolics
and again makes dependence
with linearized
arrays. This
into rhe subscript
analysis imprecise.
expressions
The inability
to
analyze the use of index arrays and linearized arrays prevents many
optimization
and is not a deficiency specific to our system,
Other coding styles may also inhibit
For example, LinpackdandMatrix300
optimization
in our system.
are written in amoduk-wstyle
with singly nested loops enclosing
function
calls to routines which
also contain singly nested loops.
To improve
programs
written
this style requires interprocedural
optimization
[CHK93,
HKM91];
these optirnizations
Many
are not currently
loop nests (69Yo) in the original
in memory
order,
and even more
the most reuse in the innermost
scientific
implemented
programmers
able to permute
order, resulting
This result indicates
We illustrate
in Figures
memory
however,
Our compiler
was
sociative
9.
to transform
for data locality
characterize
Memona
by
of their nests and inner loops that are originally
in
order and that are transformed
Figure 8, half of the original
into memory
older.
of programs
achieve memory
Unfortunately,
Our transformation
order in the majority
our abllily
algorithms
scakrized
cache, the transformed
significantly
loop nests may be cpu-bound
and the optimized
contribute
transformer.
portions
applications
improvement
(Arc2d, Dnasa7,
success-
or degradation
optirnizations
vector programs,
with significant
and Simple).
are particularly
since these programs
the predicted improvements
did not materialize
programs. To explore these results, we simulated
determine
perfor-
These results
effective for
are structured
to
correctly
sociative,
2-way
for many of the
cache behavior to
cache hit rates for out test suite.
We simulated
cachel,
an RS/6000 cache (64KB,
4-way
set as-
128 byte cache lines), and cache2, an i860 cache (8KB,
set associative,
32 byte cache limes).
We determined
the
change in the hit rates both for just the optimized procedures and
for the entire program. The resulting measured rates are presented
in Table 4. Places where the compiler affected cache hit rates by
z .1 V. are emboldened for greater emphasis. For the Find columns
programs
instead of
we chose the better of the timed and unfused versions for each pro-
of the progrmn may not
to the overall execution
All
emphasize vector operations rather than cache lime reuse. However,
may not result in run-time improvements for several reasons: data
sets for benchmark programs tend to be small enough to fit in
memory-bound,
with the -O option to
and executed on the RS/6000. For those applications
indicate that data locality
In
deternnine and
transform
set as-
We used
and the version produced by our
Table 3 shows a number of applications
of nests and programs.
to successfully
cache, 4-way
and 128 byte cache lines.
program
source-to-source
mance improvements
can be transformed
such that 90’%0 or more of their inner loops are positioned
of our test suite running
a 64KB
occurred.
programs have fewer than 70% of their
The majority
for the best locality.
policy,
both the original
automatic
nests in memory order. In the transformed version, Z9Y0 have fewer
than 70% of their nests in memory order. Over half now have 80%
or more of their nests in memory order. The results in Figure 9
are more dramatic.
replacement
fully compiled
by program
the programs
>= 90
Results
not listed in Table 3, no performance
The figures
>= 70 >=80
the standard IBM RS/6000 Fortran 77 compiler
compile
for one or more nests in 66% of the programs.
our ability
Performance
on an IBM RS/6000 model 540 with
order and a
order position.
>= 60
In Table 3, we present the performance
Transformation
8 and
the percentage
that
11% of the loop nests into memory
total of 85q0 of the inner loops in memory
Successful
5.5
(74~0) have the loop carrying
position.
>=40
Percent of Inner Loops in Memory Order
are already
in a total 801?’oof the nests in memory
improved data locality
S.4
programs
for improvement.
an additional
<=20
in our translator.
often pay attention to data locality;
there are many opportunities
in
gram.
time.
259
Table 3: Performance
Results (in seconds)
Table 4: Simulated
RS/6000 with 64KB, 4-way set associative cache and
cache tine size of 128 byte.
Original
Transformed
‘erfect Benchmarks
Program
Cold misses are not included
Optimized
Program
z
NAS Benchmarks
149.49
applu I 146.61
361.43
337.84
amp
Mix Programs
simple
850.18
963.20
Iinpackd
159.04
157.48
414.60
wave
445.94
1.13
1.01
1.08
in Table 4, the reason more programs
portions
have more significant
instance, whole program
adm
II 100
arc2d
II 89.0198.5
obtained improvements
ocean
100
100
100
100
For
always carry over to the entire program
unoptimized
hit ratios
both with
For the 8K cache, fusion
rates for Hydro2d,
0.95%,
Appsp,
respectively.
performance
applu
because of the presence of
and without
improved
and Erlebacher
whole
by 0.51
We were surprised
applying
with fusion by 5.3% on the subroutine
Linpackd’s
routine
fusion
also lowered hit rates Track, Dnasa7, and Wave; the degradation
be due to added cache conflict
u
rnatgen and by
0.02% for the entire program.
Ma~gen is an initialization
whose performance is not usually measured. Unfortunately,
To recognize
hh
and
().24~0,
to improve
reuse at the innermost
array references
that interfere
correct thk deficiency
We intend
5.6
though duect comparisons
with cache optimization
when compared
to Wolf’s
on a DECstation 5000 with a 64KB
results show performance
which
degradations
transformations
programs
in execution
did not degrade performance
and performance
of Arc2d
time.
improvements
on Btri.x,
Gmtry,
and Vpenta.
Our
96.04
simple
91.0
99.1
84.3
93.7
97.35
9934
93.33
95.65
98.2
99.9
82.9
95.9
99.74
99.82
87.31
88.09
,,
and by a slight amount on the i860.
or degrade either kernel.
More
direct
u
We neither
comparisons
are
times were measured on different
architectures.
Data Access Properties
(Refs/Group),
temporsl
improved.
Wolf
,,-96.04
that we significantly
im-
in Table 5 .4 We report the
ideal memory order, and final
of Reference Group classifies
constructed
pardy or completely
using group-spatial reuse. The amount of group reuse is indicated
by measuring the average number of references in each RefGroup
where a RefGroup
reuse and occasionally
size greater than 1 implies
group-spatial
group-
reuse. The amount
of group reuse is presented for each type of self reuse and their av -
Our results on the routines in Dnasa7 are similar to Wolf ‘s, both
showing
99.65
wave
tains the percentage of RefGroups
His
on any of the Perfect
was significantly
92.1
the percentage of RefGroups displaying each form of self reuse as
invariant (Inv), unit-stride (Unit), or none (None).
(Group) con-
or no change in all but Adm,
showed a small (1 VO) improvement
91.6
we present the data access properties
tiling
cache.
99.65
99.8
H
u
II
99.3
locality statistics for the original,
versions of the programs. Locality
only relative to
direct-map
88.76
95.92
98.34
93.28
81.67
170.41
[81.11
195.25
proved on the RS/6000 (Arc2d, Dnusa7, Appsp, Simple and Wave),
programs widt scalar replacement ~o192].
Wolf applied permutation, skewing, reversal and tiling to the Perfect Benchmarks and
Dnasa7
96.95
93.80 “
—
—
93.72
L
98.79
93.78
97.54
96.40 “
appsp
ties for our test suite. For the applications
to
results,
are dh%cult because he combmes
and reports improvements
100
96.3
87.4
98.7
92.8
L
To further interpret our results, we measured the data access proper-
in the future.
Our results are very favorable
99.98 99.97 97.02
98.77 98.77 93.84
—
—
—
—
—
—
99.36 99.36 93.71
100
100
99.83 99.83 98.85
100 100 99.28 99.28 93.79
10Q 100 99.81 99.81 97.49
93.7 93.7 99.92 99.92 96.43
SPEC Benchmarks
54.5 73.9 99.26 99.27 85.45
95.5 95.5 99.77 99.77 95.92
100 100 99.99 99.99 98.34
90.2 91.9 98.36 98.48 92.77
91.6 92.1 93.26 93.26 81.66
1199.2199.81198 .83198,831170.41
I[1OO
1100 II1198.83198.841181.00
,
,
,3
II 87.3187.3 II 99.20 199.201[95.26
NA.VRenchmrk~
100
96.7
87.4
95,3
93.0
mmid
execution
loop level, it may sometimes merge
cache.
99.45 99.45 97.32 97.32
100
not possible because Wolf does not present cache hit rates and the
requires cache capacity and
or overflow
II 95.30198.661188.58193.61
100
9938 99.36 97.22 97.14
99.33 99.39 96.04 96.43
improve
interference anrdysis similar to that performed for evaluating loop
tiling [LRW91 ]. Because our fusion algorithm only attempts to
optimize
1[99.95 199.951198.48198.58
II68.3191.9
99.9 99.9 -99:4 99.4
90.5 92.9 88.5 89.0
on the DECstation
may
and capacity misses after loop fusion.
and avoid these situations
1197.7197.8
Mircelianeous Programs
erlebacher 99.4 99.8 94.0 96.8 98.00 98.25 92.11 93.36
98.93 98.94 95.58 95.60
98.7 100 94.7 100
hnpackd
loop
program
Yo,
83.2 92.7
100 100
100 100
97.9 98.3
99.7 99.7
1]100 ] 100
II
100 1100
!,
II97.8197.8
II
but did not
loops.
We measured
fusion.
su2c0r
swm256
tomcatv
Simple, and Wave. Improve-
loop nests were more dramatic,
100 100
lUI
100
99.9 99.9
dnasa7
doduc
fpppp
hydro2d
matrix300
programs
in whole program hit rates for Adm, Arc2d,
ments in the optimized
100
100
H
did not im-
hh rates for Dnasa7 and Appsp show sig-
Appsp, Erlebacher,
100
99.6 99.6
100 100
98.8 99.7
spec77
track
trfd
improvements.
100
dyfesm
nificant improvement
after optilzation
for the smaller cache even
though they barely changed in the larger cache. Our optirnizations
Dnasa7, Hydro2d,
[ 100
flo52
mdg
mg3d
‘qcd
caused by small data set sizes. When the cache is reduced to 8K,
the optimized
Orig I Final II Orig I Final
bdna
L
prove on the RS/6000 is due to high hit ratios in the original
Cache 1
Cache 2 1
Orig I Finat II Orig I Final
L
0.98
1.07
I
Whole Program
Cache 2
Perfect Benchmarks
1.20
1.00
8.68
1.29
dnasa7 (’btrix)
dnasa7 (emit)
dnasa7 (gmtry)
dnasa7(vpenta)
Procedures
Cache 1
[
1.01
=
SPEC Benchmarks
Cache Hit Rates
64K cache, 4-way, 128 byte cache line (RS/6000)
SK cache, 2-way, 32 byte cache line (i860)
Spcedup
2.15
1.00
arc2d
dyfesm
flo52
As illustrated
Cachel:
Cache2:
im-
proved Mxm by about 10% on the DECssation, but slightly degraded
performance on the i860. Wolf slowed Cholesky by about 10%
erage (Avg).
The LoopCost
improvement
as an average (Avg) over all the nests and a weighted
Ratio
column
estimates the potential
4Tke data accessproperties for rdt the programs are presented etsewhere [CMT94].
260
Table 5: Data Access Properties
Locality of Reference Groups
original
jinol
3
3
Utit
53
77
ideal
14
66
5
8
35
o
o
8
o
o
1
6
1
3
48
57
37
38
49
44
93
98
97
47
71
70
program
Irrv
arc2d
dnasa7
original
jinal
ideal
original
Jlnal
appsp
iakal
simple
original
final
ideal
original
Jinal
ideal
original
Jinal
ideal
wave
all programs
average (W) uses nesting depth.
for all the programs.
3
37
60
3
8
44
41
53
51
The last row contains
Table 5 also reveals that each of these applications
nificant
gain in self-spatial
reuse (Unit)
Because of the relatively
locality
machine-dependent
to ensure unit-stride
By having the compiler
loop ordering,
without
programmer
significantly
that are often
involved
However,
tiliig
maybe
in recurrences
innermost
be applied to the outermost
able to exploit
the
Analysis
1.29
2.21
2.16
1.30
1.33
1.34
1.34
1.06
1.06
1.06
2.22
2.23
2.23
1.41
1.41
1.41
1.23
1.23
1.23
4.14
4.73
2.08
2.95
2.27
3.33
1.25
4.34
1.24
4.43
2.48
2.72
2.48
2.72
4.26
4.30
4.25
4.28
—
—
—
—
far outweigh
when the recurrence
loop. To regain any lost parallelism,
loop [CCK88,
performance
unroll-and-jam
Car92].
through the use of compiler
sian elimination
and perrmrtstion
optimization.
in no spatial locality.
are able to achieve unit-stride
is therefore
in a form that she or he understands,
machine-dependent
be permuted.
some of the invarisnt
across rows, resultiig
most loop. The programmer
and time-step loops
performance
Apphr suffers from
reuse
Below, we examine Arc2d, Simple, Gtntry (three of the applications
tion in performance.
effectively
a tiny
accesses in the itmer-
allowed
to non-unit
6
solver from the Perfect benchmarks.
degradation
in performance
routines
exhibit
stride accesses.
The
poor cache performance
main computational
and two with nesting depth 3. Our algorithm
factor of 6 improvement
Permuting
due
tertn cache-liie
loop is an
better performance
unit-
Tiling
in the two loops with nesting depth 3.
improvement
illustrated
on the RS/6000. Locality
within
the
cart apply loop
and loop interchange,
WL91,
to capW0187].
because it affects scalar opti-
Our cost model provides
primary
criterion
us with
the key in-
for tiling
is to create
references with respect to the target loop. These reffewer cache lines than both consecutive
access significantly
It contains
and non-consecutive references, making tiliig
worthwhile
despite
the potential loss of spatial reuse at tile boundaries. For machines
form (i.e., a recurrence
with long cache lines, it may also be advantageous to tile outer loops
hydrodynamics
two loops that are written in a “vectorizable”
short-
Assuming
increases loop overhead, and may decrease spatial reuse
loop-invariant
loop order for performance.
is a two-dimensional
must be applied judiciously
sight to guide tiliig-the
erences
large, the compiler
of strip-mining
estimated
of inner loops.
reuse at outer loops [IT88, LRW91,
at tile boundaries.
in Tnble 3 is
order maximizes
reuse across iterations
a combination
mization,
attained similarly
by improving
less time-critical
routines.
Our
optimization strategy obviated the need for the programmer to select
Simple
loops into memory
ture long-term
alone accounted for a factor of 1.9 on the whole
The additional
the “correct”
forunit-stride
give the original
loops is not a problem.
that the cache size is relatively
tiling,
is able to achieve a
on the main loop nest by attaining
stride accesses to memory
application.
(l%).
Tiling
The
loop nest with four inner loops, two with nesting, depth 2
This improvement
handles the
of the main data arrays are very small
better performance
two innermost
to write the code
whtle the compiler
We note specific coding styles that our system
Arc2d is a fluid-flow
imperfect
reductions
with a degrada-
ported to the RS/6000.
main computational
Al-
details.
(5 x 5). While ourmodelpredicts
(the only application
can
it is irn-
though this may have been how the author viewed Gaussian elimination conceptually, it translated to poor performance. Distribution
Programs
and Applu
Finally,
was allowed to write the code in
access to the amays, the small array dmensions
that we improved)
the potential
is carried by the
Gmtry, a SPEC benchmark kernel fkom Dnasa7, performs Gaus-
The two leading dimensions
of Individual
shown in Table 3. In this case,
a form for one type of machine and still attain machine-independent
carried by outer loops.
5.7
Wt
portartt to note that the programmer
strat-
or final. Invariant
and cannot
Avg
parallelism
Although
compute
Avg
1.25
in cache performance
access on
reuse. The ideal program exhibits
occurs on loops with reductions
1.26
1.00
1.00
1.16
1.10
1.07
1.08
1.09
1.02
1.85
1.00
1.00
1.27
1.02
1.01
1.15
1.05
1.03
loss in low-level
row in Table 5 reveals that on average fewer
reuse typically
Ratios
None
the improvements
effort.
more invariant reuse than the original
Unit
1.23
1.34
1.31
1.48
1.48
1.27
1.04
1.03
1.03
2.25
2.26
2.27
1.48
1.55
1.55
1.26
1.27
1.26
ization to achieve the improvements
program.
than two references exhibited group-temporal
reuse in the inner
loop, and no references displayed group-spatial
reuse. Instead,
most programs exhibit self-spatial
Irrv
1.53
2.12
1.72
1.41
1.33
1.61
0
0
1.49
0
0
1.50
1.95
2.00
1.63
1.53
1.52
1.23
loops exhibited poor cache performance.
Compound reorders these
loops for data locality (both spatial and temporal) rather than vector-
a variety of coding styles can be
additional
The all programs
;
0
0
0
0
the totals
we have shown that our optimization
egy makes this unnecessary.
ntn efficiently
:
0
0
0
0
0
0
0
0
0
0
long cache lines on the RS/6000, spatial
can make the effort
RS/6GO0 applications,
Group
had a sig-
over the original
was the key to getting good cache performance.
progratrtrners
None
44
20
20
47
35
28
62
51
48
7
2
2
47
28
27
LoopCost
Refs/Gmup
% Groups
code.
is carried by the outer loop rather than the innermost
loop).
These
261
if they carry many unit-stride
references,
D. Gannon, W. Jalby, and K. Gallwan.
[GJG88]
such as when ~ansposing
local memory
a matxix. In tJte future, we intend to study the cumulative effects
of optimization
presented in this paper widr tiling, unroll-and-jam,
Journal
[GKT91]
7
Conclusion
This paper presents a comprehensive
locality
and is the first to combine
approach
to improving
loop permutation,
in practice,
More importantly,
the compiler’s
making
to exploit
presented in this paper validate
[HKM91]
The empirical
but particularly
machine-independent
5(5):587-616,
Practicaf dependence
Conference
Design andlmplementation,
on Pro-
Toronto, Canada,
and K. S. McKinley.
Interproceduraf
for parallel code generation.
91, Albuquerque,
and R. Triolet.
In Proceedings
NM, November
Supemode
partitioning.
of
1991.
In Proceed-
ings of the Fi)leenth Annual ACM Symposium on the Principles
of Programming
[KKP+
81]
D. Kuck,
Languages,
R. Kuhn,
Dependence
San Diego, CA, January 1988.
D. Padua, B. Leasure,
graphs and compiler
of Programming
with
[KM92]
programming.
K. Kennedy
Languages,
Conference
In Conference
Symposium on the Principles
Williamsburg,
and K. S. McKinley.
and data locality.
and M. J. Wolfe.
optimization.
Record of the Eighth AnnuaiACM
We believe
good performance
of the SIGPL4N’91
F. Irigoin
[IT88]
results
target architecture,
90 programs.
step towards achieving
and C.-W. Tseng.
M. W. Hall, K. Kennedy,
the accuracy of our cost model and
for vector and Fortran
this is a significant
G. Goff, K. Kennedy,
Supercornpding’
algorithms for selecting the best loop structure for data locality. In
addition, they show this approach has wide applicability
for existing Fortran programs regardless of their original
Computing,
testing. Jn Proceedings
transformations
in our model is not a factor in
data locality.
and DLstribtied
June 1991.
fusion, distri-
them ideal for use in a compiler.
the imprecision
ability
Strategies for cache and
by global program transformation.
1988.
gram Language
data
bution, and reversal into an integrated algorithm. Because we accept
some imprecision in the cost model, our algorithms are simple and
inexpensive
of Parallel
October
and scalar replacement.
management
In Proceedings
VA, Jarmasy 1981.
Optimizing
for parallelism
of the 1992 ACM International
on Supercotrsputing,
Washington,
DC, July 1992.
K. Kennedy and K. S, McKinley.
Maximizing
loop parallelism
Acknowledgments
[KM93]
We wish to thank Ken Kennedy for providing
ance for much of this research.
the impetus and guid-
We are grateful
and improving
to the ParaScope
Proceedings
research group at Rice Universi~
for the software infhstructure
on
which this work depends. In particular, we appreciate dte assistance of Nathaniel
McIntosh
on simulations.
data locality
of the Sixth Workshop on .!.anguogesand
ers for Parallel
[KMT93]
K. Kennedy,
Computing,
the
and transformation
Center for Research on Parallel Computation at Rice University
supplying most of the computing resources for our experiments
for
and
Concurrency:
[LP92]
Computers.
tional
of Virtual Memory
PhD thesis, Dept. of Computer
Science, University
of JIJinois at Urbana-Champaign,
[Car92]
S. Carr. Memory-Hierarchy
[CCK88]
D. Callahan,
Computer
1979.
and improving
balance for pipelined
D. Callahan,
allocation
Computing,
variables.
Jn Proceedings
[CHH+93]
K. Cooper,M.
ley, J. M. Met.lor-Cnsmmey,
the IEEE,
81(2):244-263,
K. Cooper,
for procedure
February
[CK94]
February
M. W. Hall,
cloning.
of the SIG-
[War84]
Computer
NL91]
of
conditional
control
flow.
Scalar replacement
SoftwareAractice
S. Carr, K. S. M%isrley,
tion
for improving
Dept. of Computer
[FST91]
data locality.
[W0187]
and Experience,
Automatic
July 1994.
[W0192]
A. Nicolau,
and D. Padua, editors, Ixmtguages and Compilers
Computing,
Worhhop,
Systemr, Santa Ctara,
and Interactive
Parallelization.
Science, Rice University,
of Programming
April
transformations.
Languages,
M. E. Woff and M. Lam. A data locality
of the SIGPLAN
Salt Lake City,
optimizing
‘ 91 Conference
Design and Implementation,
Toronto,
algorithm.
on Program
Canada, June
M. J. Wolfe. Iteration space tiling for memory hierarchies, December 1987. Exumded version of a paper which appeared in
M. J. Wolfe.
Processing,
Report TR94-234,
for Parallel
Fourth International
Operating
A hierarchical basis for reordering
Proceedings
optimiza-
M. E. Wolf.
of the Third S[AM Conference
The Tusy loop
restructuring
of the 1991 International
St. Charles, IL, August
Improving
Locality
Santa
1991. Springer-Verlag.
262
on Parallel
research tool.
Conference
Pro-
In
on Parallel
1991.
and Parallelism
Loops. PhD thesis, Dept. of Computer
versity, August 1992.
J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness.
Jn U. Banerjee, D. Gelemter,
Clara, CA, August
J. Warren.
Proceedings
cessing.
in the presence of
[W0191]
Technical
MA, October 1992.
The cache performance
1991.
19(2): 105-117,
and C.-W. Tseng. Compiler
Science, Rice University,
K. S. McKinley.
Language
24(1):5 1–77, January 1994.
[CMT94]
and M. E. Wolf.
Languagesand
In Proceedings
A methodology
Languages,
Support for Programming
Systernr,Boston,
UT, January 1984.
K. S. McKin-
1993.
S. Carr and K. Kennedy.
M. Lam, E. Rothberg,
Loop restructurof the Fifth Interna-
Jn Conference Record of the Eleventh Anrtuol ACM Symposium
1993.
and K. Kennedy.
tool.
October
1992.
register
Proceedings
In Proceedings
on Architectural
Operating
on the Principles
environment.
5(7):575-602,
Access normalization:
PhD thesis, Dept. of Computer
L. Torczon, and S. K. Warren. The
ParaScope parallel programming
[CHK93]
[McK92]
1988.
Design and Im-
W. Hall, R. T. Hood, K, Kennedy,
Analysis
CA, Aprit 1991.
of Par-
August
Jrnproving
PL4N ’90 Conference on Program Latguage
plementation,
White Plains, NY, June 1990.
Conference
Programming
interlock
machines. Journal
5(4):334-358,
S. Carr, and K. Kennedy.
for subscripted
1992.
Estimating
Tseng.
and optirnizations of blocked algorithms. In Proceedings of the
Fourth International
Conference on Architectural
Support for
PhD thesis, Dept. of
September
J. Cocke, and K. Kennedy.
1993.
parallel programming
& Experience,
compilers.
Languagesartd
[LRW91]
Management.
Science, Rice University,
allel and DLrtributed
[CCK90]
W. Li and K, Pingali.
ing for NUMA
References
the Perfornwtce
and C.-W.
in an interactive
Practice
OR, August
Jrr
Compil-
1993.
simulations.
W. Abu-Sufah. Improving
Portland,
K. S. McKinley,
We acknowledge
[AS79]
via loop fusion and distribution.
in Nested
Science, Stanford
Uni-
Download