Improving the Ratio of Memory operations to Floating-Point Operations

advertisement
Improving the Ratio of Memory operations
to Floating-Point Operations in Loops
STEVE CARR
Michigan Technological
Unwerslty
and
KEN KENNEDY
Rice University
Over
the past
tional
power
decade,
floatmg-pomt
computational
cycles
memory
It
to write
program
to a target
is our
mance
and
memory
statically
to
have
that
target
architectures,
memory
and
because
A
this
a better
per
Idle
these
balance
1s a step m the wrong
programmer
instead,
cache
To overcome
ensures
In our view,
from
are plpelmed,
are executed
that
the computa-
data
operatmns
style
machme-specific
should
the task
not
be
of speclahzmg
a
to the compder.
optimizing
found
their
paper
compder
way
mto
we attempt
automatically
to elimmate
between
more
a coding
a sophisticated
In this
and use these
on mcreaslng
reqmre
for each new machme,
techmques
seek
focused
operations,
be left
that
have
often
apphcatlons
use
more
hierarchy?
and
program
Can
tricks
the balance
particular
scientific
version
should
practical?
on specific
accesses
when
learned
programs
machme
evaluate
performance
can dehver
and floatmg-pomt
makes
of the memory
develop
a machme
have
of programming
strateges
computations
a new program
v.ew
design
Because
are common
references
because
requu-ed
But
than
programmers
dmectlon
myriad
chip
operation
bottlenecks,
between
microprocessor
on a single
to answer
restructure
methods
attempt
or reduce
pipeline
interlock
operations
need
the
that
question
loops
to balance
to apply
the
perfor-
To do so, we
high
computation
operations
whether
for
to achieve
To do thin,
and floating-point
to determme
the
to improve
program
These
estimates
obviate
practice
they
and
estimate
for each loop m a
various
loop transfor-
mations,
Experiments
kernels,
and
with
our automatic
Additionally,
the
between
the
application
of the
the balance
achieved
Categories
and
Subject
techniques
estimate
show
of the balance
estimate
are
by our
Descriptors:
very
automatic
that
Integer-factor
between
speedups
memory
operations
accurate—experiments
system
and that
D 3,4 [Programming
are possible
reveal
possible
httle
by hand
Languages]
on
and computation,
Processors
difference
optlmlzatlon.
—compz[ers,
optzmzzatzon
General
Terms:
Addltlonal
This
IBM.
Languages
Key Words
research
and
was supported
Authors’
address:
Umvers,ty,
Houghton,
Permission
to copy without
not made
of the
or distributed
publication
Assoclatlon
MI
and
Phrases
by NSF
grant
unroll-and-jam
CCR-9120008,
ONR
S, Carr,
Department
of
Computer
49931;
K. Kennedy,
CITI,
Rice
fee all or part
for direct
Its date
for Computmg
Balance,
of this
commercial
appear,
Machinery
specific permission.
c, 1994 ACM 0164-0925/94/1100-1768
ACh%T,an~act~.n~ “n Pmgranumng Languag,,
and
grant
is granted
advantage,
the ACM
notice
1s ~ven
Mich]gan
University,
material
To copy otherwise,
NOO014-91-J-1989,
Science,
that
Houston,
provided
copyright
copying
or to repubhsh,
TX 77251-1892,
that
notice
the copies
Is by permission
reqmres
VO1 lh NO 6 N., ember
1994,
are
and the title
of the
a fee and\or
$03.50
and Sj+twns
and by
Technolo~cal
Pa~,s 176 X-181O
Improving the Ratio of Memory Operations
1769
.
1. INTRODLICTION
Over the past decade, the computer
industry
has realized
ments in the power of microprocessors.
These gains have
dramatic
improvebeen achieved both
by cycle-time
improvements
and by architectural
innovations
like
instruction
issue and pipelined
functional
units. As a result of these
tural
improvements,
tions per machine
But with these
today’s
microprocessors
cycle than
gains have
can perform
their predecessors.
come problems.
Because
many
multiple
architec-
more
the latency
opera-
and band-
width of memory
systems have not kept pace with processor
speed, computations are often delayed waiting
for data from memory.
As a result, processors
see idle computational
cycles and empty pipeline
stages more frequently.
In
fact, memory
and pipeline
delays have become the principal
performance
bottleneck
for high-performance
microprocessors.
To overcome
these performance
problems,
programmers
employ
a coding
style
that
ences and floating-point
eliminate
pipeline
floating-point
achieves
operations.
interlock
operations
a good balance
The goal of this
and to match
in program
the
ratio
have
between
coding
to
refer-
methodology
of memory
loops to the optimum
learned
memory
is to
operations
such ratio
to
that
the
target machine
can deliver.
This is usually
done by unrolling
outer loops and
merging
the resulting
copies of inner loops. Unfortunately,
this coding method
almost
always
leads to very ugly code. Additionally,
this method
is error
prone, tedious,
and time consuming.
The thesis of this paper is that most of this type of hand optimization
is
unnecessary,
because it can,
compiler.
Having
the compiler
two ways.
First,
it avoids
and should,
be performed
automatically
in a
optimize
loops is better for the programmer
in
a lot of work
that
is not necessary
for the expression
of a correct solution.
Second, it makes it possible for the same program
to be
used on a variety
of machine
architectures.
To establish
this thesis, the paper presents
new techniques
to restructure
program
loops automatically
for high performance
on specific target architectures. A method
to estimate
statically
the performance
of a given loop on a
particular
architecture
is presented
and used in an algorithm
to determine
how to transform
a loop to best improve
its performance.
This approach
extends previous
work by showing how to decide automatically
when and how
to transform
a loop. The transformation
system is tailored
to a specific target
machine
based on a few parameters
of that machine,
such as number
of
registers,
floating-point
operations
transformations
are machine
retargeted
to new processors
described
in this
and experiments
per
cycle,
independent
by changing
and
paper has been implemented
to determine
its effectiveness
paper concludes
with
these experiments.
a report
loads
per
cycle.
in the sense that
a few parameters.
on the performance
Thus,
they
The
in an experimental
have been carried
improvements
the
can be
method
compiler,
out. The
observed
in
2. BACKGROUND
This section lays the foundation
tions that improve
the memory
ACM TransactIons
for the application
of reordering
transformaperformance
of programs.
It begins with
on Programmmg
Languages
and Systems,
Vol. 16, No, 6, November
a
1994
1770
.
measure
Steve Carr and Ken Kennedy
of machine
and loop performance
that
will
be used in the application
of the transformations
described
in the next section. Then, a form
dence graph used by the transformation
system to model memory
presented.
2.1 Performance
Model
It is assumed
that
chronous
execution
IBM
RS/6000
has
a typical
only.
of depenusage is
the target
of memory
or HP
PA-RISC).
optimizing
In particular,
architecture
is pipelined
accesses and floating-point
It is also assumed
compiler—one
the compiler
ters
globally
(via
a typical
tion
and the arithmetic
that
performs
coloring
pipelines.
that
the
performs
strength
scheme)
This
and allows
operations
reduction,
and
makes
target
scalar
schedules
machine
optimizations
allocates
both
it possible
asyn(e.g.,
the
regisinstruc-
for the transforma-
tion system to restructure
ing the loop code to the
the loop nests, while leaving the details of optimizcompiler
for the target
machine.
To measure
the
performance
of program
notion of balance defined
loops given
by Callahan
2.1.1 Machine
steady
state
Balance.
manner
the above assumptions,
et al. [ 1988].
A computer
with
both
is balanced
memory
when
accesses
and
tions being performed
at peak speed. To quantify
defined
as the rate at which
data can be fetched
pared
to the rate
at which
floating-point
max words/cycle
Alf =
The values
of MM
and
FM represent
peak
it can operate
the
in a
opera-
relationship,
/3~1 is
memory,
il/l~[, com-
can be performed,
FAI:
= Mh[
max flops/cycle
word is the same as the precision
machine
has at least one intrinsic
use
floating-point
this
from
operations
we
= FII
“
performance
where
the
size of a
of the floating-point
operations.
Every
~lf. For example,
on the IBM RS\6000
PM = l.l
2.1.2
Balance
Loop Balance.
Just as machines
for a specific loop can be defined
pL
It
is assumed
that
=
number
references
have
as
balance
of memory
references
number
of flops = F
to array
variables
ratios,
so do loops.
= M
are
(1)
actually
references
to
memory,
while references
to scalar variables
involve
only registers.
Memory
references
are assigned
a uniform
cost under
the assumption
that
loop
interchange,
tiling,
and cache copying
can be performed
to attain
cache
locality
[Kennedy
and McKinley
1992; Lam et al. 1991; Meadows
et al. 1992;
Wolf and Lam 1991].
Comparing
/3~ to & gives a measure
of the performance
of a loop running
on a particular
1Note
that
multlply-add
architecture.
we count
floating-point
instruction
A@M Transactions
counts
on Programming
If ~~ > /3~, then
instruction
issues
the loop needs
rather
than
actual
data
operations.
at a higher
Therefore,
as one operation,
Languages
and Systems,
Vol
16,
No
6, November
1994
a
Improving the Ratio of Memory Operations
rate
than
the machine
can provide,
and idle
Such a loop is said to be memory bound,
by lowering
fi~. If /3~ < ~~, then data
supplied
to the processor,
said to be compute bound.
rate of a machine
point operations
computational
for
exist.
is
and need not be further
balanced.
This is because floatingcannot usually
be removed,
and arbitrarily
increasing
the
~L=
order
will
and idle memory
cycles will exist. Such a loop
Compute-bound
loops run at the peak floating-point
2.1.3 Pipeline
Interlock.
Previous
measure
of pipeline
interlock
in the
formulation
of balance is as follows:
caused
cycles
and its performance
can be improved
cannot be processed
as fast as it is
number
of memory
operations
is unlikely
to be beneficial.
the loop is balanced
for the target machine.
In
1771
.
balance
to relate
by pipeline
interlock
how to use unroll-and-jam
interlock
using Equation
does not explicitly
F
+
more
must
Finally,
if ~~ = ~~,
work in loop balance
has included
a
computation
of balance.
The complete
M
idle cycles “
closely
(2)
to performance,
be removed.
Callahan
idle
cycles
et al. [1988]
the
show
explicitly
to remove
idle cycles due to pipeline
(2). The method
described
in Section 3 in this paper
remove
interlock.
Instead,
interlock
is implicitly
removed
by unrolling
to create balanced
loops using Equation
(1). It is assumed
that
unrolling
to balance
a loop will create enough copies of an innerloop
recurrence to remove
all interlock.
The experiments
in Section
4 validate
this
assumption
unrolling,
amounts
for short
pipelines.
the Callahan
et al.
until
the
handled
separately
removal
in more
2.2 Dependence
A
dependence
interlock
However,
technique
is removed.
and only when
Interlock
necessary.
exists
between
antidependence
writes
—an
output
—an
input
ignored,
after
unroll
but
3.2.4 describes
rather
interlock
Graph
two
references
if the
if there
exists
a control-flow
to the second, and both references
1978]. The dependence
is
—a true dependence
or flow dependence
location
and the second reads from it,
second
is not
Section
interlock
increase
detail.
path from the first reference
same memory
location
[Kuck
—an
if there
is still
can be used to
first
reference
if the
first
reads
reference
from
the
access
writes
location
the
to the
and
the
to it,
dependence
dependence
if both
if both
references
references
write
read
to the location,
from
and
the location.
u and
w, are contained
in n common
loops, separate
If two references,
instances
of the execution
of the references
can be described
by an iteration
vector. An iteration
vector, denoted
~ is simply the values of the loop control
variables
of the loops containing
u and w. The set of iteration
vectors
corresponding
to all iterations
of the loop nest is called the iteration
space.
ACM
Transactions
on Programmmg
Languages
and
Systems,
Vol
16, No
6, November
1994
1772
.
Steve Carr and Ken Kennedy
DO 10 I = I,N
DO 10 J = l,N
DO 10 K = I,N
A(I+I,
10
J,K-1)
= A(I,
J,K)
+ C(I
a
t
1
(=,=,*)
J)
(<,=,>)
(1,0,-1)
Fig.
Using
iteration
fined
for
vectors,
each
dependence:
1.
Example
a distance
if
accesses location
Z on iteration
.>this definition,
z ~ . Under
l—[1’
u
dependence
vector,
graph,
d=
(dl,
d2,...,
dn),
can be de-
accesses location
Z on iteration
~, and
vector for this dependence
i~U, the distance
the
k th component
of the
distance
vector
w
is
is
equal to the number
of iterations
of the k th loop (numbered
from outermost
to innermost)
between accesses to Z. The direction
uector D = ( D1, . . . . D,) of
a dependence
is defined by the equation:
These symbols can be combined
<
> , and # . If all direction
when more than one condition
is true to give
vectors are possible for a dependence,
then its
~r’ection
is denoted
as *. The elements
are always displayed
in order left to
right,
from the outermost
to the innermost
loop in the nest. As an example,
consider
Figure
1. The distance
and direction
vectors
for the dependence
between
the definition
and use of array A are (1, O, – 1) and ( <, = , > ),
respectively.
The direction
vector
for the dependence
from
C to itself
is
( = , = . *). There is no single distance
vector for this dependence.
The loop associated
with
the outermost
nonzero
direction
(or
vector
entry
is said
to be the
carrier
of the
dependence.
The
distance)
threshold
of a
carried dependence
is the value of the distance
vector entry for the carrier.
A
minimum
threshold
equal to the minimum
possible
positive
distance
vector
value is associated
with those dependence
having
no single distance
vector
value in the outermost
position.
If all entries
are O or = , the dependence
is
anti Its threshold
M O. In this paper, we consider only those
loop independent,
dependence
that have a consistent
threshold-that
is, those dependence
for
which
the threshold
is constant
throughout
the execution
of the loop—for
memory
reuse [Callahan
et al. 1988; Gannon et al. 1987]. For a dependence
to
have a consistent
threshold,
the location
accessed by the dependence
source
on iteration
i must be accessed by the sink on iteration
i + c, where c does
L. A dependence
with
a consistent
threshold
is called
a
not vary
with
consistent
dependence;
otherwise,
it is called an inconsistent
dependence.
For
the purposes
of this paper, only those consistent
dependence
whose source
ACM
Transact,oni
on
Programming
Languages
and
Systems,
Vol
16,
No
6. November
1994
Improving the Rat[o of Memory Operations
and
sink
have
considered
only
one
for memory
dled elsewhere
induction
reuse.
[Carr
variable
in
each
subscript
Multiple-induction-variable
1773
.
position
subscripts
are
are han-
1992].
3. METHOD
Conceptually,
the transformation
computed.
Then,
method
is simple.
for each memory-bound
For each loop nest,
loop, transformations
try to make
~~ < &I. This section
presents
two transformations
reduce fl~. The first transformation
reduces the number
of memory
in an innermost
loop. The second introduces
into the innermost
loop, keeping
the loop-nest
the overall
memory
requirements
more
total
of the entire
~~ is
are applied
to
used to
accesses
floating-point
operations
constant,
while reducing
loop nest.
3.1 Scalar Replacement
The
number
references
of memory
to arrays
references
with
in
sequences
a loop
can
of scalar
be lowered
variables.
In
by replacing
the
code shown
iteration
of the loop
below,
DO IO 1=2, N
10 A(I) =A(I – l)+
the value
accessed
B(I)
by A(1 – 1) is defined
on the previous
Using this knowledge,
the flow of values
by A(1) on all but the first iteration.
between
the references
can be expressed
with scalar temporaries
as follows.
T= A(l)
D0101=2,
N
T= T+ B(l)
10 A(I)=T
Since
the
values
held
in scalar
of A(1 – 1) has been
load
cycle
requirements
quantities
removed,
of the
loop
will
resulting
[ Chaitin
probably
be in registers,
in a reduction
et al. 1981].
This
in the
transformation
called
1988;
scalar replacement
and is described
in detail elsewhere
[Callahan
1990; Carr and Kennedy
1994; Dehnert
et al. 1989; Duesterwald
1993;
Rau
1991;
Rau
et al. 1992].
Essentially,
any
the
memory
consistent
true
is
et al.
et al.
or input
vector entries is amenable
to
dependence
with O’s in the outer n – 1 distance
scalar replacement.
Additionally,
any array reference
that is invariant
with
respect to the innermost
loop may be removed
by scalar replacement.
3.2
Unroll-And-Jam
Unroll-and-jam
is a transformation
that
can
be used
in
conjunction
with
scalar replacement
to improve
the performance
of memory-bound
loops [Aiken
and Nicolau
1987; Allen and Cocke 1972; Callahan
et al. 1988]. The transformation
unrolls
an outer loop and then jams the resulting
inner loops back
together.z
z Note
capture
that
Using
in scalar
loop-carried
be done to improve
ACM
unroll-and-jam,
replacement
reuse.
the
innermost
Unroll-and-jam
scheduling
Transactions
more
and
scalar
on Programming
computation
loop
is separate
can be introduced
is unrolled
to remove
any copies
from
innerloop
unrolling
any
into
an
needed
that
to
might
replacement.
Languages
and Systems,
Vol
16, No 6, November
1994
1774
.
Steve Carr and Ken Kennedy
innermost
loop body without
For example,
the loop
a proportional
increase
in memory
references.
DO1OI=1,2*M
DOIOJ=I,
N
10
A(1) = A(1) + B(J)
after
unroll-and-jam
of I by a factor
DO IO I= I,2KM,2
DO IO J=I,
N
A(1) = A(1) + B(J)
10
A(I+l)=A(I+l)+
In the original
loop,
B(J), are left
after
unroll-and-jam,
of 1 becomes:3
B(J)
one floating-point
scalar
operation
replacement,
two floating-point
giving
and
one memory
a balance
operations
reference,
of 1. After
and only
applying
one memory
reference
exist in the loop, giving a balance of 0.5. On a machine
that can perform
twice
as many floating-point
operations
as memory
accesses per clock cycle, the
second loop would perform
better.
Although
unroll-and-jam
has been studied
extensively,
it has not been
shown how to tailor unroll-and-jam
to specific loops run on specific architectures. In the past, unroll
amounts
have been determined
experimentally
and
specified with a compile-time
parameter
[Callahan
et al. 1990]. However,
best choice for unroll
amounts
varies between
loops and architectures.
tion
to
3.2.3 derives
balance
a method
program
Unroll-and-jam
loops
is tailored
to choose
with
unroll
respect
to a specific
amounts
to
automatically
a specific
machine
based
target
the
Sec-
in order
architecture.
on a few parameters
of
the architecture,
such as effective
number
of registers
and machine
balance.
The result is a machine-independent
transformation
system in the sense that
it can be retargeted
to new processors
by changing
a few parameters.
The rest of this section begins with an overview
of the safety conditions
for
unroll-and-jam.
Then,
a method
for updating
the
dependence
scalar replacement
can take advantage
of the new
unroll-and-jam
is presented.
Finally,
an automatic
gram loops with respect to a particular
architecture
3.2.1 Safety.
With
a semantics
that
is only
graph
so that
opportunities
created by
method
to balance
prois derived.
defined
on correct
programs
(as in Fortran),
unroll-and-jam
is safe if the flow of values is not changed.
Dependence
can prevent
unroll-and-jam
or limit
the amount
of unrolling
that can be done and still allow jamming
to be safe. Consider
the following
loop.
DO1OI=1,2*M
DO1OJ=1,N
10
A(I,J) = A(1 – l,J + 1)
There
is a true
of (1, – 1).
~Note
ACM
that
dependence
Unrolling
unrolling
TransactIons
the
by a factor
on
Progammmg
from
A(l ,J) to A(l – 1,J + 1) with
outer
loop
of k results
Languages
once
creates
a distance
a loop-independent
in k + 1 loop bodies
and
Systems,
Vol
16,
No
6, November
1994
vector
true
.
1775
dependence
that becomes an antidependence
in the reverse
direction
loop jamming.
This dependence
prevents
jamming
because the order
after
of the
Improving the Ratio of Memory Operations
store
to and load from
a location
would
be reversed
as shown
in the unrolled
code below.
DO IO I= I,2*M,2
DO1OJ=1,N
A(I,J) = A(1 – I,J + 1)
10
A(1+ I,J) =A(I,J + 1)
Location
the
A(3,2)
original
is defined
loop.
In the
on iteration
(3,2)
transformed
loop,
and
the
read
on iteration
location
is read
(4, 1) in
on iteration
(3, 1) and defined on iteration
(3, 2), reversing
the access order and changing
the semantics
of the loop. The following
theorem
from Callahan
et al. establishes the conditions
for unroll-and-jam
safety and is stated without
proof
et al. 19881.
[Callahan
3.2.1.
THEOREM
with
distance
Let e be a true,
d=
(O,...,
O,dk,
where d] < 0 and all components
at level k can be unrolled
at
generated
output
or antidependence
carried
at level k
vector
that
prevents
Essentially,
this
fusion
O,...,
),
between the kth and jth are O. Then the loop
most dk — 1 times
before a dependence
is
of the inner
theorem
O,dl),..
states
that
n – k loops.
unrolling
loop
k more
than
d~ – 1
will introduce
a dependence
that prevents
loop fusion. If all d~ are positive,
fusion will not be prevented.
To allow unroll-and-jam
in some loops containing edges with negative
distance
vector entries,
loop skewing
can be used to
remove
the
assumed
the
safety
appear
for
at
am
entries
skewing
appear
unroll-and-j
dence
any
test
statements
loops
negative
that
in
preventing
done
unroll-and-jam
at levels
inner
levels,
to be safe.
can be violated
is
other
[Allen
than
loop
No
dependence
by a transformation
applied.
Additionally,
the innermost
distribution
recurrence
and Kennedy
must
involving
1987; Kuck
unroll-and-jam
is safe if no jamming-preventing
by loop unrolling
and if no recurrence
is violated
[Wolfe
system
level
when
or when
be safe
data
1986].
is done
in
et al. 1980].
is
non-DO
multiple
order
or control
dependence
by distribution.
It
before
for
depen-
Therefore,
are generated
3.2.2 Dependence
Copying.
To allow scalar replacement
to take advantage of the new opportunities
for reuse created by unroll-and-jam,
the dependence graph must be updated
to reflect
these changes.
In this section,
we
describe
a method
to update
of variables.
Both classes
subscript
expression.
dependence
have
3.2.2.1 Variant
References.
dependence
graph
after
loop
array references
contain
only
position
and whose references
[Callahan
only
for the two most
one
induction
prominent
variable
in
classes
any
given
Below
is a method
to compute
the updated
unrolling
for consistent
dependence
whose
one loop induction
variable
in each subscript
are variant
with respect to the unrolled
loop
et al. 19881.
ACM
Transactions
on Programming
Languages
and
Systems,
Vol.
16. No
6, November
1994
1776
If the
.
Steve Carr and Ken Kennedy
ith
loop
in a nest
L is unrolled
by a factor
of k, then
dependence
graph G = (V’, E’) for Z can be computed
graph G = (V, E) for L by the following
rules:
(1) For each u = V, there
to the original
reference
from
the
updated
the dependence
are k + 1 nodes UO,... , U}, in V’. These
and its k copies.
correspond
● E, with
distance
vector ci(e ), there are k + 1
em = (u~,w~),
u~ is the mth copy of u, w. is the
(2) For each edge e = (u, w )
edges
eo, . . . . ek where
n th copy of w and
n = (m
+d,
(e))
mod(k
d,(e)
\
To extend
unrolled
dependence
copying
in an outermost-
ifn>m
11k+l
=
d(e)
*
[ H
dl(en)
(3)
+ 1)
(4)
i~n<m
+1
to unroll-and-jam,
to innermost-loop
we assume
ordering.
that
If loop
and a new distance
vector,
d(e) = ( dl, dz, . . . . d,)
d~(e) = O and Vi < k, d,(e) = O then the next inner
loops
are
k is unrolled
is created
such that
nonzero
entry,
d,(e),
becomes the threshold
of e and j becomes the carrier
of e. If Vi, d,(e) = O,
then e is loop independent.
If 3 i < k such that
d,(e) > 0 then e remains
carried by loop i with the same threshold.
To demonstrate
dependence
copying,
consider
Figure
2. The
original
edge
A(1, J) to A(1,J – 2) has a distance vector of (2, O) before unroll-and-jam.
10
After unroll-and-jam
the edge from A(I,J) now goes to A(I,J) in statement
and has a vector of (O, O). An edge from A(I,J + 1) to A(I,J – 2) with a vector of
(1, O) and an edge from A(I,J + 2) to A(I,J – 1) with a vector of (1, O) have also
from
been created. For the first edge, the first formula
latter
two edges, the second formula
applies.
3.2.2.2 Inuarlant
copying for invariant
References.
references,
(1) If any reference
is invariant
reference invariant.
(2) If a specific
wirwariant.
reference,
for d,( em ) applies.
For the
To aid in the explanation
of dependence
the following
terms are defined.
with
respect
u, is invariant
In the following
example,
I and
invariant;
and J is A(l) -invariant.
J are
with
to a particular
respect
reference-invariant
loop, that
to a loop,
that
loops:
loop is
loop
is
I is B(J)-
DOIOI=l,
N
DO1OJ=1,N
10
A(1) = A(1) + B(J)
The behavior
of invariant
references
differs
from
that
of variant
references
when unroll-and-jam
is applied.
Given an invariant
reference
u, the direction
vector entry of an incoming
edge corresponding
to a u-invariant
loop represents multiple
values rather
than just one value. Therefore,
Equation
(4) is
ACM
TransactIons
on
Programmmg
Languages
and
Systems,
Vol
16,
No
6, No> ember
1994
Improving the Ratio of Memory Operations
DO 10 J = I,H
DO 10 J. = I,H
1777
.
DO 10 J = I,N,3
DO 10 I = I,N
I
A(I
10 ‘(H-2)
(0,0)
(1,0)
(2,0)
A(I
-1
I
= A(I~J-2)
J)
10
A(I,
J+l
r
= A(I, J-1)
1’+2)
‘)(l’J]
\
(1,0)
Distance
Flg.2.
insufficient.
To
accommodate
entry,
the following
dence edge.
steps
vectors
before
the
in
are
*
applied
the
andafter
outermost
in order
(1) Equation
(4) is applied to the minimum
“*” in the outermost
nonzero
direction
The minimum
distance
is 1, and the
unroll-and-jam
nonzero
to copy
direction
vector
an invariant-depen-
positive
distance
indicated
by the
vector entry of the original
edge.
result
will be a loop-independent
edge.
edge is created in each new loop body between
the
(2) A copy of the original
copies of the source and sink corresponding
to the original
references.
This is because the “*” includes
a distance
vector entry of O.
(3) A copy of the original
identical
references.
invariant.
Therefore,
dependence
edge is inserted
All copies of an invariant
between
reference
a dependence
every
exists
between
every pair of
will
also be
reference
(4) Loop-independent
dependence
are inserted
from the references
statement
to all identical
references
following
them. This inserts
the edge created
in step
There is no need to insert
In Figure
pair.
within
a
copies of
1 to account
for unroll
values of more than
an edge identical
to the one inserted
in step
3, J is A(1, K)-invariant.
When
J is unrolled
1.
1.
by 1, a loop-independent
dependence
is inserted
between
the two references
to A(I, K) per step 1; a copy
of the original
dependence
is made for the new copy of A(1,K) per step 2; and
there are loop-carried
dependence
inserted
between
the two references
per
step 3. Step 4 is not necessary
because
step 1 has already
inserted
a
loop-independent
dependence.
The safety
rules
of unroll-and-jam
prevent
the exposure
of a negative
distance
vector
entry
in the outermost
nonzero
position
for true,
output
and
antidependences.
However,
the same is not true for input
dependence.
The
order in which reads are performed
does not change the semantics
of the loop.
Therefore,
If the outermost
entry
of an input
dependence
distance
vector
becomes negative
during
unroll-and-jam,
the direction
of the dependence
is
changed,
and the entries
in the vector are negated.
ACM
TransactIons
on Programmmg
Languages
and
Systems,
Vol
16, No
6, November
1994.
1778
.
Steve Carr and Ken Kennedy
DO IOK=
1.X
DO 10 J ~ I,M,2
DO 10 I = I,N
DO 10 K = I,li
DO 10 J = 1,11
DO 10 I = i,li
10
...
= A(
1
,K)
a
[step
3)
. ..=.
,=, *,_-)
(=,*,=)
El
I,K)
(=,*,=)
(original)
(0,0,0)
(step
1)
t
10
. ..=
,1, K
L
(=,*,=)
(step
Flg.3.
3.2.3
ImprolingB
is to convert
Invariant-dependence
alancewith
loops
that
copying.
i7nro11-and-Jam.
are bound
by memory
2)
Asstated
access time
earlier,
into
loops
the goal
that
are
balanced.
Since unroll-and-jam
can introduce
more floating-point
operations
without
a proportional
increase
in memory
accesses, it has the potential
to
improve
loop balance.
In this section,
a decision
procedure
to compute
the
balance
ratio
decision
memory
procedure
accesses
in
unrolled
loops
dence
carried by the l-loop
advantage.
After
unrolling
become
leaving
decision
original
is
works,
consider
per floating-point
developed.
As
an
Figure
4. The
operation,
but
of which unroll-and-jam
the l-loop by 1, there
example
original
it also
of how
the
loop has three
has two depen-
by a factor of 1 can take
are two dependence
that
loop independent
and expose opportunities
for scalar replacement,
a balance of two memory
operations
per floating-point
operation.
The
procedure
for
unroll-and-jam
loop and the unroll
amount
takes
the
for the outer
dependence
loop, and then
edges
in
computes
the
the
unrolled-loop
balance
as if the loop were unrolled
by the specified
amount.
This procedure
is applied
to various
unroll
amounts
m an efficient
manner
until the unroll
amount
that best balances
a loop on a particular
architecture
is found.
Although
unroll-and-jam
improves
balance, using it carelessly
can be counterproductive.
Transformed
loops that spill floating-point
registers
or whose
object code is larger
than
the instruction
cache may suffer
performance
degradation.
Since the size of the object code is compiler
dependent,
it is
assumed that the instruction
cache is large enough to hold the unrolled
loop
body. Thus, the transformation
heuristic
can remain
relatively
independent
ACM
TransactIons
on
Programming
Languages
and
Systems,
Vol
16,
NrJ
6, November
1994
Improving the Ratio of Memory Operations
DO 10
I
=
I,l?
I = I,M,2
DO 10 J = I,l?
DO 10
(1,0)
DO 10 J = I,lf
I
‘0
1779
.
+
‘(1113’’J)
,0
+A(1-2’J)
::;~~::~;:’:;
9
-?
(1,0)
Fig,4.
Objective
of the details
of a particular
objectives
for unroll-and-jam:
(1) Balance
a loop with
(2) Control
register
The
goals
stated
forunroll-and-jam
software
a particular
system.
This
leaves
the
following
architecture.
pressure.
can be expressed
as a nonlinear
integer
optimization
prob-
lem.
objective
min I /3L – ~J~Io
# floating-point
function:
constraint:
registers
required
s
register-set
size
I flL – L?MIo is the balance
where
The
decision
the loops
the
point
objective
variables
in a loop
before
in
nest.
the
norm
problem
The
c causes
are
the
register-pressure
floating-point
function,
and has the following
registers
the
amounts
constraint
are
balance
unroll
spilled.
norm
for
limits
For
the
to favor
bound loop over a slightly
memory-bound
one.
For each loop nest within
a program,
its possible
definition:
each
unrolling
solution
a slightly
of
to
to the
compute-
transformation
is modeled
as a problem
of the above form.
Solving
the optimization
problem
yields
unroll
amounts
that balance
the loop nest as much as possible.
Using the
following
definitions
throughout
the derivation,
the compile-time
construction
and efficient
solution
V = array
references
V, = members
X, = number
(This
M = number
the
perfectly
of times
nested.
of memory
of this
Section
loops. Additionally,
flow in the innermost
ACM
the
ith
Transactions
it
outermost
loop in L is unrolled
created
by unrolling
after
unroll-and-jam
operations
references
derivation,
is detailed.
reads
of loop bodies
3.2.4.5
problem
loop
are memory
of floating-point
purposes
optimization
in an innermost
of V that
is the number
F = number
For
of a balance
after
it
shows
+ 1
loop
i)
loop
nest
unroll-and-jam
is assumed
how
to
that
handle
each
imperfectly
is assumed
that loops containing
conditional
loop body are not candidates
for unroll-and-jam.
on Programming
Langnages
and
Systems,
Vol.
16, No
6, November
is
nested
control
This is
1994
1780
Steve Carr and Ken Kennedy
.
because inserting
disastrous
effects
replacement
additional
control
flow within
4 Finally,
upon
performance.
removes
not a limitation
no stores.
This
of the method
an innermost
loop can have
it is assumed
that
scalar
is done for clarity
[Carr
of presentation
and is
1992].
Since the value of f?~ is a compile-time
3.2.3.1 Computing
Loop Balance.
constant
defined by the target
architecture,
only the coefficients
in LIL need
to be computed
at compile
time
to compute
the objective
function
for a
particular
loop. To compute
~L for a particular
loop, the number
of floatingpoint operations,
F, and memory
references,
M, are computed
for one innermost-loop
iteration.
F is simply
the number
of floating-point
operations
in
the
original
loop,
unroll-and-j
f,
multiplied
by
the
number
of loop
bodies
created
by
am, giving
F=fx
~
lsl
where n is the
the dependence
In computing
Xl,
..77
number
of loops in a nest. M requires
a detailed
analysis
of
graph.
that floating-point
common
subexpressions
F, it is assumed
are not removed
from the inner loop. There are two reasons for this assumption. First,
the removal
of floating-point
computation
from memory-bound
loops is not likely to improve
performance.
Since the loop is already bound by
memory
increase
assuming
cycles, the removal
of floating-point
the number
of idle computation
no common
subexpression
operations
will probably
only
cycles in the loop. Second,
by
removal,
the solution
equation
is an order of magnitude
faster.
In computing
M, an infinite
register
set is assumed.
to the loop balance
The register-pressure
constraint
in the optimization
problem
ensures that any assumption
about a
memory
reference
being removed
is true. To simplify
the derivation
of the
formulation
for memory
cycles, it is initially
assumed
that
at most one
incoming
or outgoing
edge is incident
upon each u G V. This restriction
is
removed
at the end of Section 3.2.4.
The reference
set of the loop can
different
memory
listed below.5
(1) V“
behavior
= references
when
without
be partitioned
unroll-and-jam
an incoming
(2) V,c = memory
ing consistent
reads that
dependence,
(3) Vrl = memory
reads
that
consistent
are invariant
M=
have
‘At this point,
edges that
shown
this
to
be
we are only concerned
can possibly
be made
sets
that
exhibit
These
sets
dependence.
with
with
A#+M:+
true
with
innermost.
on
the
loops
respect
to a loop.
each partition,
{I@’,
M:,
TransactIons
on
Programmmg
Languages
IBM RS/6000
where
Therefore,
and
M:},
M:.
fusion
we only
is legal
deal
after
with
unroll-and-jam
posltwe
distance
entries.
ACM
are
have a loop-carried
or loop-independent
incombut are not invariant
with respect to any loop.
Using this partitioning,
the cost associated
is computed
to get the total memory
cost,
4 Experiments
into
is applied.
Systems,
Vol
16,
No
6, November
1994
and
vector
Improving the Ratio of Memory Operations
A(I,
J,K)
=
A(I,
A(I,
A(I,
A(I,
. . .
10
Fig.5,
For
the
first
and-jam
partition,
contains
ence. This
value
V“
@,
before andafter
each
one memory
canbe
copy
expressed
additional
memory.
in
copies
For each
Figure
from
.
loop
body
associated
with
produced
by unroll-
each original
refer-
nxt].
q
lsz<n
V@
5, unrolling
the
.
.
as follows:
,1 f=
example,
J,K)
= . . .
J,K+I)
= . .
J+I, K) = . .
=..
J+ I,K+I)
unroll-and-jam.
of the
reference
M@=
For
1781
DO 10 K = I,IT,2
DO 10 J = I,N,2
DO 10 I = I,N
DO 10 K = l,N
DO 10 J = 1,11
DO 10 I = 1,14
10
-.
.
the
reference
outer
two
to A(1,J, K) that
loops
will
by
1 creates
be references
3
to
u G V,c, unroll-and-jam
may create dependence
useful in scalar
for some of the new references
created. In order for a reference,
replacement
u, to be scalar
replaced,
its incoming
an innermost-loop-carried
an innermost
dependence
dependence,
e,,, must
be converted
into
or a loop-independent
dependence,
which is called
edge. In terms
of the distance
vector associated
with an edge, d(e,, ) = ( dl, d2, . . . . d.), the edge is made innermost
when its
distance
vector becomes
d(ej ) = (O, 0,...,0,
d.).
Essentially,
the subscript
expression
associated
with each outerloop
induction
variable
must be identical in both the source and sink of the dependence
in order for the reference
at
the sink of the dependence
the references
will be the
to be removed.
After
sink of an innermost
for applying
unroll-and-jam
6, the
copy
of A(I,J – 2) (the
from
A(I,J + 3) that
first
dependence
copies
(A(I,J)
and
must
quantify
this
reference
is not
A(1,J + 1)) are the
unroll-and-jam,
only some of
edge, The decision procedure
value.
innermost,
sinks
For example,
in Figure
to A(I,J – 1)) is the
but
the
of innermost
sink
second
edges
(in
of a
and
third
this
case,
ones that are loop independent)
and will be removed
by scalar replacement.
To compute
the number
of memory
references
inserted
in the innermost
loop
by unroll-and-jam
after scalar replacement,
first the total number
of
references
before scalar replacement
is computed
as in @,
Next,
the
ment
is computed
number
of new
references
by quantifying
that
the
will
number
be removed
by scalar
of references
that
replace-
will
be the
sink of an innermost
dependence
after unroll-and-jam.
To simplify
the derivai
is considered,
where
Ve, d(e) =
only
the
unrolling
of loop
tion,
is extended
to arbitrary
loop
(o,..., O,d,, O,...,0, d. ). Later, the formulation
nests and distance
vectors.
ACM
Transactions
on Programming
Languages
and
Systems,
Vol.
16, No
6, November
1994
1782
Steve Carr and Ken Kennedy
.
DO 10 J = I,H,4
DO 10 I = l,?i
DO 10 J = 1,11
DO 10 I = l,li
+
A(I,
‘0 ‘(r!-H7
J)
= A(I,
J:2)
(o, o)
(1,0)
(2,0)
A(I
‘( I, J-l)
J+l)
(0,0)
\+
A(I,
J+2)
A(I,
(1,0)
J)
\
I
10
V,( before
Flg.6
A(I,
andafter
J+3)
I
= A(I,
J+I)
unroll-and-Jam
Recall from Eauation
(4) in Section 3.2.2 that, in orderto
create adistance
.
vector with a O in the ith position,
there must be some u G V,c for which
d,(e, ) <X,,
resulting
in an innermost
dependence
edge (if d,(e) >Xl,
then
not enough
outer
iterations
are unrolled
to create
a reference
with
the
subscript
expressions
containing
the ith induction
variable
being equivalent
in both the source and sink of a dependence).
However,
depending
upon
which of the two distance
formulas
is applied,
it may be impossible
to create
an edge with
entry
a O in the
ith
position.
In order
for the formula
that
allows
an
to become
O, d,( e; ) = [ci,(e,, )\X, j, to be applied,
the edge created
by
and-j am, ej = ( w~, U. ), must have n > m. If n < m, then the applica-
unroll-
an entry
from becoming
O.
ble formula,
d, (e;, ) = 1dt(e,, )/Xl ] + 1, prevents
Below, Theorem
3.2.3.1.1 shows that n > m is true for Xl – dl(ec ) of the new
dependence
edges if d,( e,, ) < X,. This
can be removed
by scalar replacement.
shows
that
X, – d, ( e, ) new references
THEOREM
3.2.3.1.1.
Given O < m, d,(e) <Xl,
and n = (m + d,(e)) mod X,
for each new edge e; = ( Zum, U.) created from e,, = ( WO, ZJO
) by unmll-andjam,
then
n>mtim<Xc-dl(e).
PROOF.
See Appendix
In the case where
a general
formula
placement.G
For an arbitrary
after
❑
C.
X, < d,( e, ), no edge can be made
of (XC — d, (e,))+
distance
unroll-and-jam,
the
edges
vector,
new
if any
edge
will
that
outer
not
innermost,
are
amenable
entry
is left
be innermost.
resulting
in
to scalar
re-
greater
6
lfx>o
‘+=
ACM
TransactIons
on
Programmmg
Languages
(
;
ifx
and
Systems,
<O’
Vol
16,
No
than
If a dependence
6, November
1994
O
Improving the Ratio of Memory Operations
edge becomes
that
innermost,
additional
edge. For example,
innermost
edge that
in Figure
unrolling
of any
6, the load from
is a copy of the incoming
loop
creates
1783
a copy
of
A(1, J + 1) has an incoming
edge for the load
The edge into A(1, J + 1) was created by increasing
the
J-loop after A(1,J)’s edge became innermost.
Quantifying
JJ (XL -
.
unroll
these
from
A(1, J).
amount
of the
ideas gives
d,(e)))+
l<2<n
references
that
from
total
the
can be scalar
number
unroll-and-jam
gives
replaced
for each u c V c. Subtracting
before
scalar replacement
of references
the value
for M,c. Therefore,
~il<c<n
predicts
Xl
<2
6, Xl
= 4 and
correctly
that
were
no distance
true,
vector
For references
dl( e) = 2 for the reference
no memory
that
to A(1,J – 2). The formula
be removed.
Note that if
references
would have been removed
because
a O in the first entry.
2 memory
would
have
references
are invariant
is different
from variant
loop L,, i < n, unrolling
with
expression
of v. Unrolling
memory
references.
Finally,
loop, no memory
references
outer
loops
loop-invariant
ory
references,
while
respect
to a loop, memory
access the same memory
location
as U.
for L, does not appear in the subscript
(scalar
In Figure
unrolling
the
behavior
is invariant
with respect to
more references
to memory
any loop that is not v-invariant
will
if u is invariant
with respect to the
will be left after scalar replacement
are unrolled
references).
will
references.
If v G V,l
L, will not introduce
because each unrolled
copy of u will
This is because the induction
variable
which
t
I<z<n
UEVT
In Figure
this value
and after
replacement
7, unrolling
J-loop
removes
all
introduce
innermost
no matter
innermost-
the
K-loop
introduces
does not.
Using
these
mem-
properties
gives
where
o(e, i, n) = if e is invariant wrt loop n then
return O
else if e IS invariant wrt loop i then
return 1
else
return X,
Now that the formulation
for loop balance in relation
been obtained,
it can be used to prove that unroll-and-jam
ACM
TransactIons
on Programmmg
Languages
and
Systems,
to unroll-and-jam
has
will never increase
Vol
16, No
6, November
1994
1784
Steve Carr and Ken Kennedy
.
DO 10 K = I,li
DO 10 J = I,li
DO 10 I = l,li
,..
10
= A(
DO 10 K = i,li,2
DO 10 J = l,?f,2
DO 10 I = I,F4
...
,K)
El
(=,*,=)
10
Fig 7
the
measure
unrolling
~L =
A simplified
all
of the
following
loop
~ *X,
q
K)
andafter
Consider
...
= A(
. . .
=
. . .
= A(I,
,
+1)
A( Ij ,
)(0,0,0)
+1)
unroll-and-jam.
the
following
formula
for
~~
when
i.
+ E,= V;X,
– (Xt
–d,
(e,,))++
~uel,;
~(ec), i,zz)
fxx(
formula
vertices
where
Combining
Vrl before
balance.
of loop
only
= A(I,
(0,0,0)
Vj,
these
of M
and
for a given
edges
in
X,
each
may
be determined
of the” above
vertex
by examining
sets,
giving
the
al >0.
terms
W’
= aoX[
M;
= alX,
M:
= a,3X, + a4
+ az
gives
pL =
(Coxt
+ cl)
fxxl
where co, c1 > 0. Note that the values of co and c1 may vary with the value of
X,, but they are always nonnegative.
As X1 increases,
co may decrease and c1
become
positive
for
may increase
because the value of (X, — d,(~, ))+ may
some v G V,c. Theorem
3.2.3.1.2
below states that if loop z is unrolled
n
times, there will be n * f flops and possibly
less than
n * m memory
references, where
m is the number
of memory
references
in the original
loop.
Hence, unroll-and-jam
does not increase
the measure
of loop balance.7
71f floating-point
not
hold
common
Unroll-and-j
am
subexpresslon
may
insert
removal
proportionally
is
taken
more
into
account,
memory
Theorem
references
than
3,2.3.1.2
operations,
ACM
‘IYan+actlons
on
Programmmg
Language>
and
Systems,
Vol
16,
No
6, November
does
floatmg-pomt
1994
Improving the Ratio of Memory Operations
THEOREM
The function
3.2.3.1.2.
for
fl~ is nonincreasing
.
1785
in the i dimen-
sion.
See Appendix
PROOF.
THEOREM
Unrolling
3.2.3.1.3.
innermost-loop
Theorem
❑
C.
order,
3.2.3.1.3
multiple
does not increase
can be proved
loop
nests
in
an
outermost-
to
~~.
by simple
induction
on the
number
of loops
unroll-and-jammed.
3.2.3.2 An Example.
the following
loop.
As an example
of a formula
for loop balance,
consider
DO IO J=l,N
DO IO I=l,N
10
A(I,J) = A(I,J – 1) + B(1)
A(1, J) G V@, A(I,J – 1) = V,c
B(1) G V,z. This gives
pL =
with
an incoming
distance
vector
of (1, O), and
Xl+xl–(xl–l)++l
xl
3.2.3.3 Register Pressure.
The next step in creating
a balance
problem
for a particular
loop is to derive a formula
for register
optimization
pressure,
To
compute
the number
of registers
required
by an unrolled
loop, it must be
determined
which references
can be removed
by scalar replacement
and how
many registers
each of those references
needs. This quantity
is called R. As
in computing
k!, the partitioned
sets of V, {V@, V,c, V,r}, are considered.
Associated
required
with
by that
each
of the
set, {~,
R:,
partitions
R;},
R=
Since each member
used
to capture
hold a value
ters required
R’”+R;
of Vc’ has no incoming
reuse.
However,
during
expression
for those memory
temporaries
needed
technique
of Sethi
number
of registers
of
V
is
the
number
of registers
giving
members
evaluation.
references
+R;
.
dependence,
of Y’ti
still
registers
require
cannot
a register
be
to
To compute
the number
of regisnot scalar replaced
and for those
during
the evaluation
of expressions,
the tree-labeling
and Unman
[1970]
is used. Using
this technique,
the
required
for each expression
in the original
loop body is
computed,
and the maximum
over all expressions
I?z. This
technique
does not take into
account
pressure
due to instruction
scheduling.
This issue
is taken as the value for
the increase
in register
is addressed
at the end of
Section 3.2.4.
For variant
references
whose
incoming
edge is carried
by some loop,
how many references
will be removed
by
u G vr~, it must be determined
scalar replacement
after unroll-and-j
am and how many registers
are required
ACM
Transactmns
on Programmmg
Languages
and
Systems,
Vol
16, No
6, November
1994
1786
for
Steve Carr and Ken Kennedy
.
those
references.
computation
The
latter
assumed
flow
The
of M,!’
value,
that
across
all
former
value
and is shown
d~(e, ) + 1, comes
registers
a loop-carried
has
previously
been
derived
in
the
below.
interfere
from
scalar
[Callahan
dependence,
all
replacement,
et al. 1990].
such
values
For
where
it
values
that
interfere
with
is
one
another
across the back edge of the loop. For values that flow across loopindependent
dependence,
predicting
the effects of scheduling
before optimization
to determine
interference
for loop-independent
dependence
is compiler
loop
dependent.
Therefore,
all values
body. In addition,
registers
are
avoid
result
reliance
upon a particular
scheduling
in the following
formula
for A?::
References
ters
are assumed to be live throughout
the
assumed
not to be reused in order to
only
that
are invariant
if a corresponding
with
respect
algorithm.
These
to an outer
reference-invariant
loop
assumptions
require
loop is unrolled.
regis-
Unrolling
a
loop that is not u-invariant
does not increase
register
pressure,
but rather
creates
opportunities
for scalar
replacement
that
may be utilized
if a Uinvariant
loop is unrolled.
No matter
which
loops are unrolled,
references
that
are invariant
ters. In Figure
with
respect
7, the reference
to the
innermost
to A(1, K) requires
is unrolled
and since K is unrolled
ing this behavior
gives
loop
always
no registers
once, two registers
require
unless
are required.
regis-
the J-loop
Formulat-
where
a (e, i, n) ~
3.2.3.4
pressure,
An
if e is Invariant wrt loop I then
If (3X,1X, > 1 A e IS Invariant wrl loop j) or
(e E Invanant wrt loop n) then
return 1
else
return O
else
return X,
Example.
consider
As an example
the following
loop.
of a formula
for computing
register
DO1OJ=1,N
DO IO I=I,N
10 A(I,J) = A(I,J – 1) + B(J)
ACM
TransactIons
on
Programmmg
Languages
and
Systems,
Vol
16,
No
6, November
1994
Improving the Ratio of Memory Operations
A(I,J)
G Va,
A(I,J – 1) G V,c
B(J) G 1’:. This
with
an incoming
distance
1787
.
vector
of (1, O) and
gives
R=l+(xl–l)++xl.
3.2.4
Applying
to take
program
loops
Unroll-and-Jam
in a Compiler.
This
section
the previous
optimization
problem
and solve it
loop nests on real machines.
First,
it is shown
to which
choose
the
unroll-and-jam
unroll
will
amounts
for
be applied,
those
loops.
and
issues
that
may
have
it is shown
interlock
sented, followed
by the effects of multiple
incident
tion. Then, imperfectly
nested loops are handled,
compiler-dependent
presented.
then
Next,
discusses
how
practically
for real
how to choose the
how
removal
to
is pre-
edges on balance computaand finally,
machineand
an
effect
on
loop
balance
are
3.2.4.1
loops in
Picking
Loops to Unroll.
Applying
unroll-and-jam
to all of the
an arbitrary
loop nest can make the solution
to the optimization
problem
extremely
complex.
To simplify
the
solution
procedure,
unroll-and-
jam is only applied to a subset of the loop nest. Since experience
suggests that
most loop nests have a nesting
depth
of 3 or less, unroll-and-jam
can be
limited
to
1 or 2 loops
in
a given
nest.
Even
though
tiling
for
cache
may
increase
the depth
of the loop, it is assumed
that unroll-and-jam
is only
performed
on the tiled iteration
space that contains
the cache reuse [Wolf and
Lam 1991]. Unrolling
outer loops that iterate
by the tile size would increase
the amount
of tiling.
of data
required
For example,
to be held
in the tiled
in cache, and thus,
defeat
the purpose
loop
DO IO I= I,N,IS
DO IO J=I,N
DO IO H= II, II+ IS-1
10
A(n) = A(n) + B(J)
unroll-and-jam
probably
The
pick
might
be applied
to the
J-loop,
but
applying
it
to
I would
not be profitable.
heuristic
those
for determining
loops
references
in the
whose
the
unrolling
unrolled
loop.
subset
of loops
is likely
to result
In particular,
the
for unroll-and-jam
in
the
fewest
loops
that
carry
is to
memory
the
most
dependence
that can be innermost
after unroll-and-jam
is applied
should be
unrolled.
To find these loops, each distance vector is examined
to determine
if
it can be made innermost
for each pair of loops in the nest. Given a distance
where
d,, d] >0,
unrolling
loops i
vector d(e) =(dl,
. . ..dl.
dl, ,dl, . . . . d.)
and j can make e innermost
if Vk, k + i, j, n d~(e) = O. If only one loop, i, is
unrolled,
then e can be made innermost
if Vk, k # i, n dh(e) = O.
3.2.4.2
Picking
lem
solution,
the
solution
ACM
Unroll
unrolling
procedure.
Transactions
In
Amounts.
only
Later,
one
loop
we
on Programming
the
following
is considered
discuss
Languages
how
and
discussion
in
to
order
extend
Systems,
Vol
of the
to bring
the
solution
16, No.
prob-
clarity
6, November
to
to
two
1994
1788
Steve Cam and Ken Kennedy
.
loops. Given
following
where
register
this
restriction,
optimization
formula
can be reduced
to the
S = (Xl – d,(e))+
and Rkl is the effective
size of the floating-point
set for the target
architecture.
The solution
procedure
begins
by
examining
the vertices
coefficient
for X,
takes
the
O(G)
time,
and
and edges of the dependence
any
where
constant
term
G is the
as was
size of the
the summations
are solved to get a linear
requires
us to keep an array of coefficients
graph
shown
to accumulate
earlier.
dependence
This
graph.
the
process
Essentially,
function
of Xl. The ‘-function
and constants,
one set for each
value of d,, that will be used depending
upon the value of Xl. In practice,
most distances
are O, 1, or 2, allowing
us to keep 4 sets of values: one for each
of the common distances
and one for all remaining
determines
which coefficient
set is used. If d, >2
may be imprecise
if 3 < X1 < d[. However,
instances
of this situation.
Given
a set of coefficients,
the solution
checking
register
pressure,
to determine
distances.
The value of X,
for some edge, the results
experimentation
the
space
can
best
unroll
be
are O, 1, or 2, unrolling
more than
pressure
beyond the physical
limits
chine.
size
Therefore,
time
the
3.2.3.1.2,
bound
the
of O(log
the best answer.
of the
solution
R~r ) time
Thus,
decision
to two
loops,
loop can be created from the general
terms in ~~ for two particular
loops, i
sets of coefficient
values for each loop.
is held constant,
the function
for ~~
dimensional
solution
space with each
space
is sorted
for a binary
the entire
time for one loop.
To extend unroll-and-jam
solution
space
searched,
factor.
dependence
distances
increase
the register
Theorem
discovered
can be bounded
requires
a simplified
most
by R~.
order,
of the solution
procedure
while
Since
R~ will probably
of the target
ma-
in descending
search
no
By
giving
a
space to find
0( G + log RA1 )
optimization
problem
formula.
The coefficients
and constant
and j, can be constructed
by keeping 4
By Theorem
3.2.3.1.2, if either X, or Xj
is nonincreasing,
which
gives a tworow and column
sorted in decreasing
order,
lending
itself
to an 0( RV1 ) intelligent
search (since the rows and
columns
are sorted, each step of a search algorithm
can eliminate
an entire
row or column
from consideration)
and a total time of 0( G + R~ ) to determine unroll
amounts.
In general,
a solution
for unroll-and-jam
of k loops,
k > 1, can be found in O(G + R$l- 1) time. As an example
of the result
of
applying
dix A.
unroll-and-jam
using
3.2.4.3 Removing
Interlock.
pipeline
interlock,
leaving
idle
ACM
TransactIons
on
Programmmg
a balance
optimization
problem,
After
unroll-and-jam,
computational
cycles.
a loop
To remove
Languages
and
Systems,
Vol
16,
No
6, November
see Appen-
may contain
these cycles,
1994
Improving the Ratio of Memory Operations
the technique
interlock.
of Callahan
If interlock
et al. [1988]
exists,
one loop
can be used to estimate
can be unrolled
1789
.
the amount
to create
more
of
paral-
lelism
by introducing
enough
copies of the innerloop
recurrence.
Given the
pipeline
length,
1P, and the length
of the longest
recurrence
carried
by the
innermost
loop, p( L, ), unroll-and-jam
can be applied
to an outer loop until
enough floating-point
computation
is moved into
F > p(L~ ) X 1P. Essentially,
the innermost
loop to fill the delay slots caused by interlock.
This is possible
because copies of the innerloop
recurrence
will be parallel
with each other
and unroll-and-jam
will
al. 1988].
Experimental
interlock
of the
not move
evidence
detection
pipeline
in Section
completely
is short.
outerloop
4 suggests
and only
Because
recurrences
optimize
of the
short
that
inward
[Callahan
it is reasonable
for balance
pipeline,
when
little
et
to ignore
the length
unrolling
will
probably
be necessary
to remove
interlock,
and the balance
optimization
problem
already
discussed will most likely take care of that interlock.
Applying only one heuristic
will
speed up compilation.
However,
with
a long
pipeline,
it may be advantageous
to remove
interlock
first
because
the
unrolling
determined
by the optimization
enough to remove interlock.
Additionally,
second allows the optimization
resources
of the machine.
3.2.4.4
be only
multiple
Multiple
Edges.
one edge entering
incident
edges
problem
In general,
problem
removing
to keep
it cannot
will be less
interlock
first
register
likely
rather
pressure
be assumed
that
to be
than
within
there
or leaving
each node in the dependence
graph
are possible
and likely.
One possible
solution
the
will
since
is to
treat each edge separately
as if it were the only incident
edge. Unfortunately,
this is inadequate
because many edges may correspond
to the flow of the
same set of values,
Consider
the loop in Figure
8. There is a loop-carried
input dependence
from A(1,J) to each A(1 – 1,J) with distance vector (O, 1) and
a loop-independent
dependence
between
the A(1 – 1, J) ’s. Considering
each
edge separately
would give a register
pressure
estimation
of 5 registers
per
unrolled
iteration
provided
Although
come from the
this estimation
of J rather
than
the
actual
value
of 2, since
both
values
same source and require
use of the same registers.
is conservative,
it is more conservative
than neces-
sary. To improve
the estimation,
register
sharing
is considered.
To relate all references
that access the same values, the scalar replacement
algorithm
considers
the oldest reference
in a dependence
chain as the reference that provides
the value for the entire
chain [Carr and Kennedy
1994].
The oldest reference
in a chain is called the generator.
Using this technique
alone to capture
sharing
within
the context of unroll-and-jam,
however,
may
underestimate
register
pressure.
While
the generator
of the dependence
chain is the source of the value that flows through
the chain, the value from
the generator
may cross dependence
carried
by an outer loop. Since scalar
replacement
is only amenable
to innermost
dependence,
the value from the
generator
may not provide
the value for a scalar-replaced
reference
until
enough
unrolling
has occurred
to ensure
that the chain
only crosses the
innermost
loop.
ACM
Transactions
on Programmmg
Languages
and
Systems,
Vol
16, No
6, November
1994
1790
Steve Carr and Ken Kennedy
.
DO 10 J = l,H
DO 10
10
I
B(I,
=
l,B
(0,0)
J)
-
“(’m=
(0,1)
Fig
DO 10
D
I
=
10
8.
Effect of multiple
DO 10
I,li
A
I+;
I
1
I
I
L
r
..J)=A(I!,
.
J)+A(I,
(0,0)1
-----
I
DO 10
J = I,li
~._#o)
10
edges.
J)
~
I
- -1
-----
1
l,lJ,2
I,N
‘------’
(0,0)
m
:’(1+
I
(w;
-----
:
:
,J)=A(~,J)+A(I,J)
I
I
1
I
I
I
I
(0,0)
I
I
(0,0)
\\
(1,0)
10
=
J =
~A(I+?,J)=A(I+i,J)+A(I+i,J)
1,
I
L-J
~
I
I
I
L-------------
t(u)
:
I
_---J
(1,0)
Flg 9
Consider
the
example
in
innermost).
In the original
incoming
input
dependence:
l,J) and one with a distance
Mult]ple
Figure
generators
9(note
in one partition.
that
the
solid
dependence
edges
are
loop, the second reference
to A(I,J) has two
one with a distance
vector of(l,O)
from A(1 +
vector of (O, O) from A(l,J). Before unrolling,
the
generator
of the dependence
chain, A(I+l,J)
does not provide
the value to
scalar replace the second reference
to A(l,J). In this case, the first referenceto
A(I,J) provides
the value. However,
after unrolling
thel-loopby
l, the generator provides
the value for scalar replacementto
the copies ofA(l,J)—that
is,
the references
to A(1 + I,J) in statement
10 in the unrolled
loop. Therefore,
the edge from the generator,
A(1 + 1,J) in the original
loop, cannot necessarily
be picked to calculate
the register
pressure
for the whole chain. Instead,
a
stepwise
approximation
can be used, where each possible
intermediate
generator within
a dependence
chain is considered.
These intermediate
generators
occur when the value from a generator
of a dependence
chain flows across an
outerloop-carried
dependence
edge to a member,
A, of the dependence
chain
and some third
member,
B, of the chain has a value that flows across an
generainnermost
dependence
edge to A. In this case, B is an intermediate
tor.
ACM
Tran&actlons
on
Programming
Languages
and
Systems.
Vol
16,
No
6, November
1994
Improving the Ratio of Memory Operations
Procedure
PartitionNodes(
Input: G = (V, .!3),
graph
to be unrolled
each v c V in its own partition,
while 3 an unmarked node do
P(v)
put
let
u be an arbitrary
mark
v
fordf
w E Vl((de
input
(e is
t=
(for
mark
1791
G,~, k)
the dependence
~, k = loops
.
unmarked
node
= (w, u) Ve
or true
=
(v,
w))A
A
and is consistent)
1 . ..n–l.
i#j,
t#k,
d,(e)
=O))
do
w
merge
P(w)
P(v)
and
enddo
enddo
end
Fig.
The first
graph
step in the stepwise
so that
all
same partition.
references
references
been
node in each register
ing dependence
those
this
edges that
section,
the value
of the
the
edges
of each partition
d, for
that
to ensure
After computing
tion for register
this
ACM
of the
the
distance
generator’s
reuse
loop body.
number
in the
values.
algorithm,
partially
For
In addition,
does not provide
Transactions
register
on Programming
Languages
the
vector
the
This
value
is
underestimated.
to the
formula
have
of a partition
Vol
encomfrom
be measured
coefficients
points
Systems,
will
the maximum
is not
and
across
is picked.
will
where
at the innermost-loop
and
unrolling
for i = 1,...,n –
edges
requirements
the intermediate
the reuse
d.,
used
that
in
provide
only
partition)
a partition
of registers
a distance
the
vector
in the
outgoing
within
will
to flow
Only
earlier
(if enough
To be conservative,
a summarized
distance
vector,
pressure
are accumulated.
point
of a partition
all
in the unrolled
that
to estimate
set of unroll
is computed.
the innermost-loop
as soon as possible
partition
generator
node
flows
is found.
the generator
the
the
that
as described
entire
to each
to
the
has no incom-
loops
loop,
leaving
(one
11 is applied
or has only loop-carried
for the
value
SummarizeEdges,
outgoing
1 the minimum
At
replacement
to cause
in the
Once the
OldestValue,
the value
v-invariant
unroll-and-jam,
in procedure
created
by
task.
z), that
of its partition
carried
After
Next,
guarantees
are
to reference
are considered.
edges).
generator
member
that
be put
this
in Figure
The reference,
in the unrolled
for scalar
all
is first
reuse
innermost
picked
another
the dependence
will
in the procedure
can create
is performed
pass
algorithm
is determined.
dependence
generator
10 accomplishes
the
that
is to partition
same
in Figure
partition
from
the
used in R. First,
the partition
incoming
approximation
partitioned,
the coefficients
throughout
PartitionNodes
having
The algorithm
have
calculate
10.
16, No.
equabeen
given
a
the generator
level
for some
6, h’ovember
1994
1792
.
Steve
Ken Kennedy
Carr and
Procedure
C’omputeR
Input:
P =
partition
G
(V, E),
=
(P,
of
pc
memory
references
the dependence
R = register
foreach
G, l?)
pressure
graph
formula
P do
v = OldestValue(p)
d=S
ummarizeEdges(v)
update
R using
d
Prune(p,u)
while
size(p)
> 1 do
u = OldestValue(p)
d=S
ummarizeEdges(u)
update
R Using
Prune(p,
R,j – Rd+d[ej,
6 = (~, u)
u)
U=u
enddo
enddo
end
Procedure
Oldest
foreach
v E
Value(p)
p
if /l an incoming
dependence
v
Ve = (u, u), ((u is invariant
(P(u)
return
wrt
# P(v)))
the loop
at level(carrier(e)))
V
then
u
end
Procedure
SummarizeEdgm(v)
fort=
lton–1
e = (v, w) 6 E, w is a memory
foreach
df = min(d~,
foreach
dj
e c E, w is a memory
= max(d~,
return
read
d,(e))
read
d~(e))
d’
end
Procedure
Prune(p,u)
p
remove
v from
foreach
e = (v, w) c E
if d,(e)
= O, 1 < ~ < n then
remove
w
end
Fig,
of the
references
1 I,
within
Compute
the
coefficients
partition
for optimization
must
problem
be determined
(see A(1,J) in the
original
loop in Figure
9). Register
requirements
will be underestimated
these situations
are not handled.
First, applying
procedure
Prune to each partition,
we remove all nodes
which
there are
current
generator
ACM
Transactions
on
no intermediate
points
where
provides
the values for scalar
Programmmg
Languages
and
Systems,
Vol
a reference
replacement.
16,
No
other
Any
6, November
if
for
than the
reference
1994
Improving the Ratio of Memory Operations
that
occurs
onthesame
iteration
of loops
of the partition
will have no such
registers
needed by the remaining
calculating
the coefficients
I to n – Iasthe
points.
nodes
the previous
of R for an intermediate
generator.
Since
current
generator
Next, the number
of additional
in the partition
is computed
by
partition.
The coefficients
of R derived from
of the modified
partition
will include registers
with
1793
.
generator
of the modified
the summarized
distance
vector
that have already been counted
the generator
for a partition
will
point provide the values for the entire partition,
we do not want
same registers
multiple
times. Overcounting
of register
pressure
at some
to count the
is prevented
by computing
the coefficients
for R as if an intermediate
generator
were the
generator
in an original
partition.
Then, we subtract
from R the number
of
references
that have already
been handled
by the previous
generator.
The
distance
vector
intermediate
current
current
d(6)
of the
edge,
generator,
summarized
and previous
represents
the
6, from
u, is used
the
previous
to compute
generator,
the
latter
Now
ences,
summarized
distance
vector
as if the
– O)+ – ( Xl – 1)+ additional
points are repeated
until
references
the partition
respect
to some
invariant
with
loop.
the other
and have
dependence
only
edge
be incident
respect
can
Since
multiple
edges
to one of the unrolled
carried
upon
are
multiple
edges
also
by both
the
generator
presented
in
by A(1 + q ,J),
edges for variant
that
are invariant
allowed,
loops.
a reference.
affect
previous
example
headed
references
loops and variant
occurs, the reference
is treated
as a member
loop and V c with respect to the second.
Finally,
the
removed.
These steps for
has 1 or fewer members.
that we have shown how to handle
multiple
we discuss
the differences
for references
one
new
Given
distance
vector, d’, and d(2) for the edge between
the
Rd. — R(d S+~(;)). The value d’ +
generators,
compute
was the generator
of the modified
partition.
In the
Fi~re
9, d’ = (Oj O) and d(~) = (1, O) for the partition
giving (Xl
intermediate
u, to the
value.
If
This
the
of V 1 with
computation
with
referwith
may
be
respect
to
is not the case if
previous
situation
respect
to the first
of
Here,
M.
each
reference
is considered
separately,
using a summarized
distance
vector that
consists of the minimum
d, over all incoming
edges of u for i = 1, . . . . n – 1.
The summarized
distance
vector
represents
the earliest
possible
point
in
unrolling
many
when
vectors,
estimate
a reference
the
point
may
be removed.
calculated
may
Since the vector
be too early,
is a summary
giving
of
a conservative
of M.
3.2.4.5 Imperfectly
which unroll-and-jam
following
loop.
It may be the case that the loop nest on
Nested Loops.
is to be performed
is not perfectly
nested. Consider
the
DO IO I=I,N
D020J=1,N
20 A(I,J) = B + E
DO1OJ=1,N
10 C(J) = C(J) i- D(l)
On the RS/6000,
where /3M = 1, the l-loop would not be unrolled
surrounding
statement
20, but would be unrolled
for the loop
ACM
TransactIons
on Programmmg
Languages
and
Systems,
Vol.
16, No.
for the loop
surrounding
6, November
1994
1794
.
Steve
Ken Kennedy
Carr and
statement
10.X To handle
the J-loops as follows
this
situation,
the
l-loop
can be distributed
around
D0201=1,
N
D020J=1,N
A(I,J) = B + E
20
DO IO I=I,N
DO1OJ=1,N
10
C(J) = C(J) + D(l)
and each loop unrolled
independently.
In general,
unroll
amounts
are computed
with
a balance
optimization
formula
for each innermost
loop body as if it were within
a perfectly
nested
loop body. A vector of unroll
amounts
that includes
a value for each of a loop’s
surrounding
loops, and a value for the loop itself is kept for each loop within
the nest. When determining
if distribution
is necessary,
if a loop that has
multiple
inner
loops is encountered
at the next level, each of the unroll
vectors for each of the inner
amounts
differ. If they differ,
loop
at the
Then
outermost
level
each of the newly
at inner
levels.
where
created
If any loop that
to a recurrence,
unroll
loops is compared
to determine
if any unroll
each of the loops from the current
loop to the
the unroll
amount
the
loops
must
vector
unroll
differ
cannot
in each vector
for each level
is distributed.
for additional
be distributed
values
in all of the vectors
amounts
is examined
(note
distribution
be distributed
due
are set to the smallest
that
a loop that
cannot
be distributed
will have an unroll
value of O to ensure
safety)
[Allen
and
Kennedy
1987]. The complete
algorithm
for distributing
loops for unroll-andjam
is given
in Figure
3.2.4.6 Machinescribed
a solution
there
are other
compiler
12.
and
Compiler-Dependent
to the loop balance
and
factors
that
may
need
Issues.
This
paper
has
register
pressure
problem,
to be addressed
debut
on some architecture-
combinations.
Addressing.
Unroll-and-jam
will
likely
increase
the
number
of address
registers
required
for the array references
in tbe innerloop
body. Address-register pressure is a compiler-dependent
issue that will affect balance if there is
register
spilling.
In addition
to address
registers,
the updating
of their
contents
must also be considered.
On architectures
with
an autoincrement
addressing
mode, there
will
be no effect on balance.
However,
if explicit
instructions
are necessary to increment
address registers,
then these instructions must
instructions
be included.
Most likely
since memory references
Instruction
Cache.
In this paper,
is large enough to hold an unrolled
they will count
are often serwced
as additional
by an integer
memory
pipeline.
it is assumed that the instruction
cache
loop body. However,
some architectures
(e.g., Intel i860) have very small on-chip instruction
caches. In these cases, it
would
be necessary
to understand
how the back-end
compiler
generates
9If
both
statements
ACM
loops
and
Transactions
are
leave
unrolled
the loop
the
same
structure
on Programming
amount,
we
would
Just
make
copies
of
the
intact.
Languages
and Systems,
Vol
16,
No
6, November
1994
non-DO
Improving the Ratio of Memory Operations
Procedure
DistributeLoops(
Input: L =
loop
level
if L has maltiple
1 = outermost
loop
L,i)
to check
i = nesting
1795
.
for distribution
of L
inner
level
loops
at level
where
t + 1 then
the IUUWll amo~ts
&ffer
for the
nests.
if 1 > 0 then
if loops
at levels
1, . . . . z can be distributed
then
L=L
while
level(L)
~ 1 do
distribute
C around
its inner
loops
L = parent(L)
enddo
else
for]
=lto
ido
m = fiNmum
of all WOll
for the inner
vectors
Of
loops
L at level J
assign
to those
vectors
at level
j the value
m
enddo
endif
endif
N = the first
if N exists
inner
DistributeLoops(
N = the next
while
of L at level
loop
z+ 1
then
N,,
loop
N exists
DistributeLoops(
N = the next
+ 1)
at level
t following
L
do
N,,)
loop
at level
t following
N
enddo
end
Fig,
instructions
so that
limit unrolling.
Scheduling.
heuristic
12,
some estimate
Instruction
presented
the back-end
Distribute
scheduling
for unroll-and-jam.
of code size could
scheduling
does not handle
compiler’s
loops
can
this
be performed
increase
directly
algorithm
register
since intimate
would
in order
pressure.
The
knowledge
be necessary
to
to give
of
an
effective
estimate.
One method
for handling
scheduling
is to reduce
the
number
of available
registers
artificially
to allow
room for the scheduler
to
require
additional
resources.
The number
of effective
registers
can be determined
Data
experimentally.
Cache.
The
design
of the
data
cache
can have
a great
effect
on the
ability
of the cache to retain data as assumed by the heuristic.
Direct-mapped
caches often suffer from interference
between
references
[Lam et al. 1991].
However,
this issue is beyond the scope of this paper and is left to the cache
optimization
phase of the
Wolf and Lam 1991].
ACM
Transactmns
compiler
on Pmgrammmg
[Lam
Lal.guages
et al. 1991;
and
Systems,
Meadows
Vol
16, No
et al.
6, November
1992;
1994
1796
Steve Carr and Ken Kennedy
.
Special Instructions.
multiply-add
instruction
completed
in
separately.g
time
This,
number
of flops
balance
can give
Many
that
less
than
in turn,
that
advanced microprocessors
allows
the two arithmetic
it
would
affects
to
perform
the peak machine
can be performed
a false
take
of ~~
procedure
instruction
floating-point
reducing
if the instruction
be performed.
rather
than
instruction
by increasing
per cycle. However,
impression
counts
each
balance
as a result,
unnecessary
unrolling
may
add can be counted
as one operation
actually
have a floating-point
instructions
to be
the
machine
is not used
and,
Instead,
the multiplytwo since the decision
issues.
4. EXPERIMENT
A source-to-source
ParaScope
PFC [Allen
mental
translator,
programming
and Kennedy
design
called
is illustrated
preprocessor,
unroll-and-jam
transformed
Memoria,
has been
in Figure
has a good
Additionally,
compiler
and
the RS/6000
loop balance
a natural
13. In this
a large
number
has a number
estimate
scheme,
This
Renaming.
Zero-Cost
Branches.
ing
instructions
next
iteration
allows
before
pipelined
Large
4-Way
cache
the
the
the
pressure
achieve
Mode.
(64K
).
ignore
for loops
Set-Associative
the
effects
of
stalls
related
to
This
the parallelism
Autoincrement
by fetch-
operations
helps
assumed
allows
on the
simulate
the
to exist
heuristic
a
in
to
of balance.
The size of the cache allows
the heuristic
Data
Large
Cache.
associativity
tends
or loop-independent
Implementation
identifies
mstructlon
Trmwactlons
to
reduce
effectively.
most-loop-carried
ACM
it
(32).
make
loop size.
interference
a multlply
as a
because
are optimized
beginning
finishes.
in the computation
Cache
to
and to ignore
and
iteration
loop and helps
Instruction
heuristic
boundary
current
instructions
540 was chosen
As discussed
at the end of Section 3.2.4, address-register
be controlled.
For the RS/6000,
any expression
having
‘Our
serves
of floating-point
registers
of hardware
features
that
branches
loop
Addressing
address
to ignore
Backward
across
Autoincrement
ignore
Memoria
of performance:
pipelining
across iterations
on register
antidependences
on registers.
software
a loop.
in the
rewriting
Fortran
programs
to improve
loop balance using both
and scalar replacement
as described.
Both the original
and
versions
of the program
are then compiled
and run using the
standard
product
compiler
for the target machine.
For the test machine,
the IBM RS/6000
model
Register
implemented
environment,
using the dependence
analyzer
from
1987; Callahan
et al. 1987; Carr 1992]. The experi-
on
multlply-add
M an add mstructlon
Programmmg
Languages
dependence
mstructlons
This
and
may
requires
in an AST
not be valid
Systems,
Vol
pressure
needs to
no incoming
inner-
16,
an address
by checking
if the parent
on all architectures.
No
6. NCJ> ember
regis-
1994
of
Improving the Ratio of Memory Operations
1797
Maoria
Fig.
ter.
Although
dence
from
reasons.
In
considering
Experimental
4.1
c-
Original
compiler,
the
it
number
pressure
could
not
removes
be avoided
the
indepen-
of registers
available
for
was
performance
artificially
re-
additional
register
spilling.
with
scalar replacement
1992].
Performance
Figures
14 and
Results
15 show
applying
unroll-and-jam
complete
applications.
and
Improved
design.
address-register
a particular
addition,
13.
duced to 26 to allow for scheduling
and to prevent
This number
was determined
by experimentation
[Carr
*
Compiler
SOurc.
original
reported
formed
Figures
results
results
performance
and scalar
Full
code using
in normalized
optimization
the
IBM
execution
on the
was
used
XLF
compiler
where
on both
version
attained
Below
code is 100.
XLF compiler
for loop-independent
the results
with
by using more sophisticated
we discuss the improvement
applications
in
our
after
of kernels
the
and
transformed
2.2. The
results
are
of the trans-
of the original
code. In
code; (SR) reports
the
and (UJSR)
reports
the
scalar replacement.
The
performs
a limited
form
of
dependence
and loop-invariant
refscalar
replacement
show the benefit
dependence
attained
suite.
First
analysis.
by Memoria
the
on each
results
of each
played
in Figure
14 is presented.
Then,
a discussion
performance
shown in Figure
15 is presented.
of the
4.1.1
and
RS/6000
the performance
code is normalized
against
the performance
14 and 15, (Original)
de~otes
the original
after Memoria
applied
scalar replacement,
after Memoria
applied
unroll-and-jam
and
scalar replacement
erences.
Therefore,
IBM
to a number
times
base execution
time for the original
It should be noted that the IBM
kernels
results
replacement
of the
kernel
dis-
application
Kernels
DMXPY.
DMXPY
(denoted
by
VM
in
Figure
14)
version
of a vector-matrix
multiply
that
is machine
RS/6000))
and difficult
to understand.
Unroll-a~d-jam
is
a hand-optimized
specific
(not for the
and scalar replace-
ment are applied
to the unoptimized
version
(Original)
and compared
with
the performance
with the LINPACKD
version (look at the SR bar). Memoria
(the UJSR bar) is able to attain slightly
better performance
than the hand-optimized
code, while allowing
the kernel to be expressed in a machine-indepenACM
TransactIons
on Programming
Languages
and
Systems,
Vol
16, No
6, November
1994
1798
.
Steve Carr and Ken Kennedy
l’;\hhhkOk!i
&+Fig.
VM
14.
Kernel
unroll-and-j
MMk
MMi
BLU
performance
am and
scalar
BLUP
on
IBM
Gmt
Emit
Vpa
Afold
(SR = after
RS/6000
Fold
Sevsd
scalar
Sor
;:
....
.x
.,..
:::
::::
..:.
....
!
IU300
Flg
15.
after
Tciuv
Apphcatlon
unroll-
A&n
scalar
on IBM
RS/6000
(SR = after
9895
w 89
“:::
i.::
,,,.
j
.,..
: :;
:,::
.,..
,.,.
::j
:..,
,.,.
,...
.:::
,,,,
,:::
:::;
;::
.:
....
.::
.::!
,,,,
11°
oOpt
phot
Arc2d
performance
and-j am and
UJSR
replacement,
scalar
❑ Orlgmd
❑R
wUJSR
Mesa
UJSR
replacement,
of how hand optimization
is not necessarily
kernel
in a machine-independent
form and
allow the compiler
to handle
the machine
details.
This allows good
mance across a variety
of architectures
without
programmer
effort.
Matrix
Multiply.
The results
for two versions
of matrix
50 x 50 are displayed.
MMk
is matrix
multiply
by the innermost
loop, and MMi is matrix
multiply
carried
by the middle
loop. Memoria
achieves
integer
small matrices
where most memory
references
are from
memory
lelism.
reuse
=
replacement)
dent fashion.
This is an example
best. It is better
to express the
of size
carried
= after
replacement)
98
:::
::.;
,.::
;:::
x
Mean
in the loop improve,
but
also available
Linear
Algebra
Kernels.
Experimentation
with
decomposition
with
and without
partial
pivoting
multiply
perfor-
on arrays
with the reduction
with the reduction
factor
speedups
on
cache. Not only does
instruction-level
paral-
the block versions
(BLUP
and BLU,
of LU
respec-
tively) is shown. Memoria
attains
a factor of over 2 improvement
in run-time.
Each of the kernels
contained
an innerloop
reduction
that is amenable
to
unroll-and-jam.
In these instances,
unroll-and-jam
lel copies of the mnerloop
recurrence
to Improve
introduces
balance.
multiple
paral-
Three of the NAS kernels
(Gmtry,
Emit, and Vpenta)
from
NAS Kernels.
the SPEC benchmark
suite are examined.
Gmtry
and Emit
observe larger
improvements
because they contain outerloop
reductions
to which unroll-andjam can be applied.
Geophysics
nels:
one that
another
ACM
that
Tran.act]on~
Kernels.
computes
Included
in the experiment
are two geophysics
the adjoint
convolution
of two time series (Afold)
computes
the
on
Programmmg
convolution
Languages
and
(Fold).
Systems,
Each
Vol
16,
of these
No
loops
6, November
1994
kerand
requires
Improving the Ratio of Memory Operations
Table
Suite
Application
SPEC
Matrix300
Description
Matrix
Mesh
Adm
Transonic
Onedim
Seval
and
Ser.
tained
by applying
computes
only
the
to remove
Particle
4.1.2
Seval
scalar
We
Exploration
and
a B-spline
applicable).
replacement
interlock,
Simulation
triangular
developed
in other works
kernels
contain
innerloop
The
to the
successive-over-relaxation
Applications.
Particle
rhomboidal,
evaluates
is not
Transport
Electromagnetic
Oil
Equai.ion
Prediction
Sphot
coopt
pipeline
Flow
2d Hydrodynamics
of trapezoidal,
(unroll-and-jam
Solver
Simple
using techniques
that have been
and Kennedy
1992]. Again, these
loops
Pollution
Schroedinger
Weather
Wave
optimization
Air
Inviscid
Time-Independent
ShaJ
the
Generation
2d Fluid-Flow
F1052
Local
Multiply
Pseudospectral
Arc2d
RiCEPS
1799
I.
Tomcatv
Perfect
.
main
of a grid.
but
originally
and
has
reported
27
spaces
only
singly
is ob-
computational
Unroll-and-jam
Carr
nested
improvement
no improvement
chose
iteration
[Carr 1992;
reductions.
loop.
Sor
is applied
is noted.
Fortran
applications
from
SPEC, Perfect, RiCEPS,
and local sources to be a part of our study. Of those
27, 5 failed to be analyzed
successfully
by PFC; 1 failed to compile
on the
RS6000;
and 10 contained
no reuse opportunities
for the loop-balance
optimization
algorithms.
Most of those programs
that contained
no reuse opportunities
had loop-independent
or loop-invariant
reuse that was
the IBM XLF compiler.
Table I contains
a short description
of each of the remaining
captured
by
applications
that
can be transformed
by Memoria.
The results
of performing
scalar
replacement
and unroll-and-j
am on the application
suite are shown in Figure
15. The results
for most
applications
are modest
with
the exception
of
Matrix300
and Onedim.
These two applications
spend considerable
time
ACM
Transactions
on Programming
Languages
and
Systems,
Vol
16, No
6, November
1994
1800
.
Steve Carr and Ken Kennedy
computing
reductions
tions. As shown
for improvement.
ments
from
15%
unroll-and-jam
tion
time.
is given
50°6
Shal,
improves
effective
completely
loop
on
the
individual
loops
by reduc-
loops
to
which
do not dominate
unroll-and-jam
is never
execu-
applied.
of unroll-and-jam
A more
on the
applica-
4.2.
unroll-and-jam
carry
is most
a large
instruction-level
cost.
dominated
present the best opportunity
Simple,
and CoOpt, improve-
these
performance
often
in the
computational
achieved
and Sphot,
that
Reductions
dominates
are
of the
suggests
reductions.
being
Unfortunately,
in Section
The study
least
to
examination
reduction
Matrix300
is applied.
For Adm,
detailed
tions
with
in the kernel results,
reductions
For Tomcatv,
Arc2d,
F1052,
presence
Since
effective
amount
parallelism
greatly.
of a floating-point
a divide
performance
takes
to make
Balance
Estimate
of
unrolling
because
on the
memory
and
a
Unroll-and-jam
divide
19 cycles
enough
in the presence
of reuse,
is
of its
RS/6000,
high
its
references
cost
a minimal
factor.
4.2
Effectweness
The next
jam
of the Loop
portion
decision
balance
of the experiment
procedure
prediction
dependence
the PFC
in
statistics
analyzer
analyzer
deals
predicting
can
is used
because
for this
the implementation
of the statistics
was upgraded
to be of PFC quality.
In Figure
16, the accuracy
kernels
problem
“Observed’
with
The
kernel.
“Predicted’
“Compiler”
the IBM
are
after
estimates
numbers
with
balance
given
a fixed
number
register
allocator).
(cache
miss
experiment
bar
numbers
graph
the
registers
unroll
and
multiply-add
16 indicates
identical
to the
balance
prediction
is counted
the
different
cies do not
emerge
on this
Programmmg
predicted
produced
are likely
with
on
that
balance
dependence
TransactIons
source
code
applied.
generated
(using
by
the best
the IBM
amounts
are used or a minimum
the
loop.
represents
XLF
are increased
balance
is reached.
as one floating-point
Division
is counted as 19 flops since the base operations
one cycle and division
takes 19 cycles.
ACM
from
each
In
all references
to memory
are counted
as one memory
operation
penalties
are not included),
and each addition,
subtraction,
multiplication,
Figure
balance
are
assembler
of floating-point
numbers,
the
replacement
“Optimal”
on the
the
for
by examining
scalar
than
analyzer
are obtained
dependence
of
during
by Memoria
represents
optimization.
registers
rather
unsupported
of balance
by examining
full
tables
ParaScope
and (2) the ParaScope
obtained
For the “Optimal”
all floating-point
each column,
the
detailed
B. The
became
“Initial”
and
are obtained
compiler
until
from
unroll-and-jam
XLF
of the
analyzer
gathering
of the unroll-and-
More
Appendix
of the computation
created
numbers
Memoria
in
portion
is detailed.
of the loop in the original
optimization
the accuracy
balance.
be found
(1) the PFC
aforementioned
with
loop
by
balance
the
to occur when
distances
Languages
references
at multiple
set of kernels.
and
for
compiler.
Systems,
Vol
this
set of kernels
is
inaccuracies
in
have multiple
These
also be noted
16,
take
Slight
loop levels.
It should
operation.
on the RS/6000
No.
6, November
incoming
inaccurathat
1994
these
Improwng the Ratio of Memory Operations
.
1801
1:lLLLLLBkBmwL—
Vm
MMk
W
Fig.
BLU
16
BLUP
Kernel
Gmt
Emit
Vpe
balance-prediction
Afold
Fold
Scr
accuracy.
.illl
a
b% Dmm
Q Redl.ad
❑ Oh.arw!d
Q
Campi!a
❑ o@und
Vpe
Fig.
17.
Kernel
register-pressure
AMd
Fold
Sm
accuracy.
inaccuracies
will not occur between
the observed
balance
and the compiler
balance.
No results
are reported
for “Seval”
because unroll-and-jam
is not
performed
on the kernel.
For the kernels
in this study, Memoria
achieves a loop balance that is not
appreciably
different
from
ence between
the optimal
the optimal
balance.
The
balance
and the balance
occurred
on Vpenta.
Here,
ranges
contain
reuse. Not
loop-independent
dependence
with
considering
register
reuse causes
overestimate
register
pressure
and unnecessarily
only significant
differachieved
by Memoria
restrict
limited
Memoria
live
to
unrolling.
The register
pressure
results
in Figure
17 indicate
that Memoria
overestimates register
pressure
in most instances.
This would argue for Memoria
to
include
some heuristic
for approximating
scheduling
to allow register
reuse.
However,
it does not argue for the elimination
of reserving
registers
as a
buffer
against
only considers
the register
allocator
of the target
compiler
because Memoria
register
pressure
on a per loop basis, not over a whole routine
as in global
register
allocation
[Chaitin
1984]. Previous
experiments
have shown
loops
with
very
low
register
spills
standing
Table
of program
II presents
register
are done
pressure
for an entire
structure
information
et al. 1981;
that register
due
live
Chow and Hennessy
spilling
will occur in
to
scalar
replacement
range
rather
than
[Briggs
1992; Carr 1992].
concerning
the unrolling
with
because
an under-
of the loops
in each
kernel.
“Unrolled”
shows how many loops are unrolled
due to balance
and
pipeline
interlock.
Of those that
contained
interlock,
only SOR requires
additional
unrolling
after balancing
the loop without
considering
interlock.
Interlock
is removed
in all other loops just by unrolling
to improve
the ratio
of memory to floating-point
operations.
“Not Unrolled
shows how many loops
are not unrolled
due to balance and lack of potential
improvement
(“None”),
and transformation
legality
(“Safety”).
“Nests”
reports
the number
of perfectly
nested
loop
structures
ACM Transactions
that
on Programming
are considered
Languages
and
for unroll-and-jam.
Systems,
Vol
16, No
6, November
Imper1994
1802
Steve Carr and Ken Kennedy
-
Table
Kernel Unrolling Information
II
Kernel
Not
Unrolled
Unrolled
None
Safety
3MXPY
o
0
1
MatJIK
0
0
1
1
0
0
1
2
2
0
1
0
3
3
0
5
1
5
6
--1---
MatJKI
BLU
BLUP
2
Gmtry
1/0
I 0
2
0
0
5
7
1
0
0
1
2
Afold
0
0
1
Fold
0
0
1
0
0
0
0
0
1
Emit
Vpent
a
%
Q----k
Seval
0
Sor
4.0
Nests
1
1
..”
Tcatv
Iu300
Flg
Arc2d
18
Apphcatlon
FI052
Simp
O&n
balance-prediction
accuracy.
feet loop nests have multiple
perfect loop structures
that
unroll-and-jam.
Figure
18 reports the accuracy of Memoria
in predicting
application
suite.
in an application
are considered
loop balance
for
in the
The balance numbers
are reported
as an average of all loops
to which
unroll-and-jam
is applied.
“Initial,”
“Predicted,”
and “Observed”
numbers
have the same meaning
as in Figure
16. No
numbers
from the compiler-generated
code are reported.
It is expected
that
the compiler
numbers
for balance would be identical
to the observed numbers
as previously
reported
for the kernels because the output
of the XLF compiler
is very predictable.
ACM
TransactIons
on
Programmmg
Languages
and
Systems,
Vol
16,
No
6
November
1994
Improving the Ratio of Memory Operations
Memoria
This
that
note
predicts
application
the observed
contains
numbers
references
in every
with
application
multiple
incoming
.
1803
except
F1052.
dependence
required
summarization
as described
in Section 3.2.4. It is pleasing
how accurately
the decision
procedure
performs
across a variety
applications
and kernels.
5. RELATED
WORK
Callahan
but they
et al. [1988] describe unroll-and-jam
do not present
a method
to compute
[ 1987] discuss
in the context of loop balance,
unroll
amounts
automatically.
Aiken
and Nicolau
called
zation
loop quantization.
To ensure parallelism,
where each loop is unrolled
until iterations
dent. However,
with
between
the unrolled
software
iterations
their
register
method
misses
unnecessarily
pressure.
Wolf
restricts
and Lam
a transformation
identical
to unroll-and-jam
they perform
a strict quantiare no longer data indepen-
or hardware
pipelining,
do not prohibit
low-level
usage
unroll
[ 1991]
to
of
benefits
from
true
dependence
parallelism.
Thus,
true
dependence
amounts.
They also do not control
present a framework
for determining
and
register
the data
locality
for a loop nest and use loop interchange
and tiling to improve
locality.
They present unroll-and-jam
in this context as register
tiling,
but they do not
present a method
to determine
unroll
amounts.
6. SUMMARY
We have presented
a design
replacement
and unroll-and-j
transformations
input
only
are
machine
a few parameters
and number
of registers.
A major achievement
problem
that
for a transformation
am to improve
minimizes
independent
of the target
of this
work
the
distance
the
in
system that applies
balance
of loops.
the
sense
machine,
is the
development
between
that
they
such as machine
scalar
These
take
as
balance
of an optimization
machine
balance
and
loop
balance, bringing
the balance of the loop as close as possible to the balance of
the machine.
This
work
has also shown
how to use a few simplifying
assumptions
to make the solution
to the optimization
inclusion
in real compilers.
These transformations
have been implemented
source
preprocessor
and
applied
to a substantial
problem
in
fast enough
a Fortran
collection
for
source-to-
of kernels
and
whole programs.
The accuracy of our method is established
in the experimental results.
These results
show that in almost all cases hand optimization
is
unable
to achieve
a significantly
better
balance
than Memoria.
We believe
this result shows that hand-balancing
loops is unnecessary.
In addition
to the
accuracy of Memoria,
the run-time
results show that the methods
are particularly successful
on kernels
from linear
algebra. Even though
the technique
is
targeted
toward
loop optimization,
we still achieve spectacular
results
on a
couple of programs;
however,
in general,
modest results
are reported
on full
applications.
These results
are achieved
on an IBM RS/6000,
which has an
effective
optimizing
compiler
and a small load penalty
(1 cycle). We would
expect
such methods
to produce
larger
improvements
on machines
with
greater
load
ACM
penalties.
TransactIons
on Programmmg
Languages
and
Systems,
Vol
16, No
6, November
1994
1804
Steve Carr and Ken Kennedy
.
The
effectiveness
of scalar
replacement
and
unroll-and-jam,
as demon-
strated
by the experiments,
argues that they should
be included
highly optimizing
scalar compiler.
Since it has been established
that
be carried out by a machine-independent
few machine
parameters
to drive the
possible
to develop
a preprocessor
processors
as they emerge.
APPENDIX
Al.
A. MATRIX
Original
Matrix
MULTIPLY—
in every
they can
source-to-source
system that uses a
optimization
algorithm,
it should
be
that
can
BEFORE
be quickly
retargeted
to new
AND AFTER
Multlply
doj=l,
n
dol=l,n
dok=l,
n
c(I,j = c(i,j) + a(i, k) * b(k,j)
enddo
enddo
enddo
A2.
Balanced
c
C
c
Matrix
Multiply
pre-loop
removed
do J-mod(n,2) + 1,n,2
c
C
c
pre-loop
removed
do I = mod(n,2) + l,n,2
if (1 ,gt. n) goto 1
a$l$O = a(l,l)
b$2$0 = b(],])
c$O$O = C(I,J)+ a$l $0 * b$2$0
b$4$0 = b(l,j + 1)
c$3$0 = C(I,J + 1) + a$l$O*b$4$0
a$6$0 =a(l + 1,1)
c$5$0
= c(i + 1,j) + a$6$0
x b$2$0
c$7$0
do
= c(i + 1 J + 1 ) + a$6$0
~ b$4$0
k = 2,n
a$l$O =
b$2$0 =
c$O$O =
b$4$0 =
c$3$0 =
a$6$0 =
a(i, k)
b(k,])
c$O$O + a$l $0 * b$2$0
b(k,j + 1)
c$3$0 + a$l $0 t b$4$0
a(i + 1,k)
c$5$0
= c$5$0
+ a$6$0
x b$2$0
c$7$0
= c$7$0
+ a$6$0
L b$4$0
Languages
and
enddo
c(i+l,j+l)=c$7$0
c(i + 1,j) = c$5$0
C(I,J + 1) = C$3$0
c(i,j) = c$O$O
1
continue
enddo
enddo
ACM
TransactIons
on
F’rogrammmg
Systems,
Vol
16,
No
6, November
1994
Improving the Ratio of Memory Operations
APPENDIX
B. BALANCE
PREDICTION
Label
Meanin,s
1Predicted
OL obtained
from
the dependence
(Observed
PL obtained
A I
the optimization
graph
0= obtained
IBM
(3ptimaf
XLF
by examining
the murm
and scalar
by examining
compiler
the best ~= given
problem
created
from
for each loop
aft er unroll-and-jam
(Dompiler
1805
TABLES
Table
~
.
full
a fixed
the IBM
XLF
1[B
initial
PL
1~B
@L after
17P
floating-point
register
1vests
the number
of perfectly
generated
by the
optimization
number
register
unroll-and-jam
Memoria
replacement
the assembler
with
(using
code with
of floating-point
re~istern
allocator).
and scalar
replacement
pressure
nested
loops
considered
for
unrokand.jarn
1Jnrolled
results
for those nests
1WA Unrolled
results
those
P
1
the number
that
nests that
of perfect
are unrolled
am not unrolled
nests
in that
appfkation
that
are
unrolled
I
1
the number
of loops
N!
1
the number
of perfect
not
NI
1
that
are unroUed
due to interlock.
nests in the application
that
are
unmffed
the number
of those in W’
that
could
not be improved
by unmfkmd-jam
s
the number
not
ACM
Transactions
of those in “N”
where
unrolf -and-jam
is
safe
on Programming
Languages
and
Systems,
Vol
16, No
6, November
1994
1806
.
Steve Carr and Ken Kennedy
Table
Kernel
Predicted
Kernel
ACM
Prediction
Accuracy
Optmd
Compiler
Observed
lB
FB
FP
FB
FP
B3
FB
FP
FB
Fp
DMXPY
3.00
1.09
26
1.09
25
3.00
1.09
26
1.07
32
MatJIK
2.00
1.00
10
1,00
10
2.00
1.00
7
1.00
7
MatJKI
3.00
1.00
17
1.00
16
300
1.00
14
100
14
BLU
3.00
2.04
26
2,04
25
3.00
2.04
25
2.03
32
2.00
1.00
10
1.00
10
2.00
1.00
7
1.00
7
3.00
2.04
26
204
25
3.00
2.04
25
203
32
2.00
1.00
10
1.00
10
2.00
100
7
1.00
7
Gmtry
3.00
2.04
26
2.04
25
3.00
2.04
25
2.03
32
Emit
300
1.09
26
109
25
300
I 09
26
107
32
300
1.09
26
1.09
25
3.00
1.09
26
1.07
32
Vpenta
2.33
1.58
23
1.58
22
2.33
1.58
12
1,01
12
Afold
2.00
1.00
5
1.00
5
2.00
1.00
4
1.00
4
Fold
2.00
1.00
5
100
5
2.00
1.00
4
1.00
4
Sor
0.60
0.53
9
0.53
8
0.80
0.53
9
0.53
9
BLUP
Table
Table
Table
Balance
A.11
A. I m-esents a key to the other two tables in this appendix.
A.1~ presents
the-balance
prediction
accuracy of Mernoria
on kernels.
A.111 presents
Memoria’s
balance prediction
accuracy on applications.
TransactIons
on
Programmmg
Languages
and
Systems,
Vol
16,
No
6, November
1994
Improving the Ratio of Memory Operations
Table
Application
Application
Accuracy
Not
Unrolled
Predicted
IB
FB
FP
FB
FP
N
NI
s
2
2
0
2.00
1.04
26
1.04
25
0
0
0
5
1
0
2.00
1.58
24
1.58
24
4
4
0
Adm
106
0
0
-
-
-
-
-
106
72
34
Arc2d
75
15
0
2.70
1.55
15.9
1.55
15.1
60
46
14
F1052
76
14
0
2.13
1.53
15.2
1.57
12.8
62
62
0
Onedim
10
6
0
2.00
1.00
10
1.00
10
4
4
0
Shal
7
0
0
-
-
-
-
-
7
7
0
Sphot
10
0
0
-
-
-
-
-
10
7
3
Simple
22
2
0
4.00
3.08
17.5
3.08
16.5
20
18
2
Wave
10
0
0
-
-
-
-
10
7
3
139
15
0
1.04
21.2
1.04
17.2
124
124
0
v
coopt
each
new
2.5
C. PROOFS
3.2.3.1.1.
THEOREM
for
Observed
1
Tomcat
APPENDIX
Unrolled
N
Matrix300
1807
A.III.
Balance
Nests
.
edge
e; =
Giuen
( w~,
O < m,
d,(e)
v. ) created
< X,,
from
and
e,, =
( w“,
n = (m
+ d,(e))
modXz
V. ) by unroll-and-jam,
then
n>m=m<X1—dL(e).
PROOF,
( ~
) Given
n
>
m,
assume
m
>XL
– d,(e).
Since
O <m,
d,(e)
<XL
then
n = (m
+ d,(e))
= m + d,(e)
ACM
Transactions
on Programming
Languages
mod X,
–X,
and
Systems,
Vol
16, No
6, November
1994,
1808
.
Steve Carr and Ken Kennedy
giving
n2+cZ,
(e)-
X,>n2.
Since
d,(e)
the inequality
is violated,
<X,,
leaving
m
<X,
– d,(e)
Given
(=)
m
–d,(e)
and
n = (m
+dl(e))mod
<X,
m,dl(e)
> 0,
then
= m +d,
Xl
(e).
Since
m,d,(e)
> 0
then
m +d,
(e)
>m,
giving
❑
n>m.
THEOREM
The
3.2.3.1.2.
fimction
for
~L
is
nonincreasing
in
the
i dinlen-
sion.
For
PROOF.
jam
the
is applied
original
to the
loop,
i-loop
pi
X, = 1 and
1 times,
n –
(c:r2
=
~~ = (c.
n >
+c~
1, then
+ cl)/f.
If unroll-andX, = n giving
+c~)
fn
where
the
in
Cz is the
sum
of all
X,,
value
would
not
(X,
decrease
between
in
– d, ( eU )) ‘ became
positive
value
of fl~ is
(c~n
f+ =
+Cl
+ (CO –c~)(n
–
1))
fn
To show
that
flI, is nonincreasing,
~L
n(cO
2
+ Cl)
>
PI > ~~ must
be true.
%
(c~n
+ (c.
–c~)(n
–
1’) +cl)
fh
nf
nco+ncl>nc~+nco
— nc~+c~—co+cl
(n–l)cl>c~l–cO
ACM
Transactions
on
with
or remain
the same with the change
co and CL is the number
of edges
becomes
positive
with
the change
in X,. Given that
of any d, ( e, ) added into Cz is n – 1 (any greater
positive).
Therefore,
the maximum
make
(X, — d, ( e, ))+
where
(X, – d,(e, ))+
X, = n, the maximum
value
d, ( e, ) where
X,. Since co will
c; s co. The difference
increase
Programming
Languages
and
Systems,
Vol
16
No
6, November
1994
lmprovlng
Since
n > 1, c1 >0
ing.
El
and
the Raho of Memory Operations
CL s co, the
inequality
holds
and
&
1809
o
is nonincreas-
ACKNOWLEDGMENTS
The
authors
this
paper.
Reynolds
would
first
Their
like
to thank
suggestions
and Phil
Sweany
the referees
greatly
provided
valuable
tion of this document.
Preston
Briggs,
helpful
suggestions
on document
form.
and David
Callahan
for their
improved
the
suggestions
Keith
Finally,
valuable
paper’s
input
quality.
during
to
Robert
the prepara-
Cooper, and Uli Kremer
made
John Cocke, Peter Markstein,
gave us the inspiration
for this
work.
REFERENCES
AIKEN,
A.
AND
87-821,
ALLEN,
NICOLAU,
Cornell
F.
AND
J.
ACM
J.
KENNEDY,
Program.
P.
1992.
Science,
Rice
CALLAHAN,
D.,
CAER,
S.,
CAILAHAN,
CALLAHAN,
S.
Univ.,
CARR,
S.
ANU
CARR,
S.
ings
KF,NNEI)Y,
Pratt.
AND
G,j
Re~ster
Not
Desig]z
and
of Fortran
programs
to vector
form.
Ph.D.
thesis,
Dept.
of
Computer
Improvmg
allocation
for
subscripted
SIGPLAN
’90 Confer-
Implementation.
1988.
Esblmatmg
Comput.
interlock
and improving
balance
5, 334-358.
K.,
KENNEDY,
mm
Proceedings
In
register
of the ACM
TORCZON,
of the
L.
1987.
1st International
l?araScope:
A
Conference
on
Berhn.
Ph.D.
thesis,
Dept.
of Computer
Science,
Rice
K.
1992
M.,
via
19,
Scalar
1 (Jan.),
Compiler
A.,
J. L.
the
of
Society
Press,
COCKS,
Corrzput
J.,
Lang.
1984.
222–232.
in
blockability
Computer
CIJANDRA,
coloring
6 (June),
replacement
presence
of
conditional
control
51-77.
numerical
algorithms.
Los Alamitos,
HOPKINS,
M.,
AND
Proceed-
In
Calif.,
114– 124.
MARIHTEIN,
P.
1981.
6, 45-57.
Register
allocation
proceedings
by priority-based
of the ACM
SIGPLAN
coloring.
’84
SZG-
S.vmposzum
on
Construction.
J. C., Hsu,
DEHNER’r,
Proceedings
Languages
GUPTA,
reference
AND
BRAT-T,
J.
Internutlonal
Operutzng
E.,
Proceedings
P. Y.-T..
of the 3rd
a?ld
DUESTERWALD,
array
R.,
1994,
24,
CHOW, F. C. AND HENNIMSY,
In
K.
ng ’92. IEEE
AUSLANN?R,
Compiler
Rep.
N. J., 1-30.
Proceedings
management.
K,
KENNEDY,
allocation
PLAN
In
transformations.
Cliffs,
coloring.
1990.
Dmtrlb.
HOOD,
Exper.
of Supercomputz
CHAI’rIN,
Tech.
algorithm.
Tex.
Soft[L,.
flow,
graph
and
KENNEDY,
J. Parallel
Memory-hierarchy
Houston,
and
491-542.
53 –65.
Desgn
Springer-Verlag,
1992.
K.
environment.
Supercomputmg.
CARR,
via
KENNEDY,
AND
programming
optimizing
translation
9, 4 (Oct.),
25, 6 (June),
COOPER, K.,
D.,
parallel
J.,
analysis
Tex.
AND
machines.
of
Automatic
Language
D , COCKE,
for pipelined
An
Englewood
allocation
Not.
ence on Programming
catalogue
Syst.
Houston,
SIGPLAN
variables.
A
1987.
Lang.
Register
Univ.,
quantization:
Prentice-Hall,
K.
AND
Loop
N.Y.
1972.
of Compilers.
Trans.
BRIGM,
1987.
Ithaca,
COCKE,
OptunLzcztLon
ALLEN,
A.
Univ.,
Systems.
R.,
analysis
AND
and
of the ACM
P.
ACM,
SOFFA,
its
1989.
Conference
use
SIGPLAN
New
M
in
’93
Overlapped
York,
L.
loop
on Arch ~tectural
support
in
Support
the
Cydra
5.
for Programming
26-38.
1993.
A
practical
data
optimization.
SIGPLAN
Conference
on Programmmg
flow
Not.
28,
framework
6 (June),
Language
for
68–77.
DeSLgn
and
Irnplementatlon,
GANNON,
D., JALBY,
W.,
agement
by global
on Supercomputzng.
KENNEDY,
K.
Proceedings
AND
of
AND
GALLIVAN,
K.
1987.
program
transformations.
Springer-Verlag,
Berhn.
MCKINLEY,
the
1992
K.
1992.
Strategies
In Proceedings
Opt]m~zing
Znternatzonal
Conference
on Programmmg
Languages
for cache
for parallelism
on
and
local
of the 1st Znternatlonal
and
memory
Supercomputing.
ACM,
memory
man-
Conference
hierarchy.
New
In
York,
323-334.
ACM
TransactIons
and
Systems,
Vol.
16, No.
6, November
1994
1810
.
KUCK,
D.
New
The Structureof
1978.
Computerland
Computations.Vol.
I.
John
Wiley
and
D.,
KLTHN,
getable
R,,
LEASURE,
vectorlzer.
Press,
In
Los Alamltos,
M
Calif.,
algorithms.
for Programmmg
M~AJIOWS,
L , NAKAMOTO,
piler
for LIW
RAU, B.
and
1991.
ceedings
WOLFE,
Languages
Lecture
superscalar
Notes
software
plpelmed
and
SIGPLAN
592 Conference
17, 4 (Oct.),
M
E.
(June),
Design
WOLFE,
AND
30–44
and
M.
The
structure
Apphcchons.
The cache
4th
of an advanced
IEEE
Society
In
for
589
Not.
1970
rntng
The
27,
and
Language
Destgn
of optimal
parallehsrn
and
comIn
for Parallel
Berlin,
283–299.
63-74.
pipelimng
’92.
1992.
S
on Architectural
York,
software
C’ompllers
Sprmger-Verlag,
7 (July),
generation
zing,
mstructlon-level
P. P., ANU SCHLANSKF,R, M
SIGPLAN
New
of RISC
on Languages
vol
ACM,
A vector]
Proceedings
analysis
Science,
Systems.
1992
and optimizations
Conference
Pro-
Comput-
236–250
Register
allocation
Proceedings
of the
for
ACM
Implementatmn.
code for arithmetic
expressions.
J.
715-728.
L.AM,
M
S
Proceedings
1991
A data
of the ACM
locabty
optlmlzmg
SIGPLAN
algor,thm.
’91 Conference
SIGPLAN
Not.
on Programming
26, 6
Language
Implementation.
1986.
Loop
skewing:
December
1992;
revised
The
wave front
method
rewslted.
J. Parallel
Program.
279-293.
Received
retar-
Computer
performance
International
Operatzng
dependence
on Program
SI?THI, R. AND ULLIWMJ, J.
ACM
and
Workshop
TIRUMALAI,
loops.
and
1991
architectures
in Computer
M.,
E
of the
S., AND SCHUSTER, V
Data-flow
1984.
M.
Destgn
Proceedings
of the 4th International
R.Au, B. R., LEE,
WOLF,
AND
163-178
In
Support
zng
B.,
Supercomputers:
S., ROTHBERG, E E , AND WOLF, M
of blocked
ACM
Sons,
York.
KLTCK,
LMI,
Steve Carr and Ken Kennedy
Transactions
on
Programming
December
Languages
1993
and
and
May
Systems,
1994;
Vol
16,
accepted
No
June
6, November
1994
1994
15, 4,
Download