Improving Register Allocation for Subscripted

advertisement
Improving
Register
David
Allocation
Callahan+
for Subscripted
Steve
Variables*
Ken Kennedy4
CarrS
Abstract
1
Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to
analyze the definitions and uses of individual
array
elements. This deficiency is particularly
troublesome
for floating-point
registers, which are most often used
as temporary repositories for subscripted variables.
In this paper, we present a source-to-source transformation,
called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables. The objective is to increase the
likelihood that these elements will be assigned to registers by the coloring-based register allocators found
in most compilers. In addition, we present transformations to improve the overall effectiveness of scalar
replacement and show how these transformations
can
be applied in a variety of loop nest types. Finally, we
present experimental results showing that these techniques are extremely effective-capable
of achieving
integer factor speedups over code generated by good
optimizing compilers of conventional design.
Although conventional compilation systems do a good
job of allocating scalar variables to registers, their
handling of subscripted variables leaves much to be
desired. Most compilers fail to recognize even the
simplest opportunities
for reuse of subscripted variables. For example, in the code shown below,
*Research
supportedby
NSF Grant
CCR-8809615
DO 10 I = 1, I
DO 10 J = 1, H
10
A(I) = A(1) + B(J)
most compilers will not keep A (I ) in a register in the
inner loop. This happens in spite of the fact that
standard optimization
techniques are able to determine that the address of the subscripted variable is
invariant in the inner loop. On the other hand, if the
loop is rewritten as
10
20
and by IBM
400 N 34th St, Suite 300, Seattle
t Department
of Computer
Texas 77251-1892
Science,
Rice University,
Houston
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the
publication
and its date appear, and notice is given that copying is
by permission of the Association for Computing Machinery. To copy
otherwise, or to republish, requires a fee and/or specific permission.
01990
ACM 0-89791-364-7/90/0006/0053
DO 20 I = 1, If
T= A(I)
DO 10 J = 1, M
T = T + B(J)
A(1) = T
even the most naive compilers allocate T to a register
in the inner loop.
The principal remon for the problem is that the
data-flow analysis used by standard compilers is not
powerful enough to recognize most opportunities
for
reuse of subscripted variables. Standard register allocation methods, following Chaitin, et al. [CAC+81],
concentrate on allocating scalar variables to registers throughout a procedure. Furthermore, they treat
subscripted variables in a particularly
naive fashion,
if at all, making it impossible to determine when a
specific element might be reused. This is particularly
problematic for floating-point
register allocation, because most of the computational
quantities held in
such registers originate in subscripted arrays.
Corporation
tTera Computer
Company,
Washington
98103
Introduction
$1.50
Proceedings of the ACM SIGPLAN’SO Conference on
Programming Language Design and Implementation.
White Plains, New York, June 20-22, 1990.
53
statement to the second, and both statements reference the same memory location [Kuc78]. We say the
dependence is
For the past decade, the PFC Project at Rice University has been using the theory of data dependence to recognize parallelism in Fortran programs
[AK84b]. Since data dependences arise from the reuse
of memory cells, it is natural to speculate that dependence might be used for register allocation. This
speculation led to methods for the allocation of vector
registers [AK88], which led to scalar register allocation via the observation that scalar registers are vector registers of length 1 [CCK87]. In this paper, we
extend the theory developed in our earlier works and
report on experiments with a prototype system, baaed
on PFC, that improves register allocation by rewriting subscripted variables as scalars, as illustrated by
the example above.
2
l
l
l
2.1.1
an output dependence if both statements
the location, and
an input dependence if both statements
the location.
write to
read from
For the purposes of improving register allocation,
we concentrate on the true and input dependences,
each of which represents an opportunity
to eliminate
a load. In addition, output dependences can be helpful in determining when it is possible to eliminate a
store.
A dependence is said to be carried by a particular
loop if the references at the source and sink of the
dependence are on different iterations of the loop. If
both the source and sink occur in the same iteration,
the dependence is loop independent.
The threshold
of a loop-carried
dependence, r(e), is the number
of loop iterations between the source and sink. In
determining
which dependences represent reuse, we
consider only those that have a consistent threshold
- that is, those dependence8 for which the threshold is constant throughout the execution of the loop
[GJG87, CCK87]. F or a dependence to have a consistent threshold, it must be the case that the location
accessed by the dependence source on iteration i is
accessed by the sink on iteration i + c, where c does
not vary with i.
Method
Dependence
an antidependence if the first statement reads
from the location and the second writes to it,
l
In this section, we will describe a set of transformations that can be applied to Fortran source to improve
the performance of compiled code. We assume that
the target machine has a typical optimizing Fortran
compiler, one that performs scalar optimizations only.
In particular, we assume that it performs strength reduction, allocates registers globally (via some coloring scheme) and schedules both the instruction
and
the arithmetic
pipelines. This makes it possible for
our preprocessor to restructure the loop nests in Fortran, while leaving the details of optimizing the loop
code to the compiler. These assumptions proved to
be valid for the Fortran compiler on the MIPS M120,
which served as the target for most of our experiments.
We begin the description of our method in the first
subsection with a discussion of the special form of
dependence graph that we use in subsequent transformations,
paying special attention
to the differences between this form and the standard dependence
Next, we introduce scalar replacement, a
graph.
transformation
that replaces references to subscripted
variables by references to generated scalar variables
whenever reuse is possible. Finally, we present a set
of transformations
that can be used before scalar replacement to improve its effectiveness. Our current
implementation,
a modified version of our automatic
parallelizer PFC, will be described in section 3.
2.1
a true dependence or flow dependence if the first
statement writes to the location and the second
reads from it,
2.1.2
Dependence
Pruning
Since conventional methods for computing a dependence graph often produce transitive
dependence
edges, not all edges entering a reference represent the
true flow of values [Wo182]. For example, if there
were two stores to the same location in a loop nest,
the dependence graph would have true dependences
from both definitions to every use that they could
reach. Therefore, in order to capture potential reuse,
it must be determined which of these definitions is
the last to store into the memory location before a
value is loaded from that location.
In general, for
each memory load in the inner loop body, we want to
find the most recent definition within the innermost
loop or, if no definition exists, the least recent use.
Graph
Basics
We say that a dependence exists between two statements if there exists a control-flow path from the first
54
inner loop just to eliminate a single memory reference. The procedure consists of six steps, as follows.
Procedure PRVNEGRAPH is an algorithm that takes
as input a dependence graph for a loop nest containing no inner-loop conditional control flow and returns
a dependence graph pruned of all transitive, inconsistent and outer-loop-carried
dependence edges, leaving only those edges which represent the true flow of
values in the innermost loop.
We
Determining
the Number
of Temporaries.
begin by calculating the number of temporary variables, ri, needed to hold each variable generated by
the source, gi, of a true or input dependence long
enough to achieve reuse. For each generator, we need
T(e) + 1 temporary variables to eliminate the load at
the sink of e, where e is the outgoing edge with the
largest threshold.
This provides one temporary for
each of the values computed on the r(e) previous iterations plus one for the value computed on the current
iteration.
Note that for some backward dependence
edges it may be possible to use one less temporary
variable. However, for the sake of simplicity, our implementation treats backward and forward edges uniformly when determining the number of temporaries.
Although this would seem to create unnecessary register pressure, the typical coloring register allocator
will coalesce the extra temporaries with others before
generating code.
procedure PRUNEGRAPH (G)
Definitions:
G = (V, E) and E, = incoming
edges for node v E V.
VALID(e) returns true if edge e is carried by
an innermost loop, or is loop independent
and its source and sink are in the same
loop body.
for each v E V do begin
if 3 an incoming true dependence then begin
T = {ele E E,, is a true dependence with
the smallest threshold & VALID(e))
es = ele E T and the source of e does not
have an outgoing loop-independent
output dependence whose sink has a
true dependence i E T
if r(e,) is not consistent then eg = I
end
else begin
T = {ele E Ev is an input dependence with
the largest consistent threshold &
VALID(e))
es = ele E T and the source of e does not
have an incoming loop-independent
consistent input dependence
end
for each e E Eu do
if e # e, and e is not an output dependence
then remove e from G
end
Reference
Replacement.
The next step in the
transformation
is to create temporary
variables,
Tf,T;,
. . . , G*-i,
for each generator, gi, and replace
the reference to the generator with Tf . The sink of
each of the generator’s outgoing dependence edges
should be replaced with Tt, where k is the threshold
of the dependence edge. If gi is a load from memory,
then we must insert a “load” of the form Tt = gi
immediately before the generating statement . If gi
is a definition, then we must insert a “store” of the
form gi = Tf immediately
following the generating
statement. This second assignment may be unnecessary if there are other definitions of the same location
in the same iteration of the loop body, which can be
detected by looking for a loop-independent
consistent
output dependence leaving the definition.
The effect of scalar replacement will be illustrated
on the following loop nest.
Once the graph has been pruned, the source of each
true or input dependence can be seen as “generating”
a value that could eliminate a load if kept in a register.
Hence, we call these sources generators.
2.2
Scalar
I
2
10
Replacement
DO 10 I = 1, 100
A(I+2) = A(I) + C(I)
= B(K) + C(I)
B(K)
CONTINUE
The reusegenerating
Before discussing loop restructuring
transformations,
we will describe how to replace subscripted variables
by temporary scalar variables to effect reuse. In
making this transformation,
we consider only loopindependent dependences and dependences carried by
the innermost loop, since it would be impractical to
save a value in a register over all the iterations of the
dependences are:
1. A true dependence from A(I+2)
ment 1 (threshold 2),
to A(I)
in state-
2. A loop-independent
input dependence from C (I )
in statement 1 to C(1) in statement 2 (threshold 0), and
55
example above, B(K) does not change with each iteration of the I-loop; therefore, its value can be kept in
a register during the entire execution of the loop and
stored back into B(K) after the loop exit, as below.
3. A true dependence from B(K) to B(K) in statement 2 (threshold 1).
In addition, there is an output dependence from B(K)
in statement 2 to itself.
By our method, the generator A(It2)
in statement 1 needs three temporaries, which we call TlO,
Til and Ti2. Here we are using the first numeric digit
to indicate the number of the generator. The generator C(1) in statement 1 needs only one temporary,
T20, and the generator B(K) in statement 2 needs
two temporaries, T30 and T31. Note that we could
use only one temporary for the last generator, but we
have chosen to be consistent for the sake of simplicity.
When we apply the reference replacement procedure
to the example loop, we generate the following code.
DO 10 I =
T20
1
TlO
A(It2)
2
T30
B(K)
10 CONTINUE
DO 10 I = 1, 100
T20
= C(1)
TlO
= T12 + T20
1
A(It2) = TlO
T30
= T31 + T20
2
Ti2
= Tll
Tll
= TIO
T31
= T30
10 CONTINUE
B(K) = T30
The enabling conditions for this type of code motion in the presence of true dependences are:
1, 100
= C(1)
= T12 + T20
= TlO
= T31 + T20
= T30
l
l
Register-to-Register
Moves.
In order to keep
the values in temporary variables correct across the
iterations of the innermost loop, we insert, for each
generator gi , the following block of assignments at the
end of the loop body.
Tie1
57-i-2
There exists a consistent output dependence carried by the innermost loop from a definition to
itself, and
All innermost-loop-carried
and loop-independent
dependences related to that definition are consistent.
In the presence of only input dependences, the condition for moving a load out of the loop is:
l
= Tim2
= 7-y-3
There exists an innermost-loop-carried
consistent input dependence from a use to itself.
We call each of these array reference types scalar because they are scalar with respect to a given loop.
When only the first condition for scalar array references is met, a store to a generator for that variable
cannot be moved outside of the innermost loop. Consider the following example.
In the example from above, the code becomes:
DO 10 I = 1, N
DO 10 J = 1, N
10
AtI) = A(1) + A(J)
DO 10 I = 1, 100
= C(1)
T20
= T12 + T20
1
TlO
A(I+2)
= TlO
2
= T31 •t T20
T30
= T30
B(K)
= Tll
T12
= TlO
Tll
= T30
T31
10 CONTINUE
The true dependence from A(1) to A(J) is not consistent. If the value of A (I 1 were kept in a temporary
and stored into A(1) outside of the loop, then the
value of A(J) would be wrong whenever I = J and
I > 1. This situation can be recognized while pruning the dependence graph.
Initialization.
Finally, max( rl , . . , , rn) - 1 iterations of the loop must be peeled from the beginning
of the loop to initialize the temporary variables correctly. In these initial iterations, the sinks of the dependences from generators for iteration k are replaced
It may be the case that the assignCode Motion.
ment to a generator may be moved entirely out of
the innermost loop and put after the loop. This is
possible when the reference to the generator does not
vary with the iterations of the innermost loop. In the
56
achieved by modeling the task as a knapsack problem,
where the number of scalar temporaries required for
a generator is the size of the object to be put in the
knapsack, the number of memory accesses eliminated
is the value of the object and the size of the register
file is the knapsack size. Using dynamic programming, an optimal solution to the knapsack problem
can be found in O(kn) time, where k is the number of
registers available for allocation and n is the number
of generators. Hence, for a specific machine, we get a
linear time bound. The best solution to the knapsack
problem may leave the knapsack partially empty, requiring us to examine each of the generators not yet
allocated to see whether they have an edge which requires few enough temporaries to be allocated in the
leftover space.
If the optimal solution takes too much time, as
it may on a machine with a large number of registers, a greedy approximation
algorithm might be
used. In this approximation
scheme, the generators
are ordered in decreasing order based upon the ratio
of benefit to cost, or in this case, the ratio of memory
accesses saved to registers required. At each step, the
algorithm chooses the first generator that fits into the
remaining registers. This method requires 0( n log n)
time to sort the ratios and O(n) time to select the generators. Hence, the total running time is O(nlogn).
With the greedy method, we have a running time
that is independent of machine architecture, making
it more practical for use in a general tool such as PFC.
Unfortunately,
the greedy approximation
can produce suboptimal
results.
To see this, consider the
following example, in which each generator is represented by a pair (n, m), where n is the number of
registers needed and m is the number of memory accesses eliminated.
Assume that we have a machine
with 6 registers and that we are generating temporaries for a loop that has the following generators:
(4,8), (3,5), (3,5), (2,l). The greedy algorithm would
first choose (4,8) and then (2,1). This would result
in the elimination
of nine memory accesses instead
of the optimal ten. However, it may be possible to
improve the greedy strategy if partial allocations for
generators are considered.
with the temporary variable T! only if j < k, and the
appropriate number of register transfers are inserted
for each peeled iteration.
When this transformation
is applied to our example, we get the following code.
T20
T10
A(3)
T30
Tll
T31
T20
= C(1)
= A(l) + T20
= T10
TlO
A(4)
= A(2)
+ T20
= TIO
= T31
+ T20
T30
T12
1
2
IO
= B(K)
+ T20
= TIO
= T30
= C(2)
= Tll
= TIO
Tll
T31 = T30
DO 10 I = 3,
=
T20
=
TIO
A(I+2) =
100
c(1)
Tl2
= T31
T30
Tl2
Tll
T31
CONTINUE
+ T20
TIO
+ T20
= Tli
= TIO
= T30
B(K) = T30
We have eliminated three loads and one store from
each iteration the loop, at the cost of three registerto-register transfers in each iteration.
Fortunately,
these transfer instructions
can be eliminated if we
unroll the original loop by 2 [CCK87].
Even without unrolling,
many transfers will be eliminated by
live-range coalescing in the target compiler’s register
allocator.
2.2.1
Reducing
Register
Pressure
During our experiments, we discovered that a problem arises when scalar replacement creates a large
number of scalars that compete for registers: some
compilers were not as effective as we expected when
presented with loops that require almost every available register. In particular, some compilers failed to
coalesce temporaries whose live ranges did not interfere. Because of this, we decided to generate temporaries in a way that minimizes register pressure. In
effect, we wanted to do part of the register allocator’s
job ourselves.
Our goal is to allocate temporary variables so that
we can eliminate the most memory accesses, given the
available number of machine registers. This can be
2.2.2
Handling
Conditional
Control-Flow
The algorithm we have described fails for loops with
conditional control flow because it makes no provision
for multiple generators-it
is possible that the generator for a particular use depends upon the control-flow
57
While our approach works in most simple
may introduce unnecessary overhead in the
of more complex control flow. Our future
will pursue a more comprehensive algorithm
replacement in the presence of conditionals.
path taken through the body of the loop. This is the
case in the following loop.
DO 10 I = i,N
IF (B(1)
A(I)
10
.LT. 0.0) THEN
= C(1) + D(1)
ELSE
A(I) = E(I)
ENDIF
G(I) = 2*A(I)
Sl
In this section, we discuss two transformations
that
can be used to improve scalar replacement:
loop interchange and unroll-and-jam.
These transformations
move dependences carried by outer loops inward, so
that the distances between references to the same array location will be shortened enough to make scalar
replacement feasible. We will also show how to perform unroll-and-jam
on loops with irregular-shaped
iteration spaces.
The definition of A(I) in either statement S1 or SZ
can provide the value used in statement Ss.
The problem of multiple generators is only one of
many that must be handled by scalar replacement in
the presence of control flow. The following is a more
complete list of the responsibilities
of the algorithm.
1. Determining
which possible generators
load along the control-flow paths.
2.3.1
reach a
dependence
threshold
of scalar replace-
temporaries
consistently
Many of these problems can be solved by formulating
them as data-flow problems. For example, to determine if a definition of identical threshold reaches a
load on all paths, we can solve a data-flow problem
similar to available expressions.
We have developed a preliminary
method for dealThe method requires
ing with these problems.
that each path through a loop define the requisite
temporaries-a
condition which may result in the insertion of loads. In the example above, this strategy
would result in the following code.
DO 10 I = 1,N
IF (B(1) .LT. 0.0) THEN
T10 = C(1) + D(1)
= TIO
A(I)
ELSE
T10 = E(I) + F(1)
= TlO
A(I)
ENDIF
10
G(I) = 2*TlO
Loop
Interchange
Loop interchange can be used to shift an outer loop
to the innermost level, possibly moving dependences
inward to the new innermost loop. Therefore, we
want to use loop interchange, where safe, to increase
reuse and remove as many memory accesses as possible [AK84a].
Our first goal with loop interchange is to move
the loop that carries the most dependences for scalar
array references to the innermost level. This will
achieve the largest gains because loads and stores
can be removed completely from the innermost loop.
Also, the transformation
presented in the next section can take advantage of this situation.
If this step
is not possible, then we would like to move the loop
that carries the most consistent dependences. Unfortunately, different loop orderings may produce different carriers for the same dependence, which adds to
the complexity of computing the optimal loop ordering.
Another problem with loop interchange is that getting the best loop ordering for scalar replacement may
hurt other critical performance areas. Consider the
following example.
2. Finding the most recent definitions (or least recent uses) along the control-flow paths.
the scalar
5. Naming
within the loop.
Transformations
s2
s3
the profitability
4. Determining
ment on each generator.
Restructuring
2.3
+ F(1)
the optimal
3. Determining
for scalar replacement.
cases, it
presence
research
for scalar
DO IO J = I, N
DO IO I = I, N
10
A(I) = A(1) + B(I,J)
Sl
Because Fortran stores its arrays in column major
order, the stride between accesses to the B array is
one. With a cache that has a line length of 1, this
will correspond to l- 1 cache hits and 1 cache miss
for 1 accesses to B. If the I and J loops were interchanged so that A(1) could be kept in a register, the
s2
s3
58
stride between accesses to B would be equal to the
dimension of the column of B, resulting in poor cache
performance.
1. ForeachvEV,therearek+lnodesuc,...,wkin
V’. These correspond to the original
and its k copies.
2.3.2
2. For each edge e = (v, w) E E, with consistent
threshold r(e), there are k + 1 edges ec, . . . , ek
where ej = (vi, wi) and we have the relations:
Unroll-and-jam
Unroll-and-jam
is another transformation
that can be
used, where safe, to shorten the distances between
references to the same array location [AC72, AN87,
CCK87]. The transformation
unrolls an outer loop
and then fuses the inner loops back together. It can
transform dependences carried by an outer loop into
dependences that are either loop independent or carried by the innermost loop. For example, the loop:
DO 10 J = 1, 2*M
DO 10 I = 1, N
10
A(I, J) = A(I-1,J)
after unrolling
i=
(j + T(e)) mod (k + 1)
Intuitively,
these rules simply take account of the
fact that several iterations are unrolled into the body,
so there are k + 1 copies of both u and w and the
thresholds are greater than zero only for dependences
that cross iterations of the new loop.
These rules can be extended to unroll-and-jam
in
the following manner. Whenever a threshold of a dependence becomes zero, the next inner non-zero entry
in the distance vector for the dependence becomes the
new threshold and the loop corresponding to that entry becomes the new carrier. If there is no non-zero
entry, then the new dependence becomes loop independent. If the edge is not carried by the unrolled
loop, then it is just copied into the new loop body.
Unroll-and-jam
not only improves register allocation opportunities-it
also improves cache performance by shortening distances between sources and
sinks of dependences. As a result, cache hits are more
likely [Por89]. Finally, unroll-and-jam
can be used to
improve pipeline utilization
in the presence of recurrences [CCK87].
The major difficulty with unroll-and-jam
is determining the amount, of unrolling necessary to produce
optimal results.
There are many factors that are
involved in the decision process, some of which are
register pressure (both integer and floating-point),
storage layout, loop balance, pipeline utilization
and
instruction-cache
size [CCK87]. The current implementation picks the two outer loops that carry the
most dependences (true and input) and unrolls them
by a factor provided as a parameter.
We have developed, but not implemented, an efficient heuristic that chooses this parameter automatically. Because we are using unroll-and-jam
to improve scalar replacement, our heuristic concentrates
on register-file utilization.
We would like to use all of
the floating-point
registers available if possible, but
we do not want to unroll the loop more than necessary. Unrolling carelessly can lead to a large explosion in code size with little additional improvement
in execution time. Therefore, the decision procedure
+ A(I-l,J-1)
becomes:
DO 10 J = 1, 2*M, 2
DO 20 I = 1, N
+ A(I-l,J-1)
20
A(I, J) = A(I-1,J)
DO 10 I=l, N
A(1, J+l) = A(I-l,J+l)
+ A(I-I,J)
10
and after fusion becomes:
DO 10 J = 1, 2*M, 2
DO IO I = 1, N
+ A(I-l,J-1)
A(I, J> = A(I-1,J)
= A(I-1,JtI)
t A(I-1,J)
A(I,J+I)
10
statement
Sl
s2
The true dependence from A(1, J) to A(I-I, J-I) is
originally carried by the outer loop, but after unrolland-jam the dependence edge goes from A(1, J) in
S1 to A(I-1, J) in SZ and is carried by the innermost
loop. We can now allocate a temporary for A(1, J)
and remove two loads instead of one as in the original
loop *
To allow scalar replacement to take advantage of
the new opportunities
for reuse created by unrolland-jam, we must update the dependence graph to
reflect these changes. Below is a method to compute
the updated dependence graph after loop unrolling if
we restrict ourselves to consistent dependences that
contain only one loop induction variable in each subscript position [CCK87].
The dependence graph G‘ = (V’, E’) for L’ can be
computed from the dependence graph G = (V, E) for
L by the following rules:
59
becomes a constrained integer-optimization
problem,
where the size of the inner-loop body is maximized
with the constraint that the register pressure is not increased beyond the limits of the hardware. To handle
the problems of pipeline utilization
and instructioncache size, we can increase or decrease the unroll factor to fit the situation, although unrolling for register
usage, by itself, is likely to create enough parallelism,
while keeping the inner-loop body within a reasonable
size limit.
2.3.3
Irregular-Shaped
Iteration
aN+P
Spaces
I
1
When the iteration space of a loop is not rectangular,
unroll-and-jam
cannot be directly applied, as shown
previously.
For each iteration of the outer loop, the
number of iterations of the inner loop may vary, which
makes unrolling a specific amount illegal. Below, we
show how to perform unroll-and-jam
on triangularand trapezoidal-shaped
iteration spaces by extending
loop interchange
previous work done on triangular
[Wo186, Wo187, CK89].
Another way to think of unroll-and-jam
is as an application of strip mining, loop interchange and loop
unrolling.
Using this model to demonstrate the transformation, we give the general form of a loop that has
been already strip mined. The I and J loops have
been normalized to give a lower bound of 1, LY and
p are integer constants(p may be a loop invariant),
(Y > 0 and, for simplicity, S divides B.
I
I
I+&
FIGURE 1: New Iteration
-
II
N
Space
overcome this, we can split the index set of J into one
loop that iterates over the rectangular region below
the line J=crI+,d and one loop that iterates over the
triangular region above the line. Since we know the
length of the rectangular region, the first loop can be
unrolled to give the following loop nest.
DO 10 I = l,N,S
DO 20 J = l,crI+p
loop body unrolled S-l times
20
DO 10 II = I+I,I+S-I
DO 10 J = WAX(l,aI+@+I)
10
loop body
DO 10 I = I,N,S
DO 10 II = I,I+S-I
DO 10 J = l,ctTI+p
IO
loop body
,aII+P
Depending upon the values of (Y and p, it may also
be possible to determine the size of the triangular
region; therefore, it may be possible to completely
unroll the second loop nest to eliminate the overhead.
The formula above can be trivially extended to handle
cr < 0.
While the previous method applies to many of the
common irregular-shaped
iteration spaces, there are
still some important
loops that it will not handle.
In linear algebra, signal processing and partial differential equation codes, we often find loops with
trapezoidal-shaped
iteration spaces.
Consider the following example.
To interchange the II and J loops, we have to account for the fact that the line aII+/3 intersects the
iteration space at the point where J = aI+P. Therefore, when we interchange the loops, the II-loop must
iterate over a trapezoidal
region, as shown in Figure 1, requiring its lower bound to begin at I until
(J-p)
a > I. This gives the following loop nest.
DO 10 I = I,N,S
DO 10 J = i,a(I+S-i)+P
DO 10 II = MAX(I,(J+)/a),I+S-I
IO
loop body
DO 10 I = 1,N
DO 10 J = I,#IN(I+H,L)
loop body
10
To complete the unroll-and-jam
transformation,
we
need to unroll the inner loop. Unfortunately,
the iteration space is a trapezoidal region, making the number of iterations of the inner loop vary with J. To
60
I
I
transformer
(PFC)
FIGURE 2: Experimental
Here, as is often the case, a MIN function has been
used to handle boundary conditions, resulting in the
loop’s trapezoidal shape. The formula to handle a
triangular region does not apply in this case, so we
must extend it. Consider the previous loop after normalization.
We have a rectangular iteration space while I+H < L
and a triangular region after that point. Because we
already know how to handle rectangular and triangular regions and because we want to reduce the execution overhead in the rectangular region, we can use
index set splitting,
as described previously, to split
the loop into one rectangular and one triangular iteration space.
body
= MAX(l,MIN(N,(L-H))+l),N
J = l,L-1+1
body
Experimental
)
original
time
time
design.
Livermore
Loops.
We tested scalar replacement
on a number of Livermore loops with unroll-and-jam
turned ofi Many of the loops did not contain opportunities
for reuse or contained inner-loop conditional control flow, which we do not currently handle. Therefore, we do not show results for these loops,
since no improvement was attainable by our system.
In Table 1, we show the performance of the original and transformed loops, using arrays of size 1000
and single precision REALS. The “Unrolled”
column
shows the speed of the original loop after unrolling it
an amount equivalent to that done by scalar replacement. The speedup column may have two fields. The
first field shows the speedup of the transformed loop
over the original loop. The second field, if present,
shows the speedup of the transformed loop over the
unrolled loop.
Some of the interesting results include the performances of loops 5 and 11, which compute first order
linear recurrences. In each of these loops, one load
per iteration of the loop is removed and the loop is
unrolled once. The unrolling
has limited effect because of the recurrence and the fact that our version
= l,WIN(N,(L-M))
J = l,M+l
We can now apply regular unroll-and-jam
to the first
loop and triangular unroll-and-jam
to the second. For
the general case, we can easily extend the index set
splitting to allow MIN or MAXfunctions in either the
upper or lower bound of a loop and to allow constant
coefficients on the induction variable appearances in
the bounds of the inner loop.
3
improved
then replaces subscripted variables with scalars, using
both the knapsack and greedy algorithms to moderate register pressure. For our test machine, we chose
the MIPS Ml20 (based on the MIPS R2OOO) because
it had a good compiler and a reasonable number of
floating-point
registers (16) and because it was availIn
examining these results, the reader
able to us.
should be aware that some of the improvements cited
may have been achieved because the transformed versions of the program have, in addition to better register utilization,
more operations to fill the pipeline
on the M120.
The experimental design is illustrated in Figure 2.
In this scheme, PFC serves as a preprocessor, rewriting Fortran programs to improve register allocation.
Both the original and transformed versions of the program are then compiled and run using the standard
product compiler for the target machine.
DO 10 I = 1,N
DO 10 J = l,MIN(I+M,L)-I*1
loop body
10
DO 10 I
DO 10
10
loop
DO 10 I
DO 10
IO
loop
)
Fortran
compiler
Fortran
source program
Results
We have implemented a prototype source-to-source
translator based on PFC, a dependence analyzer and
parallelization
system for Fortran programs. The prototype restructures loops using unroll-and-jam
and
61
Loop 1 Iterations
1 I 20.000
5
I 25,000
I
H 13
L
14
19
I
I
-I---
2,600
31000
10.000
1 Original
I 25.44s
I 16.61s
,
I
I
30.22s
1 22.8s
I 21.19s
TABLE
1 Unrolled
Il 20.6s
I 15.12s
1
1
-
I
-
1: Livermore
1 Transformed
20.0s
Il
I
12.15s
I
I
29.14s
19.61s
19.36s
1 Speedup
II 1.23 II 1.03
1.37 1 1.24 1
I
I
I
1.04
1.16
1.09
H
Loops results.
a hand optimized version of code. We chose the routine SMXPY from LINPACK (see appendix A), which
computes a vector matrix product and was hand unrolled to improve performance [DBMS79].
We rerolled it to get the original loop, shown below:
of the MIPS compiler (1.31) seems to be more conservative in allocating registers to loop-independent
dependences when stores to arrays are present. Our
method removes this problem with the allocation of
temporary variables and we see significant improvements. Loops 1 and 12 do not show as much improvement over the unrolled original loop because
they involve input dependences rather than true dependences. Because there is no recurrence, unrolling
improves parallelism and the MIPS compiler can allocate registers to the loop-independent
dependences
that are created.
Loop 6, shown below, is an interesting case because
it involves a scalar array reference that is difficult to
recognize.
DO 10 J = 1, H
DO 10 I = 1, 1J
10
= Y(1) + X(J)*W(I,J)
Y(I)
and then fed this loop to PFC which performed unrolland-jam 15 times (the same as the hand-optimized
version) plus scalar replacement.
As the table below shows, our automatic techniques achieved slightly
better performance from the easy-to-understand
original loop, without the effort of hand optimization.
The speedup column represents the improvement of
the automatic version over the hand-optimized
version. To make this measurement, we iterated over
each loop 30 times using arrays of dimension 1000.
DO 6 I= 2,B
DO 6 K= 1,1-l
W(I)= W(I)tB(I,K)*Y(I-K)
6
A compiler that recognizes loop invariant addresses to
get this case would fail because of the load of W(I-K) .
Through the use of dependence analysis, we are able
to determine that there is no dependence between
W(I) and W(I-K) that is carried by the innermost
loop. Because of this additional information,
we are
able to get a speedup of 1.09.
Loop 8 required more registers than the MIPS provides, so the transformed
loop was approximately
10% slower than the original loop. After we used
either the knapsack or greedy algorithm
to relieve
register pressure, we saw a speedup of 1.04 over the
original loop.
Original
28.0s
LINPACK
18.47s
Automatic
17.52s
Speedup
1.05
Matrix
Multiplication.
Next, we performed our
test on matrix multiplication
with 500 x 500 arrays of
single precision REAL numbers (see appendix
B). For
matrix multiplication,
we used various unroll values
performed on both outer loops plus scalar replacement. We show their separate effects in the table below. The first entry in the speedup column of the table represents the speedup due to scalar replacement
alone, whereas the second column shows the speedup
due to unroll-and-jam
plus scalar replacement.
An important test for a reLinpack
Benchmark.
structuring
compiler is to achieve the same results as
62
Unroll Value
Original
Scalar
unroll-and-jam
can be applied to a variety of irregular loop structures, including triangular
and trapezoidal loops. We have also demonstrated that these
techniques can lead to dramatic speedups on RISC architectures, even those for which the native compiler
does extensive code optimization
(the MIPS R2000,
for example).
We are currently engaged in studying the generality
of these techniques by extending our implementation
and applying it to a wider class of programs. This
research involves the following subprojects:
~~~
When the unroll values used on matrix multiply
exceeded 1, the register pressure of the loop caused
scalar replacement to hurt performance rather than
help it. Fortunately,
both the knapsack and greedy
algorithms successfully eliminated this problem, reducing the time for the loop unrolled by a factor of 3
from 67.2s to 60.49s.
The current implementation
is incomplete.
We
are in the process of extending it to automatically compute unroll amounts and to handle triangular and trapezoidal
loops and loops with
conditionals.
A Trapezoidal
Loop.
To test the effectiveness of
unroll-and-jam
on loops with irregular-shaped
iteration spaces, we chose the following trapezoidal loop
which computes the adjoint-convolution
of two time
series.
Similar methods can be used to improve cache
performance.
The key transformation
for this
purpose is strip-mine-and-inlerchange
[Por89]
(sometimes called loop blocking [CK89], iteration space tiling [Wo187, Wo189] or supernode
partitioning
[ITSSJ). This transformation
is essentially unroll-and-jam
with the copies of the
inner loop body rolled back into a loop. We have
implemented a preliminary
version of this transformation in PFC and performed some experiments that indicate that it further enhances machine performance [CK89]. We are in the process
of enhancing this transformation
to automatically select automatic strip sizes and to handle
complicated loop structures.
DO 10 I = 1,Nl
DO 10 J = I,HIN(I+N2,N3)
F3(I) = F3(I)+DT*Fl(K)*WXLK(I-K)
10
We performed trapezoidal unroll-and-jam
by hand
with an unroll value of 3 plus scalar replacement. We
ran it on arrays of single-precision
REALS of size 100
so that 75% of the execution was done in the triangular region. Below is a table of the results.
Iterations
5000
Original
15.59s
Transformed
9.08s
Speedup
1.72
Although the study reported here demonstrates
the potential of these methods, we need to conduct a study that determines their effectiveness
on whole programs. Once the implementation
is
complete, we plan to apply the transformations
to the RiCEPS Fortran benchmark suite being
developed at Rice.
A Complete
Application.
We also performed the
test on a complete application, onedim. This program
computes the eigenfunctions
and eigenenergies of a
time-independent
Schrijdinger equation.
The table
below shows the results when scalar replacement and
unroll-and-jam
are applied with an unroll value of 2.
Original
47.9s
4
Summary
1 Transformed
I
33.34s
Given that future machine designs are certain to
have increasingly complex memory hierarchies, compilers will need to adopt increasingly
sophisticated
strategies for managing the memory hierarchy so that
programmers can remain free to concentrate on program logic. The transformations
and experiments reported in this paper represent an encouraging first
step in that direction. It is pleasing to note that the
theory of dependence, originally developed for vectorizing and parallelizing
compilers can be used to substantially improve the performance of modern scalar
machines as well.
1 Speedup
I
1.44
and Future
Work
We have presented the method of scalar replacement,
which transforms program source to uncover opportunities for reuse of subscripted variables. This method
can be enhanced by a number of transformations,
particularly
unroll-and-jam.
We have shown how
63
Acknowledgments
The authors would like to thank Preston Briggs,
Keith Cooper, Ervan Darnell, Paul Havlak and ChauWen Tseng for their helpful suggestions during the
preparation of this document.
We also thank John
Cocke and Peter Markstein for giving us the inspiration for this work and supporting us while we pursued
it.
References
[AC721
J.R. Allen and K. Kennedy. Automatic
loop interchange. In Proceedings
of the SIGPLAN
‘84 Symposium on Compiler
Conslrzelion,
SIGPLAN
Notices Vol. 19, No. 6, June 1984.
[AK84b]
JR. Allen and K. Kennedy.
PFC: A program to
In Supercomconvert
fortran
to parallel
form.
Design and Applicahons,
pages 186-205.
puters:
IEEE Computer
Society Press, Silver Spring, MD.,
1984.
[AK881
J.R. Allen and K. Kennedy.
Vector
cation. Technical report, Department
Science, Rice University,
1988.
[AN871
A. Aiken and A. Nicolau.
Loop quantization:
An
analysis and algorithm.
Technical Report 87-821,
Cornell University,
March 1987.
[CK89]
S. Carr and K. Kennedy.
Blocking linear algebra
codes for memory hierarchies.
In Proceedings
of
the Fourth SIAM Conference
on Parallel Processing for Scientific
Computing,
Chicago, IL, December 1989.
[aBMS79]
J.J. Dongarra,
J.R. Bunch,
Stewart.
LINPACK
User’s
tions, Philadelphia,
1979.
[G JG87]
[IT881
A.K. Porterfield.
Software Methods for Zmprovement of Cache Performance
on Supercomputer
Applicationa.
PhD thesis, Rice University,
May 1989.
[WolSS]
M. Wolfe. More iteration
space tiling. In Proceedings of the Supercomputing
‘89 Conference,
1989.
A
version of SMXPY
IF (J .GE. 2) THEI
DO 2 I = 1. Ii
Y(I) = ( (Y(I))+x(J-~)*H(I,J-~))
+X(J)*H(I,J)
COITIIUE
ESDIF
J = HDD(12,8)
IF (J .GE. 4) THEI
DO 3 I = 1, 11
Y(I) = ((( (Y(I))+X(J-3)*H(I,J-3))
+X(J-2)*l’l(I,J-2>)+X(J-l~*!l(I,J-l))
+X(J>+H(I,J)
COITIIUE
ElDIF
J = HOD(I2,16)
IF (J .GE. 8) THEI
DO 4 I = 1, Ii
Y(I) = ttttt(t
(Y(I))
+X(J-7)*H(I,J-7,)+X(J-6,*H(I,J-6))
+XtJ-5)*l’l(I,J-S))+X(J-4)*n(I,J-4))
+X(J-3)*!l(I,J-3))+X(J-2)*H(I.J-2))
+X(J-i)*H(I,J-l))+X(J)
l I’l(I,J)
COITIIUE
ElDIF
JHII = J+16
DO 6 J = Jl!II,
12,
DO 5 I = 1, 11
Y(I) = ((((((<((((((((
16
(Y(I))
+X(J-i5)*H(I,J-f5))+X~J-l4)*n(I,J-i4))
+X(J-l3)*l!(I,J-l3))+X(J-l2)*n(I,J-12))
+X(J-ll)*f!(I,J-ii))+X(J-lO)*n(I,J-IO))
+X(J-9)*!t(I,J-9))+X(J-s)*n(I,J-8))
+X(J-7)*H(I,J-7))+X(J-6)*M(I,J-6))
F. Irigoin and R. Triolet.
Supernode partitiong.
In
Conference
Record of the Fifteenfh
ACM Symposium on the Principles
of Programming
Languagea,
pages 319-328, January 1988.
[Por89]
M. Wolfe. Iteration
space tiling for memory hierarchies, December 1987. Extended version of a paper
which appeared in Proceedings
of the Third SIAM
Conference
on Parallel Processing.
J = noD(12,4)
D. Gannon,
W. Jalby, and K. Gallivan.
Strategies for cache and local memory management
by
global program transformations.
In Proceedings of
the Firsi Zniernational
Conference
on Sapercompuling. Springer-Verlag,
Athens, Greece, 1987.
D. Ku&
The Structure
of Computers
and Computations
Volume 1. John Wiley and Sons, New
York, 1978.
[wo187]
Y(I) = (Y(I))+X(J)*H(I,J)
COITIIUE
EIDIF
C.B. Moler, and G.W.
Guide. SIAM Publica-
[Kuc78]
M. Wolfe.
Advanced
loop interchange.
In Proceedings of the 1986 International
Conference
on
Parallel Processing, August 1986.
J = HODtI2,2)
IF (J .GE. 1) THE1
DO 1 I = 1, I1
[CAC+ 811 G.J. Chaitin,
M.A.
Auslander,
A.K.
Chandra,
J. Cocke, M.E. Hopkins, and P.W. Ma&stein.
Register allocation
via coloring.
Computer Languages,
6:45-57,
January 1981.
D. Callahan,
J. Cocke, and K. Kennedy.
Estimating interlock
and improving
balance for pipelined
machines. In Proceedings of the 1987 International
Conference
on Parallel Processing, August 1987.
[WolSS]
Hand-optimized
register alloof Computer
[CCK87]
M. Wolfe.
Optimizing
Supercompilers
for Supercompuless.
PhD thesis, University
of Illinois,
October 1982.
Appendix
F.E. Allen and J. Cocke. A catalogue of optimizing
transformations.
In Design and Optimization
of
Compilers,
pages l-30. Prentice-Hall,
1972.
[AK84a]
[Wo182]
+X(J-5)*M(I,J-S))+X(J-4)*n(I,J-4))
+X(J-3)*H(I.J-3))+X(J-2)*H(I,J-2))
+X(J-l)*H(I,J-l))+X(J)*n(I,J)
COITIIUE
COITIIUE
64
from LINPACK.
Appendix
B
C$2$0 = C(I,J+l)+A$l$O*B$Z$O
A$2$0 = A(I+l,l)
C!$B$O = C<I+l,J)+A$2$O*B$l$O
CQ4$0 = C(I+l,J+l)+A$2$O+B$2$0
DO 13 K - 2,1,1
B$l$O = B(K,J)
A$l$O
= A(I,K)
CSlSO = C$l$O+A$lSO*B$l$O
B$2$0 = B(K,J+l)
CS2SO = C$2SO+ASl$O*B$2$0
A$2$0 = A(I+l,K)
C$3$0 = CS3$O+AQ2$0*B$l$O
Automaticall
generated matrix multiply with unrolland-jam per r ormed on both the I-loop and J-loop
with an unroll value of 1 plus scalar replacement.
3
4
2
6
7
5
1
C
C
DO 1 J = l.HOD(1.2),1
DO21=
l,HOD(1,4),1
C(1.J) = 0.0
IF (1 .GT. I) GOT0 4
C$l$O = C(I,J)+A(I,l)*B(l,J)
DO 3 K = 2,1,1
C$l$O = C$lSO+A(I,K)*B(K,J)
C(1.J) = C$l$O
COITIIUE
COKTIIUK
DO 5 I = I'lOD(1,4)+1,1,4
C(I,J) = 0.0
C(I+i,J)
= 0.0
C(I+2,J) = 0.0
C(I+J,J) = 0.0
IF (1 .GT. I) GOT0 7
B$l$O = B(l,J)
C$i$O = C(I,J)+A(I,I)*B$I$O
C$2$0 = C(I+l,J)+A(I+l,l>*B$1$O
C$3$0 = C(I+2,J)+A(I+2,l)*B$l$O
CS4$0 = C(I+3,J)+A(I+3,1)*B$l$O
DO 6 K = 2,1,1
BSlSO = B(K,J)
C$l$O = C$l$O+A(I,K)*B$l$O
C$2$0 = C$2$0+A(I+l,K)*B$l$O
cs3so = C.$3$O+A(I+2,K)*B$l$O
cs4so = C$4$O+A(I+3,K)*B$l$O
C(I+3,J) = C$4$0
C(I+P,J) = C$3$0
C(I+l,J)
= C$2$0
C(I,J) = C$l$O
COPTIIUE
COKTIIUE
COBTIBUE
13
c$4$0
= c$4$O+A$2SO+BS2$0
C(I+l,J+I)
= C$4$0
C(I+l,J)
= C$3$0
C(I,J+l)
= C$Z$O
C(I,Y) = C$l$O
COITIIUE
14
12 COITIWK
8 COITINE
DO 8 J = tlOD(1,2)*1.1,2
DO 9 I = l,ROD(I,2),1
C(I,J) = 0.0
C(I,J+l)
= 0.0
IF (1 .GT. I) GOT0 11
A$l$O
= A(I,i)
C$l$O = C(I,J)+A$l$O*B(l,J)
C$2$0 = C(I,J+l>+A$1$O*B(l.J+1)
DO 10 K = 2,1,1
ASlOO = A(I,K)
C$l$O = C$l$O+A$l$O*B(K,J)
10
C$2$0 = CQ2SO+A$l%O*B(K,J+l)
C(I,J+l)
= C%2$0
C(I,J) = C$i$O
11
COITIIUE
9
COITIMUE
DO 12 I = MOD(1,2)+1,1,2
C(I,J) = 0.0
C(I,J+l)
= 0.0
C(I+l,J)
= 0.0
C(I+l,J+l)
= 0.0
IF (1 .GT. I) GOT0 14
B$l$O = B(l,J)
ASIS0 = AtI,l)
C$l$O = C(I,J)+A$l$O*B$l$O
B$2$0 = B(f,J+l)
65
Download