Compiler Blockability of Numerical

advertisement
Compiler
Blockability
of Numerical
Steve
Ken
Department
Carr
Kennedy
of Computer
Rice
Science
University
Houston
TX
77251-1892
Abstract
mizations
don
Over the past decade,
have
focused
on
on a single
not
kept
increasing
chip.
speed
ante
is leading
cated
memory
explicitly
programs.
comptler
codes
paper
technology
the
while
our
to
retaining
well
are
on par-
investigation
the
Our
results
memory
to write
a sublanguage
will
Introduction
The
trend
precisely,
able
programmers
need
increasing
Unfortunately,
hand
barked
many
in
same
rate.
number
The
result
of cycles
for
10 to 20 machine
Although
lems,
cycles
cache
it performs
working
tion
sets
has
codes
led
hierarchy.
We believe
The
user
are specific
task
of specializing
There
compiler
dence.
fall
in
is a long
this
is a step
not
be creating
a program
Fortran
of
in
the
algorithms
latency
of
machine.
to a target
in
history
wrong
Instead,
I compiler
*Research
supported
by Darpa
NOO014-91-J-1989
and by NSF Grant
“What
included
to
the
the
memory-
obviate
the
this
versions
used
compiler
algo-
the
address
a compiler
the
need
of real-world
the best
block
automatically
we
does
with
this
preliminary
generate
study,
information
assist
algorithms
the
to
in
ex-
contributed
some
algorithms
performing
block
To
have
extends
point
algorithms
algorithms
1].
define
we em-
if a compiler
block
of several
to
the
approach,
point
be needed
to derive
course
loops,
The
mation
Grant
exhibit
codes
hand-blocked
ver-
known
this
transformation
ical
algorithms
we have
an
and
found
common
These
LU
to many
as index-set
show
dependence
used
have
decomposition
a wide
can be automatically
algebra,
been
and
success-
without
results
splitting.
that
al-
on triangular
in linear
methods
of these
pat-
transformation
be successfully
to block
key
discovered
can be used to analyze
complex
we have
are quite
loops.
applied
oting.
can
study,
that
that
that
which
fully
opti-
of this
approach
In addition,
gorithms
indepen-
through
ONR
CCR-9120008
efforts
programs
terns.
architecture
enough
of this
paper
are competitive
block
114
1063-9535I92 $3.00@ 1992 lEEE
question
In the
programs
machine
achieve
Sorensen
versions
from
algorithmic
use of sophisticated
achieve
and
enough
[D DSvdV9
and
would
trapezoidal
of the
to
en-
sions?”
situatheir
the
our
In
that
with
77
This
[CK89].
in order
prob-
in the memory
that
block
that
This
should
and
LA PACK.
rithms
to restructure
performance
will
in a natu-
to determine
generate
Fortran
results
compiler.
optimizations
The
size.
technology
the corresponding
Dongarra
chip.
these
the viability
from
at the
calculations
cache
to a particular
to the
on
in
common.
ameliorate
the
design
increase
quite
programmers
direction.
should
to
to improve
that
an
for
processors.
optimization.
automatically
LAPACK
increasing
access—a
on scientific
than
many
been
is now
helps
poorly
larger
by hand
has
a memory
that
success
on scalar
form
pos-
contend
an algorithm
good
it
programs
same
compiler
on an experiment
could
a natural
power
is not
We
the
to express
To investigate
reveal
microprocessor
speed
enhanced
recently,
made
need for
performance.
computational
memory
77.
achieve
performance
for
point
is toward
Fortran
to
machine-independent
hierarchy
has
vector
management
More
research,
in high-performance
of
memory-hierarchy
pressed
1
technology
possible
to aban-
More
machine-independent
it
be
for scientists
programming.
vectorization
in
ral,
tnto
opttmizations
be expressed
good
compli-
obviate
use of compder
can
~
to machme-spectjic
describes
algorithms
tmbal-
programmers
programming.
through
numerical
Thts
to perform
leading
designed
between
to use more
turn,
sible
speeds haue
speed.
to make it possible
machine-language
advanced
power
imbalance
destgners
systems,
This
an
In
strategies
computational
memory
machine
machine-specific
form
and
design
memory
is
haerarchzes.
memory
that
result
restructuring
ticular
the
Unfortunately,
The
pace.
computation
microprocessor
Algorithms*
piv-
is a transfor-
Our
results
class
with
of numer-
optimized
for
a
particular
machine’s
expressed
in a natural
covered
that
erations
memory
specialized
commute
pilers
to block
We
ground
our
we present
essary
to block
the
scribe
a study
tions
to derive
their
corresponding
that
of the
cannot
a set of language
sion
of
form.
block
of reuse:
occurs
that
occurs
same
cache
been
in a
accessed
of a loop.
accesses
as some
and
a reference
iteration
a reference
line
a
for
temporal
when
has previously
when
Spa-
data
previous
that
access.
In
loop,
we pro-
the
the expres-
and
DO 10 I = I,E
A(I) = A(I-5)
10
algo-
a machine-independent
work
types
or a previous
the following
from
For those
to allow
related
reuse
management,
of as an opportunity
transforma-
by a compiler,
in
we review
two
reuse
data
current
is in the
we de-
are
accesses
tial
memory-hierarchy
Temporal
in the
nec-
in LAPACK
extensions
algorithms
Finally,
Then,
algorithms.
be blocked
pose
are
to
can be thought
There
spatial.
optimization.
of these
algorithms
applied
reuse.
to
c)f back-
that
reuse
dependence
loop
a review
memory
application
Cache
When
com-
thought
automatically.
point
2.2
op-
can enable
transformations
the block
which
previously
with
to
LAPACK
are
dis-
methods.
related
Next,
they
we have
about
another
were
presentation
material
rithms
one
that
by automatic
begin
when
In addition,
knowledge
with
codes
be unlockable
hierarchy
form.
present
a
reference
value
defined
ence
to B(I)
ments
to
+
B(I)
A ( I-5
) has
by A ( I )
has spatial
of B will
temporal
5 iterations
likely
reuse
earlier.
of
The
reuse
since
consecutive
be in the
same
cache
the
referele-
line.
summary.
2.3
2
Background
2.1
To
fundamental
same
tool
namely
tool
used
available
dependence.
the first
A
if there
statement
reference
to the compiler
in vectorization
two statements
improve
and
to the second,
the same
exists
a control
memory
and both
location
path
a loop
ing
sets
the
first
statement
second
dence,
also
If the first
the
writes
reads
from
called
a flow
statement
second
it,
from
statements
WL91].
source
and
for
to
location
it,
from
there
an output
write
to the
and
antidepen-
location,
Assume
the
of the
cache.
statements
A dependence
the
source
ent
iterations
carried
and
can
that
the
loop
be
loop
and
location,
and
the
used
[CK87,
to
at
describe
the
HK91].
of arrays
Sections
such
for
iteration
every
A
[W0187,
between
it
when
the
is more
the reuse
nest.
is not
interchange
is
blocking
seciion
of an
mining,
or set
tially,
of
loop
JS doesn’t
rows,
extra
diagonals.
of the
115
not
as shown
below.
,M)
occurs.
J-loop
register
loop
to
[LRW9
to effect
N, a pre-loop
loop
the
1].
strip-mine-andUnroll-and-
instead
of
as an application
and
the
occurs
if JS is less than
[CC K90].
is completely
instead
addition,
of cache
is no interference
blocking
be seen
In
of B out
analogous
interchange
divide
iterations
size
strip-mine-and-
J-loop
HHJ(J+JS-1
A now
mine-and-interchange
com-
the
B, but
+ B(JJ)
and there
can
the inner
for
JS
of JS values
for
and
than
exists
reuse,
is unroll-and-jam
used
greater
reuse
to the
transformation
portion
as elements,
reuse
reuse
reference
describe
result
so that
loop
of M is much
= A(I)
temporal
jam
by a particular
the
= 1,11
size of the cache
[AK87].
information,
this
in cache
A’s temporal
A(I)
Temporal
are on differ-
dependence
work-
capture
+ B(J)
Temporal
DO 10 I
there
if the references
dependence
dependence
is accessed
substructures
columns
the
by a loop
of the
of the
of references
mon
sink
enhance
analysis
array
is carried
= A(I)
DO 10 J = I,H,
is
dependence.
by an outer
To
from
whose
to
distance
following
DO 10 JJ = J,
read
the
to reside
the
value
10
If both
achieves
of a dependence
is applied
dependence.
is an input
that
A. To exploit
for
there
blocks
cache
ac-
space
Strip-mine-and-interchange
It shortens
A(I)
interchange
statements
into
for
that
iteration
DO 10 I = I,M
dence.
If both
reuse.
the
DO 10 J = 1,11
10
is an
enough
the datum
depen-
the location
grouped
sink
of loops
in cache,
be
Consider
and
is a true
dependence.
reads
writes
to the
behavior
fit
small
temporal
Por89,
likely
[Kuc78].
there
than
can
are
occurs.
If the
memory
data
is a transformation
between
flow
of
available
parallelization—
dependence
exists
is the
blocking
the
cess more
Dependence
The
Iteration-space
unrolling.
unrolled
Essenafter
unroll-and-jam.
is used
of a MIN function
cache
of strip
stripWhen
to handle
.
the
3
Index-set
splitting
J
Iteration-space
plied
blocking
as shown
safety
the
constraints
of blocking.
splitting
creates
each new
loop
changed
and
pletely
a partial
be
from
iterating
over
iteration
space.
M
application
called
Index-set
one original
split-
loop
nonintersecting
order
iteration
space
As
an
example
of index-set
following
loop.
is un-
is still
E
J=aII+~
with
portions
Execution
original
the
t
ap-
Sometimes
a transformation
applied.
loops
executed.
consider
permit
can
the
be directly
section.
cases,
multiple
of the original
ting,
these
always
previous
only
In
index-set
ting
in
cannot
com-
1
11
splitFigure
1+1 S-1
1: Upper
Left
N
11
Triangular
Iteration
Space
DO 10 I = I,H
10
A(I)
= A(I)
+ B(I)
set
I
11-loop
The
index
of
can
be split
at
100 to
iteration
obtain
DO 10 I = I, MIH(M, IOO)
A(I)
= A(I)
20
A(I)
= A(I)
Although
this
loop
forms.
enable
This
the
eration
H
does
section
loops
nothing
the
blocking
uses
index-set
of triangular
and
and
with
by itself,
of
splitting
to
trapezoidal
complex
formula
cases
iteration
the
iteration
space
iteration-space
I
it-
appears
bound
dependence
in
the
Interchanging
gions
loops
requires
preserve
Therefore,
loop
for
with
then
The
gular
general
integer
form
constants
the
when
blocking
We
of one type
below,
be symbolic)
Figure
1 gives
trian-
space
loops,
iteration
the
of this
loop.
intersection
a > 0.
space
Therefore,
at the
point
interchanging
line
the
the
II
J=@II+~
(1, aI+/3)
the
of
loops
must
iteraand
with
inner
loops
unrolling
the
more
above
region
following
cre-
triangular
)+@ and one loop
region
that
iter-
the line.
Since
is known,
loop
is
of the
it
can
qest.
IS
= 1,1+1 S-2
J
unrolled
H)
upon
to determine
therefore,
the first
H
loop body
Depending
the
values
loop
nest
triangular
to handle
other
be possible
~,
to eliminate
triangles
iteration
it
may
triangular
to completely
the
unroll-and-jam
common
Trapezoidal
of a and
the size of the
it may
ditionally,
3.2
MIN(a(I+IS-2)+(3,
body
DO 10 J = a(I+IS-l)+~,
also
reun-
overhead.
Ad-
can be extended
[Car92].
spaces
the
be han-
requires
over
region
the
two
ex-
Since
of J at a(I+IS-1)+~
rectangular
to give
be
of iterations
making
J,
iterates
be possible
roll
To interchange
of the
of the
can
100P
10
gion;
description
with
DO 20 J = aII+@,
IS
a graphical
length
DO 10 I = I,E,
/3 are literal
and
of
lower
is
begin
20
a and
of the
as follows.
by the
splitting
rectangular
DO 20 II
of strip-mined
where
the
function
instead
number
J=cY(I+IS-I
the
be unrolled
the
vary
that
the line
the
spaces.
bound
defined
region,
loop
requires
for-
to handle
a linear
unroll-and-jam
Index-set
ates over
DO 10 II = 1,1+1 S-1
DO 10 J = cYII+~, Fl
[OOP
body
10
dled.
also
extended
where
upper
space
one loop
below
strip-mine-and-interchange
(D may
DO 10 I = I,E,
tion
bounds
to
I+ IS-i)
strip-mine-and-interchange
iteration
ates
W0187].
we derive
iteration
for
bounds
[W0186,
regions
re-
0 and
to triangular
difficult.
it to unroll-and-jam.
is given
an
This
[Car92].
innermost
applied.
a triangular
loop
Below,
loop
rectangular,
directly
of the loop
the
triangular
derivation
extend
loop
over
modification.
on triangular
the
and
iterate
of
determining
performed
is not
be
modification
semantics
blocking
bound
mula
the
the
that
a loop
cannot
<
the
Triangular
spaces
of
blocking
with
1+1S–1.
H
= I, MIH((J-~)/cI,
a
a trapezoidal
If
>
nest,
can be trivially
where
tended
Triangular
region
~
body
IOOP
This
complex
patterns.
3.1
loop
DO 10 II
10
enable
blocking
spaces
until
DO 10 J = cd#3,
MIH(M, 100)+1),
+ B(I)
can
a trapezoidal
~
DO 10 I = l, E,IS
transformation
application
over
of
the following
+ B(I)
DO 20 I = MAX(l,
its
bound
gives
10
to iterate
upper
While
the
the
the common
116
previous
method
non-rectangular-shaped
applies
to
iteration
many
of
spaces,
there
are still
handle.
some
In linear
ential
equation
iteration
ple,
codes,
spaces
where
important
algebra,
loops
occur.
L is assumed
with
it
will
partial
not
double-precision
the
differ-
trapezoidal-shaped
the
following
to be a constant,
with
The
MIN function
one
triangular
= N. Because
gions
can be handled
split
into
with
blocking
two
one rectangular
separated
at
rectangular
already,
separate
the
to
point
triangular
the index
lower
at the
each
following
bound,
need
point
new
loop
L, of the
not
:1 =
loop
that
that,
after
an iteration
space
that
consider
can
In some
space
~
can
of two
gular
bound
regions.
ting
As
is a linear
variablel
extended
loop
that
handle
in a trapeIt
may
tion
be
this
more
loop
complex
which
of A (K)
dard
dependence
the
rection
The
value
split
in
similar
to
example,
complete
from
the
loop
loops
that
The
ration
and
blocking
using
can
index-set
regions
consider
be
previous
execution
unroll-
a MIN
bounds
time.
two
and
After
and-j am and
placement
there
on both
can
fol-
would
result
the
in four
by
A(K)
section
does
exist
loops
come
performing
a transformation
loops,
from
20%
we ran
an
A read
by the
the
In
section
from
partial
of K can be split
space
blocking
so that
where
locations
space
where
determine
the split
subscript
expression
A(K)
and
of the
one loop
and
A (II)
one loop
they
point
that
the
access
disjoint
creates
defines
the
over
them
scalar
on arrays
I
Figure
reof
117
2: Data
N
for
A
set
it-
common
the iter-
section
1+1 S-1
Space
the
larger
A
1
index
loops,
explo-
splitting,
to
these
this
index-set
re-
locations.
separate
oil
the
1+1S
over
iterates
access
that
loop,
iterates
functions
program’s
ref-
and the section
k)e index-
of the
called
reveals
I to N. Therefore,
for
sec-
source
Consider
be blocked.
constitute
the
dependence
array
for
means
the
blocking.
I to 1+1 S-1
goes from
not
This
analyzing
apply
of the
or di-
exists
at
II-
Stan-
as distance
accessed
true
goes from
Undefini-
by the
[WO182].
are
to
the
recurrence
However,
that
II-loop
distribution.
such
the
backward
by A(K)
To allow
of two
[Car92].
to remove
the
the
N.
ation
bound
function
Con-
position.
) carried
with
that
is potential
2. The
currence
,HIIJ(I,El)
(K)*F2(I-K)
lower
but
loop.
between
abstractions,
arrays
of
blocking,
use of A ( II
report
to A(II)
written
split-
the
sink
that
erence
[Car92].
convolution
and
Figure
trian-
the
is prevented.
memory
splitting
can each
program
the
compiler
the
to the innermost
defined
of the
eration
MAX function
of the iteration
the
below.
interchange
vectors,
every
series.
DO 10 I = 0,1i3
DO 10 K = MAX(O,I-U2)
10
F3(I) = F3(I)+DT*F1
set
loop,
the
example
is a recurrence
and
ex-
an
computes
there
preventing
of the outer-loop
example,
computes
within
iteration-space
loop,
produces
As
in rhomboidal
regions
the shape
patterns
be interchanged
tions
to the case for triangular
another
lowing
time
To
only
for
mined
complete
fortunately,
series.
function
resulting
to rhomboidal
similar
times
regions.
patterns
difficulties
dependence
blocking
lower
presents
the strip
must
blocked.
time
the
sider
be
To
value.
loop
dependence
cases, it is not
that
also
DO 10 I = 0,1J3
DO 10 K = I, f41H(I+li2,111)
10
F3(I) = F3(I)+DT*FI
(K)*F2(I-K)
The
1000
triangular
M
splitting,
be
the following
convolution
induction
kernel
in the
of
mea-
DO 10 I = l,Ii
DO 10 II = I, 1+1 S-1
T(II)
= A(II)
DO IOK=
II, Ii
A(K) = A(K) + T(II)
10
inner
index-set
Complex
3.3
separately.
nests
be a constant
any function
adjoint
each
timing
re-
set of I can be
DO 10 I = l,141JJ(M, (lJ-(3)/a)
DO1OJ=
L,aI+~
[OOP
body
10
DO 20 I = HAX(i,lIIIH(lJ,
(E-@) /a)+l),
DO 20 J = L,M
[OOP
body
20
ample,
over
execution
is a table
For
and
where
blocked.
nest
we iterated
of the
Below
&io.
a >0.
region
the
and
regions
applied
gives
The
[CCK90].
RS/6000
H)
defines
region
aI+~
zoidal
75%
REALS
on an IBM
surements,
exam-
and
results
body
/OOp
Splitting
that
and
Consider
DO 10 I=
1,11
DO 10 J = L, MIH(aI+~,
10
loops
seismic
To
the
is
loop
is guarded
necessary
1%-ocedme
IndexSetSp
[it
trix
For each transformation-preventing
repeat
the following
or a region
steps
is created
1. Calcufate
the sections
the preventing
2.
that
Intersect
until
failure
may
be blocked.
of the source
the sections
3.
If the intersection
4.
Set the subscript
and union
and sink
using
to the boundary
common
induction
5.
Split
6.
Repeat
sections and
variable.
the index
steps
10
20
stop.
If the
section
the disjoint
5 if there
loop
be
at this point.
original
equal
cessed
to
the
by the
the equation
In
the
boundary
source
and
is solved
above
K. Splitting
sink
the
of the
for the inner
example,
at this
between
let
point
K =
sections
dependence
induction
1+1 S-1
code
possible
may
method
the guard
into
executed.
Instead,
and
solve
for
be used
loop
and
The
11-loop
the
can
10 and
loop
The
nest
induction
method
just
used
mining
representation
of
enough
to relate
variable
values.
solving
the
only
that
to record
this
of linear
previous
to be inspected
KC is initialized
KLB is the
the
be
the
to index
array
we have
notation
the
each
false
code
patterns,
the
effects
values
the
The
idea
of an outer-loop
is true
and
inner-loop
the
nest
is
code
loop
is inserted
bounds
On the true
the
within
variable
following
for
branch
code
of the loop
bound
the
information
of the
is inserted,
to false,
to be inspected
of an executed
range.
branch
of the
inspected
to store
guard,
the
upper
the
fol-
bound
of
true
on
range.
(FLAG) THEE
KUB(KC) = K-1
FLAG = .FALSE.
EKDI F
allows
performance
of
the
shapes
of control
It
innermost
equations.
to iteration-space
be considered.
of the
am [AK87].
values.
is inserted
executed
that
last
may
flow
be the
and
dependence
on blocking
case that
the
the
IF-inspection
must
After
an inner
of
the
guard
of the loop,
upper
loop
inserting
transformed
118
value
iteration
to store
also
IF-conversion
IF-inspection,
IF
1].
IF-inspection
In addition
out
in
cho-
[HK9
representation
On
lowing
Note
4
of
to 1, FLAG is initialized
lower
re-
( . HOT. FLAG) THEE
KC= KC+ I
KLB(KC) = K
FLAG = . TRUE.
EMDIF
ex-
must
the
guard
would
in instructions
the guard
to be transformed.
where
blocking.
that
enhance
those
this
called
Then,
is to
replicate
due to a decrease
guard
for which
IF-inspection,
IndexSetSplit
upon
90 array
for
and
IF
precision
representation
to greatly
the
depends
in
in the
correctness
However,
the
is executed.
To effect
corresponding
partial
The
the
IF-statement
and
the
executed
dependence
method
locations
5, we show
systems
in
when
by
not
loop
an increase
at run-time
K is the induction
applicable
in
to handle
to Fortran
IndexSetSplat
A(K)
sections.
The
sen is equivalent
maybe
the
on
10.
of IndexSetSplzt
effectiveness
In Section
and
state-
be completed
variables
3 presents
strip
around
can
in the preventing
A(II)
Figure
after
The
described
induction
(e.g.,
ample).
distributed
statement
involved
different
positions
be
to B would
executed
to preserve
unroll-and-j
variable
loop
executed
blocking
surrounding
the references
have
now
20 and
unroll-and-jam
were
a combination
allow
is to inspect
inner
that
techniques,
to keep
still
the loop
ments
unma-
routine
checked
be unsafely
and
sparse-matrix
can
yields
be
degradation
and
and
and
references
never
iteration.
parallelism
DO 10 II = 1,1+1 S-1
T(II)
= A(II)
DO 20 K = I ,1+1 S-1
A(K) = A(K) + T(II)
DO 10 K = I+IS, M
A(K) = A(K) + T(II)
10
prevent
following
the BLAS
the innermost
in a performance
loop-level
variable.
ignored
statements
ac-
DO 10 I = I,M
20
from
K-1oop,
would
it for each unrolled
sult
set
the
code.
One
move
IndexSetSplit
that
Therefore,
in the
are multiple
on the
introduced
unrolled
3: Procedure
is take
were
performed
guard.
boundaries.
Figure
that
IF-statement
were
and
solve for the inner-loop
set of the inner
4 and
then
of the larger
between
code
to
Consider
= I,E
DO 20K=
I,E
IF (B(K, J) .Eq. 1.0) GOTO20
DO IO 1=1,
M
C(I, J) = C(I, J) + A(I, K) * B(K, J)
COHTI!WE
of
symbolic
are equal
expression
IF-statement
[DDSvdV91].
dependence.
and union
an
DO 20J
information.
equal
multiply
SGEMM
dependence
by
computation.
bound
could
requiring
of the
be
a test
last
range
code,
the
of FLAG
after
the
body.
the
inspection
is distributed
around
loop
the inspection
to be
code
the
FLAG = . FALSE.
DO 10J = l,H
KC=O
DD 20K=
l,I!
IF (B(J, K) .ME. 1.0) THEM
IF ( . IJOT. FLAG) THEM
KC= KC+l
KLB(KC) = K
FLAG = . TRUE.
EllDIF
ELSE
IF (FLAG) THEN
KUB(KC) = K-1
FLAG = FALSE.
EEDIF
EEDIF
COHTIFNJE
IF (FLAG) THEE
KUB(KC) = M
FLAG = . FALSE.
EIVDIF
DO 10 Kli = l,KC
DO 10 K = KLB(K19) ,KUB(KIJ)
DO 10 I = l,FJ
C(I, J) = C(I, J) + A(I, K) * B(K, J)
20
10
LAPACK
at
To
its
adapt
4: Matrix
Multiply
After
the
independent
form
To
it
independent
ability
systems
of linear
veloped
in Section
block
can
Our
shows
dexSetSplit,
loop
where
ated.
The
is shown
new
the
cuted
result
4.
K-loop
the
than
loop
operations,
transformations
original
nest
for
trix
ble
below,
UJ
is the
is the
are
cost caused
better
f’reguencg
result
on
ran
arrays
into
is
exe-
slight
in-
show
this,
it on an IBM
often
we
the
B (K, J)
loop
unroll-and-jam
ta-
i
and
=
3 .33s
3.84s
2.25s
1.48
3.08s
3.71s
2.13s
1.45
The
goal
LINPACK
have
of
of
linear
LAPACK
and
better
is
EISPACK
cache
replace
with
performance.
the
block
algorithm,
without
QR
de-
pivoting
of LU
obtained
triangular
matrix.
into
by
matrix
This
multiplying
decomposition
two
matrices,
of elementary
lower
L=
Mk-l.
as follows
. . M;l,
(1
is
an
can
A by
matrix
triangular
Mk.
and
decomposition
the
ries
IF-
Using
this
sition
without
20
algorithms
algorithms
10
Unfortunately,
119
equation,
The
computes
shown
equations
to
lower
triangular
U=
Speedup
UJ+IF
10%
Systems
that
U, such that
1,
UJ+lF
after
2.5%
5
in
UJ
The
perfor-
rotations.
is a form
L and
derived.
Original
with
blockable.
matrices,
a sewhere
[Ste73],
.. MIA
after
inspection.
Frequency
is
about
the memory
A is decomposed
the matrix
be
RS/6000
Iri
elimination
where
upper
ma-
unroll-and-jam
the innermost
In-
pivoting
QR decomposition
is not
decomposition
L is a unit
new
IF-inspected
how
piv-
method
information
non-LAPACK
its
without
the
partial
and
from
algorithm.
A = LU,
can be
the
of REALS.
shows
of performing
the
To
our
Gaussian
the
ranges.
loop
large,
LU
5,1
is executed
transforming
of performing
the guard
result
over
those
inner
locality.
and
on 300x300
is cre-
by IF-inspecticm
after
data
example
540
moving
loop
unroll-and-jam
multiply
model
the
point
with
if
best known
the
in LAPACK,
to improve
Givens
de-
multiply
executes
loop
how
with
techniques
using
and
blocksolving
is “blockable”
LU decomposition
Householder
of a fourth
the
for
iteration
on matrix
within
which
the
executed
the guarded
counteracted
performed
over
was
KN-loop
executes
over
crease in run-time
more
loop
The
where
ranges
in
executes
of IF-inspection
of ranges
the
If
that
innermost
in Figure
number
and
nest
the
the
derive
algorithm
commutative
composition
a new
using
algorithm
IndezSetSplit
also shows
can
a machine-
examines
LU decomposition
using
technology
algorithms
one found
is a blockable
mance
and
section
An
that
the
in
machine-independent
study
study
space
the
pro-
handling
LAPACK
automatically
algorithm,
oting
compiler
equations
3.
that
a machine-
details.
express
this
hand
to obtain
in
compiler
of LAPACK’S
corresponding
IF-inspection
to
form,
of three
a compiler
the
whether
possible
another,
we believe
kernel
optimization
investigate
make
to
subroutine
each
with
~er-
independence.
machine
contrast,
express
machine-specific
additional
machine-specific
LAPACK
In
performance.
should
one
perform
on each
grammers
achieved
of machine
from
must
optimization
high
have
expense
kernels
a programmer
blockable
Figure
desimers
formance
point
Mk and
below
an algorithm
pivoting
after
using
algorithm,
statement
strip
for
Gaussian
where
10 applies
LU
decompo-
elimination
statement
Mk
is
20
to A, is
mining,
DO 10 K = l, M-l, KS
DO 10 KK = K, K+KS-l
u
DO 20 I = I(K+l,
A(I, KK) = A(I, KK) / A(KK, KK)
DO 10 J = KK+l, W
DO 10 I = KK+I, E
A(I, J) = A(I, J) - A(I, KK) * A(KK, J)
DO 10 K = l, M-I, KS
DO 20 KK = K, HIM(K+KS-l, M-1)
DO 30 I = KK+l, H
30
A(I, KK) = A(I, KK)/A(KK, KK)
DO 20 J = KK+I >K+KS-I
DO 20 I = KK+l,li
20
A(I, J) = A(I, J) - A(I, KK) * A(KK, J)
DO 10 J = K+KS, E
DO 10 I = K+l,ll
DO 10 KK = K, MIM(MIH(K+KS-l, M-1) ,1-1)
10
A(I, J) = A(I, J) - A(I, KK) * A(KK, J)
N
Eiizzi
10
K+KS-l
20
K
K
N
Figure
Figure
5: Sections
of A in LU
locality,
Unfortunately,
this
performance
rithm
the
block
completed
version,
We show
To improve
a number
applies
array
its
cache
a block
algo-
of updates
them
on only
how
To
complete
To
a portion
to attain
to
to
the table
below,
is
we performed
“2”
attain
algorithm
the
the
loop
that
loop
there
blocking
KK-1ooP
nest
that
that
statement
surrounds
prevents
20
between
the
5 shows
entire
unless
accessed
by
the section
the
space
to
create
tion
loop
of J can
loop
where
the
A ( 1, J ) are
disjoint
in statement
20.
only
from
executes
over
of
the
Now,
triangular
this
in the
point,
interchange
innermost
the
best
tained.
Therefore,
Not
does this
only
LU
block
can
position
block
on
REALS.
could
version
as well,
10, the
the
not
itera-
have
with
been
similar
decomposition
algorithm
exhibit
partial
pivoting
has
is
discover
IndexSetSplit,
with
algorithm
fit
the
form
partial
a new
sections
Spltt to the algorithm
the
is added
handled
the
without
same
cannot
of code
7 for
In
recurrence
the
exists
after
LU
partial
that
IndexSetSplit.
in Figure
for
using
be said
(see Figure
pivoting).
by
potential
pivoting
does
Consider
IndexSet-
applying
7.
the
6).
been
At
The
ob-
reference
reference
blockable.
better
can
DO 10 KK = K, K+KS-l
D030 J= I,M
TAU = A(KK, J)
25
A(KK, J) = A(IHAX, J)
30
A(I?4AX, J) = TAU
DO 10 J = KK+KS,H
DO 10 I = KK+l, Ii
10
A(I, J) = A(I, J) - A(I, KK) * A(KK, J)
to put
(see Figure
compiler
pivoting
the following
by
by A ( 1, KK)
be used
the
partial
pivoting
below.
algorithm
with
in LU decomposition
decomposition
DO 10 KK = K, K+KS-l
DO 10 J = K+KS,lJ
DO 10 I = KK+I, H
10
A(I, J) = A(I, J) - A(I, KK) * A(KK, J)
KK-loop
transformations
Sorensen
ver-
was run
double-precision
decomposition
algorithm
when
itera-
accessed
accessed
is shown
blocking
J D K+ KS- 1
locations
those
loop
of the
point
LU
Although
section
10. Since
statement
the
memory
This
experiment
scalar
the
A accessed
The
a portion
at
that
The
and
producing
~
is
20 is a subset
surrounding
be split
a new
space
array
KK-loop.
statement
for
In addition,
be-
splitting
by A ( 1, J ) in statement
exists
of the
index-set
in
code,
54o using
final
6.
and
by the KK-1OOP
index-set
of the
of the
A (1, KK)
accessed
recurrence
tion
the sections
execution
the
to the
5.2
Figure
RS/6000
that
in Figure
In
version
unroll-and-jam
blocked
point
However,
done.
for
to as “2+”.
referred
the
by Sorensen.
A ( 1, KK ) in statement
10 carried
distribution
algorithm
decompo-
with
to the Sorensen
trapezoidal
sion
to LU
version
J-loop
parallel.
improvements.
around
10 before
position.
to the
to our
applied
by hand
performance
refers
replacement
Note
de-
around
and
statement
J) in statement
LU
be distributed
to the innermost
is a recurrence
20 and A(I,
of strip-mined
must
surrounds
ing interchanged
refers
its
a hand-coded
“1”
as the
10 can be made
IndexSetSplit
compared
and
of the inner
the block
and
an IBM
the
composition,
Decomposition
parallelism
statement
We applied
sition
IndexSeLSplit.
using
LU
it also has increased
surrounds
algorithm
all at once
[DDSvdV91].
that
cache
strip-mine-and-interchange
for the K-loop
nest,
the
and
of the
poor
developed
groups
A together
portion
best
loop
have
essentially
the matrix
exhibits
matrices.
scientists
that
a block
algorithm
on large
performance,
6: Block
Decomposition
sections.
data
loops
120
to A ( IMAX,
to A (1,
would
J ) in statement
J ) in statement
Distributing
convert
the
the true
KK-loop
25 and
10 access
around
dependence
from
the
the
same
both
J-
A (I,
J)
nal
DO 10 K = I, U-l
c
c .,.
c
pick
pivot
---
partial
I14AX
are
identical.
pivoting
20
10
A(I,
Figure
7: LU
to A ( IMAX,
direction.
The
This
of a block
is not
rules
would
ing recurrence
derived
block
level
parallelism
algorithm
by
sion,
row
point
and block
concept
for
row
interchanges
to understand
values
to
the sequence
of values
location.
the
is different
these
block
of values
from
the
memory
that
that
point
5.3
The
key
tion
of the
both
triangular
a
the
Gaussian
are com-
array.
alone
dependence
It
columns,
is used
be written
re-
8: Block
where
a location
Partial
on
REALS.
is the
multiplicalower
zeros
below
each
of matrices
that
have
to
solve
class,
a system
having
of lin-
orthonormal
decomposition
independent
[Ste73].
columns,
then
A can
in the form
the
Q has orthonormal
is upper
fi-
One
triangular
class
of matrices
that
of the form
1 — 2VVT.
algorithm
column.
QR algorithm
fits
the
the
I and
=
R
elements,
properties
of Q is
transformations
Householder
QR
elementary
below
detailed
consists
V~ =
reflector
k=
for
values
For a more
and
QQT
diagonal
the
to obtainAk+~
Vk eliminates
the kth
for
applying
toAk
positive
or Householder
reflectors
point
columns,
with
elementary
The
K+KS ,Ii
=K+l ,E
KK = K, IIIM(HIII(K+KS-l,
E-l) ,1-1)
J) = A(I, J) - A(I, KK) * A(KK, J)
with
was run
A = QR,
I–2VkV~
LU
in QR
uniquely
the
1,. ... n–
1.
diagonal
in
discussion
of the
of Vk, see Stewart
the computation
[Ste73].
Although
Figure
unroll-and-jam
reveals
decomposition,
although
class
such
a particular
Each
10
where
8 and
of elementary
introduce
be used
One
If A has linearly
is not
algorithm
DO 10 J =
DO 10 I
DO 10
A(I,
after
below,
Figure
experiment
elimination
Any
can
of iteratively
point
in
To
compiler
double-precision
that
element.
updates
version,
however,
the
table
A by a series
matrix
property
through
is,
QR
matrices
diagonal
may
DO 10 K = l, E-l, KS
DO 20 KK = K, HIE(K+KS-l,M-1)
c
C ...
c
to
equations.
of LU
so it
decompo-
be profitable?”
algorithm
Householder
computa-
before
locations.
pass
row
ver-
this
pass through
version
each
ear
A data
code
are al-
compilers,
given
54o using
split
in LU
question
the
This
wholeinvolv-
matching
of making
algorithm
that
Rs/6000
The
dependence
this.
to
is that
of the
The
knowledge
replacement.
understand
(rows)
to
refers
scalar
index-set
available
consider
the
and
recurrence
of pattern
profitability
sophisticated,
the
6.
bllock
occur
in
in
the
matching
is fol-
computations
and whole-column
sufficient
In
but
Data
1].
same
performed
compiler
operations.
sequence
may
be recognized.
increase
com-
loop-
in Figure
The
can
are
upgrade
permutations
the situation
the
to under-
pattern
25 of the
to believe
to
To
that
in commercially
refers
with
columns
to install
Forms
“Will
an IBM
whole
row
sition
“1 +“
be
the
done
and
prevent-
interchange
In
ready
“l”
can still
in which
the
10 and
be ignored.
more
[D D$lvdV9
algorithm
are
locations
mutative
maps
8)
row
interchanges
the
the
statements
dicase.
have
to prove
could
de-
can be equipped
on
permutations.
both
updates
see the potential
existence
the increased
is updated.
in different
key
the
ignores
update
versions,
reverse
of data
the KK-loop
each
updates)
the
non-pivoting
independently.
column
(column
lation
that
a whole-column
tions
in
preclude
to the
in the
version,
multiple
occur
to
one would
is reasonable
Pivoting
of a dependence
also exhibits
is updated
particular
* A(K, J)
Partial
(see Figure
found
point
element
seem
and distributes
This
K)
preservation
algorithm
mathematically
lowed
the
compiler,
ing
- A(I,
reversing
row
to recognize
with
similar
a block
the
for
the
with
decomposition
blockable.
operations
mutable
an understanding
LU
a compiler
that
column
an antidependence
analogue
However,
J)
Decomposition
prohibit
rection.
= A(I,
J ) into
pendence
In
J)
stand
Without
operations,
Fortunately,
Do30J=l,
rJ
TAU = A(K, J)
A(K, J) = A(IMAX, J)
A(IHAX, J) = TAU
DO 20 I = K+l>II
A(I, K) = A(I, K) / A(K, K)
DO 10 J = K+I, IJ
DO 10 I = K+l, IJ
25
30
the
values
of commutative
Pivoting
position,
tion
121
pivoting
the
of the
best
original
is not
block
necessary
algorithm
algorithm.
The
for
QR decom-
is not
an aggrega-
block
application
of
a number
computation
of
elementary
and
storage
original
algorithm
the
step
first
then
involves
does
[DDSvdV91].
not
both
exist
in
DO 10L=
I,H
DO 10 J = L+I, H
IF (A(J, L) .Eq. 0.0) GOTO 10
DEli = DSQRT(A(L, L)* A(L, L) + A(J, L)* A(J,L))
C = A(L, L)/DEM
S = A(J, L)/DEli
DO 10 K = L,E
Al = A(L, K)
A2 = A(J, K)
A(L, K) = C*AI + S*A2
A(J, K) = -S*A1 + C*A2
COMTIMUE
10
the
Given
is to factor
(:::)
and
reflectors
that
=(:::
:::)(R~’)
solve
Figure
9: QR
Decomposition
with
Givens
Rotations
where
Q=
~_–2-;v#(l
-Qv2v;)(1-2v~
give
v;)
.—
The
difficulty
putation
for
the
compiler
of I – 2VTVT
computation
that
algorithm.
block
stride-one
make
comes
because
did
not
To illustrate,
in
the
it involves
exist
in
consider
the
the
com-
space
original
and
to the
innermost
would
necessitate
the
point
case where
the
size is 2.
access to the
the references
IF-block
(1 – 2W,V;)(I
=
Here,
– 2V27$)
I – 2(VIV’)
the
computation
(’
‘VT”20($)
of the
matrix
of a true
of A(L,
the
element
the
K-loop
part
possible
data
of the
dependence
choice
the
computation
of
of block
However,
the
to
to express
of Q from
the
block
that
of a language
In
to
factor.
that
chose
algorithms
form.
requires
in a cur-
in a manner
expressibility
allow
QR
algorithm
to automatically
machine-independent
this
im-
blocking
this
language
a compiler
hanced
it
Householder
a machine-dependent
of no way
programming
allow
making
information.
expression
We know
rent
algorithm,
to determine
The
the
original
can
would
factor.
at
Section
(with
10) [KKP+81].
scalar
of QR
rithms
cannot
in
form
QR
[Sew90].
rithm
we
We
show
know
from
the
can be used
rotation
of no
point
IndexSetSplit
that
that
Givens
best
block
algorithm,
and
matrix
algo-
so instead
IF-inspection
have
applicability.
Consider
ner
matrix
is the
currently
to derive
wider
in
of orthogonal
decomposition
Figure
the
Fortran
have
cesses,
resulting
changing
the
code
The
a long
in
J-loop
to the
Givens
references
stride
poor
for
between
cache
QR
to A in
the
successive
performance.
innermost
shown
position
In
form
algo-
from
their
to
main-
order
coding
of block
must
needs
House-
block
a compiler
types
compiler
end,
looping
of blocking
ing
some
styles,
algorithms
be made
to be directed
blocking
factor
in
possible.
to pick
for
an
algo-
automatically.
IN
corresponding
in-
the
blocking
These
regions
BLOCK
that
factor.
Inter-
are
would
are assumed
optional.
by
the
executes
it
If
they
to start
proposal
compiler’s
are
a Do-loop
the
DO and
IN
region
guides
analyze
to
of an IN
are
expressed,
not
value
DO
block-
DO specifies
by
compiler
determine
bounds
at the first
BLOCK
defined
the
for
choice
whose
compiler.
over
should
The
the
constructs
DO specifies
is chosen
that
a preliminary
to guide
factor.
factor
ac-
we present
constructs
DO. BLOCK
a DO-loop
9 [Sew90].
K-1oop
the
To this
and
with
that
by
machine-dependent
two
2.04
5.49
of machine-independent
of these
results
Speedup
algorithms.
a machine-independent
rithm
Another
goal
expression
the
QR
point
(see
of the
decomposition
shows
be derived
of
distri-
interchange
3.37s
15.3s
examination
transformations
for
J-1oop,
Optimized
6.86s
84.0s
holder
ref-
exists
QR.
The
Specifically,
Givens
Point
these
splitting
is a table
extensions
the
and
the
prevents
for
index-set
of Givens
Size
L)
only
of the
Language
the
6, we address
allowing
6
tain
sections
Below
performance
between
recurrence
expansion)
300X300
500X500
in
a
the
the
around
a recurrence
use of A(L,
IF-inspection
Figure
corresponding
issue.
5.4
L,
bution
be en-
be stated
that
J-loop
However,
the
respect
interchange
antidependence
and
A (L, L),
Array
is not
and
K)
with
loop
of the
K-1oop,
Examining
reveals
to A ( J, K) and
case,
distribution
definition
of the
In this
the
consisting
erences
*
loop.
and
distribution.
Q =
references
to A (L, K ) invariant
a
to
the
DO statement
the
bounds
in the specified
DOIOL=I,
B
DO 20 J = L+l, H
IF (A(J, L) .EQ. 0.0)
DEN = DSQRT(A(L,
C(J)
BLOCK DO K = l> H-I
Ill
DO I = KK+l, Ii
A(I, KK) = A(I,
L)* A(L, L)+ A(J, L)* A(J, L))
= A(L, L)/DEM
DO J = KK+l,
= C(J)*A1
A(J, L)
= -S(J)*A1
IF-Inspection
Code
+ S(J)*A2
+ C(J)*A2
including
20
DO 10 K = L+l, E
DO 10 Jll = l,JC
DO 10 J = JLB(JE) ,JUB(JH)
Al = A(L, K)
A2 = A(J, K)
+ S(J)*A2
A(L, K) = C(J)*AI
10
A(J, K)
Figure
= -S(J)*A1
10:
+ C(J)*A2
Optimized
Givens
Figure
and
allow
end
indexing
last
index
were
not
The
a natural
block
blocking
PACK,
the
necessary,
PACK
compiler.
to
the
code
and
In
for
still
retain
would
accessible
be removed,
to new
good
it handle
partially
Previous
performance
their
8
compiler
from
have
[W0186,
discusses
strip-mine-and-interchange
shaped
iteration
spaces,
general
compiler
algorithms
unroll-and-jam.
He
use index-set
that
We
arises
take
technique
Irigoin
blocking
pendence
splitting
from
this
and Triolet
iteration
abstraction
he
nor
shows
to handle
by
more
describe
spaces
does
the
example
a trapezoidal
for
called
he
we were
memory
In
how
to
exhibit
region
version
results
for
How-
applicato non-
that
many
many
cases,
for
uses a de-
commute
blocking
compiler
codes
based
fortunately,
cone [IT88].
methods
123
by
shown
our
like
strictly
success
a plausi-
in
loops
obtaining
as good
has
splitting”.
codes
as the best
developers.
knowledge
the
ad-
which
to succeed
be blocked
been
that
block
In
about
a compiler
not
not
by comby the use
yields
on dependence
QR decomposition,
we have
introduced
LAPACK
could
we can
and
as “index-set
can enable
that
ver-
algorithm.
splitting
that
we
For each
can be overcome
at least
produced
the block
are encouraging:
known
index-set
end,
in LAPACK
whether
of the problems
we have
operations
To that
succeed
point
can
enough
algorithm.
trapezoidal
patterns
performance
algorithms
the
study
and
well
both
point
could
from
of this
triangular
dition,
a general
technique
that
the
of programs
we determined
of the transformation
to
in
a dependence
nest
1].
a compiler
blocking.
to examine
technology
dependence
work
cases.
a general
hand
able
programs,
block
plex
present
for
whether
a collection
compiler
found
blocking.
developing
include
computations
and the corresponding
block
triangularnot
extend
by
on cache
particular,
iteration-space
further
handles
In
for
but
triangular
a step
that
W0189].
also
the need
of these
blocking
[WL9
is it applicable
to determine
restructure
examined
sion
the
W0187,
set out
for which
of LAreadily
of work
not
nor
applying
Summary
to avoid
ma-
perfclrmance.
LAPACK
amount
does
nor
loops.
automatically
architectures.
a significant
for
a loop
parallelism
splitting
nested
We have
work
has done
a framework
and
nested
codes,
loops.
and ordering
framework
of index-set
algebra
blockable
present
memory
The
Wolfe
Lam
transformations
ble
7
and
blocking
on non-perfectly
in linear
when
problem
making
Fortran
of
a rnachine-
the library
so, the machine-dependency
in Extended
work
does
perfectly
case of LA-
be used,
does not
are common
tion
the
choice
Then,
to port
LU
which
ever!
algo-
leaving
the
could
library.
be used
as
is that
the
algorithms
source-level
could
while
namely
extensions
to machine
By doing
form,
language
independent
technology
chine
to the
extensions
technique
Wolf
be coded
a non-blockable
details,
factor,
the
independence.
of the
11: Block
loops,
decomposition
it could
express
machine-dependent
of 1. To
LAST returns
if LU
machine
advantage
can
a step
region,
algorithm,
11 to achieve
programmer
with
example,
a blockable
in
value
a block
For
principal
rithm
last
within
value.
in Figure
the
at the
LAST(K)
QR
This
block
KK)
DO I = KK+l, N
A(I, J) = A(I, J) - A(I, KK) * A(KK, J)
ERDDO
EMDDO
EMDDO
DO J = LAST(K)+I, H
DO I = K+l, E
Ill K DO KK = K, MIH(LAST(K) ,1-1)
A(I, J) = A(I, J) - A(I, KK) * A(KK, J)
E!JDDO
EUDDO
EEDDO
EIJDDD
A2 = A(J, L)
A(L, L)
KK)/A(KK,
EMDDO
S(J)
= A(J, L)/DE19
Al = A(L, L)
c
c
c
K DO KK
GOTO20
by
any
analysis.
Un-
universal.
For
block
algorithm
has
no corresponding
sizes
larger
than
to compensate
ing
for
block
algorithms,
must
be developed.
Our
goal
it
algorithm
require
the
is to succeed,
make
point
one
because
additional
blocking.
If automatic
machine-independent
such
block
as that
on
[CK87]
block-
expression
proposed
Programming
Callahan
Fzrst
has been
to find
for
the
compiler
user
to
techniques
express
that
[CK89]
S. Carr
the
algorithms
naturally
memory
strated
ods
with
hierarchy
that
that
linear
there
can
exist
codes.
IndezSetSpld
and
we will
the resulting
scientific
breadth
addition,
we will
that
algorithms
that
future
increasingly
pilers
will
can
remain
this
paper
many
complex
need
to
free
language
extend
to
from
more
memory-hierarchy
linear
com-
during
logic.
In
ad-
the
go our
Rebecca
algebra.
Conference
us with
on
SIAM,
An implementation
pages
319–328,
D.
of
Conference
Symposium
on
ming
Languages,
1981.
1988.
B. Leasure,
and
and
compiler
Record
the
A GM
PTogTamming
January
graphs
ACM
partition-
Fifleenth
Padua,
Dependence
In
the
Principles
R. Kuhn,
Wolfe.
1991.
July
of
Languages,
D. Kuck.
The
op-
of
the
of
Progmm-
Principles
StructuTe
Volume
Eight
Lam,
E.E.
Rothberg,
for
and
and
Com-
Sons, New
Proceeding.
conference
Programming
April
t ems,
A.K.
M.E.
Wolf.
of
on
the
Fourth
ATchatecuTa{
Languages
The
of blocked
and
In-
Support
Operating
Sys -
1991.
Porte field.
oj
and
and optimization
In
ternational/
will
Computem
Wiley
John
1978.
M.S.
algorithms.
automatic
of
1.
re-
it
fully
these
Uli
block
algorithms.
Kremer
helpful
this
and
and point
To
Gache
So&waTe
Methods
PeTfoTmance
on
PhD
versions
[Ste73]
thesis,
Rice
of
[WL91]
G.W.
M.E.
Wolf
and
foT
Improve-
SupeTcomputeT
University,
’91
guage
May
Wolfe.
tober
Systems,
on
9(4)
PTogramrning
:491-542,
S. Carr.
Memory-Hierarchy
thesis, Rice University,
Science,
October
Languages
[W0186]
and
1987.
[W0187]
PhD
In Proceedings
S. Carr,
allocation
of the
and
for
K. Kennedy.
subscripted
SIGPLAN
’90
Inlprovvariables.
Conference
[W0189]
Wolfe.
Processing,
Wolfe.
1986
A data
6’ompu-
1973.
locality
of
opti-
the
SIG-
PvogTamming
Lan-
June
Supercompilers
1991.
for
University
Super-
of Illinois,
loop
Iteration
In
Oc-
space
on
December
More
of
the
tiling
o.f
PaTallel
In Pro-
Conference
on
1986.
Proceedings
Conference
Wolfe.
interchange.
International
August
Computing,
1989.
MatTiz
Implementation,
Advanced
PaTallel
ceedings
Al-
1982.
of the
M.
Linear
York,
Proceedings
thesis,
ceedings
M.
to
on
Optimiz~ng
erarchies.
Management.
F’M3
Department
of Computer
1992.
D. Callahan,
ing register
M.
and
of
1990.
New
Lam.
In
Con.feTence
Design
computers.
J.R. Allen and K. Kennedy. Automatic
translation of Fortran programs to vector form. ACM
Methods
England,
M.S.
algorithm.
PLAN
M.
Horwood,
Stewart.
Introduction
Academic
Press,
mizing
people
Computational
Ellis
tations.
in un-
of these
G Sewell.
gebTa,
Danny
us guidance
all
[Sew90]
suggestions
document.
gave
and
thanks.
Transactions
[CCK90]
and
Systems
regular
section analon Parallel
and
Dis-
Record
the
If these
[W0182]
[Car92]
on
cache performance
References
[AK87]
Sorensen,
LineaT
Supernode
D. Kuck,
York,
[LRW91]
[Por89]
many
of
algorithms
heartfelt
IL,
Comptite.s.
R. Triolet.
on
computations,
Carr,
made
provided
derstanding
D.C.
Solving
2(3):350–360,
and
Applications.
preparation
LAPACK
chCagO,
1989.
McKinley
Sorensen
Duff,
K. Kennedy.
job
management.
Briggs,
the
of
Pro-
1991.
and
putations
a few
do a good
toward
[Kuc78]
Acknowledgments
Kathryn
1.S.
Systems,
ment
Preston
algebra
PaTallel
cO?7LpUt~?Lg,
Vorst.
timization.
to
programmers
with
can
step
[KKP+81]
linear
on
ShaTed-Memo’ry
F. Irigoin
in
-
1987.
sophisticated
that,
general
a significant
certain
and
P. Havlak
M.
on program
compiler
van der
Symposium
exblock
hierarchies,
so that
established
the
are
increasingly
to concentrate
algorithms
represent
designs
Dongarra,
ing.
In
style.
strategies
methods,
methods.
to express
memory
adopt
we have
key
sults
it possible
machine
memory-management
ditional
these
J.J.
H.A.
tributed
[IT88]
the
In P~oceedings
C’onfeTence
of interprocedural
bounded
ysis.
IEEE
Transactions
of
the
of
SupeTcomput
Greece,
Blocking
SCient{J$C
pro-
1989.
Philadelphia,
[HK91]
Then,
understand
to investigate
make
to add
to a collection
in a machine-independent
Given
have
continue
by
fOT
VectoT
in an
knowledge.
to better
[DDSvdV91]
all,
implemented
we plan
compiler
supplied
would
not
blocking
future,
cessing
on
hierarchies.
SIAM
of
a parallel
PTo.eedings
Athens,
and K. Kennedy.
Fourth
Analysis
in
In
Imple-
1990.
Kennedy.
effects
Verlag,
and
June
Conference
for memory
December
meth-
but
we have
the
in order
of coverage
tensions
many,
commutativity
programs
good
demon-
implementable
block
In
of
have
and rhomboidal
system.
apply
We
Currently,
trapezoidal
experimental
expectation
readily
automatically
algebra
triangular,
the
performance.
NY,
environment.
Springer-
codes
numerical
K.
side
Intemationa[
ing.
possible
and
gramming
6,
Design
Plains,
interprocedtwal
of
in Section
D.
Language
White
mentation,
computation
for memory
the
Third
Processing
for
hi-
SIAM
Scientific
1987.
iteration
Supercomputing
space
tiling.
’89
In
Pro-
Conference,
Download