Limits of Instruction-Level Parallelism David W.

advertisement
Limits
of Instruction-Level
Parallelism
David W. Wall
Digital
Equipment
in parallel.
instructions
Abstract
Growing
machines
ful
interest
and heavily
examination
ism
exists
how
ambitious
programs.
by
wide
the
techniques
for
multiple-issue
machines
much
in typical
complicated
software
of
in
-pipelined
requires
parallelism
parallel-
Such an examination
[1] is one that can issue multiple
is
and
in the same cycle.
the parallelism
that
one instruction
of
to take advantage
have been proposed.
hardware
variety
can be exploited,
including
renaming, and alias analysis.
In each case, the parallelism is the number of
divided by the number of cycles required.
Architectures
a care-
instruction-level
increasing
Corporation
Research Laboratory
Western
less
branch prediction,
register
By performing simulations
than
machine
of
independent
A superpipelined
instructions
machine
[7] issues
per cycle, but the cycle time is set much
the
[11]
of this kind
A superscalar machine
typical
instruction
is like a superscalar
latency.
A
machine,
except the
VLIW
based on instruction maces, we can model techniques at
the limits of feasibility
and even beyond.
Our study
parallel instructions
must be explicitly
packed
compiler into very long instruction words.
shows a striking
But how much parallelism
is there to exploit?
Popular wisdom, supported by a few studies [7,13,14],
difference
between
assuming
that the
techniques we use are perfect and merely assuming that
they are impossibly
good. Even with impossibly
good
techniques, average parallelism
more common.
rarely exceeds 7, with
suggests
5
that
parallelism
within
a basic
by the
block
rarely
exceeds 3 or 4 on the average. Peak parallelism can be
higher, especially for some kinds of numeric programs,
but the payoff of high peak parallelism
is low if the aver-
age is still small.
1. Introduction
These
There is growing
interest in machines that exploit,
already
usually with compiler assistance, the parallelism that prolevel. Figure 1 shows an
grams have at the instruction
example
rl
:= O[r9]
4[r3]
4[r2]
:= r6
(a} parallelt’stn=3
Figure
parallelism
:= r6
(and
lack
granted
direct
to
copy
provided
commercial
title
of the
that
copying
Q 1991
specific
ACM
withcwt
fee
the copies
the ACM
and ita date
is by permission
To copy
all or part
thereofl.
otherwise,
of this
are not made
of the
and
Association
or to republish,
material
or distributed
copyright
appear,
/0003
-0176...
in
for the DECStation
5000,” load laten-
notice
notice
a superscalar capability
to a machine with some
To increase
the instruction-level
parallelism
One category includes techniques for increasing the
parallelism
within a basic block, the other for using
parallelism across several basic blocks. These techniques
often interact in a way that has not been adequately
is
for
and the
explored.
is given
We would like to bound the effectiveness
for Computing
requires
that
the hardware can exploit, people have explored a variety
of techniques.
These fall roughly into two categories.
●
a fea
permission.
0-89791-3810-9191
programs;
Adding
in l(a) consists of three instructions
advantage,
publication
Machinery.
and/or
that
machines
as reflected
pipelining
is beneficial only if there is more parallelism
available than the pipelining already exploits.
that can be executed at the same time, because they do
not depend on each other’s results. The code fragment in
l(b) does have dependencies, and so cannot be executed
Permission
Many
cies, delayed branches, and floating-point
latencies give
the machine a degree of pipelining
equal to about 1.5.
+ 17
(b) parallelism=l
1. Instruction-level
The code fragment
typical
:= O[r9]
r2 := rl
r2 := 17
are troublesome.
operations with latencies of multiple
cycles. We can
compute the degree of pipelining
by multiplying
the
latency of each operation by its dynamic frequency in
of this parallelism.
rl
limits
have some degree of pipelining,
$ 1.50
..-.
DECStetion
is a trademark
of Digital
Equipment
Corporation.
of a
technique whether it is used in combination
sibly good companion
techniques,
eral approach is therefore needed.
describe our use of trace-driven
importance
with impos-
or with none.
to study the
we may have no way of knowing whether the memory
locations referenced in the load and store are the same.
If they are the same, then there is a dependency between
of programs.
these two instructions:
we cannot store a new value until
we have fetched the old.
parallelism
Parallelism
dependencies
within
between
within
a basic
pairs
blocks.
block
dependency.
is
limited
of instructions.
these dependencies are real, reflecting
architecture
registers
assuming
by
the flow of data in
can lead to a false dependency
mining
the actual addresses referenced,
on a register.
1.2. Crossing
block
The number
usually
9
instructions
might pay attention
in that
An alternative is the hardware solution
in which the hardware imposes
renaming,
indirection
between the register number appearing
instruction
instruction
and the actual register used. Each time an
sets a register, the hardware selects an actual
which
does the register
allocation
Taking
in the
In a
dynami-
mispredicts
entry.
hardware
tech-
the branch predic-
its table entry;
We do not wrap
loop only once, when it is exited.
vahte for table entries is 2, just
that each branch will be taken.
Branch prediction
we fetch
while
tunately, register renaming can also lengthen the machine
pipeline, thereby increasing the branch penalties of the
its effects
on parallelism.
We
is often used to keep a pipeline
and decode
we are executing
instructions
after
a branch
the branch and the instructions
To use branch prediction
to increase parallel
execution, we must be able to execute instructions across
This may involve
an unknown branch speculatively.
maintaining shadow registers, whose values are not committed until we are sure we have correctly predicted the
branch.
memory.
a typical
good initial
A
before it.
False dependencies can also involve
a common
it causes us to decrement.
barely predicting
to include more registers than will fit in the instruction
format,
further reducing
false dependencies.
Unfor-
with
is
Two branches that map to the same table entry interfere
with each othe~ no “key”
identifies the owner of the
full:
here only
prediction
a branch causes us to increment
not taking
can give better results than the compiler’s
but we are concerned
If we
we must be able to issue instruc-
In the scheme we used [9,12],
static allocation, even if the compiler did it as well as it
could. In addition, register renaming allows the hardware
machine,
branches is
less than 6.
around when the table entry reaches its maximum or
minimum.
We predict that a branch will be taken if its
table entry is 2 or 3. This two-bit prediction scheme
of register
a level of
register to use for as long as that value is needed.
between
tor maintains a table of two-bit entries. Low-order bits
of a branch’s address provide the index into this table.
for
number of registers needed is minimized.
cally,
of instructions
quite small, often averaging
Branch
parallelism.
Current compilers often do not, preferring
instead to reuse registers as often as possible so that the
sense the hardware
boundaries.
branch will be taken, or else we must cope with the possibility that we do not know.
to its alloca-
the opportunities
but this may be
tions from different basic blocks in parallel.
But this
means we must know in advance whether a conditional
nique.
tion of registers, so as to maximize
decide
but even
that they are the same.
want large parallelism,
order, because the third changes the value of rl. However, if the compiler had used r3 instead of rl in the third
instruction, these two instructions would be independent.
A smart compiler
references are independent,
assume conservatively
scalar
rl := 0[r9]
r2:=rl+l
we must do the second and third
there is no
can help a compiler
too late to affect parallelism decisions. If the compiler or
hardware are not sure the locations are different, we must
In the code sequence
rl :=
If they are different,
analysis
that is imprecise; sometimes we must assume the worst.
Hardware can resolve the question at run-time by deter-
Some of
a traditional
Alias
when two memory
the program. Others are false dependencies, accidents of
the code generation or results of our lack of precise
knowledge about the flow of data.
Allocating
For
rl := 0[r9]
4[r16] := r3
a
We will begin with a survey of ambitious techniques for increasing the exploitable
instruction-level
1.1. Increasing
assump-
on memory.
example, in the code sequence
branch and jump predic-
tion, and alias analysis. In each case we can model
range of possibilities from perfect to non-existent.
parallelism
have to make conservative
tions that lead to false dependencies
In this paper, we will
simulation
of register renaming,
ever, we may still
A gen-
It may involve
being selective about the instruc-
assume that memory locations have meaning to the programmer that registers do not, and hence that hardware
tions we choose: we may not be willing
to execute
memory stores speculatively, for example. Some of this
may be put partly under compiler control by designing an
renaming
instruction
of memory
locations
is not desirable.
How-
177
set with
explicitly
squashable
instructions.
Each squashable instruction
condition
evaluated
would be tied explicitly
in another instruction,
and would
squashed by the hardware if the condition
false.
Rather
branches,
than
try
we might
possible
along
both
when
we know
to
predict
paths,
which
destinations
of
path
for
correct
Many
A real archi-
architectures
or indirect.
have
two
or three
control.
kinds
niques,
of
of
a direct
jump
well
out, fairly accurate) jump
maintained
of destination
prediction scheme. A table is
addresses. The address of a
jump
into this table.
the index
jump,
Whenever
we
is
an
compiler
that can also increase parallelism.
trace. Our system
branch prediction
scheme could
good alias analysis and regis-
This is in contrast to the 1989 study
Wall
[7],
which
executable
their
they
worked
rather
compiler
scheduling
than dynamic
did
scheduling
did not consider
The
by
more
instruction
only
ambitious
of Jouppi
static
within
traces.
basic
and
program
Since
blocks,
scheduling.
of our paper is more like that of
methodology
Tjaden and Flynn [14], which also scheduled instructions
from a dynamic trace. Like Jouppi and Wall, however,
we
Tjaden
and
Flynn
did
not
move
instructions
across
branches. Their results were similar to those of Jouppi
and Wall, with parallelism rarely above 3, even though
the two studies assumed quite different architectures.
with each other.
old
array of tech-
ter renaming.
entry for the jump. We predict that an indirect jump will
be to the address in its table entry. As with branch prediction, we do not prevent two jumps from mapping to the
unrolling
whenever
system for scheduling
produced by an instruction
work even with impossibly
we put its address in the table
same table entry and interfering
a simple
ask how well a realistic
in
been done on predicting
the destinations
of indirect
jumps, but it might pay off in instruction-level
parallelism. This paper considers a very simple (and, it turns
technique
and memory
consider the full range in order to bound the effectiveness
of the various techniques.
For example, it is useful to
an indirect jump, however, may require us to wait until
the address computation
is possible.
Little work has
Loop
trace
case the option ranges from perfect, which could not be
implemented in reality, to non-existent.
It is important to
advance. The same is true of a branch, assuming we
know how its condition will turn out. The destination of
provides
effect,
based on the
this prediction
allows us to assume various kinds of branch and jump
prediction, alias analysis, and register renaming. In each
A direct jump is one whose des-
destination
execute an indirect
In
work.
we have built
instructions
Branches are
tination is given explicitly
in the instruction,
while an
indirect jump is one whose destination is expressed as an
address computation involving a register. In principle we
the
as a whole.
the state of registers
2. This and previous
conditional
and have a destination some specified offset
from the PC. Jumps are unconditional,
and may be
can know
these blocks
To better understand this bewildering
to change the flow of
either direct
(a sequence of
to find a trace
on the amount of fanout we can
in parallel.
instructions
is
of the long instruction
enter or leave the sequence unexpectedly.
This added
code may itself be scheduled as part of a later and less
heavily executed trace.
tolerate before we must assume that new branches are not
explored
VLIW
fails, code is inserted outside the sequence of blocks to
branches happen quite often in normal code, so for large
degrees of parallelism we may encounter another branch
one.
the parallelism
for
by the compiler
scheduling predicts a branch statically,
profile.
To cope with occasions when
capability is guaranteed to be wasted, but we will never
miss out by taking the wrong path.
Unfortunately,
tecture may have limits
scheduling
It uses a profile
tions
Some of our parallelism
before we have resolved the previous
was developed
[4]
global
blocks that are executed often), and schedules the instruc-
instructions
the wrong
scheduling
where
needed to exploit
words.
execute
squashing
it is.
machines,
be
turns out to be
the
speculatively
Trace
to a
Nicolau
optimization
effectiveness
If we unroll
assuming
and Fisher’s
of
perfect
trace
trace-driven
scheduling
branch
prediction
study [11] of the
was
more
liberal,
and perfect
alias
a loop ten times, thereby removing 909?0of its branches,
we effectively increase the basic block size tenfold. This
larger basic block may hold parallelism
that had been
analysis. However, they did not consider more realistic
assumptions, arguing instead that they were interested
primarily in programs for which realistic implementations
unavailable
would be close to perfect.
because of the branches.
technique for
moving instructions across branches to increase parallelism. It analyzes the dependencies in a loop body, looking for ways to increase parallelism by moving instmcSoftware
pipelining
[81 is a compiler
The study by Smith,
Johnson, and Horowitz
[13]
was a realistic application of trace-driven simulation that
assumed neither too restrictive nor too generous a model.
They were interested, however, in validating a particular
realistic
machine design, one that could consistently
tions from one iteration into a previous or later iteration.
Thus the dependencies in one iteration can be stretched
out across several, effectively executing several iterations
in parallel without the code expansion of unrolling.
exploit a parallelism of only 2. They did not explore the
range of techniques discussed in this paper.
178
We believe our study can provide useful bounds on
the behavior not only of hardware techniques like branch
prediction
and register
renaming,
techniques like software pipelining
to retiring the cycle’s instructions
passing them on to be executed.
but also of compiler
from the scheduler, and
When we have exhausted the trace, we divide
and trace scheduling,
number
of
Unfortunately,
we could think of no good way to
model loop unrolling.
Register renumbering
can cause
created.
The result is the parallelism.
much of the computation
3.1. Parameters.
in a loop to migrate
toward the beginning of the loop, providing
for parallelism much like those presented
Much
of the computation,
incrementing
however,
backward
opportunities
by unrolling.
like
of the loop index, is inherently
the repeated
that there are an infinite
sequential.
false register
using an LRU
the results to those of the
3. Our experimental
framework.
executed.
addresses referenced,
discipline:
and the results
of branches
packs these instructions
may contain
so that no
For finite
renaming,
set dynamically
allocated
when we need a new register
count) is earliest.
as 64 instructions
on replicated
Fin-
done with 256 integer registers
registers.
It is also interesting
to
the number on our base machine. To simulate no renaming, we simply use the registers specified in the codq this
and
is of course highly
into
the compiler
dependent
on the register strategy of
we use.
We can assume several degrees of branch
as many
we
data
into cycles, we assume that
We further assume no limits
the
cycles
see what happens when we reduce this to 64 or even 32,
a sequence of pending cycles.
In packing instructions
of
whose most recent use (measured
is normally
and 256 floating-point
This trace also includes
A greedy algorithm
any cycle
register
we select the register
To explore the parallelism available in a particular
program, we execute the program to produce a trace of
parallel.
occur.
in cycles rather than in instruction
ite renaming
jumps.
the number
number of registers,
dependencies
we assume a finite
normal versions.
the instructions
by
We can do three kinds of register renaming:
perfect, finite, and none. For perfect renaming, we assume
We address loop unrolling in an admittedly unsatisfying
manner, by unrolling the loops of some numerical programs by hand and comparing
instructions
tion.
all branches are correctly
in
a two-bit
func-
two-bit
tional units or ports to registers or memory: all 64
instructions
may be multiplies,
or even loads.
We
assume that every operation
has a latency of one cycle,
so the result of an operation
executed in cycle N can be
One extreme is perfect prediction:
prediction
predicted.
predic-
we assume that
Next we can assume
scheme as described
scheme can be either infinite,
before.
with
The
a table big
enough that two different branches never have the same
table entry, or finite, with a table of 2048 entries. To
model trace scheduling, we can also assume static branch
used by an instruction
executed in cycle N+ 1. This
includes memory references: we assume there are no
prediction
cache misses.
that it goes most frequently.
And finally, we can assume
that no branch prediction
occurs; this is the same as
assuming that every branch is predicted wrong.
We pack the instructions
as follows.
For each instruction
from the trace into cycles
in the trace, we start at
tion.
In any case we are concerned
jumps;
cycle after this one.
correctly.
we put
the instruction in the next non-full cycle. If we cannot
put the instruction in any pending cycle, we start a new
predic-
only with
indirect
we assume that direct jumps are always predicted
The
scheduling
pending cycle at the end of the sequence.
effect
of
branch
and
jump
is easy to state. Correctly
prediction
predicted
on
branches
and jumps have no effect on scheduling (except for register dependencies involving
their operands). Instructions
on opposite sides of an incorrectly predicted branch or
As we add more and more cycles, the sequence
gets longer. We assume that hardware and software techniques will have some limit on how many instructions
jump, however, always conflict.
Another way to think of
this is that the sequence of pending cycles is flushed
they will consider at once. When the total number of
instructions in the sequence of pending cycles reaches
whenever an incorrect prediction is made. Note that we
generally assume no other penalty for failure.
This
we remove the first cycle from the sequence,
or not.
for jump
go where it went last time). We can assume static prediction based on a profile. And we can assume no predic-
them), we assume that we can put the instruc-
whether it is full of instructions
run; in
always go the way
We can assume that indirect jumps are perfectly
predicted.
We can assume infinite or finite hardware
prediction as described above (predicting that a jump will
tion in that cycle but no farther back. Otherwise we
assume only that we can put the instruction in the next
this limit,
an identical
tion.
conflict exists depends on which model we are considering. If the conflict is a false dependency (in models
If the correct cycle is full,
from
The same choices are available
the end of the cycle sequence, representing the latest
pending cycle, and move earlier in the sequence until we
find a conflict
with the new instruction.
Whether a
allowing
based on a profile
this case we predict that a branch will
assumption
This corresponds
179
is optimistic;
in most
real architectures,
a
failed
prediction
causes a bubble
in one or more
can occur.
in the pipeline,
in which
cycles
no execution
We will return to this topic later.
We can also allow instntctions
tain
The
None
limit.
involved
to move past a cer-
instructions
incorrectly
predicted
branches.
of the experiments
described
time.
here
With
Four levels of alias
actual memory
are available.
analysis
alias analysis,
in which
address referenced
by a load or stor~
a
that
whenever
Between
conflicts
with
a load
or
store.
size.
compiler
might
do it.
We don’t
such a compiler,
but we have implemented
scheme is alias
two inter-
by
a new
O[gp] := r2
Figure
The two instructions
analysis
manifestly
because they
other
by
to the stack and the other is
intermediate
scheme
even though
compiler
is
our
called
own
alias
compiler
doesn’t do it. Under this model, we assume perfect
analysis of stack and global references, regardless of
which
registers
are used to make them.
address on the stack conflicts only with
the
same
address.
are resolved
The
by instruction
idea
that references
the compiler,
Heap
behind
alias
dataflow
windows
remarks
instructions
Livermore
24479634
floating-point
814
174883597
Stanford
1019
20759516
sed
1751
1447717
844
13910586
loops 1-14
Linear algebra [3]
Hennessy’s
Stream
suite
[5]
editor
Fite search
yacc
1856
30948883
Compiler-compiler
metronome
4287
70235508
Timing
grr
5883
142980475
PCB
verifier
router
2721
26702439
Recursive
ccom
10142
18465797
C compiter
gccl
8300Q
22745232
espresso
~
12000
135317102
7000
1247190509
fpppp
2600
244124171
qnantnrn chemistry
doduc
5200
284697827
hydrocode
180
1986257545
tomcatv
Figure
other
Programs
3.2.
hand,
3. The
pass
tree
comparison
front
end
GNU C compiter
1 of
boolean function
minimizer
Lisp interpreter
simulation
mesh generation
seventeen
test
programs.
measured.
As test cases we used four
by
real
programs
compiler
is
be resolved
by
These
and possibly
by
benchmarks
assumptions is particulady
defensible.
Many languages
allow pointers into the stack and global areas, rendering
them as difficult
as the heap. Practical considerations
DECStation
such as separate compilation
may
“ R3000
analyzing
perfectly.
were
usually
the reference
The Mips
also keep us from
On the other
180
used
programs
data was
references
reaches the window
a load or store to
analysis
analysis
enter the
are the norm for the results
22294030
solving diophantine equations over loop indexes, whereas
heap references are often less tractable. Neither of these
non-heap
instructions
462
A store to an
on the
the heap can often
by doing
also
new window.
268
inspection.
our
outside
references,
prediction
a full-size
Livermore
eco
a reference to the globaI data area.
The
analysis
a reference
A missed
new
of
and then start
Whetstones
by inspection.
use the same base register but different displacements.
The two instructions in 2(b) cannot conflict, because one
is manifestly
cycles,
the number of instructions
egrep
in 2(a) cannot conflict,
into
dynamic
(b)
2. Alias
we fetch an entire window
them
windows,
Continuous
Linpack
(a}
can
one at a time, and old cycles leave the window
lines
ure 2.
rl := O[fp]
We
or continuously.
instruction
den~ the two ways this might happen are shown in Fig-
:= r2
of
assumed discrete windows.
This is a common technique in compile-time
instruction-level
code schedulers.
We look at the two
instructions to see if it is obvious that they are indepen-
4[r9]
instructions.
discretely
window.
inspection.
rl := O[r9]
number
described here, although to implement them in hardware
is more difficult.
Smith, Johnson, and Horowitz
[13]
have
mediate schemes that may give us some insight.
One intermediate
maximum
the
is 2048
either
windows,
continuous
window
these two extremes would be alias analysis as a
smart vectorizing
this
schedule
with
With
always
is
causes us to start over with
store conflicts with a load or store only if they access the
same location.
We can also assume no alias analysis, so
a store
default
the window
discrete
fresh
at the
size
about the effects of
parallelism.
that can appear in the pending cycles at any
instructions,
We can
we look
some intuition
window
By
manage
this ability.
assume perfect
provides
alias analysis on instruction-level
This
of
corresponds
instructions
alternatives
to architectures that speculatively
execute
from both possible paths, up to a certain
number
~anout
side, even heap references are not as hopeless as this
model assumes [2,6]. Nevertheless, our range of four
resulting
whatsoever
at WRL,
are
shown
run
on accompanying
an official
data set.
5000,
version
is a trademark
toy
in
Figure
“short”
The programs
which
benchmarks,
and six SPEC
has a MIPS
1.31 compilers
of MIPS Computer
were
seven
benchmarks.
3.
The
SPEC
test data, but the
data
set rather
than
were
compiled
for a
R3000*
used.
Systems, Inc.
processor.
4. ResuIts.
static jump
We ran these test programs
configurations.
dynamic
for a wide range of
The results we have are tabulated in the
formance
some interesting trends. To provide a framework for our
exploration,
we defined a series of five increasingly
prediction
but we will
extract
instructions.
Many
of the results
show the effects of variations
we present
It would
be interesting
starts to degrade, and to explore how well static
does if the profile
(54L
is from a non-identical
t
!
will
pe$etv
56 i
.. -’,..-
Stupid
Fair
jump
reg
alias
renaming
analysis
. .
.
,’
,’
none
none
none
none
infinite
infinite
inspection
256
,’
,’
36
~
28
,’
,
,,
>..
,’
~
,.
.
,’
,
,’
doduc
,,
r
,’
..-
48 I
~
3
~
,’
,’
..-”
predict
.- -- ;:, i
.. ..
52
Note that even the Fair model is quite ambitious.
branch
i
I infinite
infinite
256
perfect
Great
infinite
infinite
perfect
perfeet
espresso
Cconr
egrep
32
w.
g&
24
Good
run
*
I
60
on these standard models.
predict
to
of the program.
ambitious models spanning the possible range. These
five are specified in Figure 4; the window size in each is
2K
does a bit better than our simple
scheme.
explore how small the hardware table can be before per-
some of them to show
appendix,
prediction
prediction
Lmpack
~~$onome
20
Stanford
16
liwo
12
Perfect
perfect
~rfect
perfect
perfect
Whetstones
Livennore
8
4
Figure 4. Five increasingly
ambitious
models.
S~pid
injinite
jinite
96%
Linpack
Good
Fair
Great
Perfect
jumps
branches
96%
static
injinite
jinite
95%
99%
99%
96%
19%
77%
Figure 6. Parallelism
static
Livermore
98%
98%
98%
19%
Stanford
90%
90%
89%
69%
69%
71%
Whetstones
90%
90%
92%
80%
80%
88%
sed
97%
97%
97%
96%
96%
97%
egrep
90%
90%
91%
98%
98%
98%
yacc
95%
95%
92%
75%
75%
71%
met
92%
92%
92%
77%
77%
65%
4.2. Parallelism
under
under the five models.
the five models.
Figure 6 shows the parallelism
each of the five
models.
of each program for
The numeric
programs
are
shown as dotted lines. Unsurprisingly,
the Stupid model
rarely gets above 2; the lack of branch prediction means
that it finds only intra-block parallelism, and the lack of
renaming and alias analysis means it won’t find much of
grr
85%
84%
82%
67%
66%
64%
eco
92%
92%
91%
47%
47%
56%
Ccom
90%
90%
90%
55%
54%
64%
has parallelism
ti
90%
90%
90%
56%
55%
70%
tomcatv
99%
99%
99%
58%
58%
72~o
branch prediction,
perfect alias analysis, and perfect
register renaming would lead us down a dangerous gar-
doduc
95%
95%
95%
39%
39%
62%
den path.
espresso
89%
89%
87%
65%
65%
53%
feeee
92%
91%
8870
84%
84%
80%
gcc
90%
89’70
90%
55%
54%
60%
mean
92%
92%
92%
67%
67%
73%
that. The Fair model is better, with parallelism between
2 and 4 common. Even the Great model, however, rarely
So would a study that included only fpppp and
It is interesting
The success of the two-bit
branch prediction
has
been reported elsewhere [9, 101. Our results were comparable and are shown in Figure 5. It makes little
difference whether we use an infinite table or one with
only 2K entries, even though several of the programs are
more than twenty thousand instructions
long.
Static
branch prediction
as well
based on a profile
as hardware
prediction
that Whetstones
fect model.
This is the result of Afndahl’s
the
parallelism
do poorly
and Livermore,
benchmarks,
compute
prediction.
all we want to run on our
two numeric
independently,
and jump
A study that assumed perfect
tomcatv, unless that’s really
machine.
Figure 5. Success rates of branch and jump prediction.
4.1. Branch
above 8.
for
even under the Per-
each
Law:
Livermore
if we
loop
the values range from 2.4 to 29.9, with a
median around 5. Speeding up a few loops 30-fold
sim-
ply means that the cycles needed for less parallel
will dominate the total.
loops
4.3. Effects of unrolling.
Loop unrolling
should have some effect on the five
models. We explored this by unrolling two benchmarks
by hand. The normal Linpack benchmark is already
does almost exactly
across atl of the tests;
unrolled
181
four times, so we made a version of it unrolled
structure as they did before unrolling.
sions have slightly
however,
ver-
that framework,
because 3/4 or 9/10 of the loop overhead has
been removed.
slightly.
quate,
The unrolled
less to do within
As a result,
the parallelism
goes down
Even when alias analysis by inspection
unrolling
the loops
either
naively
is ade-
or carefully
sometimes causes the compiler to spill some registers.
This is even harder for the alias analysis to deal with
because these references usually
register than the array references.
Loop unrolling
able parallelism,
. *-======--
unrolling with the rest of our techniques
have been able to do here.
t
r
S~upid
Oreat
Gocd
Fair
but it is clear we must integrate
4.4. Effects of window
Perfect
under the five models.
instructions:
size.
the scheduler is allowed
instructions
ten times, and a version
back up, removing
also unrolled
in which
the normal
the Livermore
did the unrolling
we rolled
unrolling
benchmark
in two different
the loops
by four.
We
ten times.
We
ways.
reassociated to increase parallelism,
One is naive
varying the window
are delayed
so that false memory
are complete,
do not interfere
with doing the calculations
Figure 7 shows the result.
10-unrolled
until
versions.
Unrolling
after all the
model
conflicts
both
A less ambitious
the continuous
by 4 or 10 disappears altogether
75% of the executed instructions,
just
short of our maximum
window
which
accounts
for
shows
achieves a parallelism
of 64 in each case.
Great
tine. This loop includes an embedded random-number
generator, and each iteration is thus very dependent on its
predecessor.
This confirms
the importance
of using
must
unrolling
actually
hurts
the
only
is not always
sufficient
down to 4.
almost
the Perfect
The Good
identical
to the
to resolve
them,
would
tend
use discrete
than
windows
window
increases
would
them
as the
windows
with
above.
continuous.
2K
window
size
than
Under
the
as conbut
decreases;
of 128 instructions
9
assuming
as well
instructions,
to
a fresh
Figure
8, except
do nearly
is
get a
relative
less parallelism
we used
as Figure
rather
the
to have
model
discrete
manager
schedule
and then start over
models
windows
when
parallelism
window
same
model,
windows.
instructions,
execute
the
difference
before
the
we
the
curves level off. If we have very small windows, it
might pay off to manage them continuously;
in other
words, continuous management of a small window is as
the
good as multiplying
the window size by 4. As before,
the Perfect model does better the larger the window, but
parallelism
the parallelism
slightly under the Fair model. The reason is fairly simple. The Fair model uses alias analysis by inspection,
which
of
This
discrete
The
tinuous
whole program traces; studies that considered
parallelism in saxpy would be quite misleading.
full
window.
in the Perfect
aggregate parallelism
stays lower because the next most
frequently executed code is the loop in the matgen rou-
Naive
but looks
4.5. Effects of using discrete
Livermore
unrolling
routine,
Typical
in parallel.
each other,
because the saxpy
size from 2K instructions
is not shown,
and Linpack in the more ambitious
models, although
unrolling
Linpack by 10 doesn’t make it much more
parallel than unrolling
by 4. The difference between
model,
time.
Great model.
The dotted lines are the
helps
at one
parallelism drops off quickly.
Unsurprisingly,
model does better the bigger its window.
and in which assign-
ments to array members
cycles
Under the Great model, which does not have perfect
branch prediction, most programs do as well with a 32instruction
window as with a larger one. Below that,
in which computascalar product) are
calculations
pending
size of 2K
to keep that many
superscalar hardware is unlikely to handle windows of
that size, but software techniques like trace scheduling
for a VLIW machine might. Figure 8 shows the effect of
unrolling, in which the loop body is simply replicated ten
times with suitable adjustments to array indexes and so
unrolling,
on. The other is careful
tions involving
accumulators
(like
in
the
better than we
Our standard models all have a window
Figure 7. Unrolling
base
is a good way to increase the avail-
,;:,,.=:=.:,=======
12
8
have a different
is only two-thirds
that of continuous
win-
dows.
the conflict
4.6. Effects of branch
between a store at the end of one iteration and the loads
at the beginning of the next. In naive unrolling, the loop
body is simply replicated, and these memory conflicts
impose the same rigid framework
to the dependency
and jump
prediction.
We have several levels of branch and jump prediction. Figure 10 shows the results of varying these while
register renaming and alias analysis stay perfect. Reduc-
182
64
t
I
I
1
I
I
I
1
60 .
J
56 p
52 )
48
,’
44
,’
,’
,’
,’
.’
40
36:
32+
,. ..-
,’
,’
:
fPPPP
..
,’
24 b
40 I
,..
#’
28 f
-------------
. tomcatv
,..
. ----
---,’
44
]
,’
1
52
tonlcatv
.------------?
---,’
,’
,,’
20k
Linpack
-------------------------:
12
. . . . . . . . . . ..-
8
-------------
---
4
n
‘48163264
128
Figure 8(a). Size of continuous
256
512
windows
lK
2K
under Great model.
Figure 9(a).
Size of discrete windows
under Great model,
64
60
56
fPPPP
torncatv
52
.,.
,’
48
,,
,’
,,
,,’ ,,’
40 1
,,’
,,,
,’
44
48
,,,
,.
esplesso
‘1
40
~ 36
.2
j 32
Cmrrt
ww
w.
espresso
24
:g:hpack
20
Stanford
16
Ccotn
& 28
Lmpack
~~onome
20
b
~::rome
16
liwo
12
12
8
doduc
44
,,
I
Stanford
8
Whetstones
Livennore
.::
whetstones
Livemroxe
4
o
“48163264
128
Figure 8(b). Size of continuous
ing the level of jump
but only
256
windows
prediction
if we have perfect
S12
lK
48163264
2K
Figure 9(b).
under Perfect model,
can have a large effect,
branch prediction.
128
of parallelism
mispredicted
Other-
256
Size of discrete windows
512
lK
2K
rmder Perfect model,
under the Great model, assuming that each
branch or jump adds N cycles with no
wise, removing
jump prediction
altogether
has little
effmt. This graph does not show static or finite predic-
instructions
in them.
If we assume instead that all
indirect jumps (but not all branches) are mispredicted, the
tion: it turns out to make little difference whether prediction is infinite, finite, or static, because they have nearly
the same success rate.
right ends of these curves drop by 10% to 30%.
That jump
prediction
has little
Livermore
jumps, and their branches are comparatively
predictable.
Tomcatv and fpppp are above the range of this graph, but
effect on the paral-
lelism under non-Perfect models does not mean that jump
prediction is useless. In a real machine, a jump predicted
incorrectly (or not at all) may result in a bubble in the
pipeline.
The bubble
is a series of cycles in which
and Linpack stay relatively
horizontal
They make fewer branches and
over the entire range:
their curves have slopes about the same as these.
4.7. Effects of alias analysis
no
execution occurs, while the unexpected instructions are
fetched, decoded, and started down the execution pipeline. Depending on the penalty, this may have a serious
effect on performance,
Figure 11 shows the degradation
and register
renaming.
We also ran several experiments
varying alias
analysis in isolation, and varying register renaming in
isolation.
183
64
60
fPPPP
tomcatv
56
doduc
52
48
44
40
espresso
44
36
CcInn
40
32
&
28
I
egrep
P.
Lmpack
24
32
&
28
,,.,
,,
1
eaprcsso
I
Cconr
egrep
w,
Lmpack
gcc
24
Stanford
16
doduc
36
$
!!$l”””m’
20
fPPPP
tolucatv
‘----------------.“,’
.“,’
,“,’
,’,’
,,
,,
.,
.,
.,
52
48
8
2
$
. ... .. . .. .. ... ....
-------------------
60
56
~~onome
20
Stanford
JiecO
16
12
8
12
Whetstones
Livennore
4
tie’o
8
Whetstones
Livennom
4
‘&n. ‘e
J nf
bW
Figure
?%P
jil%;.
~M
jpi$
j!t~s
~f#
){~$
10. Effect of branch and jump prediction
aNone
aInsp
with
Figure
12(a).
aCOmp
Effect
of alias
aPerf
analysis
model.
on Perfect
petfect alias analysis and register renaming
14
~-..
-------
--
------
-
52
---------------
.
12 -
-------
.-
, ----------------:
e--‘
Linpack
tomcatv
,’
48 ;
44 +
10 -
~
,, ---”” ””-” ----”---:
,’
32
,’
,’
& 28
8.
j
,’
40 &
~ 36 2
,’
,’
,’
,.
,,
,’ ,.’
20 :
,’
,,’ ,,’
,,
16 ~
fPPPP
.“
,.’
-------------------:
Linpack
12
2-
-~- . . . . . . . . . . . . . . . . . . . . . .
8
li
4
I
01
01234567
I
I
I
I
I
I
I
I
891o
mispraiiction
penalty
alnsp
aNOne
Figure 12(b).
Figure
11. Effect of a misprediction
Figure
12 shows
aCOmp
(cycles)
aPerf
Effect of alias analysis on Great model,
cycle penalty on the Great model,
the effect
of varying
analysis under the Perfect and Great models.
Figure 13 shows the effect of varying the register
renaming under the Perfect and Great models. Dropping
from infinitely many registers to 256 CPU and 256 FPU
registers rarely had a large effect unless the other parame-
the alias
We can see
isn’t very powerful;
it
that “alias analysis by inspection”
increased parallelism
by more than 0.5. “Alias
rarely
analysis by compiler”
was (by definition)
indistinguish-
ters were perfect.
Under the Great model, register
renaming with 32 registers, the number on the actual
machine, yielded parallelisms roughly halfway between
no renaming and perfect renaming.
able from perfect alias analysis on programs that do not
use the heap, and was somewhat helpful even on those
that do. There remains a gap between this analysis and
perfection, which suggests that the payoff of further work
4.8.
on heap disambiguation
may be significant.
Unless
branch prediction is perfect, however, even perfect alias
analysis usually leaves us with parallelism
Conclusions.
Good branch
critical
between 4 and
of
8.
instruction-level
reduce
on
184
the
prediction
to the exploitation
the penalty
parallelism
by hardware
of more
parallelism.
for indirect
of
Jump
jumps,
non-penalty
or software
than modest
is
amounts
prediction
cm
but has little
effeet
cycles.
Register
64
gU&Ptv
60
doduc
56
52
zero-conflict branch and jump prediction
256 CPU and 256 FPU registers
perfect alias analysis
continuous windows of 64 instructions
48
44
espresso
40
E
a
36
$
32
ccom
ewep
w.
& 28
24
Lmpack
/&onome
20
Stanford
16
liao
12
Whetstones
Livennore
8
4
o~
rNone
rPerf
Figure 13(a).
Effect of register renaming
on the Perfect model.
Figure 14. The paratletism
64L
I
I
from an ambitious
hardware model.
3
I
60 +
56 ?
52 :
48
44
,,
,. ~
..-
..-
ceom jj~~
terncatv
..;
----
?.:+!’
:.:+
:::.::
eco ~:
stanford
yacc _
,,
40
I
oh
rNone
r32
Figure 13(b).
!
3
I
r64
1256
Effect of register renaming
static branch and jump prediction
256 CPU and 256 FPU registers
perfect alias analysis
continuous windows of 2K instructions
rPerf
on the Great model.
Figure 15. The parallelism
renaming
is important
as well, though
a compiler
be able to do an adequate job with static analysis, if it
knows it is compiling for parallelism.
Even ambitious
models
combining
discussed here are disappointing.
Figure
parallelism achieved by a quite ambitious
model, with branch and jump prediction
from en ambitious
software model,
might
is closer to 9, but the median is still around 5. A consistent speedup of 5 would be quite good, but we cannot
honestly expect more (at least without developing tech-
the techniques
niques beyond those discussed here).
14 shows the
hardware-style
using infinite
We must also remember
tions this study makes.
tables, 256 ITT-J and 256 CPU registers used with LRU
renaming, perfect alias analysis, and windows
of 64
the simplifying
assump-
We have assumed that all opera-
tions have latency of one cycle; in practice an instruction
with larger latency uses some of the available parallel-
instructions maintained continuously.
The average parallelism is around 7, the median around 5. Figure 15
ism.
We
have
assumed
unlimited
resources,
including
a
perfect cache; as the memory bottleneck gets worse it
may be more helpful to have a bigger on-chip cache than
shows the parallelism
achieved by a quite ambitious
software-style model, with static branch and jump predic-
a lot of duplicate functional units. We have assumed that
there is no penalty for a missed prediction; in practice the
tion, 256 FPU and 256 CPU registers used with LRU
renaming, perfect alias analysis, and windows
of 2K
instructions maintained continuously.
The average here
penalty may be many empty cycles.
185
We have assumed a
(11),
uniform machine cycle time, though adding superscalar
capability
will surely not decrease the cycle time and
may in fact increase it. We have assumed uniform technology, but an ordinary machine may have a shorter
time-to-market
ogy.
and therefore
our expected parallelism
pp.
Tilak
Agarwala
by a third;
Watson
March
[2]
and
could reduce
together
[4]
John
instruction
set
Research
Center
R.
Chase,
Cocke.
High
processors.
performance
IBM
Technical
Thomas
Joseph
Maxk
A. Fisher,
John
Nicolau.
and
PLAN
Report
Computers
Wegman,
and
F.
Kemeth
Appendix.
R. Ellis,
Parallel
John
C. Ruttenberg,
processing:
A
smart
and
com-
Proceedings
of the SIGpp.
on Compiler
Construction,
as SIGPLAN
Notices 19 (6), June
a dumb
machine.
Published
John
Hennessy.
D.
Jones
approach
grams
Stanford
benchmark
and
Steven
S.
suite.
to interprocedural
with
recursive
Muchnick.
data
data
flow
Personal
Norman
P.
Jouppi
instruction-level
and
David
parallelism
flexible
and pro-
Ninth Annual
Programming
of
W.
for
A
analysis
structures.
ACM
Symposium
on Principles
Languages, pp. 66-74, Jan. 1982.
Wall.
Available
superscalar
and
super-
Third International
Symposium on
Architectural
Support for Programming
Languages and
1989.
Published
Operating Systems, pp. 272-282, April
as Computer Architecture
News 17 (2), Operating Systems Review 23 (speciat issue), SIGPLAN Notices 24
pipelined
machines.
(special
issue).
Also
available
m WRL
Research
Report
89/7.
[8]
Monica
scheduling
Lam.
Software
pipelining:
technique
for VLIW
machines.
An
effective
Proceedings
Programming
of the SIGPLAN
’88 Conference
on
pp. 318-328.
Language
Design and Implementation,
Published
as SIGPLAN Notices 23 (7), July 1988.
[9]
Published
and M. J. Flynn.
of parallel
instructions.
C-19 (10), pp. 889-895,
Detection and parallel
IEEE Transactions on
October
1970.
Johnny
strategies
K. F. Lee
and Alan
and branch
(l),
pp.6-22,
J~U~y
target
J. Smith.
buffer
Branch
design.
prediction
Computer
17
1984.
[10]
Scott McFarling
and John Hennessy. Reducing the cost
of branches.
Thirteenth
Annual Symposium on Compp. 396-403.
Published
as Computer
puter Architecture,
Architecture News 14 (2), June 1986.
[11]
Alexandru
Nicolau and Joseph A. Fisher.
Measuring
the parallelism
available for very long instruction
word
architectures.
IEEE Transactions
on Computers
C-33
Parallelism
under
many models.
On the next two pages are the results of running the test
programs under more than 100 different configurations.
The
columns labelled ‘ ‘Livcu10”
and ‘‘ Lincu10”
are the Livermore
and Linpack
benchmarks
unrolled
carefully
10 times.
The
configurations
are keyed by the following
abbreviations:
communication.
[7]
strategies.
on Computer Architecture,
as Computer Architecture
News
G. S. Tjaden
#55845,
1984.
Neil
prediction
Symposium
[14]
execution
’84 Symposium
37-47.
[6]
1984.
branch
Michael
D. Smith,
Mike
Johnson,
and Mark
A.
Horowitz.
Limits on multiple instruction
issue. Third
International
Symposium on Architectural
Support for
Programming
Languages and Operating
Systems, pp.
290-302, April 1989.
Published
as Computer Architecture News 17 (2), Operating Systems Review 23 (special
issue), SIGPLAN Notices 24 (special issue).
J.
Jack J. Dongarra.
Performance
of various computers
using standard linear equations software in a Fortran
envirotunent.
Computer Architecture
News 11 (5), pp.
22-27, December
1983.
Alexandru
[5]
135-148.
of
31, 1987.
David
piler
Annual
November
study
[13]
they could
Zadeck. Analysis of pointers and structures. Proceedings of the SIGPLAN
’90 Conference on Programing
pp.
296-310.
Language
Design and Implementation,
Published as SIGPLAN Notices 25 (6), June 1990.
[3]
A
9 (3), 1986.
it completely.
reduced
Smith.
Eighth
References
[1]
pp. 968-976,
J. E.
use newer, faster technol-
Sadly, any one of these considerations
eliminate
[12]
186
bPerf
bJnf
b2K
bStat
bNone
perfect branch prediction, 100% correct.
2-bit branch prediction with infinite table.
2-bit branch prediction with 2K-entry table.
static branch prediction from profile.
no branch prediction.
jPerf
jJnf
j2K
jStat
jNone
perfect indirect jump prediction, 100% correct.
indireet jump prediction with infinite table.
indirect jump prediction with 2K-entry table.
static indirect jump prediction from profile.
no indirect iump prediction.
rPerf
rN
rNone
perfect register renaming: infinitely many registers.
register renaming with N cpu and N fpu registers.
no register renaming: use registers as compiled.
aPerf
aComp
aInsp
aNone
perfect alias analysis: use actual addresses to decide.
“compiler”
alias analysis.
“alias analysis “by inspection.”
no alias analysis.
WN
dwN
continuous window of N instructions, default 2K.
discrete window of N instructions, default 2K.
FIOow=?w)wl-w)m
.
.
.
.
.
r-(.rul-il+mmmm~
1+.-I
I-4 FII+I+
.
.
.
.
.
.
.
.
.
w-4
.
. . . . . .
rwwwwwwvm~w
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vwwwmwc.4mm
,.,
.
rrrrrwm-mm-
.
0WW04NNOLTIPV
VIPPPPPUIUIIX*P
..,.,
wwwwwwwmm~w
mev-.=rmo-wcnw
.
.
.
.
rrrrrrrw=fmr
.
Or+mqqdu-lmwo
.
.
.
.
LIJcJaw.
.
.
.
.
.
~+owcomfnmrnm
. . . . . . .
wwmwm~~~om~
.
.
.
.
w-4+m*r4mrmc0.
.
.
.
.
wwwtn-mwf+m=r=r
.
.
.
.
.,
.
-moicoomrcn-twco
.
.
.
.
.
-rwmmmtid*wm
.
.
.
.
.
m-4N0dmw.+mm*
,,.,,
.
corwm-me4e4c0utm
.
.
.
.
,,
.
.
.
.
.
.
.
.
mmmwomwindcvm
.
.
.
.
.
,,..,
NCV@COIXFWLOWNCO
1+
N03comc0cormrr
.
.
.
.
.
WW-VW-f=#VMOI-*
.
.
rcocomcorwm-4m
,,,...
Loeew-cu*-+mr4
.
.
.
roomqcombw+m
.
.
.
.
.
wwm=rmmmtircw+
cOmcOco*OmwOm
..s
.
.
.
W-?*-MNN-MN
.
MWwwwwwwwwwww
H
HH
#c
.n
HH
.n
.
.
.
.
Wwwwwwwww
Wwwwwwwwww
GGCCCGCCCG
H H H HH
AA.4$2AAAAA14
.
.
cow
c cc
c GG
c
HHHHHHHHHHH
.n .n .~ .~ .m .m .~
H
ow
. .
at.
w-r-r-emmtwd-
w
187
.
,.
mm
mmrwmwutmmo
,..!,.
o cc
G SC
CGG
131HHHHHHI+HHH
.n .m .I1 .“ .rl .Im .n .m .rl
Wwwwwwwwwwww
Izccccccccccifi
H HH
H HH
nnnJlllJ2xa,4Ann.Q
.
.
n? Ndd
u-la!
.
.
.
Pwul
,+
.
.
.
HH
c
c
.~
.~
.~
SC
H
HH
.n
&.-id.-+,+IOrAOAQI*I*
.
.
.
.
.
Ill
u---*mmc4mc4-
.
.
.
.
.
.
.
.
.
.-id.smcowwcom..+d,+
.
.
,,..
mme-mmmmmmmm
.
.
.
.
.
.
.
.
.
.
.
v=r-wm
..+*
.
.
.i.-+.+-a.+com
.
.
.
.
---m-mm
.
,,.
.
.
mmm
.
CocommmcNal
.
.
.
.
*-a*m**m
.
.
.
--a=rmmcomoo
.
.
mui==rmmmm
.
.
.
.
mmmm
.
.
.
-J=?.$*Flr’ldr-lFl
mmmmmmmmmm
.
.
.
.
.
COcoriei
dwoocomwmmr
.,.,.
.
@41+cwc-41+1-+r+AAr+
,,,,
.,,.
I-IFIFIAAFI?+.+.-I.-I
mmmcwcw
,,
b
In
cocucucofmrmrrm
+.
.
.
.
.
.
.-+*=4-4-f**-4mcw-m
o
m
mmmm~-ow-m-r
In,.,,.,...
Mknwmwmwvmw-m
mmixmmrimmmco
.
.
.
.
.
.
w=r=fm-mmmmmmm
cowmmmm~ommm
..........
mtn-4mmmmmwe4utn
AL.4mmro
.
.
.
-=rm~mr
.........
m-m
f)
@4wc0
+ m@4
w
=i4KiLli
00
Kikillilxi
NV
W44WWWWWWWWWW
G C fic
G cc
H H I+H
H HH
nnA.Ja.Qnnn.4.Qn,a,4A
cc
HH
C
H
cc
HH
WWWWWWWWWW4-IW
c
cc
cc
G G c
H HH
HH
H H H
,4Ja,fla.fa.(1.(1.cl.an.cl$l
4J.IJ+I+I
EJIdldti
4J+J.IJ4J
U2ululrn
Ww
G fifi
H HH
Cc
HH
c
H
.Q .4,4.4
188
.
.
.
“ “ ““
wcnmrm
.
.
Add,+-
.
Wulul!n
r-u-lrw!nrhlrumm
r+rli-lrll+l-lrlr+l+rl
.
“ ‘ ‘ “ ‘
“ .-IAAAA
‘ “
.
.
Download