Single Instruction Stream Parallelism

advertisement
Single
Instruction
Michael
Butler,
Tse-Yu
Department
and
Computer
Mitch
Is Greater
Alsup,
Hunter
Microprocessor
Since
Recent
(less
studies
than
have
two
cycle)
streams.
Since
the
should
it is important
ists.
study
benchmarks
repeat
the
hardware
we get similar
all constraints
semantics
allelism
and
amount
the amount
constraints
except
struction
stream
is properly
to design
hand,
of these
and
Johnson
measured
the
performance
by the
unconstrained
much
per
now,
cycle
cycle.
Finally,
single
we show
that
one can sustain
on a processor
vent
if the
from
that
results
processor,
is rea-
today.
The
increasing
research
into
ements
way
to
to
ways
use
these
tional
units
only
lelism
present
studies
[3,
case.
cluded
ecute
14,
More
that
scientific
a
real
15,
recently,
insufficient
than
circuits
large
chip.
suggested
some
two
replicate
instructions
logic
when
el-
all constraints
multiple
is
exists
processors
per
it.
this
researchers[l,
ef-
in
that
2] have
in
real,
that
will
per
tions
con-
allows
ex-
chines
ACM 0-89791 -394-9/91 /0005/0276$1
.50
data
pre-
on the
execution
On
the
except
those
of 17 instructions
the hardware
of
other
we have
an execution
flow,
re-
found
per cy-
is properly
rate
of 2.0 to 5.8
that
an abstract
is reasonable
stream
3 describes
our
and
4 reports
realistic
Section
to
issues.
and
discusses
1The
Nasa7
benchmark
parallelism
to
experiments:
machine
results
the
configuraof our
simu-
case, and several
implement
5 presents
remarks
2 dismodel
influence
of various
machine
The model
used by Smith
the unconstrained
are
the
the
Section
execution
instruction
and discusses
the
on performance.
which
in six sections.
Section
a brief
6 offers
the future
work
was not simulated
mark
consists
of seven independent
straints,
we omitted
these loops.
276
cycle.
program,
when
benchmarks,
Section
plementation
Permission
to copy without fee all or part of this material IS granted
provided that the copies are not made or distributed for direct commercial
advantage, the ACM copyright nonce and the title of the publication and
its date appear, and notice is given that copying is by permission of the
Association for Computing Machinery. To copy otherwise, or to republish,
requires a fee and/or specific perrmssion.
did not
constraints
on a processor
is organized
the
identified.
Q 1991
cycle
Section
and Johnson[l],
cycle.
per
in excess
available
tested.
lations
features
non-
how
could
today.
paper
simulator,
fact
also
on an
that
parallel
are removed
one can sustain
be exploited.
Early
was
have
to quantify
benchmarks
undue
of the
we show that
cusses restricted
paral-
warrants
that
This
func-
technique
of parallelism
Finally,
design
to
instructions
by the semantics
instructions
One
instruction-level
parallelism
to support
of
We
benchmarks
in order
with
hand,
balanced,
motivated
numbers
This
of
applications
16]
has
performance.
is to
amount
applications
more
utilize
single
the
in
5,
VLSI
computational
elements
on
if
of
to
improve
fective
the
density
mod5] and
Jouppi[2].
to sustain
two
cle.
perfor-
Hwu[3,
of the processor
that,
than
degrees
Introduction
show
it is difficult
more
quired
1
if the artifacts
per-
the
execution
and
of these
in these
have
compute-bound
measured
by Patt
model
exists
programs
that
it.
Our
2.0
in
and
execution
parallelism
be exploited
in-
[l]
un-
ten
of the
applications
on several
studied
we have
nine
have
benchmarks
those
by Smith
of par-
exploiting
We
im-
degrees
for
used
do this,
in optimized,
standard
when
~equired
To
influ-
to verify
parallelism
a set of real
de facto
should
it is important
exists.
We have
els, including
those
balanced,
per
mance
parallelism
of available
benchmarks.
we have found
parallelism
to 5.8 instructions
We
of available
suite,l
the
formance
show
they
other
become
SPEC
and
code.
Group
West
78735
parallelism
SPEC
Shebanow
of the processor,
a study
compiled
in the
ex-
constraints.
On the
important
hardware
of the
researchers,
of 17 instructions
most
really
much
dertaken
of available
execution
resource
results.
how
in sin-
of the processor,
resource
are removed
in excess
sonable
the
parallelism
parallelism
previous
oft he program,
perhaps
much
differing
of the
under
posed,
how
work
little
is available
the design
we model
under
the
that
influence
to verify
In this
that
per
gle instruction
parallelism
concluded
operations
Drive
Texas
ence the design
Michael
Two
Technology
Cannon
Austin,
Abstract
and
Memory
William
48109-2122
than
Incorporated
and
6501
of Michigan
Michigan
Scales,
Motorola
Engineering
Science
University
Arbor,
Parallelism
Yeh, and Yale Patt
of Electrical
The
Ann
Stream
loops.
today,
discussion
some
ma-
are
all
of im-
concluding
we have planned.
because
Due
to
this benchtime
con-
2
The
RDF
To exploit
whatever
stream,
facts
one
that
data
Processing
those
instructions
namic
order
they
The
that
from
present
in
of time.
execute.
appear
in the
size
dynamic
the
dynamic
issue
tions
can
that
stream
We
and
call
the
tion
class
machine
latency
the
can
tency
number
correspond
cpu
to
design
corresponds
sue
corresponds
the
niscient,
ple,
can
restriction
its
The
Ia-
RDF
size and
model
that
all
parallelism
size
the
on is-
present
unbounded,
ters
level
RDF
of the
in the
data
parallelism
the
and
(UDF)
available
exam-
RDF
hand,
model,
if we restrict
we obtain
upper
that
can support
a specific
It
All
of
SPEC
implementable
size, issue
execution
parameon the
floating
size and
tigate
277
We
point
the
ten
effect
of
latencies,
of
shorter
were
several
of the
instructions.
their
add,
simulated
and
loads,
etc.),
a description
class.
taken
The
from
machines
so
with
M for floating
memory
latencies,
pre–
each
class is listed
however,
the
t, gcc
limitations,
and
to that
pro-
with
output
point
L for
one machine
time)
in eqntot
million
cycles),
belong
know
the
instruction
(in
produced
run
scalar
called
all
was se-
unchanged
classes
multiply,
that
all but
Johnson[l].
for
Each
latency
instructions
cies for
the
of
integer
compiler
to time
(A for floating
execution
the
used
Due
instruction
latencies.
and
is not
and
was simulated
compiler
conventional
file.
1.8.5
with
shortest
run
the
FORTRAN
(i.e.
were
under
benchmarks
2.4 compiler
if that
code
cpp
cpp
Hills
intesuite:
matrix300,
run
The
C Rel.
MC88100–a
as the input
and
nine
SPEC
fpppp,
(A particular
object
without
its abbreviation
point
on.
for
the
for
Green
Data
exceptions:
are
from
doduc,
benchmark
an
1 shows
paper
architecture.
benchmarks
benchmark
instruction
bounds
a given
on
Ii,
the
Diab
turned
run
Table
machine.
window
influence
of the
for window
compiled
set
efficient
was run
an excall
the various
for
most
this
gee,
using
or the
processor
We
in the
compiled
following
to exploiting
application.
flow
the
several
programs
tomcatv,
instruction
cessor.)
window
of performance
that is possible.
For example,
model gives an upper bound on the performance
a machine
espresso,
when
variations
we have
no impediment
an (opti-
capability.
in
point
spice2g6,
lected
its om-
For
units,
presented
floating
eqntott,
its
other
results
optimizations
functional
to study.
functional
are also
presents
with
several
with
performance
values
unit
to
of imple-
emulating
identify
differing
cost
Benchmarks
compiler
stream.
On the
Experiments
were
of
instruction
unbounded
are interesting
an unrestricted
all
above,
Nonetheless
to the
rate
unit
specifies
and
realizable.
issue
specified
prediction
ecution
model
model
with
functional
M88000
amount
on
3
ger and
parame-
restriction
limitation
and
The
the
window
the
will
rate,
3.1
constraints
on
We
models,
its
changed.
on the
RDF
it is continuin an effort
the
RDF
unbounded,
and
on
supported.
the
is specified.
limitation
the
instrucand
practical
reduce
al-
to the im-
Today,
group
we are concerned
parameters
benchmarks.
the
if in addition
this
is
because
important
to
RDF
branch
is not
the
be
units
set architectures,
objective,
has not
in
as a speedup
brought
The
In
paper,
RDF
(HPS)
its applicability
research
ultimate
of the
Substrate
architectures.
machine,
exploits
stream.
implementation
and
is
corre-
cycle.
be
operation,
defined
RDF
window
of operations
operation
first
the
of instruc-
a single
Its
RDF
In this
num-
bandwidth.
Clearly,
of the
each
The
that
units,
desired
a practical
buffering
memory
functional
some
[3].
to
rate
of
was
the
operation.
every
with
model
set
each
instant
mal)
that
efficiently
developed
our
that
instruction
instruction
by our
capability
restrict
model
Performance
performance
mentation.
be
unit
that
on
a specific
configuration
in a single
discovered
refined
its
unit
model
bound
if we further
originally
of all
being
improve
any
can
a packet.
the
with
perform
RDF
cycle
specify
associated
The
on
latencies
one
ally
an upper
an execution
for complex
we quickly
predic-
RDF
can support
a realizable
was
branch
the
if functional
machine
High
HPS
plementation
instruc-
as the
the
mechanism
though
instruction
in
that
[3].
can
dynamic
window
1985
in the
rate
a functional
specified
those
that
Finally,
parallelism
model,
that
number
the
the
at
issue
to a realizable
We first
dy-
we obtain
we have
available
RDF
a condi-
of
is less than
from
flow
stream.
as long
instructions
in
associated
model,
ters
of
graph
maximum
into
group
way
are issued
number
flow
and
and
If we couple
of a machine
to
their
The
stream
window
be removed
data
instruction
be issued
is the
entered
the
into
each
data
the
capability
predictor,
a problem.
model
sponds
an omniscient
the
maximum
in the
rate
knows
dynamic
can
of instructions
by
instruction
(Instructions
RDF
stream,
when
completed.
Instructions
is the
the
size. ) The
has
size
not
instruc-
and retiring
is created
always
will
window
tions
a dynamic
for execution
execution
predictor
branch
were
unit
a problem.
implementable,
instruction
into
not
window
a
window
issuing
dynamic
stream
is such
class latencies.
of systematically
if function
were
the performance
The
parameters:
have been resolved,
after
tional
parallelism.
issue rate
tion
to a real branch
of arti-
paradigm
by three
instructions
instruction
branch
(RDF)
instructions
scheduling
dependencies
ber
flow
a program’s
converting
devoid
of that
instruction
consists
from
graph,
and
in the instruction
model
utilization
It is characterized
tions
flow
the
rate,
exists
an execution
restricted
size, issue
of Execution
parallelism
needs
limit
abstract
model.
Model
in
we
Smith
with
order
of
latenand
smaller
to
simulated
invesone
CedeLkQiWton &da cdlectadh’
M&lk instwoca twe, C cwn~lec VW. FORTSANcom~lec GreenHIII)
100
Description
Instruction
Exe-
Class
cution
Latency
(A)
FP Add
6
(M)
Multiply
6
FP mul
(D)
Divide
12
FP dlv and
(L)
Mem
Load
2
Memory
loads
(S) Mem
Store
-
Memory
stores
90
80
FP add,
(B)
Branch
1
Control
(T)
Bit
1
Shift,
1
INT
Field
(I) Integer
sub,
and convert
and INT
mul
INT
div
;
and bit
testing
sub and logic
I
❑ S,t Held
❑ Memov Store
❑ MemoryLosd
50
❑ Branch
I 40
a
930i33’
e
instmctions
add,
H Integer
P
70
e
r
60
c
OPS
“:”:’:’:’:’
im.
:::::::::::
,,,,,,,,,,:
= IN1J7PDIV
❑ INTFPMul
::::::::!:
*
■ FPA5d
20
Table
1: Instruction
Classes
and
0I
I
I
[
I
.
snT01TEwEs3m
FORTRAN
com@exGreen
W)
I
I
1
Datacoliactadfrom Ma$kinstmcOon
tiawC compIlw Mb,
50 ~
to
Latencies
I
.J
.....l
...........I....II I I
..a.. .. . .. .. . . .. . .. ..
...1...
i
~
FPWP M;?
LtmJ.c
Benchmark
n
■ ~~
❑ ES%SS3
Figure
2: Instruction
Class
Distribution
The
machine
configurations
3.
machine
Figure
Each
simulated
is specified
l__.--l
EFFFW
tional
S MATRIXW
retirement
units,
a Wx2G6
(i.e.
L? TWCATV
cycle),
Branch Memo(y Manwy sit Field Integer
Lcxad Store
Instruction
unit,
and
1: Benchmark
Code
Analysis
configuration
using
floating
point
Figure
1 shows
of each
each
respond
of
33 percent
ations;
the
for
to
the
data
can
are
branches
respectively.
same
nine
As
seen,
simple
Figure
bars
listed
integer
ALU
28
a stack
The
Machine
instruction
tant
for
the
ones
the
instruction
examining
we
utilization
units,
instruction
of these
cle),
can
the
execution
we
units.
functional
machine
on
if
functional
be
of
Machine
of
from
unit
issued
to
machines
are
to
Also,
configuration
(only
functional
unit
contributes
to
as
the
two
and
I.
tional
de-
one
incy-
upper
throughput.
278
can
units
would
service
the
can
be implemented
rest
of the
have
multiple
3) the
ma-
units.
One
instructions
execute
from
instructions
instructions
an
capabilities
from
effort
was
to instruc-
In addition,
the relative
cost
was considered.
If infrequently
could
cost,
at
As
be made
to service
class of instructions
frequently
machine.
functional
unit
be enhanced.
most
machines
(see Figure
execute
of
of executing
of handling
configurations
used)
in hardware
mance
can
functional
(frequently
remaining
can be
number
is capable
capable
one
contains
machines
I, one can execute
machine
to match
increase
is
each
and
used functional
efficient
issue
I,
tion
class frequencies,
of adding
functionality
reflect
achieve
since
configuration
each
such
must
only
unbounded
has four
units
L, S, T,
choosing
RDF.18
an
is ca-
machines,
of executing
For example
A, M, D, and
B and
and
are
as 4F2M
functional
chart
classes
resources
classes.
the classes
imPor-
unit
Smith/Johnson
The
that
indicate
functional
of which
classes.
units
15
particularly
the
with
each
identified
made
is
performance
frequency
on
struction
frequency
simulated.
of
pendent
bound
class
The
functional
and
Configurations
each
functional
parentheses
the
is capable
as machines
In
3.2
configuration,
a single
UDF
one letter.
chine
benchmark.
pipes
only
instruction
oper-
in a single
so each set of parentheses
functional
at
to
the
that
In
described
rate,
packet
load/store
class of instructions,
all
cor-
approximately
another
2 presents
by
benchmarks.
vertical
benchmarks
be
comprise
organized
frequency
nine
nine
8
respectively.
dynamic
of the
the
of 3, 3, and
divide,
the
each
class,
figure.
and
Iatencies
and
of
of instructions
loads
percent
the
class
instruction
respectively
right
point
multiply,
histograms
instruction
Within
the
floating
add,
unit
within
classes
unit
in
buffer.
letters
functional
issue
are issued
of the
corresponds
of executing.
each
for
the
instruction
pable
that
to its functional
parentheses
listed
set of func-
mechanism,
characteristics
prefetch
are
by its
size of an instruction
of instructions
the
respect
of
the
Figure
the
instruction
set
Class
prediction
and
With
NT/FP
Div
branch
mechanism,
a group
and
lNT/FP
Mul
in Benchmarks
Em
Eluo-c
FPAdd
SPCEZ26
mMcAlv
the resultant
For
example,
occurring
a nominal
such,
all
an addi-
at nominal
machine
perfor-
integer
ALUs
instructions
cost
relative
functional
and
to the
units
are
Machine
Functional
Unit
Branch
Load
Irlat.1
I
hue
Rediction Bate MecL Packet Store
Configuration
Name
PacketRetire #Of
S&J4
(A)(M)(D)(L)(SI(B)(TI(f)(f)(L)
85%Syth 4
UtIF
ca(A,M,f),fJ,B,’f,~
RDFJSco(~M,D,~~B,TJ)
OticfOtdE?1
~uenti,]
hi.
No caches
Prefetch
sion
Bufer
rates
Aligned
4wd,
No bank
for
IO(% hfinb oti~o&
I
outc40dR lhnliaredFetch
All
look
I
Otichk
store/load
8
4FZM (A,M,D,[)(SJ
Z(L,S,T,I)
1~
4EZM
B85(A,M,D,I)(B,l)
2(L~,T,fJ
85%Byth 1
owdofi
1
un~kmdFetc~
h Order
4 o.t,f orkr untbedFe~h
InOhkl
4 OutJD?&r Undigd Fekh
same
Inode?
6 OutorOr&r Unahgwd
FdcII
8 OutofO?fk UnJigiwd
Fetch
All
WAB85 (A,IXMHIVXTN!D
3U.$,TJ)85%Byth 1
h Order
8 OutnlOr&r Undigried
Fetch
initiate
8P3fJ
BE (A,lXfA,f)(Il,f)(T,fXBJ)
2&#,TJ) BedBP
in order
8 OutofO&r Unaiigred
Fetch
ally
●
Figure
3: Simulated
With
respect
set
io
Machine
its functional
of parentheses
unit,
and the letters
within
classes
executing.
A number
exist.
In
that
the
RDF.18,
unit
corresponds
tnstruciion
ses indicates
Configurations
that
appearing
of our
abstract
an unbounded
number
indicate
unit
before
copies
each
the
of
a set ofparenthe
-
packet
in
the
simulations
3.3
Smith
Silnulation
The
simulator
machines
process
An
is capable
as well
works
(ISIM)
unit
When
a trap
must
of functional
units
are
rently
in the window
reads
machine
trap
level
the
Our
The
any
range
of
simulation
gathers
makes
renaming
anti
and
achieving
execution
file
rate
sim-
●
machine,
and
thus
time
the
can
operands,
ready
Packet
Renaming
dependencies
with
the
models
●
as-
branch
pre-
interrupts.
being
sim-
all instructions
cur-
and then
that
execute
define
limits
can exist
execu-
but
the total
in the
num-
machine
is considered
occupying
space
until
in the
it
of four
waiting
for
free,
executing,
at
in the
window,
is retired.
states:
the
size is given
Rate
into
-
In-
waiting
assigned
for
func-
or waiting
in the
parameter
for
number
of
the
packet
solely
by the
Buffer
buffer
machine
per
A
cycle.
one
cycle
are
Nor-
and
the
is deter-
configuration.
-
be modeled
This
machines
to
the
cycle.
that
a single
is set
Configuration
characteristics.
Smith/Johnson
in
rate
issued
may
per
of instructions
machine
issue
indicates
can be issued
of a group
instructions
Prefetch
This
that
of
and refill
we simulated.
“re-
This
the machine
is issued
Window
the
prefetch
of
both,
for
parameter
of packets
brought
mally
elimi-
is critical
(i.e.
instructions
execution.
parameters
be in one
consists
mined
and
all
to complete,
to become
Issue
packet
assumptions:
to
packets.
number
simplifying
performance
unit
the
statistics.
able
are mutu-
use of clheckpointing
instruction
it
structions
processing
sim-
after
precise
wait
that
from
number
schedul-
the window
and
size - This
An
the
The
from
to our
key
time.
describes
and
and
simulator.
one
issue
execu-
RDF
by ISIM.
analysis
is performed.
output
high
which
produced
several
our
very
instruction.
retirement.
MC881OO
simulates
begins
dependency
execution
simulator
and
the
trace.
and then
stream
data
for
code
in a configuration
Register
for
simulator
instruction
instruction
nates
models.
object
an
performs
and
a wide
(i.e.
cycle)
units
issue,
ber of instructions
machines).
the
occurs
pipelined
is encountered,
stop
are several
in the
tional
to be simulated
dynamic
ulator
of modeling
as execution
in
reads
Johnson’s
(except
to
in the win-
each
for supporting
ulated
operations
performing
loads
resident
completed
recovery
and
as follows:
producing
ulator
and
have
corresponds
miss
Process
instruction
tion,
ing,
of
packet
UDF
integer
of
and
forwarding
are fully
are removed
Window
simple
this
operation
) in whole
There
executing
hit
independent.
diction
tion
of
units
Instructions
the
capable
stores
are both
that
[6] as a mechanism
present.
capable
found
machines,
of the functional
ver-
ly.
sumption
is capable
are
if
location
a new
in that
functional
the parentheses
the functional
mu!tiple
case
a single
current
100 percent
are modeled.
modeled
functional
tired”
configuration,
to
in the
assume
D caches.
forwarding
We
InOrda
We
I and
conflicts
infrequent
1
1
both
memory
dow.
modeled
simulator.
machines
1
(AJ(M,D,I)(B,93(L3,T,I) 1~
6F3M
8F3M (.W(KNDJKT,W)W$,TJ) lM%
are explicitly
of the
in
An
with
instruction
specified
was used only
order
to
size
in the
match
their
configurations.
No
memory
stages
naming
renaming
renaming
of the
at the byte
occurred
the algorithms
renaming
simulator,
did
In
we supported
granularity
level,
so infrequently,
or the
not
is performed.
actions
noticeably
but
either
of the
improve
the
early
memory
found
Functional
Unit
Configuration
capabilities
of all functional
the configuration
file.
Each
that
because
compilers,
●
reof
fined
that
cuting.
performance.
279
by the instruction
-
The
number
and
units
are specified
in
functional
unit is cle-
classes
it is capable
of exe-
●
In/out
of
whether
order
C)rder
instructions
for
still
Execution
each
allows
— i.e.
can
for slip
other
This
unit.
to occur
between
but
functional
parameters
of the
out–of–issue
discuss
the
execution
models
in turn.
functional
in–order
out–of–order
for
indicates
In–order
are executed
unit,
flag
be executed
functional
operations
functional
-
each
respect
4. I
Previous
For
units.
Prediction
prediction
branch
accuracy
prediction,
branches
instruction
rectly
or
the
This
not.
Smith/Johnson
an actual
each
of
the
is made
outcome
prediction
This
pointing
scribe
issue
by
is stalled
models
Claas
the
ecute
the
compared
the
until
the
of clock
type
used
are
these
latencies
given
-
If a
branch
is
cycles
in
table
1.
machine
used
This
of 3, 3, and 8 for floating
the
floating
point
to
point
add,
Units
to model
of
-
a machine
functional
units.
are effectively
may
window
allows
with
an unbounded
this
functional
the
Issue
whether
functional
unit
they
issued
are
sequence
to
executing
it,
that
it cannot
indicates
at most
window.
at
to match
unit.
all
in that
between
con-
time
harmonic
superscalar
pede
instructions
high
ous limit
cycle.
of four.
Each
chine
work
ing
benchmark
Results
was simulated
configurations:
of Smith
no
realizable
artificial
and
Johnson[l],
constraints
obtained
faithful
(2)
on
by
three
the
Aside
to the previous
a model
to achieving
processor,
restricting
and
the
IPC.
to four
from
280
performance
floating
and
results
Johnson,
the
IPC
benchmark
machines
The
performance
most
independent,
for a peak
a maximum
peak
imobvi-
is the issue rate
of nine
allowing
per cycle,
and
of the
machine.
is comprised
since
suite.
our
a particular
in
report
benchmark
Smith/Johnson
high
point
performance
between
unnecessarily.
However,
execution
of four
execution
in-
has been
IPC.
the four
additional
a poorly
an overly–constrained
values
the
by
benchmarks,
the
entire
and
units,
are issued
performance:
(3)
machine
functional
are three
represent-
of
of nine
reduced
sets of ma-
details
The
their
of
published
show
show
of a scalar
performance
structions
under
(1) a model
machines
rate
IPC
and
prediction
processor,
executing
by the
order
compilers,
integer
across
a scalar
by Smith
and
com-
the results
results
the
Johnson
for
of
due to differ-
12, that
comparison
machine
Several
of
mean
published
is divided
capable
over
in
Branch
with
3.1 IPC
and
a rough
pipelined
Simulation
Smith
will
functional
all previously
and different
for
is only
order
after
simulations
IPC
1.4 and
of speedup
those
is
Our
2.1
ex-
as in [1].
4 through
are consistent
for
upon
is not possible
Figures
1.7 and
The
executed
only
completed.
comparison
and Johnson.
there
window
(synthetic)
one notes
the
instruction
unit
and
be issued
issue
have
a direct
To provide
if an instruction
to a functional
instruction
The
each
Thus
one in-
are
set architectures
terms
or whether
stores
is 85 percent
benchmarks.
a particular
unit),
instructions
functional
flag
to
(with
a common
be assigned
following
assigned
time
is constrained
to a unique
can not
This
to each functional
of assigning
issue
-
are
at issue
going
that
Constraint
instructions
struction
instructions
Thus,
act
con-
can be placed
instructions
the
are executed
the
is not
in data-dependent
ent instruction
and
Instruction
and
prefetch
issue,
cycles).
from
a
mem-
boundary.
in the
though
integer
model,
with
machine,
even
be-
of the
In this
is filled
instruction
(The
removed
Loads
between
the
for different
instructions
Smith
aa are
cycle.
selected
configuration.
a cycle
instructions
from
was
Instruction
into
ALU
our simulations
machine,
units
in
are
accuracy
are pre-
4 instructions
unit
be scheduled
issued
buffer
one integer
integer
pletion.
la-
flag
than
execute
config-
results
performance
models.)
be required.
by functional
in
of Smith
on a four-instruction-word
instructions
more
SJ4
highest
prefetch
of model-
machine
SJ4
figures.
as 3 of the
not
ample,
store
multiply,
This
With
aa many
in any given
buffer
four
only
machine
aligned
as many
While
Functional
simulator
words
instructions
“SF”
but
4 through
configuration
(All
the
wide
Thus,
units
we
exception
with
Smith/Johnson
course
ex-
4[1].
in the
instruction
a single
respectively.
Unbounded
needed
smaller
to
and machine
clarity
four
the
(Figures
the results
demonstrated
strained
de-
latencies
only
model
tencies
number
The
The
suffix.
divide
required
of instruction.
is the machine
latencies
it
of bringing
of a check-
These
of machine
benchmarks
simulated,
four
ory
to
trace.
performance
Latency
number
a given
there
then
by
mechanism.
Instruction
and
used
as determined
fails,
resolved.
cor-
prediction,
for
cause
is generated
branch
and
We will
set
SJ4 presents
machine
were
sented
mechanism
nine
labeled
Johnson’s
urations
dynamic
is predicted
real
and
prediction
number
the
With
1]. With
the
each
Work
ing the assumptions
configuration
in
branch
is
[1].
prediction
real
branch
machine
a random
whether
of branch
and real[l
the
in the
types
are encountered
stream,
to determine
are two
- synthetic
is specified
As
the
There
supported
synthetic
file.
-
machine.
for
to
12), the curve
Branch
results
units
within
with
unconstrained
simulation
instruction
machine
balanced
prefetch
wide
fetch
limit,
there
characteristics
that
machine
configuration,
buffer,
and
limit
a sequential
load/store
execution
A significant
Smith/Johnson
chine
rise
constraint.
source
machines
configurations.
is an
The
to a performance
comprise
EQNTW1’I InatructknLevelPamllebm Upper Boond 3(0 lPC
of performance
degradation
imbalance
scarcity
in
If integer
operations
38 percent
of the
instructions
integer
benchmarks
and
machine
contains
integer
unit,
of much
ALU
then
greater
will
to expect
than
because
be saturated.
our simulations
and
it is unreasonable
two
the
simply
This
is in fact
— the integer
execution
time
units
for
the
ma-
units
nearly
the
the
of integer
bottleneck.
7- - --_
in
by contention
Another
the
bottleneck
prefetch
respect
cache
buffer.
lines
will
reduce
cussed
one
per
--
busy
two
prefetch
lines
from
only
the i–cache,
the
sue two
window,
to fetch
new
space.
The
space,
instructions.
cost
of a more
features
chine
to
cycle,
complex
into
only
the
limit
the
instructions
and
thus
!
the
loads
occur
spect
to other
strictly
all previous
the
in
loads,
stores
(i.e.
been
applied
in order
to
maintain
1
p4
memory
system.
This
support
of a checkpointed
write
all
in
of the
a synergistic
cantly.
changes
a machine
effect
and
the
with
state
1
.“
- A ----------------------------------------
For instance,
removing
allows
greater
x 4F2M285
~t-1
o
9
12345678
from
the
for
5: Instruction
‘~;
6
the
in
addition
of extra
load/store
of implementing
all of these
30 to 50 percent
(.6 to 1.0 IPC)
model,
across
Ii, eqntott,
and
the
four
pipes.
is a performance
integer
13
14
15
Parallelism
16
of ESPRESSO
Upper Boud 38 lPC
-----------------------------
..__
m
.,.,’
1
––– ______________________
KJFB
48$24.4
02$3MR7sc
+. Ww Rs
--m----------------------------------‘“
c
3
have
/
signifi-
..__
,
----
2
* 8F3MP45
I
o 6F2M
‘-”-------
+ ..9=.+::?
::.::
;:>:.-i,>%<.+?.kg.%:-*-*
-*-~--v. ..+..- ,. --,
.$, ------A
,
_- .--,.-—
,-—, —--,-- —.-—-,---,..... ..-.
<.--m.. .=- .-. -- -— --- -—. -— -.-. ---
load/store
#if:-:
- 4F2M
X 4F2M@85
[
__________________________________
- S&J4
to be gained
The
P
result
increase
of
tested
~1
12345678
over the Smith/Johnson
benchmarks
12
*
--------
-.. --’ .8__________
5
are imple-
improves
Level
CCChwtmction Level Padelbm
are
in
as described
the sequential
11
before
the
performance
10
Window Size (packets)
,
constraint
‘) 8F3M,?45
/
- S&J4
re-
improvements
performance
.+.2FWRB
after
with
above
~sF3MlWSF
—________________________
of
only
however,
suggested
model,
* 8F2M
1-
‘4
11
When
FD$!8
1
__ —________________________
o 6FW
[6]
mented
_----
ma-
constraints
buffer
Level Panileli,m Up~r Bourn+179uc
—__ ——___-__
t
is unnecessary,
of EQNTOTT
- 4F2+4
These
order
a consistent
Parallelism
at the
issued
These
Level
two
execution
instructions
16
execution
are executed
executed.
15
does
of the
impede
issue
t4
available
buffer.
model,
sequential
and
instructions
have
store)
Smith/Johnson
13
~~
Figure
in
12
,.”
is-
issue
ability
;,--
--_:
5
throughput,
Finally,
11
■
‘~q
----6
by prefetch-
prefetch
10
empty
be eliminated
controller
in the
1
9
4: Instruction
ESPSEPSOInfiction
it is
buffer
only
unnecessarily
issue
can
of the
can
- S&J4
567s
Figure
instance,
regardless
prefetch
is room
for
prefetch
can
-X 4F2MB85
Window SJze (packets)
improve.
buffer
inefficiency
there
If,
instructions
::-
to
is that
but
the
prefetch
This
ing anytime
two
next
the
buffer
instructions
to a full
attempt
buffer
four
?::::::::,
H-J
1234
absence
would
empty.
::”::
alignment
(in the
prefetch
completely
contains
due
window
of the
tie-?
‘“-
,.,x’
1 “z----------------------------------------Y
As dis-
expanded
and perform
performance
*WW
- 4F2M
will
buffer.
were
instructions
blocks),
when
buffer
not
useful
shortcoming
refilIed
buffer
-“
o
ba-
instructions
a prefetch
:.x--
of
and small
2,5 useful
by such
of the end of basic
Another
only
:__J_J’
instructions
distribution
cycle
four
of useful
#.,:, *.-.x
with
middle
--_--::_
~T::&”-----+.’-+*--_+-+
2
of
is aligned
-- 8F3M245
---------------------Jo
A;.:ZT=-====:”;::=
~;.~_____._.___
3 -- ~------
in
is the behavior
*8FSMR3S.C
0 ff’24JRB
,’
c
integer
happens
in the
= F31F.B
* 8F3M
- _+----------------
resource.
buffer
the number
if the
so as to produce
ALU
——_________________
,.
I
is determined
targets
a uniform
[1],
in
the
branch
on average
be fetched
integer
Because
any
Assuming
sic blocks,
the
to performance
to memory,
fetched.
fetch
for
---
5- ---------------
“+
largely
m-.---v
s
m-e
#
the
are constantly
,..
8
,.
m
~,__________________________________
6- - -----
a speedup
the
.m,
, -.-t” ---_
——_____
gives
only
what
program
8
in the
9
10
11
12
1S
14
15
16
Window Size (packets)
(gee,
Figure
espresso).
281
6: Instruction
Level
Parallelism
of GCC
I
LI Instruction Level ParaUeli3mUpper Bound: 162R
8
,—,—,
m
m
., ...,.-
, . . . ..
,-
6
I
y“
-----------------------------------------
i
+--__L-----___------_-----___-----___---/
II’
I
,
P
4 4-i_
W
------______
-s- . . ..m
- FDF!3
B
6 e::::::::::i
& 8F24A
i’
t
MATSIX340 Inatmction Level PmlleliamUpperBound 1166WC
--- ■ .-* ---m . ...9-*
-m
7
7 +--------=~-------------------------------
I
-. ---------~F2;
1
8
“
-----_
J_---_
--------___-J
GeF3h4Fa2f
~8F3MFasJ
5
I
?4
.* 8F3M2$5
c
0 S$w
3
+ 6FWJ
3
- 4F2M
- 4F2?4
.:::=.::.x:::%%
::r::’~’””~
X 4F2MS8S
2
_-:;;:z’z?’”
+ 2FW.RB
+. 8X$44RB
* 8F3W885
c
* SF2M
---
2
X 4F2MB45
~7FG%---------------
&/.i,’,x,,.x4
-S6J4
- S&J4
.,8<-A
------------..
1
o~
o~
12345
6789101112131415
16
9
12345678
Window Size (packets)
Figure
7: Instruction
Level
10
11
12
13
14
15
16
Window Size (packets)
Parallelism
of XLISP
Figure
DODUC Instruction Level Parallelism U~er BOUML66 IPC
10: Instruction
Level
Parallelism
of MATRIX300
SPICESG6Ind.mction Lavel Pe.rallelimnUpper Bound 17 IFT
8
,
, ...
,. i
, ..... , ..,. ... . .....
_____________
_____
._::::
_______
,_._
1
e”’~”’>i’’+—=
7
X
mFQF!.3
FCIFL9
~WSMFSS
5
,,
.-m. -----------------------
---------------
p4
6- %M R8
““
3
1
,#
✍✍✍✍✍✍
.
..
,=.
.
..,=F*
. . . . . :&’l+?_’:::::’-r
~
________
9
10
11
12
13
14
15
~7’
~->.
---
8:
Instruction
FPPPPMruction
LA
Level
Prxallelmm I@r
., 8F3MS85
✍☛✍☛✍✎
-”>”+
_,.—.,--., 0 6FZM
-2”-”+
- 4F%
--.
’-;=”=””
--------I
X 4F2MB85
-----A--
.<.X-:%’S’”:
..:.:”:”:2:::”:---–
-.———
——---—
———
——--—
———-.
-—
————
--—----—
-S3J4
- S&J4
~
1
16
12345S78
9
Window Size (packatsj
Figure
✿✿✿✿✿✿✿✿✿✿✿✿
.. ..+.
❞
0
12345678
✍✍✍
--——————~~:o.+ko””
,..<.---*
-“
1
o~
✿✿✿
✿✿☞✿☛
2
-+-.*.-..
..+
✿✍✍✍✍✍
✍✍✍
3
X 4F2MB&
------------------------
-G--’----J
, ;’:
✿✿✿✿
.:..
----------
2,
‘b- W2MRB
✍✚✍✍
c
* 6F?44
- 4F2k4
,,
~3F3L4R3%
■
I
P4
-- SF3Msa5
s’
c
* 8F2M
5
,.
s
---- ~-----------------------------------
6
* 8F244
:~
10
11
12
13
14
15
16
Window Siza (packets)
Parallelism
of
DODUC
Figure
Bound, 378 WC
11: Instruction
Level
Parallelism
of SPICE2G6
TOMCA3V In&uction Level Pmnllehmo Upper Boud S30I PC
8~
_––______
7
t
——__---
6
—______________________________
,..
.1-
.-.
a
aa
t
_*-–U––____– _____
5
t--------------
m
I
- -— —__---
p4-
■
-----~
—____,--
c
I
g
—______
---___
--l’_”___
._____
‘~
––__v–
- 4F3M
X 4F2MS85
_-
- S&J4
1
-–-#,,epaf’-”’-””
-&=*_x..
x=:g:x::-:~::?::?:::;;?-
m,’
. -------------
------
3
–––.+
,d
2
.,___
-/
~G3MFB%
‘b- ffW RB
—-----
..–.
----———
——------—
–. . . ..–-..
——---F -. .Z:-.:z::
9
10
11
12
13
14
15
16
Level
Parallelism
i
–3Z.V2+-L*..:’
;-_.e
.% -. .y:+x===””
~
‘. 8F3MSS5
o 6F?M
- 4F2M
-- —---:x . ..-
X 4F2MS85
- S&J4
‘;:,,Q74.:z:?::::---------------------
/
f2345678
Window !lze (packets)
9: Instruction
FOFB
1
0/
12345678
r
~ 8F2M
,
1
b~
Figure
{
------------------------
‘:––__‘-----------------------
c
.“
2- -------------
– _____________
‘
““”””’””’
!----------
p4
+ SF?4A
-.________
–______
,
0w3MWS$
-J-SF3MZas
.’
3
* 8F2M
. *2M Iw
—_________________
_,a_–,a
6 ~----------------’
5
----{
–––.
~~’”
Fcf B
9
10
11
12
13
14
15
16
Window Size (packets)
of FPPPP
Figure
282
12: Instruction
Level
Parallelism
of TOMCATV
We
on
also
the
similar
to
primary
that
his
published
limitation
on
and
it
parallelism
precede
that
which
packets
we chose
Lack
ability
to
block.
integer
the
floating
the
Smith/Johnson
machines
4,2
The
RDF
each
the
of
of the
highest
nine
performance
a restricted
eight
eight
tion
class,
move
instructions
complete
on
per
until
capable
the
of
an
that
Branch
results
Note
of eight
model.
placed
an upper
bound
model
performance
for
with
an
the
machines
issue
approaches
an
size
or fetch
rate,
bound
for
either
tion
accuracy
ulator
provides
ticular
benchmark.
Machine
of
true
This
In
of
all
segments
absolute
upper
is the
this
rate
graph,
the
ranges
from
parallelism
that
is
value,
the
that
chine
100 percent
IPC
for the
the
machine
to
exists
in
the
The
4.3.1
eight
have
program
units
of the
Ground
the
middle
machine
curves
assumptions
chine
that
terest
that
in
schemes.
tions
drove
knowing
the
Consequently,
units
(synthetic),
and
a real
model
of
from
are
on
execution
the
entered
of
the
a series
this
a desire
set
one
of
to
of
pick
with
with
of
configura-
of the
85
following
percent
removed
in
these
model
from
the
85 percent
1.5 to 2.8
of
ranges
(floating
that
mempredic-
3.6 IPC
for
the
the floating
for
the
significant
impact
of the
per
mechanism.
limits
limita-
The
average
advantage
ranges
for
the
prediction,
point)
integer
and
of addifrom
ap-
benchmarks
from
With
1.6 to 5.0
the real branch
2.4 and 3.4 (integer),
is obtained.
of floating
performance
of these
cycle
the
machines
in con-
character-
Performance
floating
outcome
Performance
point
with
point
of the
Because
latencies
benchmarks,
smaller
the experiment
for the eight
using smaller
latencies.
With
the
the
branch
branch
branch
branch
of between
the
each
100
and real
predicts
benchmarks.
performance
three
characteristics:
of that
one
(five)
with
machines
synthetic,
issue
branch
of several
These
gathered
units.
point
machines
history
the
5.8 (floating
predictor,
283
three
branch
3.2 to
program.
from
1.9 and
repeated
machine
window
with
mechanism
on the
and
existence
experiin
real
blocks
performance
predictor.
simulated
from
prediction
1.8 to 5.7 for
unit
2.2 to 2.9 IPC
predictor,
branch
from
from
85 percent
functional
for the floating
accurate
and
running
suffers
we assumed
using
and
ranges
dynamically
the
proximately
in-
machine
rate,
and
100 percent
prediction
The
tional
conan
with
branch
based
size of basic
ma-
prediction
issue
ma-
ranges
to 1.4 to 2.7 IPC
presented.
prediction.
tion
The
branch
size,
Smith/Johnson
and
of
better
accurate,
branch
of
12,
assumptions.
coupled
window
with
percent
differs
of
each
100
instructions
based
a realistic
of functional
various
selection
effect
4 through
performance
implementable,
predictors:
The
the
were
are
combines
ments
the
under
configurations
figurations
set
show
configurations
Figures
with
Performance
branch
(integer)
perfor-
of the
benchmarks
configuration
synthetic,
junction
benchmarks,
the
prediction
85 percent
units
predic-
machine
benchmarks.
functional
of a branch
istics
nine
memory
branch
balanced
are also
different
percent
Middle
for
use
benchmarks.
machines
each
with
benchmarks
memory
and
Configurations
For
point
two
demonstrate
branch
is presented
Several
the
traced.
4.3
our
effective
Performance
for the integer
floating
Performance
integer
point
at
IPC
units
tion.
par-
constrained
1165
to
constraints.
1.9 to 2.2 IPC
ory
Flow
reported
17
than
point).
sim-
Data
execution
This
more
percent
of a properly
buffer
with
with
100
shown
advantages
no prefetch
from
of
asymptote
Unrestricted
case,
dependencies.
each
indicates
window
an
( UDF).
by
are
or
A six-functional-unit
restrict
of
figures,
benchmarks
it makes
machines
85 percent
2.7 to 2.9 IPC
Clearly
However,
and
IPC.
If we do not
top
as they
are
better
intensive
for
Results
mance
re-
is omniscient.
machine
RDF
units,
instruction,
restrictions
performs
point
sizes because
Four-functional-unit
is-
instruc-
Functional
of
performance
of
machine
of
the
the
window.
4.3.2
rate
out-of–order
prediction
the
that
full.
type
other
provide
issue
reduces
significant
12,
performance
an
abstract
window
No
approximate
only
the
the
regardless
is
any
unrealizable
simulation
eight.
cycle
window
execution.
is
This
each
executing
4 through
with
cycle.
from
execution.
this
shows
machine
instructions
each
Figures
curve
dataflow
instructions
sues
benchmarks,
completed
mechanism
As seen in the
machine
window
of the
For
benchmarks.
the
removed
is not
it impacts
on the floating
smaller
Machine
benchmarks,
point
has
by the
this
to
are
checkpointing,
While
in which
contrast
instruction
implementing
the
is in
required
in the
order
instructions
as the
size.
instructions
in the
This
constraint,
window
all
and
where
as soon
for
after
issued.
model
This
effective
only
completed,
were
execution.
and
a basic
units,
been
out–of–order
intro-
machine’s
have
Smith/Johnson
ex-
I–stream.
is within
the
imposes
instruction
the
the
The
effectively
packet
packet
Jouppi’s
speculative
an
in
limits
to
IPC),
for
by
that
execution
two
no
in issue
results
execution
performance
between
speculative
than
In–order
based
obtained
performance
dependencies
instructions
utilize
high
branches.
models
and
(less
execution
in–order
false
simulated
data
are
beyond
ducing
have
Jouppi[2]
to
a serious
of
we
of
impediments
machine
ecution
all
note
assumptions
of
on the
and
latencies,
the
we
functional
unit
the real branch
smaller
latency
ma-
chine
ranges
marks.
from
The
mains
2.8 to 5.8 on the floating
performance
on the integer
point
bench-
benchmarks
the
re-
register
value
issue from
unchanged.
value
the
is known.
instruction
4.3.3
Lack
Analysis
ber
Performance,
clock
measured
cycle,
in
instructions
can be modeled
executed
per
where
clock
I is the
cycle,
is the
and
dynamic
and
cycle
the
dynamic
unit
compiler
chine
below
of at
one
issued
only
p can
code
are minimized.
presented,
value
branch
every
issue
will
issue
that
clearly
of the
chine,
RDF.18
The
and
RDF
only
3 memory
size,
and
machines
The
of clock
cycles
This
availability,
precliction
range
model,
in which
and
the first
instruction
full
window
three
factors
and
In
these
have
effect
factors
certain
a major
circumstances,
instructions
architectures
in a register,
play
bandwidth
carded
after
The
effect
may
not
which
the jump
a real
the
since
location
permit
cannot
a jump
after
a prior
free).
There
cache
in our
which
example,
of branch
sinlula-
be discarded.
this
issue
after
the
of predicting
it.
There
are
reduce
the frequency
branches
algorithm.
execute.
This
dis-
branch.
11.
It
is clearly
is most
branch
than
8F3M.
greater
effects
window
it is removed
a minimum,
time
The
4-12
issue
after
so that
at small
beyond
window
some
efficiency,
branch
if the instrucThis
are likely
overflow
sizes.
by
is equal
the compiler
point,
is
the window
lifetime
effect
attempt
to be avail-
or by increasing
of window
life-
instruction
it is issued.
operands
is issued
effect
of an
from
having
is
machine.
lifetime
can be increased
by either
code
RB
instruction
instruction
for operands
wait
of
prediction
Instruction
issuance
in gcc
performance
arise when
size.
from
until
At
the
the
visible
notable
omniscient
the window
by
re-
of instructions
with
time
or
Second,
machine
as the
From
instruction
misprediction
window
it
instructions,
machines:
the
stalls.
until
all instruc-
8F3M
full
issue
is issued
the
exceeds
size.
must
a mispredicted
50 percent
must
causes
issue to branch
This
However,
organization
fect.
Consider
there
the win-
can be seen in
Once
other
the
factors
misprediction
are code fragments
or window
the
window
such
losses,
size increases
Livermore
Loops
for
which
will
have
kernel
#5:
to
DO
in
10
i
X(i)=
to a location
be performed
data
as
etc.
dominate.
un-
however,
from
For
beNeu-
instruction
modeled
the number
tion
static
on performance.
be known.
of branch
of reducing
6) for
figures
prediction
machine
even
eliminating
4, 5, 6, 7, and
size increases
mis-
In a pure
no part
con-
a subsequent
(lock-up
with
prediction
to its latency.
dow
instruction
branch
cache
First,
either
branch
able as an instruction
were effectively
frequency.
this
can be minimized
the frac-
branch
arise
bank
Von
mispredicted,
stalled
by
a better
retirement.
window
is ap-
factors:
num-
against
allow
able to retire
effect.
into
and
factor
availability,
of
can
in
buffer
branch
branch
for
remedies
measured
adds
both
the
of being
two
Finally,
policy,
represents
by four
instruction
is assumed.
stored
this
instructions
omniscience
machine
any given
for
efficiency
is determined
losses,
bounded
In
for
the
This
traffic
effectively
instead
time
window
machine
reduce
is a dichotomy
to
data
branch
almost
be
ma-
per issue cycle
stall
register
as a lack
memory
we can
was not
to having
to schedule
execution
pointer
fetch
by
of retirement
The
8F3M
only
8F3M
Otherwise,
exception
can
.33 to .72.
dynamic
issued.
RDF
is limited
as one branch
are identical.
proximately
The
the
of the
data
mispredicted
(figure
were
factor
and
it has been
after
of view
in figures
for
the performance
to that
latencies.
such
efficiency
by comparing
machine
ports.
with
issue
machine
operation
issue restrictions,
tion
static
seen in the results
levels
size
of the
that
issued
the 8F3M
effects
there
a mispredicted
duce the latency
performed.
The
time
has the effect
ma-
optimizations
that
to deal
misprediction
the
using
restriction
be emphasized
way
of misprediction
through
a superscalar
misses,
the cache
condition
is equivalent
caused
one multiply
that
no superscalar
to access the
has missed
tions
be is-
one multiplier)
for
load
the point
1 unless
as those
will
is possible
load
is determined
per
can
the
condition
be issued.
example,
memory
Branch
a branch
be improved
such
It should
branch
that
such
of only
It
memory
From
de-
machine.
beyond
has one
A compiler
reorganize
the results
most
(e.g.,
Clearly,
assistance.
could
effects
of the
stream
out
traffic
For
This
must
until
tions.
by the
is program/data
restrictions,
machines.
misses.
misses.
de-
on performance.
the maximum
conflicts
p.
by the
restrictions
due to the presence
reduce
6
p and
instruction
mann
can
cache
We point
cache
this
availability
that
is no corresponding
of instructions
Other
and
Both
p is determined
(which
issue
clock
per
factor,
factor.
bound
factor
instruction
by functional
further
the
in instructions
1 is determined
an upper
number
1 instructions.
per clock
rate
no instructions
sued in a single
the
eficiency
a restriction
and
reduce
issue
stream
by
example,
rate
one inclusive.
efficiency
instruction
pendent)
clock
issue
issue
issue
efficiency
and
establishes
static
will
zero
maximum
signer
For
execution
between
The
static
p is the
dynamic
6 range
The
maximum
define
of instruction
etc.
machine
is issued
availability.
of instruction
tween
ipc=I*p*c5
We
of instructions
flicts,
The
the jump
pointer
because
as:
is known.
time
10
until
284
=
2,n
Z(i)*
CONTINUE
(Y(i)
-
X(i-1))
code relittle
ef-
This
simple
loads,
recurrence
a floating
tiply,
a store,
perhaps
degree
to execute
each
a full
the
state
Inside
loop,
type
will
The
rate
dynamic
such
of latency
and fpppp
induced
to issue
the
each
a minimum
itera-
For large
will
out
achieve
efficiency
situations
access
the
Standard
presence
aspects
exhibit
this
6
problem.
Issues
based
of implementing
on RDF
difficult
principles
issues
multaneous
data
Instruction
what
delivery
delivering
them
paper
multiplier
ficult.
[I],
to
the
that
to issue
address
a fetch
address,
of fetching
degree
four
holds
cache
lines
within
the
Once
structions
Issue
can
address
should
resolved
until
of
branches
packet
The
be issued,
in
as well
branch
needs
in time
problems
be
decoded
caching
an
instruction
in the
DIC.
instructions,
packet
can
be
The
DIC,
in
can
store
branch
the
through
(DIC)
instructions
next
[10,
in-
fetch
of
branch
Rather
of
to
thus
improving
we have
to the average
per
come
window
increase
instructions
While
we are not
decoded
window
we expect
285
size
which
the
instructions
In
in
this
way,
window
space
a different
use
be
increments.
in
all
imple-
to eliminate
of the
one branch
This
window,
would
per
being
limits
size of a basic
cycle,
issued
in
the available
block.
increase
Allowing
the
amount
numbers
of 2.0 to 5.8 instructions
data
and
today.
capabilities
shows
information,
limita-
instruction
instructions
the branch
a restricted
is reasonable
stored
storing
on the perthese
is possible
only
cycle
our results
from
predecoded
to
is
parallelism.
bounded
prediction
assumed
with
cache,
and
that
performance.
branches
will
on
mechanisms
valuable
it
the branch.
rates
than
data
order
after
better
the same cycle
bandwidth
than
today.
packet
With
following
a limited
to more
that
consume
make
in-
applications
execution.
and no instructions
what
up
of single
the
be removed.
and
Furthermore,
a
in
only
of checkpointing,
inefficiency
simula-
substantially.
completed
instructions
waiting
this
cycle
These
use
have
of available
tar-
fetch.
packet
two
When
in whole
only
capable
Our
hardware
required
and
vi-
[1, 2].
limitations
improve
window
issued,
multiple
the
the
removed
parallelism
address.
will
in an instruction
addition
in
the
of producing
the performance
are
Second,
the
presence
end
the
3].
each other
of
integer
and
imposed
were
mentation
are not
next
the next
can
to implement
achieved.
be
packets
completed
the
many
the
and
to perform
cache
the
the
re-
techniques
processor
performance
we have
from
in
on large
weJave
processors
while
address
however,
determine
circumvented
instruction
undecoded
each
can be
how
what
or-
per memory
excess
stated
software
Packets
the
in
can
example,
removed
sequen-
to fetch
nol~–sequential
to be predicted
get calculated
can
thus
the
demonstrated
stream
interests
the
are removed,
stream
in
fetched,
Furthermore,
packet
as a possibly
been
constraints,
decode.
the
dif-
that
two
where
determine
and
These
after
that
example,
of the fetch
has
constraints
be.
For
in the
today,
necessary
instructions
determining
make
Another
with
in the
are reasonable
with
available
tions
limit
shown
that
formance
line.
of instructions
consistent
rates
per cycle
ex-
simultaneously,
assuming
four
would
consistency
instruction
have
that,
For
fetching
that
lies in quickly
packet.
engine,
size
cache
a beginning
cycle
and
as proposed
lines
of the alignment
cache
a group
difficulty
next
first
results
Note
point
delivery
technique
instructions,
guarantees
regardless
per
consistent
here.
in one clock.
four
for
branches
the raw bandwidth
instructions
cache
delivered
instruction
can
with
stores.
execution
3 instructions
and
one floating
issues
cache
them,
units
issues,
make
several
for a superscalar
line
tion
si-
determining
function
these
the
way of providing
several
of
(e. g., only
issue per clock)
is a good
tial
proper
caches
fully
cache
Remarks
of a single
structions
these
multiple
fetching
bandwidth
restrictions
We briefly
Given
and
problem
a small
be implemented
small
cache
is only
processors
be fetched,
from
issue
Among
delivery
is the
must
Aside
lmachine
trivial.
machine
accesses.
instructions
ecution,
are not
First,
ported
Concluding
ability
a superscalar
are instruction
to provide
less prohibitive,
of memory
of delivering
Several
We briefly
be used
or recode
benchmarks
Implementation
Its
ports
of providing
cyc!le.
a conventional
would
uses one, single
unit.
This
5
ports.
multiple
can
by
cache
be used to keep the
is to re-
instructions
multiple
quest
6 is
bandwidth.
backed
small
sizes.
is that
per machine
memory
The
ganization
of twelve.
accesses
cache
of the
packet
problem
which
be used.
cost
and
techniques
necessary
associative
in Table
prior
machine
of the executed
doduc
two
arise.
execution
for
describe
given
of one clock
remedy
branch:
addresses,
implementation
a superscalar
on the
the
target
Another
memory
take
quickly
branch
mul-
multiple
clock
will
is dependent
the
only
duce the latencies
the loop.
one
of iterations,
The
and
the latencies
condition
at an issue
this
.08!
just
two
point
On
iteration
using
iteration
window
number
steady
each
at least
compare
iteration.
it takes
However,
1. Since
only
per
require
a floating
an increment,
8 machine,
of 12 clocks
‘n’,
and
will
subtract,
8 instructions
iteration.
tion,
relation
point
flow
issue
As
rate,
levels
increase,
per
cycle
suggesting
well
in
that
in excess
ccmsisteut
machine
the
this
17
to
(the
1165
is possible
of 5 instructions
has
with
and
sizes and
In the limit,
size and issue rate
per
that
of integration
window
correspondingly.
engine
issue
our unUDF)
range
(yet),
per
[8]
cycle.
Finally,
and
perhaps
re-emphasizing
tained
that
using
sue.
code
compiler
motion
support
mented
today
deliverable
is
that
is still
that
deliver
more
issue
increase
than
still
the
instruction
will
be
[10]
pro-
question.
[2]
M. Johnson,
on Multiple
Third
N.P.
Jouppi,
Level
tems,
for
D. R. Ditzel,
and
Architecture
of the
Proceedings
on
Computer
“Available
on
and
Patt,
Op-
[12]
and
Super-
of the
Third
W.
Hwu,
New
Microarchitecture:
Proceedings
and
a%d
Yeh,
“,
Support
Operating
[13]
Sys-
“Adaptive
Steve
Melvin
and
Parallelism
ware
and
Software
18th
Annual
ture,
(May
Robert
Patt,
Shebanow,
Rationale
18th
“HPS,
and
W.
Hwu,
Regarding
Workshop
and
HPS,
M.
A
[14]
pp.103-
Shebanow,
High
Proceedings
Patt,
and
W.
Restricted
Minimal
18th
“HPSm,
Mi-
[15]
1985),
ture,
(June
1986),
on
Per-
computer
[16]
Hav-
Patt,
Execution
of
the
on Computers,
W%eckpoint
(December
Repair
IEEE
1987),
University,
Report
(June
“Super-Scalar
No.
Processor
CSL-TR-89-383,
Hard-
Proceedings
on
Computer
Nix,
of the
Architec-
John
O ‘Donnell,
Rodman,
computers
and
“A
VLIW
Compiler.”,
, (August
Joseph
Available
C-33,
ArIEEE
1988),
Fisher,
for Very
IEEE
11, (November
Edward
M.
Riseman
Inhibition
of
Jumps.”,
IEEE
D.
J.
967-
“Measuring
Long
Instruction
Transactions
1984),
for
[17]
Trans-
G.
on com-
968-976.
Transactions
Design”,
Stanford
1989).
286
1972),
Muraoka,
of
Operations
in
Fortran–Like
Speedup”,
Transactions
1970),
889-895.
“The
Condition
computers
and
C-21,
S–C.
M.
of
on
1972),
J.
Chen,
Simultaneously
and
Transactions
12, (December
and
on
by
Programs
IEEE
Execution
C. Foster,
1405-1411.
Y.
S. Tjaden
IEEE
Caxton
Parallelism
Kuck,
Parallel
pp.1496-
and
Potential
Number
sulting
tober
Johnson,
Fine-
Combined
Scheduling
Architectures”,
cutable
Architec-
1514.
Technical
on
Nicolau
ers C-21,
Machines”,
“Exploiting
Robert
for a Trace
Alexandru
the
pp.297-307.
and Y.N.
Out-of-order
a High
Architecture
Proceedings
Symposium
Patt,
Through
and Paul
the Parallelism
Annual
(December
Flow
Yale
“Critical
Performance
of the
Functionality.”,
Annunal
Hwu
Hwu,
Data
Predic-
of Michigan,
1991)
12, (December
13th
[7] W.M.
pp.
979.
Workshop
1985),
on Microprogramming,
formance
actions
1987),
Introduc-
Annual
(December
A
PP.109-116.
W.W.
MicroSympo-
Branch
Techniques”,
Colwell,
Word
croarchitecture.”,
[6]
(June
University
Symposium
Papworth,
puters
ing
McLellan,
CRISP
Annual
Training
Report,
Grained
David
108.
[5] Y.N.
H.R.
of the l./th
Architecture,
Technical
chitecture
M.
of the
Microprogramming,
Issues
pp.
In-
pp.272-282.
tion.”,
[4] Y.N.
Sym-
1987),
(1991).
Instruction-
Architectural
Languages
1989),
Proceedings
Superscalar
Conference
Tse–Yu
Transactions
on
Annual
(June
A. D. Berenbaum
Hardware
tion
pp.290-302.
Proceedings
Machines.”,
(April
[3] Y.N.
of the ldth
Architecture,
Issue
Pipelined
on Architec-
Languages
1989),
D. Wall,
Programming
Horowitz,
Issue”,
Programming
Parallelism
M.A.
Conference
(April
and
ternational
for
Instruction
for
Systems,
pipelined
and
International
Support
erating
“Instruction
Interuptable
Proceedings
processor.”,
[11]
tural
1989),
309-319.
Smith,
“Limits
12, (December
S. Vajaperyam,
on Computer
“The
References
of the
38, No.
and
Transactions
27-34.
perfor-
stream
and
Processors.”,
sium
[1]M.D.
Parallelism
IEEE
for High-Performance
posium
imple-
to what
G. S. Sohi
Logic
further.
be
twice
[9]
With
execu-
can
Performance”,
Vol.
Distribution
Machine
density,
further.
the limits
by single
an open
pp.1645-1658.
processors
in [1, 2], and
tomorrow
on Computers,
to
Nonuniform
and
on
is-
granularity
should
Effect
for superscalar
higher
larger
Its
ob-
assistance
“The
P. Jouppi,
been
to increase
[12], performance
suggested
cessors
provide
Norman
of Instruction-Level
is worth
have
compiler
produce
line
it
results
optimized
to
to
bottom
mance
not
our
can be expected
units
The
of
importantly,
architecture-specific
performance
tion
all
compilers
With
perform
most
Flynn,
Independent
computers
“On
Exe-
Their
Re-
on comput-
1293-1310.
“Detection
and
Instructions”,
C-19,
10,
(Oc-
Download