Two-Level Adaptive Training Branch

advertisement
Two-Level
Adaptive
Training
Tse-Yu
Department
Yeh
of Electrical
The
Ann
and
Branch
Yale
N.
Engineering
University
Arbor,
Patt
and
Michigan
Computer
Introduction
Pipelining,
structures,
The
deep
importance
tiveness
for
Some
static
are
profiling
dynamic
in that
This
paper
the
branch
mation
use run-time
to pipeline
stalls
or
schemes.
negative
history
and
to
schemes.
of the
93 percent
basis
the
are
the
for other
relevant
scheme
vs.
7 percent
represents
reducing
for
metric
than
number
first
to
reduce
sible
Adap-
The
ac-
the
compared
Since
rate.
a 100 percent
pipeline
and
initiating
flushes
required.
0-89791-460-0/91/001
1/0051
schemes
are
taken.
51
schemes
make
can
not
a
the
as posbubbles.
and
decoding
execution
of
pipeline
to reduce
branch
the execuprefetching
target
before
taken
60 percent
of
can
into
or
Static
the
that
all
and
conditional
also
that
[13].
other.
consideration
on
tend
The
the
are
the
as the
all
branches
taken
can
accuracy
dynamic
study,
taken.
Static
Certain
more
direction
Backward
inabout
opcode.
to branch
branch
such
are
In
pre-
that
all
in this
branches
instructions
infor-
prediction
used
be based
static
branch
branches
Smith
into
on the
as predicting
predicting
benchmarks
the
depending
68 percent
by Lee
of the
than
can be classified
be as simple
Predicting
predictions
be taken
of the
approximately
as reported
direction
to reduce
by predicting,
predictions.
diction
structions
$1.50
is a way
to
classes of branch
@ 1991 ACM
pipeline
pipeline
the
schemes
used
are
until
to resolve
as early
fetching
dynamic
mation
achieve
ways
branch
fetch
fast
to branches
prediction
and
branches
Permission to copy without fee all or part of this material is grantrd pro.
vialed that the copies are not made or distributed for direct conurremal
advantage, the ACM ropyright notice and the title of the publication and
its date appesr, and notire is given that copying is by permission of the
Association for Computing Machinery.
To copy otherwise, or to
republish, requires a fee and/or specific permission.
in
taken
are two
to reduce
execution
In
stalls
loss due to the pipeline
instruction
prediction
tar-
is resolved.
schemes
im-
the
due
issuing
resulting
the
is to provide
Branch
penalty
Branch
other
resolve
instruction
tion
branch
Adaptive
the
target
bubbles.
a predicexecution
for
second
There
is to
the
branches
to be calculated.
of cycles
the performance
is considerable.
is the miss
of
is large,
before
instruction
number
conditional
Unconditional
address
the
to be resolved
to be calculated
target
pipelines
increases.
condition
is determined,
the
the
and
97 percent
case)
address
When
loss:
Two-Level
(best
more
the
target
stalls
speculative
the
the
the
due
greater,
branches,
the
be fetched.
computers,
static
Two-Level
schemes.
of the
for
simulated,
known
benchmarks,
SPEC
is 3 percent
to wait
branch
Adaptive
introduced,
of other
can
of
for
address
conventional
alters
of infor-
Two-Level
ten
the
rate
in
the
achieves
in progress,
provement
on
schemes.
flushing
This
target
As
becomes
on performance
types
to wait
ways
On
performance
branches.
bandwidth
to the
effective
processor.
machine
unresolved
different
the
have
predictor,
which
Prediction
requires
miss
Training
of
prediction
Branch
on nine
already
for
of branches
have
bubbles.
Predictor
branch
miss
branch
scheme,
algorithm
to simulations
Training
to less than
dynamic
impede
issuing
effect
branches
are
at run-time.
Branch
dynamic
deeper
Among
Others
on a single
branches
get
information
execution
Training
prediction
compared
curacy
a new
Adaptive
configurations
Training
The
opcode
predictions.
hand,
as [18] and continuing
one of the most
performance
contains
of conditional
prediction
use
other
to the effec-
as early
[6], has been
to improve
execution.
literature
time
get instruction
collected
Several
tion
they
to make
proposes
Two-Level
tive
the
at least
present
other
predictions.
the
and
up
in the presence
that
they
speed
of branch
statistics
among
predictor
In fact,
a number
in
help
branch
pipeline
is well-known.
proposals
make
to
of a good
of a deep
branches
and
pipelines
use,
Science
48109-2122
1
microarchitectures
ion
of Michigan
Abstract
High-performance
Predict
in one
can
also
Taken
and
Forward
fective
Not
once
over
work
Profiling
path
scheme
well
[12,
by
ever,
program
with
certain
profiling
bit
this
irregular
tendencies
sets
than
than
scheme
using
the
of the
branches
the
which
the
schemes
more
branches.
data
sets
that
posed
at
tion
branch
prediction
of branches’
dictions.
Lee
called
and
a Branch
urating
is then
history
the
Buffer
design
the
result
of the
Static
Training
from
branch
also
also
Static
[13] which
of the
Training
statistics
scheme
to accumulate
may
not
is that
the
col-
results
both
the
Section
of
five
Sec-
a wide
dynamic
se-
and
contains
Adaptive
the
Adaptive
the
some
con-
Training
Training
following
Branch
Prediction
characteristics:
to different
is
executed
based
during
the
current
history
of
execution
of
program.
history
pattern
pattern
history
pattern
the
on
the
information
on the fly of the program
program
and
prediction
Execution
of
disadvan-
statistics
be applicable
including
used
models.
Prediction
has the
Branch
pat-
results
major
the
methodology
prediction
simulation
predictors.
branches
is the
a history
execution
The
pro-
the
Prediction
only
uses the statistics
n run-time
to
Branch
the
simulated
the
Two-Level
scheme
Another
Smith
and
The
Tar-
record
branch.
program
discusses
the
Branch
prediction
by Lee and
gain
branch’s
Branch
to
of the
a prediction.
first
The
be simplified
proposed
of the last
branch
entry.
This
performance
introduction
Training
Two-Level
2
execution
of the
by
scheme.
remarks.
sat-
information
The
state
scheme,
execution
to make
of the
their
scheme
to be run
same
can
history
branch
an
three
of schemes
cluding
to a large
Adaptive
repozts
Training
processor.
gives
and
represents
mispredictions
they
uses 2-bit
predictions.
of the
a pre-run
consisting
tage
has
In
last
scheme
lected
the
four
static
the
pre-
a structure
to collect
the
of
to make
[13] which
changes
state
get
dynamic
proposed
to make
buffer.
on
advantage
behavior
Buffer
counters
used
in the
is based
tern
Smith
dynamically
entry
takes
run-time
Target
up-down
which
study
lection
Dynamic
two
Section
in this
run-time.
knowledge
directly
This
in
Adaptive
lead
Two-Level
scheme.
differ-
occur
can
93 percent.
reduction
Twc-Level
Section
and
How-
have
under
on a high-performance
in advance
may
achieve
100 percent
reduction
branch
opcode.
has to be performed
data
other
the
in
efonly
to predict
tendencies
prediction
sample
branch
with
be used
the
a static
is fairly
it misses
However,
on programs
also
which
because
of a loop.
5] can
measuring
presetting
[16]
programs,
all iterations
does not
ent
Taken
in loop-bound
runs
table
of the
execution
information
of the
in
the
predictor.
program
is collected
by updating
branch
the
history
Therefore,
no pre-
are necessary.
data
sets.
There
is
pipelined
misses
work
that
tivation
to
second
is the
of that
unique
is the
Branch
only
is collected
of Static
record
of the
record
of the
record
of the
last
Trace-driven
Two-Level
diction
schemes
mark
suite.
the
By
benchmarks
his-
but
of the
using
the
reaches
[13].
used
branch
and
simulated
on
Two-Level
average
in this
history
history
entries
comes
not
sults
moreover
of
history
ble
prediction
static
the
97 percent,
while
scheme
branch
SPEC
Adaptive
prediction
The
pre-
most
52
the
The
in bits
are
results
in
the
the
a history
bits
particular
pattern
pattern
the
history
bits
content
of the
branch
that
history
addresses,
in
is being
register
the
the
history
ta-
the
most
of the
by checktable
register
entry
for
the
predicted.
table
history
the
contents
ing
the
re-
All
register
are made
by
branch
represent
predictions
in
out-
is a shift
register
Branch
the
bits
information.
history
for
predic-
on the
representing
contained
of
the
contents
history
depending
history
pattern
instead
the
pattern
history
and
branch
updating
table
to
Lee
programs,
register.
instruction
of the
and
recent
the
on which
by
ma(HR)
similar
of
history
Since
for
most
The
particular
Training
accuracy
shifts
branch
Training,
profiling
pattern
registers
(HRT).
indexed
bench-
the
Adaptive
information
branches.
which
(PT),
scheme
by
registers
of the
table
two
register
Training
is collected
of the
has
history
pattern
Static
statistics
based
are
scheme
branch
In Tw&Level
tions
register
particular
study.
the
Adaptive
Prediction
Training
the
history
in
execution
recent
dynamic
were
Smith
the
scheme
based
branch
used
of the
dis-
Training
are
n branches,
were
Training
other
Prediction,
The
Adaptive
the
accumulating
executmajor
Adaptive
structures,
those
Two-Level
Branch
Two-Level
data
and
two
The
The
the
Prediction.
last
jor
s occurrences
without
predictions
The
mo-
n branches.
Adaptive
as the
fly
s occurrences
simulations
aa well
Branch
last
last
eliminating
Two-Level
Training
predictions.
n branches.
n branches.
on the
because
on the
the
uses
of
Concept
pre-
dynamic
scheme
last
last
Training
is called
Prediction,
on the
for
of the
beforehand,
here
by
is the
to make
of the
behavior
pattern
program
advantage
information
branch
This
new
2.1
deep-
of speculative
higher-accuracy
The
history
in
caused
amount
[1, 8].
a new,
scheme.
information
proposed
machines
large
be discarded
history
level
ing the
the
proposing
of branch
tory
to
prediction
first
degradation
superscalar
due
has
for
branch
levels
performance
and/or
diction
The
serious
is indexed
register
table
by
branch
is called
Pat&n
Table (PT)
BraDchmtllry
Pattero
—
OQ-...oo
00...0I
00-,10_
BrauchEistmy B.egk& (Eli)
(Mftleftwheoupiate)
RwBi@+I
. . . . . .. RwR~.I
II
s:
Pattern m)
Hie4tK0’y edidion
o
~~~~~ I
Bit(a)
‘*’y=
“’$ ,“ ,,,,,0
7
(LT)
Ime-t-Time
Auta!mtnn
Figure
1:
The
structure
of
the
l’ramitim
Two-Level
Figure
Adaptive
scheme.
history
register
table
is called
a global
the
history
registers
access
The
structure
Prediction
Bj
branch
needed
track
in
based
is
on the
in the
history
bits
in
of the
the
history
the pattern
history
When
comes
each
register,
at most
register.
of the
patterns,
each
order
there
entry
are
is indexed
of its
branch
history
register,
ss &,c-kRi,c-k+~
of executing
the
branch,
are
k
patterns
2k entries
in
by one distinct
pattern
table.
The
pattern
entry
PTR
,,=-kR,,c-k+l
are then
diction
used
of the
for
branch
A is the
After
come
the
&,e
in the
update
conditional
entry
the
in
branch.
the
the
The
ad-
into
is
position
history
pTRi,=_kR,,c_k+
resolved,
bits
and
in
is
the
After
l . . . . .. Ri.c_l.
the
register
also
can
of the
is done
pattern
by
the
history
state
bits
transition
in
SC+l.
the
pattern
function
The
finite-
history
in
z characterized
of the
history
bits
S.+l
be-
transition
chines
used
in this
will
be what
records
pattern
used
the history
appeared.
ta-
as not
up-
predicted
as taken.
transition
table
entry
and
6 which
takes
next
53
up-down
and
counter
counter,
which
when
there
history
the
the
branch
branch
will
automathe same
is no taken
when
pattern
will
be
branch
will
be
A2
the
is needed
branch
is a saturatbut
Buffer
when
next
prediction
times
of the
Target
is incremented
when
two
is also used,
Branch
of the
The
automaton
hisof the
The
one bit
otherwise,
The
Smith’s
is decremented
execution
taken;
pattern
Only
same
hisFigure
execution
appeared.
last
ma-
in the
in the
last
execution
has the
predicted
the
pattern
information.
Only
the next
finite-state
the
of the
results
register
the output
appears
time.
history
the
recorded,
Lee
last
pattern
with
the
stores
of the
pattern
happened
the
2. Since
bits,
are shown
pattern
history
out-
history
of the
updating
Last-Time
history
branch
in
entry
the
to store
Al
table
when
the
1.
for
outcome
tran-
machine,
1 and
pattern
his-
The
S and
machine
study
the
same
bits
is used
pattern
table.
by equations
on the
only
the
circuit
the
a finite-state
diagrams
pattern
automaton
logic
history
by equation
state
in the
(2)
pattern
is a Moore
The
history
out-
being
outcome
Ri,c)
comprise
is based
machine
bit
of the
be characterized
prediction
finite-state
The
becomes
the
pattern
6 to update
6, pattern
H~
pattern
the
history
6(s.,
=
entries
which
the
bits
in the
function
ing
history
of
pattern
new
pattern
function
R of the branch
ton
history
and
the
combinational
the
come
branch
function.
the
new
sition
dated,
the content
of the history
register
becomes
by
Rj,C-k+~~,C_k+z . ..... &,C and the state represented
pattern
bits
2. The
(1)
decision
bit
pattern
tory
tory
pre-
A(SC),
=
bits
generate
the
to implement
tory
pattern
the
Therefore,
time
branch
left
significant
the
SC in
, . . . .. R..=_l
predicting
prediction
is shifted
least
bits
address
is
Zc
where
history
to
the
come
the
for the last k out-
is used
to
A straightforward
1“
Bi is being predicted,
the
HE,
whose content
is
. . . . .. Ri.c-l
history
as inputs
$%+1
keep track
to
pattern
keep
a”
there
2k different
In
in the old
diagrams
updating
entry.
are
to
then
Since
transition
for
k
last
k bits
branch
table
bits.
of a
of the
was taken,
a “ O“ is recorded.
dressed
ble
for
all
Branch
therefore,
state
used
pattern
branch
prediction
pattern
branch;
register
a conditional
denoted
to
table.
pattern.
contents
table
history
history
table;
pattern
Training
The
A4
pat-
because
The
If the branch
if not,
in the history
appear
1.
The
table,
Adaptive
the
of the history.
(PHRT).
same
Figure
of executing
is recorded;
the
of Two-Level
is shown
out comes
table
pattern
Automaton
M
machines
the
a per-address
2:
state
tern
Automaton AZ
8mwlnml# up&uvl Ox.nt,.)
eat
$*!=6(SGM
Auto-n
Training
Al
of B
s,
11.....11
~ @
Autanatm
differently,
design
branch
is not
be predicted
[13].
is taken
taken.
The
as taken
when
the
counter
otherwise,
the
Automata
A3
Both
ing
and
Static
are
two
Level
Adaptive
from
profiling.
diction
Training
but
Training,
decision
pattern
function,
is determined
of A is determined
history
dictions
are
different
tive
Training,
ate
pattern
of each
tory
if the
times
during
on the
branch,
the
Two-Level
Since
branch
proper
tive
ferent
sults
in different
there
decision
is a history
were
Branch
configurations:
with
change
can
of the
data
Static
well
execution
4-way
lated
with
is lost
history
Adapmany
Training,
3.2
two
tion.
Methods
Implementations
of
History
Table
Register
the
is not
table
ter
feasible
for each static
in real
Register
The
Per-address
register
for
table
to have
history
its own
Therefore,
implementing
the
register
history
two
in
The
solution
regis-
approaches
Per-addrew
in
Within
address
higher
part
table
the
for
is used
for
table
tern
table
lookup
History
A fixed
grouped
ble
with
the
next
to index
branch.
implemented
Register
into
which
lower
the
table
part
The
althe
in
the
per-address
his-
way
the
(AHRT).
is called
When
time
time
the
is available
table
does
of the
very
the
otherwise.
a
54
to
the
the
practical
history
the
must
history
result
have
branch
has
when
does
this
the
not
kind
have
branch.
register
the
to
ta-
Therefore,
the
and
prethe
cycle.
prediction
of
previous
confirmed.
This
is being
machine,
but
until
a
excase
executed
not
usu-
has a high
ten-
is predicted
staIl
the
that
of the
loop
of a
pro-
store
table,
of branch
branch
is confirmed.
the
pat-
register
be predicted,
been
superscalar
Since
and
history
result
a tight
pattern
table,
when
the
address.
the
is updated,
to be accessed
occurs
before
one cycle,
perform
history
in the
branch
not
to
pattern
for
a predic-
into
a high-performance
register
bit
Predictor
make
instruction
updated
in the
often
be taken,
machine
branch
is
register
problem
is required
appears
dency
problem
from
history
ecution
ally
next
as a prediction
the
for
the
the
by a deep-pipelined
of a
and
is recorded
in this
Table
as a
(LRU)
The
branch
num-
together
with
a prediction
pattern
per-address
this
to
the two lookups
the requirement
at the
diction
Hi~tory
Least-Recently-Used
as a tag
the
the
cache.
are
replacement.
is used
allocated
Associative
the
a set,
is used
register
implement
sim-
accuracy
Branch
lookups
determining
to
Another
is to
in
Training
table
to squeeze
is usually
duce
enough
as a set-associative
branch
entry
branch
approach
of entries
gorithm
tory
a big
simu-
much
interference
Adaptive
It is hard
branch
Table.
first
set.
have
implementations.
are propoeed
ber
to
also
IHRT
designs.
sequential
prediction
It
was
The
how
two
and 256-
Latency
Two-Level
needs
sets re-
which
3.1
history
table
Prediction
The
on the
data
behavior.
Implementation
the
with
dif-
processor
3
to
branch,
simulated
HHRT
there
Training
set-associative
show
and
in which
256 entries.
to
is saved.
approaches
Adaptive
The
and
accuracy
be obtained
conditional
was
4-way
execution
store
(IHRT),
AHRT
is provided
register
would
static
set-associative.
due
what
imple-
to make
over
if changing
cur-
The
512 entries
in the
Two-Level
colli-
this
prediction
Table
each
the
Since
the
tag
The
is called
table,
practical
reg-
a condi-
table.
(HHRT).
of the
two
512-entry
data
Two-Level
Two-Level
accurate
sets.
predict
entry
to the
program
the updates,
be highly
in
adjust
for
for
Predictor.
ulation
bits
register
be
cost
Register
simulated
for
of Two-Level
in accordance
the
above
History
can
can
but
the
the
interference
than
of
approach
a hash
expect,
is lower
study,
his-
function
Predictions
behavior
and
not
branch
in more
would
into
Table
accessing
cost
the history
this
If
a new
approach.
address
hashing
using
Register
an AHRT,
the Ideal
result
information
therefore,
predictor
With
can still
may
same
adaptively
history
execution
programs
contrary,
actual
The
an
table.
AHRT,
is an extra
in this
is to implement
when
approach
In this
appropri-
behavior.
the
predictions.
the
history
Training.
pattern
Training
given
this
with
Adap-
the
the
for
appears
There
store
has
pattern
in the
branch.
for
one
the
an entry
table
results
As
corresponding
is used
History
history.
of the
table.
can occur
mentation
contents
a hash
history
Hash
sions
pre-
Two-Level
with
table;
change
Training,
pattern
branch
the
a given
branch
updates
prediction
execution
the
Adaptive
hand,
pattern
the
Training
program
rent
to
Adaptive
Adaptive
history
tional
en-
branch
tag
approach
branch’s
If the
to address
the
the
first.
have
for
w
per-address
the
not
the
second
table
the
is used
does
is allocated
The
history
for
same
execution.
pattern
inputs
execution
information
different
in
same
other
branch
AHRT,
implementing
ister
pre-
Therefore,
is, the
As a result,
pattern,
different
the
made
history
be found
That
entry
Training
to the
is located
register
branch
for
in Two-
input
a given
dy-
is to be predicted,
AHRT
in the
the
informa-
in Static
execution.
before
pattern,
the
branch
in the
history
pre-
between
history
the
J, for
branch
i.e.
dynamically
is preset
before
output
at
pattern
In Static
Traintheir
difference
changes
try
entry
Adaptive
information,
the
table
conditional
taken.
to A2.
because
major
to two;
as not
Two-Level
The
is that
or equal
similar
on run-time
pattern
than
predicted
predictors,
history.
schemes
in the
and
branch
are based
be
are both
Training
branch
these
is greater
will
A4
dynamic
dictions
namic
tion
value
branch
taken
the
and
previous
Dynamic Indmwti.n
DMribution
=i\
~
i
100
@o
80
P
I
❑ bum
h
El Im 5rdl
■ J“m
80
Sub hsl
r
c
km
:
60
t
a
40
0
e
30
mallw
r
70
e
Nm+snti
Inu
Imt
❑ ml.
❑ Ihm
■ WV
Fmm ‘sub hs!
mlti
h!,
IWSU
O CuWuwl
Iml
Bmmh Ins!
20
10
10
0
A
0
7&A
oqnt q+.
~
~.
1,
fpWP maw. ,mca 1-.
FPA tic
m
son 2*
Mwhmarh
Figure
3: Distribution
of dynamic
instructions.
and
Simulation
Methodology
4
Figure
Model
Trace-driven
simulations
were used in this
torola
instruction
level
for
88100
generating
address
lator
instruction
traces
which
verifies
for branch
instructions
[4] are classified
into
subroutine
return
branches,
and
structions
other
non-branch
stack.
can
the branch
a subroutine
address
and
is detected.
The
the return
sets without
routines,
the
Emma
double
in [2] is able
An
get address
is calculated
struction
to the
can
branches
which
stack
stacks
on registers
for
scheme
counter;
the
the stack
when
ior.
miss
For instruction
proposed
by
the return
Description
Nine benchmarks
in
this
from
the
SPEC
prediction
Kaeli
are integer
of
branches
in
the
the mettle
tar-
cution
lated
target
value
The
ready.
suite
are
on the
tend
behavwhere
for
fpppp
conditional
and
condiand
gcc
branch
gcc finish
conditional
of dynamic
range
accuto have
branch
prediction
f pppp
millions
number
repet-
benchmarks
except
million
benchmarks
have
is tested.
benchmarks
for the benchmarks
dynamic
ure 3. About
namic
Five
focuses
twenty
The
integer
floating
prediction
benchmarks
predictor
for twenty
The
and tomcatv
high
branch
five
inexe-
branches
are
instructions
simu-
fifty
to 1.8
from
million
billion.
Unconditional
benchmark
fpppp,
the
the
and irregular
is on the
all
to capture
Among
a very
branches
study
before
long
integer
of the branch
this
executed.
too
thus,
The
branches,
structions.
in the in-
benchmarks.
doduc,
include
matrix300
it
were simulated
address
for the register
study.
and four
kernels.
execution;
conditional
Since
of Traces
branch
conditional
it takes
of all seven
Therefore,
branch
used
of static
benchmarks
benchmarks,
tional
sub-
branch’s
the offset
point
because
loop
many
prediction
may
benchmarks
floating
the integer
4.1
number
racy is attainable.
instruction
from
to become
213
370
1: The
behavior
itive
therefore,
have to wait
address
codes
returns
immediately.
1149
Number
Static Cnd. Br.
556
489
653
606
matrix300,
spice2g6 and torncatv and the integer ones
eqntott, espresso, gee, and li. Nasa7 is not ininclude
point
unconditional
program
is the target
In-
return
as the
overflows.
by adding
be generated
Table
address
a return
to perform
immediate
espresso
Ii
fpppp
spice2@
a return
prediction
instructions
prediction.
address
when
address
address
special
277
6922
Subroutine
onto
is popped
return
eqntott
gcc
doduc
matrix300
tomcat v
The
into
for condition
using
is pushed
address
Bencbrnark
Name
ing point
unconditional
on registers.
targets.
target
Number
of
Static Cnd. Br.
cluded
by
is called
for the branch
Benchmark
Name
instructions.
set
branches,
are classified
have to wait
be predicted
A return
instruction
immediate
branches
branch
each benchmark.
class.
branches
to decide
and
to collect
conditional
the branches
instruction
branches
and
branches,
than
results
of dynamic
simu-
branches,
accuracy.
classes:
unconditional
Conditional
in order
the branch
in the M88100
four
and
prediction
predicts
prediction
The branch
A Mois used
instruction
the branch
with
study,
(ISIM)
The
instructions,
the predictions
statistics
when
traces.
are fed into
decodes
simulator
4: Distribution
are
float-
55
The
instruction
24 percent
benchmarks
instructions
distribution
is shown
of the dynamic
and
about
for the floating
5 percent
point
in Fig-
instructions
benchmarks
instructions.
distribution
of the
dynamic
branch
for
of the dy-
instructions
are
Branch
HRT
1.
!Lner. tatmzl
Entry
content
~
Entries
NaIu.
AT(AHRT(2S6,12SR),
Ilelliat*o
Eniry
C.rlten<
Il
keeping
12-bii
SR
At=
A2
512
12. bit
SR
AtDI
A2
512
I%bit
SR
Aim
A3
specifies
the
512
12-bit
SR
At=
A4
an
in
512
12-bit
SR
Atra
LT
tomaton
IHRT,
AT(AHRT(512,12sR),
PTGJ2,A3),)
AT(AHRT(512,1!JSR),
PT(212,LT),)
AT(AHRT(512,1OSR),
612
10-bit
512
8-bit
SR
AtuI
AZ
SR
AtIU
A2
AT(AHRT(512,6SR),
.512
6-bit
256
SR
12-bit
Atm
AX
SR
A41u
AZ
AT(HHRT(256,12SR),
in each
PT(#z ,Aa)>)
512
AT(HHRT(512,12SR),
PT(212
12-bit
SR
AtIU
A2
history
12-bit
SR
Atza
A2
2.
12-bit
SR
,A2),)
AT(1HRT(,12SR),
PT(212,A2
),)
512
ST(AHRT(512,12SR),
PT(_J12
PT(212
.512
PT(212
– Two-Level
Smsth’s
),,)
Adapt%ve
table
IHRT
of
any
au-
any
automaton
612
12-bit
SR
PB
training
and
512
12-bit
SR
PB
different
data
12-bit
SR
PB
Data
the
sets
the
testing.
Data
there
is no
Data
Data
When
set
Figure
designs,
designs.
used.
data
When
in
Buffer
in their
are
same
content
pattern
because
kept
data
the
the
shown
included,
history
in the im-
in
Target
In
imple-
for
of entries
specifies
Branch
is not
register.
is the
of an entry
information
as Same,
Buffer
or Lee
this
LS
– Four-way
HHRT
– Hash
Hzstory
Regwter
Tzme,
Table,
PB
-
SR
-
Preset
of simulated
about
4. As can
80 percent
ing
is used
for
is specified
our
are
is the
branch
conditional
prediction
branches.
class that
accuracy.
in the
dynamic
The
trace
The
should
distribu-
branch
instruc-
conditional
be studied
number
tapes
the
of static
of the benchmarks
is
both
Diff,
as
HRT
register
re-initialized.
is not
pattern
branches
using
ginning
history
to
3. For
of execution
will
0’s.
of each
beginning
when
pattern
an entry
the
history
table
entries
of execution.
those
A3, and
Last-Time,
the
be more
his-
register
for
that
accord-
1‘s than
branch,
in the
A2,
1 such
in
of the
1‘s at the
static
likely,
Al,
to state
more
beginning
more
to state
initialized
are listed
are
are taken
execution,
bits
automata,
are initialized
the
in the
at the
The
model
contents
contain
a different
history
schemes
designs.
simulation
the
During
to
for
2.
are initialized
execution.
If
set is needed
Training
of branches
bits
testing.
Buffer
of each
should
the
and
Adaptive
Target
results,
is re-allocated
bles
Since
pattern
A4,
all
ta-
entries
all entries
branches
likely
are
at the
be-
to be predicted
taken.
1.
In
Simulation
addition
schemes,
Model
Branch
Several
configurations
Adaptive
Training
register
the
all
in the
taken
conditional
data
60 percent
are also initialized
branch
to improve
no training
in Table
usually
of program
predictors.
training
simulation
register
entry
Bit.
for
scheme
are listed
Accordingly,
Register,
used
Branch
and
about
to
tory
Regwter
Shift
branch
– Lee
Set-
Prediction
be seen from
of the
Smith’s
study
Since
lkainmg,
– .%at$c
are
as in Two-Level
and
configuration
AHRT
-
sets
is not specified,
the shemes,
A2
LT
A2
LT
Al
LT
Des$gn,
Table,
Last-
Atm
Aim
AiIII
A*UI
Aim
A* III
ST
Hwtory
LT
in Figure
were
table
along
naming
to
ideal
for
Scheme (History
is
Putt ern(Size,
fies the scheme,
Static
the
the
Training
the
or
is similar
prediction
the prediction
were
also
accessing
Scheme
Lee
Adaptive
Lee
and
speciTrainSmith’s
56
to
and
simulated
method
with
The
branch
with
practical
for
Static
Branch
automata
prediction
Training
schemes
some
were
Smith’s
Two-Level
but
introduced
Smith’s
simulated
static
the
two
and
and
for a given
The
Training
schemes
Lee
an IHRT
Adaptive
Static
designs,
purposes.
with
that
Two-Level
prediction
scheme
by profiling.
the
Smith’s
Buffer
scheme
Entry.Content),
Data).
(ST),
sim-
schemes,
and
branch
comparison
HRT
were
different
branch
Two-Level
hash
(IHRT)
(Size,
his-
to
Lee
Target
static
implementa-
and
HRT
Entry-Content),
for example,
per-address
practical
(AHRT)
distinguish
convention
the
two
HRT
the
for the Two-Level
For
(PHRT),
with
order
simulated
scheme.
associative
In
(AT),
part
history
how
Register
– Ideal
tions
ing
be
Smith’s
specified
2: Configurations
schemes
content
can
and
Putt ern
The
the
The
specifies
Z%ain%ng,
Target
– Automaton,
(HHRT),
be
information
Entry_Content
and
PB
+
History
ulated.
can
Pattern
history
SR
512
512
512
512
.
),,)
Branch
Associative
tions,
ent
content
a history
the number
12-bit
.
PT(212
,PB),Diff)
LS(AHRT(512,A2
LS(AHRT(512,LT),,)
LS(HHRT(512,A2
LS(HHRT(512,LT),,)
LS(lHRT(
,A2),,)
LS(lHRT(
,LT),,)
tory
2 or
keeping
PB
PT(212,PB),Diff)
ST(1HRT(,12SR),
4.2
number
,PB)$DM)
ST(HHRT(512,12SR),
in Table
Figure
specifies
SR
,PB),SaIIIe)
PT(212
in
12-bit
,PB),SaIIIe)
ST(AHRT(512,12SR),
branches
table
for
Lee
pattern
ST(IHRT(,12SR),
tion,
for
example,
the
The
entry.
register
(Size,
Ent ry.Cent
and
each
history
entry.
For
the
PB
,PB),SaIXIe)
ST(HHRT(51;,12SR),
is shown
in
for
specifies
Size
plementation,
PT(26,A2),)
Table
Size
HHRT.
Entry_Content),
patterns,
PT(28,A2),)
Table,
the
shown
mentation
AT(AHRT(512,8SR),
Atm
of branches,
implementation,
content
Pattern(Size,
PT(210,A2),)
and
the
History
In
implementation
,A4),)
AT(AHRT(512,12SR),
AT
(LS).
is the
information
or
in
entry
design
History
AHRT,
of entries
PT(212,A2),)
AT(AHRT(512,12SR),
Buffer
history
256
PT(212,A2),)
PT(212
Target
Entry-Content),
and
simulated
for
Static
Training
Adaptive
Training
the important
pattern
and
dynamic
difference
is pre-determined
approaches
Training
for
the
HRT
the
same
designs
were
with
above.
Target
A2,
A3,
Buffer
A4,
schemes
and
Last-Time.
simulated
include
the
Always
Taken,
taken,
and
scheme
is done
a simple
for
tion.
predicted
branch
was
each
takes
used
profiling
the sum
of the larger
possible
directions
number
of the
with
and
Static
Training
and
ures
the
nine
labeled
mean
across
the
static
diction
the
ferent
5.1
scaled
with
ent HRT
simulations
using
can achieve
scheme
which
the
the
Effect
of
State
- -----------------------
5 shows
the
were
Y
automata.
history
register
accuracy.
Training
miss
Four
cause
early
other
four-state
using
Last- Time
ideal
!
TaIJ
km
itG
Figure
The
accuracy
machines
the other
around
history
In
is used
order
figures,
maintain
which
Time
automata
table.
A2,
A3,
and
the
execution
A4
A3,
and A4.
1 percent
are therefore
which
The
history
records
II
WO tifPPwm.ti,
w
300
Adaptive
register
show
Xm
qc
10MS
Training
table
the
scheme
A2
state
schemes
using
implementations.
which
transition
curves
is shown
clearly
with
in
the
usually
performs
automata
used
Effect
of History
ment ation
A3, A4,
be-
Figure
6 shows
to the
on the
prediction
included,
it was inferior
automata
more
only
to
each
transition
A2,
is not
about
97 percent.
state
automata,
indicated
performs
We
the
state
the
in this
following
transition
best
among
study.
Automata
Al
A2,
“w.
6: Two-Level
different
Training
register
of different
automata,
q.lt,
scheme
and
Static
history
Transition
transition
experiments
the ones using
Last-
state
.. .. .. .. .
scheme
the
effect
Smith’s
Last- Time were simulated.
and
ATIHHIAW2,1~Tw2v2L)
- ATWRTP2t1 z3RI,PT(4s90,W
- . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .
#u-
simdiffer-
accuracy
table
the efficiency
-
. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .
* ATlN41T@2t.m2Wl(401&A2L)
5.1.2
Figure
3
* ATIHRTI!6, MI%?T(4wLA21.I
a ATIAHITISW.1UlUPT(401qMLl
0,22
dif-
automata,
Adaptive
also uses the
.. ....... .... .......................
0.78 i
schemes
different
and
using
ImdEemA8A2m
a
c
to 100 percent.
on prediction
Lee
schemes
0.8 -- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
transition
history
mug Ditrelentxmr
n.n . - ...................$$.
Training
and
Training
automata.
c
c
pre-
between
Adaptive
transition
0“’”
~v’’-k$y
~
:
v
the
5.1.1
O AT(AHIIT(5M! 2SRWWW.LTI.I
,2’.>;.V$Z......
I...
and
shows
Two-Level
state
Ad@iveTn2mhw -.
*
shows
all float-
5:
‘r90-Z.awl
cat-
across
axis
Training
Two-Level
to
de-
schemes.
effects
without
Figure
geometric
benchmarks,
76 percent
demonstrate
as a comparison
.
.1..................
different
Fig-
G Mean”
a comparison
state
their
IHRT
...
. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
the
across
axis,
from
Adaptive
of the
an
..
L
0.24
0.8
Buffer
the
mean
Adaptive
different
to show
.......
were
accuracy
“ Int
vertical
implementations,
lengths
y
...
+ ATIAHflTIS!2.12SR),PT140WA4).)
total
schemes.
with
prediction
Two-Level
ulated
Target
all integer
The
Two-Level
The
OM - - . . . . . . . . . .
.
.
of
schemes,
shows
the geometric
concludes
branch
the
section
horizontal
Mean”
across
this
prediction
the
G
benchmarks.
accuracy
:
-. AT(A”~T(SI z,,2S~,PT(400AA2L)
instructions.
prediction
benchmarks,
shows
section
in
Branch
the
On
mean
“ FP G Mean”
point
the
as “Tot
all
the
for two
over
Training
branch
10 show
geometric
This
presented
benchmarks.
egory
study,
0.92 - -
c
c
set
the ratio
branch
Adaptive
Schemes,
5 through
data
~
Results
results
some
same
branch
‘\
one the
in the two numbers
static
Dlffmmnt State ‘lhndth
1
Om -
and
execu-
is the
in this
conditional
SdIme4 Udq
profiling
profiling
by taking
khPth ‘l%ddIw
‘Wo-I.avd
1
Not
of taken
the
execution
number
the Two-Level
signs,
in the
Since
of every
simulation
run
The
of a branch
was calculated
Simulation
The
ing
branch
dynamic
Forward
frequency
frequently.
accuracy
and
scheme.
the
direction
prediction
5
profiling
static
most
for
Taken
by counting
not-taken
The
Backward
The
scheme
worse
achieve
four-state
information
what
happened
more
tolerant
Training
ulated
than
with
forms
than
ond,
the
entry
AHRT
the
time;
to noise
the
scheme
in
history.
57
the
Every
same
history
finite-state
last
effects
best,
the
hit
ratio.
the
branch
scheme
worst,
This
Adaptive
graph
length.
the IHRT
AHRT
scheme
the fourth
in the
in the
length,
Imple-
implementations
Two-Level
register
512-entry
is due
history
scheme
HHRT
Table
HRT
of the
history
register
512-entry
the
of the
accuracy
schemes.
equivalent
similar
the
Register
the
was simWith
the
scheme
per-
scheme
the
sec-
third,
the
256-
and the 256-entry
decreasing
order
to
the
increasing
as the
hit
ratio
HHRT
of the
interference
decreases.
HRT
in
.
1~
WO.Lwd Ad@iw
;
.. .. .. . . ..
... .,,
0“
//
. .. . . . . . . . . . . . . . . .
a 8,
U,hlgaf.lnrys@stelUof
~$
—t
f.a@u
. ,:.,’
?
. . . .. .. .. .. .. .. .. . ..
. . . .. . . . .
. .. .. .. .. . .
. . .
.. .
. .. .. .. . .. .
. . . . .. .. .. .. .. .. .. . . .
0,1
.
. . . . . .. .
. .
.. .. ..
Training
gcc
li
doduc
a ATlAtM17512.tmPT(lm)
fpppp
-
matrix300
history
registers
5.1.3
the
of
7 shows
Table
3: Training
using
four
lated.
lengthening
to the
the
and
often
accuracy
improves
5.2
Static
Static
Training
branch
will
the
to
scheme
of every
keep
the
the
number
requires
tions,
the
all
to consider
in addition
from
to
the
to the
data
set and
those
for
sets
are
the
same
were
both
and
executed
the
data
set,
on the
Static
of nine
matrix300,
sets.
re-
tion
is achieved
to index
the
branch
the
is then
Training
changes,
enough
in
of practical
two
sets
the
table
like
implementa-
The
58
were
for
data
sets
are
order
for
trained
results
achieve
to
are the
with
that
branches
are
4-way
AHRT.
The
execution,
is about
the
regular
5 percent
branch
using
12 bit
is about
Adaptive
and
different
the
prediction
for
For
are
behavior
0.5 percent.
an
data
1 percent
lower.
set
This
registers
accuracy
degradations
are within
data
Two-Level
and
the
same
accuracy
when
in
predic-
97 percent.
history
However,
in
configuraschemes
scheme
the
sets
3.
highest
the
espresso
It is about
degradations
The
The
by
data
similar
using
12 bit
drop
benchmarks,
8.
because
applicable
The
Tkaining
Training
training
gcc and
with
ap-
eqntott,
in Table
is about
achieved
using
for
other.
an IHRT.
scheme
for
each
schemes
other
excluded
sets or the
Figure
Static
and
as that
are
data
Adaptive
in
with
benchmarks,
are shown
execution
by the
trained
four
schemes
of the
and
ent due to the
HRT
the
can
predictions
testing
shown
are used
significant.
point
set,
tomcatv,
Training
respectively.
programs.
practical
same
accuracy
which
data
similar
registers
512-entry
number
were
other
Two-Level
training
history
to another,
which
best
and
and
accuracy
is used
bit
data
schemes
applicable
too
6 are
of each
preset
are
the
is being
history
in their
in
schemes
The
fpppp,
Static
to
a branch
program
a big
The
be used
lli~~
on different
testing
in
on the same
schemes
same
the
are no other
which
the
(with
sets,
Same
other
benchmarks
data
the
Figure
by
tested
tested
the
data
(with
beforehand.
Five
Static
and
are
pat-
scheme.
training
schemes
Two-
register
in the
Training
schemes
for
schemes
logic
and
with
implement
history
both
of the
All
Training
because
Static
trained
the
In
by
the
presented.
comparison.
best
for
used
to
than
the
transition
trained
as those
for
Because
bench-
simulated
cost
less expensive
the
were
which
were
The
effects
which
execution
must
for
the
names)
tions
program,
prediction
the
the
the
of the
contains
sets of each
study
required
their
plicable
with
given
gather
registers
data
state
results
the
branches
IHRT,
NA
greycode.in
NA
because
simulation
to
required
effects
to show
sets
track
one
offer
static
is simpler
known
his-
any
the
and
program
software,
preset
registers
the
in
branch.
varies
hardware
to hold
In order
the
the
the
pattern
which
nat oms
greycode.in
NA
table
table
in training
in the
history
doducin
in this
the
In order
data
When
of history
the
IHRT
until
pattern
However,
used
execution
The
and
his-
branch
table
predicting
branches
keep
.i
queens
doducin
Training,
path.
run-time.
recorded
table
names)
register
probabilities
with
History
information.
for
of static
to
eight
NA
NA
schemes.
is not
there
required
done
branch
of the
pattern
prediction
used
static
the
the
branch
be
needs
during
its
branch
by
of a branch
not-taken
support.
track
branch
predicted,
history
examines
profiling
the
can
hardware
static
or
Adaptive
a fair
calculate
accounting
statistics
quires
to
to predict
Although
history
simu-
According
accuracy
Prediction
from
set
be taken
pattern
Training
the
n executions
last
gathered
data
training
were
0.5 percent
2 bits.
prediction
Branch
statistics
tory
schemes
is reached.
of the
a training
for
Training
Level
tern
on
~aking
pattern
the
about
increasing
the
asymptote
lengths
for
registers
results,
of hanoi
testing
used
Training
Static
Train-
Training
register
increases
history
simulation
length
tory
history
accuracy
length
Adaptive
Adaptive
Static
similar.
register
of Twc-Level
Two-Level
different
The
the
using
Length
of history
accuracy
The
schemes
Register
effect
S=
mark.
lengths.
History
the
prediction
ing schemes.
Training
of different
Effect
Figure
Adaptive
bca
dbxout
short
implementations
Two-Level
Data
int.pri~.eqn
Cps
tiny
spice2g6
tomcat v
I
7:
Testing
cexp.i
tower
* AT(N+IT(6,2.IaPT(4M1.A2L)
AqWiT@latSlyT@6v,q)
Set
.
t
Figure
‘Data
NA
espresso
* AlpWT(siaamMAf))
0s4
Name
eqntott
El
e
ala
Benchmark
:..:................
~...
. ,g-----------------------------------
. .. . . .. . . . . . .
*
c
Y
. .. .. .
.*.
... .-;..*
“-“ -
sn
A
c
c
‘rnkJvg Safnmm
Ii
the
not
of the
Since
lower
is more
floating
so apparprograms.
the
data
camu190a
d Srmdl
Plwdictlm
sch-
1
+’ Pr&h@(..s.fnl)
Figure
8:
Prediction
accuracy
of
Static
Training
Figure
10:
Comparison
of branch
prediction
schemes
schemes.
bound
cal
at 93 percent
HRT
percent
1
a LY+mT@,f*\,)
-
U3(AMIT(512LT).J
0 LS(MIYT(5WA2).)
the
a L$$+HRT@12#2).)
Xm
Wal
~ Pm’$llnj(,,samo)
The
Prediction
BTFN,
accuracy
Always
of Branch
Taken,
and
Target
the
For
training
curacy
Training
Schemes
using
testing
is not
complete,
schemes
is not
graphed.
and
for
the
different
the
data
sets
ac-
Other
accuracy
Not
taken
(BTFN),
ing scheme.
The
ulated
automata,
Only
with
the results
are shown
in
signs
A3
using
using
percent
urations,
ulated.
A2.
similar
Using
Al,
because
those
to IHRT,
an IHRT
using
and
designs
using
AHRT,
in those
predict
A2.
and
results
to those
Al
using
A2
the
is about
simple
times
Three
schemes
the
opcode
whether
taken
the
branch
for
Always
scheme
like
benchmarks.
prediction
lower
for
than
accuthe
other
70 percent.
69 percent.
Taken
scheme
benchmark
to
changes
another.
each
the
is made
Its
branch
The
count
or not.
to
at the
the
bit
depending
than
the
prediction
prediction
bit.
percent.
This
92.5
cost
the
prediction
run-time
is about
but
of how
times
is larger
The
according
is to run
many
is set or cleared
scheme
simple
here
statistics
and how
branch.
branch
count
is fairly
simulated
is taken
taken
of this
prediction
other
the
accumulate
for
of the
branch
average
taken
However,
scheme
to
the branch
in the
not
is often
one
profiling
is not
of profiling
and
accuracy.
and Forthe
profil-
5.4
Comparison
of
Schemes
were sim-
Figure
10 illustrates
the comparison
between
the
schemes mentioned
above. The 512-entry 4-way AHRT
was chosen for all the uses of HRT, because it is sim-
Last-Time
of the
de-
of the designs
ple
about
and
buffer
and HHRT,
to
below
60 percent.
once
branch
low
A2, A3, A4, and Last-Time.
are similar
designs
than
Buffer
designs
figure,
Taken
Taken,
Target
of the
and A4
The
lower
Branch
the
Backward
Always
fall
benchmarks
is approximate
from
program
scheme
ward
4
compared
Not
benchmarks,
of
markedly
not-taken
results of Lee and smith’s
practi-
is about
points
Forward
as 98 percent.
its accuracy
average
The
Figure 9 shows the simulation
Branch Target Buffer designs,
loop-bound
accuracy
on
Schemes
with
poorly
data
loop-bound
but
benchmarks,
of the
5.3
the
The
The
average
predict
of the
and
for
tomcatv
is as high
many
for the Static
for
Taken
Taken
and
the
racy
the
scheme.
schemes
Last-Time
A2.
Some
is effective
matrix300
Buffer
Profiling
Always
Backward
average
9:
same
Using
using
schemes.
(BTFN)
The
Figure
than
and
other
quite
designs,
the
76 percent,
* L.SWIMT(SULTM
- N.q,
lower
BTFN
* W4RT@t.LTI,J
for
implementations.
2 to
Static
similar
config-
were
sets the
3
enough
Training
sim-
about
upper
59
to
be
training
costs.
At
scheme
97 percent.
implemented.
schemes
the
whose
As
top
Tmm-Level
are
is the
average
can
chosen
Two-Level
prediction
be seen
from
Adaptive
on the
basis
accuracy
the
of
Adaptive
graph,
is
the
Static
Training
lower
than
dicts
almost
Buffer
top
branch
Branch
92.5
with
pre-
lsst
about
[1]
Target
percent.
the
achieves
References
5 percent
scheme
Smith’s
around
a branch
of the
1 to
profiling
and
accuracy
predicts
execution
about
The
as Lee
with
which
predicts
curve.
as well
design
scheme
the
scheme
the
result
This
paper
a new
examining
branch
the
history
behavior
pattern
with
HRT
three
history
data
was
the
other
higher
using
with
ally
the
tion
several
schemes
schemes,
Backward
The
Taken
and
scheme
were
to have
on
suite.
ter than
diction
Since
a prediction
lative
execution
improvement
Not
from
Tse-Yu
to improve
however,
good
branch
of
[6]
sim-
with
Architecture,
Adaptive
Training
Report,
University
Technical
(1991).
Motorola
Inc.,
Arizona,
(March
“M881OO
User’s
Phoeniz,
Manual”,
13, 1989).
Hwu, T. M. Conte, and P. P. Chang, “ComparW.W.
ing Software and Hardware
Schemes for Reducing
Symposium
of the 16th inArchitecture,
1989).
and
D.
Parallelism
Wall,
for
“Available
and
Proceedings
Machines.”,
Instruction-
Superscalar
Super-
of the Third
In-
ternational
Conference
on Architectural
Support
Languages
and Operating
Sysfor Programming
1989), pp. 272-282.
tems, (April
predicTraining
[7]
a simple
Jouppi
pipelined
is usu-
and
N.P.
Level
seen
Taken,
in
scheme
D.
J.
Lilja,
Pipelined
has
than
pipeline
“Reducing
the
Processors
1988),
pp.47-55.
W.W.
Hwu
Branch
IEEE
“,
Penalty
Computer,
in
(July
branch
pre[9]
the
processor
P. G. Emma
specube
superscalar
exploiting
instruction
processor
predictor.
mispredicted
on the
Two-Level
is proposed
processors
level
performance.
critically
by
Branch
effec-
(June
[11]
effective-
accuracy
penalty
D-R.
in
of a
the
pp. 1496-
and
in Programs
, IEEE
Performance”
H.
Architectures
(July
M.
“,
Symposium
1987),
“Characterization
1987),
Levy,
for
Trans-
pp.859-876.
“An
Evaluation
of
Proceedings of the ldth inon Computer
Architecture,
PP.1O-16.
high
as-
branches.
60
Ditzel
and
CRISP
H.R.
McLellan,
Microprocessor:
“Branch
Folding
Reducing
Branch
tional
Proceedings of the Idth Interna(June
Symposium on Computer Architecture,
1987),
pp.2-9.
Delay
Training
to support
the
for
Trans-
parallelism
This
Adaptive
as a way
minimizing
are
1987),
Dependencies
on Computers,
[10] J. A. DeRosa
Training
execution
Data
Pipeline
actions
can
Repair
IEEE
(December
and E. S. Davidson,
and
Evaluating
performance
Adaptive
“Checkpoint
Machines”,
on Computers,
of Branch
required.
of
the
Patt,
Execution
1514.
a 100 percent
flushing
Two-Level
actions
bet-
and Y.N.
Out-of-order
benchmark
4 percent
flushes
progrem,
[8]
been
of 97 per-
SPEC
or dynamic
more
causes
the
depends
Prediction
performance
“Two-Level
Yeh,
(May
AHRT
Training
accuracy
a high-performance
and
single
ness,
on Computer
on Computer
Always
taken,
the
static
miss
for
to
34-42.
Prediction”,
ternational
methods
sociated
pp.
Proceedings
branch
is about
means
already
Deep-pipelining
Branch
Symposium
ternational
scheme.
tive
ternational
Ta-
Due
of the 18th In-
the Cost of Branches”,
As
Static
Training
number
by using
of
register.
designs,
accuracy
which
on
considerable
Returns”
History
Branches
, Proceedings
Subroutine
1991),
“Branch
Target
scheme
was
Adaptive
prediction
of the other
the
of the
Archi-
simulated.
benchmarks
in
history
Smith’s
Forward
prediction
most
P. G. Emma,
H.
Par-
Proceedings
Two”,
usually
accuracy
or static
and
Adaptive
schemes,
reduction
the
Buffer
an average
nine
The
[5]
IHRT
the
lengths.
prediction
dynamic
Target
[4]
hold
each
same
scheme
register
as Lee
Twc-Level
shown
cent
such
The
because
Two-Level
other
Branch
profiling
the
Alsup,
Level
HHRT.
results,
to
Than
of Moving
of Michigan,
which
for
the
Training
by lengthening
addition
scheme,
the
history
simulation
improved
In
[3]
sim-
to
an AHRT
than
size,
Adaptive
various
table.
bounds
using
same
were
IHRT
M.
the
unique
enough
Patt,
“Instruction
is a set-associative
upper
accuracy
than
the
large
which
A scheme
rate
schemes
is a hash
obtain
of the
miss
table
AHRT
to
Two-Level
ulated
Training
and
Prediction
(May
by
and
of that
Y.N.
Shebanow,
is Greater
Branch
which
prediction
has lower
from
the
HHRT
n branches
configurations:
HHRT
schemes.
an
Each
the
included
has
last
a branch
occurrences
register
all stat ic branches,
and
Two-Level
predicts
n branches.
last
Adaptive
is an ideal
cache,
of the
Two-Level
ulated
predictor,
scheme
for the lasts
of the
The
branch
The
Training.
Yeh,
M.
D. R. Kaeli
ble
Remarks
proposes
Adaptive
T-Y
and
18th International
Symposium on Computer
tecture, (May. 1991), pp. 276–286.
[2]
Concluding
Butler,
allelism
of
89 percent
accuracy.
6
M.
Scales,
The
to
Zero”,
[12]
S.
McFarling
Cost
of
ternational
(1986),
and
J.
Hennessy,
“Reducing
the
ings of the 8th International
Symposium
on Computer Architecture,
(May. 1981), pp.443-458.
Branches”,
Proceedings
of the 13th InSymposium
on Computer
Architecture,
[16] J.E. Smith,
pp.396-403.
gies”,
[13] J. Lee and A. J. Smith,
gies
and
Computer,
Branch
“Branch
Target
(January
Prediction
Buffer
1984),
Design”,
Strat&
posium
IEEE
Gross and J. Henneasy,
“Optimizing
Branches”,
Proceedings
of the l$th
shop on Microprogramming,
(Oct.
duced
Delayed
Instruction
and
C.H.
Prediction
International
Architecture,
(May.
Strate&m1981),
Shar
Sequin,
Set VLSI
Annual
Work1982), pp.114
“RISC-I:
Computer”,
and
E.S.
Davidson,
“A
MdtiminiPr~
cessor System Implemented
Through
Pipelining.”,
IEEE Computer,
(Feb. 1974), pp.42-51.
[18] T. C. Chen,
Patterson
Computer
$ih
pp.443-458.
120.
[15] D.A.
on
of Branch
of the
pp.6-22.
[17] L.E.
[14] T.R.
“A Study
Proceedings
“parallelism,
Efficiency”,
Computer
1971), pp.69-74.
A ReProceed-
61
Pipelining
Design,
Vol.
and Computer
10, No. 1, (Jan.
Download