Dynamic Branch Predictors

advertisement
Dynamic Branch Prediction
• Why does prediction work?
– Underlying algorithm has regularities
– Data that is being operated on has regularities
– Instruction sequence has redundancies that are artifacts of
way that humans/compilers think about problems
• Is dynamic branch prediction better than static
branch prediction?
– Seems to be
– There are a small number of important branches in programs
which have dynamic behavior
4/8/2015
Lec4 ILP
1
Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of misprediction)
• Branch History Table: Lower bits of PC address
index table of 1-bit values
– Says whether or not branch taken last time
– No address check
• Problem: in a loop, 1-bit BHT will cause two
mispredictions (avg is 9 iteratios before exit):
– End of loop case, when it exits instead of looping as before
– First time through loop on next time through code, when it
predicts exit instead of looping
4/8/2015
Lec4 ILP
2
Dynamic Branch Prediction
• Solution: 2-bit scheme where change prediction
only if get misprediction twice
T
NT
Predict Taken
T
Predict Not
Taken
T
NT
T
Predict Taken
NT
Predict Not
Taken
• Red: stop, not taken
NT
• Green: go, taken
• Adds hysteresis to decision making process
4/8/2015
Lec4 ILP
3
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the table
18%
12%
10%
9%
9%
5%
9%
5%
1%
Lec4 ILP
Integer
Floating Point
7
na
sa
30
0
pp
at
rix
fp
p
ice
sp
c
do
du
ice
sp
li
gc
c
0%
m
4/8/2015
20%
18%
16%
14%
12%
10%
8%
6%
4%
2%
0%
eq
nt
ot
es
t
pr
es
so
Misprediction Rate
• 4096 entry table:
4
Correlating Branch Predictor
• It may possible to improve the accuracy if we
look at the behavior of other branches.
if (aa == 2)
aa = 0;
if (bb == 0)
bb = 0;
if (aa != bb)
The behavior of b3 is correlated with the behavior of b1 and b2.
Correlating Predictors
• Two-level predictors
if (d == 0)
d = 1;
if (d == 1)
initial value of
d
0
1
2
b1
value of d
before b2
b2
1-bit Predictor (Initialized to NT)
d
b1
predic
b1
action
new b1
pr
b2
predic
b2
action
new b2
pr
2
nt
t
t
nt
t
t
0
t
2
0
(1,1) Predictor
• Every branch has two separate prediction bits.
– First bit: the prediction if the last
branch in the program is not taken.
– Second bit: the prediction if the last
branch in the program is taken.
• Write the pair of prediction bits together.
Combinations & Meaning
Prediction bits
Prediction if not
taken
Prediction if taken
Correlated Branch Prediction
• Idea: record m most recently executed branches
as taken or not taken, and use that pattern to
select the proper n-bit branch history table
• In general, (m,n) predictor means record last m
branches to select between 2m history tables,
each with n-bit counters
– Thus, old 2-bit BHT is a (0,2) predictor
• Global Branch History: m-bit shift register
keeping T/NT status of last m branches.
• Each entry in table has m n-bit predictors.
4/8/2015
Lec4 ILP
11
Correlating Branches
(2,2) predictor
–
Behavior of recent
branches selects
between four
predictions of next
branch, updating just
that prediction
Branch address
4
2-bits per branch predictor
Prediction
2-bit global branch history
4/8/2015
Lec4 ILP
12
Accuracy of Different Schemes
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
18%
16%
14%
12%
11%
10%
8%
6%
6%
5%
6%
6%
4,096 entries: 2-bits per entry
Unlimited entries: 2-bits/entry
Lec4 ILP
li
eqntott
expresso
gcc
fpppp
matrix300
0%
spice
1%
0%
doducd
1%
tomcatv
2%
4/8/2015
5%
4%
4%
nasa7
Frequency of Mispredictions
20%
1,024 entries (2,2)
13
Tournament Predictors
• Multilevel branch predictor
• Use n-bit saturating counter to choose between
predictors
• Usual choice between global and local predictors
4/8/2015
Lec4 ILP
14
Tournament Predictors
Tournament predictor using, say, 4K 2-bit counters
indexed by local branch address. Chooses
between:
• Global predictor
– 4K entries index by history of last 12 branches (212 = 4K)
– Each entry is a standard 2-bit predictor
• Local predictor
– Local history table: 1024 10-bit entries recording last 10
branches, index by branch address
– The pattern of the last 10 occurrences of that particular branch
used to index table of 1K entries with 3-bit saturating counters
4/8/2015
Lec4 ILP
15
Comparing Predictors
• Advantage of tournament predictor is ability to select the right
predictor for a particular branch
– Particularly crucial for integer benchmarks
– A typical tournament predictor will select the global predictor almost 40%
of the time for the SPEC integer benchmarks and less than 15% of the
time for the SPEC FP benchmarks
2-bit BHT
4/8/2015
Lec4 ILP
SPEC89
16
Pentium 4 Misprediction Rate
(per 1000 instructions, not per branch)
14
13
Branch mispredictions per 1000 Instructions
13
6% misprediction rate per branch SPECint
(19% of INT instructions are branch)
12
12
11
2% misprediction rate per branch SPECfp
(5% of FP instructions are branch)
11
10
9
9
8
7
7
6
5
5
4
3
2
1
1
0
0
0
4/8/2015
Lec4 ILP
a
m
es
17
7.
u
ap
pl
17
3.
17
2.
m
gr
id
im
sw
17
1.
e
is
af
ty
up
w
16
8.
w
SPECint2000
18
6.
cr
18
1.
m
cf
gc
c
17
6.
vp
r
17
5.
16
4.
gz
i
p
0
SPECfp2000
17
Branch Target Buffers (BTB)
• Branch target calculation is costly and stalls the
instruction fetch.
• BTB stores PCs the same way as caches
• The PC of a branch is sent to the BTB
• When a match is found the corresponding
Predicted PC is returned
• If the branch was predicted taken, instruction
fetch continues at the returned predicted PC
Branch Target Buffers
Branch PC
Predicted PC
PC of instruction
FETCH
Yes: inst is
Extra
branch,
=?
Prediction
Next PC =
state
predicted PC
bits
No: proceed normally
(Next PC = PC+4)
Dynamic Branch Prediction Summary
• Prediction becoming important part of execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated
with next branch
– Either different branches (GA)
– Or different executions of same branches (PA)
• Tournament predictors take insight to next level, by
using multiple predictors
– usually one based on global information and one based on local
information, and combining them with a selector
– In 2006, tournament predictors using  30K bits are in processors
like the Power5 and Pentium 4
• Branch Target Buffer: include branch address &
prediction
4/8/2015
Lec4 ILP
20
Download