Dynamic Hardware Prediction Basic Branch Prediction Buffers N

advertisement
Dynamic Hardware Prediction
• Importance of control dependences
– Branches and jumps are frequent
– Limiting factor as ILP increases (Amdahl’s law)
• Schemes to attack control dependences
– Static
• Basic (stall the pipeline)
• Predict-not-taken and predict-taken
• Delayed branch and canceling branch
– Dynamic predictors
• Effectiveness of dynamic prediction schemes
– Accuracy
– Cost of a correctly predicted branch
– Cost of an incorrectly predicted branch
Basic Branch Prediction Buffers
a.k.a. Branch History Table (BHT)
- Small direct-mapped cache of T/NT bits
Branch Instruction
IR:
+
Branch Target
BHT
T (predict taken)
PC:
NT (predict not- taken)
PC + 4
N-bit Branch Prediction Buffers
Use an n-bit saturating counter
Only the loop exit causes a misprediction
2-bit predictor almost as good as any general n-bit predictor
Predict taken
Predict taken
11
10
taken
not taken
Predict not taken
Predict not taken
01
00
2-bit Predictor
1
Correlating Predictors
a.k.a. Two-level Predictors
– Use recent behavior of other (previous) branches
Branch Instruction
IR:
+
Branch Target
BHT
T (predict taken)
PC:
NT (predict not- taken)
PC + 4
1-bit global branch history: T/NT
(stores behavior of previous
branch)
T
NT
Example
BNEZ
ADDI
L1: SUBUI
BNEZ
...
L2:
if (d = = 0) d = 1;
if (d = = 1) whatever;
R1, L1
; branch b1 (d!=0)
R1, R0, #1
R3, R1, #1
R3, L2
; branch b2 (d!=1)
Basic one-bit predictor
d=? b1 pred
2
0
2
0
NT
T
NT
T
b1 action new b1 pred b2 pred b2 action
T
NT
T
NT
T
NT
T
NT
NT
T
NT
T
T
NT
T
NT
new b2 pred
T
NT
T
NT
One-bit predictor with one-bit correlation
d=? b1 pred
2
0
2
0
NT/NT
T/NT
T/NT
T/NT
b1 action new b1 pred b2 pred b2 action
T
NT
T
NT
T/NT
T/NT
T/NT
T/NT
NT/NT
NT/T
NT/T
NT/T
T
NT
T
NT
new b2 pred
NT/T
NT/T
NT/T
NT/T
(m, n) Predictors
• Use behavior of the last m branches
• 2m n-bit predictors for each branch
• Simple implementation
– Use m-bit shift register to record the behavior of the
last m branches
m-bit GBH
PC:
(m,n) BPF
+
n-bit predictor
2
Size of the Buffers
• Number of bits in a (m,n) predictor
– 2m x n x Number of entries in the table
• Example – assume 8K bits in the BHT
– (0,1): 8K entries
– (0,2): 4K entries
– (2,2): 1K entries
– (12,2): 1 entry!
• Does not use the branch address
• Relies only on the global branch history
li
g
es cc
pr
es
so
eq
nt
ot
t
20
18
16
14
12
10
8
6
4
2
0
na
m sa7
at
rix
30
to 0
m
ca
tv
do
du
c
sp
ice
fp
pp
p
Frequency of mispredictions
Performance of 2-bit Predictors
(0,2) 4K entries
(0,2) 1M entries
(2,2) 1K entries
SPEC89 Benchmarks
Branch-Target Buffers
• Further reduce control stalls (hopefully to 0)
• Store the predicted address in the buffer
• Access the buffer during IF
PC
Look up
=
Predicted address
T/NT
NO: instruction is not a branch
YES: instruction is a branch
3
Prediction with BTF
Send PC to memory and BTF
IF
NO
YES
Entry found in
BTF?
Send out predicted address
NO
ID
Is instr
a taken
branch?
YES
Taken
branch?
NO
Update BTF
EX
YES
Kill fetched instr;
restart fetch at other
target delete entry
from BTF;
Target Instruction Buffers
• Store target instructions instead of addresses
• Advantages
– BTB access can take longer than time between IFs
and BTB can be larger
– Branch folding
• Zero-cycle unconditional branches
– Replace branch with target instruction
• Zero-cycle conditional branches
– Condition codes preset
Procedure Return Predictors
• Use buffer (stack) of return addresses
60
Misprediction rate
50
40
gcc
30
li
fpppp
20
10
0
1
2
4
8
16
Number of entries in the return stack
4
Performance Issues
• Limitations of branch prediction schemes
– Prediction accuracy (80% - 95%)
• Type of program
• Size of buffer
– Penalty of misprediction
• Fetch from both directions to reduce penalty
– Memory system should:
• Dual-ported
• Have an interleaved cache
• Fetch from one path and then from the other
Approaches to Improve Performance
• Goal so far: achieve CPI = 1
– Eliminate structural, data, and control stalls
• Additional performance improvements
– Make clock rate faster
• Improve manufacturing process
– Increase the number of stages
• Superpipelining
– Multiple issue of instructions
• Superscalar
• VLIW
• IPC instead of CPI !
Superscalar Processors
• Issue more than one instruction per cycle
• Duplication of functional units
• Constraints
– Structural
– Data dependencies
– Control dependencies
Sound familiar?
• Scheduling of instructions
– Static
– Dynamic
5
Download