Branch Prediction

advertisement
Branch Prediction
Static, Dynamic Branch prediction techniques
10/14
branch.1
Control Flow Penalty
Why Branch Prediction
Next fetch
started
Modern processors have 10 -14
pipeline stages between next PC
calculation and branch resolution !
PC
I-cache
Fetch
Buffer
Fetch
Decode
Issue
Buffer
work lost if pipeline makes wrong
prediction
Func.
Units
~ Loop length x pipeline width
Branch
executed
Result
Buffer
Execute
Commit
Arch.
State
10/14
branch.2
Branch Penalties in a Superscalar
are extensive
10/14
branch.3
Reducing Control Flow Penalty
Software solutions
• Minimize branches - loop unrolling
Increases the run length
Hardware solutions
• Find something else to do - delay slots
• Speculate –Dynamic branch prediction
Speculative execution of instructions beyond
branch
10/14
branch.4
Branch Prediction
Motivation:
Branch penalties limit performance of deeply pipelined
processors
Much worse for superscalar processors
Modern branch predictors have high accuracy
(>95%) and can reduce branch penalties significantly
Required hardware support:
Dynamic Prediction HW:
• Branch history tables, branch target buffers, etc.
10/14
Mispredict recovery mechanisms:
• Keep computation result separate from commit
• Kill instructions following branch
• Restore state to state following branch
branch.5
Static Branch Prediction- review
Overall probability a branch is taken is ~60-70% but:
backward
90%
forward
50%
JZ
JZ
ISA can attach preferred direction semantics to branches,
e.g., Motorola MC88110
bne0 (preferred taken) beq0 (not taken)
ISA can allow arbitrary choice of statically predicted direction,
e.g., HP PA-RISC, Intel IA-64
typically reported as ~80% accurate
10/14
branch.6
Branch Prediction Needs
• Target address generation
– Get register: PC, Link reg, GP reg.
– Calculate: +/- offset, auto inc/dec
– Target speculation
• Condition resolution
– Get register: condition code reg, count reg.,
other reg.
– Compare registers
– Condition speculation
10/14
branch.7
Target address generation takes
time
10/14
branch.8
Condition resolution takes time
10/14
branch.9
Solution: Branch speculation
10/14
branch.10
Branch Prediction Schemes
1.
2.
3.
4.
5.
6.
2-bit Branch-Prediction Buffer
Branch Target Buffer
Correlating Branch Prediction Buffer
Tournament Branch Predictor
Integrated Instruction Fetch Units
Return Address Predictors (for subroutines,
Pentium, Core Duo)
7. Predicated Execution (Itanium)
10/14
branch.11
Dynamic Branch Prediction
learning based on past behavior
History
Information
Incoming Branches
{ Address }
Branch
Predictor
Prediction
{ Address, Value }
Corrections
{ Address, Value }
10/14
• Incoming stream of addresses
• Fast outgoing stream of predictions
• Correction information returned from pipeline
branch.12
Branch History Table (BHT)
Table of predictors
• Each branch given its own predictor
• BHT is table of “Predictors”
Branch PC
Predictor 0
Predictor 1
– Could be 1-bit or more
– Indexed by PC address of Branch
• Problem: in a loop, 1-bit BHT will cause two
mispredictions (avg is 9 iterations before exit):
– End of loop case: when it exits loop
– First time through loop, it predicts exit instead of looping
• most schemes use at least 2 bit predictors
• Performance = ƒ(accuracy, cost of misprediction)
Predictor 7
– Misprediction  Flush Reorder Buffer
• In Fetch state of branch:
– Use Predictor to make prediction
• When branch completes
– Update corresponding Predictor
10/14
branch.13
Branch History Table Organization
Target PC calculation takes time
Fetch PC
00
k
I-Cache
Instruction
Opcode
BHT Index
2k-entry
BHT,
2 bits/entry
offset
+
Branch?
Target PC
Taken/¬Taken?
4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
10/14
branch.14
2-bit Dynamic Branch Prediction
more accurate than 1-bit
• Better Solution: 2-bit scheme where change prediction
only if get misprediction twice:
T
NT
Predict Taken
Predict Taken
T
T
NT
NT
Predict Not
Taken
T
Predict Not
Taken
NT
• Red: stop, not taken
• Green: go, taken
• Adds hysteresis to decision making process
10/14
branch.15
BTB: Branch Address at Same Time as
Prediction
• Branch Target Buffer (BTB): Address of branch index to get prediction AND
branch address (if taken)
PC of instruction
FETCH
Branch PC
=?
Predicted PC
Yes: instruction is prediction state
branch and use
bits
predicted PC as
next PC
No: branch not
predicted, proceed normally
(Next PC = PC+4)
Only predicted taken branches and jumps held in BTB
Next PC determined before branch fetched and decoded
later:
check prediction, if wrong kill instruction,
update BPb
10/14
branch.16
BTB contains only Branch & Jump
Instructions
BTB contains information for branch and jump
instructions only
 not updated for other instructions
For all other instructions the next PC is PC+4 !
Achieved without decoding instruction
10/14
branch.17
Combining BTB and BHT
• BTB entries considerably more expensive than BHT,
fetch redirected earlier in pipeline - can accelerate indirect branches
(JR)
• BHT can hold many more entries - more accurate
BTB
BHT in later
pipeline stage
corrects when
BTB misses a
predicted
taken branch
BHT
A
P
F
B
I
J
R
E
PC Generation/Mux
Instruction Fetch Stage 1
Instruction Fetch Stage 2
Branch Address Calc/Begin Decode
Complete Decode
Steer Instructions to Functional units
Register File Read
Integer Execute
BTB/BHT only updated after branch resolves in E stage
10/14
branch.18
Subroutine Return Stack
• Small stack – accelerate subroutine returns
• more accurate than BTBs.
Pop return address
when subroutine return
decoded
Push return address when
function call executed
&nextc
&nextb
&nexta
10/14
k entries
(typically k=8-16)
branch.19
Mispredict Recovery
In-order execution machines:
– Instructions issued after branch cannot write-back before branch
resolves
– all instructions in pipeline behind mispredicted branch Killed
10/14
branch.20
Predicated Execution
• Avoid branch prediction by turning branches into
conditionally executed instructions:
if (x) then A = B op C else NOP
– If false, then neither store result nor cause exception
– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move; PA-RISC can annul any following instr.
– IA-64: 64 1-bit condition fields selected
so conditional execution of any instruction
– This transformation is called “if-conversion”
x
A=
B op C
• Drawbacks to conditional instructions
– Still takes a clock even if “annulled”
– Stall if condition evaluated late
– Complex conditions reduce effectiveness;
condition becomes known late in pipeline
10/14
branch.21
Accuracy v. Size (SPEC89)
10/14
branch.22
Dynamic Branch Prediction
Summary
• Prediction becoming important part of scalar execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated with
next branch.
• Tournament Predictor: more resources to competitive
solutions and pick between them
• Branch Target Buffer: include branch address &
prediction
• Predicated Execution can reduce number of branches,
number of mispredicted branches
• Return address stack for prediction of indirect jump
10/14
branch.23
Download