CS 152 Computer Architecture and Engineering Lecture 14 - Branch Prediction Krste Asanovic

advertisement
CS 152 Computer Architecture and
Engineering
Lecture 14 - Branch Prediction
Krste Asanovic
Electrical Engineering and Computer Sciences
University of California at Berkeley
http://www.eecs.berkeley.edu/~krste
http://inst.eecs.berkeley.edu/~cs152
Last time in Lecture 13
• Register renaming removes WAR, WAW hazards by
giving a new internal destination register for every
new result
• Pipeline is structured with in-order
fetch/decode/rename, followed by out-of-order
execution/complete, followed by in-order commit
• At commit time, can detect exceptions and roll back
buffer to provide precise interrupts
3/20/2008
CS152-Spring’08
2
Recap: Overall Pipeline Structure
In-order
Fetch
Out-of-order
Reorder Buffer
Decode
Kill
In-order
Commit
Kill
Kill
Execute
Inject handler PC
Exception?
• Instructions fetched and decoded into instruction
reorder buffer in-order
• Execution is out-of-order (  out-of-order completion)
• Commit (write-back to architectural state, i.e., regfile &
memory) is in-order
Temporary storage needed to hold results before commit
(shadow registers and store buffers)
3/20/2008
CS152-Spring’08
3
Control Flow Penalty
Next fetch
started
PC
I-cache
Modern processors may have
> 10 pipeline stages between
next PC calculation and branch
resolution !
Fetch
Buffer
Fetch
Decode
Issue
Buffer
How much work is lost if
pipeline doesn’t follow
correct instruction flow?
Func.
Units
Execute
~ Loop length x pipeline width
Branch
executed
Result
Buffer
Commit
Arch.
State
3/20/2008
CS152-Spring’08
4
MIPS Branches and Jumps
Each instruction fetch depends on one or two pieces
of information from the preceding instruction:
1) Is the preceding instruction a taken branch?
2) If so, what is the target address?
Instruction
Taken known?
Target known?
J
After Inst. Decode
After Inst. Decode
JR
After Inst. Decode
After Reg. Fetch
BEQZ/BNEZ
After Reg. Fetch*
After Inst. Decode
*Assuming
3/20/2008
zero detect on register read
CS152-Spring’08
5
Branch Penalties in Modern Pipelines
UltraSPARC-III instruction fetch pipeline stages
(in-order issue, 4-way superscalar, 750MHz, 2000)
Branch
Target
Address
Known
Branch
Direction &
Jump
Register
Target
Known
3/20/2008
A
P
F
B
I
J
R
E
PC Generation/Mux
Instruction Fetch Stage 1
Instruction Fetch Stage 2
Branch Address Calc/Begin Decode
Complete Decode
Steer Instructions to Functional units
Register File Read
Integer Execute
Remainder of execute pipeline
(+ another 6 stages)
CS152-Spring’08
6
Reducing Control Flow Penalty
Software solutions
• Eliminate branches - loop unrolling
Increases the run length
• Reduce resolution time - instruction scheduling
Compute the branch condition as early
as possible (of limited value)
Hardware solutions
• Find something else to do - delay slots
Replaces pipeline bubbles with useful work
(requires software cooperation)
• Speculate - branch prediction
Speculative execution of instructions beyond
the branch
3/20/2008
CS152-Spring’08
7
Branch Prediction
Motivation:
Branch penalties limit performance of deeply pipelined
processors
Modern branch predictors have high accuracy
(>95%) and can reduce branch penalties significantly
Required hardware support:
Prediction structures:
• Branch history tables, branch target buffers, etc.
Mispredict recovery mechanisms:
• Keep result computation separate from commit
• Kill instructions following branch in pipeline
• Restore state to state following branch
3/20/2008
CS152-Spring’08
8
Static Branch Prediction
Overall probability a branch is taken is ~60-70% but:
backward
90%
forward
50%
JZ
JZ
ISA can attach preferred direction semantics to branches,
e.g., Motorola MC88110
bne0 (preferred taken) beq0 (not taken)
ISA can allow arbitrary choice of statically predicted direction,
e.g., HP PA-RISC, Intel IA-64
typically reported as ~80% accurate
3/20/2008
CS152-Spring’08
9
Dynamic Branch Prediction
learning based on past behavior
Temporal correlation
The way a branch resolves may be a good
predictor of the way it will resolve at the next
execution
Spatial correlation
Several branches may resolve in a highly
correlated manner (a preferred path of
execution)
3/20/2008
CS152-Spring’08
10
Branch Prediction Bits
• Assume 2 BP bits per instruction
• Change the prediction after two consecutive mistakes!
taken
taken
take
right
¬ taken
¬take
wrong
taken
¬ taken
¬take
right
¬ taken
taken
take
wrong
¬ taken
BP state:
(predict take/¬take) x (last prediction right/wrong)
3/20/2008
CS152-Spring’08
11
Branch History Table
Fetch PC
00
k
I-Cache
Instruction
Opcode
BHT Index
2k-entry
BHT,
2 bits/entry
offset
+
Branch?
Target PC
Taken/¬Taken?
4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
3/20/2008
CS152-Spring’08
12
Exploiting Spatial Correlation
Yeh and Patt, 1992
if (x[i] <
y +=
if (x[i] <
c -=
7) then
1;
5) then
4;
If first condition false, second condition also false
History register, H, records the direction of the last
N branches executed by the processor
3/20/2008
CS152-Spring’08
13
Two-Level Branch Predictor
Pentium Pro uses the result from the last two branches
to select one of the four sets of BHT bits (~95% correct)
00
Fetch PC
k
2-bit global branch
history shift register
Shift in
Taken/¬Taken
results of each
branch
Taken/¬Taken?
3/20/2008
CS152-Spring’08
14
Limitations of BHTs
Only predicts branch direction. Therefore, cannot redirect
fetch stream until after branch target is determined.
Correctly
predicted
taken branch
penalty
Jump Register
penalty
A
P
F
B
I
J
R
E
PC Generation/Mux
Instruction Fetch Stage 1
Instruction Fetch Stage 2
Branch Address Calc/Begin Decode
Complete Decode
Steer Instructions to Functional units
Register File Read
Integer Execute
Remainder of execute pipeline
(+ another 6 stages)
UltraSPARC-III fetch pipeline
3/20/2008
CS152-Spring’08
15
Branch Target Buffer
predicted
target
BPb
Branch
Target
Buffer
(2k entries)
IMEM
k
PC
target
BP
BP bits are stored with the predicted target address.
IF stage: If (BP=taken) then nPC=target else nPC=PC+4
later:
check prediction, if wrong then kill the instruction
and update BTB & BPb else update BPb
3/20/2008
CS152-Spring’08
16
Address Collisions
132 Jump 100
Assume a
128-entry
BTB
1028 Add .....
target
236
BPb
take
Instruction
What will be fetched after the instruction at 1028? Memory
BTB prediction
Correct target
= 236
= 1032
 kill PC=236 and fetch PC=1032
Is this a common occurrence?
Can we avoid these bubbles?
3/20/2008
CS152-Spring’08
17
BTB is only for Control Instructions
BTB contains useful information for branch and
jump instructions only
 Do not update it for other instructions
For all other instructions the next PC is PC+4 !
How to achieve this effect without decoding the
instruction?
3/20/2008
CS152-Spring’08
18
Branch Target Buffer (BTB)
I-Cache
2k-entry direct-mapped BTB
PC
(can also be associative)
Entry PC
Valid
predicted
target PC
valid
target
k
=
match
•
•
•
•
Keep both the branch PC and target PC in the BTB
PC+4 is fetched if match fails
Only taken branches and jumps held in BTB
Next PC determined before branch fetched and decoded
3/20/2008
CS152-Spring’08
19
Consulting BTB Before Decoding
132 Jump 100
entry PC
132
target
236
BPb
take
1028 Add .....
• The match for PC=1028 fails and 1028+4 is fetched
 eliminates false predictions after ALU instructions
• BTB contains entries only for control transfer instructions
 more room to store branch targets
3/20/2008
CS152-Spring’08
20
CS152 Administrivia
• Lab 4, branch predictor competition, due April 3
– PRIZE (TBD) for winners in both unlimited and realistic categories
• Quiz 4, Tuesday April 8
3/20/2008
CS152-Spring’08
21
Combining BTB and BHT
• BTB entries are considerably more expensive than BHT, but can
redirect fetches at earlier stage in pipeline and can accelerate
indirect branches (JR)
• BHT can hold many more entries and is more accurate
BTB
BHT in later
pipeline stage
corrects when
BTB misses a
predicted
taken branch
BHT
A
P
F
B
I
J
R
E
PC Generation/Mux
Instruction Fetch Stage 1
Instruction Fetch Stage 2
Branch Address Calc/Begin Decode
Complete Decode
Steer Instructions to Functional units
Register File Read
Integer Execute
BTB/BHT only updated after branch resolves in E stage
3/20/2008
CS152-Spring’08
22
Uses of Jump Register (JR)
• Switch statements (jump to address of matching case)
BTB works well if same case used repeatedly
• Dynamic function call (jump to run-time function address)
BTB works well if same function usually called, (e.g., in
C++ programming, when objects have same type in
virtual function call)
• Subroutine returns (jump to return address)
BTB works well if usually return to the same place
 Often one function called from many distinct call sites!
How well does BTB work for each of these cases?
3/20/2008
CS152-Spring’08
23
Subroutine Return Stack
Small structure to accelerate JR for subroutine returns,
typically much more accurate than BTBs.
fa() { fb(); }
fb() { fc(); }
fc() { fd(); }
Pop return address
when subroutine
return decoded
Push call address when
function call executed
&fd()
&fc()
&fb()
3/20/2008
CS152-Spring’08
k entries
(typically k=8-16)
24
Mispredict Recovery
In-order execution machines:
– Assume no instruction issued after branch can write-back before
branch resolves
– Kill all instructions in pipeline behind mispredicted branch
Out-of-order execution?
– Multiple instructions following branch in program
order can complete before branch resolves
3/20/2008
CS152-Spring’08
25
In-Order Commit for Precise Exceptions
In-order
Fetch
Out-of-order
Reorder Buffer
Decode
Kill
In-order
Commit
Kill
Kill
Execute
Inject handler PC
Exception?
• Instructions fetched and decoded into instruction
reorder buffer in-order
• Execution is out-of-order (  out-of-order completion)
• Commit (write-back to architectural state, i.e., regfile &
memory, is in-order
Temporary storage needed in ROB to hold results before commit
3/20/2008
CS152-Spring’08
26
Branch Misprediction in Pipeline
Inject correct PC
Branch
Prediction
Branch
Resolution
Kill
Kill
Kill
PC
Fetch
Decode
Reorder Buffer
Commit
Complete
Execute
• Can have multiple unresolved branches in ROB
• Can resolve branches out-of-order by killing all the
instructions in ROB that follow a mispredicted branch
3/20/2008
CS152-Spring’08
27
Recovering ROB/Renaming Table
Rename
Table r1
t t vvv
t
t v
Rename
Snapshots
Registe
r File
r2
Ptr2
next to commit
Ins# use exec
op
p1
src1
p2
src2
pd dest
t1
t2
.
.
tn
data
rollback
next available
Ptr1
next available
Reorder
buffer
Load
Unit
FU
FU
FU
Store
Unit
Commit
< t, result >
Take snapshot of register rename table at each predicted
branch, recover earlier snapshot if branch mispredicted
3/20/2008
CS152-Spring’08
28
Speculating Both Directions
An alternative to branch prediction is to execute
both directions of a branch speculatively
• resource requirement is proportional to the
number of concurrent speculative executions
• only half the resources engage in useful work
when both directions of a branch are executed
speculatively
• branch prediction takes less resources
than speculative execution of both paths
With accurate branch prediction, it is more cost
effective to dedicate all resources to the predicted
direction
3/20/2008
CS152-Spring’08
29
Acknowledgements
• These slides contain material developed and
copyright by:
–
–
–
–
–
–
Arvind (MIT)
Krste Asanovic (MIT/UCB)
Joel Emer (Intel/MIT)
James Hoe (CMU)
John Kubiatowicz (UCB)
David Patterson (UCB)
• MIT material derived from course 6.823
• UCB material derived from course CS252
3/20/2008
CS152-Spring’08
30
Download