CS 152 Computer Architecture and Engineering Lecture 14 - Branch Prediction Krste Asanovic

CS 152 Computer Architecture and Engineering Lecture 14 - Branch Prediction Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152 Last time in Lecture 13 • Register renaming removes WAR, WAW hazards by giving a new internal destination register for every new result • Pipeline is structured with in-order fetch/decode/rename, followed by out-of-order execution/complete, followed by in-order commit • At commit time, can detect exceptions and roll back buffer to provide precise interrupts 3/20/2008 CS152-Spring’08 2 Recap: Overall Pipeline Structure In-order Fetch Out-of-order Reorder Buffer Decode Kill In-order Commit Kill Kill Execute Inject handler PC Exception? • Instructions fetched and decoded into instruction reorder buffer in-order • Execution is out-of-order (  out-of-order completion) • Commit (write-back to architectural state, i.e., regfile & memory) is in-order Temporary storage needed to hold results before commit (shadow registers and store buffers) 3/20/2008 CS152-Spring’08 3 Control Flow Penalty Next fetch started PC I-cache Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution ! Fetch Buffer Fetch Decode Issue Buffer How much work is lost if pipeline doesn’t follow correct instruction flow? Func. Units Execute ~ Loop length x pipeline width Branch executed Result Buffer Commit Arch. State 3/20/2008 CS152-Spring’08 4 MIPS Branches and Jumps Each instruction fetch depends on one or two pieces of information from the preceding instruction: 1) Is the preceding instruction a taken branch? 2) If so, what is the target address? Instruction Taken known? Target known? J After Inst. Decode After Inst. Decode JR After Inst. Decode After Reg. Fetch BEQZ/BNEZ After Reg. Fetch* After Inst. Decode *Assuming 3/20/2008 zero detect on register read CS152-Spring’08 5 Branch Penalties in Modern Pipelines UltraSPARC-III instruction fetch pipeline stages (in-order issue, 4-way superscalar, 750MHz, 2000) Branch Target Address Known Branch Direction & Jump Register Target Known 3/20/2008 A P F B I J R E PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute Remainder of execute pipeline (+ another 6 stages) CS152-Spring’08 6 Reducing Control Flow Penalty Software solutions • Eliminate branches - loop unrolling Increases the run length • Reduce resolution time - instruction scheduling Compute the branch condition as early as possible (of limited value) Hardware solutions • Find something else to do - delay slots Replaces pipeline bubbles with useful work (requires software cooperation) • Speculate - branch prediction Speculative execution of instructions beyond the branch 3/20/2008 CS152-Spring’08 7 Branch Prediction Motivation: Branch penalties limit performance of deeply pipelined processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Prediction structures: • Branch history tables, branch target buffers, etc. Mispredict recovery mechanisms: • Keep result computation separate from commit • Kill instructions following branch in pipeline • Restore state to state following branch 3/20/2008 CS152-Spring’08 8 Static Branch Prediction Overall probability a branch is taken is ~60-70% but: backward 90% forward 50% JZ JZ ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate 3/20/2008 CS152-Spring’08 9 Dynamic Branch Prediction learning based on past behavior Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution) 3/20/2008 CS152-Spring’08 10 Branch Prediction Bits • Assume 2 BP bits per instruction • Change the prediction after two consecutive mistakes! taken taken take right ¬ taken ¬take wrong taken ¬ taken ¬take right ¬ taken taken take wrong ¬ taken BP state: (predict take/¬take) x (last prediction right/wrong) 3/20/2008 CS152-Spring’08 11 Branch History Table Fetch PC 00 k I-Cache Instruction Opcode BHT Index 2k-entry BHT, 2 bits/entry offset + Branch? Target PC Taken/¬Taken? 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions 3/20/2008 CS152-Spring’08 12 Exploiting Spatial Correlation Yeh and Patt, 1992 if (x[i] < y += if (x[i] < c -= 7) then 1; 5) then 4; If first condition false, second condition also false History register, H, records the direction of the last N branches executed by the processor 3/20/2008 CS152-Spring’08 13 Two-Level Branch Predictor Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct) 00 Fetch PC k 2-bit global branch history shift register Shift in Taken/¬Taken results of each branch Taken/¬Taken? 3/20/2008 CS152-Spring’08 14 Limitations of BHTs Only predicts branch direction. Therefore, cannot redirect fetch stream until after branch target is determined. Correctly predicted taken branch penalty Jump Register penalty A P F B I J R E PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute Remainder of execute pipeline (+ another 6 stages) UltraSPARC-III fetch pipeline 3/20/2008 CS152-Spring’08 15 Branch Target Buffer predicted target BPb Branch Target Buffer (2k entries) IMEM k PC target BP BP bits are stored with the predicted target address. IF stage: If (BP=taken) then nPC=target else nPC=PC+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb 3/20/2008 CS152-Spring’08 16 Address Collisions 132 Jump 100 Assume a 128-entry BTB 1028 Add ..... target 236 BPb take Instruction What will be fetched after the instruction at 1028? Memory BTB prediction Correct target = 236 = 1032  kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles? 3/20/2008 CS152-Spring’08 17 BTB is only for Control Instructions BTB contains useful information for branch and jump instructions only  Do not update it for other instructions For all other instructions the next PC is PC+4 ! How to achieve this effect without decoding the instruction? 3/20/2008 CS152-Spring’08 18 Branch Target Buffer (BTB) I-Cache 2k-entry direct-mapped BTB PC (can also be associative) Entry PC Valid predicted target PC valid target k = match • • • • Keep both the branch PC and target PC in the BTB PC+4 is fetched if match fails Only taken branches and jumps held in BTB Next PC determined before branch fetched and decoded 3/20/2008 CS152-Spring’08 19 Consulting BTB Before Decoding 132 Jump 100 entry PC 132 target 236 BPb take 1028 Add ..... • The match for PC=1028 fails and 1028+4 is fetched  eliminates false predictions after ALU instructions • BTB contains entries only for control transfer instructions  more room to store branch targets 3/20/2008 CS152-Spring’08 20 CS152 Administrivia • Lab 4, branch predictor competition, due April 3 – PRIZE (TBD) for winners in both unlimited and realistic categories • Quiz 4, Tuesday April 8 3/20/2008 CS152-Spring’08 21 Combining BTB and BHT • BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR) • BHT can hold many more entries and is more accurate BTB BHT in later pipeline stage corrects when BTB misses a predicted taken branch BHT A P F B I J R E PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute BTB/BHT only updated after branch resolves in E stage 3/20/2008 CS152-Spring’08 22 Uses of Jump Register (JR) • Switch statements (jump to address of matching case) BTB works well if same case used repeatedly • Dynamic function call (jump to run-time function address) BTB works well if same function usually called, (e.g., in C++ programming, when objects have same type in virtual function call) • Subroutine returns (jump to return address) BTB works well if usually return to the same place  Often one function called from many distinct call sites! How well does BTB work for each of these cases? 3/20/2008 CS152-Spring’08 23 Subroutine Return Stack Small structure to accelerate JR for subroutine returns, typically much more accurate than BTBs. fa() { fb(); } fb() { fc(); } fc() { fd(); } Pop return address when subroutine return decoded Push call address when function call executed &fd() &fc() &fb() 3/20/2008 CS152-Spring’08 k entries (typically k=8-16) 24 Mispredict Recovery In-order execution machines: – Assume no instruction issued after branch can write-back before branch resolves – Kill all instructions in pipeline behind mispredicted branch Out-of-order execution? – Multiple instructions following branch in program order can complete before branch resolves 3/20/2008 CS152-Spring’08 25 In-Order Commit for Precise Exceptions In-order Fetch Out-of-order Reorder Buffer Decode Kill In-order Commit Kill Kill Execute Inject handler PC Exception? • Instructions fetched and decoded into instruction reorder buffer in-order • Execution is out-of-order (  out-of-order completion) • Commit (write-back to architectural state, i.e., regfile & memory, is in-order Temporary storage needed in ROB to hold results before commit 3/20/2008 CS152-Spring’08 26 Branch Misprediction in Pipeline Inject correct PC Branch Prediction Branch Resolution Kill Kill Kill PC Fetch Decode Reorder Buffer Commit Complete Execute • Can have multiple unresolved branches in ROB • Can resolve branches out-of-order by killing all the instructions in ROB that follow a mispredicted branch 3/20/2008 CS152-Spring’08 27 Recovering ROB/Renaming Table Rename Table r1 t t vvv t t v Rename Snapshots Registe r File r2 Ptr2 next to commit Ins# use exec op p1 src1 p2 src2 pd dest t1 t2 . . tn data rollback next available Ptr1 next available Reorder buffer Load Unit FU FU FU Store Unit Commit < t, result > Take snapshot of register rename table at each predicted branch, recover earlier snapshot if branch mispredicted 3/20/2008 CS152-Spring’08 28 Speculating Both Directions An alternative to branch prediction is to execute both directions of a branch speculatively • resource requirement is proportional to the number of concurrent speculative executions • only half the resources engage in useful work when both directions of a branch are executed speculatively • branch prediction takes less resources than speculative execution of both paths With accurate branch prediction, it is more cost effective to dedicate all resources to the predicted direction 3/20/2008 CS152-Spring’08 29 Acknowledgements • These slides contain material developed and copyright by: – – – – – – Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) • MIT material derived from course 6.823 • UCB material derived from course CS252 3/20/2008 CS152-Spring’08 30

CS 152 Computer Architecture and Engineering Lecture 14 - Branch Prediction Krste Asanovic

Related documents

Products

Support

CS 152 Computer Architecture and Engineering Lecture 14 - Branch Prediction Krste Asanovic

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib