Branch Prediction Static, Dynamic Branch prediction techniques 10/14 branch.1 Control Flow Penalty Why Branch Prediction Next fetch started Modern processors have 10 -14 pipeline stages between next PC calculation and branch resolution ! PC I-cache Fetch Buffer Fetch Decode Issue Buffer work lost if pipeline makes wrong prediction Func. Units ~ Loop length x pipeline width Branch executed Result Buffer Execute Commit Arch. State 10/14 branch.2 Branch Penalties in a Superscalar are extensive 10/14 branch.3 Reducing Control Flow Penalty Software solutions • Minimize branches - loop unrolling Increases the run length Hardware solutions • Find something else to do - delay slots • Speculate –Dynamic branch prediction Speculative execution of instructions beyond branch 10/14 branch.4 Branch Prediction Motivation: Branch penalties limit performance of deeply pipelined processors Much worse for superscalar processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Dynamic Prediction HW: • Branch history tables, branch target buffers, etc. 10/14 Mispredict recovery mechanisms: • Keep computation result separate from commit • Kill instructions following branch • Restore state to state following branch branch.5 Static Branch Prediction- review Overall probability a branch is taken is ~60-70% but: backward 90% forward 50% JZ JZ ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate 10/14 branch.6 Branch Prediction Needs • Target address generation – Get register: PC, Link reg, GP reg. – Calculate: +/- offset, auto inc/dec – Target speculation • Condition resolution – Get register: condition code reg, count reg., other reg. – Compare registers – Condition speculation 10/14 branch.7 Target address generation takes time 10/14 branch.8 Condition resolution takes time 10/14 branch.9 Solution: Branch speculation 10/14 branch.10 Branch Prediction Schemes 1. 2. 3. 4. 5. 6. 2-bit Branch-Prediction Buffer Branch Target Buffer Correlating Branch Prediction Buffer Tournament Branch Predictor Integrated Instruction Fetch Units Return Address Predictors (for subroutines, Pentium, Core Duo) 7. Predicated Execution (Itanium) 10/14 branch.11 Dynamic Branch Prediction learning based on past behavior History Information Incoming Branches { Address } Branch Predictor Prediction { Address, Value } Corrections { Address, Value } 10/14 • Incoming stream of addresses • Fast outgoing stream of predictions • Correction information returned from pipeline branch.12 Branch History Table (BHT) Table of predictors • Each branch given its own predictor • BHT is table of “Predictors” Branch PC Predictor 0 Predictor 1 – Could be 1-bit or more – Indexed by PC address of Branch • Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): – End of loop case: when it exits loop – First time through loop, it predicts exit instead of looping • most schemes use at least 2 bit predictors • Performance = ƒ(accuracy, cost of misprediction) Predictor 7 – Misprediction Flush Reorder Buffer • In Fetch state of branch: – Use Predictor to make prediction • When branch completes – Update corresponding Predictor 10/14 branch.13 Branch History Table Organization Target PC calculation takes time Fetch PC 00 k I-Cache Instruction Opcode BHT Index 2k-entry BHT, 2 bits/entry offset + Branch? Target PC Taken/¬Taken? 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions 10/14 branch.14 2-bit Dynamic Branch Prediction more accurate than 1-bit • Better Solution: 2-bit scheme where change prediction only if get misprediction twice: T NT Predict Taken Predict Taken T T NT NT Predict Not Taken T Predict Not Taken NT • Red: stop, not taken • Green: go, taken • Adds hysteresis to decision making process 10/14 branch.15 BTB: Branch Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH Branch PC =? Predicted PC Yes: instruction is prediction state branch and use bits predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb 10/14 branch.16 BTB contains only Branch & Jump Instructions BTB contains information for branch and jump instructions only not updated for other instructions For all other instructions the next PC is PC+4 ! Achieved without decoding instruction 10/14 branch.17 Combining BTB and BHT • BTB entries considerably more expensive than BHT, fetch redirected earlier in pipeline - can accelerate indirect branches (JR) • BHT can hold many more entries - more accurate BTB BHT in later pipeline stage corrects when BTB misses a predicted taken branch BHT A P F B I J R E PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute BTB/BHT only updated after branch resolves in E stage 10/14 branch.18 Subroutine Return Stack • Small stack – accelerate subroutine returns • more accurate than BTBs. Pop return address when subroutine return decoded Push return address when function call executed &nextc &nextb &nexta 10/14 k entries (typically k=8-16) branch.19 Mispredict Recovery In-order execution machines: – Instructions issued after branch cannot write-back before branch resolves – all instructions in pipeline behind mispredicted branch Killed 10/14 branch.20 Predicated Execution • Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP – If false, then neither store result nor cause exception – Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. – IA-64: 64 1-bit condition fields selected so conditional execution of any instruction – This transformation is called “if-conversion” x A= B op C • Drawbacks to conditional instructions – Still takes a clock even if “annulled” – Stall if condition evaluated late – Complex conditions reduce effectiveness; condition becomes known late in pipeline 10/14 branch.21 Accuracy v. Size (SPEC89) 10/14 branch.22 Dynamic Branch Prediction Summary • Prediction becoming important part of scalar execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch. • Tournament Predictor: more resources to competitive solutions and pick between them • Branch Target Buffer: include branch address & prediction • Predicated Execution can reduce number of branches, number of mispredicted branches • Return address stack for prediction of indirect jump 10/14 branch.23