CDA 5155 Week 4 Branch Prediction M U X 1 + target + Inst mem PC REG file eq? sign ext bpc target M U X M U X A L U Data memory Control beq IF/ ID ID/ EX EX/ Mem Mem/ WB Branch Target Buffer Fetch PC Send PC to BTB found? Yes use target Predicted target PC No use PC+1 Branch prediction • Predict not taken: ~50% accurate – No BTB needed; always use PC+1 • Predict backward taken: ~65% accurate – BTB holds targets for backward branches (loops) • Predict same as last time: ~80% accurate – Update BTB for any taken branch What about indirect branches? • Could use same approach – PC+1 unlikely indirect target – Indirect jumps often have multiple targets (for same instruction) • Switch statements • Virtual function calls • Shared library (DLL) calls Indirect jump: Special Case • Return address stack – Function returns have deterministic behavior (usually) • Return to different locations (BTB doesn’t work well) • Return location known ahead of time – In some register at the time of the call – Build a specialize structure for return addresses • Call instructions write return address to R31 AND RAS • Return instructions pop predicted target off stack – Issues: finite size (save or forget on overflow?); – Issues: long jumps (clear when wrong?) Costs of branch prediction/speculation • Performance costs? – Minimal: no difference between waiting and squashing; and it is a huge gain when prediction is correct! • Power? – Large: in very long/wide pipelines many instructions can be squashed • Squashed = # mispredictions pipeline length/width before target resolved Costs of branch prediction/speculation • Area? – Can be large: predictors can get very big as we will see next time • Complexity? – Designs are more complex – Testing becomes more difficult, but … What else can be speculated? • Dependencies – I think this data is coming from that store instruction • Values – I think I will load a 0 value • Accuracy? – – – – Branch prediction (direction) is Boolean (T,NT) Branch targets are stable or predictable (RAS) Dependencies are limited Values cover a huge space (0 – 4B) Parts of the branch predictor • Direction Predictor – For conditional branches • Predicts whether the branch will be taken – Examples: • Always taken; backwards taken • Address Predictor – Predicts the target address (use if predicted taken) – Examples: • BTB; Return Address Stack; Precomputed Branch • Recovery logic Ref: The Precomputed Branch Architecture Characteristics of branches • Individual branches differ – Loops tend not to exit • Unoptimized code: not-taken • Optimized code: taken – If-statements: • Tend to be less predictable – Unconditional branches • Still need address prediction Example gzip: • gzip: loop branch A@ 0x1200098d8 • • • • Executed: 1359575 times Taken: 1359565 times Not-taken: 10 times % time taken: 99% - 100% Easy to predict (direction and address) Example gzip: • gzip: if branch B@ 0x12000fa04 • • • • Executed: 151409 times Taken: 71480 times Not-taken: 79929 times % time taken: ~49% Easy to predict? (maybe not/ maybe dynamically) Example: gzip 12000000 10000000 8000000 6000000 4000000 A 2000000 Easy to predict 14000000 Easy to predict total branch executions 16000000 B 0 0 % taken (per branch) Direction prediction: always taken Accuracy: ~73 % 100 Branch Backwards 3 not taken 2.5 taken 2 1.5 1 0.5 distance of branch target Most backward branches are heavily TAKEN Forward branches slightly more likely to be NOT-TAKEN Ref: The Effects of Predicated Execution on Branch Prediction 95 80 65 50 35 20 5 0 -1 5 -2 0 -4 5 -5 0 -7 5 -8 00 0 -1 % of total branches 3.5 Using history • 1-bit history (direction predictor) – Remember the last direction for a branch Branch History Table branchPC NT How big is the BHT? T Example: gzip total branch executions 16000000 A 14000000 12000000 10000000 8000000 6000000 4000000 2000000 B 0 0 % taken (per branch) Direction prediction: always taken Accuracy: ~73 % How many times will branch A mispredict? How many times will branch B mispredict? 100 Using history • 2-bit history (direction predictor) Branch History Table branchPC SN How big is the BHT? NT T ST Example: gzip total branch executions 16000000 A 14000000 12000000 10000000 8000000 6000000 4000000 2000000 B 0 0 % taken (per branch) Direction prediction: always taken Accuracy: ~73 % How many times will branch A mispredict? How many times will branch B mispredict? 100 Using History Patterns ~80 percent of branches are either heavily TAKEN or heavily NOT-TAKEN For the other 20%, we need to look a patterns of reference to see if they are predictable using a more complex predictor Example: gcc has a branch that flips each time T(1) NT(0) 10101010101010101010101010101010101010 Using history • 1-bit history (direction predictor) – Remember the last direction for a branch Branch History Table branchPC NT How big is the BHT? T Using history • 2-bit history (direction predictor) Branch History Table branchPC SN How big is the BHT? NT T ST Using History Patterns ~80 percent of branches are either heavily TAKEN or heavily NOT-TAKEN For the other 20%, we need to look a patterns of reference to see if they are predictable using a more complex predictor Example: gcc has a branch that flips each time T(1) NT(0) 10101010101010101010101010101010101010 Local history branchPC Branch History Table Pattern History Table 10101010 What is the prediction for this BHT 10101010? When do I update the tables? NT T Local history branchPC Branch History Table Pattern History Table 01010101 NT On the next execution of this branch instruction, the branch history table is 01010101, pointing to a different pattern What is the accuracy of a flip/flop branch 0101010101010…? T Global history Branch History Register Pattern History Table 01110101 for (i=0; i<100; i++) for (j=0; j<3; j++) if (aa == 2) 0; taken j<3 jaa= =1 1101 == 2) j<3 jif=(bb 2 1011 taken 0; not taken j<3 jbb == 3 0111 {… i<100if (aa != bb) 1110 usually taken How can branches interfere with each other? Gshare predictor branchPC Branch History Register 01110101 Must read! Ref: Combining Branch Predictors Pattern History Table xor Bimod predictor Global history reg branchPC xor Choice predictor PHT skewed taken PHT skewed Not-taken mux Hybrid predictors Local predictor (e.g. 2-bit) Global/gshare predictor (much more state) Prediction 1 Selection table (2-bit state machine) Prediction 2 Prediction How do you select which predictor to use? How do you update the various predictor/selector? Overriding Predictors • Big predictors are slow, but more accurate • Use a single cycle predictor in fetch • Start the multi-cycle predictor – When it completes, compare it to the fast prediction. • If same, do nothing • If different, assume the slow predictor is right and flush pipline. • Advantage: reduced branch penalty for those branches mispredicted by the fast predictor and correctly predicted by the slow predictor Pipelined Gshare Predictor • How can we get a pipelined global prediction by stage 1? – Start in stage –2 – Don’t have the most recent branch history… • Access multiple entries – E.g. if we are missing last three branches, get 8 histories and pick between them during fetch stage. Ref: Reconsidering Complex Branch Predictors Exceptions • Exceptions are events that are difficult or impossible to manage in hardware alone. • Exceptions are usually handled by jumping into a service (software) routine. • Examples: I/O device request, page fault, divide by zero, memory protection violation (seg fault), hardware failure, etc. Taking and Exception • Once an exception occurs, how does the processor proceed. – Non-pipelined: don’t fetch from PC; save state; fetch from interrupt vector table – Pipelined: depends on the exception • Precise Interrupt: Must stop all instruction “after the exception” (squash) – Divide by zero: flush fetch/decode – Page fault: (fetch or mem stage?) • Save state after last instruction before exception completes (PC, regs) • Fetch from interrupt vector table How Much ILP is There? ALU Operation GOOD, Branch BAD Expected Number of Branches Between Mispredicts E(X) ~ 1/(1-p) E.g., p = 95%, E(X) ~ 20 brs, 100-ish insts How Accurate are Branch Predictors?