Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science label: xor r10,r1,r11 Reg DMem Ifetch Reg Ifetch Reg Ifetch Reg DMem Reg Chapter 2: A Five Stage RISC Pipeline Reg DMem Reg DMem ALU add r8,r1,r9 Ifetch ALU or r6,r1,r7 Reg ALU and r2,r3,r5 Ifetch ALU beq r1,r3,label ALU Control Hazard Reg Reg DMem 2 Branch Penalty Impact • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! • Two part solution: – Determine branch taken or not sooner, AND – Compute taken branch address earlier • MIPS branch tests if register = 0 or 0 – beqz R4, name • MIPS Solution: – Move Zero test to ID/RF stage – Adder to calculate new PC in ID/RF stage – 1 clock cycle penalty for branch versus 3 Chapter 2: A Five Stage RISC Pipeline 3 Modified MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next SEQ PC Next PC MUX Adder Adder Zero? rs MUX Data Memory MEM/WB ALU MUX Imm ID/EX Register File IF/ID PC Instruction Memory EX/MEM rt Sign Extend rd rd rd WB Data Chapter 2: A Five Stage RISC Pipeline 4 … Reg DMem Ifetch Reg Ifetch Reg Ifetch Reg DMem Reg Chapter 2: A Five Stage RISC Pipeline Reg DMem Reg DMem ALU … Ifetch ALU Label: xor r10,r1,r11 Reg ALU and r2,r3,r5 Ifetch ALU beq r1,r3,label ALU Branch Resolved in ID Stage Reg Reg DMem 5 Branch Prediction • Predict Branch Not Taken – – – – Execute successor instructions in sequence. “Squash” instructions in pipeline if branch actually taken. 47% MIPS branches not taken on average. PC+4 already calculated, so use it to get next instruction. • Predict Branch Taken – 53% MIPS branches taken on average. – But haven’t calculated branch target address yet • MIPS still incurs 1 cycle branch penalty • Other machines: branch target known before outcome • Delay Branch Technique Chapter 2: A Five Stage RISC Pipeline 6 Delay Branches • This technique involves using software making the delay slots valid and useful. Some n number of instructions after the branch is executed regardless of whether the branch is taken. branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken Branch delay of length n • 1 delay slot allows proper decision and branch target address in 5 stage pipeline • MIPS uses this. Chapter 2: A Five Stage RISC Pipeline 7 Performance Effect of Branch Penalty Let pb = the probability that an instruction is a branch pt = the probability that a branch is taken b = the branch penalty CPI = the average number of cycles per instruction. Then CPI = (1 - pb) + pb[pt(1 + b) + (1 - pt)] CPI = 1 + bptpb Chapter 2: A Five Stage RISC Pipeline 8 Delay Branch Technique Chapter 2: A Five Stage RISC Pipeline 9 Delay Branch Technique (1) “From before” A:=B+C If B>C Then Goto Next Delay Slot ... Next: becomes If B>C Then Goto Next A:=B+C .... Next: Chapter 2: A Five Stage RISC Pipeline 10 Delay Branch Technique (2) “From target” Next: X := Y * Z ... B := A + C If B > C Then Goto Next Delay Slot becomes May need to duplicate Next: Must be OK to execute when not taken X := Y * Z ... ... B := A + C If B > C Then Goto Next X := Y * Z Chapter 2: A Five Stage RISC Pipeline 11 Delay Branch Technique (3) “From fall through” B := A + C If B > C Then Goto Next Delay Slot X := Y * Z ... Next: becomes Must be OK to execute when taken B := A + C If B > C Then Goto Next X := Y * Z ... Next: Chapter 2: A Five Stage RISC Pipeline 12 Delay Branch Technique (cont.) The performance of Delay Branches can be modeled by the following equation: CPI = 1+bpbpnop where pnop is the fraction of the b delay slots filled with nops. Thus, if fi is the probability that the delay slot i is filled with a useful instruction, then pnop = 1 - (f1 + f2 + …+ fb)/b Example: Suppose we have the following characteristic b=4, f1 =0.6, f2 = 0.1, f3 = f4 =0, pb=0.2 We have CPI = 1 + 4 0.2 0.825 = 1.66 Chapter 2: A Five Stage RISC Pipeline 13 Delay Branch Technique (cont.) The concept of squashing or annulling can be used in conjunction with delay branches. Next: X := Y * Z ... … B := A + C If B > C Then Goto Next X := Y * Z =>This instruction is nullified bne,a rs,rt,label a bit a Branch outcome taken not taken taken Delay inst. Executed? yes yes yes a not taken no (annulled) Chapter 2: A Five Stage RISC Pipeline 14 Delay Branch Technique (cont.) • For processors with this capability, the performance can be modeled as CPI = 1 + bpb[pnop(1 - pnull) + pnull)] where pnull=(1-pt) for nullify-on-branch-not-taken. • Suppose b=4, f1=0.8, f2=0.3, f3=0.1, f4=0, pb=0.2, pnull= 0.35 => CPI=1.644 Chapter 2: A Five Stage RISC Pipeline 15 Delayed Branch Performance • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots. – About 80% of instructions executed in branch delay slots useful in computation. – About 50% (60% x 80%) of slots usefully filled. Chapter 2: A Five Stage RISC Pipeline 16 Evaluating Branch Alternatives Pipeline speedup = Pipeline depth 1 +Branch frequency´Branch penalty Suppose Conditional & Unconditional = 14%, 65% change PC Prediction Stall pipeline Predict taken Predict not taken Delayed branch Branch scheme 3 1 1 0.5 CPI penalty 1.42 1.14 1.09 1.07 speedup v. unpipelined 3.5 4.4 4.5 4.6 Chapter 2: A Five Stage RISC Pipeline speedup v. stall 1.0 1.26 1.29 1.31 17 Reducing Branch Penalty Branch penalty in dynamically scheduled processors: wasted cycles due to pipeline flushing on mispredicted branches Reduce branch penalty: 1. Predict branch/jump instructions AND branch direction (taken or not taken) 2. Predict branch/jump target address (for taken branches) 3. Speculatively execute instructions along the predicted path 18 What to Use and What to Predict Available info: – – Current predicted PC Past branch history (direction and target) What to predict: – – – PC pred_PC Conditional branch inst: branch direction and target address Jump inst: target address Procedure call/return: target address May need instruction pre-decoded IM PC & Inst Predictors pred info feedback P C 19 Mis-prediction Detections and Feedbacks Detections: • At the end of decoding – Target address known at decoding, and not match – Flush fetch stage • At commit (most cases) – Wrong branch direction or target address not match – Flush the whole pipeline Feedbacks: • Any time a mis-prediction is detected • At a branch’s commit (at EXE: called speculative update) FETCH predictors RENAME REB/ROB SCHD EXE WB COMMIT 20 Branch Direction Prediction • Predict branch direction: taken or not taken (T/NT) taken Not taken • • BNE R1, R2, L1 … L1: … Static prediction: compilers decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1. 2. 3. 4. 5. 1-bit Branch-Prediction Buffer 2-bit Branch-Prediction Buffer Correlating Branch Prediction Buffer Tournament Branch Predictor and more … 21 Predictor for a Single Branch General Form 1. Access 2. Predict Output T/NT state PC 3. Feedback T/NT 1-bit prediction Feedback T Predict Taken NT 1 NT T 0 Predict Taken 22 Branch History Table of 1-bit Predictor K-bit BHT also Called Branch Prediction Buffer in textbook • Can use only one 1-bit predictor, but accuracy is low • BHT: use a table of simple predictors, indexed by bits from PC • Similar to direct mapped cache • More entries, more cost, but less conflicts, higher accuracy • BHT can contain complex predictors Branch address 2k Prediction 23 1-bit BHT Weakness • Example: in a loop, 1-bit BHT will cause 2 mispredictions • Consider a loop of 9 iterations before exit: for (…){ for (i=0; i<9; i++) a[i] = a[i] * 2.0; } – End of loop case, when it exits instead of looping as before – First time through loop on next time through code, when it predicts exit instead of looping – Only 80% accuracy even if loop 90% of the time 24 2-bit Saturating Counter • Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) T Predict Taken 11 NT T T Predict Not Taken 01 10 Predict Taken 00 Predict Not Taken NT NT T NT • Gray: stop, not taken • Blue: go, taken • Adds hysteresis to decision making process 25 Correlating Branches Code example showing the potential If (d==0) d=1; If (d==1) … Assemble code BNEZ R1, L1 DADDIU R1,R0,#1 L1: DADDIU R3,R1,#1 BNEZ R3, L2 L2: … Observation: if BNEZ1 is not taken, then BNEZ2 is taken 26 (1, 1) Predictor • (1,1) predictor - last branch, 1-bit prediction • We use a pair of bits where the first bit being the prediction if the last branch in the program was not taken, and the second bit being the prediction if the last branch was taken. Prediction If Prediction Bits Last branch Not Taken Last Branch Taken NT/NT Not Taken Not Taken NT/T Not Taken Taken T/NT Taken Not Taken T/T Taken Taken Chapter 3 - Exploiting ILP 27 (1, 1) Predictor: Example • Consider the following code assuming d is assigned to R1. if (d==0) d=1; if (d==1) bnez addi subi bnez ... L1: R1,L1 R1,R0,#1 R3,R1,#1 R3,L2 ; branch b1 (d!=0) ; d==0, so d=1 ; branch b2 (d!=1) L2: • Suppose d alternates between 2 and 0, (1, 1) predictor initialized to not taken. Bold indicate prediction. • d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT/NT T T/NT NT/NT T NT/T 0 T/NT NT T/NT NT/T NT NT/T 2 T/NT T T/NT NT/T T NT/T 0 T/NT NT T/NT NT/T NT NT/T The only misprediction is on the first iteration, when d=2, because the b1 was not correlated with the previous prediction of b2 Chapter 3 - Exploiting ILP 28 (1, 1) Predictor: Example • If we had use a 1-bit predictor d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT T T NT T T 0 T NT NT T NT NT 2 NT T T NT T T 0 T NT NT T NT NT • We would have had all the branches mispredicted! Chapter 3 - Exploiting ILP 29 (m, n) Predictor (m,n) Predictor: In general, (m,n) predictor uses the behavior of last m branches (using shift register) to choose from 2m branch predictors, each of which is a n-bit predictor for a single branch. Chapter 3 - Exploiting ILP 30 Performance of (2, 2) Predictor • Improvement is most noticeable in integer benchmarks. • (m,n) predictor outperforms 2-bit predictor, even with unlimited entries! Integer benchmarks Chapter 3 - Exploiting ILP 31 Tournament Predictors • Uses multiple predictors, usually one based on local information and one based on global information. – Local predictors are better for some branches – Global predictors are better at utilizing correlation • A selector is used to choose among the predictors, usually a 2-bit saturating counter. 00 11 n/m means: 01 10 0/1 means: • n - left predictor • m - right predictor • 0 - Incorrect • 1 - Correct Chapter 3 - Exploiting ILP 32 Example: Alpha 21264 Branch Predictor 21264 uses the most sophisticated branch predictor. 3-bit saturating counter Last 12 outcomes of all the branches 2-bit predictor 2-bit saturating counter Last 10 outcomes of this branch Chapter 3 - Exploiting ILP 33 Tournament Predictor in Alpha 21264 • Local predictor consists of a 2-level predictor: – Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted – Next level Selected entry from the local history table is used to index a table of 1K entries consisting 3-bit saturating counters, which provide the local prediction • Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors) 1K 10 bits 1K 3 bits % of predictions from local predictor in Tournament Prediction Scheme 0% 20% 40% 60% 80% 100% 98% nasa7 100% matrix300 94% tomcatv 90% doduc 55% spice 76% fpppp 72% gcc 63% espresso eqntott li 37% 69% Accuracy of Branch Prediction 99% 99% 100% tomcatv 95% doduc 84% fpppp 86% 82% li 77% 97% 88% gcc 70% 0% 20% 40% 60% 80% Profile-based 2-bit counter Tournament 98% 86% 82% espresso 98% 96% 88% 94% 100% • Profile: branch profile from last execution (static in that is encoded in instruction, but profile) fig 3.40 Accuracy v. Size (SPEC89) Conditional branch misprediction rate 10% 9% 8% Local - 2 bit counters 7% 6% 5% Correlating - (2,2) scheme 4% 3% Tournament 2% 1% 0% 0 8 16 24 32 40 48 56 64 72 80 88 96 Total predictor size (Kbits) 104 112 120 128 Power Consumption BlueRISC’s Compiler-driven Power-Aware Branch Prediction Comparison with 512 entry BTAC bimodal (patent-pending) Copyright 2007 CAM & BlueRISC Pitfall: Sometimes dumber is better • Alpha 21264 uses tournament predictor (29 Kbits) • Earlier 21164 uses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits) • SPEC95 benchmarks, 21264 outperforms – 21264 avg. 11.5 mispredictions per 1000 instructions – 21164 avg. 16.5 mispredictions per 1000 instructions • Reversed for transaction processing (TP) ! – 21264 avg. 17 mispredictions per 1000 instructions – 21164 avg. 15 mispredictions per 1000 instructions • TP code much larger & 21164 hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264) • What about power? – Large predictors give some increase in prediction rate but for a large power cost Branch Target Buffer BTB acts as a cache for BTAs. This eliminates cycles wasted per branch required to calculate the BTAs. Chapter 3 - Exploiting ILP 40 BTB (cont.) BTA and the outcome of the branch is known by end of ID stage …but not relayed until EX stage Chapter 3 - Exploiting ILP 41 BTB (cont.) Buffer Hit Taken Not Taken 0 2 Buffer Miss Taken Not Taken 2 0 The performance of BTB can be modeled by CPI =1+ bpb (1- p m ) pw +bpb pm pw +cpm pb p t where b is the normal branch penalty, c is the number of cycles required to service a BTB miss, and p m is the probability of a BTB miss. The probability of wrong prediction p w depends on whether there was a BTB miss o r a hit. In the case of a BTB hit, p w = 1- pt. For a BTB miss, pw = p t . Thus CPI =1+ bpb (1- p m )(1- p t )+ (b+ c)p b p m pt Chapter 3 - Exploiting ILP 42 Return Address Prediction BTB and BPB do a good job in predicting how future behavior will repeat. However, the subroutine call/return paradigm makes correct prediction difficult. The BTB then contains the following after the second subroutine is called: 100 104 108 112 ... jal ... ... jal ... 500 ... 520 subr:... ... jr subr subr Inst. Addr 100 520 112 Target Addr. 500 104 500 $31 When we return from subr, we get a hit on a valid entry in the BTB (Inst. Addr. = 520) and predict that we will return to address 104. However, this is not correct. The next instruction should be 116! Chapter 3 - Exploiting ILP 43 Subroutine Return Stack In order to detect such mispredictions, subroutine return stack can be used to augment the BTB. Chapter 3 - Exploiting ILP 44 Performance of SRS SPEC 95 Chapter 3 - Exploiting ILP 45 Pentium 4’s Branch Predictor • “Unveiling the Intel Branch Predictors” – Pentium 4 – http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1597026 46 Natural Branch Predictors • “Towards a High Performance Neural Branch Predictor” – http://webspace.ulbsibiu.ro/lucian.vintan/html/USA.pdf – The main advantage of the neural predictor is its ability to exploit long histories while requiring only linear resource growth – Used in IA-64 simulators 47 Core 2’s Branch Predictor? • TAGE: Tagged Geometric Chapter 3 - Exploiting ILP 48 TAGE Performance 49 To Learn More Chapter 3 - Exploiting ILP 50