CENG 450 Computer Systems and Architecture Lecture 11 Amirali Baniasadi amirali@ece.uvic.ca 1 This Lecture Branch Prediction Multiple Issue 2 Branch Prediction Predicting the outcome of a branch Direction: Taken / Not Taken Direction predictors Target Address PC+offset (Taken)/ PC+4 (Not Taken) Target address predictors • Branch Target Buffer (BTB) 3 Why do we need branch prediction? Branch prediction Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) Allows useful work to be completed while waiting for the branch to resolve 4 Branch Prediction Strategies Static Decided before runtime Examples: Always-Not Taken Always-Taken Backwards Taken, Forward Not Taken (BTFNT) Profile-driven prediction Dynamic Prediction decisions may change during the execution of the program 5 What happens when a branch is predicted? On misprediction: No speculative state may commit Squash instructions in the pipeline Must not allow stores in the pipeline to occur • Cannot allow stores which would not have happened to commit • Even for good branch predictors more than half of the fetched instructions are squashed 6 A Generic Branch Predictor Predicted Stream PC, T or NT Execution Order Fetch f(PC, x) Resolve Actual Stream Actual Stream f(PC, x) = T or NT Predicted Stream - What’s f (PC, x)? - x can be any relevant info thus far x was empty 7 Bimodal Branch Predictors Dynamically store information about the branch behaviour Branches tend to behave in a fixed way Branches tend to behave in the same way across program execution Index a Pattern History Table using the branch address 1 bit: branch behaves as it did last time Saturating 2 bit counter: branch behaves as it usually does 8 Saturating-Counter Predictors Consider strongly biased branch with infrequent outcome TTTTTTTTNTTTTTTTTNTTTT Last-outcome will misspredict twice per infrequent outcome encounter: TTTTTTTTNTTTTTTTTNTTTT Idea: Remember most frequent case Saturating-Counter: Hysteresis often called bi-modal predictor Captures Temporal Bias 9 Bimodal Prediction Table of 2-bit saturating counters Predict the most common direction Taken 11 T PC Not Taken 10 T PHT Ta k e n Ta k e n 00 Ta k e n 01 N ot Ta k e n Tak en Tak en 00 11 N ot Ta k e n Tak en 01 N ot Tak en Not Taken Ta k e n 10 N ot Ta k e n N ot Tak en T/NT 11 N ot N ot Tak en Tak en Taken 01 N ot Ta k e n Tak en 10 Taken Not Taken ... NT Taken 00 NT Not Taken Advantages: simple, cheap, “good” accuracy Bimodal will misspredict once per infrequent outcome encounter: TTTTTTTTNTTTTTTTTNTTTT 10 Correlating Predictors From program perspective: Different Branches may be correlated if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) then … Can be viewed as a pattern detector Instead of keeping aggregate history information I.e., most frequent outcome Keep exact history information Pattern of n most recent outcomes Example: BHR: n most recent branch outcomes Use PC and BHR (xor?) to access prediction table 11 Pattern-based Prediction Nested loops: for i = 0 to N for j = 0 to 3 … Branch Outcome Stream for j-for branch • 11101110111011101110 Patterns: • 111 -> 0 • 110 -> 1 • 101 -> 1 • 011 -> 1 100% accuracy Learning time 4 instances Table Index (PC, 3-bit history) 12 Two-level Branch Predictors A branch outcome depends on the outcomes of previous branches First level: Branch History Registers (BHR) Global history / Branch correlation: past executions of all branches Self history / Private history: past executions of the same branch Second level: Pattern History Table (PHT) Use first level information to index a table Possibly XOR with the branch address PHT: Usually saturating 2 bit counters Also private, shared or global 13 Gshare Predictor (McFarling) Branch History Table Global BHR PC f Prediction PC and BHR can be concatenated completely overlapped partially overlapped xored, etc. How deep BHR should be? Really depends on program But, deeper increases learning time May increase quality of information 14 Hybrid Prediction Combining branch predictors Use two different branch predictors Access both in parallel A third table determines which prediction to use Two or more predictor components combined PC GSHARE Bimodal ... Different branches benefit from different types of history T/NT T/NT Selector T/NT 15 Issues Affecting Accurate Branch Prediction Aliasing More than one branch may use the same BHT/PHT entry Constructive • Prediction that would have been incorrect, predicted correctly Destructive • Prediction that would have been correct, predicted incorrectly Neutral • No change in the accuracy 16 More Issues Training time Need to see enough branches to uncover pattern Need enough time to reach steady state “Wrong” history Incorrect type of history for the branch Stale state Predictor is updated after information is needed Operating system context switches More aliasing caused by branches in different programs 17 Performance Metrics Misprediction rate Mispredicted branches per executed branch Unfortunately the most usually found Instructions per mispredicted branch Gives a better idea of the program behaviour Branches are not evenly spaced 18 Upper Limit to ILP: Ideal Machine Amount of parallelism when there are no branch mispredictions and we’re limited only by data dependencies. 160 150.1 FP: 75 - 150 Instruction Issues per cycle IPC 140 120 118.7 Integer: 18 - 60 100 75.2 80 62.6 60 54.8 40 17.9 20 0 gcc Instructions that could theoretically be issued per cycle. espresso li fpppp doducd tomcatv Programs 19 Impact of Realistic Branch Prediction Limiting the type of branch prediction. 61 60 58 60 FP: 15 - 45 48 50 46 45 46 45 45 IPC Instruction issues per cycle 41 40 35 Integer: 6 - 12 30 29 19 20 16 15 12 10 13 14 10 9 6 7 6 6 6 7 4 2 2 2 0 gcc espresso li fpppp doducd tomcatv Program Perfect Selective predictor Standard 2-bit Static None 20 Pentium III Dynamic branch prediction 512-entry BTB predicts direction and target, 4-bit history used with PC to derive direction Mispredicted: at least 9 cycles, as many as 26, average 10-15 cycles 21 AMD Athlon K7 10-stage integer, 15-stage fp pipeline, predictor accessed in fetch 2K-entry bimodal, 2K-entry BTB Branch Penalties: Mispredict penalty: at least 10 cycles 22 Multiple Issue • Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. • Superscalar processors • Very Long Instruction Word (VLIW) processors 23 1990’s: Superscalar Processors Bottleneck: CPI >= 1 Limit on scalar performance (single instruction issue) Hazards Superpipelining? Diminishing returns (hazards + overhead) How can we make the CPI = 0.5? Multiple instructions in every pipeline stage (super-scalar) 1 2 3 4 5 6 7 Inst0 Inst1 Inst2 Inst3 Inst4 Inst5 IF IF ID ID IF IF EX EX ID ID IF IF MEM MEM EX EX ID ID WB WB MEM MEM EX EX WB WB MEM MEM WB WB 24 Superscalar Vs. VLIW Religious debate, similar to RISC vs. CISC Wisconsin + Michigan (Super scalar) Vs. Illinois (VLIW) Q. Who can schedule code better, hardware or software? 25 Hardware Scheduling High branch prediction accuracy Dynamic information on latencies (cache misses) Dynamic information on memory dependences Easy to speculate (& recover from mis-speculation) Works for generic, non-loop, irregular code Ex: databases, desktop applications, compilers Limited reorder buffer size limits “lookahead” High cost/complexity Slow clock 26 Software Scheduling Large scheduling scope (full program), large “lookahead” Can handle very long latencies Simple hardware with fast clock Only works well for “regular” codes (scientific, FORTRAN) Low branch prediction accuracy Can improve by profiling No information on latencies like cache misses Can improve by profiling Pain to speculate and recover from mis-speculation Can improve with hardware support 27 Superscalar Processors Pioneer: IBM (America => RIOS, RS/6000, Power-1) Superscalar instruction combinations 1 ALU or memory or branch + 1 FP (RS/6000) Any 1 + 1 ALU (Pentium) Any 1 ALU or FP+ 1 ALU + 1 load + 1 store + 1 branch (Pentium II) Impact of superscalar More opportunity for hazards (why?) More performance loss due to hazards (why?) 28 Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by the compiler) or dynamic(by the hardware) • Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 29 Elements of Advanced Superscalars High performance instruction fetching Good dynamic branch and jump prediction Multiple instructions per cycle, multiple branches per cycle? Scheduling and hazard elimination Dynamic scheduling Not necessarily: Alpha 21064 & Pentium were statically scheduled Register renaming to eliminate WAR and WAW Parallel functional units, paths/buses/multiple register ports High performance memory systems Speculative execution 30 SS + DS + Speculation Superscalar + Dynamic scheduling + Speculation Three great tastes that taste great together CPI >= 1? Overcome with superscalar Superscalar increases hazards Overcome with dynamic scheduling RAW dependences still a problem? Overcome with a large window Branches a problem for filling large window? Overcome with speculation 31 The Big Picture issue Static program Fetch & branch predict execution & Reorder & commit 32 Readings New paper on branch prediction online. READ. Material would be used in the THIRD quiz 33