Branch Prediction Schemes ECE404: Advance Microprocessor System Raj Parihar Motivation Why do we need (dynamic) branch predictors? Branches are very frequent Pipeline stall (bubble) is required to know the exact outcome / target of branch Longer pipelines increase the latency; even worse! Use a naïve predictor (always Taken or Not Taken) More than 20% INS are Branches Approx 50% of the time correct results Branch penalty is too HIGH If miss-predicted, Squash all the INS after branch That’s why we need really smart branch predictors 3/7/2012 ECE404 Raj Parihar References “Alternative Implementations of Two-Level Adaptive Branch Prediction” by Tse-Yu Yeh and Yale N. Patt “Combining Branch Predictors” by Scott McFarling “The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference” by Sprangle, et al “Dynamic History-Length Fitting: A Third Level of Adaptivity for Branch Prediction” by Toni Juan et al “Neural Methods for Dynamic Branch Predictor” by Daniel A. Jimenez et al. 3/7/2012 ECE404 Raj Parihar Outline Branch Prediction: Basics Various Branch Predictors BP in Commercial Design Enhancement Techniques Schemes to Reduce Interference Some Simulations and Results Conclusion 3/7/2012 ECE404 Raj Parihar Conditional Branches High performance systems use multi-level branch predictors Two aspects of conditional branch prediction What about unconditional branches? Branch outcome: Taken or Not Taken Branch Address: if Taken then to Where? Don’t even bother! Compilers are smart enough to deal with them State-of-the-art gives approx ~ 98.8% hit rate 3/7/2012 ECE404 Raj Parihar Basic Branch Prediction Bimodal Branch Prediction Local Branch Prediction Schemes Per branch address Global Branch Prediction Schemes 3/7/2012 Combined all branch addresses ECE404 Raj Parihar Bi-model Branch Predictors 92 91 Bimodal 90 89 88 8k 16 k 32 k 64 k 87 32 64 13 bits 93 2k 4k PC 94 12 8 25 6 51 2 1k Conditional Branch Prediction Accuracy (%) Pattern History Table 0000 Predictor Size (bytes) 8191 Reference: Combining Branch Predictions, Scott McFarling. MRL TN-36 3/7/2012 ECE404 Raj Parihar Two-Level Branch Predictor Pattern History Table Global History Register (Table) PC 3/7/2012 ECE404 Raj Parihar Two Level Branch Predictor (Cont…) Pattern History Table 0000 Global History Register (Table) 12 0 PC 13 bits 213 – 1 = 8191 3/7/2012 ECE404 Raj Parihar Combining Branch Predictors Meta Predictor (2-level, Bi-model ) PHT GHT/ GHR PHT 0000 0000 12 0 PC PC 13 bits 12 bits 213 – 1 3/7/2012 ECE404 4095 Raj Parihar Branch Predictors in Commercial Processors POWER4, Alpha GS 21264 and Intel POWER4: Core Branch Prediction Unit Instruction Fetch Unit Decode, Crack, Group Issue Queues LD/ ST Queue Execution units 3/7/2012 FP Execution units Fixed Point EX units BR Execution unit CR Execution unit ECE404 Raj Parihar POWER4: Branch Prediction Unit Three set of branch-history tables Local predictor (Traditional BHT) Global predictor 11-bit global history vector (Similar to GHR) GHR is XORed with Branch address to index the HT 16K entry global history table, 1-bit prediction Selector Table 3/7/2012 16K entry, indexed by branch address, 1-bit prediction Keeps track of better predictor (global or local) 16K entry global history table, 1-bit prediction ECE404 Raj Parihar POWER 4 Branch Prediction (Cont…) Fetching is re-directed based on prediction Eventually branches are executed in BR unit Upon execution predictor tables are updated Dynamic branch prediction can be overdriven by software, if needed Link stack to predict the target of branches A target address of branch-to-count is often repetitive 3/7/2012 ECE404 Raj Parihar Alpha 21264: BP Composed of 3 units Local predictor Global predictor Choice predictors Local Predictor 3/7/2012 2-level, per-branch HT 1K table entry, 3-bit SC VPC [ 11:2] of current address ECE404 Raj Parihar Alpha 21264 (Cont…) Global predictor Uses 12-most recent br 4K-entry global HT 2-bit saturating counter Choice Predictor 3/7/2012 Monitors the history of local & global predictors 4K-entry table 2-bit each Chooses the best of Two ECE404 Raj Parihar Intel Processors 386/ 486 All branches are statically predicted Not Taken Pentium III 2-level, local histories 2-bit saturating counters (Lee-Smith) It’s really tough to find any info about Pentium M Intel chips Combines 3 predictors Loop predictor analyzes branches to see if they have loop behavior 3/7/2012 Bimodal, Global and Loop predictor Moving in one direction (taken or NT) a fixed number of times ECE404 Raj Parihar Branch Prediction: Insights What makes them nearly perfect? Potential Interferences in Two-Level Interferences are caused by multiple branch instructions being mapped to the same table entry Types of interference: Neutral Interference Positive Interference Negative Interference Negative interference is more dominant than positive interference 3/7/2012 ECE404 The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference, by Sprangle, Chappell, Alsup, and Patt, ISCA 97 Raj Parihar Destructive Interference (Aliasing) Unrelated branches might accidentally use the same counter Almost all known techniques change the microarchitecture If two branches behave differently, the predictor can’t learn the behavior Leads to decreased accuracy Techniques shown to work well in simulation But microprocessor manufacturers still use relatively simple predictors Can we reduce destructive interference without changing the processor? 3/7/2012 ECE404 Raj Parihar Ways to Reduce Interference Larger prediction table Efficient Hash Function Use different mapping schemes to better distribute branches among different entries Profiling of branches Map conflicting branches to different table entries Separating different classes of branches to use different prediction tables Avoid negative interference 3/7/2012 By converting negative interferences into neutral or positive interference ECE404 Raj Parihar Agree Predictor Idea is that Hope is that Most branches are highly biased, either T or NT First time branch in BTB exhibit the biased nature A bias bit is assigned to each branch in BTB PHT gives the info as “agree” or “disagree” 3/7/2012 ECE404 Raj Parihar Pattern History Table: Utilization We need a very efficient HASH function We assume the utilization of PHT entry is uniform 3/7/2012 Not true though, depends upon the efficiency of HASH function XOR is the simplest HASH function And here are some surprising results ECE404 Raj Parihar Simplescalar Implementation (Baseline) L1 Table (GHR) L2 Table (PHT) L1 size - 1 0000 XOR AND Int L1 index 31 XOR Long BR Address >> 2 AND Memory Address Ones (size HR) L2 index 8191 Baseline Tweaking 3/7/2012 ECE404 Raj Parihar Simplescalar Tweak Change the history_reg_size in .cfg file Branch Lookup bpred.c: bpred_dir_lookup function XOR upper half of l2index with lower half before indexing L2 table which is kind of PHT Branch Update 3/7/2012 Because L2index is a pointer it is automatically updates the correct entry once bpred_update function is called ECE404 Raj Parihar PHT Utilization GHR20 Baseline GHR = 13bit PHT = 8192 Entry GHR20 GHR = 20bit Simulation:10 million INS BR+GHR baddr is XORed Fast forward: 50 million INS Baseline BR+GHR Number of Branches 250000 200000 150000 100000 50000 0 1 529 1057 1585 2113 2641 3169 3697 4225 4753 5281 5809 6337 6865 7393 7921 PHT Entry [n] 3/7/2012 ECE404 Raj Parihar PHT Utilization: Art Benchmark BR_Baseline Miss_Baseline 120000 Number of BR/MISS 100000 80000 60000 40000 20000 0 1 534 1067 1600 2133 2666 3199 3732 4265 4798 5331 5864 6397 6930 7463 7996 PHT Entry 3/7/2012 ECE404 Raj Parihar Impact of GHR Length In general, the length of GHR can impact the overall prediction accuracy Longer GHR will affect more entries in PHT, but may reduce or enhance interference Determining the appropriate GHR length is non trivial To allow GHR to be changed dynamically is a possible way to improve performance How to determine the best GHR length is an open research issue. 3/7/2012 ECE404 Raj Parihar Variable GHR Length: Simulation GHR length VS # of Mispredictions Baseline (GHR13) GHR10 GHR16 GHR20 bzip2 swim GHR26 # of Mispredictions 70000 60000 50000 40000 30000 20000 10000 0 ammp applu apsi art Applications 3/7/2012 ECE404 Raj Parihar twolf vortex Variable GHR Length: Simulation GHR length VS % Improvement GHR10 GHR16 GHR20 GHR26 bzip2 swim 30 20 % Improvement 10 0 -10 ammp applu apsi art -20 -30 -40 -50 Application 3/7/2012 ECE404 Raj Parihar twolf vortex Dynamic History Length Optimal Branch history length Vary history length Some prefer short history (less training time) Some require longer history (complex behavior) Choose through profile/compile-time hints Or learn dynamically References 3/7/2012 Maria-Dana Tarlescu, Kevin B. Theobald, and Guang R. Gao. Elastic History Buffer: A Low-Cost Method to Improve Branch Prediction Accuracy. ICCD, October 1996. Jared Stark, Marius Evers, and Yale N. Patt. Variable Path Branch Prediction. ACM SIGPLAN Notices, 33(11):170-179, 1998 ECE404 Raj Parihar Variable GHR and PHT Utilization BR_Baseline Miss_Baseline BR_GHR20 Miss_GHR20 120000 # of BR/Mispredictions 100000 80000 60000 40000 20000 0 1 509 1017 1525 2033 2541 3049 3557 4065 4573 5081 5589 6097 6605 7113 7621 8129 PHT Entry # (Total 8192) 3/7/2012 ECE404 Raj Parihar Changing the Branch Predictor Before 2001, most work refined two-level adaptive branch prediction [Yeh & Patt 92] A 1st-level table records recent global or per-branch pattern histories A 2nd-level table learns correlations between histories and outcomes Refinements focus on reducing destructive interference Some of the better refinements 3/7/2012 gshare [McFarling `93], agree [Sprangle et al. `97], hybrid predictors [Evers et al. `96], skewed predictors [Michaud et al. `93] ECE404 Raj Parihar A Machine Learning Approach Conditional Branch Prediction is a Machine Learning Problem The machine learns to predict conditional branches So why not apply a machine learning algorithm? Artificial neural networks 3/7/2012 Simple model of neural networks in brain cells Learn to recognize and classify patterns ECE404 Raj Parihar Neuron Based Prediction Some of the well known techniques Perceptron based predictors Back propagation Radial basis network Elman network Linear Vector Quantization (LVQ) network All are well known complex neural network based approaches Lot of computation and implementation overhead 3/7/2012 Idea is to implement basic/lightweight footprint of above ECE404 Raj Parihar Basic of Neuron Based Predictor The inputs to a neuron are branch outcome histories The last n branch outcomes Can be global or local (per-branch) or both (alloyed) Conceptually, branch outcomes are represented as +1, for taken -1, for not taken The output of the neuron is Non-negative, if the branch is predicted taken Negative, if the branch is predicted not taken Ideally, each static branch is allocated its own neuron 3/7/2012 ECE404 Raj Parihar Perceptron Based Predictors Inputs (x’s) are from branch history n + 1 small integer weights (w’s) learned by on-line training Output (y) is dot product of x’s and w’s; predict taken if y ≥ 0 Training finds correlations between history and outcome 3/7/2012 ECE404 Branch History Information (Global, Local) Branch outcome (Taken, Not Taken) Raj Parihar Conclusion Branch predictors make use of correlation between history of a branch, correlation with other branches and its outcome Neural prediction could be incorporated into future CPUs Accuracy is very good; complexity is still a bottleneck Power and energy need to be amortized Predictor accuracy is more important for deeper pipelines because the penalty increases with the depth of pipeline 3/7/2012 ECE404 Raj Parihar Question!