Dynamic Branch Prediction • Why does prediction work? – Underlying algorithm has regularities – Data that is being operated on has regularities – Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems • Is dynamic branch prediction better than static branch prediction? – Seems to be – There are a small number of important branches in programs which have dynamic behavior 4/8/2015 Lec4 ILP 1 Dynamic Branch Prediction • Performance = ƒ(accuracy, cost of misprediction) • Branch History Table: Lower bits of PC address index table of 1-bit values – Says whether or not branch taken last time – No address check • Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iteratios before exit): – End of loop case, when it exits instead of looping as before – First time through loop on next time through code, when it predicts exit instead of looping 4/8/2015 Lec4 ILP 2 Dynamic Branch Prediction • Solution: 2-bit scheme where change prediction only if get misprediction twice T NT Predict Taken T Predict Not Taken T NT T Predict Taken NT Predict Not Taken • Red: stop, not taken NT • Green: go, taken • Adds hysteresis to decision making process 4/8/2015 Lec4 ILP 3 BHT Accuracy • Mispredict because either: – Wrong guess for that branch – Got branch history of wrong branch when index the table 18% 12% 10% 9% 9% 5% 9% 5% 1% Lec4 ILP Integer Floating Point 7 na sa 30 0 pp at rix fp p ice sp c do du ice sp li gc c 0% m 4/8/2015 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% eq nt ot es t pr es so Misprediction Rate • 4096 entry table: 4 Correlating Branch Predictor • It may possible to improve the accuracy if we look at the behavior of other branches. if (aa == 2) aa = 0; if (bb == 0) bb = 0; if (aa != bb) The behavior of b3 is correlated with the behavior of b1 and b2. Correlating Predictors • Two-level predictors if (d == 0) d = 1; if (d == 1) initial value of d 0 1 2 b1 value of d before b2 b2 1-bit Predictor (Initialized to NT) d b1 predic b1 action new b1 pr b2 predic b2 action new b2 pr 2 nt t t nt t t 0 t 2 0 (1,1) Predictor • Every branch has two separate prediction bits. – First bit: the prediction if the last branch in the program is not taken. – Second bit: the prediction if the last branch in the program is taken. • Write the pair of prediction bits together. Combinations & Meaning Prediction bits Prediction if not taken Prediction if taken Correlated Branch Prediction • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table • In general, (m,n) predictor means record last m branches to select between 2m history tables, each with n-bit counters – Thus, old 2-bit BHT is a (0,2) predictor • Global Branch History: m-bit shift register keeping T/NT status of last m branches. • Each entry in table has m n-bit predictors. 4/8/2015 Lec4 ILP 11 Correlating Branches (2,2) predictor – Behavior of recent branches selects between four predictions of next branch, updating just that prediction Branch address 4 2-bits per branch predictor Prediction 2-bit global branch history 4/8/2015 Lec4 ILP 12 Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% 16% 14% 12% 11% 10% 8% 6% 6% 5% 6% 6% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry Lec4 ILP li eqntott expresso gcc fpppp matrix300 0% spice 1% 0% doducd 1% tomcatv 2% 4/8/2015 5% 4% 4% nasa7 Frequency of Mispredictions 20% 1,024 entries (2,2) 13 Tournament Predictors • Multilevel branch predictor • Use n-bit saturating counter to choose between predictors • Usual choice between global and local predictors 4/8/2015 Lec4 ILP 14 Tournament Predictors Tournament predictor using, say, 4K 2-bit counters indexed by local branch address. Chooses between: • Global predictor – 4K entries index by history of last 12 branches (212 = 4K) – Each entry is a standard 2-bit predictor • Local predictor – Local history table: 1024 10-bit entries recording last 10 branches, index by branch address – The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters 4/8/2015 Lec4 ILP 15 Comparing Predictors • Advantage of tournament predictor is ability to select the right predictor for a particular branch – Particularly crucial for integer benchmarks – A typical tournament predictor will select the global predictor almost 40% of the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarks 2-bit BHT 4/8/2015 Lec4 ILP SPEC89 16 Pentium 4 Misprediction Rate (per 1000 instructions, not per branch) 14 13 Branch mispredictions per 1000 Instructions 13 6% misprediction rate per branch SPECint (19% of INT instructions are branch) 12 12 11 2% misprediction rate per branch SPECfp (5% of FP instructions are branch) 11 10 9 9 8 7 7 6 5 5 4 3 2 1 1 0 0 0 4/8/2015 Lec4 ILP a m es 17 7. u ap pl 17 3. 17 2. m gr id im sw 17 1. e is af ty up w 16 8. w SPECint2000 18 6. cr 18 1. m cf gc c 17 6. vp r 17 5. 16 4. gz i p 0 SPECfp2000 17 Branch Target Buffers (BTB) • Branch target calculation is costly and stalls the instruction fetch. • BTB stores PCs the same way as caches • The PC of a branch is sent to the BTB • When a match is found the corresponding Predicted PC is returned • If the branch was predicted taken, instruction fetch continues at the returned predicted PC Branch Target Buffers Branch PC Predicted PC PC of instruction FETCH Yes: inst is Extra branch, =? Prediction Next PC = state predicted PC bits No: proceed normally (Next PC = PC+4) Dynamic Branch Prediction Summary • Prediction becoming important part of execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch – Either different branches (GA) – Or different executions of same branches (PA) • Tournament predictors take insight to next level, by using multiple predictors – usually one based on global information and one based on local information, and combining them with a selector – In 2006, tournament predictors using 30K bits are in processors like the Power5 and Pentium 4 • Branch Target Buffer: include branch address & prediction 4/8/2015 Lec4 ILP 20