CS 7960-4 Lecture 7 Combining Branch Predictors Scott McFarling WRL Tech. Report TN-36 1993 Bimodal Branch Prediction • Identifies most popular prediction in recent past • Updates happen during commit PC 10-bit index 1 0 1024 entries 2-bit saturating counters Results • SPEC’89 programs simulated for 10M instrs (modern studies use hard-to-predict programs) • A larger predictor reduces contention for counters • Prediction rates saturate at 93.5% (at 2K bytes) (Fig.3) Local Predictors • Two-Level predictor: The first level has history, the second level has saturating counters • History gets updated immediately PC 10-bit index 0 1 1 1 1 0 16 entries 1024 entries 4-bit history table 2-bit saturating counters Results • For small predictors, there could be contention at both levels, resulting in inaccurate predictions • Will also take longer to warm up – after every context switch • Does very well for large predictors – saturates at 97.1% Global Predictors • A single history register – neighboring branches have correlated results • However, the PC is not used 1 0 1024 entries 10-bit global history 2-bit saturating counters Do We Need PC? • Note that the global history reveals which branch is being examined • Hence, it outdoes bimodal predictors when the transistor budget is large (Fig.7) • Local predictor does better – it is more important to identify the PC and local history than behavior of neighboring branches Gselect • Use a combination of PC and global history • Bimodal and global prediction are special cases (Fig.9) PC n / / m n+m / 1 0 1024 entries 5-bit global history 2-bit saturating counters GShare • Xor-ing 10 history bits and 10 PC bits has more info than the concatenation of 5 bits of each and more info than each individual component Branch Address Global History Gselect 4/4 Gshare 8/8 00000000 00000001 00000001 00000001 00000000 00000000 00000000 00000000 11111111 00000000 11110000 11111111 11111111 10000000 11110000 01111111 01111110 00000001 11100001 01111111 Terminology • GAG: Global history indexes into global array of saturating counters • PAG: Per-address history indexes into global array of saturating counters • GAP: Global history indexes into each PC’s private array of counters (gselect) • PAP: Per-address history indexes into each PC’s private array of counters Trade-Offs • Some predictors warm-up faster than others • Some programs benefit from global history, some from local history • Some programs have branches that interfere with each other • Note that a 64KB local predictor has fewer saturating counters than a 64KB bimodal predictor – the former won’t be better for every program Combining Predictors • Use an array of saturating counters to pick the best available predictor for each PC Predictor A PC 1 0 1024 entries Predictor B 2-bit saturating counters Results • The combination of local and gshare increases the prediction accuracy to 98.1% (Fig.16) • For smaller transistor budgets, the combination of bimodal and gshare is better (gshare is twice the size to make sure the total is a power of two) • A 1KB combined predictor does as well as a 16KB gselect predictor Future Work • Detect conflicts, correlations, and common predictions through profiling/compiler analysis • Functions that compress information in history or PC • Pipeline predictions – predict two branches ahead • Hierarchical predictors – get a quick prediction in a cycle and a more accurate one two cycles later Next Week’s Paper • “Design Trade-Offs for the Alpha EV8 Conditional Branch Predictor”, Seznec et al., ISCA’02 Title • Bullet