5. Elágazás kezelés 3.1. Bevezetés: Figure 1.: Basic scheme of instruction fetching in sequential processors Figure 2.: Basic scheme of instruction fetching using branch prediction Figure 3.: Straightforward processing of an unconditional branch on a short pipeline Figure 4.: Straightforward processing of a conditional branch on a short pipeline Figure 5.: Straightforward processing of a conditional branch on a long pipeline Figure 6.: Pipeline stages in Intel’s and AMD’s processors Figure 7.: Performance (P) and dissipation vs. number of pipeline stages 3.2. Áttekintés és Lokális: Figure 8.: Overview of basic prediction mechanisms for predicting the branch direction Figure 9.: Local prediction Figure 10.: Global prediction (Prediction depends on the actual path) Figure 11.: Hardware prediction Figure 12.: Principle of the local, dynamic prediction with 1-bit branch history Figure 13.: Alternatives to implement 2-level local branch prediction Figure 14.: The principle of Pentium Pro’s 128x4 way set associative BHT Figure 15.: The actual layout of Pentium Pro’s 128x4 way set associative BHT 3.3. Globális: Figure 16.: Straightforward global prediction Figure 17.: Principle of the Gshare prediction Figure 18.: Principle of the Gselect prediction Figure 19.: Dynamic branch prediction of Athlon (K7) using the Gselect (Gshare) mechanism 3.4. Global choice: Figure 20.: Dynamic branch prediction in the Athlon-64 (K8) using the Gselect (Gshare) mechanism Figure 21.: Principle of the combined prediction using local and global prediction at the same time (used in the Alpha 21264, or the POWER 4) Figure 21.: Principle of the combined prediction using local and global prediction at the same time (used in the Alpha 21264, or the POWER 4) Figure 22.: Implementation alternatives of choice prediction Figure 23.: The principle of the combined branch prediction of the POWER 4 3.5. Advanced features: Figure 24.: Advanced branch prediction features of the Pentium family Figure 25.: Advanced branch prediction features of AMD’s K6/K7/K8 family Figure 26.: Advanced branch prediction features of IBM’s POWER family 3.6. Áttekintő összefoglalás: Figure 27.: Overview of the implemented schemes for predicting the branch direction 3.7. Többszörös eljárások: Figure 28.: Assumed Simplified branch prediction scheme of the K7, without showing the global prediction (A: address bit, C: Conditional branch, W: Way) Figure 29.: Assumed simplified branch prediction scheme of the K8, without showing the global prediction (C: Conditional branch, R: Return, W: Way 0/1, SA: Start address) Figure 30.: The physical implementation of branch prediction in Intel’s Northwood and Prescott processors Figure 31.: Instruction Cache: More then instructions alone Figure 32.: Chapter 4, Opteron’s Instruction Cache and Decoding 3.8. BTA: Figure 33.: Alternatives to generate the BTA Figure 34.: The principle of branch prediction using a BHT and a BTAC (C: counter) Figure 35.: Alternative ways to access BHTs/BTACs + Sequential I F A R IIFA Branch I$ BTA IB Further processing Figure 1.: Basic scheme of instruction fetching in sequential processors + I F A R IIFA N Y I$ Branch fetch BTA Branch prediction IB Further processing Figure 2.: Basic scheme of instruction fetching using branch prediction Unconditional branh instruction F D W E Branch BT A 2 bubbles Fetch BT I Figure 3.: Straightforward processing of an unconditional branch on a short pipeline Conditional branh instruct ion F D E W Branch Condition ? BT A Fetch BT I 3 bubbles Figure 4.: Straightforward processing of a conditional branch on a short pipeline F1 F2 F3 D1 E1 E2 Condition ? BT A Fetch BT I Large number of bubbles Figure 5.: Straightforward processing of a conditional branch on a long pipeline No of pipeline stages 40 30 20 Pentium Pro (~12) Pentium K6 * (6) (5) * * 10 1990 1995 Pentium 4 (~20) * Athlon (6) P4 Prescott (~30) * Conroe (14) * Athlon-64 (12) * * 2000 2005 Figure 6.: Pipeline stages in Intel’s and AMD’s processors Year P/D P D 10 20 30 Nr. of pipeline stages 40 Figure 7.: Performance (P) and dissipation vs. number of pipeline stages Prediction of the branch direction Local Global (2-level) 1-level 2-level Static Compiler hints Straightforward global Gshare Dynamic HW-prediction 1-bit 2-bit Figure 8.: Overview of basic prediction mechanims for predicting the branch direction Gselect ? Figure 9.: Local prediction 1 0 0 Path 2: . . 0 0 Path 1: 0 0 . 0 0 ? Figure 10.: Global prediction (Prediction depends on the actual path) . 1 0 0 D < 0 Taken ? D>0 Not taken Figure 11.: Hardware prediction BHT IFA: x } x: 0: sequential cont. 1: branch Figure 12.: Principle of the local, dynamic prediction with 1-bit branch history 2-level local branch prediction With a shared global history table for all patterns With individual history tables for different patterns (Alpha 21264) (Pentium Pro) IFA: IFA: Local BHT (e.g. 1K*10bit) Local BHT (e.g. 1K*3bit) Local BHT (e.g. 128*4bit) Local BHT e.g. 4-ways each Figure 13.: Alternatives to implement 2-level local branch prediction 76 0 BTA (linear) Tag BHT Index 127 Way 2 Way 3 Way 0 Way 1 0 1 0 0 15 0 xx xx: 00/01 not taken 10/11 taken Tags History 4-bit Tags History 4-bit Tags History 4-bit 0 Counters Figure 14.: The principle of Pentium Pro’s 128x4 way set associative BHT Tags History 4-bit 127 0 Tag Tag Tag Tag H C H C H C H Figure 15.: The actual layout of Pentium Pro’s 128x4 way set associative BHT C Global history (shift register) 0 1 1 0 0 1 1 BHT x Branch history Figure 16.: Straightforward global prediction Global history IFA ... 0 1 1 0 0 1 1 1 0 0 1 1 0 0 } XOR BHT x Figure 17.: Principle of the Gshare prediction Branch history Global history 0 1 1 0 0 1 1 BHT Branch history x ... 1 IFA: 0 1 1 0 Figure 18.: Principle of the Gselect prediction 0 BHT (4K * 2 bit) 8 Global history 31 IFA: 7 4 3 0 11 0 0 Subsequent fetch blocks 16 xx } xx: 00/01: Not taken 10/11: Taken 256 Figure 19.: Dynamic branch prediction of Athlon (K7) using the Gselect (Gshare) mechanism 8 BTH Global history 47 IFA: 8 7 4 3 (4*4K*2bit) 0 16 xx 256 ? 256 Branch 2 256 Branch 1 256 Branch 0 Figure 20.: Dynamic branch prediction in the Athlon-64 (K8) using the Gselect (Gshare) mechanism Global history IFA: Local BHT Global BHT IFA: Best choice BHT x Local prediction Global prediction Resulting prediction Figure 21.: Principle of the combined prediction using local and global prediction at the same time (used in the Alpha 21264, or the POWER 4) Global history IFA: Local BHT IFA: Global BHT Best choice BHT x Local prediction Resulting prediction Global prediction Global prediction Local prediction Actual prediction Figure 21.: Principle of the combined prediction using local and global prediction at the same time (used in the Alpha 21264, or the POWER 4) Choice prediction 1. prediction Alpha 21264 2. level local history prediction with a shared counter table for all patterns (1K * 10 bits/1K * 3 bits) 1-level local history prediction POWER 4 (16K * 1-bit) PA-8500/8700 2. prediction Choice Simple 2-level global prediction Global history referenced choice table (12-bit global history/4K * 2 bits) (12-bit global history/4K * 2-bits) 2-level Gshare global prediction (11-bit global history is hashed with the IFA, 16K * 1-bit counter table) Compiler hints Figure 22.: Implementation alternatives of choice prediction Accessed in the same way as the global counter table (16K * 1-bit) 2-bits 11-bit global history ... 18 5 0 1 1 0 0 1 1 1 0 XOR 0 1 1 0 0 } IFA IFA: BHT 14 14 14 16K*1bit 16K*1bit 16K*1bit Local History Selector Table Global History Update Local prediction Select the better Global prediction Figure 23.: The principle of the combined branch prediction of the POWER 4 1-bit per group Alpha 21264 Branch Hybrid Prediction gshare P1 P2 EECC551 - Shaaban #41 lec # 5 Winter 2005 1-9-2006 Pentium Pro PII/PIII P4 W/N P4 P RAS Y Y Y Y Branch hints - - Y Y Loop detector - - - - Indir. branch pred. - - - Y Figure 24.: Advanced branch prediction features of the Pentium family K6 K7 K8 RAS Y Y Y Branch hints - - - Loop detector - - - Indir. branch pred. - - - Figure 25.: Advanced branch prediction features of AMD’s K6/K7/K8 family POWER 3 POWER 4 Return Stack Link Stack* 2x Return Stack* Branch hints - Y Y Loop detector - Count Stack* Indir. branch pred. - RAS - POWER 5 Count Stack* - * Supported by compiler hints Figure 26.: Advanced branch prediction features of IBM’s POWER family Prediction of the branch direction Basic prediction mechanism Local 2-level 1-level S tatic Dynamic Compiler hints 1-bit S hared counters 2-bit Individual counters S imple global Auxiliary combined mechanism (Advanced branch prediction features) Global Combined 2-level Choice prediction Gshare Preemptive use of compiler hints Dedicated prediction RAS Pentium Pentium Pro (512*2) Pentium Pro Pentium Pro P4 Will/Northw. (4K*2) P4 Will/Northw. P4 Will/Northw. P4 Will/Northw. P4 Will/Northw. P4 Prescott (4K*2) P4 Prescott P4 Prescott P4 Prescott P4 Prescott K6 K6 (8K*2) K6 (16-entries) K7 K7 K7 (12-entries) K8 K8 (16K*2) K8 (12-entries) PPC 604 PPC 604 (512*2) PPC 620 PPC 620 (2K*2) POWER 3 POWER 3 (2K*2) PPC 620 (POWER 4) (11-bit/16K*1) POWER 4 PPC 620 POWER 4 POWER 4 Alpha 21164 (12-entries) (Alpha 21264) (1K*10/1K*3) Alpha 21264 (Alpha 21264) (12its/4K*2) Alpha 21264 (32-entries) Alpha 21264 PA-8000 (256*3) PA-8000 PA-8500/8700 UltraS PARC-III POWER 4 Alpha 21164 (2K*2) PA-8000 P4 Prescott PPC 604 (POWER 4) (16K*1) Alpha 21164 PA-8500/8700 (UltraS PARC-III) Indirect branch 3-bit Pentium Pro POWER 4 For cycles Gselect Pentium (256*2) Pentium Backup use of static prediction UltraS PARC-III (12-bits/16K*2) UltraS PARC-III UltraS PARC-III (8-entries) Figure 27.: Overview of the implemented schemes for predicting the branch direction POWER 4 31 15 14 IFA: Tag 31 4 30 2-way set associative I$ 14 13 43 0 BTAC IFA: Tag Index BTA 0 BTA 1 Index 1K x 2 addr. I F A R IFA [13:4] 1K*16B fetch blocks Way 0 Way 1 IFA [14:4] IFA [14:4] [31:15] [31:15] 16 b 16 b S elector Table BTA (Exec.) (shared for the upper and lower parts of the I$) 1K*16B fetch blocks BTA1 BTA0 Fetch unit (during predecoding) Tags 15 16B+P IFA [3:0] 16 B Fetch block 16B+P 0 16 B Selector block Tags IFA [3:1] A 15 0 IFA14 W: 31 BTA 0 C: BTA x x 32-bit Decode and issue instructions beginning with the given address Sequential (no branch) 12 entries RET BTA 1 BTA 0 Take or not according to the global prediction (cond. branch) Take the branch (uncond. branch) RAT RET address Figure 28.: Assumed Simplified branch prediction scheme of the K7, without showing the global prediction (A: address bit, C: Conditional branch, W: Way) +16 31 15 14 IFA: Tag 31 4 30 2-way set associative I$ 14 13 43 0 IFA: Tag Index SA Index BTAC ? BTA 2 BTA 1 BTA 0 I F A R 512 x 4 addr. 1K*16B fetch blocks IFA [12:4] Way 1 Way 0 S elector Table IFA [14:4] IFA [14:4] [31:15] [31:15] BTA calculator ? BTA2 BTA1 BTA0 1K*16B fetch blocks Tags 15 16B+P 16 b SA Predecoding SA [3:0] 16 B Fetch block 16 b 0 16 B Selector block 16B+P Tags IFA [3:1] 15 0 x x Old IFA 13 14 W 14 New IFA 0 RC BTA 11-bit Decode and issue instructions beginning with the given address Sequential (no branch) BTA2/RET BTA1/RET BTA0/RET 12 entries RAT Take or not according to the global prediction (cond. branch) Take the branch (uncond. branch) RET address Figure 29.: Assumed simplified branch prediction scheme of the K8, without showing the global prediction (C: Conditional branch, R: Return, W: Way 0/1, SA: Start address) + 16 Figure 30.: The physical implementation of branch prediction in Intel’s Northwood and Prescott processors Figure 31.: Instruction Cache: More then instructions alone Figure 32.: Chapter 4, Opteron’s Instruction Cache and Decoding BTA Calculated on the fly Examples Ultra SPARC III K6 Power 4, 5 Accessed from BTAC From the I$ PPro/PII/PIII/P4 21264 K7/K8 Power 3 Figure 33.: Alternatives to generate the BTA + IFA: I$ IFA: BHT BTAC IFA: Tag I F A R Update BTAC (create/delete BTAC entry) IB C Further processing Update BHT with branch result Tags BTA Update BTAC with BTA if BHT initiates it. Figure 34.: The principle of branch prediction using a BHT and a BTAC (C: counter) if BTAC misses IIFA BTA if mispred. if BTAC hits Accessing BHTs/BTACs Cache-like access (direct / set associative) Indexed access IFA: Associative access IFA: Index BHT C (Counters) For large tables most branches will map to a unique entry. For smaller tables multiple branches may map to the same entry, resulting in interferences and thus in degrated prediction accuracy. IFA: Tags Index IFA Tags C Tags C IFA Reduces interferences but increases cost. Avoids interference but stronly increases cost. Examples: 16K entry local BHT (Power4) 16K entry global BHT (Power4) 16K entry selector table (Power4) C (E.g. two-way set associative) 128*4 way BHT/BTAC (Pentium Pro) 1K*4 way BHT/BTAC (Pentium II, III, 4) 128*2 way BTAC (Power3) Figure 35.: Alternative ways to access BHTs/BTACs 64K entry BTAC (PPC 604) Irodalomjegyzék Többszálas processzorok 1. J.M. Borkenhagen at all: “A multithreaded PowerPC processor for commercial servers”, IBM. J. Res. Develop. Vol.44, No. 6., November 2005, pp. 885-898. 2. MAJC 3. C. McNary, R. Bhatia: Montecito: “A dual-core, dual-thread Itanium processor”, IEEE Micro, MarchApril 2005,pp. 10-20. 4. P. Kongetira at all: “Niagara: a 32-way multithreaded sparc processor”, IEEE Micro, March-April 2005, pp. 21-29. 5. Sun’s UltraSparc T1 – the Next Generation Server CPUs, December 29, 2005, 8 pages. 6. R. Hetherington: “The UltraSPARC T1 Processor – Power Efficient Throughput Computing”, Sun Microsystems, December 2005, pp. 3-11. 7. R. Cross: “Hyper-Threading Technology”, Intel Technology Journal, Volume 06, Issue 01, 2002, pp. 4-16. 8. 21464 9. B. Sinharcy at all: “Power5 system microarchitecture”, IBM J. Res. & Dev. Vol. 49. No. 4/5, July/September 2005, pp. 505-521. 10. “Fundamentals of Multithreading”, SystemLogic.net, June 2001, 19 pages. Többmagos processzorok 1. “Intel Core Microarchitecture Design Goals”, Inside Intel Core Microarchitecture White Paper, pp. 3-12. 2. “The UltraSPARC IV Processor Architecture”, pp. 1-8. 3. J.M. Tendler at all: “Power4 system microarchitecture”, IBM J. Res. & Dev. Vol. 46, No. 1, January 2002, pp. 5-25. 4. “Lostcircuits HP PA-8800 RISC Processor”, Review by MS, October 19, 2001. 12 pages. 5. “UltraSPARC IV Processor Architecture Overview”, Technical Whitepaper, Sun Microsystems, February 2004, pp. 1-8.