Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti Pentium Pro Case Study • Microarchitecture – Order-3 Superscalar – Out-of-Order execution – Speculative execution – In-order completion • Design Methodology • Performance Analysis • Retrospective Goals of P6 Microarchitecture IA-32 Compliant Performance (Frequency - IPC) Validation Die Size Schedule Power BTB/ICU P6 – The Big Picture BAC/Rename Allocation Fetch 4 Cycles Decode 2 Cycles Dispatch 2 Cycles Reservation Station (20) 4 2 cyc ROB (40x157) 3 2 AGU1 AGU0 MOB RRF 1 0 2 Cycles IEU1 IEU0 JEU Fadd Fmul Imul DCU Div Memory Hierarchy Main Memory PCI 64 bit ICache (8KB) CPU BIU 16 bytes DCache (8Kb) L2 Cache (256Kb) • Level 1 instruction and data caches - 2 cycle access time • Level 2 unified cache - 6 cycle access time • Separate level 2 cache and memory address/data bus Instruction Fetch Prediction Inst. Instruction TLB Branch Target Buffer (512) 2 cycle 16 bytes + marks Inst. Buf Inst. Rotator Physical Addr. Length Marks Instruction Data Victim Cache Inst. Length Decoder Prediction Marks Fetch Address Next Addr. Logic ICache (8Kb) Instruction Data Mux Stream Buffer Other Fetch Requests 16 bytes L2 Cache (256Kb) To Decode Branch Target Instruction Cache Unit Lower 12 bits Stream Buffer Bus Interface Unit Fetch Address Upper 20 bits ICache (8 Kb) Victim Cache Instruction Tag Array Data Mux Instruction Data Lower 12 bits Hit/Miss ITLB Branch Target Buffer Target Addr. 4-bit BHR spec. 4-bit BHR Br. Offset Fetch Addr. Tag Tag Compare Fetch Address Return Stack Prediction Control Logic Prediction & Target Addr. Pattern History Table (PHT) is not speculatively updated A speculative Branch History Register (BHR) and prediction state is maintained Uses speculative prediction state if it exist for that branch 16 entries/set PHT Way 3 Target Addr. 4-bit BHR spec. 4-bit BHR Br. Offset Fetch Addr. Tag Way 1 Target Addr. 4-bit BHR spec. 4-bit BHR Br. Offset Fetch Addr. Tag 128 Sets Way 0 Branch Prediction Algorithm 0010 Branch Execution Br. History 0 0 1 State Machine 0 11 10 Br. Pred. 1 0 1 0 1 Speculative History 0101 Pattern Table 0000 0001 0010 0011 0100 0101 0110 Spec. Pred. 1 0 1 . . . 1110 1111 Current prediction updates the speculative history prior to the next instance of the branch instruction Branch History Register (BHR) is updated during branch execution Branch recovery flushes front-end and drains the execution core Branch mis-prediction resets the speculative branch history state to match BHR Instruction Decode - 1 Macro-Instruction Bytes from IFU Instruction Buffer uROM Decoder 0 4 uops Decoder 1 Decoder 2 1 uop 1 uop 16 bytes To Next Address Calc. Branch Address Calc. uop Queue (6) Up to 3 uops Issued to dispatch Branch instruction detection Branch address calculation - Static prediction and branch always execution One branch decode per cycle (break on branch) Instruction Decode - 2 Macro-Instruction Bytes from IFU Instruction Buffer uROM Decoder 0 4 uops Decoder 1 Decoder 2 1 uop 1 uop 16 bytes To Next Address Calc. Branch Address Calc. uop Queue (6) Up to 3 uops Issued to dispatch Instruction Buffer contains up to 16 instructions, which must be decoded and queued before the instruction buffer is re-filled Macro-instructions must shift from decoder 2 to decoder 1 to decoder 0 What is a uop? Small two-operand instruction - Very RISC like. IA-32 instruction add (eax),(ebx) Uop decomposition: ld guop0, (eax) ld guop1, (ebx) add guop0,guop1 sta eax std guop0 MEM(eax) <- MEM(eax) + MEM(ebx) guop0 <- MEM(eax) guop1 <- MEM(ebx) guop0 <- guop0 + guop1 MEM(eax) <- guop0 Renaming Dispatch Buffer (3) uoP Queue (6) Instruction Dispatch To Reservation Station Mux Logic Retirement Info Allocator 2 cycles Register Renaming Allocation requirements “3-or-none” Reorder buffer entries Reservation station entry Load buffer or store buffer entry Dispatch buffer “probably” dispatches all 3 uops before re-fill Register Renaming - 1 Real Register File (RRF) EAX EBX 8 ECX 8 12 4 9 FST0 FST1 GuoP0 GuoP1 Integer RAT EAX EBX ECX GuoP0 GuoP1 Floating Point RAT FST0 FST1 FST2 IuoP(0-3) CC/Events Reorder Buffer (ROB) 0 1 2 3 4 5 6 7 8 9 39 FST7 Similar to Tomasulo’s Algorithm - Uses ROB entry number as tags The register alias tables (RAT) maintain a pointer to the most recent data for the renamed register Execution results are stored in the ROB Register Renaming - Example Real Register File (RRF) EAX EBX 8 ECX 8 FST0 FST1 12 GuoP0 GuoP1 4 9 IuoP(0-3) CC/Events Integer RAT EAX EBX ECX Reorder Buffer (ROB) 0 1 2 GuoP0 3 GuoP1 4 5 Comp sub 6 7 Floating Point RAT Alloc 8 FST0 9 FST1 FST2 39 FST7 Dispatching: add eax, ebx add eax, ecx fxch f0, f1 Completing: sub eax, ecx © Shen, Lipasti 15 Challenges to Register Renaming Real Register File (RRF) EAX EBX 8 ECX 8 12 4 9 Integer RAT EAX EBX ECX GuoP0 GuoP1 FST0 FST1 GuoP0 GuoP1 Floating Point RAT FST0 FST1 FST2 39 IuoP(0-3) CC/Events 8-bit code mov mov add add Reorder Buffer (ROB) 0 1 2 3 4 5 6 7 8 9 FST7 AL, #data1 AH, #data2 AL, #data3 AL, #data4 Byte addressable registers Out-of-Order Execution Engine RS bypass Reservation Station (20) 4 3 2 AGU1 AGU0 1 0 2 Cycles IEU1 IEU0 JEU Fadd MOB Fmul Imul DCU (8Kb) Div • In-order branch issue and execution • In-order load/store issue to address generation units • Instruction execution and result bus scheduling • Is the reservation station “truly” centralized & what is “binding”? Reservation Station From Dispatch Queue Port 4 Port 3 Port 2 Port 1 Port 0 Cycle 1 Cycle 2 To Execution Units • Cycle 1 – Order checking – Operand availability • Cycle 2 – Writeback bus scheduling Memory Ordering Buffer (MOB) AGU0 Load Buffer (16) AGU1 Store Address Buffer (12) Conflict Logic R/S Store Data Buffer (12) 2 cycle Data Cache Unit (8Kb) MOB Control 2 cycle Bypass Logic • Load Data Result Load buffer retains loads until completed, for coherency checking • Store forwarding out of store buffers • 2 cycle latency through MOB • “Store Coloring” - Load instructions are tagged by the last store Instruction Completion • Handles all exception/interrupt/trap conditions • Handles branch recovery – OOO core drains out right-path instructions, commits to RRF – In parallel, front end starts fetching from target/fall-through – However, no renaming is allowed until OOO core is drained – After draining is done, RAT is reset to point to RRF – Avoids checkpointing RAT, recovering to intermediate RAT state • Commits execution results to the architectural state in-order – Retirement Register File (RRF) – Must handle hazards to RRF (writes/reads in same cycle) – Must handle hazards to RAT (writes/reads in same cycle) • “Atomic” IA-32 instruction completion – uops are marked as 1st or last in sequence – exception/interrupt/trap boundary • 2 cycle retirement Pentium Pro Design Methodology - 1 Performance Model Behavioral Model Microarchitecture 0.75 years 1 year Start 1990 Finish 4Q 1994 Structural Model Conception Logic Design 1 year Silicon Debug 1.75 years Circuit Design Layout Pentium Pro Performance Analysis • Observability – On-chip event counters – Dynamic analysis • Benchmark Suite – BAPco Sysmark32 - 32-bit Windows NT applications – Winstone97 - 32-bit Windows NT applications – Some SPEC95 benchmarks Performance – Run Times User-Mode Processor Cycles Unhalted User-Mode Processor Cycles 1.2E+11 Total of 27.5 billion cycles 1E+11 8E+10 6E+10 4E+10 2E+10 0 li avs Pwave word Corel excel MSVC PhotoShop microstation paradox PowerPnt WordPro SYSflw96 SYSexcel SYSword7 SYSCorel6SYSexcel7 compress PageMaker SYSParadox7 SYSpowerpnt7 SYSpagemaker6 go Performance – IPC vs. uPC Instructions and UOPs Uops Retired Per Cycle Instructions and retired per cycle 1.6 1.4 2 uops/instruction 1.2 Inst retired UOPS retired 1 0.8 0.6 # Retired Per Cycle 0.4 0.2 0 WS-AVS WS-Excel WS-PWave WS-MSVC++ WS-PhotoShop WS-Microstation WS-CorelDraw WS-Word WS-Paradox WS-PowerPnt WS-PageMaker SPEC-Li SPEC-Go SYS-ExcelSYS-Word SYS-Excel SYS-FlW96 WS-WordPro SYS-CorelDraw SYS-Paradox SYS-PowerPnt SYS-Pagemaker SPEC-Compress Performance – IPC vs. uPC Inst.Inst.oror UOPS Uops per cycle retired retired per cycle 100 90 80 70 AVS PageMaker 60 Paradox 50 PowerPoint WordPro 40 30 Instructions were Retired 20 Percent10 of Cycles where n=0-3 UOPs or 0 3 UOPs 2 UOPs 1 UOP 0 UOPs 3 Inst. 2 Inst. 1 Inst. 0 Inst. Performance – Cache Misses Misses Per Cycle CacheCache Misses Per Cycle 2.5 2 IL1 - 0.51% DL1 - 1.15% L2 - 0.25% IFU Misses / Cycle DCU Misses/ Cycle L2 Misses / Cycle 1.5 1 %UnhaltedCycles 0.5 0 WS-AVS WS-Excel WS-PWave WS-Microstation WS-MSVC++ WS-PhotoShop WS-CorelDraw WS-Word WS-Paradox WS-PowerPnt WS-PageMaker SPEC-Li SYS-ExcelSYS-Word SYS-Excel SYS-FlW96 WS-WordPro SYS-CorelDraw SYS-Paradox SYS-PowerPnt SYS-Pagemaker SPEC-Compress SPEC-Go Performance – Branch Prediction Branch Mispredict Rate Branch Mispredict Rate 16% 14% 6.8% avg 12% 10% 8% 6% 4% % B ran ches M isp red icted 2% 0% WS-AVS SPEC-Li WS-Excel WS-Word SYS-Ex SPEC-Go SYS-Word cel SYS-Excel WS-PWave SYS-FlW96 WS-Paradox WS- MSVC++ WS-WordPro SYS-Paradox WS-PowerPnt SYS-PowerPnt WS-CorelDraw WS-PhotoShop WS-PageMaker SYS- CorelDraw WS-Mi crostation SYS-Pagemaker SPEC-Compr ess BTB Miss Rate B T B M i ss R ate 45% 40% 35% 30% 25% 20% 15% 10% % Bran ches th at M iss in BT B 5% 0% W S -A V S S P E C -L i W S -E x ce l W S -W o rd S Y S -E S P E C -G o S xc Y S e-W l o rd S Y S -E xc e l W S -P W a v e S Y S -F lW 9 6 W S -P a ra d o x W S -MS V C+ + W S -W o rd P ro S Y S -P a ra d o x W S - P o w e rP n t S Y S -P o w e rP n t So -C ra W S -P h o toW Sh p o re lD W Sw -P a g e Ma ke r S Y S -C o re lD ra w W S -M ic ro st a t io n S Y S -P a g e ma k e r S P E C -Co mp re ss Conclusions IA-32 Compliant Performance (Frequency - IPC) 366.0 ISpec92 283.2 FSpec92 8.09 SPECint95 6.70 SPECfp95 Validation Die Size - Fabable Schedule - 1 year late Power - Retrospective • Most commercially successful microarchitecture in history • Evolution – Pentium II/III, Xeon, etc. • Derivatives with on-chip L2, ISA extensions, etc. – Replaced by Pentium 4 as flagship in 2001 • High frequency, deep pipeline, extreme speculation – Resurfaced as Pentium M in 2003 • Initially a response to Transmeta in laptop market • Pentium 4 derivative (90nm Prescott) delayed, slow, hot – Core Duo, Core 2 Duo, Core i7 replaced Pentium 4 © Shen, Lipasti 29 Microarchitectural Updates • Pentium M (Banias), Core Duo (Yonah) – Micro-op fusion (also in AMD K7/K8) • Multiple uops in one: (add eax,[mem] => ld/alu), sta/std • These uops decode/dispatch/commit once, issue twice – Better branch prediction • Loop count predictor • Indirect branch predictor – Slightly deeper pipeline (12 stages) • Extra decode stage for micro-op fusion • Extra stage between issue and execute (for RS/PLRAM read) – Data-capture reservation station (payload RAM) • Clock gated for 32 (int) , 64 (fp), and 128 (SSE) operands © Shen, Lipasti 30 Microarchitectural Updates • Core 2 Duo (Merom) – 64-bit ISA from AMD K8 – Macro-op fusion • Merge uops from two x86 ops • E.g. cmp, jne => cmpjne – 4-wide decoder (Complex + 3x Simple) • Peak x86 decode throughput is 5 due to macro-op fusion – Loop buffer • Loops that fit in 18-entry instruction queue avoid fetch/decode overhead – Even deeper pipeline (14 stages) – Larger reservation station (32), instruction window (96) – Memory dependence prediction © Shen, Lipasti 31 Microarchitectural Updates • Nehalem (Core i7/i5/i3) – – – – – – – – – – – RS size 36, ROB 128 Loop cache up to 28 uops L2 branch predictor L2 TLB I$ and D$ now 32K, L2 back to 256K, inclusive L3 up to 8M Simultaneous multithreading RAS now renamed (repaired) 6 issue, 48 load buffers, 32 store buffers New system interface (QPI) – finally dropped front-side bus Integrated memory controller (up to 3 channels) New STTNI instructions for string/text handling © Shen, Lipasti 32 Microarchitectural Updates • Sandybridge (2nd generation Core i7) – On-chip integrated graphics (GPU) – Decoded uop cache up to 1.5K uops, handles loops, but more general – 54-entry RS, 168-entry ROB – Physical register file: 144 FP, 160 integer – 256-bit AVX units: 8 DPFLOP/cycle, 16 SPFLOP/cycle – 2 general AGUs enable 2 ld/cycle, 2 st/cycle or any combination, 2x128-bit load path from L1 D$ © Shen, Lipasti 33