CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001 Clock Frequencies • Aggressive clocks => little work per pipeline stage => deep pipelines => low IPC, large buffers, high power, high complexity, low efficiency • 50% increase in clock speed => 30% increase in performance Mispredict latency = 10 cyc Mispredict latency = 20 cyc Deep Pipelines Variable Clocks • The fastest clock is defined as the time for an ALU operation and bypass (twice the main processor clock) • Different parts of the chip operate at slower clocks to simplify the pipeline design (e.g. RAMs) Microarchitecture Overview Front End • ITLB, RAS, decoder • Trace Cache: contains 12Kmops (~8K-16KB I-cache), saves 3 pipe stages, reduces power • Front-end BTB accessed on a trace cache miss and smaller Trace-cache BTB to detect next trace line – no details on branch pred algo • Microcode ROM: implements mop translation for complex instructions Execution Engine • Allocator: resource (regs, IQ, LSQ, ROB) manager • Rename: 8 logical regs are renamed to 128 phys regs; ROB (126 entries) only stores pointers (Pentium 4) and not the actual reg values (unlike P6) – simpler design, less power • Two queues (memory and non-memory) and multiple schedulers (select logic) – can issue six instrs/cycle Schedulers • 3GHz clock speed = time for a 16-bit add and bypass NetBurst • 3GHz ALU clock = time for a 16-bit add and bypass to itself (area is kept to a minimum) • Used by 60-70% of all mops in integer programs • Staggered addition – speeds up execution of dependent instrs – an add takes three cycles • Early computation of lower 16 bits => early initiation of cache access Detailed Microarchitecture Data Cache • 4-way 8KB cache; 2-cycle load-use latency for integer instrs and 6-cycle latency for fp instrs • Distance between load scheduler and execution is longer than load latency • Speculative issue of load-dependent instrs and selective replay • Store buffer (24 entries) to forward results to loads (48 entries) – no details on load issue algo Cache Hierarchy • 256KB 8-way L2; 7-cycle latency; new operation every two cycles • Stream prefetcher from memory to L2 – stays 256 bytes ahead • 3.2GB/s system bus: 64-bit wide bus at 400MHz Performance Results Quick Facts • November 2000: Willamette, 0.18m, Al interconnect, 42M transistors, 217mm2, 55W, 1.5GHz • February 2004: Prescott, 0.09m, Cu interconnect, 125M transistors, 112mm2, 103W, 3.4GHz Improvements • Willamette (2000) Prescott (2004) • L1 data cache 8KB 16KB • L2 cache 256KB 1MB • Pipeline stages 20 31 • Frequency 1.5GHz 3.4GHz • Technology 0.18m 0.09m Pentium M • Based on the P6 microarchitecture • Lower design complexity (some inefficiencies persist, such as copying register values from ROB to architected register file) • Improves on P4 branch predictor PM Changes to P6, cont. • Intel has not released the exact length of the pipeline. • Known to be somewhere between the P4 (20 stage) and the P3 (10 stage). Rumored to be 12 stages. • Trades off slightly lower clock frequencies (than P4) for better performance per clock, less branch prediction penalties, … Banias • 1st version • 77 million transistors, 23 million more than P4 • 1 MB on die Level 2 cache • 400 MHz FSB (quad pumped 100 MHZ) • 130 nm process • Frequencies between 1.3 – 1.7 GHz • Thermal Design Pointhttp://www.intel.com/pressroom/archive/photos/centrino.htm of 24.5 watts Dothan • Launched May 10, 2004 • 140 million transistors • 2 MB Level 2 cache • 400 or 533 MHz FSB • Frequencies between 1.0 to 2.26 GHz • Thermal Design Point of 21(400 MHz FSB) http://www.intel.com/pressroom/archive/photos/centrino.htm to 27 watts Branch Prediction • Longer pipelines mean higher penalties for mispredicted branches • Improvements result in added performance and hence less energy spent per instruction retired Branch Prediction in Pentium M • Enhanced version of Pentium 4 predictor • Two branch predictors added that run in tandem with P4 predictor: – Loop detector – Indirect branch detector • 20% lower misprediction rate than PIII resulting in up to 7% gain in real performance Branch Prediction Based on diagram found here: http://www.cpuid.org/reviews/PentiumM/index.php Loop Detector • A predictor that always branches in a loop will always incorrectly branch on the last iteration • Detector analyzes branches for loop behavior • Benefits a wide variety of program types http://www.intel.com/technology/itj/2003/volume07 issue02/art03_pentiumm/p05_branch.htm Indirect Branch Predictor • Picks targets based on global flow control history • Benefits programs compiled to branch to calculated addresses http://www.intel.com/technology/itj/2003/volume07iss ue02/art03_pentiumm/p05_branch.htm Benchmark Battery Life UltraSPARC IV • CMP with 2 UltraSPARC IIIs – speedups of 1.6 and 1.14 for swim and lucas (static parallelization) • UltraSPARC III : 4-wide, 16 queue entries, 14 pipeline stages • 4KB branch predictor – 95% accuracy, 7-cycle penalty • 2KB prefetch buffer between L1 and L2 Alpha 21364 • Tournament predictor – local and global; 36Kb • Issue queue (20-Int, 15-FP), 4-wide Int, 2-wide FP • Two clusters, each with 2 FUs and a copy of the 80-entry register file Title • Bullet