Intel Pentium 4 Processor Presented by Michele Co (much slide content courtesy of Zhijian Lu and Steve Kelley) Outline Introduction (Zhijian) – Willamette (11/2000) Instruction Set Architecture (Zhijian) Instruction Stream (Steve) Data Stream (Zhijian) What went wrong (Steve) Pentium 4 revisions – Northwood (1/2002) – Xeon (Prestonia, ~2002) – Prescott (2/2004) Dual Core – Smithfield Introduction Intel Pentium 4 processor – Latest IA-32 processor equipped with a full set of IA-32 SIMD operations First implementation of a new microarchitecture called “NetBurst” by Intel (11/2000) IA-32 Intel architecture 32-bit (IA-32) – 80386 instruction set (1985) – CISC, 32-bit addresses “Flat” memory model Registers – Eight 32-bit registers – Eight FP stack registers – 6 segment registers IA-32 (cont’d) Addressing modes – – – – Register indirect (mem[reg]) Base + displacement (mem[reg + const]) Base + scaled index (mem[reg + (2scale x index)]) Base + scaled index + displacement (mem[reg + (2scale x index) + displacement]) SIMD instruction sets – MMX (Pentium II) » Eight 64-bit MMX registers, integer ops only – SSE (Streaming SIMD Extension, Pentium III) » Eight 128-bit registers Pentium III vs. Pentium 4 Pipeline Comparison Between Pentium3 and Pentium4 Execution on MPEG4 Benchmarks @ 1 GHz Instruction Set Architecture Pentium4 ISA = Pentium3 ISA + SSE2 (Streaming SIMD Extensions 2) SSE2 is an architectural enhancement to the IA-32 architecture SSE2 Extends MMX and the SSE extensions with 144 new instructions: 128-bit SIMD integer arithmetic operations 128-bit SIMD double precision floating point operations Enhanced cache and memory management operations Comparison Between SSE and SSE2 Both support operations on 128-bit XMM register SSE only supports 4 packed single-precision floating-point values SSE2 supports more: 2 packed double-precision floating-point values 16 packed byte integers 8 packed word integers 4 packed doubleword integers 2 packed quadword integers Double quadword Packing 128 bits (word = 2 bytes) Quad word 64 bit Quad word 64 bit Double word Double word Double word Double word 32 bit 32 bit 32 bit 32 bit Hardware Support for SSE2 Adder and Multiplier units in the SSE2 engine are 128 bits wide, twice the width of that in Pentium3 Increased bandwidth in load/store for floating-point values load and store are 128-bit wide One load plus one store can be completed between XMM register and L1 cache in one clock cycle SSE2 Instructions (1) Data movements Move data between XMM registers and between XMM registers and memory Double precision floating-point operations Arithmetic instructions on both scalar and packed values Logical Instructions Perform logical operations on packed double precision floating-point values SSE2 Instructions (2) Compare instructions Compare packed and scalar double precision floating-point values Shuffle and unpack instructions Shuffle or interleave double-precision floatingpoint values in packed double-precision floatingpoint operands Conversion Instructions Conversion between double word and doubleprecision floating-point or between singleprecision and double-precision floating-point values SSE2 Instructions (3) Packed single-precision floating-point instructions Convert between single-precision floating-point and double word integer operands 128-bit SIMD integer instructions Operations on integers contained in XMM registers Cacheability Control and Instruction Ordering More operations for caching of data when storing from XMM registers to memory and additional control of instruction ordering on store operations Conclusion Pentium4 is equipped with the full set of IA-32 SIMD technology. All existing software can run correctly on it. AMD has decided to embrace and implement SSE and SSE2 in future CPUs Instruction Stream Instruction Stream What’s new? – Added Trace Cache – Improved branch predictor Terminology op – Micro-op, already decoded RISC-like instructions – Front end – instruction fetch and issue Front End Prefetches instructions that are likely to be executed Fetches instructions that haven’t been prefetched Decodes instruction into ops Generates ops for complex instructions or special purpose code Predicts branches Prefetch Three methods of prefetching: Instructions only – Hardware Data only – Software Code or data – Hardware Decoder Single decoder that can operate at a maximum of 1 instruction per cycle Receives instructions from L2 cache 64 bits at a time Some complex instructions must enlist the help of the microcode ROM Trace Cache Primary instruction cache in NetBurst architecture Stores decoded ops ~12K capacity On a Trace Cache miss, instructions are fetched and decoded from the L2 cache What is a Trace Cache? I1 … I2 br r2, L1 I3 … I4 … I5 … L1: I6 I7 … Traditional instruction cache I1 I2 I3 I4 I2 I6 I7 Trace cache I1 Pentium 4 Trace Cache Has its own branch predictor that directs where instruction fetching needs to go next in the Trace Cache Removes – Decoding costs on frequently decoded instructions – Extra latency to decode instructions upon branch mispredictions Microcode ROM Used for complex IA-32 instructions (> 4 ops) , such as string move, and for fault and interrupt handling When a complex instruction is encountered, the Trace Cache jumps into the microcode ROM which then issues the ops After the microcode ROM finishes, the front end of the machine resumes fetching ops from the Trace Cache Branch Prediction Predicts ALL near branches – Includes conditional branches, unconditional calls and returns, and indirect branches Does not predict far transfers – Includes far calls, irets, and software interrupts Branch Prediction Dynamically predict the direction and target of branches based on PC using BTB If no dynamic prediction is available, statically predict – Taken for backwards looping branches – Not taken for forward branches Traces are built across predicted branches to avoid branch penalties Branch Target Buffer Uses a branch history table and a branch target buffer to predict Updating occurs when branch is retired Return Address Stack 16 entries Predicts return addresses for procedure calls Allows branches and their targets to coexist in a single cache line – Increases parallelism since decode bandwidth is not wasted Branch Hints P4 permits software to provide hints to the branch prediction and trace formation hardware to enhance performance Take the forms of prefixes to conditional branch instructions Used only at trace build time and have no effect on already built traces Out-of-Order Execution Designed to optimize performance by handling the most common operations in the most common context as fast as possible 126 ops can in flight at once – Up to 48 loads / 24 stores Issue Instructions are fetched and decoded by translation engine Translation engine builds instructions into sequences of ops Stores ops to trace cache Trace cache can issue 3 ops per cycle Execution Can dispatch up to 6 ops per cycle Exceeds trace cache and retirement op bandwidth – Allows for greater flexibility in issuing ops to different execution units Execution Units Double-pumped ALUs ALU executes an operation on both rising and falling edges of clock cycle Retirement Can retire 3 ops per cycle Precise exceptions Reorder buffer to organize completed ops Also keeps track of branches and sends updated branch information to the BTB Execution Pipeline Execution Pipeline Data Stream of Pentium 4 Processor Register Renaming Register Renaming (2) 8-entry architectural register file 128-entry physical register file 2 RAT Frontend RAT and Retirement RAT Data does not need to be copied between register files when the instruction retires On-chip Caches L1 instruction cache (Trace Cache) L1 data cache L2 unified cache Parameters: All caches are not inclusive and a pseudo-LRU replacement algorithm is used L1 Instruction Cache Execution Trace Cache stores decoded instructions Remove decoder latency from main execution loops Integrate path of program execution flow into a single line L1 Data Cache Nonblocking Support up to 4 outstanding load misses Load latency 2-clock for integer 6-clock for floating-point 1 Load and 1 Store per clock Speculation Load Assume the access will hit the cache “Replay” the dependent instructions when miss happen L2 Cache Load latency Net load access latency of 7 cycles Nonblocking Bandwidth One load and one store in one cycle New cache operation begin every 2 cycles 256-bit wide bus between L1 and L2 48Gbytes per second @ 1.5GHz Data Prefetcher in L2 Cache Hardware prefetcher monitors the reference patterns Bring cache lines automatically Attempt to stay 256 bytes ahead of current data access location Prefetch for up to 8 simultaneous independent streams Store and Load Out of order store and load operations Stores are always in program order 48 loads and 24 stores can be in flight Store buffers and load buffers are allocated at the allocation stage Total 24 store buffers and 48 load buffers Store Store operations are divided into two parts: Store data Store address Store data is dispatched to the fast ALU, which operates twice per cycle Store address is dispatched to the store AGU per cycle Store-to-Load Forwarding Forward data from pending store buffer to dependent load Load stalls still happen when the bytes of the load operation are not exactly the same as the bytes in the pending store buffer System Bus Deliver data with 3.2Gbytes/S 64-bit wide bus Four data phase per clock cycle (quad pumped) 100MHz clocked system bus Conclusion Reduced Cache Size VS Increased Bandwidth and Lower Latency What Went Wrong No L3 cache Original plans called for a 1M cache Intel’s idea was to strap a separate memory chip, perhaps an SDRAM, on the back of the processor to act as the L3 But that added another 100 pads to the processor, and would have also forced Intel to devise an expensive cartridge package to contain the processor and cache memory Small L1 Cache Only 8k! – Doubled size of L2 cache to compensate Compare with – AMD Athlon – 128k – Alpha 21264 – 64k – PIII – 32k – Itanium – 16k Loses consistently to AMD In terms of performance, the Pentium 4 is as slow or slower than existing Pentium III and AMD Athlon processors In terms of price, an entry level Pentium 4 sells for about double the cost of a similar Pentium III or AMD Athlon based system 1.5GHz clock rate is more hype than substance Northwood Northwood 1/2002 Differences from Willamette – – – – – Socket 478 21 stage pipeline 512 KB L2 cache 2.0 GHz, 2.2 GHz clock frequency 0.13 fabrication process (130 nm) » 55 million transistors Prescott Prescott 2/2004 Differences – – – – – 31 stage pipeline! 1MB L2 cache 3.8 GHz clock frequency 0.9 fabrication process SSE3