Inside the ® Pentium 4 Processor Micro-architecture Fall 2000 Next Generation IA-32 Micro-architecture Doug Carmean Principal Architect Intel Architecture Group August 24, 2000 Intel Copyright © 2000 Intel Corporation. Labs Agenda Fall 2000 l IA-32 Processor Roadmap l Design Goals l Frequency l Instructions Per Cycle l Summary Intel Copyright © 2000 Intel Corporation. PDX Intel® Pentium® 4 Processor Fall 2000 Performance Intel® NetBurst™ Micro--Architecture Micro Today P6 MicroMicro-Architecture P5 MicroMicro-Architecture 486 MicroMicro-architecture Time Intel Copyright © 2000 Intel Corporation. PDX Intel® Pentium® 4 Processor Design Goals Fall 2000 l Deliver world class performance across both existing and emerging applications l Deliver performance headroom and scalability for the future Micro -architecture that Micro-architecture thatwill willDrive DrivePerformance Performance Leadership Leadershipfor forthe theNext NextSeveral SeveralYears Years Intel Copyright © 2000 Intel Corporation. PDX Intel® NetBurst™ Micro-architecture 400 MHz System Bus Fall 2000 Advanced Transfer Cache Advanced Dynamic Execution Hyper Pipelined Technology Rapid Execution Engine Streaming SIMD Extensions 2 Execution Trace Cache Copyright © 2000 Intel Corporation. Enhanced Floating Point / Multi-Media Intel PDX Store AGU Load AGU ALU ALU ALU ALU FP move FP store FMul FAdd MMX SSE L1 D-Cache and D -TLB Schedulers uop Queues 3 FP RF uCode ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L2 Cache and Control BTB & I-TLB 3.2 GB/s System Interface Pentium® 4 Processor Block Diagram Store AGU Load AGU ALU ALU ALU ALU FP move FP store FMul FAdd MMX SSE L1 D-Cache and D -TLB Schedulers uop Queues 3 FP RF uCode ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L2 Cache and Control BTB & I-TLB 3.2 GB/s System Interface Pentium® 4 Processor Block Diagram CPU Architecture 101 Fall 2000 Delivered Delivered Performance Performance == Frequency Frequency ** Instructions Instructions Per Per Cycle Cycle Frequency Frequency Intel Copyright © 2000 Intel Corporation. PDX Frequency l What Fall 2000 limits frequency? – Process technology – Microarchitecture l On a given process technology – Fewer gates per pipeline stage will deliver higher frequency Frequency -architecture Frequency is is driven driven by by Micro Micro-architecture Intel Copyright © 2000 Intel Corporation. PDX NetburstTM Micro-architecture Pipeline vs P6 Basic P6 Pipeline 1 2 3 4 5 Fetch Fetch Decode Decode 6 7 8 Fall 2000 Intro at 733MHz 9 .18µ Decode Rename ROB Rd Rdy Rdy/Sch /Sch Dispatch 10 Exec Basic Pentium® 4 Processor Pipeline 1 2 TC Nxt IP 3 4 TC Fetch 5 6 Drive Alloc 7 8 Rename 9 10 Que Sch 11 12 Sch Sch 13 14 15 Disp Disp RF Intro at 16≥ 17 18 19 20 1.4GHz RF Ex Flgs Br Ck Drive .18µ Hyper Hyper pipelined pipelined Technology Technology enables enables industry industry leading leading performance performance and and clock clock rate rate Intel Copyright © 2000 Intel Corporation. PDX Hyper Pipelined Technology Fall 2000 20 Netburst™ MicroNetburst™ MicroArchitecture Today 1.13GHz ≥ 1.4GHz Frequency 10 P6 MicroMicro-Architecture 233MHz 166MHz 5 P5 MicroMicro-Architecture 60MHz Introduction Copyright © 2000 Intel Corporation. Time Intel PDX Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 20 19 Flgs Br Ck Drive TC Nxt IP: Trace cache next instruction pointer L2 Cache and Control ROM Copyright © 2000 Intel Corporation. AGU ALU ALU ALU ALU Fms Fop L1 DD -Cache and DD -TLB Integer RF FP RF 3 Schedulers 3 uop Queues AGU Rename/Alloc Trace Cache Decoder BTB BTB & II-TLB System Interface Pointer from the BTB, indicating location of next instruction. Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive TC Fetch: Trace cache fetch L2 Cache and Control ROM Copyright © 2000 Intel Corporation. AGU ALU ALU ALU ALU Fms Fop L1 DD -Cache and DD -TLB Integer RF FP RF 3 Schedulers 3 uop Queues AGU Rename/Alloc Trace Cache Cache Trace Decoder BTB BTB & II-TLB System Interface Read the decoded instructions (uOPs) out of the Execution Trace Cache Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Drive: Wire delay L2 Cache and Control ROM Copyright © 2000 Intel Corporation. AGU ALU ALU ALU ALU Fms Fop L1 DD -Cache and DD -TLB Integer RF FP RF 3 Schedulers 3 uop Queues AGU Rename/Alloc Trace Cache Decoder BTB BTB & II-TLB System Interface Drive the uOPs to the allocator Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Alloc: Allocate L2 Cache and Control ROM Copyright © 2000 Intel Corporation. AGU ALU ALU ALU ALU Fms Fop L1 DD -Cache and DD -TLB Integer RF FP RF 3 Schedulers 3 uop Queues AGU Rename/Alloc Rename/Alloc Trace Cache Decoder BTB BTB & II-TLB System Interface Allocate resources required for execution. The resources include Load buffers, Store buffers, etc.. Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Rename: Register renaming L2 Cache and Control ROM Copyright © 2000 Intel Corporation. AGU ALU ALU ALU ALU Fms Fop L1 DD -Cache and DD -TLB Integer RF FP RF 3 Schedulers 3 uop Queues AGU Rename/Alloc Rename/Alloc Trace Cache Decoder BTB BTB & II-TLB System Interface Rename the logical registers (EAX) to the physical register space (128 are implemented). Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Que: Write into the uOP Queue L2 Cache and Control ROM Copyright © 2000 Intel Corporation. AGU ALU ALU ALU ALU Fms Fop L1 DD -Cache and DD -TLB Integer RF FP RF 3 Schedulers 3 uop uop Queues Queues AGU Rename/Alloc Trace Cache Decoder BTB BTB & II-TLB System Interface uOPs are placed into the queues, where they are held until there is room in the schedulers Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Sch: Schedule L2 Cache and Control ROM Copyright © 2000 Intel Corporation. AGU ALU ALU ALU ALU Fms Fop L1 DD -Cache and DD -TLB Integer RF FP RF 3 Schedulers Schedulers 3 uop Queues AGU Rename/Alloc Trace Cache Decoder BTB BTB & II-TLB System Interface Write into the schedulers and compute dependencies. Watch for dependency to resolve. Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Disp: Dispatch L2 Cache and Control ROM Copyright © 2000 Intel Corporation. AGU ALU ALU ALU ALU Fms Fop L1 DD -Cache and DD -TLB Integer RF FP RF 3 Schedulers 3 uop Queues AGU Rename/Alloc Trace Cache Decoder BTB BTB & II-TLB System Interface Send the uOPs to the appropriate execution unit. Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive RF: Register File Copyright © 2000 Intel Corporation. AGU AGU ALU ALU ALU ALU Fms Fop L1 DD -Cache and DD -TLB Schedulers uop Queues 3 FP FPRF RF ROM 3 Rename/Alloc Trace Cache Decoder BTB IntegerRF RF Integer L2 Cache and Control BTB & II-TLB System Interface Read the register file. These are the source(s) for the pending operation (ALU or other). Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Ex: Execute L2 Cache and Control AGU AGU ALU ALU ALU ALU ALU ALU ALU ALU Fms Fop Fop L1 DD -Cache and DD -TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU AGU BTB BTB & II-TLB System Interface Execute the uOPs on the appropriate execution port. Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Flgs: Flags L2 Cache and Control AGU AGU ALU ALU ALU ALU ALU ALU ALU ALU Fms Fop Fop L1 DD -Cache and DD -TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU AGU BTB BTB & II-TLB System Interface Compute flags (zero, negative, etc..). These are typically the input to a branch instruction. Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 20 19 Flgs Br Ck Drive Br Ck: Branch Check L2 Cache and Control ROM Copyright © 2000 Intel Corporation. AGU ALU ALU ALU ALU ALU ALU ALU ALU Fms Fop L1 DD -Cache and DD -TLB Integer RF FP RF 3 Schedulers 3 uop Queues AGU Rename/Alloc Trace Cache Decoder BTB BTB & II-TLB System Interface The branch operation compares the result of the actual branch direction with the prediction. Intel PDX Hyper Pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 Que Sch 12 13 14 15 Sch Sch Disp Disp RF Fall 2000 16 17 RF Ex 18 20 19 Flgs Br Ck Drive Drive: Wire delay L2 Cache and Control ROM Copyright © 2000 Intel Corporation. AGU ALU ALU ALU ALU Fms Fop L1 DD -Cache and DD -TLB Integer RF FP RF 3 Schedulers 3 uop Queues AGU Rename/Alloc Trace Cache Decoder BTB BTB & II-TLB System Interface Drive the result of the branch check to the front end of the machine. Intel PDX CPU Architecture 101 Fall 2000 Delivered Delivered Performance Performance == Frequency Frequency ** Instructions Instructions Per Per Cycle Cycle Instructions Instructions Per Per Cycle Cycle Intel Copyright © 2000 Intel Corporation. PDX Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Do more things in a clock – Branch prediction l Reduce time it takes to do something – Reducing latency Intel Copyright © 2000 Intel Corporation. PDX Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Branch prediction – Do more things in a clock l Reduce time it takes to do something – Reducing latency Intel Copyright © 2000 Intel Corporation. PDX Branch Prediction Fall 2000 l Accurate branch prediction is key to enabling longer pipelines l Dramatic improvement over P6 branch predictor: – 8x the size (4K) – Eliminated 1/3 of the mispredictions l Proven to be better than all other publicly disclosed predictors – (g-share, hybrid, etc) Intel Copyright © 2000 Intel Corporation. PDX The Execution Trace Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU ALU ALU ALU ALU FP move FP store FMul FAdd MMX Intel SSE L1 D-Cache and D -TLB Schedulers uop Queues 3 FP RF uCode ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L2 Cache and Control BTB & I-TLB 3.2 GB/s System Interface Fall 2000 PDX Execution Trace Cache l Advanced Fall 2000 L1 instruction cache – Caches “decoded” IA -32 instructions (uops) l Removes decoder pipeline latency l Capacity is ~12K uOps l Integrates branches into single line – Follows predicted path of program execution Execution Execution Trace Trace Cache Cache feeds feeds fast fast engine engine Intel Copyright © 2000 Intel Corporation. PDX Execution Trace Cache 1 cmp 2 br -> T1 .. ... (unused code) T1: T2: T3: 3 sub 4 br -> T2 .. ... (unused code) 5 mov 6 sub 7 br -> T3 .. ... (unused code) Fall 2000 Trace Cache Delivery 1 cmp 2 br T1 3 T1: sub 4 br T2 5 mov 6 7 br T3 8 T3:add 9 sub 10 mul 11 cmp sub 12 br T4 8 add 9 sub 10 mul 11 cmp 12 br -> T4 Intel Copyright © 2000 Intel Corporation. PDX Advanced Dynamic Execution Fall 2000 l Extends basic features found in P6 core l Very deep speculative execution – 126 instructions in flight (3x P6) – 48 loads (3x P6) and 24 stores (2x P6) l Provides larger window of visibility – Better use of execution resources Deep Deep Speculation Speculation Improves Improves Parallelism Parallelism Intel Copyright © 2000 Intel Corporation. PDX Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Do more things in a clock – Branch prediction l Reduce time it takes to do something – Reducing latency Intel Copyright © 2000 Intel Corporation. PDX Rapid Execution Engine l Dramatically Fall 2000 lower ALU latency l P6: – 1 clock @ 1GHz 1ns l Intel® NetBurst™ micro-architecture: – ½ clock @ >1.4GHz <0.35ns Copyright © 2000 Intel Corporation. Intel PDX L1 Data Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU ALU ALU ALU ALU FP move FP store FMul FAdd MMX Intel SSE L1 D-Cache and D -TLB Schedulers uop Queues 3 FP RF uCode ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L2 Cache and Control BTB & I-TLB 3.2 GB/s System Interface Fall 2000 PDX High Performance L1 Data Cache Fall 2000 l 8KB, 4-way set associative, 64-byte lines l Very high bandwidth – 1 Ld and 1 St per clock l New access algorithms l Very low latency – 2 clock read New New algorithm algorithm enables enables faster faster cache cache Intel Copyright © 2000 Intel Corporation. PDX Data Speculation Fall 2000 l Observation: Almost all memory accesses hit in the cache l Optimize for the common case – Assume that the access will hit the cache – Use a low cost mechanism to fix the rare cases that miss l Benefit: – Reduces latency – Significantly higher performance Intel Copyright © 2000 Intel Corporation. PDX Replay Fall 2000 l Repairs incorrect speculation – Re-execute until correct l Replay is uOP specific – Replay the uOP that mis-speculated – Replay dependent uOPs – Independent uOPs are not replayed Efficient Efficient mechanism mechanism to to reduce reduce latency latency Intel Copyright © 2000 Intel Corporation. PDX L1 Cache is >2x Faster Fall 2000 l P6: – 3 clocks @ 1GHz 3ns l Intel® NetBurst™ micro-architecture: – 2 clocks @ ≥ 1.4GHz <1.4ns Lower Lower Latency Latency is is Higher Higher Performance Performance Intel Copyright © 2000 Intel Corporation. PDX Example with higher IPC and Faster Clock! Code Sequence P6 @1GHz Fall 2000 Intel® NetBurst™ Micro-architecture @1.4GHz Ld Add Add Ld Add Add 10 clocks 10ns IPC = 0.6 Copyright © 2000 Intel Corporation. 6 clocks 4.3ns IPCIntel = 1.0 PDX L2 Advanced Transfer Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU ALU ALU ALU ALU FP move FP store FMul FAdd MMX Intel SSE L1 D-Cache and D -TLB Schedulers uop Queues 3 FP RF uCode ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L2 Cache and Control BTB & I-TLB 3.2 GB/s System Interface Fall 2000 PDX L2 ATC Organization l 256KB, Fall 2000 8-way set associative – 128-byte lines – Two 64-byte pieces per line l Holds both data and instructions l High bandwidth: 45 GB/Sec @ 1.4GHz – 2.8x P6 @1GHz Intel Copyright © 2000 Intel Corporation. PDX Aggregate Cache Latency Fall 2000 l Function of all caches in a processor l Overall Effective Latency L1 latency + L1 Miss Rate * L2 latency + L2 Miss Rate * DRAM Latency Average Average cache cache speed speed is is >1.8x >1.8x better better than than the the Pentium® Pentium® III III Processor Processor Average on desktop applications, Intel® Pentium® III processor @ 1GHz, Intel® Pentium® 4 processor processor @ 1.4GHzIntel Copyright © 2000 Intel Corporation. PDX Comparison Fall 2000 Pentium® III Processor Pentium 4 Processor Relative Frequency 1 GHz ≥ 1.4 Ghz ≥ 1.4 Adder Speed 1 ns < .36 ns ≥ 2.8 Adder Bandwidth 2 billion/sec ≥ 5.6 billion/sec > 2.8 L1 Cache Speed 3 ns < 1.42 ns ≥ 2.1 L1 Cache Size 16 KB 8 KB 0.5 L1 Cache Bandwidth 16 GB/sec ≥ 44.8 GB/sec ≥ 2.8 L2 Cache Bandwidth 16 GB/sec ≥ 44.8 GB/sec ≥ 2.8 Instructions In flight 40 126 3.15 Loads in flight 16 48 3 Stores in flight 12 24 2 Branch targets 512 4092 8 Uop Fetch Bandwidth 3 billion/sec ≥ 4.2 billion/sec ≥ 1.4 Improvement Intel Copyright © 2000 Intel Corporation. PDX Intel® Pentium® 4 Processor Summary Fall 2000 l Revolutionary, new microarchitecture from Intel designed for the evolving Internet l Design features for balanced, high performance platform scalability and headroom l The world’s highest performance desktop processor Intel Copyright © 2000 Intel Corporation. PDX