Inside the Pentium 4 Processor Micro

advertisement
Inside the
®
Pentium 4 Processor
Micro-architecture
Fall 2000
Next Generation IA-32 Micro-architecture
Doug Carmean
Principal Architect
Intel Architecture Group
August 24, 2000
Intel
Copyright © 2000 Intel Corporation.
Labs
Agenda
Fall 2000
l IA-32
Processor Roadmap
l Design Goals
l Frequency
l Instructions Per Cycle
l Summary
Intel
Copyright © 2000 Intel Corporation.
PDX
Intel® Pentium® 4 Processor
Fall 2000
Performance
Intel® NetBurst™
Micro--Architecture
Micro
Today
P6 MicroMicro-Architecture
P5 MicroMicro-Architecture
486 MicroMicro-architecture
Time
Intel
Copyright © 2000 Intel Corporation.
PDX
Intel® Pentium® 4 Processor
Design Goals
Fall 2000
l Deliver
world class performance
across both existing and emerging
applications
l Deliver performance headroom and
scalability for the future
Micro
-architecture that
Micro-architecture
thatwill
willDrive
DrivePerformance
Performance
Leadership
Leadershipfor
forthe
theNext
NextSeveral
SeveralYears
Years
Intel
Copyright © 2000 Intel Corporation.
PDX
Intel® NetBurst™ Micro-architecture
400 MHz
System
Bus
Fall 2000
Advanced
Transfer
Cache
Advanced
Dynamic
Execution
Hyper
Pipelined
Technology
Rapid
Execution
Engine
Streaming
SIMD
Extensions 2
Execution
Trace Cache
Copyright © 2000 Intel Corporation.
Enhanced Floating
Point / Multi-Media
Intel
PDX
Store
AGU
Load
AGU
ALU
ALU
ALU
ALU
FP move
FP store
FMul
FAdd
MMX
SSE
L1 D-Cache and D -TLB
Schedulers
uop Queues
3
FP RF
uCode
ROM
3
Rename/Alloc
Trace Cache
Decoder
BTB
Integer RF
L2 Cache and Control
BTB & I-TLB
3.2 GB/s System Interface
Pentium® 4 Processor
Block Diagram
Store
AGU
Load
AGU
ALU
ALU
ALU
ALU
FP move
FP store
FMul
FAdd
MMX
SSE
L1 D-Cache and D -TLB
Schedulers
uop Queues
3
FP RF
uCode
ROM
3
Rename/Alloc
Trace Cache
Decoder
BTB
Integer RF
L2 Cache and Control
BTB & I-TLB
3.2 GB/s System Interface
Pentium® 4 Processor
Block Diagram
CPU Architecture 101
Fall 2000
Delivered
Delivered Performance
Performance ==
Frequency
Frequency ** Instructions
Instructions Per
Per Cycle
Cycle
Frequency
Frequency
Intel
Copyright © 2000 Intel Corporation.
PDX
Frequency
l What
Fall 2000
limits frequency?
– Process technology
– Microarchitecture
l On
a given process technology
– Fewer gates per pipeline stage will deliver
higher frequency
Frequency
-architecture
Frequency is
is driven
driven by
by Micro
Micro-architecture
Intel
Copyright © 2000 Intel Corporation.
PDX
NetburstTM Micro-architecture
Pipeline vs P6
Basic P6 Pipeline
1
2
3
4
5
Fetch
Fetch
Decode
Decode
6
7
8
Fall 2000
Intro at
733MHz
9
.18µ
Decode Rename ROB Rd Rdy
Rdy/Sch
/Sch Dispatch
10
Exec
Basic Pentium® 4 Processor Pipeline
1
2
TC Nxt IP
3
4
TC Fetch
5
6
Drive Alloc
7
8
Rename
9
10
Que Sch
11
12
Sch Sch
13
14 15
Disp Disp
RF
Intro at
16≥ 17
18 19 20
1.4GHz
RF
Ex Flgs Br Ck Drive
.18µ
Hyper
Hyper pipelined
pipelined Technology
Technology enables
enables industry
industry
leading
leading performance
performance and
and clock
clock rate
rate
Intel
Copyright © 2000 Intel Corporation.
PDX
Hyper Pipelined
Technology
Fall 2000
20
Netburst™ MicroNetburst™
MicroArchitecture
Today
1.13GHz
≥ 1.4GHz
Frequency
10
P6 MicroMicro-Architecture
233MHz
166MHz
5
P5 MicroMicro-Architecture
60MHz
Introduction
Copyright © 2000 Intel Corporation.
Time
Intel
PDX
Hyper pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
20
19
Flgs Br Ck Drive
TC Nxt IP: Trace cache next instruction pointer
L2 Cache and Control
ROM
Copyright © 2000 Intel Corporation.
AGU
ALU
ALU
ALU
ALU
Fms
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
3
Schedulers
3
uop Queues
AGU
Rename/Alloc
Trace Cache
Decoder
BTB
BTB & II-TLB
System Interface
Pointer from the BTB, indicating location of
next instruction.
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
19
20
Flgs Br Ck Drive
TC Fetch: Trace cache fetch
L2 Cache and Control
ROM
Copyright © 2000 Intel Corporation.
AGU
ALU
ALU
ALU
ALU
Fms
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
3
Schedulers
3
uop Queues
AGU
Rename/Alloc
Trace Cache
Cache
Trace
Decoder
BTB
BTB & II-TLB
System Interface
Read the decoded instructions (uOPs) out of
the Execution Trace Cache
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
19
20
Flgs Br Ck Drive
Drive: Wire delay
L2 Cache and Control
ROM
Copyright © 2000 Intel Corporation.
AGU
ALU
ALU
ALU
ALU
Fms
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
3
Schedulers
3
uop Queues
AGU
Rename/Alloc
Trace Cache
Decoder
BTB
BTB & II-TLB
System Interface
Drive the uOPs to the allocator
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
19
20
Flgs Br Ck Drive
Alloc: Allocate
L2 Cache and Control
ROM
Copyright © 2000 Intel Corporation.
AGU
ALU
ALU
ALU
ALU
Fms
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
3
Schedulers
3
uop Queues
AGU
Rename/Alloc
Rename/Alloc
Trace Cache
Decoder
BTB
BTB & II-TLB
System Interface
Allocate resources required for execution. The
resources include Load buffers, Store buffers, etc..
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
19
20
Flgs Br Ck Drive
Rename: Register renaming
L2 Cache and Control
ROM
Copyright © 2000 Intel Corporation.
AGU
ALU
ALU
ALU
ALU
Fms
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
3
Schedulers
3
uop Queues
AGU
Rename/Alloc
Rename/Alloc
Trace Cache
Decoder
BTB
BTB & II-TLB
System Interface
Rename the logical registers (EAX) to the physical
register space (128 are implemented).
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
19
20
Flgs Br Ck Drive
Que: Write into the uOP Queue
L2 Cache and Control
ROM
Copyright © 2000 Intel Corporation.
AGU
ALU
ALU
ALU
ALU
Fms
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
3
Schedulers
3
uop
uop Queues
Queues
AGU
Rename/Alloc
Trace Cache
Decoder
BTB
BTB & II-TLB
System Interface
uOPs are placed into the queues, where they
are held until there is room in the schedulers
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
19
20
Flgs Br Ck Drive
Sch: Schedule
L2 Cache and Control
ROM
Copyright © 2000 Intel Corporation.
AGU
ALU
ALU
ALU
ALU
Fms
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
3
Schedulers
Schedulers
3
uop Queues
AGU
Rename/Alloc
Trace Cache
Decoder
BTB
BTB & II-TLB
System Interface
Write into the schedulers and compute
dependencies. Watch for dependency to resolve.
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
19
20
Flgs Br Ck Drive
Disp: Dispatch
L2 Cache and Control
ROM
Copyright © 2000 Intel Corporation.
AGU
ALU
ALU
ALU
ALU
Fms
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
3
Schedulers
3
uop Queues
AGU
Rename/Alloc
Trace Cache
Decoder
BTB
BTB & II-TLB
System Interface
Send the uOPs to the appropriate execution
unit.
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
19
20
Flgs Br Ck Drive
RF: Register File
Copyright © 2000 Intel Corporation.
AGU
AGU
ALU
ALU
ALU
ALU
Fms
Fop
L1 DD -Cache and DD -TLB
Schedulers
uop Queues
3
FP
FPRF
RF
ROM
3
Rename/Alloc
Trace Cache
Decoder
BTB
IntegerRF
RF
Integer
L2 Cache and Control
BTB & II-TLB
System Interface
Read the register file. These are the source(s)
for the pending operation (ALU or other).
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
19
20
Flgs Br Ck Drive
Ex: Execute
L2 Cache and Control
AGU
AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Fms
Fop
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
Schedulers
3
uop Queues
3
Rename/Alloc
Trace Cache
Decoder
ROM
Copyright © 2000 Intel Corporation.
AGU
AGU
BTB
BTB & II-TLB
System Interface
Execute the uOPs on the appropriate execution
port.
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
19
20
Flgs Br Ck Drive
Flgs: Flags
L2 Cache and Control
AGU
AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Fms
Fop
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
Schedulers
3
uop Queues
3
Rename/Alloc
Trace Cache
Decoder
ROM
Copyright © 2000 Intel Corporation.
AGU
AGU
BTB
BTB & II-TLB
System Interface
Compute flags (zero, negative, etc..). These
are typically the input to a branch instruction.
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
20
19
Flgs Br Ck Drive
Br Ck: Branch Check
L2 Cache and Control
ROM
Copyright © 2000 Intel Corporation.
AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Fms
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
3
Schedulers
3
uop Queues
AGU
Rename/Alloc
Trace Cache
Decoder
BTB
BTB & II-TLB
System Interface
The branch operation compares the result of the
actual branch direction with the prediction.
Intel
PDX
Hyper Pipelined Technology
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
11
Que Sch
12
13
14 15
Sch Sch Disp Disp
RF
Fall 2000
16
17
RF
Ex
18
20
19
Flgs Br Ck Drive
Drive: Wire delay
L2 Cache and Control
ROM
Copyright © 2000 Intel Corporation.
AGU
ALU
ALU
ALU
ALU
Fms
Fop
L1 DD -Cache and DD -TLB
Integer RF
FP RF
3
Schedulers
3
uop Queues
AGU
Rename/Alloc
Trace Cache
Decoder
BTB
BTB & II-TLB
System Interface
Drive the result of the branch check to the front
end of the machine.
Intel
PDX
CPU Architecture 101
Fall 2000
Delivered
Delivered Performance
Performance ==
Frequency
Frequency ** Instructions
Instructions Per
Per Cycle
Cycle
Instructions
Instructions Per
Per Cycle
Cycle
Intel
Copyright © 2000 Intel Corporation.
PDX
Improving
Instructions Per Cycle
l Improve
Fall 2000
efficiency
– Do more things in a clock
– Branch prediction
l Reduce
time it takes to do something
– Reducing latency
Intel
Copyright © 2000 Intel Corporation.
PDX
Improving
Instructions Per Cycle
l Improve
Fall 2000
efficiency
– Branch prediction
– Do more things in a clock
l Reduce
time it takes to do something
– Reducing latency
Intel
Copyright © 2000 Intel Corporation.
PDX
Branch Prediction
Fall 2000
l Accurate
branch prediction is key to
enabling longer pipelines
l Dramatic improvement over P6 branch
predictor:
– 8x the size (4K)
– Eliminated 1/3 of the mispredictions
l Proven
to be better than all other
publicly disclosed predictors
– (g-share, hybrid, etc)
Intel
Copyright © 2000 Intel Corporation.
PDX
The Execution Trace Cache
Copyright © 2000 Intel Corporation.
Store
AGU
Load
AGU
ALU
ALU
ALU
ALU
FP move
FP store
FMul
FAdd
MMX
Intel
SSE
L1 D-Cache and D -TLB
Schedulers
uop Queues
3
FP RF
uCode
ROM
3
Rename/Alloc
Trace Cache
Decoder
BTB
Integer RF
L2 Cache and Control
BTB & I-TLB
3.2 GB/s System Interface
Fall 2000
PDX
Execution Trace Cache
l Advanced
Fall 2000
L1 instruction cache
– Caches “decoded” IA -32 instructions (uops)
l Removes
decoder pipeline latency
l Capacity is ~12K uOps
l Integrates branches into single line
– Follows predicted path of program execution
Execution
Execution Trace
Trace Cache
Cache feeds
feeds fast
fast engine
engine
Intel
Copyright © 2000 Intel Corporation.
PDX
Execution Trace Cache
1 cmp
2 br -> T1
..
... (unused code)
T1:
T2:
T3:
3 sub
4 br -> T2
..
... (unused code)
5 mov
6 sub
7 br -> T3
..
... (unused code)
Fall 2000
Trace Cache Delivery
1
cmp
2 br T1
3 T1: sub
4
br T2
5 mov
6
7
br T3
8 T3:add
9 sub
10 mul
11 cmp
sub
12 br T4
8 add
9 sub
10 mul
11 cmp
12 br -> T4
Intel
Copyright © 2000 Intel Corporation.
PDX
Advanced
Dynamic Execution
Fall 2000
l Extends
basic features found in P6 core
l Very deep speculative execution
– 126 instructions in flight (3x P6)
– 48 loads (3x P6) and 24 stores (2x P6)
l Provides
larger window of visibility
– Better use of execution resources
Deep
Deep Speculation
Speculation Improves
Improves Parallelism
Parallelism
Intel
Copyright © 2000 Intel Corporation.
PDX
Improving
Instructions Per Cycle
l Improve
Fall 2000
efficiency
– Do more things in a clock
– Branch prediction
l Reduce
time it takes to do something
– Reducing latency
Intel
Copyright © 2000 Intel Corporation.
PDX
Rapid Execution Engine
l Dramatically
Fall 2000
lower ALU latency
l P6:
– 1 clock @ 1GHz
1ns
l Intel®
NetBurst™ micro-architecture:
– ½ clock @ >1.4GHz
<0.35ns
Copyright © 2000 Intel Corporation.
Intel
PDX
L1 Data Cache
Copyright © 2000 Intel Corporation.
Store
AGU
Load
AGU
ALU
ALU
ALU
ALU
FP move
FP store
FMul
FAdd
MMX
Intel
SSE
L1 D-Cache and D -TLB
Schedulers
uop Queues
3
FP RF
uCode
ROM
3
Rename/Alloc
Trace Cache
Decoder
BTB
Integer RF
L2 Cache and Control
BTB & I-TLB
3.2 GB/s System Interface
Fall 2000
PDX
High Performance
L1 Data Cache
Fall 2000
l 8KB,
4-way set associative, 64-byte lines
l Very high bandwidth
– 1 Ld and 1 St per clock
l New
access algorithms
l Very low latency
– 2 clock read
New
New algorithm
algorithm enables
enables faster
faster cache
cache
Intel
Copyright © 2000 Intel Corporation.
PDX
Data Speculation
Fall 2000
l Observation:
Almost all memory
accesses hit in the cache
l Optimize for the common case
– Assume that the access will hit the cache
– Use a low cost mechanism to fix the rare
cases that miss
l Benefit:
– Reduces latency
– Significantly higher performance
Intel
Copyright © 2000 Intel Corporation.
PDX
Replay
Fall 2000
l Repairs
incorrect speculation
– Re-execute until correct
l Replay
is uOP specific
– Replay the uOP that mis-speculated
– Replay dependent uOPs
– Independent uOPs are not replayed
Efficient
Efficient mechanism
mechanism to
to reduce
reduce latency
latency
Intel
Copyright © 2000 Intel Corporation.
PDX
L1 Cache is >2x Faster
Fall 2000
l P6:
– 3 clocks @ 1GHz
3ns
l Intel®
NetBurst™ micro-architecture:
– 2 clocks @ ≥ 1.4GHz
<1.4ns
Lower
Lower Latency
Latency is
is Higher
Higher Performance
Performance
Intel
Copyright © 2000 Intel Corporation.
PDX
Example with higher IPC and
Faster Clock!
Code
Sequence
P6
@1GHz
Fall 2000
Intel® NetBurst™
Micro-architecture
@1.4GHz
Ld
Add
Add
Ld
Add
Add
10 clocks
10ns
IPC = 0.6
Copyright © 2000 Intel Corporation.
6 clocks
4.3ns
IPCIntel
= 1.0
PDX
L2 Advanced Transfer Cache
Copyright © 2000 Intel Corporation.
Store
AGU
Load
AGU
ALU
ALU
ALU
ALU
FP move
FP store
FMul
FAdd
MMX
Intel
SSE
L1 D-Cache and D -TLB
Schedulers
uop Queues
3
FP RF
uCode
ROM
3
Rename/Alloc
Trace Cache
Decoder
BTB
Integer RF
L2 Cache and Control
BTB & I-TLB
3.2 GB/s System Interface
Fall 2000
PDX
L2 ATC Organization
l 256KB,
Fall 2000
8-way set associative
– 128-byte lines
– Two 64-byte pieces per line
l Holds
both data and instructions
l High bandwidth: 45 GB/Sec @ 1.4GHz
– 2.8x P6 @1GHz
Intel
Copyright © 2000 Intel Corporation.
PDX
Aggregate Cache Latency
Fall 2000
l Function
of all caches in a processor
l Overall Effective Latency
L1 latency +
L1 Miss Rate * L2 latency +
L2 Miss Rate * DRAM Latency
Average
Average cache
cache speed
speed is
is >1.8x
>1.8x better
better
than
than the
the Pentium®
Pentium® III
III Processor
Processor
Average on desktop applications,
Intel® Pentium® III processor @ 1GHz, Intel® Pentium® 4 processor
processor @ 1.4GHzIntel
Copyright © 2000 Intel Corporation.
PDX
Comparison
Fall 2000
Pentium® III
Processor
Pentium 4
Processor
Relative
Frequency
1 GHz
≥ 1.4 Ghz
≥ 1.4
Adder Speed
1 ns
< .36 ns
≥ 2.8
Adder Bandwidth
2 billion/sec
≥ 5.6 billion/sec
> 2.8
L1 Cache Speed
3 ns
< 1.42 ns
≥ 2.1
L1 Cache Size
16 KB
8 KB
0.5
L1 Cache Bandwidth
16 GB/sec
≥ 44.8 GB/sec
≥ 2.8
L2 Cache Bandwidth
16 GB/sec
≥ 44.8 GB/sec
≥ 2.8
Instructions In flight
40
126
3.15
Loads in flight
16
48
3
Stores in flight
12
24
2
Branch targets
512
4092
8
Uop Fetch Bandwidth
3 billion/sec
≥ 4.2 billion/sec
≥ 1.4
Improvement
Intel
Copyright © 2000 Intel Corporation.
PDX
Intel® Pentium® 4 Processor
Summary
Fall 2000
l Revolutionary,
new microarchitecture from Intel designed for
the evolving Internet
l Design features for balanced, high
performance platform scalability and
headroom
l The world’s highest performance
desktop processor
Intel
Copyright © 2000 Intel Corporation.
PDX
Download