Pentium Pro Case Study - ECE 752, Advanced Computer

advertisement
Pentium Pro Case Study
Prof. Mikko H. Lipasti
University of Wisconsin-Madison
Lecture notes based on notes by John P. Shen
Updated by Mikko Lipasti
Pentium Pro Case Study
• Microarchitecture
– Order-3 Superscalar
– Out-of-Order execution
– Speculative execution
– In-order completion
• Design Methodology
• Performance Analysis
• Retrospective
Goals of P6 Microarchitecture
IA-32 Compliant
Performance (Frequency - IPC)
Validation
Die Size
Schedule
Power
BTB/ICU
P6 – The
Big Picture
BAC/Rename
Allocation
Fetch
4 Cycles
Decode
2 Cycles
Dispatch
2 Cycles
Reservation Station (20)
4
2 cyc
ROB
(40x157)
3
2
AGU1
AGU0
MOB
RRF
1
0
2 Cycles
IEU1
IEU0
JEU
Fadd
Fmul
Imul
DCU
Div
Memory Hierarchy
Main Memory
PCI
64 bit
ICache
(8KB)
CPU
BIU
16
bytes
DCache
(8Kb)
L2 Cache
(256Kb)
•
Level 1 instruction and data caches - 2 cycle access time
•
Level 2 unified cache - 6 cycle access time
•
Separate level 2 cache and memory address/data bus
Instruction Fetch
Prediction
Inst.
Instruction
TLB
Branch Target Buffer (512)
2 cycle
16 bytes + marks
Inst. Buf
Inst. Rotator
Physical
Addr.
Length
Marks
Instruction Data
Victim
Cache
Inst.
Length
Decoder
Prediction
Marks
Fetch Address
Next
Addr.
Logic
ICache
(8Kb)
Instruction Data Mux
Stream
Buffer
Other
Fetch
Requests
16 bytes
L2 Cache (256Kb)
To
Decode
Branch
Target
Instruction Cache Unit
Lower 12 bits
Stream Buffer
Bus
Interface
Unit
Fetch Address
Upper
20 bits
ICache
(8 Kb)
Victim Cache
Instruction
Tag Array
Data Mux
Instruction
Data
Lower 12 bits
Hit/Miss
ITLB
Branch Target Buffer
Target Addr.
4-bit BHR spec.
4-bit BHR
Br. Offset
Fetch Addr. Tag
Tag
Compare
Fetch
Address
Return
Stack
Prediction
Control
Logic
Prediction
&
Target Addr.

Pattern History Table (PHT) is not speculatively updated

A speculative Branch History Register (BHR) and prediction state is maintained

Uses speculative prediction state if it exist for that branch
16 entries/set
PHT
Way 3
Target Addr.
4-bit BHR spec.
4-bit BHR
Br. Offset
Fetch Addr. Tag
Way 1
Target Addr.
4-bit BHR spec.
4-bit BHR
Br. Offset
Fetch Addr. Tag
128 Sets
Way 0
Branch Prediction Algorithm
0010
Branch Execution
Br. History
0
0
1
State
Machine
0
11
10
Br. Pred.
1
0
1
0
1
Speculative History
0101
Pattern Table
0000
0001
0010
0011
0100
0101
0110
Spec. Pred.
1
0
1
.
.
.
1110
1111

Current prediction updates the speculative history prior to the next instance of the
branch instruction

Branch History Register (BHR) is updated during branch execution

Branch recovery flushes front-end and drains the execution core

Branch mis-prediction resets the speculative branch history state to match BHR
Instruction Decode - 1
Macro-Instruction Bytes from IFU
Instruction Buffer
uROM
Decoder
0
4 uops
Decoder
1
Decoder
2
1 uop
1 uop
16 bytes
To Next
Address
Calc.
Branch
Address
Calc.
uop Queue (6)
Up to 3 uops Issued to dispatch

Branch instruction detection

Branch address calculation - Static prediction and branch always execution

One branch decode per cycle (break on branch)
Instruction Decode - 2
Macro-Instruction Bytes from IFU
Instruction Buffer
uROM
Decoder
0
4 uops
Decoder
1
Decoder
2
1 uop
1 uop
16 bytes
To Next
Address
Calc.
Branch
Address
Calc.
uop Queue (6)
Up to 3 uops Issued to dispatch

Instruction Buffer contains up to 16 instructions, which must be decoded and
queued before the instruction buffer is re-filled

Macro-instructions must shift from decoder 2 to decoder 1 to decoder 0
What is a uop?
Small two-operand instruction - Very RISC like.
IA-32 instruction
add (eax),(ebx)
Uop decomposition:
ld guop0, (eax)
ld guop1, (ebx)
add guop0,guop1
sta eax
std guop0
MEM(eax) <- MEM(eax) + MEM(ebx)
guop0 <- MEM(eax)
guop1 <- MEM(ebx)
guop0 <- guop0 + guop1
MEM(eax) <- guop0
Renaming
Dispatch Buffer (3)
uoP Queue (6)
Instruction Dispatch
To
Reservation
Station
Mux
Logic
Retirement
Info
Allocator
2 cycles
Register Renaming
Allocation requirements
“3-or-none” Reorder buffer entries
Reservation station entry
Load buffer or store buffer entry
Dispatch buffer “probably” dispatches all 3 uops before re-fill
Register Renaming - 1
Real Register File (RRF)
EAX
EBX
8
ECX
8
12
4
9
FST0
FST1
GuoP0
GuoP1
Integer RAT
EAX
EBX
ECX
GuoP0
GuoP1
Floating Point RAT
FST0
FST1
FST2
IuoP(0-3)
CC/Events
Reorder Buffer (ROB)
0
1
2
3
4
5
6
7
8
9
39
FST7
Similar to Tomasulo’s Algorithm - Uses ROB entry number as tags
The register alias tables (RAT) maintain a pointer to the most recent data for the
renamed register
Execution results are stored in the ROB
Register Renaming - Example
Real Register File (RRF)
EAX
EBX
8
ECX
8
FST0
FST1
12
GuoP0
GuoP1
4
9
IuoP(0-3)
CC/Events
Integer RAT
EAX
EBX
ECX
Reorder Buffer (ROB)
0
1
2
GuoP0
3
GuoP1
4
5
Comp
sub
6
7
Floating Point RAT Alloc
8
FST0
9
FST1
FST2
39
FST7
Dispatching:
add
eax, ebx
add
eax, ecx
fxch
f0, f1
Completing:
sub
eax, ecx
© Shen, Lipasti
15
Challenges to Register Renaming
Real Register File (RRF)
EAX
EBX
8
ECX
8
12
4
9
Integer RAT
EAX
EBX
ECX
GuoP0
GuoP1
FST0
FST1
GuoP0
GuoP1
Floating Point RAT
FST0
FST1
FST2
39
IuoP(0-3)
CC/Events
8-bit code
mov
mov
add
add
Reorder Buffer (ROB)
0
1
2
3
4
5
6
7
8
9
FST7
AL, #data1
AH, #data2
AL, #data3
AL, #data4
Byte addressable registers
Out-of-Order Execution Engine
RS
bypass
Reservation Station (20)
4
3
2
AGU1
AGU0
1
0
2 Cycles
IEU1
IEU0
JEU
Fadd
MOB
Fmul
Imul
DCU (8Kb)
Div
•
In-order branch issue and execution
•
In-order load/store issue to address generation units
•
Instruction execution and result bus scheduling
•
Is the reservation station “truly” centralized & what is “binding”?
Reservation Station
From Dispatch Queue
Port 4
Port 3
Port 2
Port 1
Port 0
Cycle 1
Cycle 2
To Execution Units
•
Cycle 1
– Order checking
– Operand availability
•
Cycle 2
– Writeback bus scheduling
Memory Ordering Buffer (MOB)
AGU0
Load Buffer
(16)
AGU1
Store Address
Buffer (12)
Conflict
Logic
R/S
Store Data
Buffer (12)
2 cycle
Data Cache Unit (8Kb)
MOB
Control
2 cycle
Bypass
Logic
•
Load Data Result
Load buffer retains loads until completed, for coherency checking
•
Store forwarding out of store buffers
•
2 cycle latency through MOB
•
“Store Coloring” - Load instructions are tagged by the last store
Instruction Completion
•
Handles all exception/interrupt/trap conditions
•
Handles branch recovery
– OOO core drains out right-path instructions, commits to RRF
– In parallel, front end starts fetching from target/fall-through
– However, no renaming is allowed until OOO core is drained
– After draining is done, RAT is reset to point to RRF
– Avoids checkpointing RAT, recovering to intermediate RAT state
•
Commits execution results to the architectural state in-order
– Retirement Register File (RRF)
– Must handle hazards to RRF (writes/reads in same cycle)
– Must handle hazards to RAT (writes/reads in same cycle)
•
“Atomic” IA-32 instruction completion
– uops are marked as 1st or last in sequence
– exception/interrupt/trap boundary
•
2 cycle retirement
Pentium Pro Design Methodology - 1
Performance Model
Behavioral Model
Microarchitecture
0.75 years
1 year
Start 1990
Finish 4Q
1994
Structural Model
Conception
Logic Design
1 year
Silicon Debug
1.75 years
Circuit Design
Layout
Pentium Pro Performance Analysis
• Observability
– On-chip event counters
– Dynamic analysis
• Benchmark Suite
– BAPco Sysmark32 - 32-bit Windows NT applications
– Winstone97 - 32-bit Windows NT applications
– Some SPEC95 benchmarks
Performance – Run Times
User-Mode Processor Cycles
Unhalted User-Mode Processor Cycles
1.2E+11
Total of 27.5 billion cycles
1E+11
8E+10
6E+10
4E+10
2E+10
0
li
avs
Pwave
word
Corel excel
MSVC
PhotoShop
microstation
paradox
PowerPnt WordPro
SYSflw96
SYSexcel
SYSword7
SYSCorel6SYSexcel7
compress
PageMaker
SYSParadox7
SYSpowerpnt7
SYSpagemaker6
go
Performance – IPC vs. uPC
Instructions and
UOPs Uops
Retired Per Cycle
Instructions
and
retired per cycle
1.6
1.4
2 uops/instruction
1.2
Inst retired
UOPS retired
1
0.8
0.6
# Retired
Per Cycle
0.4
0.2
0
WS-AVS
WS-Excel
WS-PWave
WS-MSVC++
WS-PhotoShop
WS-Microstation
WS-CorelDraw
WS-Word
WS-Paradox
WS-PowerPnt
WS-PageMaker
SPEC-Li SPEC-Go
SYS-ExcelSYS-Word
SYS-Excel
SYS-FlW96
WS-WordPro
SYS-CorelDraw
SYS-Paradox
SYS-PowerPnt
SYS-Pagemaker
SPEC-Compress
Performance – IPC vs. uPC
Inst.Inst.oror UOPS
Uops
per cycle
retired retired
per cycle
100
90
80
70
AVS
PageMaker
60
Paradox
50
PowerPoint
WordPro
40
30
Instructions were Retired
20
Percent10
of Cycles where n=0-3 UOPs or
0
3 UOPs
2 UOPs
1 UOP
0 UOPs
3 Inst.
2 Inst.
1 Inst.
0 Inst.
Performance – Cache Misses
Misses Per Cycle
CacheCache
Misses
Per Cycle
2.5
2
IL1 - 0.51%
DL1 - 1.15%
L2 - 0.25%
IFU Misses / Cycle
DCU Misses/ Cycle
L2 Misses / Cycle
1.5
1
%UnhaltedCycles
0.5
0
WS-AVS
WS-Excel
WS-PWave
WS-Microstation
WS-MSVC++
WS-PhotoShop WS-CorelDraw
WS-Word
WS-Paradox
WS-PowerPnt
WS-PageMaker
SPEC-Li
SYS-ExcelSYS-Word
SYS-Excel
SYS-FlW96
WS-WordPro
SYS-CorelDraw
SYS-Paradox
SYS-PowerPnt
SYS-Pagemaker
SPEC-Compress
SPEC-Go
Performance – Branch Prediction
Branch Mispredict Rate
Branch Mispredict Rate
16%
14%
6.8% avg
12%
10%
8%
6%
4%
% B ran ches M isp red icted
2%
0%
WS-AVS
SPEC-Li
WS-Excel
WS-Word SYS-Ex
SPEC-Go
SYS-Word
cel
SYS-Excel
WS-PWave
SYS-FlW96
WS-Paradox
WS- MSVC++
WS-WordPro
SYS-Paradox
WS-PowerPnt
SYS-PowerPnt
WS-CorelDraw
WS-PhotoShop
WS-PageMaker
SYS- CorelDraw
WS-Mi crostation
SYS-Pagemaker
SPEC-Compr ess
BTB Miss Rate
B T B M i ss R ate
45%
40%
35%
30%
25%
20%
15%
10%
% Bran ches th at M iss in BT B
5%
0%
W S -A V S
S P E C -L i
W S -E x ce l
W S -W o rd S Y S -E
S P E C -G o
S xc
Y S
e-W
l
o rd
S Y S -E xc e l
W S -P W a v e
S Y S -F lW 9 6
W S -P a ra d o x
W S -MS V C+ +
W S -W o rd P ro
S Y S -P a ra d o x
W S - P o w e rP n t
S
Y
S
-P
o
w
e
rP
n
t
So
-C
ra
W S -P h o toW
Sh
p o re lD
W
Sw
-P a g e Ma ke r
S Y S -C o re lD ra w
W S -M ic ro st a t io n
S Y S -P a g e ma k e r
S P E C -Co mp re ss
Conclusions
IA-32 Compliant
Performance (Frequency - IPC)
366.0 ISpec92
283.2 FSpec92
8.09 SPECint95
6.70 SPECfp95
Validation
Die Size - Fabable
Schedule - 1 year late
Power -
Retrospective
• Most commercially successful
microarchitecture in history
• Evolution
– Pentium II/III, Xeon, etc.
• Derivatives with on-chip L2, ISA extensions, etc.
– Replaced by Pentium 4 as flagship in 2001
• High frequency, deep pipeline, extreme speculation
– Resurfaced as Pentium M in 2003
• Initially a response to Transmeta in laptop market
• Pentium 4 derivative (90nm Prescott) delayed, slow, hot
– Core Duo, Core 2 Duo, Core i7 replaced Pentium 4
© Shen, Lipasti
29
Microarchitectural Updates
• Pentium M (Banias), Core Duo (Yonah)
– Micro-op fusion (also in AMD K7/K8)
• Multiple uops in one: (add eax,[mem] => ld/alu), sta/std
• These uops decode/dispatch/commit once, issue twice
– Better branch prediction
• Loop count predictor
• Indirect branch predictor
– Slightly deeper pipeline (12 stages)
• Extra decode stage for micro-op fusion
• Extra stage between issue and execute (for RS/PLRAM read)
– Data-capture reservation station (payload RAM)
• Clock gated for 32 (int) , 64 (fp), and 128 (SSE) operands
© Shen, Lipasti
30
Microarchitectural Updates
• Core 2 Duo (Merom)
– 64-bit ISA from AMD K8
– Macro-op fusion
• Merge uops from two x86 ops
• E.g. cmp, jne => cmpjne
– 4-wide decoder (Complex + 3x Simple)
• Peak x86 decode throughput is 5 due to macro-op fusion
– Loop buffer
• Loops that fit in 18-entry instruction queue avoid fetch/decode
overhead
– Even deeper pipeline (14 stages)
– Larger reservation station (32), instruction window (96)
– Memory dependence prediction
© Shen, Lipasti
31
Microarchitectural Updates
• Nehalem (Core i7/i5/i3)
–
–
–
–
–
–
–
–
–
–
–
RS size 36, ROB 128
Loop cache up to 28 uops
L2 branch predictor
L2 TLB
I$ and D$ now 32K, L2 back to 256K, inclusive L3 up to 8M
Simultaneous multithreading
RAS now renamed (repaired)
6 issue, 48 load buffers, 32 store buffers
New system interface (QPI) – finally dropped front-side bus
Integrated memory controller (up to 3 channels)
New STTNI instructions for string/text handling
© Shen, Lipasti
32
Microarchitectural Updates
• Sandybridge (2nd generation Core i7)
– On-chip integrated graphics (GPU)
– Decoded uop cache up to 1.5K uops, handles loops, but
more general
– 54-entry RS, 168-entry ROB
– Physical register file: 144 FP, 160 integer
– 256-bit AVX units: 8 DPFLOP/cycle, 16 SPFLOP/cycle
– 2 general AGUs enable 2 ld/cycle, 2 st/cycle or any
combination, 2x128-bit load path from L1 D$
© Shen, Lipasti
33
Download