Uploaded by donalpert

Hot Chips P5 (1992)

advertisement
intef.
HatChlpa fl/
intJ.
SuperScalar Arc~itecture of the PS
Intel's Next Generation Microprocessor
Donald Alpert
Intel Corporation
HatChl.-rv
intJ.
Outline
•
•
•
•
Integer Pipeline
Superscalar Execution
Branch Prediction
Dual-Access Data
Cache
• Co~piler Optimizations
HatChi.. rv .
intJ.
Integer Pipeline
PF
Fetch and Align Instruction
D1
Decode Instruction
Generate Control Word
D2
Decode Control Word
Generate Memory Address
E
Access Data C.che or
calculate ALU Result
wal
Write Result
•
HatChlpa rt/
intJ.
Superscalar Execution
PF
Fetch and Align Instruction
Decode Instruction
D1
D2
E
WB
Gemrate Control Word
Decode Control Word
Generate Memory Addrau
Deoode Control Word
Generate Memory Addreu
Acceaa Data Cache or
Caloulate ALU Result
Access Data Cache or
Caloulate ALU Result
Write Result
Write R-ult
U.Pipe
V-Pipe
Hat Chlpa fV
~
Instruction Issue Algorithm
Decode Two Consecutive Instructions: 11 and 12
If the Following Are All _True
11 Is a "Simple" Instruction
12 Is a "Simple" Instruction
11 Is Not a JUMP Instruction
Destination of 11 ,. Source of 12
Destination of 11 ,. Destination of 12
Then Issue 11 to U-Plpe and 12 to V-Pipe
Else Issue 11 to U-Pipe
"Simple" Instructions Are Generally A~U or MOY
Operations, Including Reg-Reg, Imm-Reg, MemReg, and Reg-Mem Formats, and JUMPS
Hate111.-rv
Example
U-Pipe
Proc2:
pushl
movl
addl
cmpb
decl
movl
movl
.B4_4:
popl
ret
o/oebx
(o/oecx),o/oedx
$10, o/oedx
$65, o/oah
%edx
o/oedx,o/oebx
o/oedx,(o/oecx)
V-Pipe
movl
o/oeax,o/oecx
movb
Jne
movl
subl
movl
Char1 Glob, o/oah
.B4_4
lntGlob, o/oeax
o/oeax,o/oedx
o/oebx,o/oedx
o/oebx
HotChlpa Ill
~
Branch -Prediction
Branch Target Buffer
f
Branch
Instruction
Address
♦
Branch
D•tinatlon
Address
Correctly Predicted Branch•
♦
History
Execute with No Delays
HotChlpa N
intJ.
Dual-Access Data Cache
U-Pipe V-Pipe
Address Address
U-Pipe
Data
V-Pipe
Data
jl
h
'
''
+Bank+
'
''
TLB
Dual-Ported
Cache Tags
Dual-Ported
Conflid
Detect
I -
Cache Data
Single-Ported
lnterteaved
HotChlpa N
intef.
Compiler Optimization
• Instruction Selection
- Use Simple Formats for Efficient
Decoding
• Instruction Scheduling
- Minimize Address Generation Interlocks
- Maximize Parallel Execution
• Register Allocation
- Schedule and Allocate Together to Make
Best Use of Small Register Set
Hot Chips ft/
Summary
• Superscalar Microarchitecture
- Dual Integer Pipelines
- Branch Target Buffer
- Dual-Access Data Cache
• Fully Compatible with lntel486™ CPU
Hot Chips ft/
intef~
Hot Chips nl
The PS Floating-Point
Unit
Dror Avnon
Intel Corporation
HatChlpaN
Agenda
•
•
•
•
•
•
Design Goals
Micro-Architecture Overview
Register-Stack Manipulation
Transcendental Functions
Compiler Optimization
Summary
HatChlpa N
intJ.
Design Goals
• Architectural Compatibility
- Full Compatibility with lntel486™ CPU
- IEEE Standard 754
• High Perfonnance
- 4-10 Times lntel486™ DX 33MHz CPU
HotChlpafil
Micro-Architecture
OveNiew
Floating Point Pipeline
• Three Dedicated Arithmetic Units
• Eight Stage Pipeline
Integer Pipe:
IPFID1I02 j E jw~
FP Pipe:
• Three Execution Stages
HotChlpa fil
Micro-Architecture
Overview
Floating Point Pipeline Characteristics
• One Cycle Throughput
• Execution in U-Pipe
• U-Pipe and V-Pipe Used to Access
Data Cache
• Concurrent Data Cache Access and
FP Computation
• Tuned for Double Precision
Memory-Register Operations
Hot Chips n/
Micro-Architecture
Overview
Safe Instruction Recognition
• Early Detection of Potential Exceptions
Example:
FMULP Recognized as Safe
Cycle 1
Cycle 2
♦
♦
FMULP ST (2), ST
FADD QWORD PTR [EAX]
FMULP Recognized as unsafe
Cycle 1
Cycle2
Cycle 3
Cycle4
Cycles
♦
♦
•
•
♦
FMULP ST (2), ST
0
0
0
FADD QWORD PTR [EAX]
HotChlpa n/
Micro-Architecture
Overview
Arithmetic Units
• Multiplier
- Full Extended Precision Multiply Array
- Three Cycles Latency for All Precisions
- Support for Integer Multiplication
• Adder
- Execution of Majority of Basic Instructions
- Three 71-Bit Adders
- Two 69-Bit Shifters
- Three Cycles Latency for All Precisions
• Divider
- Divide, Remainder and Square-Root Operations
- SRT Algorithm
HotChlia rt/
Register Stack Manipulation
• Instruction Set Uses Top of Register Stack as Accumulator
• Parallel Execution of FXCH
Example:
Cycle 1 +
Cycle 2 +
stO
st1 a---------t
sl2
st3 a---------t
st4
st5.,..._____
st6
st7
i---
FADD QWORD PTR [EAX]
FMUL QWORD PTR [EBX]
t---r-:-illl'T'I-~
X~
____ ..,. ____ ..,.
.._ _____
..,.
.....
s12t--r-:-nrra...---l
t3 l--~~iiimll---1
st4.._ _ _.___...._-t
st5
st6 i-------t
st7 .__
____...
....... ---t
Before Cycle 1
FXCH ST (2)
FXCH ST(3)
After Cycle 1
After Cycle 2
HotChliartl
Transcendental Functions
• Direct Microcode Support for All Architecturally Defined
Transcendental Instructions:
-Sine
-Cosine
- Sine-and-Cosine
-Tangent
- Arctangent
-2• -1
-Y,clog:zl(
- Y,clog2(X + 1)
• Table Driven Algorithms Using Polynomial Approximation
• Performance and Error Bound Improvement over lntel486111 DX
33MHzCPU
• Comprehensive Validation Program
HotChlpa rv
Compiler Optimization
••
•
•
•
Instruction Scheduling
Register Allocation
Loop Unrolling
Parallel FXCH
HotChlpa
rv
Summary
• Streamlined Pipeline Provides High Performance
- Integration with Integer Pipeline
- One Cycle Throughput
- Tuning for Memory-Register Double Precision
Operations
• Fast Arithmetic Units Using State. of the Art Algorithms
- Multiplier
-Adder
- Divider
• Improved Performance and Accuracy of Transcendentals
• New Compiler Optimizations Co-Developed
with Micro-Architecture
Hot Chips IV
Download