intef. HatChlpa fl/ intJ. SuperScalar Arc~itecture of the PS Intel's Next Generation Microprocessor Donald Alpert Intel Corporation HatChl.-rv intJ. Outline • • • • Integer Pipeline Superscalar Execution Branch Prediction Dual-Access Data Cache • Co~piler Optimizations HatChi.. rv . intJ. Integer Pipeline PF Fetch and Align Instruction D1 Decode Instruction Generate Control Word D2 Decode Control Word Generate Memory Address E Access Data C.che or calculate ALU Result wal Write Result • HatChlpa rt/ intJ. Superscalar Execution PF Fetch and Align Instruction Decode Instruction D1 D2 E WB Gemrate Control Word Decode Control Word Generate Memory Addrau Deoode Control Word Generate Memory Addreu Acceaa Data Cache or Caloulate ALU Result Access Data Cache or Caloulate ALU Result Write Result Write R-ult U.Pipe V-Pipe Hat Chlpa fV ~ Instruction Issue Algorithm Decode Two Consecutive Instructions: 11 and 12 If the Following Are All _True 11 Is a "Simple" Instruction 12 Is a "Simple" Instruction 11 Is Not a JUMP Instruction Destination of 11 ,. Source of 12 Destination of 11 ,. Destination of 12 Then Issue 11 to U-Plpe and 12 to V-Pipe Else Issue 11 to U-Pipe "Simple" Instructions Are Generally A~U or MOY Operations, Including Reg-Reg, Imm-Reg, MemReg, and Reg-Mem Formats, and JUMPS Hate111.-rv Example U-Pipe Proc2: pushl movl addl cmpb decl movl movl .B4_4: popl ret o/oebx (o/oecx),o/oedx $10, o/oedx $65, o/oah %edx o/oedx,o/oebx o/oedx,(o/oecx) V-Pipe movl o/oeax,o/oecx movb Jne movl subl movl Char1 Glob, o/oah .B4_4 lntGlob, o/oeax o/oeax,o/oedx o/oebx,o/oedx o/oebx HotChlpa Ill ~ Branch -Prediction Branch Target Buffer f Branch Instruction Address ♦ Branch D•tinatlon Address Correctly Predicted Branch• ♦ History Execute with No Delays HotChlpa N intJ. Dual-Access Data Cache U-Pipe V-Pipe Address Address U-Pipe Data V-Pipe Data jl h ' '' +Bank+ ' '' TLB Dual-Ported Cache Tags Dual-Ported Conflid Detect I - Cache Data Single-Ported lnterteaved HotChlpa N intef. Compiler Optimization • Instruction Selection - Use Simple Formats for Efficient Decoding • Instruction Scheduling - Minimize Address Generation Interlocks - Maximize Parallel Execution • Register Allocation - Schedule and Allocate Together to Make Best Use of Small Register Set Hot Chips ft/ Summary • Superscalar Microarchitecture - Dual Integer Pipelines - Branch Target Buffer - Dual-Access Data Cache • Fully Compatible with lntel486™ CPU Hot Chips ft/ intef~ Hot Chips nl The PS Floating-Point Unit Dror Avnon Intel Corporation HatChlpaN Agenda • • • • • • Design Goals Micro-Architecture Overview Register-Stack Manipulation Transcendental Functions Compiler Optimization Summary HatChlpa N intJ. Design Goals • Architectural Compatibility - Full Compatibility with lntel486™ CPU - IEEE Standard 754 • High Perfonnance - 4-10 Times lntel486™ DX 33MHz CPU HotChlpafil Micro-Architecture OveNiew Floating Point Pipeline • Three Dedicated Arithmetic Units • Eight Stage Pipeline Integer Pipe: IPFID1I02 j E jw~ FP Pipe: • Three Execution Stages HotChlpa fil Micro-Architecture Overview Floating Point Pipeline Characteristics • One Cycle Throughput • Execution in U-Pipe • U-Pipe and V-Pipe Used to Access Data Cache • Concurrent Data Cache Access and FP Computation • Tuned for Double Precision Memory-Register Operations Hot Chips n/ Micro-Architecture Overview Safe Instruction Recognition • Early Detection of Potential Exceptions Example: FMULP Recognized as Safe Cycle 1 Cycle 2 ♦ ♦ FMULP ST (2), ST FADD QWORD PTR [EAX] FMULP Recognized as unsafe Cycle 1 Cycle2 Cycle 3 Cycle4 Cycles ♦ ♦ • • ♦ FMULP ST (2), ST 0 0 0 FADD QWORD PTR [EAX] HotChlpa n/ Micro-Architecture Overview Arithmetic Units • Multiplier - Full Extended Precision Multiply Array - Three Cycles Latency for All Precisions - Support for Integer Multiplication • Adder - Execution of Majority of Basic Instructions - Three 71-Bit Adders - Two 69-Bit Shifters - Three Cycles Latency for All Precisions • Divider - Divide, Remainder and Square-Root Operations - SRT Algorithm HotChlia rt/ Register Stack Manipulation • Instruction Set Uses Top of Register Stack as Accumulator • Parallel Execution of FXCH Example: Cycle 1 + Cycle 2 + stO st1 a---------t sl2 st3 a---------t st4 st5.,..._____ st6 st7 i--- FADD QWORD PTR [EAX] FMUL QWORD PTR [EBX] t---r-:-illl'T'I-~ X~ ____ ..,. ____ ..,. .._ _____ ..,. ..... s12t--r-:-nrra...---l t3 l--~~iiimll---1 st4.._ _ _.___...._-t st5 st6 i-------t st7 .__ ____... ....... ---t Before Cycle 1 FXCH ST (2) FXCH ST(3) After Cycle 1 After Cycle 2 HotChliartl Transcendental Functions • Direct Microcode Support for All Architecturally Defined Transcendental Instructions: -Sine -Cosine - Sine-and-Cosine -Tangent - Arctangent -2• -1 -Y,clog:zl( - Y,clog2(X + 1) • Table Driven Algorithms Using Polynomial Approximation • Performance and Error Bound Improvement over lntel486111 DX 33MHzCPU • Comprehensive Validation Program HotChlpa rv Compiler Optimization •• • • • Instruction Scheduling Register Allocation Loop Unrolling Parallel FXCH HotChlpa rv Summary • Streamlined Pipeline Provides High Performance - Integration with Integer Pipeline - One Cycle Throughput - Tuning for Memory-Register Double Precision Operations • Fast Arithmetic Units Using State. of the Art Algorithms - Multiplier -Adder - Divider • Improved Performance and Accuracy of Transcendentals • New Compiler Optimizations Co-Developed with Micro-Architecture Hot Chips IV