Microprocessor and DSP Technologies for the Nanoscale Era Seminar 1 Ram Kumar Krishnamurthy Microprocessor Research Labs Intel Corporation, Hillsboro, OR ram.krishnamurthy@intel.com Intel July 5, 2005 1 Labs About Circuits Research Lab • Established 1996 • Belongs under Microprocessor Technology Labs • Located in Hillsboro, Oregon, USA (primary) and Bangalore, India • 75 researchers • Charter: • High-performance & low-power digital circuits • Off-chip I/O signaling circuits • Power delivery circuits • >50 patents, >25 papers per year Intel 2 Labs Motivation: Higher performance at lower power and cost 1000000 100000 10000 MIPS 1000 100 10 1 Pentium® 4 Architecture Pentium® Pro Architecture Pentium® Architecture 486 386 286 8086 0.1 0.01 1970 1980 1990 2000 2010 Strong demand for > 1 TIPS performance beyond this decade How do you get there? 3 Our Research Agenda Outlook 2004 2006 2008 2010 2012 2014 2016 2018 Technology Node (nm) 90 65 45 32 22 16 11 8 Integration Capacity (BT) 0.5 1 2 4 8 16 32 64 Delay = CV/I scaling 0.7 ~0.7 >0.7 Delay scaling will slow down >0.35 >0.5 >0.5 Energy scaling will slow down Energy/Logic Op scaling Bulk Planar CMOS High Probability Low Probability Alternate, 3G etc Low Probability High Probability Variability ILD (K) RC Delay Metal Layers Medium High ~3 <3 1 1 1 6-7 7-8 8-9 Very High Reduce slowly towards 2-2.5 1 1 1 1 1 0.5 to 1 layer per generation FCRP(MARCO) Internal 4 University Intel’s Research Focus Technology Leadership 1 1000 Gate Length Industry 0.1 0.01 1990 nm Intel 2000 100 10 2010 Complete solution stack Technology 5 Arch & Design Platforms Software Architectures & Designs Back End Server Server Desktop Mobile Handheld Family Itanium® Itanium ® Xeon ® Pentium ® Celeron Centrino ® Pentium ® Xscale ® Architecture IA64, VLIW IA64/ IA32 IA32 IA32 ARM Word 64 bit 64 bit Itanium 32 bit Xeon 32 bit 32 bit 32 bit Address Space Huge Huge/4 GB 4 GB 4 GB 4 GB Cache 6 MB 6 MB, 2 MB 1 MB 1 MB 512 KB Performance High High High Medium Low Power ~130W ~100 W < 100 W ~25 W < 1W Power Metric Cost Watts/sq ft Watts/cu ft High High Watts Med Watt-hours Battery Life Med Low Our research agenda addresses all these platforms 6 Is Transistor a Good Switch? I=0 I≠0 On I=∞ I=0 I = 1ma/u I≠0 Off I=0 7 I≠0 Sub-threshold Leakage Sub-threshold Leakage 10000 45 MOS Transistor Characteristics Ids (log) Exponential Increase in Ioff 1000 Ioff (na/u) DVt 90 100 10 0.25 1 30 Vgs 50 70 90 110 130 Temp (C) SD Leakage (Watts) 1000 100 2X Tr Growth 1.5X Tr Growth 10 1 0.1 0.25u 0.18u 0.13u 90nm 65nm 45nm Technology 8 Transistors will not be switches, but dimmers Leakage Power Leakage Power (% of Total) 50% Must stop at 50% 40% 30% 20% 10% 0% 1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.09 0.07 0.05 Technology (m) A. Grove, IEDM 2002 Leakage power limits Vt scaling 9 Clock Pk0 INV_out I Leak Dyn_out M1j M11 M2j M21 M1K M2K Keeper / pulldown ratio High Leakage Impacts Functionality 1.6 Sub-70nm 1.2 0.8 0.4 0 1X 3X 5X 10X 20X Subthreshold + gate leakage M. Anders, R. Krishnamurthy et al, 2001 Symp. VLSI Circuits Sub-65nm Dynamic Circuit Active Leakage Tolerance: Cache, RF, Arrays, Bitlines most affected 10 Keeper sizes > 50% of pulldown strength High contention degraded performance Slow keeper shutoff high short-circuit power Power Will be the Limiter 1000 100000 Pentium® 4 proc 100 1 Billion Transistors 10 Million Transistors 1 386 10000 Pentium® 4 proc 15-30 GHz 1000 MHz 100 Pentium® proc 386 0.1 8086 1 0.01 8080 0.001 1970 Pentium® proc 8086 10 8080 0.1 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020 1B transistor integration capacity will exist 1000 1000000 100000 Pentium® 4 proc 10000 1 TIPS 100 10 1 386 1000's of Watts? Power (Watts) 10 1000 MIPS Pentium® 4 proc 100 Pentium® proc Pentium® proc 8086 1 8086 386 0.1 0.01 1970 8080 1980 1990 2000 2010 2020 Applications will demand TIPS performance 0.1 1970 8080 1980 1990 2000 2010 2020 But the Power… Challenge: Highest performance in the power envelope 11 Power Trend Cooling Capacity Of Conventional System 100 Pentium® 4 processor Power (W) Pentium® II processor “Business As Usual” is Not an Option Pentium® processor 10 486 386 1 1985 1990 1995 2000 C scales by 30% per generation… …but Vcc scales by 10-15% only! Must maintain or reduce power in future 12 Gate Oxide is Near Limit CoSi2 Si3N4 70 nm Gate Leakage (Watts) 130nm Transistor 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1.E-02 1.E-03 1.5X 2X During Burn-in 1.4X Vdd 0.25u 0.18u 0.13u 90nm 65nm 45nm Technology Poly Si Gate Electrode 1.5 nm Gate Oxide Si Substrate 13 Intel’s High K leadership is crucial for the industry Power Density Will Get Even Worse 10,000 1,000 Power Density (W/cm2) 100 8086 10 4004 8008 8085 386 286 486 8080 1 ’70 ’80 ’90 Pentium® processors ’00 Need to Keep the Junctions Cool • Performance (Higher Frequency) • Lower leakage (Exponential) • Better reliability (Exponential) Pat Gelsinger, ISSCC 2001 14 ’10 Slow Fast Slow High Supply Voltage Low Supply Voltage Active Power Reduction Multiple Supply Voltages Replicated Designs Vdd Logic Block 15 Vdd/2 Freq = 1 Vdd = 1 Logic Block Throughput = 1 Power = 1 Logic Block Area = 1 Pwr Den = 1 Freq = 0.5 Vdd = 0.5 Throughput = 1 Power = 0.25 Area = 2 Pwr Den = 0.125 Need high-speed multi-supply level converter techniques Leakage Control Body Bias Stack Effect Sleep Transistor Vbp Vdd +Ve Equal Loading -Ve Logic Block Vbn 2-10X reduction 2-1000X reduction 2-200X reduction 16 Need low leakage and leakage tolerant techniques Number of paths Number of paths Number of paths Dual Vt Design for Active Leakage Reduction 17 Technology provides two Vt High Vt with nominal Ioff (lower performance) Low Vt with ~10X higher loff (higher performance) High Vt Delay Low Vt Employing high Vt everywhere yields lower performance, and lower leakage (1X) Employing low Vt everywhere yields higher performance, but higher leakage (10X) Delay Logic path between latch boundaries Dual Vt Delay Selective usage of low and high Vt yields higher performance, yet low leakage between 1X, and <<10X C1 C2 Cache C3 C4 Relative Performance Chip Multi-Processing 3.5 3 CMP 2.5 2 ST 1.5 1 1 2 3 Die Area, Power • • • • • 18 Multi-core, each core Multi-threaded Shared cache and front side bus Each core has different Vdd & Freq Core hopping to spread hot spots Lower junction temperature 4 Memory Latency CPU Cache Small ~few Clocks Memory Large 50-100ns Memory Latency (Clocks) 1000 100 10 1 100 Assume: 50ns Memory latency 1000 10000 Freq (MHz) Cache miss hurts performance Worse at higher frequency Need power efficient high-speed I/O techniques 19 Increase on-die Memory Power Density (Watts/cm 2) 100 100% Cache % of full chip area 80% Logic 60% Memory ? Pentium ®4 10 40% 20% PentiumPentiumPentium Pro II Pentium III & 4 Pentium III 0% 1 m m m m m m m m m m m Large on die memory provides: 1. Increased Data Bandwidth & Reduced Latency 2. Hence, higher performance for much lower power 20 Special Purpose Hardware Acceleration TCP Offload Engine 1.E+06 MIPS 1.E+05 1.E+04 1.E+03 GP MIPS @75W TOE MIPS @~2W 1.E+02 1995 2000 2005 2010 2015 2.23 mm X 3.54 mm, 260K transistors Opportunities for acceleration: Network processing engines MPEG Encode/Decode engines Speech engines Wireless communication/baseband Special purpose HW—Best MIPS/Watt 21 Energy-efficient Data-path Circuits Cache Processor thermal map ALUs: Temp (oC) Execution core Integer and FP ALUs and MACs performance and peak-current limiters High activity thermal hotspots Goal: high-performance energy-efficient design 22 130nm 9GHz 32-bit Integer ALU (ISSCC’02) Input FIFO Output FIFO Misc Clock RF ALU 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Supply Voltage (V) BB Ctl 450 400 Die Size Process Interconnect Transistors Maximum V CC 1.61 x 1.44 mm 130nm CMOS 1 poly, 6 metal 160K 1.5 V Power (mW) 350 300 Design target: 6.5GHz at 120mW 250 200 150 100 M. Anders, R. Krishnamurthy et al, Intl. Solid-state Circuits Conf. 2002 & IEEE Journal of Solid-state Circuits 11/02 23 50 0 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Supply Voltage (V) 100 90 80 70 60 50 40 30 20 10 0 Leakage Power (mW) Scan Ctl F (GHz) max 32-bit integer exec core 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 S. Mathew, R. Krishnamurthy et al, Intl. Solid-state Circuits Conf. 2004 & IEEE Journal of Solid-state Circuits 01/05 Upper-order 32-bit ALU Clock Generator and Drivers Lower-order 32-bit ALU 90nm 7GHz 64-bit Integer ALU (ISSCC’04) I/O Circuits Process 90nm Dual-Vt CMOS, 7 Metal Die area 0.474mm2 64-bit ALU layout area 0.073mm2 Total transistor count 6100 64-bit ALU average switching power (a=0.3) 89mW at 4GHz, 1.3V, 25oC 64-bit ALU active leakage power 9.6mW at 1.3V, 25oC 64-bit ALU maximum frequency 7GHz at 2.1V, 25C 32-bit ALU average switching power (a=0.3) 71mW at 7GHz, 1.3V, 25oC 32-bit ALU active leakage power 4.4mW at 1.3V, 25oC 64-bit ALU die microphotograph and measured performance summary • 7GHz single-cycle 64-bit integer ALU (measured in 90nm CMOS) • Simultaneous 9GHz single-cycle 32-bit integer ALU mode • Fastest reported single-cycle 64-bit integer ALU performance 24 S. Hsu, R. Krishnamurthy et al, Intl. Solid-state Circuits Conf. 2005 Registers Clock Generator and Drivers R-PLA 90nm 1GHz 9mW 16*16b Multiplier (ISSCC’05) 16x16b I/O Circuits Multiplier Process 90nm Dual-Vt CMOS Die area 0.474mm2 16b Multiplier and PLA layout area 0.03mm2 16b Multiplier worst-case power 9mW at 1GHz, 1.3V, 50oC (nominal) 16b Multiplier active leakage power 540μW at 1.3V, 50oC (nominal) 16b Multiplier peak performance 1.5GHz, 32mW at 1.95V, 50oC 16b Multiplier low-voltage mode performance 50MHz, 79μW at 0.57V, 50oC Reconfigurable PLA peak performance 2.3GHz, 4.2mW at 1.3V, 50°C Reconfigurable PLA worst-case power 2mW at 1GHz, 1.3V, 50oC (nominal) Stand-by mode power 75μW (7X reduction vs. active leakage) 16*16-bit Multiplier die microphotograph and measured performance summary • 1GHz single-cycle 16*16-bit DSP multiplier (measured in 90nm CMOS) • Reconfigurable PLA control engine • 9pJ/Op or 110GOPS/Watt • Highest reported GOPS/Watt for single-cycle 16-bit multiply 25 32-bit ALU architecture Mux control Shift control 5:1 Mux Adder core 6:1 Mux External operands 6:1 Mux External operands O/p Mux Sum 2:1 Mux Mux control Sign control Loopback bus Multiple ALUs clustered together in the execution core High power density 26 Full-Adder Intro A Cin B Full adder Sum 27 Cout The Binary Adder A Cin B Full adder Cout Sum S = A B Ci = ABC i + ABC i + ABCi + ABCi C o = AB + BC i + ACi 28 The Ripple-Carry Adder A0 B0 Ci,0 A1 B1 Co,0 FA A2 B2 Co,1 A3 B3 Co,2 Co,3 FA FA FA S1 S2 S3 (= Ci,1) S0 Worst case delay linear with the number of bits td = O(N) tadder = (N-1)tcarry + tsum Goal: Make the fastest possible carry path circuit 29 Static CMOS Full Adder VDD VDD A Ci A B B A B B Ci A X Ci VDD Ci S A Ci A B B VDD A B Co 28 Transistors 30 Ci A B Carry Look-ahead Sumi= Ai Bi Carryi-1 Carryi = AiBi + (Ai+Bi)Carryi-1 Intel Labs Partial Sum Sumi= Ai Bi Carryi-1 Carryi = AiBi + (Ai+Bi)Carryi-1 Intel Labs Partial Sum Sumi= Ai Bi Carryi-1 Carryi = AiBi + (Ai+Bi)Carryi-1 Generate Propagate Intel Labs Partial Sum Sumi= Ai Bi Carryi-1 Carryi = AiBi + (Ai+Bi)Carryi-1 Propagate Generate Carryi = Gi + Pi Carryi-1 Intel Labs High-performance Adders: Kogge Stone 1 2 3 4 5 6 7 Even PG Gen. CM1 CM2 CM3 CM4 CM5 XOR Sumeven input bits Sumodd Odd XOR CM1 CM2 CM3 CM4 CM5 PG Gen. input bits GG=Gi+PiGi-1 GP=PiPi-1 Generate all 32 carries: Full-blown binary tree energy-inefficient # Carry-merge stages = log2(32) 5 stages 35 XOR Carry-merge gates PG Kogge-Stone Adder 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Critical path = PG+5+XOR = 7 gate stages Generate,Propagate fanout of 2,3 Energy Maximum interconnect spans 16b inefficient 36 Sparse-tree Adder Architecture 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 C27 C23 C19 C15 C11 C7 C3 every 4th carry in parallel Side-path: 4-bit conditional sum generator 73% fewer carry-merge gatesenergy-efficient Generate 37 Non-critical Sum Generator Pi+3,Gi+3 Pi+2 ,Gi+2 Gi+1 Pi+1 1 2:1 Sumi+3 XOR XOR 2:1 Sumi+2 CM CM XOR XOR 2:1 Sumi+1 2:1 Sumi Sumi ,0 CM XOR CM 0 Sumi ,1 CM XOR CM Pi Carry Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit Sparse-tree carry selects appropriate sum 38 Adder Core Critical Path clk Adder PG Inputs clk3 clk2 GG1 GG3 GG7 GG27 GG15 C27 Single-rail dynamic sparse-tree path Sum31_0 Sum31 CM0 Latch CM1 XOR clk Static sum generator Sum31_1 path: 7 gate stages same as KS Sparse-tree: single-rail dynamic Exploit non-criticality of sum generator Convert to static logicSemi-dynamic design Critical 39 Sparse-tree Architecture Performance impact: (20% speedup) 33-50% reduced G/P fanouts 80% reduced wiring complexity 30% reduction in maximum interconnect Power impact: (56% reduction) 73% fewer carry-merge gates 50% reduction in average transistor size 40 130nm CMOS, 1.2V, 110oC 80 60 Dynamic Kogge-Stone 40 20 20% 4GHz Design 0 140 160 Semi-dynamic Sparse-Tree 180 200 220 Delay (ps) 240 20% speedup over Kogge-Stone 56% worst-case energy reduction Scales with activity factor 41 100 56% Worst-case Energy (pJ) Energy-delay Space 260 280 Average Energy (pJ) Semi-dynamic Design 40 Dynamic Kogge-Stone 30 71% 20 Semi-dynamic Sparse-Tree 10 0 Static 0 0.1 0.2 0.3 Activity factor 0.4 0.5 sum generators : low switching activity 71% lower average energy at 10% activity 42 So, How Do We Get There? 1000000 Multi-Threaded, Multi-Core 100000 Multi Threaded 10000 MIPS 1000 10 0.1 0.01 1970 43 Super Scalar 100 1 Era of Thread & Processor Level Parallelism Speculative, OOO 286 8086 486 Era of Instruction Era of Level Pipelined Architecture Parallelism 386 1980 1990 Special Purpose HW 2000 2010 Significant Challenges Ahead Can only be solved with joint industry-university collaboration Thank You for Your Attention Q&A Our publications can be found in: •IEEE Intl. Solid-State Circuits Conference, 2001•IEEE Journal of Solid-State Circuits, 2001•Symposium on VLSI Circuits, 1999•Intl. Symposium on Low-power Design, 1999•Custom Integrated Circuits Conference, SOCC, etc., 199944 Backup 45 Optimized First-level Carry-merge Conditional Carry for Cin=0 0 CM Pi Cin=0 Gi Pi Carry-merge C#_0i Gi C#_0 Gi stage reduces to inverter Conditional carry_0 = Gi# 46 Optimized First-level Carry-merge 1 Conditional carry for Cin=1 CM Pi Cin=1 Pi C#_1 Gi Pi Pi 47 Gi C#_1 Gi Ai 0 0 1 1 Bi Pi Gi 0 0 0 1 1 0 0 1 0 1 1 1 & Gi correlated Conditional carry_1 = Pi# C#_1 1 0 0 0 Pi C#_1 Optimized Sum Generator Pi+2 ,Gi+2 Pi+3,Gi+3 Gi+1 Pi+1 Pi Optimized 1st-level carry-merge XOR XOR 2:1 Sumi+3 Optimized 48 XOR XOR 2:1 2:1 2:1 Sumi+2 Sumi+1 Sumi non-critical path: 4 stages Sumi ,0 CM XOR CM Sumi ,1 CM XOR CM Carry