Explaining The Gap Between ASIC and Custom Power: A Custom Perspective Andrew Chang Cadence Design Systems* William J. Dally Computer Systems Laboratory Stanford University * Work done while Author was at Stanford 2 Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom Power 1 3 2 Performance 3 Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom Power 1 3 2. Trade Performance for Power Larger Range w/ Custom 2 Performance 4 Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom Power 1 2 Performance 3 2. Trade Performance for Power Larger Range w/ Custom 3. Move to Different Power vs. Performance Curve More Architectural Choice with Custom 5 Dynamic Power Dissipation Pdyn = a CVdd2 f = a Ecircuit f Reduce Vdd Static, dynamic, voltage islands, power gating Reduce a and/or f Clock gating, block enables, bus encoding, glitch identification and elimination Reduce Ecircuit Engineer interconnects, increase circuit efficiency, subthreshold circuit techniques 6 Static Power Dissipation Pstatic = Vdd (Isub + Iox ) Isub = K1 W e -Vt/ nVq (1- e –Vgs/Vq) Iox = K2 W (Vgs/tox)2 e –a tox/ Vgs With K1, K2, n, and a experimentally determined Reduce Vdd Static, dynamic, voltage islands, power gating Increase effective Vt Substituting high-threshold devices, transistor stacking, static and active body bias Reduce effective W Reduce number and size of devices in design 7 Which Design Is More Efficient? 0.7um CMOS 173MHz chip w/ 460K T’s 0.18um CMOS 10kHz chip w/ 640K T’s 8 Which Design Is More Efficient? 0.7um CMOS 173MHz chip w/ 460K T’s Vdd (typ) = 3.3V, Vdd (min) = 1.1V 0.18um CMOS 10kHz chip w/ 640K T’s Vdd (max) = 1.8V, Vdd (min) = 0.18V 9 Which Design Is More Efficient? 0.7um CMOS 173MHz chip w/ 460K T’s Vdd (typ) = 3.3V, Vdd (min) = 1.1V Power = 845mW 0.18um CMOS 10kHz chip w/ 640K T’s Vdd (max) = 1.8V, Vdd (min) = 0.18V Power = 1.6mW 10 Talk Outline Normalized Metric: Ebit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient” 11 Talk Outline Normalized Metric: Ebit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient” 12 Defining Ebit Ebit = Cbit * Vdd2 Cbit = 4 * 2 fF/um * Wmin Energy needed to write a 1-bit SRAM cell Approximates minimum useful capacitance The ratio of Ebit to the energy for a range of circuits remains largely constant with technology scaling 13 Technology Scaling for Ebit Technology mm2 c2 0.5mm 58 18 0.18mm 5.7 18 c is a normalized unit of distance equal to the M1 pitch 14 Technology Scaling for Nand2 NAND2 A B YN A B YN 4c = 2.24mm 8c = 4.48mm c is a normalized unit of distance equal to the M1 pitch 15 Applying Ebit Energy 180nm 130nm 90nm 65nm Ebit (fJ) 3.3 1.4 0.5 0.36 Relative 180nm 130nm 90nm 65nm Ebit 1b FO4 1b SP-SRAM 1 1 1 1 ~10 ~10 ~10 ~10 0.3-7 0.3-7 0.3-7 0.3-7 1b RF 4-20+ 4-20+ 4-20+ 4-20+ 1b DFF 20-30+ 15-30+ 10-30+ 10-30+ 11-30 (typ 19) 5-30 (typ 14) 5-30 (typ 14) 5-30 (typ 14) Move 1b 1000 c ~100 ~100 ~100 ~100 Move 1b 1.5mm 268 367 467 714 1b Nand2 16 Talk Outline Normalized Metric: Ebit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient” 17 Talk Outline Normalized Metric: Ebit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient” 18 Effect of Architecture NVIDIA GeForceFX Design Style: ASIC 400MHz – 125M Transistors Intel Pentium-4 Design Style: Custom 2600MHz – 55M Transistors 19 Effect of Architecture NVIDIA GeForceFX Design Style: ASIC 400MHz – 125M Transistors ~20 Watts Intel Pentium-4 Design Style: Custom 2600MHz – 55M Transistors ~60 Watts 20 Effect of Architecture ASIC Architecture: 6x Efficiency NVIDIA GeForceFX Design Style: ASIC 400MHz – 125M Transistors ~20 Watts: 10GFlops & 13 GBs Intel Pentium-4 Design Style: Custom 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs 21 Custom Circuits: 9x (7x) Efficiency NVIDIA GeForceFX Design Style: Custom 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V Intel Pentium-4 Design Style: Custom 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V 22 Combined Architecture and Circuits 40x+ Improvement but 1.5 Years vs. 3+ Years NVIDIA GeForceFX Design Style: Custom 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V Intel Pentium-4 Design Style: Custom 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V 23 Talk Outline Normalized Metric: Ebit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient” 24 Talk Outline Normalized Metric: Ebit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient” 25 ASIC vs. Custom ASIC Methods Provide only coarse-grain control 100K+ gates, but require much less effort and historically scale with complexity Custom Methods Offer fine-grain control individual transistors & gates, but require large effort and scale poorly with complexity Exploits Design Structure Exploits Circuit Techniques 26 Custom Methods Emphasize Fine-Grain Manual Control + Custom Library Design Gate Library Floorplanning/ Coarse Style Partitioning Placement Detailed Coarse Placement Routing Detailed Routing Custom ASIC Complex Specific Simple Generic Manual Manual Manual Manual Manual Manual/Automated Automated Automated w/ Hints Automated Automated Automated 27 Custom Methods Emphasize Fine-Grain Manual Control + Custom Library Design Gate Library Floorplanning/ Coarse Style Partitioning Placement Detailed Coarse Placement Routing Detailed Routing Custom ASIC Complex Specific Simple Generic Manual Manual Manual Manual Manual Manual/Automated Automated Automated w/ Hints Automated Automated Automated Operation and Performance Characterized for the Specific Case 28 ASIC Methods Substitute Coarse-Grain Control Automation + Generic Library Design Gate Library Floorplanning/ Coarse Style Partitioning Placement Detailed Coarse Placement Routing Detailed Routing Custom Manual Manual Manual Automated Automated ASIC Complex Specific Simple Generic Manual Manual Manual/Automated Automated Automated Automated w/ Hints 29 ASIC Methods Substitute Coarse-Grain Control Automation + Generic Library Design Gate Library Floorplanning/ Coarse Style Partitioning Placement Detailed Coarse Placement Routing Detailed Routing Custom Manual Manual Manual Automated Automated ASIC Complex Specific Simple Generic Manual Manual Manual/Automated Automated Automated Automated w/ Hints Operation and Performance Characterized for the Typical/Generic Case 30 ASIC Focus on 100K+ Gates Lost Opportunities to Exploit Structure Designs reuse similar basic building blocks Building blocks: 1-10K-gates not 100K+ gate 64-bit adder 1K-gates 64x64 rf 2K-gates 64x64 multiplier 20K-gates Opportunities to exploit these structures lost when design is viewed in large chunks 31 Different Architectures Similar Building Blocks Bank 1 L T E L M B I CLST 2 CL Bank 0 CL CL MEMORY SWITCH NIF/ROUTER CLST 1 CL CLST 0 CLUSTER CLST 2 SWITCH CLST 1 CL CL CL CL CL CL CLST 0 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) EX CL Microcontroller SRAM XCVRS Bus RF CL CL CL Cluster7 CL Cluster6 Cluster5 Cluster4 Cluster3 Cluster2 Cluster1 Cluster0 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford) EX RF SRAM XCVRS Bus 32 Significant Structure Exists Within 100K-gates Bank 1 L T E L M B I CLST 2 CL Bank 0 CL CL MEMORY SWITCH NIF/ROUTER CLST 1 CL CLST 0 CLUSTER CLST 2 SWITCH CLST 1 CL CL CL CL CL CL CLST 0 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) EX CL Microcontroller SRAM XCVRS Bus RF CL CL CL Cluster7 CL Cluster6 Cluster5 Cluster4 Cluster3 Cluster2 Cluster1 Cluster0 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford) EX RF SRAM XCVRS Bus 33 Energy of 100K-gate Equivalent ASIC (N2) Custom Logic SRAM (small) SRAM (med) SRAM (large) = 1400K Ebits (typ) = 424K Ebits* = 1085K Ebits = 155K Ebits = 50K Ebits *Based on data extracted from Intel McKinley 34 Exploiting Circuit Techniques Custom circuits more efficient Reduced parasitics 1.7x circuit techniques and flops 1.4x libraries 1.4x due to engineering interconnects Subthreshold Circuits Low Performance but ultra-low power Requires Architecture, Gates, Memories, CAD Tools 35 Relating Power to Performance CV/I, Idsat, tFO4 Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25 tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5) 36 Relating Power to Performance Relating Vdd and Vt to tFO4 Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25 tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5) 37 Relating Power to Performance Correlation to Reported Foundry Data Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25 tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5) Technology Node CV/I est CV/I reported (ps) (ps) tFO4 est (ps) Foundry A 180-nm 3.94 3.70 53 Foundry A 130-nm 2.55 2.17 34 Foundry A 90-nm 1.85 2.04 25 Foundry A 65-nm 1.45 1.00 20 38 Achievable Power Improvement (Assuming 50/50 split of Logic and Memory) Custom vs. ASIC Energy Type Circuit Styles and Flops 1.7 0.815 Logic Libraries + Vdd Scaling 1.4 0.855 Logic SRAM Circuits 2 0.95 SRAM Interconnect + Vdd Scaling 1.4 0.855 Inter-connect Technique Type Dynamic 39 Achievable Power Improvement (Assuming 50/50 Split of Logic and Memory) Custom vs. ASIC Energy Type Bit Encoding 1 0.84 Inter-connect Clock Gating 1 0.84 Chip 1 0.5 Chip N/A 0.062 Chip Technique Frequency Scaling Subthreshold Circuits Type Dynamic 40 Achievable Power Improvement (Assuming 50/50 Split of Logic and Memory) Custom vs. ASIC Energy Type Vdd Scaling 1 0.79 Chip MT-CMOS 1 0.5 Chip Stacking and input state vector 1.4 0.7 Body Bias 2 0.5 Supply Gating 10 0.1 Chip (typically only one of these three is applied) Technique Type Static 41 Achievable Power Improvement Assuming 50/50 Split of Logic and Memory Type Tech Total Tech 45% (32%) Net Dynamic Net Static ASIC (Custom) 130-nm 8% (4%) 53% (36%) ASIC (Custom) 28%(20%) 90-nm 20%(10%) 48%(30%) 130nm uP assumes 80% Dynamic and 20% Static 90nm uP assumes 50% Dynamic and 50% Static 42 Talk Outline Normalized Metric: Ebit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient” 43 Talk Outline Normalized Metric: Ebit Effect of Architecture ASIC vs. Custom Building Blocks Achievable Energy Efficiency 16b 1024 FFT Example Answer to “Which Design is More Efficient” 44 16b 1024 point FFT Generally, k N log N operations (complex multiplies) with precomputation Radix-2, Radix-4 etc… implementations Decimation in time and/or decimation in Frequency 45 Range of Implementations MIT FFT (2005) 0.18um CMOS, 628K T’s, 10KHz: Architecture and subtheshold circuits, 180mV operation Spiffee (1999) 0.7um CMOS, 460K T’s, 173MHz: Cached FFT Architecture and algorithm, 1.1V operation SA-1100 (1999) 0.35um CMOS, 2.6M T’s, 74MHz: Commercial embedded processor, Custom Circuits, 1.5V operation Imagine (2003) 0.15um CMOS, 22M T’s , 232MHz: Streaming Media Processor, tiled standard cells, 1.2V operation Stratix IS25F627C8 (2005) 0.13um CMOS, 3.9K logic elements, 123K memory bits, 24 DSP blocks, 272MHz: Commercial FPGA Co-processor, Intel P4 (2003) 0.13um CMOS, 3GHz, SSE: Commerical General Purpose Processor, Custom Circuits, 1.5V operation TI ‘C6416 (2003) 0.13um CMOS, 720MHz: Commercial Digital Signal Processor 46 Ebit Energy 16b 1024 point FFT Design Fab Vdd MHz mW Cycles MIT FFT 180 1.8 0.01 1.6 95 Spiffee 700 3.3 173 845 5190 SA-1100 350 2 74 39 31500 Imagine 150 1.5 232 4000 3708 Stratix 130 1.3 275 884 1291 Intel P4 130 1.2 3000 51200 71680 TI 'C6416 130 1.2 720 1200 6526 47 Ebit Energy 16b 1024 point FFT Design MIT FFT EDP (rel norm) Ebit (fJ) Efft (nJ) Normalized to Ebit (1e6) Energy Ratio 143 3.3 154 47 1 1 91 25350 277 6 SA-1100 283 4.2 16601 3953 85 Imagine 148 2.2 63931 29726 637 24 1.4 4149 2964 64 12548 1.4 1E+06 873813 18591 27 1.4 10877 7769 166 Spiffee Stratix Intel P4 TI 'C6416 48 Which Design Is More Efficient? 0.7um CMOS 173MHz chip w/ 460K T’s Vdd (typ) = 3.3V, Vdd (min) = 1.1V Power = 845mW 0.18um CMOS 10kHz chip w/ 640K T’s Vdd (max) = 1.8V, Vdd (min) = 0.18V Power = 1.6mW 49 Which Design Is More Efficient? Depends on the Metric! 0.7um CMOS 173MHz chip w/ 460K T’s Vdd (typ) = 3.3V, Vdd (min) = 1.1V Power = 845mW EDP 143x better 0.18um CMOS 10kHz chip w/ 640K T’s Vdd (max) = 1.8V, Vdd (min) = 0.18V Power = 1.6mW Absolute energy 6x better 50 Summary Normalized metric – Ebit - enables meaningful comparisons across designs and technologies Custom designers can exploit a wide range of optimizations: enabling architecture with circuits and circuits with Architecture Custom designs can readily achieve a 3x advantage in energy with the potential for over 10x Selective application of custom techniques and automated support for performance characterization at specific instead of generic operating points can enable ASIC designers to begin to bridge this Power Gap. 51 Back-Up Slides 52 ASIC Rely on General Optimization Techniques Focus - Improve the Average Case Partitioning: Hyper-graph - min-cut, ratio cut Solutions: move-based, geometric & combinatorial forms, clustering Circuit e8 Hypergraph e1 e4 V1 e6 V3 V5 e1 e8 e6 V5 V3 e5 e2 V2 e3 e7 V1 e4 V2 e5 V4 Vertex & Edge weights used to encode costs V4 e2 e7 e3 H(V,E) E = { e1, e2….} nets 53 Designs with Structure Do Not Exhibit Average Characteristics Density 64b Multiplier (half-array) Routing Clear Disparity in Resource Usage 54