VLSI Datapath Choices: Cell-Based Versus Full

advertisement
Explaining The Gap Between ASIC and Custom Power:
A Custom Perspective
Andrew Chang
Cadence Design Systems*
William J. Dally
Computer Systems Laboratory
Stanford University
* Work done while Author was at Stanford
2
Design Tradeoffs: Power vs. Performance
1. Move to More Energy Efficient
Operating Point
More Energy Efficient w/ Custom
Power
1
3
2
Performance
3
Design Tradeoffs: Power vs. Performance
1. Move to More Energy Efficient
Operating Point
More Energy Efficient w/ Custom
Power
1
3
2. Trade Performance for
Power
Larger Range w/ Custom
2
Performance
4
Design Tradeoffs: Power vs. Performance
1. Move to More Energy Efficient
Operating Point
More Energy Efficient w/ Custom
Power
1
2
Performance
3
2. Trade Performance for
Power
Larger Range w/ Custom
3. Move to Different
Power vs. Performance Curve
More Architectural Choice with
Custom
5
Dynamic Power Dissipation
Pdyn = a CVdd2 f = a Ecircuit f
Reduce Vdd
 Static, dynamic, voltage islands, power gating
Reduce a and/or f
 Clock gating, block enables, bus encoding, glitch identification
and elimination
Reduce Ecircuit
 Engineer interconnects, increase circuit efficiency, subthreshold
circuit techniques
6
Static Power Dissipation
Pstatic = Vdd (Isub + Iox )
Isub = K1 W e -Vt/ nVq (1- e –Vgs/Vq)
Iox = K2 W (Vgs/tox)2 e –a tox/ Vgs
With K1, K2, n, and a experimentally determined
Reduce Vdd
 Static, dynamic, voltage islands, power gating
Increase effective Vt
 Substituting high-threshold devices, transistor stacking, static and active
body bias
Reduce effective W
 Reduce number and size of devices in design
7
Which Design Is More Efficient?
 0.7um CMOS 173MHz chip w/ 460K T’s
0.18um CMOS 10kHz chip w/ 640K T’s
8
Which Design Is More Efficient?
 0.7um CMOS 173MHz chip w/ 460K T’s
 Vdd (typ) = 3.3V, Vdd (min) = 1.1V
0.18um CMOS 10kHz chip w/ 640K T’s
 Vdd (max) = 1.8V, Vdd (min) = 0.18V
9
Which Design Is More Efficient?
 0.7um CMOS 173MHz chip w/ 460K T’s
 Vdd (typ) = 3.3V, Vdd (min) = 1.1V
 Power = 845mW
0.18um CMOS 10kHz chip w/ 640K T’s
 Vdd (max) = 1.8V, Vdd (min) = 0.18V
 Power = 1.6mW
10
Talk Outline
 Normalized Metric: Ebit
 Effect of Architecture
 ASIC vs. Custom
Building Blocks
Achievable Energy Efficiency
 16b 1024 FFT Example
 Answer to “Which Design is More Efficient”
11
Talk Outline
 Normalized Metric: Ebit
 Effect of Architecture
 ASIC vs. Custom
Building Blocks
Achievable Energy Efficiency
 16b 1024 FFT Example
 Answer to “Which Design is More Efficient”
12
Defining Ebit
Ebit = Cbit * Vdd2
Cbit = 4 * 2 fF/um * Wmin
Energy needed to write a 1-bit SRAM cell
Approximates minimum useful capacitance
The ratio of Ebit to the energy for a range of
circuits remains largely constant with technology
scaling
13
Technology Scaling for Ebit
Technology
mm2
c2
0.5mm
58
18
0.18mm
5.7
18
 c is a normalized unit of distance equal to the M1 pitch
14
Technology Scaling for Nand2
NAND2
A
B
YN
A
B
YN
4c = 2.24mm
8c = 4.48mm
 c is a normalized unit of distance equal to the M1 pitch
15
Applying Ebit
Energy
180nm
130nm
90nm
65nm
Ebit (fJ)
3.3
1.4
0.5
0.36
Relative
180nm
130nm
90nm
65nm
Ebit
1b FO4
1b SP-SRAM
1
1
1
1
~10
~10
~10
~10
0.3-7
0.3-7
0.3-7
0.3-7
1b RF
4-20+
4-20+
4-20+
4-20+
1b DFF
20-30+
15-30+
10-30+
10-30+
11-30 (typ 19)
5-30 (typ 14)
5-30 (typ 14)
5-30 (typ 14)
Move 1b 1000 c
~100
~100
~100
~100
Move 1b 1.5mm
268
367
467
714
1b Nand2
16
Talk Outline
 Normalized Metric: Ebit
 Effect of Architecture
 ASIC vs. Custom
Building Blocks
Achievable Energy Efficiency
 16b 1024 FFT Example
 Answer to “Which Design is More Efficient”
17
Talk Outline
 Normalized Metric: Ebit
 Effect of Architecture
 ASIC vs. Custom
Building Blocks
Achievable Energy Efficiency
 16b 1024 FFT Example
 Answer to “Which Design is More Efficient”
18
Effect of Architecture
NVIDIA GeForceFX
Design Style: ASIC
400MHz – 125M Transistors
Intel Pentium-4
Design Style: Custom
2600MHz – 55M Transistors
19
Effect of Architecture
NVIDIA GeForceFX
Design Style: ASIC
400MHz – 125M Transistors
~20 Watts
Intel Pentium-4
Design Style: Custom
2600MHz – 55M Transistors
~60 Watts
20
Effect of Architecture
ASIC Architecture: 6x Efficiency
NVIDIA GeForceFX
Design Style: ASIC
400MHz – 125M Transistors
~20 Watts: 10GFlops & 13 GBs
Intel Pentium-4
Design Style: Custom
2600MHz – 55M Transistors
~60 Watts: 5GFlops & 5 Gbs
21
Custom Circuits: 9x (7x) Efficiency
NVIDIA GeForceFX
Design Style: Custom
400MHz – 125M Transistors
~3 Watts: 10GFlops & 13 GBs
Vdd = 0.65V
Intel Pentium-4
Design Style: Custom
2600MHz – 55M Transistors
~60 Watts: 5GFlops & 5 Gbs
Vdd = 1.3V
22
Combined Architecture and Circuits
40x+ Improvement but 1.5 Years vs. 3+ Years
NVIDIA GeForceFX
Design Style: Custom
400MHz – 125M Transistors
~3 Watts: 10GFlops & 13 GBs
Vdd = 0.65V
Intel Pentium-4
Design Style: Custom
2600MHz – 55M Transistors
~60 Watts: 5GFlops & 5 Gbs
Vdd = 1.3V
23
Talk Outline
 Normalized Metric: Ebit
 Effect of Architecture
 ASIC vs. Custom
Building Blocks
Achievable Energy Efficiency
 16b 1024 FFT Example
 Answer to “Which Design is More Efficient”
24
Talk Outline
 Normalized Metric: Ebit
 Effect of Architecture
 ASIC vs. Custom
Building Blocks
Achievable Energy Efficiency
 16b 1024 FFT Example
 Answer to “Which Design is More Efficient”
25
ASIC vs. Custom
ASIC Methods
 Provide only coarse-grain control 100K+ gates,
but require much less effort and historically
scale with complexity
Custom Methods
 Offer fine-grain control individual transistors &
gates, but require large effort and scale poorly
with complexity
 Exploits Design Structure
 Exploits Circuit Techniques
26
Custom Methods Emphasize
Fine-Grain Manual Control + Custom Library
Design Gate Library Floorplanning/ Coarse
Style
Partitioning
Placement
Detailed
Coarse
Placement Routing
Detailed
Routing
Custom
ASIC
Complex
Specific
Simple
Generic
Manual
Manual
Manual
Manual
Manual
Manual/Automated
Automated
Automated w/ Hints
Automated
Automated
Automated
27
Custom Methods Emphasize
Fine-Grain Manual Control + Custom Library
Design Gate Library Floorplanning/ Coarse
Style
Partitioning
Placement
Detailed
Coarse
Placement Routing
Detailed
Routing
Custom
ASIC
Complex
Specific
Simple
Generic
Manual
Manual
Manual
Manual
Manual
Manual/Automated
Automated
Automated w/ Hints
Automated
Automated
Automated
Operation and Performance Characterized
for the Specific Case
28
ASIC Methods Substitute
Coarse-Grain Control
Automation + Generic Library
Design Gate Library Floorplanning/ Coarse
Style
Partitioning
Placement
Detailed
Coarse
Placement Routing
Detailed
Routing
Custom
Manual
Manual
Manual
Automated
Automated
ASIC
Complex
Specific
Simple
Generic
Manual
Manual
Manual/Automated Automated
Automated
Automated w/ Hints
29
ASIC Methods Substitute
Coarse-Grain Control
Automation + Generic Library
Design Gate Library Floorplanning/ Coarse
Style
Partitioning
Placement
Detailed
Coarse
Placement Routing
Detailed
Routing
Custom
Manual
Manual
Manual
Automated
Automated
ASIC
Complex
Specific
Simple
Generic
Manual
Manual
Manual/Automated Automated
Automated
Automated w/ Hints
Operation and Performance Characterized
for the Typical/Generic Case
30
ASIC Focus on 100K+ Gates
Lost Opportunities to Exploit Structure
 Designs reuse similar basic building blocks
 Building blocks: 1-10K-gates not 100K+ gate
 64-bit adder
1K-gates
 64x64 rf
2K-gates
 64x64 multiplier 20K-gates
 Opportunities to exploit these structures lost
when design is viewed in large chunks
31
Different Architectures
Similar Building Blocks
Bank 1
L
T
E
L
M
B
I
CLST 2
CL
Bank 0
CL
CL
MEMORY
SWITCH
NIF/ROUTER
CLST 1
CL
CLST 0
CLUSTER
CLST 2 SWITCH
CLST 1
CL
CL
CL
CL CL
CL
CLST 0
1998 “MAP” 64b Microprocessor - 5M T’s
(MIT/Stanford)
EX
CL
Microcontroller
SRAM XCVRS Bus
RF
CL
CL
CL
Cluster7
CL
Cluster6
Cluster5
Cluster4
Cluster3
Cluster2
Cluster1
Cluster0
2002 “Imagine” 32b Stream Processor - 22M T’s
(Stanford)
EX
RF
SRAM XCVRS Bus
32
Significant Structure Exists
Within 100K-gates
Bank 1
L
T
E
L
M
B
I
CLST 2
CL
Bank 0
CL
CL
MEMORY
SWITCH
NIF/ROUTER
CLST 1
CL
CLST 0
CLUSTER
CLST 2 SWITCH
CLST 1
CL
CL
CL
CL CL
CL
CLST 0
1998 “MAP” 64b Microprocessor - 5M T’s
(MIT/Stanford)
EX
CL
Microcontroller
SRAM XCVRS Bus
RF
CL
CL
CL
Cluster7
CL
Cluster6
Cluster5
Cluster4
Cluster3
Cluster2
Cluster1
Cluster0
2002 “Imagine” 32b Stream Processor - 22M T’s
(Stanford)
EX
RF
SRAM XCVRS Bus
33
Energy of 100K-gate Equivalent
 ASIC (N2)
 Custom Logic
 SRAM (small)
 SRAM (med)
 SRAM (large)
= 1400K Ebits (typ)
= 424K Ebits*
= 1085K Ebits
= 155K Ebits
= 50K Ebits
*Based on data extracted from Intel McKinley
34
Exploiting Circuit Techniques
 Custom circuits more efficient




Reduced parasitics
1.7x circuit techniques and flops
1.4x libraries
1.4x due to engineering interconnects
 Subthreshold Circuits
 Low Performance but ultra-low power
 Requires Architecture, Gates, Memories, CAD
Tools
35
Relating Power to Performance
CV/I, Idsat, tFO4
Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)
36
Relating Power to Performance
Relating Vdd and Vt to tFO4
Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)
37
Relating Power to Performance
Correlation to Reported Foundry Data
Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)
Technology Node
CV/I est CV/I reported
(ps)
(ps)
tFO4 est
(ps)
Foundry A 180-nm
3.94
3.70
53
Foundry A 130-nm
2.55
2.17
34
Foundry A 90-nm
1.85
2.04
25
Foundry A 65-nm
1.45
1.00
20
38
Achievable Power Improvement
(Assuming 50/50 split of Logic and Memory)
Custom vs.
ASIC
Energy
Type
Circuit Styles and
Flops
1.7
0.815
Logic
Libraries + Vdd
Scaling
1.4
0.855
Logic
SRAM Circuits
2
0.95
SRAM
Interconnect + Vdd
Scaling
1.4
0.855
Inter-connect
Technique
Type
Dynamic
39
Achievable Power Improvement
(Assuming 50/50 Split of Logic and Memory)
Custom vs.
ASIC
Energy
Type
Bit Encoding
1
0.84
Inter-connect
Clock Gating
1
0.84
Chip
1
0.5
Chip
N/A
0.062
Chip
Technique
Frequency Scaling
Subthreshold
Circuits
Type
Dynamic
40
Achievable Power Improvement
(Assuming 50/50 Split of Logic and Memory)
Custom vs.
ASIC
Energy
Type
Vdd Scaling
1
0.79
Chip
MT-CMOS
1
0.5
Chip
Stacking and input
state vector
1.4
0.7
Body Bias
2
0.5
Supply Gating
10
0.1
Chip
(typically
only one of
these three is
applied)
Technique
Type
Static
41
Achievable Power Improvement
Assuming 50/50 Split of Logic and Memory
Type
Tech
Total

Tech
45%
(32%)
Net Dynamic
Net Static
ASIC
(Custom)
130-nm
8% (4%)
53%
(36%)
ASIC
(Custom)
28%(20%)
90-nm
20%(10%)
48%(30%)
130nm uP assumes 80% Dynamic and 20% Static
 90nm uP assumes 50% Dynamic and 50% Static
42
Talk Outline
 Normalized Metric: Ebit
 Effect of Architecture
 ASIC vs. Custom
Building Blocks
Achievable Energy Efficiency
 16b 1024 FFT Example
 Answer to “Which Design is More Efficient”
43
Talk Outline
 Normalized Metric: Ebit
 Effect of Architecture
 ASIC vs. Custom
Building Blocks
Achievable Energy Efficiency
 16b 1024 FFT Example
 Answer to “Which Design is More Efficient”
44
16b 1024 point FFT
 Generally, k N log N operations
(complex multiplies) with precomputation
 Radix-2, Radix-4 etc…
implementations
 Decimation in time and/or decimation in
Frequency
45
Range of Implementations







MIT FFT (2005)
 0.18um CMOS, 628K T’s, 10KHz: Architecture and subtheshold circuits, 180mV
operation
Spiffee (1999)
 0.7um CMOS, 460K T’s, 173MHz: Cached FFT Architecture and algorithm, 1.1V
operation
SA-1100 (1999)
 0.35um CMOS, 2.6M T’s, 74MHz: Commercial embedded processor, Custom
Circuits, 1.5V operation
Imagine (2003)
 0.15um CMOS, 22M T’s , 232MHz: Streaming Media Processor, tiled standard
cells, 1.2V operation
Stratix IS25F627C8 (2005)
 0.13um CMOS, 3.9K logic elements, 123K memory bits, 24 DSP blocks, 272MHz:
Commercial FPGA Co-processor,
Intel P4 (2003)
 0.13um CMOS, 3GHz, SSE: Commerical General Purpose Processor, Custom
Circuits, 1.5V operation
TI ‘C6416 (2003)
 0.13um CMOS, 720MHz: Commercial Digital Signal Processor
46
Ebit Energy 16b 1024 point FFT
Design
Fab
Vdd
MHz
mW
Cycles
MIT FFT
180
1.8
0.01
1.6
95
Spiffee
700
3.3
173
845
5190
SA-1100
350
2
74
39
31500
Imagine
150
1.5
232
4000
3708
Stratix
130
1.3
275
884
1291
Intel P4
130
1.2
3000
51200
71680
TI 'C6416
130
1.2
720
1200
6526
47
Ebit Energy 16b 1024 point FFT
Design
MIT FFT
EDP
(rel norm)
Ebit
(fJ)
Efft (nJ)
Normalized to
Ebit (1e6)
Energy
Ratio
143
3.3
154
47
1
1
91
25350
277
6
SA-1100
283
4.2
16601
3953
85
Imagine
148
2.2
63931
29726
637
24
1.4
4149
2964
64
12548
1.4
1E+06
873813
18591
27
1.4
10877
7769
166
Spiffee
Stratix
Intel P4
TI 'C6416
48
Which Design Is More Efficient?
 0.7um CMOS 173MHz chip w/ 460K T’s
 Vdd (typ) = 3.3V, Vdd (min) = 1.1V
 Power = 845mW
0.18um CMOS 10kHz chip w/ 640K T’s
 Vdd (max) = 1.8V, Vdd (min) = 0.18V
 Power = 1.6mW
49
Which Design Is More Efficient?
Depends on the Metric!
 0.7um CMOS 173MHz chip w/ 460K T’s
 Vdd (typ) = 3.3V, Vdd (min) = 1.1V
 Power = 845mW
 EDP 143x better
0.18um CMOS 10kHz chip w/ 640K T’s
 Vdd (max) = 1.8V, Vdd (min) = 0.18V
 Power = 1.6mW
 Absolute energy 6x better
50
Summary
 Normalized metric – Ebit - enables meaningful
comparisons across designs and technologies
 Custom designers can exploit a wide range of
optimizations: enabling architecture with circuits and
circuits with Architecture
 Custom designs can readily achieve a 3x advantage
in energy with the potential for over 10x
 Selective application of custom techniques and
automated support for performance characterization
at specific instead of generic operating points can
enable ASIC designers to begin to bridge this Power
Gap.
51
Back-Up Slides
52
ASIC Rely on General Optimization Techniques
Focus - Improve the Average Case

Partitioning: Hyper-graph - min-cut, ratio cut

Solutions: move-based, geometric & combinatorial forms, clustering
Circuit
e8
Hypergraph
e1
e4
V1
e6
V3
V5
e1
e8
e6
V5
V3
e5
e2
V2
e3
e7
V1
e4
V2
e5
V4
Vertex & Edge weights
used to encode costs
V4
e2
e7
e3
H(V,E) E = { e1, e2….} nets
53
Designs with Structure
Do Not Exhibit Average Characteristics
Density
64b Multiplier (half-array)
Routing
Clear Disparity in Resource Usage
54
Download