Low power design

advertisement
ELEN 468
Advanced Logic Design
Lecture 29
Low Power Design
ELEN 468 Lecture 29
1
Power Dissipation
Power (Watts)
100
P6
Pentium ® proc
10
8086 286
1
8008
4004
486
386
8085
8080
0.1
1971
1974
1978
1985
1992
2000
Year
Power increases despite Vdd decrease
ELEN 468 Lecture 29
Courtesy, Intel
2
Power Density
Rocket
Nozzle
Power Density (W/cm2)
10000
Nuclear
Reactor
1000
100
Hot Plate
8086
10 4004
8008 8085
386
286
8080
1
1970
1980
P6
Pentium® proc
486
1990
Year
ELEN 468 Lecture 29
2000
2010
Courtesy, Intel
3
Why Power Increased
Growing die size, fast frequency scaling
10000
Clock Frequency (MHz)
1000
100
10
85
87
89
91
93
95
97
ELEN 468 Lecture 29
99
01
03
05
4
Gate Power Dissipation
Leakage power
Dynamic power
Short circuit power
ELEN 468 Lecture 29
5
Dynamic Power
Occurs at each
switching
Pd = CL●Vdd2●fp
fp switching
frequency
Vdd
Vdd
out
Linear
ELEN 468 Lecture 29
out
Saturation
6
Leakage Power
Static
Leakage current
= a ● Vdd
Leakage current
= b/Vt
Killer to CMOS
technology
Vdd
Vdd
Leakage
out
out
Leakage
Linear
ELEN 468 Lecture 29
Saturation
7
Short Circuit Power
During switching,
there is a short
moment when both
PMOS and CMOS are
partially on
Ps = Q●(Vdd-Vt)3●tr●fp
tr rising time
ELEN 468 Lecture 29
Input falling
Vdd
Vdd
out
out
Input rising
8
Where Does Power Go?
P ow er per centages
100%
90%
Power percentages
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
80%
70%
A ctive p o w e r
60%
Active power
50%
Cache leakage
Gate leakage
Core transistor
leakage
40%
30%
C a che le a ka g e
G a te le a ka g e
20%
10%
C o re tra nsisto r
le a ka g e
0%
Total chip power based on ITRS roadmap
In 2004, we are just breaking even
[Kim, et al, Computer 2003]
Scalable X86 CPU Design for 90nm
Low VT devices are <1% of total
non-memory transistor width
[J. Schultz and C. Webb, ISSCC 2004]
ELEN 468 Lecture 29
9
Energy – Performance Space
Energy
Every design is a point on a 2-D plane
Performance
ELEN 468 Lecture 29
10
Low Power Design
Reduce dynamic power




a: clock gating, sleep mode
C: small transistors (esp. on clock), short wires
VDD: lowest suitable voltage
f: lowest suitable frequency
Reduce static power




Selectively use low Vt devices
Power gating, MTCMOS
Stacked devices
Body bias
ELEN 468 Lecture 29
11
Clock Gating
Gate off clock to idle functional
units


e.g., floating point units
need logic to generate
disable signal
R
Functional
e
unit
g
 increases complexity of control logic
 consumes power
 timing critical to avoid clock glitches
at OR gate output

additional gate delay on clock
signal
clock
disable
 gating OR gate can replace a buffer in
the clock distribution tree
ELEN 468 Lecture 29
12
Active Power Reduction - Supply
Voltage Reduction
Static
Low
Supply
Voltage
Slow
Fast
Slow
Dynamic
High
Supply
Voltage
Pros:
• Always active in saving
Cons:
• Additional power delivery network
• Needs special care of interface between
power domains
• signals close to Vt – excessive leakage
and reduced noise margins
Adjusting operation voltage and frequency to
performance requirements:
• High performance – high Vdd & frequency
• Power saving – low Vdd & frequency
Pros:
• Doesn’t limit performance
Cons:
• Penalty of transition between different
power states can be high (in performance
and power)
• Additional control logic
ELEN 468 Lecture 29
13
Voltage Islands (Multi-Vdd)
Vddh
Vddl
Usami+
JSSC’98
Lackey+
ICCAD’02
GVI
DAC’03
Allow both macro and cell voltage assignment
Allow different voltage islands in the same circuit row
Lift unnatural layout restrictions
Minimal placement disturbance
ELEN 468 Lecture 29
14
Level Converter
Interface circuit when Vddl drives Vddh
to avoid leakage
Vddh
Vddh
VddH
weak on!
VddL
OUT
Vddl
IN
Conventional dual
supply level converter
ELEN 468 Lecture 29
OUT
IN
New single supply level
converter
15
Adjacency Metrics for Clustering
Logic adjacency metric (LAM): Vddl fanin cone of
level shifter without going through Vddh
Vddh
Vddh
LC1
LC2
LC2
Vddl
Vddl
LC3
LC3
Physical adjacency metric (PAM): for each candidate
Vddl cell, compute total size of its neighbor Vddl cells

LAM to guide logic aware voltage assignment
 PAM to guide placement aware voltage re-assignment
ELEN 468 Lecture 29
16
Level Converter Optimizations
Logic replacement (or gate sizing)
LC
LC
LC
MUX
1
MUX
2
Z
Z
LC
DEC
DEC
LC/Buffer co-optimization
B
LC
A
ELEN 468 Lecture 29
B
LC
A
17
Placement to Form Voltage Islands
with Power Grid Co-design
Based on Vddl and Vddh
cell placement after
voltage assignment,
define Vddl/Vddh power
grids on demand
Detailed placement to
form Vddl/Vddh voltage
islands that can hit their
corresponding power
supplies
Power grids on demand
Vddh
Vddl
Vddl Vddh
ELEN 468 Lecture 29
Vddl Vddh Vddl Vddh
18
Example of Voltage Islands
- IBM Cu11
- 0.13um
- 400 MHz
Vddh = 1.5V
Vddl =
1.2V

(courtesy IBM)
No timing degradation, no area increase!
ELEN 468 Lecture 29
19
Dynamic Frequency and
Voltage Scaling
Always run at the lowest supply voltage that meets the timing
constraints


DFS (dynamic frequency scaling) saves only power
DVS (dynamic voltage scaling) + DFS saves both energy and power
A DVS+DFS system requires the following

A programmable clock generator (PLL)
 PLL from 200MHz  700MHz in increments of 33MHz

A supply regulation loop that sets the minimum VDD necessary for
operation at the desired frequency
 32 levels of VDD from 1.1V to 1.6V

An operating system that sets the required frequency + supply voltage
to meet the task completion deadlines
 heavier load  ramp up VDD, when stable speed up clock
 lighter load  slow down clock, when PLL locks onto new rate, ramp down
VDD
ELEN 468 Lecture 29
20
Leakage Reduction Techniques
Vdd
pullup (Vdd)
sleep
Vdd
HVT
Wu
Wl
virtual Vdd
Vnwell ≥ Vdd
Vx
High Vt devices
Low Vt devices
low Vt
logic
virtual Gnd
Vpwell ≤ 0
sleep
stack effect
dual Vt
partitioning
variable threshold
(VTCMOS)
ELEN 468 Lecture 29
HVT
multi-threshold
(MTCMOS)
22
Natural Transistor Stacks
How?
• Reduce the leakage by stacking the devices
• Reduced Vds
• Negative Vgs
• Negative Vbs
ELEN 468 Lecture 29
23
Design with Dual Vth
Dual Vth evaluation
Dual Vth design


Two flavors of transistors: slow – high Vth, fast – low Vth
Low Vth are faster, but have ≈10X leakage
ELEN 468 Lecture 29
24
Impacts of Variable VT
Reducing the VT increases the subthreshold leakage current (exponentially)
VT = VT0 + ( F + VSB - F )
where VT0 is the threshold voltage at VSB =
0, VSB is the source- bulk (substrate)
voltage,  is the body-effect coefficient
But, reducing VT decreases gate delay
(increases performance)
ELEN 468 Lecture 29
25
Variable VT through Body Bias
For NMOS, the substrate is
normally tied to ground
(VSB = 0)
A negative bias on VSB
causes VT to increase
Adjusting the substrate
bias at runtime is called
adaptive body-biasing
(ABB) or dynamic threshold
scaling (DTS)

Requires a triple well fab
process
ELEN 468 Lecture 29
VSB,p
VSB,n
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
0.4
-2.5
-2
-1.5
-1
VSB (V)
-0.5
0
26
Forward/Reverse Body Biasing
RBB (Reverse Body Bias): zero
FBB (Forward Body Bias): high Vth in
body bias in active mode, a deep
reverse bias in standby mode.
standby mode, forward body biasing to
achieve better current drive in active mode.
Disadvantages:
Disadvantages:
• Increase PN junction reverse
leakage
• Scaling down technology worsen
short channel effects and weaken
the Vth modulation capability
• Larger junction capacitance
• High body effect for stack devices
ELEN 468 Lecture 29
27
Implementation of Dynamic Vth Scaling
(DTS)
How?
• When critical path replica frequency is less then reference CLK,
adjust bias to decrease Vth.
• Otherwise adjust bias to increase Vth.
Results:
• The lowest Vth is delivered (NBB-no body bias) if the highest
performance is required.
• When the performance demand is low, clock frequency is lowered
and Vth is raised via RBB to reduce the run time leakage power
dissipation.
ELEN 468 Lecture 29
28
Power Gating Using Sleep Transistors
Or can reduce leakage by
gating the supply rails when
the circuit is in sleep mode


in normal mode, sleep = 0 and
the sleep transistors must
present as small a resistance as
possible (via sizing)
in sleep mode, sleep = 1, the
transistor stack effect reduces
leakage by orders of magnitude
Or can eliminate leakage by switching off the power
supply (but lose the memory state)
ELEN 468 Lecture 29
29
Example of Power Gating
Can reduce power Power Switch
Control Signals
1000X
Smaller voltage swing
Embedded
(IR drop on sleep
Power
transistors)
Switches



Lower performance
Increased noise
coupling
Local power grid
design
Rows of
Standard
Cells
ELEN 468 Lecture 29
30
Power Dissipation on Variation
Tolerance
Conventional variation tolerance



Using large timing safety margin
Implies aggressive timing target
Greater power dissipation
Observation



Near-worst-case variations occur rarely
Safety margin is applied continuously to
guard the small chance of variations
Poor power efficiency
ELEN 468 Lecture 29
31
Question..
Can we deal with errors instead
preventing them from occurring by
conservative binning/clocking?
How fast can we speed up the
circuit with error rate in
manageable range?
ELEN 468 Lecture 29
32
Fault tolerant system
Begin with reference values
Introduce redundancy




Hardware: Triple Modular Redundancy
Time: Repeated process
Information: Code
Software: various algorithm
How about for delay fault?

how do we detect (may be correct?) errors?
ELEN 468 Lecture 29
33
Delay fault tolerant system
Delay fault detection



Redundant timing margin in signal path
+: Second sampling at increase clock period
- : Decrease delay of reference signal between
pipeline registers
Timing margin
t1 t2
2nd sampling
ELEN 468 Lecture 29
t
34
Delay fault tolerant system
Delay fault removal


Reference signal (SR)
Reprocessing at slower clock period (t’)
Timing margin
t1 t2
SR
t’
t
ELEN 468 Lecture 29
35
Delay fault tolerant system: Example
RAZOR*


Dynamic Voltage Scaling Design
Reduce power voltage down to
manageable failure rate
Timing margin
t1 t2
* Razor: a low-power pipeline based on circuit-level timing speculation, D. Ernst et al, 36th Annual IEEE/ACM International Symposium on Microarchitecture 2003
ELEN 468 Lecture 29
36
Delay fault tolerant system: Example
RAZOR continued


Implemented to 120MHz clock frequency
But for high speed circuits…
 Managing two clocks
 Minimum path delay constraint
 Delay of MUX
ELEN 468 Lecture 29
37
Delay fault tolerant system: Example
Parity coding


Parity generation based on output correlation
Avoid well-correlated outputs for pairing
Timing margin
t
ELEN 468 Lecture 29
38
Now.. Let’s look at delay distribution(s)
ELEN 468 Lecture 29
39
Clock speed achieved for contained error rate
ELEN 468 Lecture 29
40
Delay fault tolerant system: Example
Parity coding (continued)


Complexity
Example: C449 ISCAS Benchmark
ELEN 468 Lecture 29
41
Recently Proposed Design
Fault detection

Partial hardware and time redundancy
Timing margin
FL
g0
BL
gi
Ln
gm
Ln+1
BL'
gm
L'n+1
t
ELEN 468 Lecture 29
42
Proposed Design
Fault removal

Pipeline flush & reprocessing at lower
clock
FL
g0
Ln
BL
gi
gm
Ln+1
BL'
gm
L'n+1
ELEN 468 Lecture 29
43
Proposed Design
Division of FL an BL
FL
PI
BL
PO
CP
BL
Latch
ELEN 468 Lecture 29
Error?
44
Proposed Design
Division of FL an BL

Considerations
 The effects on the original circuit should be
minimal.
 Maximize delay fault detection coverage
 Minimize added complexity
ELEN 468 Lecture 29
45
Proposed Design
Division of FL an BL

First, POs to BL
 Gate with longest delay to gate with shortest delay

For the gates connected to BL,
 Choose the gate with maximum delay
 Then, any gate whose number of fanout> number of fanin
ELEN 468 Lecture 29
46
Proposed Design
Delay fault detection coverage
 dFL: delay from PI to any gate in FL
 di: delay from PI to any gate in original circuit
CF  1 
m ax{ d FL}
m ax{ d i}
Add graphical view
ELEN 468 Lecture 29
47
Proposed Design
Delay simulation

SPICE simulation
 TSMC 0.18um tech. Vcc=1.6V
 Gate delay for rising and falling signal
 Load: inverter
 Different input combinations are considered

Delay simulation
 Randomly generated test vectors
 106~108 according to number of primary inputs (PI)
ELEN 468 Lecture 29
48
Proposed Design
Area complexity






Ngate: Number of gates in the original circuit
Nff : Number of ffs in each pipeline, (NPI+NPO)/2
Ngate_BL: Number of gates in BL
Ngate_CP: Number of gates in comparison block
NLatch: Number of latches=Number of
connections between FL and BL
w: Complexity ratio of flipflop to gate
CA 
N gate _ B L  N gate _ C P  N L atch
N gate  w  N ff
ELEN 468 Lecture 29
49
Fault Coverage vs. Complexity
Fault Detection Coverage vs. Added Complexity : C432
Fault Detection Coverage vs. Added Complexity : C499
Added Complexity C A
0.5
Added Complexity C A
0.6
0.5
0.4
0.3
0.2
0.4
0.3
0.2
0.1
0
0.1
0
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.2
0.3
0.4
0.5
0.6
Fault detection Coverage CF
0.7
Fault detection Coverage CF
Fault Detection Coverage vs. Added Complexity : C6288
Fault Detection Coverage vs. Added Complexity : C880
0.5
Added Complexity C A
Added Complexity C A
0.5
0.4
0.3
0.2
0.1
0.4
0.3
0.2
0.1
0
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.1
0.2
0.3
0.4
0.5
Fault detection Coverage CF
Fault detection Coverage CF
ELEN 468 Lecture 29
50
0.6
Complexity
Effective complexity penalty

Depends on application
More than half of area is cache
Speed critical part: integer unit
C AE  C A 
A ppicable area
 C A  0.5
T otal chip area
ELEN 468 Lecture 29
51
Estimation of Complexity
Intel® Pentium® 4
Processor on 90 nm
Process
Data Align ALUs Registers
Cache Mux & AGU
ELEN 468 Lecture 29
52
Conclusion
Delay fault tolerant design is proposed



Possible operation clock frequency gain is
estimated from modeling and experiments
Delay fault detection coverage and complexity
are analyzed for optimal implementation
It shows that 10% clock frequency gain is
possible with proposed design at a moderate (825%) complexity increase
ELEN 468 Lecture 29
53
Download