Yanqing_proposal

advertisement
Synthesis Based
Design Techniques
for Ultra Low Voltage
Energy Efficient SoCs
Robust
Low
Power
VLSI
Yanqing Zhang
February 27th, 2012
Motivation for Ultra Low Voltage Design
Servers
and Data
Centers
Power
Desktop
Applications
Portable
Electronics
Wireless
Sensor
Nodes
Performance
2
Motivation for Ultra Low Voltage Design
[1]
Application Characteristics:
1. Device lifetime
2. Robust functionality
3. Relatively small form factor
4. Speed not a major concern
3
Motivation for Ultra Low Voltage Design
Trend has been to use voltage
scaling…
BUT IT’S NOT THAT SIMPLE!
[1]
Almost 2 orders-ofmagnitude increase
in energy efficiency
[2]
4
Key Challenges: Increased Significance of
Leakage
% Leakage Energy/Total Energy for a Critical Path
% Leakage energy/Total energy
60
50
%
40
Minimum energy
point occurs here
30
20
Vth
10
0
0
0.2
0.4
0.6
VDD
0.8
1
1.2
5
Key Challenges: Sensitivity to Variability
Local Variation of Delay for 4 Stage Inverter Chain
150
Count
100
50
0
0
20
40
60
80
100
120
140
Delay (ns)
Exponential dependence on Vth increases uncertainty in timing
closure metrics. This decreases chip yield.
6
Key Challenges: Efficient Hardware Selection
High Speed SoCs
COTS Based WSN
Custom IC Based IC
[3]
Fully functional TX and DSP,
No DSP. 3 day lifetime.
But 20mW power consumption Lacks functionality
 Short lifetime
Lifetime still short
Very powerful. Low power
so it is not power hog.
Not for ULV domain
Conventionally, we consider SPEED as main factor for system. Our
requirements are: system LONGEVITY and ROBUST FUNCIONTALITY. We can
really improve SoCs in ULV domain if we change our strategy.
7
Summary of Dissertation Goals
PROJECT 1 (completed)
• Design architecture for a Body Area Sensor Node
(BASN) SoC capable of battery-less operation.
PROJECT 2
• Local variation robust standard cell library for sub-Vt
• Synthesis flow reducing leakage energy
PROJECT 3
• Hold time robust design methodology
PROJECT 4
• Alternative approach to DVFS
8
Outline
• Motivation
• Hardware Selection for Energy Efficient SoC (BASN chip)
• Motivation
• Hypothesis
• Approach
• Results
• Library Design and Characterization at ULVs for Robust Timing
Closure
• Hold Time Analysis and Timing Closure Method for Subthreshold
• Latch Based Design for Single-VDD Alternative Approach to
DVFS
9
Project 1: Hardware Selection for Energy
Efficient SoC (BASN chip)
10
Motivation
Information
Assessment,
Treatment
 Wireless body area sensor nodes (BASN) enable inexpensive
continuous monitoring of patients
 Battery replacement/charging for body-worn devices may not
be feasible or desirable
11
Motivation
Custom IC Based IC
COTS Based WSN
MCU
[3]
Fully functional TX and DSP,
But 20mW power consumption
 Short lifetime
No DSP. 3 day lifetime.
Lacks functionality
Lifetime still short
• BASNs exemplify design space requiring energy efficiency to
the extreme
• State-of-the-art low power modules help…but not full solution
• On-chip processing a MUST (TX duty cycle, node size), but
‘throwing on an MCU’ entails high power ~100µW
• Judicial hardware selection needed
12
Hypothesis
µController
ECG
AFE
EMG
ADC
Memory
RF
DSP
EEG
Power Mgmt.
Boost
Converter
TEG
VBOOST
Voltage
Regulation
Signal Path
Power Path
RF Kick-Start
~60µW
We can achieve a battery-less (energy harvesting) BASN SoC capable of
various bio-signal acquisition and flexible data processing with state-ofthe-art low power circuit design and judicial hardware selection
13
Measured Energy/Op (pJ)
Approach
4
MCU
RR+AFib Accel.
30-Tap FIR Accel.
3
2
1
0
0
50
100
150
Delay (µs)
Accelerators:
• Programmable FIR
• Heart rate (R-R) extraction
• Atrial Fibrillation (AFib) detection
• Band energy envelope detection
• Direct memory access (DMA)
• Packetizer
200
Energy Efficiency / Sample
MCU
6.3 nJ
30 Tap FIR
Accel
57.6 pJ
MCU
3.6 nJ
Env. Detect
Accel
530 fJ
MCU
12 pJ
R-R Extract
Accel
3 fJ
110x
6800x
4000x
14
Significance
Sensors
This Work
[18]
ECG, EMG, EEG
ECG
Supply
E Harvesting
30mV, -10dBm
1.2V
Thermal, RF
X
DPM,
Clock
Power Mgmt.
Clock/Power gate gate
Gen. Purp.
1.5 pJ/Instr @
X
MCU
200kHz
Accelerators
Memory
Many
5.5kB (0.3V-0.7V)
Digital Power
Total Power
2.1µW
19µW
ASIC
42kB
(1.2V)
~12µW
31.1µW
[19]
Neural, ECG,
EMG, EEG
1V
X
[20]
X
X
X
Power gate
X
X
X
28.9pJ/Instr
@ 73kHz
X
ASIC
Few
X
X
N/A
500µW
EEG
1V
X
[21]
[22]
Temp,
ECG, TIV
Pressure
1.2V 0.4V/0.5V
X
Solar
20kB
5kB (0.4V)
(1.2V)
2.1µW 500µW
2.1µW
77.1µW 2.4mW
7.7µW
x
• Has lower power, lower minimum input supply voltage, and more
complete system integration than all other reported wireless BASN
SoCs
• first wireless biosignal acquisition chip powered solely from
15
thermoelectric harvested power
Outline
• Motivation
• Hardware Selection for Energy Efficient SoC (BASN chip)
• Motivation
• Hypothesis
• Approach
• Results
• Library Design and Characterization at ULVs for Robust Timing
Closure
• Hold Time Analysis and Timing Closure Method for Subthreshold
• Latch Based Design for Single-VDD Alternative Approach to
DVFS
16
Project 2: Library Design and
Characterization at ULVs for Robust
Timing Closure
17
Motivation
0.25
Static CMOS
NOR2 FAILS SNM
@ TT corner with
local variation
VNOR2-IN-NAND2-OUT
0.2
0.15
0.1
0.05
Static CMOS NOR2
0
0
0.05
0.1
0.15
VNAND2-IN-NOR2-OUT
0.2
0.25
18
Motivation
Problem:
Weak devices (PMOS) +
Stacked transistor
variation
 Standard cell library essential to synthesis, but scaling industry
standard cells aren’t sufficient for sub-Vt—fail SNM with
variation
19
Motivation
Logic
Gate
Logic
Gate
Logic
Gate
Logic
Gate
Logic
Gate
Logic
Gate
LEAKING WITHOUT PURPOSE!
[4]
20
Motivation
2-stage
4-stage
8-stage
16-stage
Probability (%)
16
12
σ/µ=
8
.014
.019
.022
.024
4
0
-18
-16
-14
-12
log(delay)
 Conventional method of ‘process corner based timing closure’
un-suitable for sub-Vt
 Doesn’t capture sensitivity to local variation
21
Hypothesis
1. Using TX-gate style logic, we can achieve lower energy
consumption for a given yield when compared to static CMOS
gates.
2. We can achieve decreased total energy with a flow that
optimizes leakage on non-critical paths, but still ensures path
yield with variation aware cell characterization.
22
Proposed Approach
1. TX-Gate Based Gate
Design
B
A
2. Long Length Low
Leakage Gate Design
3. Setup/Hold
Optimized Register
A
New Cell
Library
4. Synthesis Gate
Replacement
5. Place and Route
Retiming
7. Post Clock Extraction
Retiming
8. Circuit Simulation
and Evaluation
6. Clock Network
Extraction
lang = spectre
parameters …
INVX1 A B VDD VSS …
.sim opt …
23
Anticipated Contributions
• Variation immune TX-Gate standard cell library (publication)
• Variation aware path leakage optimization technique
(publication)
Anticipated Bottlenecks
• Minimizing leakage in TX-based cells
• Matching speed with static CMOS counterparts
• Layout compactness issues
24
Outline
• Motivation
• Hardware Selection for Energy Efficient SoC (BASN chip)
• Motivation
• Hypothesis
• Approach
• Results
• Library Design and Characterization at ULVs for Robust Timing
Closure
• Hold Time Analysis and Timing Closure Method for Subthreshold
• Latch Based Design for Single-VDD Alternative Approach to
DVFS
25
Project 3: Hold Time Analysis and
Timing Closure Method for Subthreshold
26
Motivation
Skew is increased
in sub-Vt because
of increased PVT
variation
sensitivity
Data 1
Clock
Data 2
Clock +skew
tSKEW
Clock
Clock+skew
Data 1
Data 2
27
Motivation
Slew is decreased
in sub-Vt because
of increased PVT
variation
sensitivity
Data 1
Data 2
Clock w/ BAD slew
Clock w/ BAD slew
Data 1
Data 2
28
Motivation
Hold time, clock-q
uncertainty in subVt because of
increased PVT
variation
sensitivity
Data 1
Data 2
Clock
Clock
Data 1
Data 2
29
Motivation
tSKEW
• Conventional method to solve hold time:
• Use clock tree synthesis to design a tree with many levels
(control skew) and large buffers(control slew)
• Use buffer insertion to take care of hold time, clock-q
THIS WON’T WORK IN Sub-Vt!
30
Motivation
85
Yield (%)
75
65
55
45
35
1
2
3
4
Level of clock tree
• More levels=more skew! Contrary to conventional widsom…
31
70
3
PCLK
PREG
PHOLD
60
50
2
40
30
1
20
10
0
40
50
60
70
Yield (%)
80
90
100
96 97
Total Circuit Power (Normalized)
% Power Overhead of Buffers of
Motivation
• Buffer insertion energy costly!
• And still doesn’t solve our problem (subject to variation too…)32
Hypothesis
1. We can achieve a similar parameter controlling method
suitable for sub-Vt by re-analyzing the effects of each parameter.
2. We can achieve a more energy efficient method for a given
yield constraint using a novel two-phase clock based timing
scheme
33
Approach
EDA Tools Method
Find the lowest energy
approach to accomplish:
Two-phase Clock Method
No More Buffers!
tSKEW
Master Clock
5. Robust VS
Register
tSKEW
2. Judicial Hold
Buffer Insertion
tSKEW
Less tSKEW
Master Clock
1. Limited Skew
Less tSKEW
Less tSKEW
Less tSKEW
DLL
Master Clock
Less tSKEW
Slave Clock
4. Tolerable
Clock Slew
Slave Clock
Less tSKEW
Slave Clock
3. Tolerable Data Slew
34
Approach
Split register into 2 positive transparent latches
tSKEW
Master Clock
tSKEW
Master Clock
Data 2
Slave Clock
Data 1
3
Tune DLL to fix timing
1
Slave Clock
4
Master Clock+skew
2
Data 1
Data 2
35
Anticipated Contributions
• Design methodology using EDA tools suitable for sub-Vt
(publication)
• A novel hold time fixing scheme using two-phase clocking
(publication)
Anticipated Bottlenecks
• Simulation time for coming up with design methodology
• DLL design for two-phase clocking
• Incorporating timing scheme into synthesis flow
36
Outline
• Motivation
• Hardware Selection for Energy Efficient SoC (BASN chip)
• Motivation
• Hypothesis
• Approach
• Results
• Library Design and Characterization at ULVs for Robust Timing
Closure
• Hold Time Analysis and Timing Closure Method for Subthreshold
• Latch Based Design for Single-VDD Alternative Approach to
DVFS
37
Project 4: Latch Based Design for SingleVDD Alternative Approach to DVFS
38
Motivation
100
SNM Yield (%)
99.56
98.96
95
90
88.57
85
[5]
Register Type
• Recent research has demonstrated near ideal energy savings
using this concept by using three voltage islands.
39
Motivation
1
Single-VDD
MVDD
PDVS
Energy
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Workload
• Potential drawback: when considering total energy through DC40
DC converter, may compromise energy savings
Hypothesis
1. We can achieve better energy efficiency in DVFS by
dynamically switching level of pipelining in a latch based design
running off of single VDD for a certain frequency range.
41
Approach
Level 0:
Logic Block
Level 1:
Level 2:
Logic Block/2
/4
Logic Block/2
/4
/4
/4
42
Approach
DCDC
DCDC
Energy,
Power
DCDC
DCDC
Energy,
Power
Latch-based Design
Blk1
Blk2
Blkn
Delay
Delay
43
Anticipated Contributions
• Analysis of optimal latch pipelining for ULVs (publication)
• Dynamic pipelining alternative approach to DVFS (publication)
Anticipated Bottlenecks
• Minimizing the overhead for switching the amount of
pipelining
• Latch-based timing issues
44
Publications
1. Fan Zhang, Yanqing Zhang et al., “A Batteryless 19µW MICS/ISMBand Energy Harvesting Body Area Sensor Node SoC”, to appear in
2012 International Solid-State Circuits Conference, 02/2012.
2. Benton H. Calhoun et al., “Body Sensor Networks: A Holistic
Approach from Silicon to Users”, IEEE Proceedings
3. Yanqing Zhang and Benton H. Calhoun, “The Cost of Fixing Hold
Time Violations in Sub-threshold Circuits”, 2011 Subthreshold
Microelectronics Conference, 09/2011
4. Yanqing Zhang et. al., “Energy Efficient Design for Body Sensor
Nodes”, Journal of Low Power Electronics and Applications,
04/2011.
5. Benton H. Calhoun, Sudhanshu Khanna, Yanqing Zhang, Joseph
Ryan, and Brian Otis, “System Design Principles Combining Subthreshold Circuits and Architectures with Energy Scavenging
Mechanisms”, International Symposium on Circuits and Systems
(ISCAS), Paris, France, pp. 269-272, 05/2010.
45
References
[1] A. Barth, “TEMPO 3.1: A Body Area Sensor Network Platform
for Continuous Movement Assessment”, BSN 2009.
[2] B. Calhoun and A. Chandrakasan, “Characterizing and
Modeling Minimum Energy Operation for Subthreshold Circuits”,
ISLPED 2004
[3] S. Rai, et. al., “A 500uW Neural Tag with 2uVrms AFE and
Frequency-Multiplying MICS/ISM FSK Transmitter”, ISSCC 2009
[4] H. L. Yeager, et. al. “Microprocessor Power Optimization
through Multi-Performance Device Insertion”, VLSI 2004
[5]Y. Shakhsheer et. al. “A 90nm Data Flow Processor
Demonstrating Fine Grained DVS for Energy Efficient Operation
from 0.25V to 1.2V”, CICC 2011
46
Schedule: Key Anticipated Milestones
Project
Milestone (Publication for…) Expected Date
BASN chip
Hardware platform
comparison
Batteryless SoC chip
Completed
Library Design
Library Design
TX-gate based standard cells
Variation aware leakage
optimization
09/2012
12/2012
Hold Closure
Sub-Vt hold time method
using EDA tools
Latch pipelining analysis in
sub-Vt
12/2012
Alternative DVFS approach
Two-phase clock method
09/2013
10/2013
BASN chip
Latch DVFS
Latch DVFS
Hold Closure
Completed
01/2013
47
THANK YOU!
“PhD Degrees:
You have to be Lin it to Lin it”
-Yanqing Zhang
48
How Does Synthesis Relate?
1. Determine Architecture
3. Standard Cell Design
MCU?
Memories?
Accelerators?
Bus protocol?
2. HDL Description
4. Characterization
INV:
delay=…
POWER=…
Leakage=…
6. Timing Closure
Clock
Data
7. Place and Route
Module
SoC_components
(in, out, clk)
…
5. Gate Translation
8. Chip Verification
DUT
49
Key Challenges: Weakened Drive Strength
Ring Oscillator Frequency
109
Frequency (Hz)
108
107
106
105
104
103
102
[2]
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
VDD
We would like a slower drop-off in frequency, because this leads
to drastic increase in leakage
50
Key Challenges: Unbalanced FET Strengths
Relative Strength of NMOS/PMOS
Ratio of Drain Current
25
140nm/90nm
140nm/180nm
140nm/270nm
280nm/90nm
420nm/90nm
20
Increasing
area
15
10
2.6
5
0
0.2
0.4
0.6
VDD
0.8
1
1.2
Standard cells are designed at nominal VDD . We can’t just scale
VDD and expect balance. This constrains speed and increases 51
Approach
Delay per Sample
Max
achievable
data rate
GOPS / W
210 pJ
8 us (80 cycles)
125 kHz
4.76
FPGA N/A
2.22 pJ
94.5 ns (1 cycle)
10 MHz
450
ASIC
0.23pJ
6.18 ns (1 cycle)
150 MHz
4348
Energy per Energy
Instruction Sample
GPP
2.62 pJ
N/A
•
•
•
•
per
Implemented same R-R extraction algorithm
Same technology, manual optimization of codes
100X energy efficiency for ASICs vs. GPPs
Use GPPs sparingly, steer processing to ASICs
52
Approach
Chip
program
VBOOST
Digitized VBOOST
Power and
Channel control
Sampling rate control
IMEM
Power/clock gate,
clock rate, and
bus control
DMA/SRAM
MCU
Bio-signal
Accelerators
LNA
VGA
Duty cycle, data rate control
DPM
ADC
Packetizer
53
Approach
Data processing
Flexible Architecture for Data Processing
Data transmission
Flexible Architecture for Data Transmission
Generic Path
Stream
MCU: microcontroller
MCU
ECG
Example Custom Path
Store and Burst
EMG
AFE
EEG
FIR
Processed
Data
RR+
AFib
Example of Mixed Path
FIR
ENV
Detect
Data for TX
4kB
DMem
Event-Based Burst
MCU
4kB
DMem
If event
• Data processing: max flexibility (generic path) or
max efficiency (biosignal accelerators)
• Data transmission: supports modes from streaming
54
(100% DC) to rare event detection (~0% DC)
Results
AFib Detect
(V)
Input ECG Signal
(V)
1
0.8
0.6
0.4
…
0.2
0
AFib begins
Chip detects AFib
0.5
…
0
0
1
…
93
95
97
99
101
103
105
107
Time (s)
• When a rare AFib occurs, TX is enabled to transmit
the last 8 beats of ECG (in the data memory).
• 19 µW total chip
55
Results
1
655 ms
VBoost
sample
0.8
ADC IN 0.6
(V) 0.4
1
TX EN
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1.6
1.8
650 µs
0.5
1
0
1
TX
DATA
Header
0.5
0.2
0.4
0.5
0.6
0
1
Data
0.8
1
CRC
1.2
1.4
1.798
1.7981
1.7982
1.7983
1.7984
1.7985
1.7986
1.798
1.7981
1.7982
1.7983
1.7984
1.7985
1.7986
0.5
0
1.7979
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.7987
1.6
1.8
Time (s)
• Every 5s, VBOOST is sampled to check for sufficient
energy
• DPM enables RF crystal oscillator (20ms) and TX
(650µs)
• 19 µW total chip
56
100
Motivation
Yield (%)
99.8
99.6
99.4
99.2
99.0
Cell Type
 Standard cell library essential to synthesis, but scaling industry
standard cells aren’t sufficient for sub-Vt—fail SNM with
variation
57
Motivation
L=90nm
L=180nm
L=270nm
L=360nm
Occurrence (%)
6
4
2
0
0
20
40
60
80
100
Delay (ns)
 Make the cells bigger?
 Won’t work, greater active energy, not an insurance to
robustness
 Even if it did work, area at least quadruples
58
0.25
Preliminary Results
VNOR2-IN-NAND2-OUT
0.2
0.15
Increased SNM
@ FS corner
TX-Gate NOR2
0.1
0.05
Static CMOS NOR2
0
0
0.05
0.1
V
0.15
NAND2-IN-NOR2-OUT
0.2
0.25
59
0.25
Preliminary Results
VNOR2-IN-NAND2-OUT
0.2
0.15
Increased SNM
@ SS corner
TX-Gate NOR2
0.1
Static CMOS NOR2
0.05
0
0
0.05
0.1
0.15
VNAND2-IN-NOR2-OUT
0.2
0.25
60
Preliminary Results
0.25
TX-Gate NOR2
PASSES SNM @
TT corner with
local variation
VNOR2-IN-NAND2-OUT
0.2
0.15
0.1
TX-Gate NOR2
0.05
0
0
0.05
0.1
V
0.15
NAND2-IN-NOR2-OUT
0.2
0.25
61
Preliminary Results
9
thold
tc-q, slew=329ns
tc-q, slew=419ns
tc-q, slew=750ns
tc-q, slew=1200ns
8
Occurrence (%)
7
6
5
4
3
2
1
0
-800
-400
0
Delay (ns)
400
800
• Hold time is quite immune to slew variation
• Slew affects clock-q—there is a limit to slew before clock-q
becomes detrimental
62
Preliminary Results
P2p
jitter
DLL
CLK_IN
Frequency Power
373 ps 100 MHz
15 uW
% Jitter/Freq
Main Contribution
3.73%
Low Power
Header/Footer
Array
Current Starved Inverters
Weak
Latches
Level
Restorers
Out_b
Out
• Low power DLL makes novel two-phase timing scheme possibly
63
worthy
Motivation
100
SNM Yield (%)
99.56
98.96
95
90
88.57
85
[4]
Register Type
• DVFS provides the ability to trade-off energy and delay to cater
64
to variable workloads
Approach
Level 0:
Logic Block
tc-q, Elatch
Pleak,latch
tlogic, Elogic
Pleak,logic
Delay: tc-q+ tsetup + tlogic = PER
Level 1:
Energy: 2Elatch + Elogic + PER(Pleak,logic + 2Pleak,latch)
Logic Block
Delay: tc-q+ tsetup + tlogic/2 = PER
tsetup, Elatch
Pleak,latch
Logic Block
Energy: 3Elatch + Elogic + PER(Pleak,logic + 4Pleak,latch)
Level 2:
Delay: tc-q+ tsetup + tlogic/4 = PER
Energy: 5Elatch + Elogic + PER(Pleak,logic + 8Pleak,latch)
Delay: tc-q+ tsetup + tlogic/2n = PER Energy: (2n +1)Elatch + Elogic + PER(Pleak,logic + 2n+1Pleak,latch)
Is this energy efficient? (2n +1)Elatch + αElogic + (tc-q+ tsetup + tlogic/2n)(Pleak,logic + 2n+1Pleak,latch)
65
Preliminary Results
Intrinsic Energy/latch (fJ)
3
2
1
0
28
30
32
34
36
38
40
Average (tc-q+tsetup)/2 (ns)
• Efficiency of latches have the potential to mitigate the
pipelining overhead of this scheme
66
Preliminary Results
110
Reg
Latch
Energy (fJ)
90
70
50
30
10 0
0.2
0.4
0.6
0.8
1
1.2
1.4
Delay (ms)
• Efficiency of latches have the potential to mitigate the pipelining
overhead of this scheme
67
Download