Synthesis Based Design Techniques for Ultra Low Voltage Energy Efficient SoCs Robust Low Power VLSI Yanqing Zhang February 27th, 2012 Motivation for Ultra Low Voltage Design Servers and Data Centers Power Desktop Applications Portable Electronics Wireless Sensor Nodes Performance 2 Motivation for Ultra Low Voltage Design [1] Application Characteristics: 1. Device lifetime 2. Robust functionality 3. Relatively small form factor 4. Speed not a major concern 3 Motivation for Ultra Low Voltage Design Trend has been to use voltage scaling… BUT IT’S NOT THAT SIMPLE! [1] Almost 2 orders-ofmagnitude increase in energy efficiency [2] 4 Key Challenges: Increased Significance of Leakage % Leakage Energy/Total Energy for a Critical Path % Leakage energy/Total energy 60 50 % 40 Minimum energy point occurs here 30 20 Vth 10 0 0 0.2 0.4 0.6 VDD 0.8 1 1.2 5 Key Challenges: Sensitivity to Variability Local Variation of Delay for 4 Stage Inverter Chain 150 Count 100 50 0 0 20 40 60 80 100 120 140 Delay (ns) Exponential dependence on Vth increases uncertainty in timing closure metrics. This decreases chip yield. 6 Key Challenges: Efficient Hardware Selection High Speed SoCs COTS Based WSN Custom IC Based IC [3] Fully functional TX and DSP, No DSP. 3 day lifetime. But 20mW power consumption Lacks functionality Short lifetime Lifetime still short Very powerful. Low power so it is not power hog. Not for ULV domain Conventionally, we consider SPEED as main factor for system. Our requirements are: system LONGEVITY and ROBUST FUNCIONTALITY. We can really improve SoCs in ULV domain if we change our strategy. 7 Summary of Dissertation Goals PROJECT 1 (completed) • Design architecture for a Body Area Sensor Node (BASN) SoC capable of battery-less operation. PROJECT 2 • Local variation robust standard cell library for sub-Vt • Synthesis flow reducing leakage energy PROJECT 3 • Hold time robust design methodology PROJECT 4 • Alternative approach to DVFS 8 Outline • Motivation • Hardware Selection for Energy Efficient SoC (BASN chip) • Motivation • Hypothesis • Approach • Results • Library Design and Characterization at ULVs for Robust Timing Closure • Hold Time Analysis and Timing Closure Method for Subthreshold • Latch Based Design for Single-VDD Alternative Approach to DVFS 9 Project 1: Hardware Selection for Energy Efficient SoC (BASN chip) 10 Motivation Information Assessment, Treatment Wireless body area sensor nodes (BASN) enable inexpensive continuous monitoring of patients Battery replacement/charging for body-worn devices may not be feasible or desirable 11 Motivation Custom IC Based IC COTS Based WSN MCU [3] Fully functional TX and DSP, But 20mW power consumption Short lifetime No DSP. 3 day lifetime. Lacks functionality Lifetime still short • BASNs exemplify design space requiring energy efficiency to the extreme • State-of-the-art low power modules help…but not full solution • On-chip processing a MUST (TX duty cycle, node size), but ‘throwing on an MCU’ entails high power ~100µW • Judicial hardware selection needed 12 Hypothesis µController ECG AFE EMG ADC Memory RF DSP EEG Power Mgmt. Boost Converter TEG VBOOST Voltage Regulation Signal Path Power Path RF Kick-Start ~60µW We can achieve a battery-less (energy harvesting) BASN SoC capable of various bio-signal acquisition and flexible data processing with state-ofthe-art low power circuit design and judicial hardware selection 13 Measured Energy/Op (pJ) Approach 4 MCU RR+AFib Accel. 30-Tap FIR Accel. 3 2 1 0 0 50 100 150 Delay (µs) Accelerators: • Programmable FIR • Heart rate (R-R) extraction • Atrial Fibrillation (AFib) detection • Band energy envelope detection • Direct memory access (DMA) • Packetizer 200 Energy Efficiency / Sample MCU 6.3 nJ 30 Tap FIR Accel 57.6 pJ MCU 3.6 nJ Env. Detect Accel 530 fJ MCU 12 pJ R-R Extract Accel 3 fJ 110x 6800x 4000x 14 Significance Sensors This Work [18] ECG, EMG, EEG ECG Supply E Harvesting 30mV, -10dBm 1.2V Thermal, RF X DPM, Clock Power Mgmt. Clock/Power gate gate Gen. Purp. 1.5 pJ/Instr @ X MCU 200kHz Accelerators Memory Many 5.5kB (0.3V-0.7V) Digital Power Total Power 2.1µW 19µW ASIC 42kB (1.2V) ~12µW 31.1µW [19] Neural, ECG, EMG, EEG 1V X [20] X X X Power gate X X X 28.9pJ/Instr @ 73kHz X ASIC Few X X N/A 500µW EEG 1V X [21] [22] Temp, ECG, TIV Pressure 1.2V 0.4V/0.5V X Solar 20kB 5kB (0.4V) (1.2V) 2.1µW 500µW 2.1µW 77.1µW 2.4mW 7.7µW x • Has lower power, lower minimum input supply voltage, and more complete system integration than all other reported wireless BASN SoCs • first wireless biosignal acquisition chip powered solely from 15 thermoelectric harvested power Outline • Motivation • Hardware Selection for Energy Efficient SoC (BASN chip) • Motivation • Hypothesis • Approach • Results • Library Design and Characterization at ULVs for Robust Timing Closure • Hold Time Analysis and Timing Closure Method for Subthreshold • Latch Based Design for Single-VDD Alternative Approach to DVFS 16 Project 2: Library Design and Characterization at ULVs for Robust Timing Closure 17 Motivation 0.25 Static CMOS NOR2 FAILS SNM @ TT corner with local variation VNOR2-IN-NAND2-OUT 0.2 0.15 0.1 0.05 Static CMOS NOR2 0 0 0.05 0.1 0.15 VNAND2-IN-NOR2-OUT 0.2 0.25 18 Motivation Problem: Weak devices (PMOS) + Stacked transistor variation Standard cell library essential to synthesis, but scaling industry standard cells aren’t sufficient for sub-Vt—fail SNM with variation 19 Motivation Logic Gate Logic Gate Logic Gate Logic Gate Logic Gate Logic Gate LEAKING WITHOUT PURPOSE! [4] 20 Motivation 2-stage 4-stage 8-stage 16-stage Probability (%) 16 12 σ/µ= 8 .014 .019 .022 .024 4 0 -18 -16 -14 -12 log(delay) Conventional method of ‘process corner based timing closure’ un-suitable for sub-Vt Doesn’t capture sensitivity to local variation 21 Hypothesis 1. Using TX-gate style logic, we can achieve lower energy consumption for a given yield when compared to static CMOS gates. 2. We can achieve decreased total energy with a flow that optimizes leakage on non-critical paths, but still ensures path yield with variation aware cell characterization. 22 Proposed Approach 1. TX-Gate Based Gate Design B A 2. Long Length Low Leakage Gate Design 3. Setup/Hold Optimized Register A New Cell Library 4. Synthesis Gate Replacement 5. Place and Route Retiming 7. Post Clock Extraction Retiming 8. Circuit Simulation and Evaluation 6. Clock Network Extraction lang = spectre parameters … INVX1 A B VDD VSS … .sim opt … 23 Anticipated Contributions • Variation immune TX-Gate standard cell library (publication) • Variation aware path leakage optimization technique (publication) Anticipated Bottlenecks • Minimizing leakage in TX-based cells • Matching speed with static CMOS counterparts • Layout compactness issues 24 Outline • Motivation • Hardware Selection for Energy Efficient SoC (BASN chip) • Motivation • Hypothesis • Approach • Results • Library Design and Characterization at ULVs for Robust Timing Closure • Hold Time Analysis and Timing Closure Method for Subthreshold • Latch Based Design for Single-VDD Alternative Approach to DVFS 25 Project 3: Hold Time Analysis and Timing Closure Method for Subthreshold 26 Motivation Skew is increased in sub-Vt because of increased PVT variation sensitivity Data 1 Clock Data 2 Clock +skew tSKEW Clock Clock+skew Data 1 Data 2 27 Motivation Slew is decreased in sub-Vt because of increased PVT variation sensitivity Data 1 Data 2 Clock w/ BAD slew Clock w/ BAD slew Data 1 Data 2 28 Motivation Hold time, clock-q uncertainty in subVt because of increased PVT variation sensitivity Data 1 Data 2 Clock Clock Data 1 Data 2 29 Motivation tSKEW • Conventional method to solve hold time: • Use clock tree synthesis to design a tree with many levels (control skew) and large buffers(control slew) • Use buffer insertion to take care of hold time, clock-q THIS WON’T WORK IN Sub-Vt! 30 Motivation 85 Yield (%) 75 65 55 45 35 1 2 3 4 Level of clock tree • More levels=more skew! Contrary to conventional widsom… 31 70 3 PCLK PREG PHOLD 60 50 2 40 30 1 20 10 0 40 50 60 70 Yield (%) 80 90 100 96 97 Total Circuit Power (Normalized) % Power Overhead of Buffers of Motivation • Buffer insertion energy costly! • And still doesn’t solve our problem (subject to variation too…)32 Hypothesis 1. We can achieve a similar parameter controlling method suitable for sub-Vt by re-analyzing the effects of each parameter. 2. We can achieve a more energy efficient method for a given yield constraint using a novel two-phase clock based timing scheme 33 Approach EDA Tools Method Find the lowest energy approach to accomplish: Two-phase Clock Method No More Buffers! tSKEW Master Clock 5. Robust VS Register tSKEW 2. Judicial Hold Buffer Insertion tSKEW Less tSKEW Master Clock 1. Limited Skew Less tSKEW Less tSKEW Less tSKEW DLL Master Clock Less tSKEW Slave Clock 4. Tolerable Clock Slew Slave Clock Less tSKEW Slave Clock 3. Tolerable Data Slew 34 Approach Split register into 2 positive transparent latches tSKEW Master Clock tSKEW Master Clock Data 2 Slave Clock Data 1 3 Tune DLL to fix timing 1 Slave Clock 4 Master Clock+skew 2 Data 1 Data 2 35 Anticipated Contributions • Design methodology using EDA tools suitable for sub-Vt (publication) • A novel hold time fixing scheme using two-phase clocking (publication) Anticipated Bottlenecks • Simulation time for coming up with design methodology • DLL design for two-phase clocking • Incorporating timing scheme into synthesis flow 36 Outline • Motivation • Hardware Selection for Energy Efficient SoC (BASN chip) • Motivation • Hypothesis • Approach • Results • Library Design and Characterization at ULVs for Robust Timing Closure • Hold Time Analysis and Timing Closure Method for Subthreshold • Latch Based Design for Single-VDD Alternative Approach to DVFS 37 Project 4: Latch Based Design for SingleVDD Alternative Approach to DVFS 38 Motivation 100 SNM Yield (%) 99.56 98.96 95 90 88.57 85 [5] Register Type • Recent research has demonstrated near ideal energy savings using this concept by using three voltage islands. 39 Motivation 1 Single-VDD MVDD PDVS Energy 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Workload • Potential drawback: when considering total energy through DC40 DC converter, may compromise energy savings Hypothesis 1. We can achieve better energy efficiency in DVFS by dynamically switching level of pipelining in a latch based design running off of single VDD for a certain frequency range. 41 Approach Level 0: Logic Block Level 1: Level 2: Logic Block/2 /4 Logic Block/2 /4 /4 /4 42 Approach DCDC DCDC Energy, Power DCDC DCDC Energy, Power Latch-based Design Blk1 Blk2 Blkn Delay Delay 43 Anticipated Contributions • Analysis of optimal latch pipelining for ULVs (publication) • Dynamic pipelining alternative approach to DVFS (publication) Anticipated Bottlenecks • Minimizing the overhead for switching the amount of pipelining • Latch-based timing issues 44 Publications 1. Fan Zhang, Yanqing Zhang et al., “A Batteryless 19µW MICS/ISMBand Energy Harvesting Body Area Sensor Node SoC”, to appear in 2012 International Solid-State Circuits Conference, 02/2012. 2. Benton H. Calhoun et al., “Body Sensor Networks: A Holistic Approach from Silicon to Users”, IEEE Proceedings 3. Yanqing Zhang and Benton H. Calhoun, “The Cost of Fixing Hold Time Violations in Sub-threshold Circuits”, 2011 Subthreshold Microelectronics Conference, 09/2011 4. Yanqing Zhang et. al., “Energy Efficient Design for Body Sensor Nodes”, Journal of Low Power Electronics and Applications, 04/2011. 5. Benton H. Calhoun, Sudhanshu Khanna, Yanqing Zhang, Joseph Ryan, and Brian Otis, “System Design Principles Combining Subthreshold Circuits and Architectures with Energy Scavenging Mechanisms”, International Symposium on Circuits and Systems (ISCAS), Paris, France, pp. 269-272, 05/2010. 45 References [1] A. Barth, “TEMPO 3.1: A Body Area Sensor Network Platform for Continuous Movement Assessment”, BSN 2009. [2] B. Calhoun and A. Chandrakasan, “Characterizing and Modeling Minimum Energy Operation for Subthreshold Circuits”, ISLPED 2004 [3] S. Rai, et. al., “A 500uW Neural Tag with 2uVrms AFE and Frequency-Multiplying MICS/ISM FSK Transmitter”, ISSCC 2009 [4] H. L. Yeager, et. al. “Microprocessor Power Optimization through Multi-Performance Device Insertion”, VLSI 2004 [5]Y. Shakhsheer et. al. “A 90nm Data Flow Processor Demonstrating Fine Grained DVS for Energy Efficient Operation from 0.25V to 1.2V”, CICC 2011 46 Schedule: Key Anticipated Milestones Project Milestone (Publication for…) Expected Date BASN chip Hardware platform comparison Batteryless SoC chip Completed Library Design Library Design TX-gate based standard cells Variation aware leakage optimization 09/2012 12/2012 Hold Closure Sub-Vt hold time method using EDA tools Latch pipelining analysis in sub-Vt 12/2012 Alternative DVFS approach Two-phase clock method 09/2013 10/2013 BASN chip Latch DVFS Latch DVFS Hold Closure Completed 01/2013 47 THANK YOU! “PhD Degrees: You have to be Lin it to Lin it” -Yanqing Zhang 48 How Does Synthesis Relate? 1. Determine Architecture 3. Standard Cell Design MCU? Memories? Accelerators? Bus protocol? 2. HDL Description 4. Characterization INV: delay=… POWER=… Leakage=… 6. Timing Closure Clock Data 7. Place and Route Module SoC_components (in, out, clk) … 5. Gate Translation 8. Chip Verification DUT 49 Key Challenges: Weakened Drive Strength Ring Oscillator Frequency 109 Frequency (Hz) 108 107 106 105 104 103 102 [2] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 VDD We would like a slower drop-off in frequency, because this leads to drastic increase in leakage 50 Key Challenges: Unbalanced FET Strengths Relative Strength of NMOS/PMOS Ratio of Drain Current 25 140nm/90nm 140nm/180nm 140nm/270nm 280nm/90nm 420nm/90nm 20 Increasing area 15 10 2.6 5 0 0.2 0.4 0.6 VDD 0.8 1 1.2 Standard cells are designed at nominal VDD . We can’t just scale VDD and expect balance. This constrains speed and increases 51 Approach Delay per Sample Max achievable data rate GOPS / W 210 pJ 8 us (80 cycles) 125 kHz 4.76 FPGA N/A 2.22 pJ 94.5 ns (1 cycle) 10 MHz 450 ASIC 0.23pJ 6.18 ns (1 cycle) 150 MHz 4348 Energy per Energy Instruction Sample GPP 2.62 pJ N/A • • • • per Implemented same R-R extraction algorithm Same technology, manual optimization of codes 100X energy efficiency for ASICs vs. GPPs Use GPPs sparingly, steer processing to ASICs 52 Approach Chip program VBOOST Digitized VBOOST Power and Channel control Sampling rate control IMEM Power/clock gate, clock rate, and bus control DMA/SRAM MCU Bio-signal Accelerators LNA VGA Duty cycle, data rate control DPM ADC Packetizer 53 Approach Data processing Flexible Architecture for Data Processing Data transmission Flexible Architecture for Data Transmission Generic Path Stream MCU: microcontroller MCU ECG Example Custom Path Store and Burst EMG AFE EEG FIR Processed Data RR+ AFib Example of Mixed Path FIR ENV Detect Data for TX 4kB DMem Event-Based Burst MCU 4kB DMem If event • Data processing: max flexibility (generic path) or max efficiency (biosignal accelerators) • Data transmission: supports modes from streaming 54 (100% DC) to rare event detection (~0% DC) Results AFib Detect (V) Input ECG Signal (V) 1 0.8 0.6 0.4 … 0.2 0 AFib begins Chip detects AFib 0.5 … 0 0 1 … 93 95 97 99 101 103 105 107 Time (s) • When a rare AFib occurs, TX is enabled to transmit the last 8 beats of ECG (in the data memory). • 19 µW total chip 55 Results 1 655 ms VBoost sample 0.8 ADC IN 0.6 (V) 0.4 1 TX EN 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1.6 1.8 650 µs 0.5 1 0 1 TX DATA Header 0.5 0.2 0.4 0.5 0.6 0 1 Data 0.8 1 CRC 1.2 1.4 1.798 1.7981 1.7982 1.7983 1.7984 1.7985 1.7986 1.798 1.7981 1.7982 1.7983 1.7984 1.7985 1.7986 0.5 0 1.7979 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.7987 1.6 1.8 Time (s) • Every 5s, VBOOST is sampled to check for sufficient energy • DPM enables RF crystal oscillator (20ms) and TX (650µs) • 19 µW total chip 56 100 Motivation Yield (%) 99.8 99.6 99.4 99.2 99.0 Cell Type Standard cell library essential to synthesis, but scaling industry standard cells aren’t sufficient for sub-Vt—fail SNM with variation 57 Motivation L=90nm L=180nm L=270nm L=360nm Occurrence (%) 6 4 2 0 0 20 40 60 80 100 Delay (ns) Make the cells bigger? Won’t work, greater active energy, not an insurance to robustness Even if it did work, area at least quadruples 58 0.25 Preliminary Results VNOR2-IN-NAND2-OUT 0.2 0.15 Increased SNM @ FS corner TX-Gate NOR2 0.1 0.05 Static CMOS NOR2 0 0 0.05 0.1 V 0.15 NAND2-IN-NOR2-OUT 0.2 0.25 59 0.25 Preliminary Results VNOR2-IN-NAND2-OUT 0.2 0.15 Increased SNM @ SS corner TX-Gate NOR2 0.1 Static CMOS NOR2 0.05 0 0 0.05 0.1 0.15 VNAND2-IN-NOR2-OUT 0.2 0.25 60 Preliminary Results 0.25 TX-Gate NOR2 PASSES SNM @ TT corner with local variation VNOR2-IN-NAND2-OUT 0.2 0.15 0.1 TX-Gate NOR2 0.05 0 0 0.05 0.1 V 0.15 NAND2-IN-NOR2-OUT 0.2 0.25 61 Preliminary Results 9 thold tc-q, slew=329ns tc-q, slew=419ns tc-q, slew=750ns tc-q, slew=1200ns 8 Occurrence (%) 7 6 5 4 3 2 1 0 -800 -400 0 Delay (ns) 400 800 • Hold time is quite immune to slew variation • Slew affects clock-q—there is a limit to slew before clock-q becomes detrimental 62 Preliminary Results P2p jitter DLL CLK_IN Frequency Power 373 ps 100 MHz 15 uW % Jitter/Freq Main Contribution 3.73% Low Power Header/Footer Array Current Starved Inverters Weak Latches Level Restorers Out_b Out • Low power DLL makes novel two-phase timing scheme possibly 63 worthy Motivation 100 SNM Yield (%) 99.56 98.96 95 90 88.57 85 [4] Register Type • DVFS provides the ability to trade-off energy and delay to cater 64 to variable workloads Approach Level 0: Logic Block tc-q, Elatch Pleak,latch tlogic, Elogic Pleak,logic Delay: tc-q+ tsetup + tlogic = PER Level 1: Energy: 2Elatch + Elogic + PER(Pleak,logic + 2Pleak,latch) Logic Block Delay: tc-q+ tsetup + tlogic/2 = PER tsetup, Elatch Pleak,latch Logic Block Energy: 3Elatch + Elogic + PER(Pleak,logic + 4Pleak,latch) Level 2: Delay: tc-q+ tsetup + tlogic/4 = PER Energy: 5Elatch + Elogic + PER(Pleak,logic + 8Pleak,latch) Delay: tc-q+ tsetup + tlogic/2n = PER Energy: (2n +1)Elatch + Elogic + PER(Pleak,logic + 2n+1Pleak,latch) Is this energy efficient? (2n +1)Elatch + αElogic + (tc-q+ tsetup + tlogic/2n)(Pleak,logic + 2n+1Pleak,latch) 65 Preliminary Results Intrinsic Energy/latch (fJ) 3 2 1 0 28 30 32 34 36 38 40 Average (tc-q+tsetup)/2 (ns) • Efficiency of latches have the potential to mitigate the pipelining overhead of this scheme 66 Preliminary Results 110 Reg Latch Energy (fJ) 90 70 50 30 10 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Delay (ms) • Efficiency of latches have the potential to mitigate the pipelining overhead of this scheme 67