ELEN 468 Advanced Logic Design Lecture 29 Low Power Design ELEN 468 Lecture 29 1 Power Dissipation Power (Watts) 100 P6 Pentium ® proc 10 8086 286 1 8008 4004 486 386 8085 8080 0.1 1971 1974 1978 1985 1992 2000 Year Power increases despite Vdd decrease ELEN 468 Lecture 29 Courtesy, Intel 2 Power Density Rocket Nozzle Power Density (W/cm2) 10000 Nuclear Reactor 1000 100 Hot Plate 8086 10 4004 8008 8085 386 286 8080 1 1970 1980 P6 Pentium® proc 486 1990 Year ELEN 468 Lecture 29 2000 2010 Courtesy, Intel 3 Why Power Increased Growing die size, fast frequency scaling 10000 Clock Frequency (MHz) 1000 100 10 85 87 89 91 93 95 97 ELEN 468 Lecture 29 99 01 03 05 4 Gate Power Dissipation Leakage power Dynamic power Short circuit power ELEN 468 Lecture 29 5 Dynamic Power Occurs at each switching Pd = CL●Vdd2●fp fp switching frequency Vdd Vdd out Linear ELEN 468 Lecture 29 out Saturation 6 Leakage Power Static Leakage current = a ● Vdd Leakage current = b/Vt Killer to CMOS technology Vdd Vdd Leakage out out Leakage Linear ELEN 468 Lecture 29 Saturation 7 Short Circuit Power During switching, there is a short moment when both PMOS and CMOS are partially on Ps = Q●(Vdd-Vt)3●tr●fp tr rising time ELEN 468 Lecture 29 Input falling Vdd Vdd out out Input rising 8 Where Does Power Go? P ow er per centages 100% 90% Power percentages 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 80% 70% A ctive p o w e r 60% Active power 50% Cache leakage Gate leakage Core transistor leakage 40% 30% C a che le a ka g e G a te le a ka g e 20% 10% C o re tra nsisto r le a ka g e 0% Total chip power based on ITRS roadmap In 2004, we are just breaking even [Kim, et al, Computer 2003] Scalable X86 CPU Design for 90nm Low VT devices are <1% of total non-memory transistor width [J. Schultz and C. Webb, ISSCC 2004] ELEN 468 Lecture 29 9 Energy – Performance Space Energy Every design is a point on a 2-D plane Performance ELEN 468 Lecture 29 10 Low Power Design Reduce dynamic power a: clock gating, sleep mode C: small transistors (esp. on clock), short wires VDD: lowest suitable voltage f: lowest suitable frequency Reduce static power Selectively use low Vt devices Power gating, MTCMOS Stacked devices Body bias ELEN 468 Lecture 29 11 Clock Gating Gate off clock to idle functional units e.g., floating point units need logic to generate disable signal R Functional e unit g increases complexity of control logic consumes power timing critical to avoid clock glitches at OR gate output additional gate delay on clock signal clock disable gating OR gate can replace a buffer in the clock distribution tree ELEN 468 Lecture 29 12 Active Power Reduction - Supply Voltage Reduction Static Low Supply Voltage Slow Fast Slow Dynamic High Supply Voltage Pros: • Always active in saving Cons: • Additional power delivery network • Needs special care of interface between power domains • signals close to Vt – excessive leakage and reduced noise margins Adjusting operation voltage and frequency to performance requirements: • High performance – high Vdd & frequency • Power saving – low Vdd & frequency Pros: • Doesn’t limit performance Cons: • Penalty of transition between different power states can be high (in performance and power) • Additional control logic ELEN 468 Lecture 29 13 Voltage Islands (Multi-Vdd) Vddh Vddl Usami+ JSSC’98 Lackey+ ICCAD’02 GVI DAC’03 Allow both macro and cell voltage assignment Allow different voltage islands in the same circuit row Lift unnatural layout restrictions Minimal placement disturbance ELEN 468 Lecture 29 14 Level Converter Interface circuit when Vddl drives Vddh to avoid leakage Vddh Vddh VddH weak on! VddL OUT Vddl IN Conventional dual supply level converter ELEN 468 Lecture 29 OUT IN New single supply level converter 15 Adjacency Metrics for Clustering Logic adjacency metric (LAM): Vddl fanin cone of level shifter without going through Vddh Vddh Vddh LC1 LC2 LC2 Vddl Vddl LC3 LC3 Physical adjacency metric (PAM): for each candidate Vddl cell, compute total size of its neighbor Vddl cells LAM to guide logic aware voltage assignment PAM to guide placement aware voltage re-assignment ELEN 468 Lecture 29 16 Level Converter Optimizations Logic replacement (or gate sizing) LC LC LC MUX 1 MUX 2 Z Z LC DEC DEC LC/Buffer co-optimization B LC A ELEN 468 Lecture 29 B LC A 17 Placement to Form Voltage Islands with Power Grid Co-design Based on Vddl and Vddh cell placement after voltage assignment, define Vddl/Vddh power grids on demand Detailed placement to form Vddl/Vddh voltage islands that can hit their corresponding power supplies Power grids on demand Vddh Vddl Vddl Vddh ELEN 468 Lecture 29 Vddl Vddh Vddl Vddh 18 Example of Voltage Islands - IBM Cu11 - 0.13um - 400 MHz Vddh = 1.5V Vddl = 1.2V (courtesy IBM) No timing degradation, no area increase! ELEN 468 Lecture 29 19 Dynamic Frequency and Voltage Scaling Always run at the lowest supply voltage that meets the timing constraints DFS (dynamic frequency scaling) saves only power DVS (dynamic voltage scaling) + DFS saves both energy and power A DVS+DFS system requires the following A programmable clock generator (PLL) PLL from 200MHz 700MHz in increments of 33MHz A supply regulation loop that sets the minimum VDD necessary for operation at the desired frequency 32 levels of VDD from 1.1V to 1.6V An operating system that sets the required frequency + supply voltage to meet the task completion deadlines heavier load ramp up VDD, when stable speed up clock lighter load slow down clock, when PLL locks onto new rate, ramp down VDD ELEN 468 Lecture 29 20 Leakage Reduction Techniques Vdd pullup (Vdd) sleep Vdd HVT Wu Wl virtual Vdd Vnwell ≥ Vdd Vx High Vt devices Low Vt devices low Vt logic virtual Gnd Vpwell ≤ 0 sleep stack effect dual Vt partitioning variable threshold (VTCMOS) ELEN 468 Lecture 29 HVT multi-threshold (MTCMOS) 22 Natural Transistor Stacks How? • Reduce the leakage by stacking the devices • Reduced Vds • Negative Vgs • Negative Vbs ELEN 468 Lecture 29 23 Design with Dual Vth Dual Vth evaluation Dual Vth design Two flavors of transistors: slow – high Vth, fast – low Vth Low Vth are faster, but have ≈10X leakage ELEN 468 Lecture 29 24 Impacts of Variable VT Reducing the VT increases the subthreshold leakage current (exponentially) VT = VT0 + ( F + VSB - F ) where VT0 is the threshold voltage at VSB = 0, VSB is the source- bulk (substrate) voltage, is the body-effect coefficient But, reducing VT decreases gate delay (increases performance) ELEN 468 Lecture 29 25 Variable VT through Body Bias For NMOS, the substrate is normally tied to ground (VSB = 0) A negative bias on VSB causes VT to increase Adjusting the substrate bias at runtime is called adaptive body-biasing (ABB) or dynamic threshold scaling (DTS) Requires a triple well fab process ELEN 468 Lecture 29 VSB,p VSB,n 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 -2.5 -2 -1.5 -1 VSB (V) -0.5 0 26 Forward/Reverse Body Biasing RBB (Reverse Body Bias): zero FBB (Forward Body Bias): high Vth in body bias in active mode, a deep reverse bias in standby mode. standby mode, forward body biasing to achieve better current drive in active mode. Disadvantages: Disadvantages: • Increase PN junction reverse leakage • Scaling down technology worsen short channel effects and weaken the Vth modulation capability • Larger junction capacitance • High body effect for stack devices ELEN 468 Lecture 29 27 Implementation of Dynamic Vth Scaling (DTS) How? • When critical path replica frequency is less then reference CLK, adjust bias to decrease Vth. • Otherwise adjust bias to increase Vth. Results: • The lowest Vth is delivered (NBB-no body bias) if the highest performance is required. • When the performance demand is low, clock frequency is lowered and Vth is raised via RBB to reduce the run time leakage power dissipation. ELEN 468 Lecture 29 28 Power Gating Using Sleep Transistors Or can reduce leakage by gating the supply rails when the circuit is in sleep mode in normal mode, sleep = 0 and the sleep transistors must present as small a resistance as possible (via sizing) in sleep mode, sleep = 1, the transistor stack effect reduces leakage by orders of magnitude Or can eliminate leakage by switching off the power supply (but lose the memory state) ELEN 468 Lecture 29 29 Example of Power Gating Can reduce power Power Switch Control Signals 1000X Smaller voltage swing Embedded (IR drop on sleep Power transistors) Switches Lower performance Increased noise coupling Local power grid design Rows of Standard Cells ELEN 468 Lecture 29 30 Power Dissipation on Variation Tolerance Conventional variation tolerance Using large timing safety margin Implies aggressive timing target Greater power dissipation Observation Near-worst-case variations occur rarely Safety margin is applied continuously to guard the small chance of variations Poor power efficiency ELEN 468 Lecture 29 31 Question.. Can we deal with errors instead preventing them from occurring by conservative binning/clocking? How fast can we speed up the circuit with error rate in manageable range? ELEN 468 Lecture 29 32 Fault tolerant system Begin with reference values Introduce redundancy Hardware: Triple Modular Redundancy Time: Repeated process Information: Code Software: various algorithm How about for delay fault? how do we detect (may be correct?) errors? ELEN 468 Lecture 29 33 Delay fault tolerant system Delay fault detection Redundant timing margin in signal path +: Second sampling at increase clock period - : Decrease delay of reference signal between pipeline registers Timing margin t1 t2 2nd sampling ELEN 468 Lecture 29 t 34 Delay fault tolerant system Delay fault removal Reference signal (SR) Reprocessing at slower clock period (t’) Timing margin t1 t2 SR t’ t ELEN 468 Lecture 29 35 Delay fault tolerant system: Example RAZOR* Dynamic Voltage Scaling Design Reduce power voltage down to manageable failure rate Timing margin t1 t2 * Razor: a low-power pipeline based on circuit-level timing speculation, D. Ernst et al, 36th Annual IEEE/ACM International Symposium on Microarchitecture 2003 ELEN 468 Lecture 29 36 Delay fault tolerant system: Example RAZOR continued Implemented to 120MHz clock frequency But for high speed circuits… Managing two clocks Minimum path delay constraint Delay of MUX ELEN 468 Lecture 29 37 Delay fault tolerant system: Example Parity coding Parity generation based on output correlation Avoid well-correlated outputs for pairing Timing margin t ELEN 468 Lecture 29 38 Now.. Let’s look at delay distribution(s) ELEN 468 Lecture 29 39 Clock speed achieved for contained error rate ELEN 468 Lecture 29 40 Delay fault tolerant system: Example Parity coding (continued) Complexity Example: C449 ISCAS Benchmark ELEN 468 Lecture 29 41 Recently Proposed Design Fault detection Partial hardware and time redundancy Timing margin FL g0 BL gi Ln gm Ln+1 BL' gm L'n+1 t ELEN 468 Lecture 29 42 Proposed Design Fault removal Pipeline flush & reprocessing at lower clock FL g0 Ln BL gi gm Ln+1 BL' gm L'n+1 ELEN 468 Lecture 29 43 Proposed Design Division of FL an BL FL PI BL PO CP BL Latch ELEN 468 Lecture 29 Error? 44 Proposed Design Division of FL an BL Considerations The effects on the original circuit should be minimal. Maximize delay fault detection coverage Minimize added complexity ELEN 468 Lecture 29 45 Proposed Design Division of FL an BL First, POs to BL Gate with longest delay to gate with shortest delay For the gates connected to BL, Choose the gate with maximum delay Then, any gate whose number of fanout> number of fanin ELEN 468 Lecture 29 46 Proposed Design Delay fault detection coverage dFL: delay from PI to any gate in FL di: delay from PI to any gate in original circuit CF 1 m ax{ d FL} m ax{ d i} Add graphical view ELEN 468 Lecture 29 47 Proposed Design Delay simulation SPICE simulation TSMC 0.18um tech. Vcc=1.6V Gate delay for rising and falling signal Load: inverter Different input combinations are considered Delay simulation Randomly generated test vectors 106~108 according to number of primary inputs (PI) ELEN 468 Lecture 29 48 Proposed Design Area complexity Ngate: Number of gates in the original circuit Nff : Number of ffs in each pipeline, (NPI+NPO)/2 Ngate_BL: Number of gates in BL Ngate_CP: Number of gates in comparison block NLatch: Number of latches=Number of connections between FL and BL w: Complexity ratio of flipflop to gate CA N gate _ B L N gate _ C P N L atch N gate w N ff ELEN 468 Lecture 29 49 Fault Coverage vs. Complexity Fault Detection Coverage vs. Added Complexity : C432 Fault Detection Coverage vs. Added Complexity : C499 Added Complexity C A 0.5 Added Complexity C A 0.6 0.5 0.4 0.3 0.2 0.4 0.3 0.2 0.1 0 0.1 0 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 Fault detection Coverage CF 0.7 Fault detection Coverage CF Fault Detection Coverage vs. Added Complexity : C6288 Fault Detection Coverage vs. Added Complexity : C880 0.5 Added Complexity C A Added Complexity C A 0.5 0.4 0.3 0.2 0.1 0.4 0.3 0.2 0.1 0 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4 0.5 Fault detection Coverage CF Fault detection Coverage CF ELEN 468 Lecture 29 50 0.6 Complexity Effective complexity penalty Depends on application More than half of area is cache Speed critical part: integer unit C AE C A A ppicable area C A 0.5 T otal chip area ELEN 468 Lecture 29 51 Estimation of Complexity Intel® Pentium® 4 Processor on 90 nm Process Data Align ALUs Registers Cache Mux & AGU ELEN 468 Lecture 29 52 Conclusion Delay fault tolerant design is proposed Possible operation clock frequency gain is estimated from modeling and experiments Delay fault detection coverage and complexity are analyzed for optimal implementation It shows that 10% clock frequency gain is possible with proposed design at a moderate (825%) complexity increase ELEN 468 Lecture 29 53