High-level Power Reduction and Management Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 Outline General Observations RTL Power Management Techniques ■ Gated Clock Architecture ■ Precomputation ■ Guarded Evaluation Behavior-Level Power Reduction Techniques ■ Performance Speedup Techniques ● Algebraic Transformations ● Common Case Computation ■ Switched Capacitance Reduction ● Algebraic Transformations Power Supply Gating ■ Basic Concept ■ Isolation Cells ■ Retention Flip-Flops Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 2 General Observations Not all components need to be active all the time Energy-efficient computations achieved by selectively turning off (or reducing the performance of) system components when they are idle Issues: ■ Controls to support power management ● Frequency control (clock gating) ● Voltage control (power shutdown) ■ Identify when circuits (or parts) can be idle ■ Location of controls ● Hardware ● Software (Hybrid) Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 3 Outline General Observations RTL Power Management Techniques ■ Gated Clock Architecture ■ Precomputation ■ Guarded Evaluation Behavior-Level Power Reduction Techniques ■ Performance Speedup Techniques ● Algebraic Transformations ● Common Case Computation ■ Switched Capacitance Reduction ● Algebraic Transformations Power Supply Gating ■ Basic Concept ■ Isolation Cells ■ Retention Flip-Flops Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 4 Gated Clock Architecture Block Fa is controlled by primary inputs, state, and primary outputs STATE Combinational Logic IN OUT GCLK fa L & CLK Latch L takes care of filtering glitches ■ L is transparent when clock is inactive Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 5 Gated Clock Architecture : Redundant Clocking Detection Idea [Ohnishi97]: ■ Redundant clockings activate registers unnecessarily ■ Use application profiles to detect redundant clockings ● Difference in the numbers of incoming and outgoing data of a register ■ Gated clock scheme designed using this information Redundant behaviors of a register ■ Unused data latching: Data not transferred to a destination ■ Unchanged data latching: Register re-stores data already present from source ■ Redundant data holding: Register re-stores data already present from itself. Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 6 Redundant Clocking Detection Identify the redundant behaviors for register X during the 10 clock cycle snapshot shown. Courtesy: [Ohnishi97] # Unused data latching(X) or AUU (X ) = 8-6=2 # Unchanged data latching(X) or AUC ( X ) = 8 - 5 = 3 # Redundant data holding(X) or AHOLD (X ) = 10 – 8 = 2 Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 7 Algorithm Algorithm for redundant clocking detection and gated clock architecture definition 1. Register data transfer condition extraction ● Analyze RTL HDL of circuit to extract data transfer conditions ● Conditions under which data transfers to/from register happened 2. Profiling ● Count the number of times these conditions become true during RTL simulation ● Estimate the number of redundant behaviors of each register from these counts 3. Register grouping algorithm applied and gated clock introduced for each group Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 8 Register Data Transfer Conditions Data Transfer Graph (DTG) captures data transfer condition between registers (denoted C RT (vi , v j ) ) Example Courtesy: [Ohnishi97] Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 9 Register Data Transfer Conditions Three types of data transfer conditions CLAT (vi ) Data transfer condition between register i and one or more source registers of i m C LAT (vi ) C RT (vr , vi ) r 1 CUSED (vi ) Data transfer condition between register i and one or more destination registers of i n CUSED (vi ) C RT (vi , vr ) r 1 CCHG (vi ) Data transfer condition to one or more source registers of i k CCHG (vi ) C LAT (vr ) r 1 Copyright Agarwal & Srivaths, 2007 Courtesy: [Ohnishi97] Low-Power Design and Test, Lecture 7 10 Profiling Count the number of times CLAT (vi ) , CUSED (vi ), and CCHG (vi ) become true during RTL simulation ■ Call these numbers ALAT (vi ) , AUSED (vi ), and ACHG (vi ) We can now determine AHOLD (vi ) ACK (vi ) ALAT (vi ) AUU (vi ) ALAT (vi ) AUSED (vi ) AUC (vi ) ALAT (vi ) ACHG (vi ) Recall our initial example! Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 11 Register Grouping Algorithm 1. Record clock cycle in which each register behaves redundantly as follows: ■ Calculate AHOLD AUU AUC in every cycle for each register ■ If ( AHOLD AUU AUC ) cyclet ( AHOLD AUU AUC ) cyclet 1 record t (redundant clocking detected in cycle t) 2. Greedy grouping of registers foreach reg i i,j do not belong to any group { Add i to new Group Gi; foreach reg j { #redundancy_similarity= #clock_cycles in which i,j behave redundantly. if (#redundancy_similarity > threshold) Add j to Gi; } } Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 12 Register Grouping Algorithm 3. Calculate the total redundant power for each group 4. Select groups whose total redundant powers are more than a given threshold power Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 13 Outline General Observations RTL Power Management Techniques ■ Gated Clock Architecture ■ Precomputation ■ Guarded Evaluation Behavior-Level Power Reduction Techniques ■ Performance Speedup Techniques ● Algebraic Transformations ● Common Case Computation ■ Switched Capacitance Reduction ● Algebraic Transformations Power Supply Gating ■ Basic Concept ■ Isolation Cells ■ Retention Flip-Flops Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 14 Pre-computation Duplicate part of logic to precompute circuit output values one cycle before they are required Use these values to reduce the total amount of switching in the circuit in the next cycle Original Circuit (n input, single output) Circuit with Pre-computation Circuit Embodiments ■ g1, g0 : Predictor functions g1 1 f 1 g0 1 f 0 ■ LE = 0; when either g1 or g0 evaluates to 1 Courtesy: [Macii98] Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 15 Pre-computation An Example [Devadas95] ■ N-bit comparator ■ Pre-computation circuit based on the behavior of the comparison operation ● If the MSBs of C and D are not equal, C>D can be evaluated just using the MSBs ● Otherwise, the rest of the bits (of C and D) are also needed. ■ Therefore, LE is given by LE C (n 1) D(n 1) Copyright Agarwal & Srivaths, 2007 Comparator Circuit XNOR Comparator Circuit with pre-computation Low-Power Design and Test, Lecture 7 16 Outline General Observations RTL Power Management Techniques ■ Gated Clock Architecture ■ Precomputation ■ Guarded Evaluation Behavior-Level Power Reduction Techniques ■ Performance Speedup Techniques ● Algebraic Transformations ● Common Case Computation ■ Switched Capacitance Reduction ● Algebraic Transformations Power Supply Gating ■ Basic Concept ■ Isolation Cells ■ Retention Flip-Flops Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 17 Guarded Evaluation Operand Isolation: Use transparent latches as a mechanism for shutting down redundant switching ■ Latches enabled when useful computation needs to be done Guarded Evaluation [Tiwari98] ■ Identifies where transparent latches must be placed ■ Identifies which signals control enable/disable of these latches Courtesy: [Macii98] Original Circuit Copyright Agarwal & Srivaths, 2007 Circuit with Guard Logic Low-Power Design and Test, Lecture 7 18 Guarded Evaluation An Example RTL Circuit: Dual-operation ALU ■ Ctrl=0 (1) : SHIFT (ADD) operation performed ■ Clock gating will not work here! REG B REG A REG B REG A Guard Logic SHIFTER ctrl ctrl ADDER SHIFTER ctrl 0 ADDER 1 ctrl 0 1 ALU with Guard Logic ALU Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 19 Background: Observability Don’t Cares Well known concept in logic synthesis ODC set of a Boolean variable x: Conditions on the Primary Inputs such that x is not observable at the Primary Outputs. Example: AND gate with inputs x,y and output z ■ x is not observable when y is 0 ■ x is not observable when z is not observable ODC ( x) y ODC ( z ) Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 20 Guarded Evaluation Exploit observability don’t care set ODCz ■ Set of PI assignments to X so that the value at z has no effect at POs. ■ Then the guard logic control signal s must satisfy the logical condition s ODC z Circuit with Guard Logic (Pure Guarded Evaluation) ■ Further, tl ( s) te (Y ) Earliest time an input to F can change Latest settling time of s to 1 Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 21 Guarded Evaluation Extended Guarded Evaluation ■ Larger set of conditions under which we can shut off logic s ( x ODC z ) ■ Shutdown conditions now include additionally ● PI assignments not in ODCz ● But, for whom, z=1 Copyright Agarwal & Srivaths, 2007 z w s Low-Power Design and Test, Lecture 7 22 Outline General Observations RTL Power Management Techniques ■ Gated Clock Architecture ■ Precomputation ■ Guarded Evaluation Behavior-Level Power Reduction Techniques ■ Performance Speedup Techniques ● Algebraic Transformations ● Common Case Computation ■ Switched Capacitance Reduction ● Algebraic Transformations Power Supply Gating ■ Basic Concept ■ Isolation Cells ■ Retention Flip-Flops Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 23 Behavior-level Power Reduction Techniques Recall the equation for dynamic power consumption Pdyn 1 2 CVdd * a * f 2 Two key approaches for reducing power: ■ Use performance speed-up transformations, and trade-off performance for power through voltage scaling ● How will this work? ■ Reduce the effective capacitance being switched Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 24 Trading off performance for power consumption benefits Exploit voltage and frequency scaling to trade-off performance gains for significant power consumption savings When voltage and frequency scaling is performed, we can calculate the power consumption benefits by determining the new operating voltage ■ Let Topt be the shortened execution time due to the use of performance optimization ■ Assume that the voltage scaled circuit takes the same time (TORIG) to complete as the original circuit Copyright Agarwal & Srivaths, 2007 Vdd Topt TORIG Vdd Vddnew Low-Power Design and Test, Lecture 7 Topt TORIG 25 Trading off performance for power consumption benefits We have first the following equations for Topt and Torig Topt N cyc *1 / f orig Torig N cyc *1 / f new Topt / Torig f new / f orig Dependency of frequency on circuit voltage is given below f (Vdd Vt ) 2 / Vdd We therefore have the following equation below for calculating Vddnew Topt / Torig ((Vddnew Vt )2 /(Vdd Vt ) 2 ) * (Vdd / Vddnew ) Topt / Torig Vdd new / Vdd Use Vddnew to calculate final power consumption! Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 26 Performance Optimization Transformations on an Example Behavior [Chandraskan95] Example Behavior of an IIR Filter YN X N A *YN 1 Behavior Data Flow XN YN + * Design Characteristics D A Copyright Agarwal & Srivaths, 2007 • • • • • Vdd = 5V Critical path Length = 2 Throughput = 2*N Capacitance = 1 unit Power = 25 units Low-Power Design and Test, Lecture 7 27 Transformation (1): Loop Unrolling We can unroll the recursive equation once, and get the following YN 1 X N 1 A * YN 2 YN X N A *YN 1 Behavior Data Flow XN Design Characteristics YN + 2D A X N 1 * * + Copyright Agarwal & Srivaths, 2007 A • • • • • Vdd = 5V Critical path Length = 2 Throughput = 2*N Capacitance = 1 unit Power = 25 units YN 1 No Low-Power change in performance/power! Design and Test, Lecture 7 28 Transformation (2): Distributivity and Constant Propagation We can apply distributive law and constant propagation YN 1 X N 1 A * YN 2 YN X N A * X N 1 A * YN 2 2 Behavior Data Flow XN + YN + 2D * A2 A2 X N 1 Design Characteristics * * + Copyright Agarwal & Srivaths, 2007 • • • • • Vdd = 5V Critical path Length = 3 Throughput = 3*(N/2) Capacitance = 1.5 units Power = 25 units A YN 1 Low-Power • Vdd = 3.75V How? • Critical path Length = 3 • Throughput = 2*N • Capacitance = 1.5 units • Power = 20 units7 Design and Test, Lecture Voltage Scaling 29 Transformation (3): Pipelining Let us assume we will now process two samples in parallel at any given time Non-pipelined operation …………….. op1 op2 op3 op4 op1 Pipelined operation …………….. op2 op3 Copyright Agarwal & Srivaths, 2007 op4 Low-Power Design and Test, Lecture 7 30 Transformation (3): Pipelining Behavior Data Flow with Pipelining ■ Observe that the critical path length reduces to 2 XN + D 2D * A2 A2 X N 1 YN + * D * + A YN 1 Design Characteristics • • • • • Vdd = 2.9V How? Critical path Length = 2 Throughput = 2*N Capacitance = 1.5 units Power = 12.5 units (2X reduction) Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 31 Transformation (3): Pipelining Source: [Chandraskan95] Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 32 Common Case Computation: A PowerOptimization Technique [Lakshminarayana99] Recall Amdahl’s law ! Idea ■ Identify computations or sequence of computations in behavior that occur most frequently ■ Design separate circuit that implements common-case behavior efficiently Copyright Agarwal & Srivaths, 2007 Generic Architecture ORIGINAL CIRCUIT Common-case Detection & execution circuit Activity of energy optimized circuit Low-Power Design and Test, Lecture 7 33 CCC: Example [Lakshminarayana99] GCD Behavior STG annotated with state and state transition probabilities from simulation profiles while (x != y) { if (x > y) { x := x - y; } else { y := y - x; } } Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 34 CCC: Example [Lakshminarayana99] Identified common case behavior if (x != y) { if (x > y) { x := x -y; }} if (x != y) { if (x > y) { x := x -y; }} if (x != y) { if (x > y) { x := x -y; }} if (x != y) { if (x > y) { x := x -y; }} Tempx := x - 4y; if (Tempx > 0) { x := Tempx; } y x Common Case Execution Copyright Agarwal & Srivaths, 2007 Simplified common case behavior 2 Common Case Detection 0 Low-Power Design and Test, Lecture 7 35 CCC: Results Performance improvement of more than 4X! Can be traded-off for power savings ■ Average power consumption reduction: 59% Average area overhead: 23% Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 36 Outline General Observations RTL Power Management Techniques ■ Gated Clock Architecture ■ Precomputation ■ Guarded Evaluation Behavior-Level Power Reduction Techniques ■ Performance Speedup Techniques ● Algebraic Transformations ● Common Case Computation ■ Switched Capacitance Reduction ● Algebraic Transformations Power Supply Gating ■ Basic Concept ■ Isolation Cells ■ Retention Flip-Flops Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 37 Operation Reduction: Distributivity [Chandrakasan95] Reducing operations reduces the switched capacitance 2nd order polynomial example X 2 A* X B can be rewritten as X X * ( X A) B A * + X X + A * + * X B + B X One lesser multiplication! Same throughput No change to the critical path Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 38 Operation Reduction: Distributivity [Chandrakasan95] Reducing operations reduces the switched capacitance ■ Can also increase the critical path (can mean higher voltage to realize the same throughput) 3rd order polynomial example X 3 A* X 2 B * X C X * ( X * ( X A) B) C can be rewritten as A X * X * + + * X A + X + X B + * * B * C #Operations=7 Critical path=4 Copyright Agarwal & Srivaths, 2007 #Operations=5 Critical path=5 Low-Power Design and Test, Lecture 7 X + C 39 Strength Reduction and Common Sub-Expression Strength Reduction ■ Exploit dissimilarity in energy consumption between operations ■ E.g, Conversion of multiplications with constants into shift-add operations Common Sub-Expression ■ Identify common computations between two computational threads and re-use to reduce the number of operations Example: 4-tap FIR Filter [Mehendale95] Yn i 0 Ai * X ni 3 X n2 X n1 Xn Ao A2 A1 * X n 3 Coefficients Value A3 * * + + Copyright Agarwal & Srivaths, 2007 * + Yn A0 (0.0111011)2 A1 (0.0101011)2 A2 (1.0110011)2 A3 (1.1001010)2 2’s complement Low-Power Design and Test, Lecture 7 40 fixed-point arithmetic Strength Reduction and Common Sub-Expression Step 1. Apply Strength Reduction ■ Replace multiplication by equivalent Shift and Add from the binary representation of the coefficients Yn i 0 Ai * X ni A0 * X 3 A1 * X 2 A2 * X1 A3 * X 0 3 A0 (0.0111011)2 A0 * X 3 2 8 * ( X 3 X 3 1 X 3 3 X 3 4 X 3 5) A2 (1.0110011)2 A2 * X 1 28 * ( X 1 X 1 1 X 1 4 X 1 5 X 1 7) #Adds #Subs #Shifts 15 2 15 Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 41 Strength Reduction and Common Sub-Expression Step 2. Identify common sub-expressions across coefficients ■ Two coefficients that have 1 in more than one bit location A0 * X 3 2 8 * ( X 3 X 3 1 X 3 3 X 3 4 X 3 5) A2 * X 1 28 * ( X 1 X 1 1 X 1 4 X 1 5 X 1 7) ■ Compute (X1 + X3) = X13 separately ■ Similarly, compute (X0 + X2) = X02 separately #Adds #Subs #Shifts 11 2 10 ■ Similarly, Compute (X13 + X13 << 1) = X13_01 separately #Adds #Subs #Shifts 10 2 9 Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 42 Outline General Observations RTL Power Management Techniques ■ Gated Clock Architecture ■ Precomputation ■ Guarded Evaluation Behavior-Level Power Reduction Techniques ■ Performance Speedup Techniques ● Algebraic Transformations ● Common Case Computation ■ Switched Capacitance Reduction ● Algebraic Transformations Power Supply Gating ■ Power Switches ■ Isolation Cells ■ Retention Flip-Flops Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 43 Power Supply Gating Basic Concept: ■ Switches placed on-chip to turn off power supply when circuit (parts) are idle. Benefits: ■ Leakage power reduction Challenges ■ IR drop leads to timing closure issues ■ Simultaneous switching of gating cells Two styles of power gating ■ Fine-grained power gating ● Power gating logic part of library cells ■ Coarse-grained power gating ● Power gating cells part of power Courtesy [Cadence-PowerMgmtDesignLine06] grid network Low-Power Design and Test, Lecture 7 Copyright Agarwal & Srivaths, 2007 44 Power Supply Gating: An Example [OMAP-ISSCC05] 90nm OMAP2420 SoC Power Switch used in OMAP 5 power domains in OMAP SoC enabled by power gating Power switches gate VDD, consists of ■ Weak PMOS: Sinks low current for power restore ■ Strong PMOS: Deliver current for normal operation 2-pass power turn-on mechanism to prevent current surges ■ Weak switches turned on first to almost fully restore VDD(local), and then the strong switches are turned on to support normal operation Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 45 Power Supply Gating: An Example [OMAPISSCC05] Leakage currents compared between ■ All power domains ON ■ WkUp domain only ON Nearly 40X reduction seen at room temperature Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 46 Isolation Cells Special cells used at the interfaces between blocks which are shut-down and blocks which are on. Prevents the outputs of shut-down modules from floating Types of Isolation Cells ■ Sets the output to a known value (0 or 1) ■ Sets the output to the last valid value Cells and their enables need to be always ON. Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 47 Data Retention Things to do before we power down ■ Save state of the module(s) being shut down Options [Zyuban02] ■ For processors, OS can save relevant state to local memory and read back ● Save/restore overheads (time, energy consumption) ■ Use scan to save complete state ■ Keep all latches on a separate power supply and just power down logic ■ Provide each latch with a shadow latch called retention latch (each retention latch is on a separate power supply) Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 48 Data Retention Integrated Scan Retention Courtesy: [Zyuban-ISLPED02] Save and Restore Operations Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 49 References Survey Papers ■ [Devadas95] S. Devadas, S. Malik: A Survey of Optimization Techniques Targeting Low Power VLSI Circuits. DAC 1995: 242-247 ■ [Macii98] E. Macii, M. Pedram, F. Somenzi: High-level power modeling, estimation, and optimization. IEEE Trans. on CAD of Integrated Circuits and Systems 17(11): 1061-1079 (1998) ■ [Chandrakasan95a] A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. Brodersen, ``Optimizing power using transformations,'' IEEE Trans. Computer-Aided Design, vol. 14, pp. 12--31, Jan. 1995. RTL Power Management ■ [Ohnishi97] M. Ohnishi, A. Yamada, H. Noda, and T. Kambe, ``A Method of Redundant Clocking Detection and Power Reduction at the RTL level,'' in Proc. Int. Symp. Low Power Electronics & Design (ISLPED), pp. 131-136, Aug. 1997. ■ [Tiwari98] V. Tiwari, S. Malik, P. Ashar: Guarded evaluation: pushing power management to logic synthesis/design. IEEE Trans. on CAD of Integrated Circuits and Systems (TCAD) 17(10): 1051-1060 (1998) Behavioral Power Optimization ■ [Mehendale95] M. Mehendale, S. D. Sherlekar, G. Venkatesh, “Synthesis of multiplier-less FIR filters with minimum number of additions”. ICCAD 1995: 668-671 ■ [Lakshminarayana99] G. Lakshminarayana, A. Raghunathan, K. S. Khouri, N. K. Jha, S. Dey: Common-Case Computation: A High-Level Technique for Power and Performance Optimization. DAC 1999: 56-61 Power Supply Gating ■ [Cadence-PowerMgmtDesignLine06] Anand Iyer, “Demystify power gating and stop leakage cold”, Power Management DesignLine, 03/03/06 ■ [Zyuban02] V. Zyuban, S. V. Kosonocky: Low power integrated scan-retention mechanism. ISLPED 2002: 98-102 ■ [OMAP-ISSCC05] P. Royannez, H. Mair, F. Dahan, M. Wagner et. al.; "90nm Low Leakage SoC Design Techniques for Wireless Applications"; ISSCC'05, Feb 2005 Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 7 50