High-level Power Analysis Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 Outline Background ■ CMOS Power Consumption Basics ■ Why Address Power Consumption Issues in High-Level Design High-Level Power Analysis ■ RTL Power Estimation ● Fast Synthesis ● Analytical Approaches ● Characterization ■ Accelerating RTL Power Estimation ● Power Emulation (Hardware Accelerated Power Estimation) ■ Beyond RTL Power Estimation ● Power Estimation at the Cycle-accurate Behavior Level ■ Architectural Power Estimation Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 2 CMOS Power Consumption Basics What are the various components of CMOS power consumption? Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 3 Levels of Design Abstraction Controller x = input_x; y = input_y; while (x != y) { if (x < y) { y = y - x; } else { x = x - y; } } out = x; (a)Behavioral description Scheduling ST_1: x = input_x; y = input_y; goto ST_2; input_x out FSM Binding ST_2: c0 = x!=y; c1 = x<y; y1 = y –x; goto ST_3 (b) Cycle-accurate functional description != < - reg_c0 reg_c1 reg_y1 reg_x input_y reg_y (c) RTL description Logic Synthesis Layout (d) Transistor-level layout Copyright Agarwal & Srivaths, 2007 (d) Logic-level netlist Low-Power Design and Test, Lecture 4 4 Why Address Power at Higher Levels of Design Abstraction? System-level design Power models for system-level components System-level power analysis High-level synthesis, RTL optimizations Architecture-level power analysis Power models for macroblocks, control logic Benefits: Estimation Early feedback about power budget Faster / Fewer design iterations Benefits: Optimization ü Large power savings possible at higher levels Power reduction opportunities Power analysis iteration times System level Logic-level power analysis Transistor-level/ Layout synthesis Transistor-level power analysis Algorithm level Power models for gates, cells, nets Register-transfer level 2-5X Logic level Design flow with high-level power analysis Copyright Agarwal & Srivaths, 2007 Transistor level 20 - 50% Increasing power savings seconds - minutes Logic synthesis Layout level Low-Power Design and Test, Lecture 4 minutes - hours hours - days Decreasing design iteration times 10-20X 5 Outline Background ■ CMOS Power Consumption Basics ■ Why Address Power Consumption Issues in High-Level Design High-Level Power Analysis ■ RTL Power Estimation ● Fast Synthesis ● Analytical Approaches ● Characterization ■ Accelerating RTL Power Estimation ● Power Emulation (Hardware Accelerated Power Estimation) ■ Beyond RTL Power Estimation ● Power Estimation at the Cycle-accurate Behavior Level ■ Architectural Power Estimation Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 6 Fast Synthesis based Power Estimation Map design through “low-effort” to a netlist for power estimation [Llopis98] Use gate-level power data to perform power estimation Approach followed by some commercial tools Low-Effort Synthesis Gate-Level Power Estimation Power Copyright Agarwal & Srivaths, 2007 RTL estimates RTL 15-20% dev Source: (Llopis-98) Gate Level estimates Low-Power Design and Test, Lecture 4 7 Analytical Methods Correlate power consumption to simple measures of design complexity ■ Logic Structures: Use gate count [Glaser91] Pint GE( Etyp Vdd 2 .CL ). f . Aint ■ ■ ■ ■ ■ GE : Circuit size in NAND2 gate equivalent Etyp: Typical power dissipation per MHz for a NAND2 gate CL : Estimated load capacitance per gate f, Vdd: Clock frequency, Voltage Aint : Estimated activity factor per clock cycle (20-30%) Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 8 Analytical Methods ■ Memories [Liu94] Pmem Pmemcell Prow Pcol Psense Dominant component 2k nk memcell 2 ( c int l column 2 C tr )Vdd .Vswing . f mem _ clock P ■ 2k : No. of memory cells, 2n-k : No. of rows ■ cint : Capacitance of unit wire length ■ lcolumn: Column interconnect length ■ Ctr : Drain diffusion capacitance on the bit/bit line Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 9 Analytical Methods Entropy based approach [Nemani96] ■ Entropy: Measure of uncertainty in a random variable ■ Entropy H of a random variable x is given by 1 1 H ( x) p log (1 p) log p 1 p ■ p: Probability of x being 1 Recall that Pavg Davg .GE.Cavg ■ Davg: Average node switching activity ■ GE: Gate equivalents, Cavg: Average gate capacitance Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 10 Analytical Methods Hypothesis ■ Can Davg be estimated only from knowledge of inputs and output behavior? Answer: Yes! Pavg H .GE.Cavg Entropy H is given by 2/3 H (H i 2H o ) nm Hi and Ho are respectively the input and output entropies Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 11 Analytical methods Entropy Based Power Estimation Methodology: ■ Run a structural RTL simulation to measure input/output entropies ■ Using input/output entropies, estimate Pavg for the combinational block ■ Use other techniques [Liu94] to estimate latch and clock power Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 12 Outline Background ■ CMOS Power Consumption Basics ■ Why Address Power Consumption Issues in High-Level Design High-Level Power Analysis ■ RTL Power Estimation ● Fast Synthesis ● Analytical Approaches ● Characterization ■ Accelerating RTL Power Estimation ● Power Emulation (Hardware Accelerated Power Estimation) ■ Beyond RTL Power Estimation ● Power Estimation at the Cycle-accurate Behavior Level ■ Architectural Power Estimation Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 13 Characterization Based Approaches ■ Characterization based power macro-models [Raghunathan-book, Raviaspdac05] ● Characterize a lower RTL COMPONENT LIBRARY level implementation of an RTL block ● Construct a macromodel or power models Power = f(I/O signal statistics) ● Applicable in behavioral synthesis environments Copyright Agarwal & Srivaths, 2007 Macromodel template selection - Complexity analysis - Variable / parameter selection Pattern generation Training sequences Logic- / transistor-level power simulator Power Profiles Data fitting / coefficient extraction Power macromodels Low-Power Design and Test, Lecture 4 14 Power Models Power = coeff_0 + transition_count(in1[t], in1[t-1]) * coeff_1 + What does the transition_count(in2[t], in2[t-1]) * coeff_2 + power model ……………………. + implement? transition_count(inN[t], inN[t-1]) * coeff_N Queues D Q in2 D D ●●● Coeff_2 [31:0] inN Power summation Q ●●● Component Inputs/Outputs Coeff_1 [31:0] ●●● ■Queues to store present and past values ■Transition count function is a simple computation ■Coefficients aggregated based on output of transition count function in1 + What does the power model contain? Coeff_0[31:0] D Q Power[31:0] Transition count function Q Coeff_N [31:0] POW_STROBE Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 15 Constructing Power Models: An Example In1[0:15] 16 In2[0:15] 16 ADDER 16 Out[0:15] (1) Macromodel template Power = coeff_0+ transition_count(in1_0[t], in1_0[t-1]) * coeff_1 + transition_count(in1_1[t], in1_1[t-1]) * coeff_2 + ……………………. + transition_count(in1_15[t], in1_15[t-1]) * coeff_16 + transition_count(in2_0[t], in2_0[t-1]) * coeff_17 + transition_count(in2_1[t], in2_1[t-1]) * coeff_18 + ……………………. + transition_count(in2_15[t], in2_15[t-1]) * coeff_32 (2) Training Sequence 10101011011011011010010010110010; 11101011101101101110011110001001; 11110100011111100000100100101010; ………………… ……………….. Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 16 Constructing Power Models: An Example (3) Gate-Level Power Data 0.079140 0.030423 0.126169 ……………… ……………… (4) Outputs from Regression – Inputs (1), (2), and (3) coeff_0 = 0.04110908 coeff_1 = 0.001006622 coeff_2 = 0.001146324 ……………… ……………… Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 17 Constructing Power Models: An Example (5) Putting it all together entity add_power IS port (in1 : IN std_logic_vector; in2 : IN std_logic_vector; POW_STROBE: in std_logic; power : out real); end addd_power; architecture VHDLgen OF add_power IS type queue1 is ARRAY (1 downto 0) of std_logic_vector(0 to (in1'high - in1'low) ); type queue2 is ARRAY (1 downto 0) of std_logic_vector(0 to (in2'high - in2'low) ); Store current and previous I/O values -- QUEUE MANAGEMENT queue_in1(1) := queue_in2(1) := queue_in1(0) := queue_in2(0) := queue_in1(0); queue_in2(0); in1; in2; Compute bit-level I/O switching activity and weigh them by their power coefficients -- MACROMODEL COMPUTATION case bw IS begin process(POW_STROBE) variable queue_in1: queue1; variable queue_in2: queue2; variable bw : integer; variable flag : integer; begin Infer bitwidth -- BIT-WIDTH INFERENCE of RTL component flag := 0; bw := (queue_in1(1)'high - queue_in1(1)'low) + 1; for i in 0 to bw loop if (flag = 0) then if (bw <= 2**i) then bw := 2**i; flag := 1; end if; end if; end loop; if POW_STROBE = '1' AND (POW_STROBE'event) then Copyright Agarwal & Srivaths, 2007 when 2 => power <= tc(queue_in1(0),queue_in1(1),0) * 7.88452e-05 + tc(queue_in1(0),queue_in1(1),1) * 7.800038e-05 + tc(queue_in2(0),queue_in2(1),0) * 0.0002803612 + tc(queue_in2(0),queue_in2(1),1) * 5.245284e-05; when 4 => power <= tc(queue_in1(0),queue_in1(1),0) + tc(queue_in1(0),queue_in1(1),1) + tc(queue_in1(0),queue_in1(1),2) + tc(queue_in1(0),queue_in1(1),3) + tc(queue_in2(0),queue_in2(1),0) * * * * * 0.0002173669 0.0002525756 0.00023067 0.0001498218 0.0001684765 . . . . . . . . . . . . . . . . end case; end if; end process; end VHDLgen; Low-Power Design and Test, Lecture 4 18 Improvements to Macromodels RTL components can exhibit significantly different power behavior for different parts of the input space [Potlapally00] See Example Circuit: ■ C5 implements part of the GCD algorithm ■ C5 also implements operand gating for the subtractor If (x>y) z=x-y else z=y C5 Behavior Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 19 Improvements to Macromodels Conventional Approach 98% of the points in the upper cluster satisfy the condition (x>y): Power Mode 1 Proposed Approach All the points in the lower cluster satisfy the condition (x<=y): Power Mode 2 Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 20 Improvements to Macromodels Power mode identification function (PIF) deduces the power mode based on the input vectors Appropriate macromodel gets invoked based on the identified power mode Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 21 Characterization Based Power Estimation Synthesizable spec. for each component Synthesis conditions RTL library CHARACTERIZATION FLOW - Speed (fast/medium/slow) - Output cap. load - Input slew rate Synthesis P&R Post-layout netlists RTL design (HDL) Structural power profile Characterization based macromodeling Simulate-able power libraries Tightly coupled with RTL design planning RT-level design planning / mapping Testbench / stimuli Power model library generator Structural (macro) netlist RTL Powerlib.vhd Power model inference and estimation code generation simulation Power Characterization Powerlib.c Power macro-model database Power Profiles Powerlib.v Power Output Simulateable Power Model Libraries Copyright Agarwal & Srivaths, 2007 Enhanced RTL Cycle-by-cycle power report Support rel. and abs. accuracy Input Low-Power Design and Test, Lecture 4 22 Enhanced RTL: Graphical View ●●● first last value data FSM 1 Power Model Power Model Controller + +/< = <= reg_c0 reg_c1 reg_c1 -1 Functional Units >> 1 reg_mid reg_first reg_last reg_out Registers Power Model Bus 1 Bus 2 Bus 3 addr out Power Aggregator Power Strobe Generator Total Power Power Model Power Model Power Model Power Model ●●● Example: Power Model Enhanced RTL Main components include ■ ■ ■ Power models for every component: Monitor component I/O values and compute power Power strobe generator: Trigger power models (statistical sampling employed for improved efficiency since RTL simulation can also be slow for large designs) Power aggregator: Compute total power consumption Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 23 Enhanced RTL: An illustration ENTITY gcd is port ( RESET : CLOCK : yi : IN xi : IN . . IN std_logic; IN std_logic; std_logic_vector(0 to 7); std_logic_vector(0 to 7); . .); end gcd; ARCHITECTURE VHDLgen of gcd is signal M_39 : std_logic; signal M_38 : std_logic; signal VHDLgen_fu3000 : real := 0.0; -- POWER MODEL TRIGGERING POW_STROBE_GEN : process( CLOCK ) begin POW_STROBE <= CLOCK after POW_STROBE_DELAY; end process POW_STROBE_GEN; -- POWER AGGREGATION FOR EACH COMPONENT CLASS component cmp_lt POW_TOTAL:process begin port (i1 : IN std_logic_vector(0 to 7) ; wait until (POW_STROBE='1' AND POW_STROBE'event) OR(POW_STROBE_REG='1' AND to POW_STROBE_REG'event); i2 : IN std_logic_vector(0 7) ; . . . . . . . . o1 : BUFFER std_logic); component cmp_lt FU_power <= VHDLgen_fu3000 + VHDLgen_fu3001 + . . . .; port (i1 : IN std_logic_vector(0 to 7) ; end component; REG_power <= VHDLgen_reg3008 + VHDLgen_reg3009 + . . . i2 : IN std_logic_vector(0 to 7) ; o1 : BUFFER std_logic); end component; .; end process; component cmp_lt_power -- POWER AGGREGATION FOR COMPLETE port (in1 : in std_logic_vector; ENERGY_GEN : process -- POWER MODEL component cmp_lt_power port (in1 : in std_logic_vector; in2 : in std_logic_vector; out1 : in std_logic; POW_STROBE : in std_logic; power : out real); end component; . . . . begin . . . . DESIGN begin wait until CLOCK'event OR POW_STROBE'event OR POW_STROBE_REG'event; if( ( POW_STROBE = '1') and POW_STROBE'event ) then main_cycle_energy := (GATE_power + FU_power + MUX_power )*characterization_period; main_energy := main_energy + ( GATE_power + FU_power + MUX_power)*characterization_period; end if; if( CLOCK = '1' and CLOCK'event ) then num_clocks := num_clocks + 1; main_power := main_energy / (real(num_clocks) * clock_period); end if; end process energy_gen; end VHDLgen; in2 : in std_logic_vector; out1 : in std_logic; POW_STROBE : in std_logic; power : out real); end component; cmp_lt port map(cmp_lt1i1, cmp_lt1i2, cmp_lt1ot); cmp_lt_power port map ( cmp_lt1i1(0 to 7), cmp_lt1i2(0 to 7), POW_STROBE, VHDLgen_fu3000); . . . . Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 24 Enhanced RTL: An illustration ENTITY gcd is port ( RESET : CLOCK : yi : IN xi : IN POW_STROBE_GEN : process( CLOCK ) -- POWER MODEL TRIGGERING IN std_logic; POW_STROBE_GEN : process( CLOCK ) IN std_logic; begin begin std_logic_vector(0 to 7); POW_STROBE <= CLOCK after POW_STROBE_DELAY; std_logic_vector(0 7); POW_STROBE <= CLOCKto after POW_STROBE_DELAY; end process POW_STROBE_GEN; . . . .); end process POW_STROBE_GEN; gcd; end ARCHITECTURE VHDLgen of gcd is signal M_39 : std_logic; signal M_38 : std_logic; signal VHDLgen_fu3000 : real := 0.0; POW_TOTAL:process begin . . . . wait until (POW_STROBE='1' AND component cmp_lt port (i1 : IN std_logic_vector(0 to 7) ; POW_STROBE'event) i2 : IN std_logic_vector(0 to 7) ; OR(POW_STROBE_REG='1' AND o1 : BUFFER std_logic); end component; POW_STROBE_REG'event); . . . . -- POWER MODEL component cmp_lt_power port (in1 : in std_logic_vector; in2 : in std_logic_vector; out1 : in std_logic; POW_STROBE : in std_logic; power : out real); end component; -- POWER AGGREGATION FOR EACH COMPONENT CLASS POW_TOTAL:process begin wait until (POW_STROBE='1' AND POW_STROBE'event) OR(POW_STROBE_REG='1' AND POW_STROBE_REG'event); . . . . FU_power <= VHDLgen_fu3000 + VHDLgen_fu3001 + . . REG_power <= VHDLgen_reg3008 + VHDLgen_reg3009 + . .; end process; . .; . . -- POWER AGGREGATION FOR COMPLETE DESIGN ENERGY_GEN : process begin wait until CLOCK'event OR POW_STROBE'event OR POW_STROBE_REG'event; if( ( POW_STROBE = '1') and POW_STROBE'event ) then main_cycle_energy := (GATE_power + FU_power + MUX_power )*characterization_period; main_energy := main_energy + ( GATE_power + FU_power + MUX_power)*characterization_period; end if; if( CLOCK = '1' and CLOCK'event ) then num_clocks := num_clocks + 1; main_power := main_energy / (real(num_clocks) * clock_period); end if; end process energy_gen; end VHDLgen; FU_power <= VHDLgen_fu3000 + VHDLgen_fu3001 + . . . .; REG_power <= VHDLgen_reg3008 + VHDLgen_reg3009 + . . . .; . . . . end process; begin . . . . cmp_lt port map(cmp_lt1i1, cmp_lt1i2, cmp_lt1ot); cmp_lt_power port map ( cmp_lt1i1(0 to 7), cmp_lt1i2(0 to 7), POW_STROBE, VHDLgen_fu3000); . . . . Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 25 Enhanced RTL: An illustration ENTITY gcd is -- POWER MODEL TRIGGERING port ( RESET : IN :std_logic; ENERGY_GEN process POW_STROBE_GEN : process( CLOCK ) CLOCK : IN std_logic; begin begin yi : IN std_logic_vector(0 to 7); POW_STROBE wait CLOCK'event OR <= CLOCK after POW_STROBE_DELAY; xi : until IN std_logic_vector(0 to OR 7); POW_STROBE'event end process POW_STROBE_GEN; . . . .); POW_STROBE_REG'event; end gcd; -- POWER AGGREGATION FOR EACH COMPONENT CLASS ARCHITECTURE of gcd is = '1') if( (VHDLgen POW_STROBE and POW_STROBE'event ) POW_TOTAL:process signal M_39 : std_logic; begin thenM_38 : std_logic; signal wait until (POW_STROBE='1' AND POW_STROBE'event) signal VHDLgen_fu3000 : real := 0.0; OR(POW_STROBE_REG='1' AND POW_STROBE_REG'event); main_cycle_energy := (GATE_power + FU_power + . . . . . . . . MUX_power component cmp_lt )*characterization_period; FU_power <= VHDLgen_fu3000 + VHDLgen_fu3001 + . . port (i1 main_energy : IN std_logic_vector(0 to 7) ; := main_energy + ( GATE_power + VHDLgen_reg3008 + VHDLgen_reg3009 + . REG_power <= i2 : IN std_logic_vector(0 to 7) ; FU_power + MUX_power)*characterization_period; o1 : BUFFER std_logic); .; end component; end process; . .; . . end if; -- POWER MODEL if( CLOCK = '1' and CLOCK'event ) then-- POWER AGGREGATION FOR COMPLETE DESIGN component cmp_lt_power ENERGY_GEN : process num_clocks := num_clocks + 1; port (in1 : in std_logic_vector; begin in2 : in std_logic_vector; wait until CLOCK'event OR POW_STROBE'event OR main_power := main_energy / (real(num_clocks) * out1 : in std_logic; POW_STROBE_REG'event; clock_period); POW_STROBE : in std_logic; if( ( POW_STROBE = '1') and POW_STROBE'event ) then end : if; power out real); main_cycle_energy := (GATE_power + FU_power + end end component; MUX_power )*characterization_period; process energy_gen; main_energy := main_energy + ( GATE_power + FU_power . . . . + MUX_power)*characterization_period; begin end VHDLgen; end if; . . . . if( CLOCK = '1' and CLOCK'event ) then cmp_lt port map(cmp_lt1i1, cmp_lt1i2, num_clocks := num_clocks + 1; cmp_lt1ot); main_power := main_energy / (real(num_clocks) * cmp_lt_power port map ( cmp_lt1i1(0 to 7), clock_period); cmp_lt1i2(0 to 7), POW_STROBE, VHDLgen_fu3000); end if; . . . . end process energy_gen; end VHDLgen; Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 26 The CPU time overheads of RTL power estimation Need improvements in efficiency for large designs [Ravi03] 5 4 Functional Simulation RTL Power Estimation 3 2 CKT6 CKT5 CKT4 CKT3 0 CKT2 1 CKT1 LOG (Time in Seconds) 6 1.25 million trans. Simulation time data obtained using ModelSim 5.3 (ModelTech) Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 27 Observations Power estimation time depends on the HDL constructs used in the power estimation code HDL-aware Optimizations Computation can be traded off for storage to improve efficiency Computation versus Storage Trade-offs Power estimation effort should be directed where needed – Significant contributors – Tough to estimate portions Copyright Agarwal & Srivaths, 2007 Partitioned Statistical Sampling Low-Power Design and Test, Lecture 4 28 Solution 1: HDL-aware optimizations • Convert operations with complex datatypes into operations with simpler datatypes • Inline HDL functions to eliminate function maintenance overheads • Minimize power model activations • Reduce workload of a power model process EXAMPLE: BIT-WIDTH INFERENCE CODE IN POWER MODEL flag := 0; bw := (queue_in1(1)'high - queue_in1(1)'low)+1; for i in 0 to bw loop if (flag = 0) then if (bw <= 2**i) then bw := 2**i;flag := 1; end if; end if; end loop; Copyright Agarwal & Srivaths, 2007 if (flag = 0) then bw := (queue_in1(1)'high – queue_in1(1)'low) + 1; for i in 0 to bw loop if (flag = 0) then if (bw <= 2**i) then bw := 2**i; flag := 1; end if; end if; end loop; end if; Low-Power Design and Test, Lecture 4 29 store compute store compute store compute store compute store store store compute store store compute • Compute average power consumption once in k cycles • Store observed signal bits of RTL component for k cycles • Compute transition counts and power consumption only in the kth cycle store Solution 2: Computation vs Storage Trade-offs Simulation Time 4600 4400 4200 4000 3800 Variations in simulation time with queue length 3600 3400 3200 100 104 101 102 103 Queue Length Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 30 Partitioned Sampling Motivation: Smart “allocation of effort” during power estimation Components with low mean power, low variance (low impact on accuracy) Components with high mean power, high variance (deserve high estimation effort) Mean and Variance Scatter Plot (MVSP) for an example design Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 31 Partitioned Sampling Algorithm Power model enhanced design in HDL HDL compilation Apply clustering algorithm on MVSP to group components with similar mean and variance 4 1 Fix sampling probabilities for clustered RTL components Simulate for a user-specified fraction of the overall simulation time 5 2 Transform HDL to incorporate sampled partitions 6 Power profiles of all the RTL components Determine Mean Variance Scatter Plot (MVSP) for the RTL design Copyright Agarwal & Srivaths, 2007 3 HDL compilation 7 Full Simulation 8 RTL power estimate Low-Power Design and Test, Lecture 4 32 Fixing the Sampling Probabilities Objective: Determine the sampling probabilities for n clusters C1, C2 … Cn n Obs. #1: The error due to sampling given by P Pi must be minimized i 1 Obs. #2: The error in sampling a cluster Ci that accounts for a greater fraction (fi) of the total power must be kept small. That is, n Minimize Pweighted fi * Pi i 1 Pi comp compCi Power estimation error due to sampling for a component comp scomp comp C i const * Ni Standard deviation of the power profile of a component comp Copyright Agarwal & Srivaths, 2007 Equation 1 Number of samples for the cluster Ci Low-Power Design and Test, Lecture 4 33 Fixing the Sampling Probabilities Formulation: Minimize Equation 1 subject to the following constraints Computational budget Number of component-samples n | C1 | .N1 | C2 | .N2 | Cn | .Nn n f * Ntot * |Ci| i 1 Ni 1, i 1.. n Formulation now a “Linearly constrained Optimization” problem -- Many solvers available (Excel, Ampl) Solution: Pri Copyright Agarwal & Srivaths, 2007 Ni Ntot Low-Power Design and Test, Lecture 4 34 RTL Power Estimation: Results Designs as large as 1.25 million transistors have been successfully evaluated using our RTL Power Estimator (RTL-PEST)* 0.8 3.8% Power (mW) 0.7 13.9% 2.9% 0.6 • RTL power estimates roughly 5 to 10% off gate-level power estimates • RTL power estimation 10-50X faster than gate-level power estimation 0.5 4.1% 0.4 12.2% 0.3 14.8% 1.2% 0.2 0.1 0 CKT1 CKT2 CKT3 CKT4 CKT5 CKT6 CKT7 Gate 0.3694 0.7308 0.503 0.5237 0.1518 0.1708 0.2332 RTL 0.3541 0.7038 0.433 0.5398 0.1783 0.1687 0.2622 * For further information, please see [Ravi03] 8 RTL-PEST 1000 6 Speedup over Comm 5 100 4 3 10 2 1 1 0 D1 Copyright Agarwal & Srivaths, 2007 7 Comm Speedup • Power Estimation speed better than the best available commercially Execution Time (sec) 10000 D2 D3 D4 Low-Power Design and Test, Lecture 4 D5 D6 D7 35 RTL Power Estimation: Results Percentage error versus CPU time trade-off for partitioned sampling and testbench reduction techniques Copyright Agarwal & Srivaths, 2007 Local power estimation errors for partitioned sampling and conventional sampling techniques Low-Power Design and Test, Lecture 4 36 Outline Background ■ CMOS Power Consumption Basics ■ Why Address Power Consumption Issues in High-Level Design High-Level Power Analysis ■ RTL Power Estimation ● Fast Synthesis ● Analytical Approaches ● Characterization ■ Accelerating RTL Power Estimation ● Power Emulation (Hardware Accelerated Power Estimation) ■ Beyond RTL Power Estimation ● Power Estimation at the Cycle-accurate Behavior Level ■ Architectural Power Estimation Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 37 Power Emulation Technology Overview New paradigm for power estimation ! [Coburn05] fi r s t l a s t v al u e da ta Basic Observations 1. Power estimation uses power models for different components 2. Power models are themselves simple functions 3. Emulation is commonly used to speed up circuit simulation ●●● FSM 1 Power Model +/- + Power Model < = <= reg_c0 reg_c1 reg_c1 -1 >> 1 reg_mid reg_first reg_last reg_out Power Model fi r s t l a s t Power Aggregator + Power Strobe Generator v al da ta FSM adu dre ou t 1 Power Model < = reg_c0 reg_c1 Power Model <= Power +/Model Power Model ●●● -1 >> 1 reg_mid reg_c1 reg_first reg_last reg_out Total Power ad dr out Testbench Outputs 2 to 3 orders of magnitude Power speedup possible ! Host PC FPGA platform Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 38 Power Emulation: Challenges Size of design enhanced with power models is very large! ■ Size increases on an average of 18.2X for MPEG4 sub-designs ■ Enhanced version exceeds capacity of largest Xilinx Virtex-II FPGA 20.4X 135000 Normal Design 120000 Design for Power Estimation Capacity of XC2V8000 FPGA FPGA Area (LUTs) 105000 20.6X 90000 75000 17.7X 16.3X 60000 14.7X 45000 17.5X Need to reduce the area requirements of power models ! 30000 15.0X 15000 ld V ea d _b it M v M c q Is p t_ R Id c D ct _c oe ff da 0 MPEG4 Design Module Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 39 Power Emulation: Challenges Why area increase? ■ Resource-hungry power models used for every RTL component in the design How to reduce area? ■ Optimize the number of power models used ■ Make the implementations of power models resourceefficient ■ Catch: Ensure minimum loss of estimation accuracy due to area reduction techniques Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 40 Area Optimization Techniques Clustering of power models ■ Single power model servicing multiple components Changing component granularity ■ Constructing power models for complex components that subsume several smaller components Exploiting correlation ■ Using power correlation between components to reduce the number of monitored components Optimizing power model implementations ■ ■ Multi-cycling additions in power model computations Using FPGA block memories for efficient storage of power model coefficients Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 41 Power Emulation: Results Evaluation on various NEC designs, Comparison with RTL-PEST, Comm-RTL CKT Sort Estimation Time (sec) RTL- Comm- PEST RTL 11.6 80.2 Emulation Acc Estimated Power (mW) FPGA Area (LUTs) RTL- RTL- Emulatio PEST n 1605 5665 Emulation Error PEST 1.2 9.7 X, 0.33 0.31 0.14 0.14 8.22 7.76 3.53% AO 3.53X 66.8X HVPeakF 120.3 136.8 1.7 70.8X, 80.5X 172.9 DCT 173.3 3.7 46.7 X, RTL-PEST 46.8X MPEG4 3300 2587 MPEG4 6.3 524X, 3300sec 411X 3192 9016 2.82X Nearly 500X speedup possible ! 0% 5.6% 6121 4.9% 24907 Comm-RTL 4.74 4.51 2587sec 19242 Power 3.14X Emulation 72351 6.3sec 2.9X Upto 500X speedup compared to RTL power estimation 3% Loss of accuracy on an average •For further information, please see [Coburn05] Area overheads lowered to ≈3X Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 42 Outline Background ■ CMOS Power Consumption Basics ■ Why Address Power Consumption Issues in High-Level Design High-Level Power Analysis ■ RTL Power Estimation ● Fast Synthesis ● Analytical Approaches ● Characterization ■ Accelerating RTL Power Estimation ● Power Emulation (Hardware Accelerated Power Estimation) ■ Beyond RTL Power Estimation ● Power Estimation at the Cycle-accurate Behavior Level ■ Architectural Power Estimation Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 43 Cycle-Accurate Functional Descriptions (CAFDs) x = input_x; y = input_y; while (x != y) { if (x < y) { y = y - x; } else { x = x - y; } } out = x; Scheduling (a) Behavioral description out input_y FSM != < cmp Registers ST_3: if (c0) { if (c1) { y = y1; Increasingly } else {used in ST_2: x = x - y; System-level simulation c0 = x!=y; } c1 = x<y; goto ST_2; y1 = y –x; } goto ST_3 out = x; goto ST_1; (b) Cycle-accurate functional Binding description input_x Controller Functional units ST_1: x = input_x; y = input_y; goto ST_2; reg_c0 - sub lt_cmp reg_c1 reg_y1 reg_x reg_y Bus2 Bus1 Copyright Agarwal & Srivaths, 2007 (c) RTL implementation Low-Power Design and Test, Lecture 4 44 Cycle-Accurate Functional Descriptions (CAFDs) x = input_x; y = input_y; while (x != y) { if (x < y) { y = y - x; } else { x = x - y; } } out = x; Scheduling (a) Behavioral description out input_y FSM != < cmp Registers ST_3: if (c0) { if (c1) { y = y1; Challenge: } else { No structural ST_2: x = x - y; c0 = x!=y; } Information c1 = x<y; goto available ST_2; y1 = y –x; } goto ST_3 out = x; goto ST_1; (b) Cycle-accurate functional Binding description input_x Controller Functional units ST_1: x = input_x; y = input_y; goto ST_2; reg_c0 - sub lt_cmp reg_c1 reg_y1 reg_x reg_y Bus2 Bus1 Copyright Agarwal & Srivaths, 2007 (c) RTL implementation Low-Power Design and Test, Lecture 4 45 Overview of Power Estimation using CycleAccurate Functional Descriptions (CAFDs) (Scheduled Behavior) Objectives ■ Extract minimum RTL structural info. ■ Back-annotate RTL structural info. More information in (Zhong04) Simulation test bench CAFD Preprocessing Resource, timing constraints Synthesis RTL RTL information extraction Virtual component instantiation Idle cycle analysis (C++/SystemC) structure-aware CAFD Power model library Cycle-accurate functional simulation Power Power report Copyright Agarwal & Srivaths, 2007 Output Input Low-Power Design and Test, Lecture 4 Power vs. time 46 Structure-aware CAFD Virtual component • Stores I/O values for current & previous cycles • Invokes the power macro-model in each cycle Structure-AWARE CAFD SIMULATION TEST BENCH Original cycle-accurate functional description I/O mapping • Traces appropriate CAFD variables to capture component I/Os in each cycle • Generates idle cycle input values POWER MODEL LIBRARY add_power reg_power Copyright Agarwal & Srivaths, 2007 Power aggregation & reporting code Low-Power Design and Test, Lecture 4 47 Example Snippet of an Structure-aware CAFD ST_1: x = input_x; y = input_y; goto ST_2; ST_2: c0 = x!=y; c1 = x<y; y1 = y –x; goto ST_3 ST_3: if (c0) { if (c1) { y = y1; } else { x = x - y; } goto ST_2; } out = x; goto ST_1; VC<bus,8> bus1,bus2; VC<reg,8> reg_y1; VC<lt,1> lt_cmp; … ST_1: … ST_2: bus1.RecordInput(x) bus2.RecordInput(y); c0 = x!=y; reg_c0.RecordInput(c0); eq_cmp.RecordIO(x,y,c0); c1 = x<y; reg_c1.RecordInput(c1); lt_cmp.RecordIO(x,y,c1); Instantiate virtual components Update virtual component I/O values y1 = y -x; reg_y1.RecordInput(y1); sub.RecordIO(y,x,y1); CalculatePower(); goto ST_3 Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 48 C-based HW Power Estimation: Results Compared accuracy, efficiency vs. gate-level and RTL-PEST ■ 50-100X speedup (or more) for various designs, less than 20% error w.r.t POWERD Average Cyclelevel Absolute Error Speedup vs. Slowdown vs. RTL Estimation Functional Simulation Circuit Power Error DES 2.1% 2.2% 83 X 1.1 X HDTV-1 1.7% 4.0% 356 X 3.2 X JPEG 2.7% 6.6% 1,143 X 3.3 X MPEG4-IDCT 3.1% 5.1% 412 X 3.2 X MPEG4-ISPQ 1.5% 2.4% 438 X 2.1 X SORT 1.7% 5.4% 266 X 1.7 X VITERBI 1.4% 6.5% 305 X 3.0 X 5.1% 223 X 2.1 X •For further information, please see2.4% (Zhong04) WAVELET Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 49 Outline Background ■ CMOS Power Consumption Basics ■ Why Address Power Consumption Issues in High-Level Design High-Level Power Analysis ■ RTL Power Estimation ● Fast Synthesis ● Analytical Approaches ● Characterization ■ Accelerating RTL Power Estimation ● Power Emulation (Hardware Accelerated Power Estimation) ■ Beyond RTL Power Estimation ● Power Estimation at the Cycle-accurate Behavior Level ■ Architectural Power Estimation Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 50 Architectural Power Estimation Requirements ■ Need to evaluate trade-offs in processor configuration ■ Need to evaluate trade-offs in software running on system ■ Must be very fast compared to HDL based power estimators. Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 51 Architectural Power Estimation (Wattch) Overall structure of an architectural power estimator Wattch [Brooks00] Parameterized models for different CPU units ■ Can vary size or design style as needed ■ Use the fundamental equation for dynamic power consumption ● P=CV2. A.f On each cycle, determine which units are accessed and accumulate energy consumption Capacitance modeled for various critical components Activity factors ■ Runtime measurements using a cycleaccurate performance simulator called SimpleScalar (has been ported to many simulators) ■ Assume an activity factor of 0.5 for which the simulator cannot report statistics Copyright Agarwal & Srivaths, 2007 Binary HW Config Cycle-Level Performance Simulator Cycle-by-Cycle Hardware Access Counts Parameterizable Power Models Performance Estimate Low-Power Design and Test, Lecture 4 Power Estimate 52 Architectural Power Estimation (source: Brooks_hpca2001) Good relative accuracy even when absolute accuracy may be off Copyright Agarwal & Srivaths, 2007 10-15% accuracy variations with low-level industry data Low-Power Design and Test, Lecture 4 53 Conclusions High-level power analysis techniques are finally coming of age Efficiency Accuracy What we could not cover today?: Using high-level power analysis for optimization ■ Power reports provide information about a design’s “hotspots” ■ Presence of power analysis in a high-level design flow makes optimization and design space exploration easy Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 54 References Books/Tutorials ■ [Raghunathan-book] A. Raghunathan, N. K. Jha, and S. Dey, "High-level power analysis and optimization", Kluwer Academic Publishers, 1997 ■ [Ravi-aspdac05] “Power Analysis in C-based Design” (part of tutorial entitled “C-based Design: Industrial Experience”), Asia-South Pacific Design Automation Conference (ASP-DAC), January 2005 Conference Papers ■ [Llopis98] R. Llopis, K. Goossens, “The petrol approach to high-level power estimation”, ISLPED 1998: 130132 ■ [Glaser91] K. D. Glaser, K. Kirsch, and K. Neusinger, ``Estimating essential design characteristics to support project planning for ASIC design management,'' in Proc. Int. Conf. Computer-Aided Design, pp. 148--151, Nov. 1991. ■ [Liu94] D. Liu and C. Svensson, ``Power consumption estimation in CMOS VLSI chips,'' IEEE J. Solid-State Circuits, vol. 29, pp. 663--670, June 1994 ■ [Nemani96] M. Nemani and F. N. Najm, ``High-level power estimation and the area complexity of Boolean functions,'' in Proc. Int. Symp. Low Power Electronics & Design, pp. 329--334, Aug. 1996. ■ [Potlapally01] N. R. Potlapally, A. Raghunathan, G. Lakshminarayana, M. S. Hsiao, and S. T. Chakradhar, "Accurate power macro-modeling techniques for complex RTL circuits", IEEE International Conference on VLSI Design, January 2001 ■ [Ravi03] S. Ravi, A. Raghunathan, and S. T. Chakradhar, "Efficient RTL Power Estimation for Large Designs," IEEE International Conference on VLSI Design, January 2003 ■ [Zhong04]L. Zhong, S. Ravi, A. Raghunathan, and N. K. Jha, "Power estimation for cycle-accurate functional descriptions of hardware," IEEE/ACM International Conference on Computer-Aided Design, November 2004 ■ [Coburn05] J. Coburn, S. Ravi, and A. Raghunathan, "Power emulation: A new paradigm for power estimation," ACM/IEEE Design Automation Conference, June 2005 ■ [Brooks00] David Brooks, Vivek Tiwari, and Margaret Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,” 27th International Symposium on Computer Architecture (ISCA), June 2000 Copyright Agarwal & Srivaths, 2007 Low-Power Design and Test, Lecture 4 55