Tamper Resistance Mechanisms for Secure Embedded Systems

advertisement
High-level Power Analysis
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
Outline
 Background
■ CMOS Power Consumption Basics
■ Why Address Power Consumption Issues in High-Level Design
 High-Level Power Analysis
■ RTL Power Estimation
● Fast Synthesis
● Analytical Approaches
● Characterization
■ Accelerating RTL Power Estimation
● Power Emulation (Hardware Accelerated Power Estimation)
■ Beyond RTL Power Estimation
● Power Estimation at the Cycle-accurate Behavior Level
■ Architectural Power Estimation
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
2
CMOS Power Consumption Basics
 What are the various components of CMOS power
consumption?
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
3
Levels of Design Abstraction
Controller
x = input_x;
y = input_y;
while (x != y)
{
if (x < y) {
y = y - x;
} else {
x = x - y;
}
}
out = x;
(a)Behavioral
description
Scheduling
ST_1:
x = input_x;
y = input_y;
goto ST_2;
input_x out
FSM
Binding
ST_2:
c0 = x!=y;
c1 = x<y;
y1 = y –x;
goto ST_3
(b) Cycle-accurate
functional description
!=
<
-
reg_c0
reg_c1
reg_y1
reg_x
input_y
reg_y
(c) RTL description
Logic Synthesis
Layout
(d) Transistor-level layout
Copyright Agarwal & Srivaths, 2007
(d) Logic-level netlist
Low-Power Design and Test, Lecture 4
4
Why Address Power at Higher Levels of Design
Abstraction?
System-level
design
Power models
for system-level
components
System-level
power analysis
High-level synthesis,
RTL optimizations
Architecture-level
power analysis
Power models
for macroblocks,
control logic
Benefits: Estimation

Early feedback about power
budget

Faster / Fewer design iterations
Benefits: Optimization
ü
Large power savings possible at
higher levels
Power reduction
opportunities
Power analysis
iteration times
System level
Logic-level
power analysis
Transistor-level/
Layout synthesis
Transistor-level
power analysis
Algorithm level
Power models
for gates, cells,
nets
Register-transfer
level
2-5X
Logic level
Design flow
with high-level
power
analysis
Copyright Agarwal & Srivaths, 2007
Transistor level
20 - 50%
Increasing power savings
seconds - minutes
Logic synthesis
Layout level
Low-Power Design and Test, Lecture 4
minutes - hours
hours - days
Decreasing design iteration times
10-20X
5
Outline
 Background
■ CMOS Power Consumption Basics
■ Why Address Power Consumption Issues in High-Level Design
 High-Level Power Analysis
■ RTL Power Estimation
● Fast Synthesis
● Analytical Approaches
● Characterization
■ Accelerating RTL Power Estimation
● Power Emulation (Hardware Accelerated Power Estimation)
■ Beyond RTL Power Estimation
● Power Estimation at the Cycle-accurate Behavior Level
■ Architectural Power Estimation
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
6
Fast Synthesis based Power Estimation
 Map design through “low-effort” to a netlist for power estimation
[Llopis98]
 Use gate-level power data to perform power estimation
 Approach followed by some commercial tools
Low-Effort
Synthesis
Gate-Level
Power Estimation
Power
Copyright Agarwal & Srivaths, 2007
RTL estimates
RTL
15-20% dev
Source: (Llopis-98)
Gate Level estimates
Low-Power Design and Test, Lecture 4
7
Analytical Methods
 Correlate power consumption to simple measures of design
complexity
■ Logic Structures: Use gate count [Glaser91]
Pint  GE( Etyp  Vdd 2 .CL ). f . Aint
■
■
■
■
■
GE : Circuit size in NAND2 gate equivalent
Etyp: Typical power dissipation per MHz for a NAND2 gate
CL : Estimated load capacitance per gate
f, Vdd: Clock frequency, Voltage
Aint : Estimated activity factor per clock cycle (20-30%)
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
8
Analytical Methods
■ Memories [Liu94]
Pmem  Pmemcell  Prow
 Pcol  Psense
Dominant
component
2k
nk
memcell  2 ( c int l column  2 C tr )Vdd .Vswing . f mem _ clock
P
■ 2k : No. of memory cells, 2n-k : No. of rows
■ cint : Capacitance of unit wire length
■ lcolumn: Column interconnect length
■ Ctr : Drain
diffusion capacitance
on the bit/bit line
Copyright Agarwal
& Srivaths, 2007
Low-Power Design and Test, Lecture 4
9
Analytical Methods
 Entropy based approach [Nemani96]
■ Entropy: Measure of uncertainty in a random variable
■ Entropy H of a random variable x is given by
1
1
H ( x)  p log  (1  p) log
p
1 p
■ p: Probability of x being 1
 Recall that
Pavg  Davg .GE.Cavg
■ Davg: Average node switching activity
■ GE: Gate equivalents, Cavg: Average gate capacitance
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
10
Analytical Methods
 Hypothesis
■ Can Davg be estimated only from knowledge of inputs and
output behavior?
 Answer: Yes!
Pavg  H .GE.Cavg
 Entropy H is given by
2/3
H
(H i  2H o )
nm
 Hi and Ho are respectively the input and output entropies
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
11
Analytical methods
 Entropy Based Power Estimation Methodology:
■ Run a structural RTL simulation to measure input/output
entropies
■ Using input/output entropies, estimate Pavg for the
combinational block
■ Use other techniques [Liu94] to estimate latch and clock
power
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
12
Outline
 Background
■ CMOS Power Consumption Basics
■ Why Address Power Consumption Issues in High-Level Design
 High-Level Power Analysis
■ RTL Power Estimation
● Fast Synthesis
● Analytical Approaches
● Characterization
■ Accelerating RTL Power Estimation
● Power Emulation (Hardware Accelerated Power Estimation)
■ Beyond RTL Power Estimation
● Power Estimation at the Cycle-accurate Behavior Level
■ Architectural Power Estimation
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
13
Characterization Based Approaches
■ Characterization based
power macro-models
[Raghunathan-book, Raviaspdac05]
● Characterize a lower
RTL COMPONENT
LIBRARY
level implementation of
an RTL block
● Construct a macromodel
or power models
 Power = f(I/O signal
statistics)
● Applicable in behavioral
synthesis environments
Copyright Agarwal & Srivaths, 2007
Macromodel template
selection
- Complexity analysis
- Variable / parameter
selection
Pattern
generation
Training
sequences
Logic- /
transistor-level
power simulator
Power
Profiles
Data fitting /
coefficient
extraction
Power
macromodels
Low-Power Design and Test, Lecture 4
14
Power Models
Power = coeff_0 + transition_count(in1[t], in1[t-1]) * coeff_1 +
What does the
transition_count(in2[t], in2[t-1]) * coeff_2 +
power model
……………………. +
implement?
transition_count(inN[t], inN[t-1]) * coeff_N
Queues
D
Q
in2
D
D
●●●
Coeff_2 [31:0]
inN
Power
summation
Q
●●●
Component
Inputs/Outputs
Coeff_1 [31:0]
●●●
■Queues to store
present and past
values
■Transition count
function is a simple
computation
■Coefficients
aggregated based
on output of
transition count
function
in1
+
What does the power
model contain?
Coeff_0[31:0]
D
Q
Power[31:0]
Transition count
function
Q
Coeff_N [31:0]
POW_STROBE
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
15
Constructing Power Models: An Example
In1[0:15]
16
In2[0:15]
16
ADDER
16
Out[0:15]
(1) Macromodel template
Power = coeff_0+
transition_count(in1_0[t], in1_0[t-1]) * coeff_1 +
transition_count(in1_1[t], in1_1[t-1]) * coeff_2 +
……………………. +
transition_count(in1_15[t], in1_15[t-1]) * coeff_16 +
transition_count(in2_0[t], in2_0[t-1]) * coeff_17 +
transition_count(in2_1[t], in2_1[t-1]) * coeff_18 +
……………………. +
transition_count(in2_15[t], in2_15[t-1]) * coeff_32
(2) Training Sequence
10101011011011011010010010110010;
11101011101101101110011110001001;
11110100011111100000100100101010;
…………………
………………..
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
16
Constructing Power Models: An Example
(3) Gate-Level Power Data
0.079140
0.030423
0.126169
………………
………………
(4) Outputs from Regression – Inputs (1), (2), and (3)
coeff_0 = 0.04110908
coeff_1 = 0.001006622
coeff_2 = 0.001146324
………………
………………
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
17
Constructing Power Models: An Example
(5) Putting it all together
entity add_power IS
port (in1 : IN std_logic_vector;
in2 : IN std_logic_vector;
POW_STROBE: in std_logic;
power : out real);
end addd_power;
architecture VHDLgen OF add_power IS
type queue1 is ARRAY (1 downto 0) of
std_logic_vector(0 to (in1'high - in1'low) );
type queue2 is ARRAY (1 downto 0) of
std_logic_vector(0 to (in2'high - in2'low) );
Store current and previous I/O
values
-- QUEUE MANAGEMENT
queue_in1(1) :=
queue_in2(1) :=
queue_in1(0) :=
queue_in2(0) :=
queue_in1(0);
queue_in2(0);
in1;
in2;
Compute bit-level I/O switching
activity and weigh them by their
power coefficients
-- MACROMODEL COMPUTATION
case bw IS
begin
process(POW_STROBE)
variable queue_in1: queue1;
variable queue_in2: queue2;
variable bw : integer;
variable flag : integer;
begin
Infer bitwidth
-- BIT-WIDTH INFERENCE
of RTL component
flag := 0;
bw := (queue_in1(1)'high - queue_in1(1)'low) + 1;
for i in 0 to bw loop
if (flag = 0) then
if (bw <= 2**i) then
bw := 2**i; flag := 1;
end if;
end if;
end loop;
if POW_STROBE = '1' AND (POW_STROBE'event) then
Copyright Agarwal & Srivaths, 2007
when 2 => power <=
tc(queue_in1(0),queue_in1(1),0) * 7.88452e-05
+ tc(queue_in1(0),queue_in1(1),1) * 7.800038e-05
+ tc(queue_in2(0),queue_in2(1),0) * 0.0002803612
+ tc(queue_in2(0),queue_in2(1),1) * 5.245284e-05;
when 4 => power <=
tc(queue_in1(0),queue_in1(1),0)
+ tc(queue_in1(0),queue_in1(1),1)
+ tc(queue_in1(0),queue_in1(1),2)
+ tc(queue_in1(0),queue_in1(1),3)
+ tc(queue_in2(0),queue_in2(1),0)
*
*
*
*
*
0.0002173669
0.0002525756
0.00023067
0.0001498218
0.0001684765
. . . . . . . .
. . . . . . . .
end case;
end if;
end process;
end VHDLgen;
Low-Power Design and Test, Lecture 4
18
Improvements to Macromodels
 RTL components can exhibit significantly different power behavior for different parts
of the input space [Potlapally00]
 See Example Circuit:
■ C5 implements part of the GCD algorithm
■ C5 also implements operand gating for the subtractor
If (x>y)
z=x-y
else
z=y
C5 Behavior
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
19
Improvements to Macromodels
Conventional Approach
 98% of the points in the upper
cluster satisfy the condition
(x>y): Power Mode 1
Proposed Approach
 All the points in the lower
cluster satisfy the condition
(x<=y): Power Mode 2
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
20
Improvements to Macromodels
 Power mode identification function (PIF) deduces the power mode based on
the input vectors
 Appropriate macromodel gets invoked based on the identified power mode
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
21
Characterization Based Power Estimation
Synthesizable
spec. for each
component
Synthesis
conditions
RTL
library
CHARACTERIZATION FLOW
- Speed (fast/medium/slow)
- Output cap. load
- Input slew rate
Synthesis
P&R
Post-layout
netlists
RTL design
(HDL)
 Structural power profile
 Characterization based
macromodeling
 Simulate-able power libraries
 Tightly coupled with RTL
design planning
RT-level design planning
/ mapping
Testbench /
stimuli
Power model
library
generator
Structural
(macro)
netlist
RTL
Powerlib.vhd
Power model inference
and estimation code
generation
simulation
Power
Characterization
Powerlib.c
Power
macro-model
database
Power Profiles
Powerlib.v
Power
Output
Simulateable Power
Model Libraries
Copyright Agarwal & Srivaths, 2007
Enhanced
RTL
Cycle-by-cycle power report
Support rel. and abs. accuracy
Input
Low-Power Design and Test, Lecture 4
22
Enhanced RTL: Graphical View
●●●
first last value data
FSM
1
Power
Model
Power
Model
Controller
+
+/<
=
<=
reg_c0
reg_c1
reg_c1
-1
Functional
Units
>> 1
reg_mid
reg_first
reg_last
reg_out
Registers
Power
Model
Bus 1
Bus 2
Bus 3
addr
out
Power
Aggregator
Power
Strobe
Generator

Total
Power
Power
Model
Power
Model
Power
Model
Power
Model
●●●
Example: Power Model Enhanced RTL
Main components include
■
■
■
Power models for every component: Monitor component I/O values and compute power
Power strobe generator: Trigger power models (statistical sampling employed
for improved efficiency since RTL simulation can also be slow for large designs)
Power aggregator: Compute total power consumption
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
23
Enhanced RTL: An illustration
ENTITY gcd is
port ( RESET :
CLOCK :
yi : IN
xi : IN
. .
IN std_logic;
IN std_logic;
std_logic_vector(0 to 7);
std_logic_vector(0 to 7);
. .);
end gcd;
ARCHITECTURE VHDLgen of gcd is
signal M_39 : std_logic;
signal M_38 : std_logic;
signal VHDLgen_fu3000 : real := 0.0;
-- POWER MODEL TRIGGERING
POW_STROBE_GEN : process( CLOCK )
begin
POW_STROBE <= CLOCK after POW_STROBE_DELAY;
end process POW_STROBE_GEN;
-- POWER AGGREGATION FOR EACH COMPONENT CLASS
component cmp_lt POW_TOTAL:process
begin
port (i1 : IN std_logic_vector(0
to 7) ;
wait until (POW_STROBE='1' AND POW_STROBE'event)
OR(POW_STROBE_REG='1' AND to
POW_STROBE_REG'event);
i2 : IN std_logic_vector(0
7) ;
. . . .
. . . .
o1 : BUFFER std_logic);
component cmp_lt
FU_power <= VHDLgen_fu3000 + VHDLgen_fu3001 + . . . .;
port (i1 : IN std_logic_vector(0 to 7) ;
end
component;
REG_power <= VHDLgen_reg3008 + VHDLgen_reg3009 + . . .
i2 : IN std_logic_vector(0 to 7) ;
o1 : BUFFER std_logic);
end component;
.;
end process;
component cmp_lt_power
-- POWER AGGREGATION FOR COMPLETE
port (in1 : in std_logic_vector;
ENERGY_GEN : process
-- POWER MODEL
component cmp_lt_power
port (in1 : in std_logic_vector;
in2 : in std_logic_vector;
out1 : in std_logic;
POW_STROBE : in std_logic;
power : out real);
end component;
. . . .
begin
. . . .
DESIGN
begin
wait until CLOCK'event OR POW_STROBE'event OR
POW_STROBE_REG'event;
if( ( POW_STROBE = '1') and POW_STROBE'event ) then
main_cycle_energy := (GATE_power + FU_power +
MUX_power )*characterization_period;
main_energy := main_energy + ( GATE_power + FU_power
+ MUX_power)*characterization_period;
end if;
if( CLOCK = '1' and CLOCK'event ) then
num_clocks := num_clocks + 1;
main_power := main_energy / (real(num_clocks) *
clock_period);
end if;
end process energy_gen;
end VHDLgen;
in2 : in std_logic_vector;
out1 : in std_logic;
POW_STROBE : in std_logic;
power : out real);
end component;
cmp_lt port map(cmp_lt1i1, cmp_lt1i2,
cmp_lt1ot);
cmp_lt_power port map ( cmp_lt1i1(0 to 7),
cmp_lt1i2(0 to 7), POW_STROBE, VHDLgen_fu3000);
. . . .
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
24
Enhanced RTL: An illustration
ENTITY gcd is
port ( RESET :
CLOCK :
yi : IN
xi : IN
POW_STROBE_GEN
: process( CLOCK ) -- POWER MODEL TRIGGERING
IN std_logic;
POW_STROBE_GEN : process( CLOCK )
IN std_logic;
begin
begin
std_logic_vector(0 to 7);
POW_STROBE <= CLOCK after POW_STROBE_DELAY;
std_logic_vector(0
7);
POW_STROBE
<= CLOCKto after
POW_STROBE_DELAY;
end process POW_STROBE_GEN;
. . . .);
end
process
POW_STROBE_GEN;
gcd;
end
ARCHITECTURE VHDLgen of gcd is
signal M_39 : std_logic;
signal M_38 : std_logic;
signal VHDLgen_fu3000 : real := 0.0;
POW_TOTAL:process
begin
. . . .
wait
until (POW_STROBE='1' AND
component
cmp_lt
port
(i1 : IN std_logic_vector(0 to 7) ;
POW_STROBE'event)
i2 : IN std_logic_vector(0 to 7) ;
OR(POW_STROBE_REG='1'
AND
o1 : BUFFER std_logic);
end component;
POW_STROBE_REG'event);
. . . .
-- POWER MODEL
component cmp_lt_power
port (in1 : in std_logic_vector;
in2 : in std_logic_vector;
out1 : in std_logic;
POW_STROBE : in std_logic;
power : out real);
end component;
-- POWER AGGREGATION FOR EACH COMPONENT CLASS
POW_TOTAL:process
begin
wait until (POW_STROBE='1' AND POW_STROBE'event)
OR(POW_STROBE_REG='1' AND POW_STROBE_REG'event);
. . . .
FU_power <= VHDLgen_fu3000 + VHDLgen_fu3001 + . .
REG_power <= VHDLgen_reg3008 + VHDLgen_reg3009 + .
.;
end process;
. .;
. .
-- POWER AGGREGATION FOR COMPLETE DESIGN
ENERGY_GEN : process
begin
wait until CLOCK'event OR POW_STROBE'event OR
POW_STROBE_REG'event;
if( ( POW_STROBE = '1') and POW_STROBE'event ) then
main_cycle_energy := (GATE_power + FU_power +
MUX_power )*characterization_period;
main_energy := main_energy + ( GATE_power + FU_power
+ MUX_power)*characterization_period;
end if;
if( CLOCK = '1' and CLOCK'event ) then
num_clocks := num_clocks + 1;
main_power := main_energy / (real(num_clocks) *
clock_period);
end if;
end process energy_gen;
end VHDLgen;
FU_power <= VHDLgen_fu3000 + VHDLgen_fu3001 +
. . . .;
REG_power <= VHDLgen_reg3008 +
VHDLgen_reg3009 + . . . .;
. . . .
end process;
begin
. . . .
cmp_lt port map(cmp_lt1i1, cmp_lt1i2,
cmp_lt1ot);
cmp_lt_power port map ( cmp_lt1i1(0 to 7),
cmp_lt1i2(0 to 7), POW_STROBE, VHDLgen_fu3000);
. . . .
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
25
Enhanced RTL: An illustration
ENTITY gcd is
-- POWER MODEL TRIGGERING
port
( RESET : IN :std_logic;
ENERGY_GEN
process
POW_STROBE_GEN : process( CLOCK )
CLOCK : IN std_logic;
begin
begin
yi : IN std_logic_vector(0 to 7);
POW_STROBE
wait
CLOCK'event
OR <= CLOCK after POW_STROBE_DELAY;
xi : until
IN std_logic_vector(0
to OR
7); POW_STROBE'event
end process POW_STROBE_GEN;
. . . .);
POW_STROBE_REG'event;
end gcd;
-- POWER AGGREGATION FOR EACH COMPONENT CLASS
ARCHITECTURE
of gcd is = '1')
if( (VHDLgen
POW_STROBE
and POW_STROBE'event
)
POW_TOTAL:process
signal M_39 : std_logic;
begin
thenM_38 : std_logic;
signal
wait until (POW_STROBE='1' AND POW_STROBE'event)
signal VHDLgen_fu3000 : real := 0.0;
OR(POW_STROBE_REG='1'
AND POW_STROBE_REG'event);
main_cycle_energy := (GATE_power + FU_power
+
. . . .
. . . .
MUX_power
component
cmp_lt )*characterization_period;
FU_power <= VHDLgen_fu3000 + VHDLgen_fu3001 + . .
port (i1 main_energy
: IN std_logic_vector(0
to 7) ;
:= main_energy
+ ( GATE_power
+ VHDLgen_reg3008 + VHDLgen_reg3009 + .
REG_power <=
i2 : IN std_logic_vector(0 to 7) ;
FU_power
+ MUX_power)*characterization_period;
o1 : BUFFER
std_logic);
.;
end component;
end process;
. .;
. .
end if;
-- POWER
MODEL
if(
CLOCK = '1' and CLOCK'event ) then-- POWER AGGREGATION FOR COMPLETE DESIGN
component cmp_lt_power
ENERGY_GEN : process
num_clocks
:= num_clocks + 1;
port (in1
: in std_logic_vector;
begin
in2
: in std_logic_vector;
wait until CLOCK'event
OR POW_STROBE'event OR
main_power
:= main_energy / (real(num_clocks)
*
out1 : in std_logic;
POW_STROBE_REG'event;
clock_period);
POW_STROBE : in std_logic;
if( ( POW_STROBE = '1') and POW_STROBE'event ) then
end : if;
power
out real);
main_cycle_energy := (GATE_power + FU_power +
end end
component;
MUX_power
)*characterization_period;
process energy_gen;
main_energy
:= main_energy + ( GATE_power + FU_power
. . . .
+
MUX_power)*characterization_period;
begin
end VHDLgen;
end if;
. . . .
if( CLOCK = '1' and CLOCK'event ) then
cmp_lt port map(cmp_lt1i1, cmp_lt1i2,
num_clocks := num_clocks + 1;
cmp_lt1ot);
main_power := main_energy / (real(num_clocks) *
cmp_lt_power port map ( cmp_lt1i1(0 to 7),
clock_period);
cmp_lt1i2(0 to 7), POW_STROBE, VHDLgen_fu3000);
end if;
. . . .
end process energy_gen;
end VHDLgen;
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
26
The CPU time overheads of RTL power estimation
Need improvements
in efficiency for large
designs [Ravi03]
5
4
Functional
Simulation
RTL Power
Estimation
3
2
CKT6
CKT5
CKT4
CKT3
0
CKT2
1
CKT1
LOG (Time in Seconds)
6
1.25 million trans.
Simulation time data obtained using ModelSim 5.3 (ModelTech)
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
27
Observations
 Power estimation time depends on the HDL constructs used in the
power estimation code
HDL-aware Optimizations
 Computation can be traded off for storage to improve
efficiency
Computation versus
Storage Trade-offs
 Power estimation effort should be directed where
needed
– Significant contributors
– Tough to estimate portions
Copyright Agarwal & Srivaths, 2007
Partitioned
Statistical Sampling
Low-Power Design and Test, Lecture 4
28
Solution 1: HDL-aware optimizations
• Convert operations with complex datatypes into
operations with simpler datatypes
• Inline HDL functions to eliminate function maintenance
overheads
• Minimize power model activations
• Reduce workload of a power model process
EXAMPLE: BIT-WIDTH INFERENCE CODE IN POWER MODEL
flag := 0;
bw := (queue_in1(1)'high
- queue_in1(1)'low)+1;
for i in 0 to bw loop
if (flag = 0) then
if (bw <= 2**i) then
bw := 2**i;flag := 1;
end if;
end if;
end loop;
Copyright Agarwal & Srivaths, 2007
if (flag = 0) then
bw := (queue_in1(1)'high –
queue_in1(1)'low) + 1;
for i in 0 to bw loop
if (flag = 0) then
if (bw <= 2**i) then
bw := 2**i; flag := 1;
end if;
end if;
end loop;
end if;
Low-Power Design and Test, Lecture 4
29
store
compute
store
compute
store
compute
store
compute
store
store
store
compute
store
store
compute
• Compute average power
consumption once in k
cycles
• Store observed signal
bits of RTL component
for k cycles
• Compute transition
counts and power
consumption only in
the kth cycle
store
Solution 2: Computation vs Storage Trade-offs
Simulation Time
4600
4400
4200
4000
3800
Variations in simulation
time with queue length
3600
3400
3200
100
104
101
102
103
Queue Length
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
30
Partitioned Sampling
Motivation: Smart “allocation of effort” during power estimation
Components with low mean
power, low variance (low
impact on accuracy)
Components with high
mean power, high variance
(deserve high estimation
effort)
Mean and Variance Scatter Plot (MVSP)
for an example design
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
31
Partitioned Sampling Algorithm
Power model
enhanced
design in HDL
HDL compilation
Apply clustering algorithm
on MVSP to group
components with
similar mean and variance
4
1
Fix sampling probabilities
for clustered
RTL components
Simulate for a user-specified
fraction of the
overall simulation time
5
2
Transform HDL to
incorporate sampled partitions 6
Power profiles of
all the
RTL components
Determine Mean Variance
Scatter Plot (MVSP) for
the RTL design
Copyright Agarwal & Srivaths, 2007
3
HDL compilation
7
Full Simulation
8
RTL power estimate
Low-Power Design and Test, Lecture 4
32
Fixing the Sampling Probabilities
Objective: Determine the sampling probabilities for n clusters C1, C2 … Cn
n
Obs. #1: The error due to sampling given by P   Pi must be minimized
i 1
Obs. #2: The error in sampling a cluster Ci that accounts for a greater
fraction (fi) of the total power must be kept small. That is,
n
Minimize Pweighted 
fi * Pi
i 1

Pi 

comp
compCi
Power estimation
error due
to sampling for a
component
comp
  scomp

comp

C


i
 const *

Ni 

Standard deviation
of the power profile
of a component
comp
Copyright Agarwal & Srivaths, 2007
Equation 1
Number of
samples for
the cluster Ci
Low-Power Design and Test, Lecture 4
33
Fixing the Sampling Probabilities
Formulation: Minimize Equation 1 subject to the following constraints
Computational
budget
Number of component-samples
n
| C1 | .N1  | C2 | .N2    | Cn | .Nn  n f * Ntot *  |Ci|
i 1
Ni  1,  i  1.. n
Formulation now a “Linearly constrained Optimization” problem
-- Many solvers available (Excel, Ampl)
Solution: Pri 
Copyright Agarwal & Srivaths, 2007
Ni
Ntot
Low-Power Design and Test, Lecture 4
34
RTL Power Estimation: Results
 Designs as large as 1.25 million transistors have been successfully evaluated
using our RTL Power Estimator (RTL-PEST)*
0.8
3.8%
Power (mW)
0.7
13.9% 2.9%
0.6
• RTL power estimates roughly 5 to
10% off gate-level power estimates
• RTL power estimation 10-50X
faster than gate-level power
estimation
0.5
4.1%
0.4
12.2%
0.3
14.8% 1.2%
0.2
0.1
0
CKT1
CKT2
CKT3
CKT4
CKT5
CKT6
CKT7
Gate
0.3694
0.7308
0.503
0.5237
0.1518
0.1708
0.2332
RTL
0.3541
0.7038
0.433
0.5398
0.1783
0.1687
0.2622
* For further information, please see [Ravi03]
8
RTL-PEST
1000
6
Speedup over Comm
5
100
4
3
10
2
1
1
0
D1
Copyright Agarwal & Srivaths, 2007
7
Comm
Speedup
• Power Estimation speed
better than the best available
commercially
Execution Time (sec)
10000
D2
D3
D4
Low-Power Design and Test, Lecture 4
D5
D6
D7
35
RTL Power Estimation: Results
Percentage error versus
CPU time trade-off for
partitioned sampling and testbench
reduction techniques
Copyright Agarwal & Srivaths, 2007
Local power estimation errors for
partitioned sampling and conventional
sampling techniques
Low-Power Design and Test, Lecture 4
36
Outline
 Background
■ CMOS Power Consumption Basics
■ Why Address Power Consumption Issues in High-Level Design
 High-Level Power Analysis
■ RTL Power Estimation
● Fast Synthesis
● Analytical Approaches
● Characterization
■ Accelerating RTL Power Estimation
● Power Emulation (Hardware Accelerated Power Estimation)
■ Beyond RTL Power Estimation
● Power Estimation at the Cycle-accurate Behavior Level
■ Architectural Power Estimation
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
37
Power Emulation Technology Overview

New paradigm for power estimation ! [Coburn05]
fi
r
s
t
l
a
s
t
v
al
u
e
da
ta
Basic Observations
1.
Power estimation uses power models for different components
2.
Power models are themselves simple functions
3.
Emulation is commonly used to speed up circuit simulation
●●●

FSM
1
Power
Model
+/-
+
Power
Model
<
=
<=
reg_c0
reg_c1
reg_c1
-1
>> 1
reg_mid
reg_first reg_last reg_out
Power
Model
fi
r
s
t
l
a
s
t
Power
Aggregator
+
Power
Strobe
Generator
v
al
da
ta
FSM
adu
dre
ou
t
1
Power
Model
<
=
reg_c0
reg_c1
Power
Model
<=
Power
+/Model
Power
Model
●●●
-1
>> 1
reg_mid
reg_c1
reg_first reg_last reg_out
Total
Power
ad
dr
out
Testbench
Outputs
2 to 3 orders of magnitude
Power speedup possible !
Host PC
FPGA platform
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
38
Power Emulation: Challenges
 Size of design enhanced with power models is very large!
■ Size increases on an average of 18.2X for MPEG4 sub-designs
■ Enhanced version exceeds capacity of largest Xilinx Virtex-II FPGA
20.4X
135000
Normal Design
120000
Design for Power Estimation
Capacity of XC2V8000 FPGA
FPGA Area (LUTs)
105000
20.6X
90000
75000
17.7X
16.3X
60000
14.7X
45000
17.5X
Need to reduce the area
requirements of power models !
30000
15.0X
15000
ld
V
ea
d
_b
it
M
v
M
c
q
Is
p
t_
R
Id
c
D
ct
_c
oe
ff
da
0
MPEG4 Design Module
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
39
Power Emulation: Challenges
 Why area increase?
■ Resource-hungry power models used for every RTL
component in the design
 How to reduce area?
■ Optimize the number of power models used
■ Make the implementations of power models resourceefficient
■ Catch: Ensure minimum loss of estimation accuracy due to
area reduction techniques
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
40
Area Optimization Techniques
 Clustering of power models
■
Single power model servicing multiple components
 Changing component granularity
■
Constructing power models for complex components that
subsume several smaller components
 Exploiting correlation
■
Using power correlation between components to reduce
the number of monitored components
 Optimizing power model implementations
■
■
Multi-cycling additions in power model computations
Using FPGA block memories for efficient storage of power
model coefficients
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
41
Power Emulation: Results
 Evaluation on various NEC designs, Comparison with RTL-PEST, Comm-RTL
CKT
Sort
Estimation Time (sec)
RTL-
Comm-
PEST
RTL
11.6
80.2
Emulation
Acc
Estimated Power (mW)
FPGA Area (LUTs)
RTL-
RTL-
Emulatio
PEST
n
1605
5665
Emulation
Error
PEST
1.2
9.7 X,
0.33
0.31
0.14
0.14
8.22
7.76
3.53%
AO
3.53X
66.8X
HVPeakF
120.3
136.8
1.7
70.8X,
80.5X
172.9
DCT
173.3
3.7
46.7 X,
RTL-PEST
46.8X
MPEG4
3300
2587
MPEG4
6.3
524X,
3300sec
411X
3192
9016
2.82X
Nearly
500X
speedup possible !
0%
5.6%
6121
4.9%
24907
Comm-RTL
4.74
4.51
2587sec
19242
Power 3.14X
Emulation
72351
6.3sec
2.9X
 Upto 500X speedup compared to RTL power estimation
 3% Loss of accuracy on an average
•For further information, please see [Coburn05]
 Area overheads lowered to ≈3X
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
42
Outline
 Background
■ CMOS Power Consumption Basics
■ Why Address Power Consumption Issues in High-Level Design
 High-Level Power Analysis
■ RTL Power Estimation
● Fast Synthesis
● Analytical Approaches
● Characterization
■ Accelerating RTL Power Estimation
● Power Emulation (Hardware Accelerated Power Estimation)
■ Beyond RTL Power Estimation
● Power Estimation at the Cycle-accurate Behavior Level
■ Architectural Power Estimation
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
43
Cycle-Accurate Functional Descriptions (CAFDs)
x = input_x;
y = input_y;
while (x != y)
{
if (x < y) {
y = y - x;
} else {
x = x - y;
}
}
out = x;
Scheduling
(a) Behavioral description
out
input_y
FSM
!=
<
cmp
Registers
ST_3:
if (c0) {
if (c1) {
y = y1;
Increasingly
} else {used in
ST_2:
x = x - y;
System-level
simulation
c0 = x!=y;
}
c1 = x<y;
goto ST_2;
y1 = y –x;
}
goto ST_3
out = x;
goto ST_1;
(b) Cycle-accurate functional
Binding
description
input_x
Controller
Functional
units
ST_1:
x = input_x;
y = input_y;
goto ST_2;
reg_c0
-
sub
lt_cmp
reg_c1
reg_y1
reg_x
reg_y
Bus2
Bus1
Copyright Agarwal & Srivaths, 2007
(c) RTL implementation
Low-Power Design and Test, Lecture 4
44
Cycle-Accurate Functional Descriptions (CAFDs)
x = input_x;
y = input_y;
while (x != y)
{
if (x < y) {
y = y - x;
} else {
x = x - y;
}
}
out = x;
Scheduling
(a) Behavioral description
out
input_y
FSM
!=
<
cmp
Registers
ST_3:
if (c0) {
if (c1) {
y = y1;
Challenge:
} else {
No
structural
ST_2:
x = x - y;
c0 = x!=y;
}
Information
c1 = x<y;
goto available
ST_2;
y1 = y –x;
}
goto ST_3
out = x;
goto ST_1;
(b) Cycle-accurate functional
Binding
description
input_x
Controller
Functional
units
ST_1:
x = input_x;
y = input_y;
goto ST_2;
reg_c0
-
sub
lt_cmp
reg_c1
reg_y1
reg_x
reg_y
Bus2
Bus1
Copyright Agarwal & Srivaths, 2007
(c) RTL implementation
Low-Power Design and Test, Lecture 4
45
Overview of Power Estimation using CycleAccurate Functional Descriptions (CAFDs)
(Scheduled
Behavior)
 Objectives
■ Extract minimum
RTL structural info.
■ Back-annotate RTL
structural info.
 More information in
(Zhong04)
Simulation
test bench
CAFD
Preprocessing
Resource,
timing
constraints
Synthesis
RTL
RTL information extraction
Virtual component instantiation
Idle cycle analysis
(C++/SystemC)
structure-aware
CAFD
Power
model
library
Cycle-accurate
functional simulation
Power
Power
report
Copyright Agarwal & Srivaths, 2007
Output
Input
Low-Power Design and Test, Lecture 4
Power vs. time
46
Structure-aware CAFD
Virtual component
• Stores I/O values
for current &
previous cycles
• Invokes the power
macro-model in
each cycle
Structure-AWARE CAFD
SIMULATION
TEST BENCH
Original
cycle-accurate
functional
description
I/O mapping
• Traces appropriate
CAFD variables to
capture component
I/Os in each cycle
• Generates idle
cycle input values
POWER MODEL
LIBRARY
add_power
reg_power
Copyright Agarwal & Srivaths, 2007
Power
aggregation
& reporting
code
Low-Power Design and Test, Lecture 4
47
Example Snippet of an Structure-aware CAFD
ST_1:
x = input_x;
y = input_y;
goto ST_2;
ST_2:
c0 = x!=y;
c1 = x<y;
y1 = y –x;
goto ST_3
ST_3:
if (c0) {
if (c1) {
y = y1;
} else {
x = x - y;
}
goto ST_2;
}
out = x;
goto ST_1;
VC<bus,8> bus1,bus2;
VC<reg,8> reg_y1;
VC<lt,1> lt_cmp;
…
ST_1:
…
ST_2:
bus1.RecordInput(x)
bus2.RecordInput(y);
c0 = x!=y;
reg_c0.RecordInput(c0);
eq_cmp.RecordIO(x,y,c0);
c1 = x<y;
reg_c1.RecordInput(c1);
lt_cmp.RecordIO(x,y,c1);
Instantiate
virtual
components
Update
virtual
component
I/O values
y1 = y -x;
reg_y1.RecordInput(y1);
sub.RecordIO(y,x,y1);
CalculatePower();
goto ST_3
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
48
C-based HW Power Estimation: Results
 Compared accuracy, efficiency vs. gate-level and RTL-PEST
■
50-100X speedup (or more) for various designs, less than 20% error w.r.t POWERD
Average
Cyclelevel
Absolute
Error
Speedup vs.
Slowdown vs.
RTL
Estimation
Functional
Simulation
Circuit
Power
Error
DES
2.1%
2.2%
83 X
1.1 X
HDTV-1
1.7%
4.0%
356 X
3.2 X
JPEG
2.7%
6.6%
1,143 X
3.3 X
MPEG4-IDCT
3.1%
5.1%
412 X
3.2 X
MPEG4-ISPQ
1.5%
2.4%
438 X
2.1 X
SORT
1.7%
5.4%
266 X
1.7 X
VITERBI
1.4%
6.5%
305 X
3.0 X
5.1%
223 X
2.1 X
•For further
information, please see2.4%
(Zhong04)
WAVELET
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
49
Outline
 Background
■ CMOS Power Consumption Basics
■ Why Address Power Consumption Issues in High-Level Design
 High-Level Power Analysis
■ RTL Power Estimation
● Fast Synthesis
● Analytical Approaches
● Characterization
■ Accelerating RTL Power Estimation
● Power Emulation (Hardware Accelerated Power Estimation)
■ Beyond RTL Power Estimation
● Power Estimation at the Cycle-accurate Behavior Level
■ Architectural Power Estimation
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
50
Architectural Power Estimation
 Requirements
■ Need to evaluate trade-offs in processor configuration
■ Need to evaluate trade-offs in software running on system
■ Must be very fast compared to HDL based power estimators.
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
51
Architectural Power Estimation (Wattch)
 Overall structure of an architectural power
estimator Wattch [Brooks00]
 Parameterized models for different CPU units
■ Can vary size or design style as needed
■ Use the fundamental equation for
dynamic power consumption
● P=CV2. A.f
 On each cycle, determine which units are
accessed and accumulate energy
consumption
 Capacitance modeled for various critical
components
 Activity factors
■ Runtime measurements using a cycleaccurate performance simulator called
SimpleScalar (has been ported to many
simulators)
■ Assume an activity factor of 0.5 for which
the simulator cannot report statistics
Copyright Agarwal & Srivaths, 2007
Binary
HW Config
Cycle-Level
Performance
Simulator
Cycle-by-Cycle
Hardware Access
Counts
Parameterizable
Power
Models
Performance
Estimate
Low-Power Design and Test, Lecture 4
Power
Estimate
52
Architectural Power Estimation
(source: Brooks_hpca2001)
 Good relative accuracy even
when absolute accuracy may be
off
Copyright Agarwal & Srivaths, 2007
 10-15% accuracy variations
with low-level industry data
Low-Power Design and Test, Lecture 4
53
Conclusions
 High-level power analysis techniques are finally coming of age
 Efficiency
 Accuracy
 What we could not cover today?: Using high-level power
analysis for optimization
■ Power reports provide information about a design’s
“hotspots”
■ Presence of power analysis in a high-level design flow
makes optimization and design space exploration easy
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
54
References

Books/Tutorials
■ [Raghunathan-book] A. Raghunathan, N. K. Jha, and S. Dey, "High-level power analysis and optimization",
Kluwer Academic Publishers, 1997
■ [Ravi-aspdac05] “Power Analysis in C-based Design” (part of tutorial entitled “C-based Design: Industrial
Experience”), Asia-South Pacific Design Automation Conference (ASP-DAC), January 2005

Conference Papers
■ [Llopis98] R. Llopis, K. Goossens, “The petrol approach to high-level power estimation”, ISLPED 1998: 130132
■ [Glaser91] K. D. Glaser, K. Kirsch, and K. Neusinger, ``Estimating essential design characteristics to support
project planning for ASIC design management,'' in Proc. Int. Conf. Computer-Aided Design, pp. 148--151, Nov.
1991.
■ [Liu94] D. Liu and C. Svensson, ``Power consumption estimation in CMOS VLSI chips,'' IEEE J. Solid-State
Circuits, vol. 29, pp. 663--670, June 1994
■ [Nemani96] M. Nemani and F. N. Najm, ``High-level power estimation and the area complexity of Boolean
functions,'' in Proc. Int. Symp. Low Power Electronics & Design, pp. 329--334, Aug. 1996.
■ [Potlapally01] N. R. Potlapally, A. Raghunathan, G. Lakshminarayana, M. S. Hsiao, and S. T. Chakradhar,
"Accurate power macro-modeling techniques for complex RTL circuits", IEEE International Conference on
VLSI Design, January 2001
■ [Ravi03] S. Ravi, A. Raghunathan, and S. T. Chakradhar, "Efficient RTL Power Estimation for Large Designs,"
IEEE International Conference on VLSI Design, January 2003
■ [Zhong04]L. Zhong, S. Ravi, A. Raghunathan, and N. K. Jha, "Power estimation for cycle-accurate functional
descriptions of hardware," IEEE/ACM International Conference on Computer-Aided Design, November 2004
■ [Coburn05] J. Coburn, S. Ravi, and A. Raghunathan, "Power emulation: A new paradigm for power
estimation," ACM/IEEE Design Automation Conference, June 2005
■ [Brooks00] David Brooks, Vivek Tiwari, and Margaret Martonosi, “Wattch: A Framework for Architectural-Level
Power Analysis and Optimizations,” 27th International Symposium on Computer Architecture (ISCA), June
2000
Copyright Agarwal & Srivaths, 2007
Low-Power Design and Test, Lecture 4
55
Download