Circuits using Input Disabling

A Power Reduction Algorithm for Combinational CMOS
Circuits using Input Disabling
by
William John Rinderknecht
B.S.E.E. and B.S.C.E., Iowa State University (1992)
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Science in Electrical Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
Fit15
December 1994
( Massachusetts Institute of Technology 1994. All rights reserved.
A
Author.........
,
_
......
,
Departmen4
. ...... .....
*.. .. .....
.... ....
........
Electrical Engineering and Computer Science
,,
January 20, 1995
(I)
Certified by ..........................
- IV .. .... - ...
... - j
.. ~:..........
.......
.v..x' .
Srinivas Devadas
Associate Professor of Electrical Engineering and Computer Science
Thesis Supervisor
C\
Acceptedby.............
fln'
6
.
A
A
....................
\\,ederic
R. Morgenthaler
Chairman, Dep rtment ComMiitteeon Graduate Students
Eng.
MASSACHUSETTS
INSTTUTF
r
s
ln!
*ts.*
-,
APR 13 1995
LIBRAIlt:b
I
A Power Reduction Algorithm for Combinational CMOS
Circuits using Input Disabling
by
William John Rinderknecht
Submitted to the Department of Electrical Engineering and Computer Science
on January 20, 1995, in partial fulfillment of the
requirements for the degree of
Master of Science in Electrical Engineering
Abstract
With the advent of portable electronics, low power CMOS circuit design has become
increasingly important. To effectively design for low power, chip designers need a
wide variety of power estimation and power reduction tools. This thesis describes a
new CAD tool that reduces the power dissipation of combinational CMOS circuits.
This automated CAD tool attempts to reduce power by selectively "turning off"
inputs. It searches a gate-level description of a CMOS circuit for inputs or nodes
that are often unnecessary to determine the outputs. It will then add logic to dynamically "turn off" these nodes based on the values of other inputs. This reduces
switching activity within the circuit thus reducing power consumption. The result is
a functionally equivalent CMOS circuit with reduced power, added area, and possibly
increased delay.
This thesis describes the technique and heuristic algorithms used by this CAD
tool to optimize circuits for low power. Experimental results are presented.
Thesis Supervisor: Srinivas Devadas
Title: Associate Professor of Electrical Engineering and Computer Science
3
4
Contents
1 Introduction
11
2 Terminology
15
2.1
Definitions .................................
15
2.2
Binary Decision Diagrams ........................
16
19
3 A Power Dissipation Model
3.1
Power Consumption of Nodes ......................
19
3.2
Power Estimation of Dynamic Combinational Circuits .........
20
3.3 Power Estimation of Static Combinational Circuits ..........
21
3.3.1
Switching Activity With a Zero Delay Model ..........
22
3.3.2
Switching Activity With a General Delay Model ........
22
4 Previous Work
25
25
4.1
The Basic Technique.
..........................
4.2
Deriving the Precomputation Logic ...................
28
4.3
Input Selection ..............................
29
4.4
More on Multiple Output Circuits ...................
4.5
Results ...................................
.
32
35
5 Architecture for Combinational Precomputation
5.1
The Hardware
5.2
The Disable
. . . . . . . . . . . . . . . .
Logic Requirements
30
.
. . . . . . ......
. . . . . . . . . . . . . . .
5
. .
35
.
37
6 Algorithms
39
6.1
Synthesis of the Disable Logic ......................
6.2
Reduction of Disable Logic Costs ...................
6.3
Selection of the Inputs ..........................
39
.
42
7 Subcircuit Selection
45
7.1 Motivation .................................
About
7.2
Observations
7.3
The First Algorithm
45
Single-Output
Subcircuits
. . . . . . . . . ...
47
50
...........................
7.4 The Second Algorithm ........................
.
52
55
8 Results, Comparisons, and Future Work
9
40
8.1
Results ...................................
55
8.2
Comparison to Alidina's Technique ..................
8.3
Future Work ................................
.
56
57
Conclusion
59
6
List of Figures
2-1 Examples of Ordered Binary Decision Diagrams . .
18
4-1
25
The Original
Circuit
.
. . . . . . . .
4-2 Single Output Precomputation Architecture ....
26
4-3 Precomputation of a Multiple-Output Function . .
27
4-4 Procedure to Determine the Optimal Set of Inputs.
31
4-5 Logic Duplication in a Multiple-Output Circuit . .
33
5-1
Original circuit.
5-2
Circuit with input disabling circuit.........
5-3
Disabling inputs in combinational circuits
5-4
Disabling inputs in Domino circuits ........
5-5
Single output circuit with input disabling circuit.
...................
.......... . f35
.......... . .36
.......... . .38
36
....
. . . . . . . . . . .
37
6-1 Procedure to reduce the cost of the disable logic ............
41
6-2 Procedure to Determine the Optimal Set of Inputs ...........
43
7-1 Two candidates for power reduction. (a) A simple 8-bit comparator.
(b) An 8-bit comparator preceded by two adders ...........
46
7-2 Procedure to Find the Minimum Set of Single-Output Subcircuits . .
48
7-3 Dividing a circuit into a minimum number of single-output subcircuits.
7-4
(a) The original circuit. (b) The single-output subcircuits ......
49
Example of branching through subcircuit combinations.
51
........
7-5 Algorithm that evaluate different combinations of subcircuits ....
7
53
I
L..
7-6 Possible groupings of adjacent MSSO subcircuits. (a) Subcircuits sharing an input. (b) A subcircuit output feeding a subcircuit input. (c)
A subcircuit output and each of its fanouts.
8
..............
54
List of Tables
4.1
Power Reductions for Datapath Circuits
4.2
Power Reductions for Random Logic Circuits .............
34
8.1
Power reductions of combinational circuits ...............
56
9
................
34
10
Chapter 1
Introduction
Low power CMOS circuit design has become increasingly important. As die sizes and
clock frequencies increase at an astonishing rate, the average power consumption of
chips also increase. High-performance RISC microprocessors that are not designed
using low power techniques can require as much as 30 watts of power [7]. This
increased power consumption can lead to reduced performance, higher chip failure
rates, and higher costs due to power supplies and/or cooling mechanisms.
More importantly, the advent of portable electronics has fueled the need for low
power CMOS circuits. Because of the rate at which CMOS circuits have shrunk, the
weight and the volume of many portable electronic devices is now dominated by the
power supplies. The weight and volume of power supplies are decreasing slowly so
that advances in this area of research are unlikely to alleviate the problem. In order
to achieve smaller portable devices, the average power consumption of CMOS circuits
must be reduced.
To this date, research in power estimation of CMOS circuits has been been fairly
complete. Static power dissipation of CMOS circuits has been shown to be negligible
in comparison to dynamic dissipation. Because of this, average power dissipation is
directly proportional to the average switching activity of a circuit [9]. Methods for
estimating power dissipation by first estimating switching activity are given in [12, 8].
In [16], Monteiro builds on these techniques to develop a computationally efficient
algorithm to estimate the switching activity and power consumption of sequential
11
I
:
circuits.
Designers use a great variety of techniques for power reduction. Most commonly,
designers use techniques at the circuit level to reduce power. For example, designers
often reduce the power supply voltage or use alternative circuit techniques, such as
Domino circuits [5, 17]. Designers also make architectural and functional changes to
reduce power consumption. For example, many chips disable local clocks when not
needed [14], and microprocessors often include wait commands that put the chips in
low-power states.
There have also been several techniques used to reduce power consumption at
the gate level. In [15], power is reduced using several methods including exploiting
don't cares, using disjoint covers, collapsing gates, and using new cost functions with
standard optimization routines. State encoding [13] and re-timing [11] algorithms
have also been developed for low power logic synthesis. In [2], Alidina presents a
technique that "turns off" certain inputs of a sequential circuit when their values are
not needed. This approach adds logic (to be evaluated in the previous clock cycle) to
determine when these inputs can be turned off. Thus this technique reduces power
consumption at the expense of chip area.
This thesis presents a new approach, based on Alidina's technique, that will selectively "turn off" nodes of combinational circuits to reduce switching activity. Effectively, this approach searches a combinational circuit for inputs or internal nodes
whose values are often unnecessary to determine the values of the outputs. It then
adds logic to "turn off" or gate these signals when they are not needed. This reduces
switching activity within the circuit without changing the values of the outputs. The
result is a functionally equivalent combinational circuit that requires less power.
Because this new approach is based closely on Alidina's technique, it uses modified
versions of the algorithms developed by Alidina in [2]. Specifically the algorithm that
determines which inputs will disabled and the algorithm that synthesizes the added
logic are closely based on the algorithms produced by Alidina. But in the Alidina
algorithm, all added logic is evaluated on the previous clock cycle alleviating any
possible complications due to timing. Because this algorithm works on combinational
12
circuits, timing is crucial to the success of the algorithm and cannot be ignored. The
Alidina paper also assumes that the cost of additional logic will be low compared to
the gross savings, therefore no attempt is made to minimize this cost. In this new
algorithm, a procedure has been added that attempts to reduce the cost of added
logic without a large decrease in the gross savings.
The primary addition of this new technique is the selection of the subcircuit.
The Alidina algorithm works on sequential circuits and places its added logic in
the previous cycle. Although this simplifies the timing constraints, it limits the
choice of the disabled nodes to the circuit inputs only. In the algorithms described
in this thesis, all of the nodes in the circuit are considered. This is done by dividing
the combinational circuit into many combinational subcircuits. Then each of these
subcircuits and each combination of these subcircuits is considered for power reduction
in a computationally efficient manner.
This results in an algorithm that is much
more computationally complex than the Alidina algorithm, but it is also capable of
considering many more possibilities making it a more powerful algorithm.
This thesis describes the theory and the implementation of this new power reduction technique. Chapters 2 and 3 contain background information about Boolean
terminology and about power estimation techniques respectively. Chapter 4 describes
Alidina's power reduction technique. The new power reduction technique presented
in this paper is described in Chapters 5 through 8. Chapter 5 describes the hardware architecture used to implement this technique. Chapters 6 and 7 describe the
algorithms used to optimize the technique. Finally, Chapter 8 presents experimental
results.
13
14
Chapter 2
Terminology
This chapter introduces terminology needed to describe the power estimation and
reduction methods described in subsequent chapters. In Section 2.1, definitions pertaining to Boolean functions are given. Section 2.2 describes Binary Decision Diagrams - graphical representations of Boolean functions needed for efficient Boolean
manipulation.
2.1
Definitions
The definitions in this section are taken from [6] and [2].
A Booleanfunction f of n input variables, xl,..., xn, and of m output variables,
fi, ...
, fm, is a mapping f: Bn
-
B m, where Bn
=
{O,1)n and B m = 0, 1)m. For
each output fi of f, the ON-set can be defined to be the set of inputs x such that
fi(x) = 1. Similarly, the OFF-set is the set of inputs x such that fi(x) = O. A
function in which m = 1 is a single-outputfunction, and a function with m > 1 is a
multiple-outputfunction.
The support of f, denoted as support(f),
in f as xi or xi. For example, if f
= X1
is the set of all variables xi that occur
-2 + x 3 , then support(f)
=
{Xl, X2, X3 }.
A literal is a Boolean variable or its complement. If xi is a Boolean variable, then
xi and xi are literals.
A cube is a set of literals that represent the intersection of the given literals. If a,
15
I
1.
b, and c are Boolean variables, then abc is a cube representing the intersection of a,
b, and c.
A cover is a set of cubes such that the union of the cubes exactly defines some
Boolean function, f. In other words, member of the ON-set of f is included in at
least one cube of the cover, and no member of the OFF-set of f is included in any
of the cubes in the cover. A disjoint cover is a cover for a function such that each of
the cubes is mutually exclusive from all the others.
The cofactor of a function f with respect to a variable xi, denoted as fxi, is defined
as:
fxi = f(x,
*'*..
, xi-
1 X,i+ ...' Xn)
(2.1)
Likewise, the cofactor of a function f with respect to a variable xi, denoted as f,
is
defined as:
f
= f(x1,
x-1,iO,
0 i+,1,Xn)
(2.2)
The Shannon expansion of function around a variable xi is given by:
f = xi fi + i fyi
2.2
(2.3)
Binary Decision Diagrams
A Binary Decision Diagram (BDD) [1, 10] is a rooted, directed graph with vertex
set, V, containing two types of vertices. A nonterminal vertex v has as attributes an
argument index, index(v) E {1,- .. , n, and two children, low(v), high(v)
E V. A
terminal vertex, v, has as an attribute a value, value(v) E {O, 1.
The correspondence between BDDs and Boolean functions is defined as follows.
A BDD, G, having root vertex, v, denotes a function, f, defined recursively as:
1. If v is a terminal vertex:
(a) If value(v) = 1, then f, = 1.
(b) If value(v) = 0, then f = 0.
16
2. If v is a nonterminal vertex with index(v) = i, then f is the function:
fv(X1,''',X
) =-
.
fiow(v)(Xl*,
Xn) + Xi fhigh(v)(Xl,.7',
n)
(2.4)
where xi is the decision variable for vertex v.
Ordered BDDs (OBDDs) have a restriction such that for any nonterminal vertex
v, if low(v) is also nonterminal, then index(v) < index(low(v)). Similarly,if high(v)
is also nonterminal, then index(v) < index(high(v)).
The OBDDs for some simple
functions are shown in Figure 2-1. Terminal vertices are represented as squares, while
nonterminal vertices are represented as circles. The low child is pointed to by the
arrow marked 0, and the high child is pointed to by the arrow marked 1.
Reduced OBDDs (ROBDDs) as proposed in [3] are a minimal OBDD representation for a given function and are defined as follows:
Definition 2.1 An OBDD G is reduced if it contains no vertex v with low(v) =
high(v) nor does it contain distinct vertices v and w such that the subgraphsrooted
by v and w are isomorphic.
In [3], Bryant also proves that an ROBDD is a canonical representation of a
Boolean function. ROBDDs are used to represent logic functions in the following
power estimation and reduction techniques.
17
I
!
f=a.b
Ordering:a = 1,b = 2
nOdd Parity FunctionI
Odd Parity Function
f=a+b
Ordering: a = 1, b = 2
Figure 2-1: Examples of Ordered Binary Decision Diagrams
18
Chapter 3
A Power Dissipation Model
In order to evaluate power reduction algorithms, accurate and efficient power estimation algorithms are needed. The following sections briefly describe the model and
basic algorithms used to estimate power consumption for CMOS circuits.
3.1
Power Consumption of Nodes
The power dissipation at a node in a CMOS circuit is directly proportional to the
switching activity at the node. Tis requires the following three assumptions:
* The only capacitance in a CMOS logic-gate is at the output node of the gate.
* Either current is flowingthrough some path from VDDto the output capacitor,
or current is flowing from the output capacitor to ground.
* Any change in a gate output voltage is a change from VDD to ground or viceversa.
These are reasonably accurate assumptions for well-designed CMOS circuits. Further,
it can be shown that the energy dissipated by a CMOS logic gate each time its
output changes is roughly equal to the change in energy stored in the gate's output
capacitance [9]. If the gate is part of a synchronous digital system controlled by a
19
I
..
global clock, then the average power dissipated by the gate is given by:
1
Pavg =
X Cload X
where Pavgdenotes the average power,
X flock
Cload
X E(transitions)
(3.1)
is the load capacitance, Vddis the supply
voltage, flok is the global clock frequency, and E(transitions)
is the expected number
of gate transitions per clock cycle [12]. All of the variables in (3.1) can be determined
from technology or circuit layout except E(transitions)
which depends on the logic
function being performed, the circuit style being used, and the statistical properties
of the primary inputs [8]. Because of this, the major difficultyin estimating the power
consumption of a circuit is determining the expected switching activity of the nodes
in the circuit.
3.2
Power Estimation of Dynamic Combinational
Circuits
In dynamic circuits, such as Domino circuits, nodes are pre-charged to a 1 or a 0.
Then, during evaluation, the nodes switch only if the actual Boolean value is opposite
the pre-charge value. This means one logic level results in two transitions whereas the
other results in zero transitions independent of the node's value in previous cycles.
Therefore, in the case of dynamic circuits, Equation (3.1) can be simplified to:
Pavg = Cload X Vd2 X fclock X Prob(f
= 1)
(3.2)
where f is the Boolean function of a particular node in terms of the circuit inputs,
Prob(f = 1) is the probability that the node will evaluate to a 1, and f is assumed
to be pre-charged to a 0. Once the Prob(f = 1) is known for each node, the power
can easily be summed over all nodes.
To determine Prob(f = 1), two assumptions must be made. First, the probability
of each input being a 1 is known and is denoted by ponefor input i. Second, the input
probabilities,pne
... pne,
are uncorrelated.
20
Given these assumptions, Prob(f = 1) can be determined easily using elementary
probability. If a particular cube is given by
Ci =
i
... ·
i*.2
i'2 '
2
(3.3)
·
then, because the ij's are uncorrelated, the probability of this cube being true is
P(ci = 1)
pone. one .. .ne
. (1 ppe)(
1
one)
'
( 1 _ one
(3.4)
If a Boolean function is expressed as a disjoint cover (i.e., a mutually exclusive group
of cubes), then the probability that this function is true is simply the sum of the
probabilities that each cube is true. Therefore, prob(f = 1) can be found by expressing
f as a disjoint cover and then summing the probabilities of the cubes [8]. Because
Binary Decision Diagrams are closely related to disjoint covers, exact probabilities
of Boolean functions can be obtained in linear time using Binary Decision Diagrams
[4, 12]. Therefore, given circuit input probabilities, all node probabilities can be
determined and Equation 3.2 can be used to determine the average power dissipation
of dynamic circuits.
3.3
Power Estimation of Static Combinational Circuits
For static circuits, Equation (3.1) is used directly by the power estimation techniques
such as [8, 12] to relate switching activity to power dissipation. The power estimation
algorithm need only determine the expected number of transitions at each node and
then sum the power for all nodes.
In combinational circuits, switching may occur whenever there is a change in the
circuit inputs. Because most combinational circuits exist within sequential systems,
inputs usually change together, and all switching is allowed to complete before the
inputs change again. So to determine the switching activity of a node, one needs to
know what state nodes were in before the inputs change and what state the nodes
21
I
will be in after all switching is complete. This is equivalent to knowing a pair of
consecutiveinput vectors, (It, It+l).
Techniques for estimating switching activity of static circuits are reviewed in the
following sections.
Switching Activity With a Zero Delay Model
3.3.1
Assuming gates have zero delay, each node can make at most one transition for
each input vector. Assuming that consecutive input vectors are independent, the
probability that an input vector pair results in a 0 -+ 1 transition is Pone(l - poe),
where Pone denotes the probability that the node evaluates to a 1. pne
can be
determined as described in Section 3.2 assuming the circuit input probabilities are
known and independent.
Similarly, the probability of a 1 -, 0 transition is (1-
~~node
~r/,,,2
ne
Pone )Pone
one one
Therefore, the expected number of transitions per clock cycle is given by
E(transitions)
2e(1
L ?A~O~
3 -=~?
node - pne)
nod
(3.5)
Equation 3.5 can be substituted directly into Equation 3.1 to determine the power
dissipation assuming gates have zero delay [8].
3.3.2
Switching Activity With a General Delay Model
Under the zero delay model, all nodes transition at most once per clock cycle. In
general this is not true.
Because of timing delays, nodes can glitch resulting in
multiple transitions in a clock cycle. To evaluate power consumption under a general
delay model, symbolic simulation is used.
In symbolic simulation, a Boolean function is constructed for every interval of
time during the clock cycle. For example, a node would be assigned the functions
fi,o, fi, ,..., fi,n if there are n time intervals in a clock cycle. The support of these
functions include variables from both input vectors, (It, It+l). A transition of a 0 between intervals j and j + 1 is represented by the function fi,j fij+.
1
Therefore, the
probability of a 0 -, 1 transition occurring between j and j + 1 is the probability that
22
fij
fij+l evaluates to a 1. Similarly, a 1 - 0 transition is represented by fij
fi,+l
Therefore the probability that any transition will occur between j and j + 1 at this
node is equal to the probability that
fi,j fij+l + fij fij+l = fi,j fij+l
will evaluate to a 1, where @ represents the exclusive-or operator.
(3.6)
The average
switching activity can be determined by simply summing these probabilities over the
entire clock cycle:
n-I
E(transitions)-
Prob(fi,j E fij+l = 1)
(3.7)
j=o
Once again, these probabilities can be evaluated as described in section 3.2.
Equation 3.7 can be substituted directly into Equation 3.1 to determine the power
dissipation of static combinational circuits with general gate delays [8].
23
I
24
Chapter 4
Previous Work
As was stated earlier, the power reduction technique presented in this thesis is based
closely on Alidina's work presented in [2]. Alidina presents a technique that "turns
off" certain inputs of a sequential circuit when the values are not needed.
This
approach adds logic (to be evaluated in the previous clock cycle) to determine when
these inputs can be turned off.
This chapter describes the details of Alidina's technique.
4.1
The Basic Technique
Alidina's technique starts with a general sequential circuit as shown in Figure 4-1.
Block A is combinational logic whereas Blocks R1 and R2 are registers. Although
R1 and R2 are shown as separate registers, they could in fact be the same register.
Assume, for the time, that Block A has only a single output, f.
X1
x2 ~
A
R1
xn
-
Figure 4-1: The Original Circuit
25
R2
xl
X
2
f
Xn
Figure 4-2: Single Output Precomputation Architecture
To reduce the switching activity in Block A, circuitry is added to prevent some of
the inputs from switching when their values are not needed. This is accomplished by
using a register with a load enable signal as shown in Figure 4-2.
To ensure that the function of the circuit does not change, the precomputation logic
must be selected correctly. The precomputation logic is the logic that determines
when the inputs may be disabled.
It is called precomputation logic because it is
evaluated in the previous clock cycle. This logic is shown as Block gl, Block g2,
and the NOR gate in Figure 4-2. To keep the operation of the circuit the same, the
inputs stored in R2 can be disabled only when the output of Block A is completely
determined by the remaining inputs (those in R1). To do this, the predictor functions,
gl and g2 , are defined as:
g9=1 =
f=1
(4.1)
92 =1 = f=0
(4.2)
where support(gl) and support(g 2 ) include only inputs that are not being disabled.
Under this definition, if g1 is true, then the value of f is known to be a 1 regardless
of the values of the R2 inputs.
Therefore, these inputs can be disabled without
affecting f. The same argument can be made for 92 where we know that f will be a
0 independent of the disabled inputs. If neither gl nor g2 is asserted, then nothing
26
fI
f2
fm
Figure 4-3: Precomputation of a Multiple-Output Function
can be determined about the value of the f, and all the inputs must be allowed to
propagate. Therefore, the precomputation logic is defined to be g = g1 + g92,where
gl and g92must satisfy Equations 4.1 and 4.2.
In general, circuits will have more than one output.
Figure 4-3 shows the ar-
chitecture generalized for multiple outputs. In this case, the inputs can be disabled
only when all of the outputs are independent of the disabled inputs. The predictor
functions are defined for each output:
g9,i = 1
fi = 1
g2,i= 1 = fi = 0
(4.3)
(4.4)
for all i such that 1 < i < m. Because every output must be independent of the
disabled inputs, the disable signal can be asserted only in the intersection of the
individual gl and g2 signals:
m
9 =
II (gl,i
i=l
+ g2,i)
(4.5)
This condition is required to ensure that each output is implemented correctly.
Given this architecture, two details must be determined for a given circuit. First,
the subset of inputs that will be disabled must be determined. An algorithm to
perform this selection will be described in Section 4.3. Second, given a particular
subset of inputs to be disabled, the precomputation logic, g, must be determined.
This will be described in Section 4.2.
27
I
4.2
Deriving the Precomputation Logic
Given a particular sequential circuit, the precomputation logic, g, must be determined. This logic must be selected so that the function of the circuit does not change.
It should also be selected to maximize the probability of disabling the inputs.
First consider the simplified case in which there is one output, f, and in which
only one input will be disabled, xi. The algorithm must determine the functions g1
and
92
such that neither gl nor g2 is a function of xi, such that Equations 4.1 and 4.2
are satisfied, and such that prob(gl + g2
=
1) is maximized.
This can be accomplished using the universal quantification of f. In a sense, the
cofactor of f with respect to xi, fi, defines the set of input vectors that make f true
given that xi is true. (The cofactor function was defined in Section 2.1.) Similarly,
fy- defines the set of input vectors that make f true given that xi is false. Therefore
the intersection of the two defines the set of input vectors that force f to be true
regardless of the value of xi. This is exactly the condition needed to fulfill Equation
4.1, and is defined as the universal quantification of f with respect to the variable xi:
U if = fi
fA
Because Uxif includes all input vectors that fulfill Equation 4.1, it also maximizes
Prob(gl = 1). Similarly, if g2 is defined as:
= fxi * fi;
92 = Uf
then g2 satisfies Equation 5.2 with maximum Prob(g 2 = 1).
Now consider the case in which many inputs will be disabled. Assume that the
set of inputs to be disabled is given by D = {xp+l,
, x),
the set of inputs that
, xp), and the total set of inputs is given
will not be disabled is given by S = {x1,
by X = {xl,
..
, xn) where 1 < p < n. To find the set of input vectors that force f
to be true regardless of the value of each variable in D, the universal quantification
f must be taken with respect to each variable in succession. This is the universal
quantification of f with respect to D, defined as:
UDf = U p+lUp+2. . . Un f
28
This was proven formally by Alidina in [2]:
Theorem 4.1 g = UDf =
function h(xl,
*..,
Up+ ... Uf
satisfies Equation 4.1. Further, no
xp) exists such that prob(h = 1) > prob(gl = 1) and such that
h=1 = f=1.
Similarly, the function
92
that satisfies Equation 4.2 and maximizes Prob(g 2 = 1) is:
92 = UDf = Up+l U3p+2... U:n
Therefore, the gl and g2 that satisfy the above three requirements can be determined
by calculating gl = UDf and g2 = UDf.
Finally, consider the case in which the circuit has multiple outputs. As was described in the previous section, every output must be independent of the disabled
inputs in order to assert the precomputation signal. Therefore, the inputs can be
turned off in the intersection of the individual precomputation signals:
m
g =
i=l
(1,i + 92,i) = (UDfl + UDf)
(UDf2+ UDf2) ... (UDfm + UDfm) (4.6)
Using this logic for g ensures that the function of the circuit will not be changed
satisfying Equations 4.3 and 4.4 and also maximizes prob(g = 1).
4.3
Input Selection
In addition to determining the precomputation logic, Alidina also presents an algorithm to select D, the subset of inputs that will be disabled. To maximize power
savings, it is desirable to select D such that the probability of disabling the inputs,
namely prob(g = 1), is maximized. Alidina presents an algorithm that finds the set
that maximizes this probability given a particular number of inputs, k.
This algorithm basically branches through a binary tree, where each node represents one input. The left branch from this node leads to all combinations of inputs
that include the given input and the right branch leads to all combinations that do not
include the given input. This branching continues until k inputs have been selected
29
at which point the prob(gl + g2 = 1) can be determined. If allowed to follow all possible paths, this scheme would find the D of size k that maximizes prob(gl + g92= 1)
simply because it would cover all possible combinations. Yet such an algorithm is too
computationally complex.
To make this algorithm feasible, Alidina's algorithm skips many possible branches
along the binary tree. Skipping branches is possible because of the following observation:
prob(U,if) = prob(fi, fy) < prob(f) Vxj, f
(4.7)
This shows that prob(g = 1) decreases monotonically as new inputs are added to D.
Therefore, if the prob(g = 1) becomes too small during the branching, all succeeding
branches of the binary tree may be skipped.
The details of this algorithm, taken directly from [2], are shown in Figure 4-4.
In the pseudo-code, D represents the set of inputs that are currently selected for
disabling. Q represents the "active" inputs that may still be selected to be placed in
D.
is simply the number of inputs that will be placed in D.
Each call of SELECT_ RECUR is analogous to a node in the binary branching
pattern. The two recursive calls are analogous to the left and right children. The
pruning condition suggested by Equation 4.7 is implemented as:
if (pr < BESTPROB) return;
This algorithm efficiently finds the set of k inputs that maximize prob(g = 1). If
the algorithm is run several times with different values of k, the optimal solution will
be found.
4.4
More on Multiple Output Circuits
The algorithms presented so far assume that all the outputs of Block A will be used in
Equation 4.6 to determine the precomputation logic. This seems necessary to ensure
the function of each output remains the same. Yet this creates a severe limitation.
Each output effectively places a restriction on when it is allowable to disable the
30
SELECT-INPUTS(
{
f, k ):
BESTPROB= 0;
SELECTEDSET =
;
SELECTRECUR( f, f, 0, X, Ixl-k );
return( SELECTEDSET );
}
g1 , 92 , D, Q, ):
SELECTRECUR(
{
if( DI + IQI <
)
return;
pr = prob(gl= 1)+ prob(g2 = 1);
if( pr < BESTPROB)
return;
else if( ID] == l) {
BESTPROB = pr;
SELECTED-SET
= X- D;
return;
}
choose xi E Q such that i is minimum;
SELECT-RECUR( Uxigi, Uxi9g
2 , D U xi, Q- xi, I );
SELECT_RECUR( g9, 92, D, Q- xi, I );
return;
}
Figure 4-4: Procedure to Determine the Optimal Set of Inputs
31
I
;
inputs. As the number of outputs increase, the probability that the inputs can be
disabled decreases. For even a reasonable number of outputs, the probability can
become quite small, and thus the power reduction can become negligible.
To overcome this, Alidina suggests using logic duplication. The idea is to synthesize the precomputation logic, g, using only a subset of the outputs in Equation 4.6.
This results in two subsets of outputs - outputs whose values are unaffected if the
inputs are disabled and outputs whose function will change if the inputs are disabled.
To prevent the function of this second group from changing, any logic that is shared
by both subsets of outputs is duplicated. In this way, the inputs may be disabled
without affecting the outputs that did not contribute to the precomputation logic
synthesis.
For example, consider the circuit in Figure 4-5(a).
desirable to disable the input
3,
In this circuit, it may be
but the combination of the outputs fi and
f2
reduces prob(g = 1) too much. Therefore, the shared logic (shown as the shaded
area) is duplicated as in Figure 4-5(b). Now x3 may be disabled without affecting
f2.
Obviously, this is not a prefect solution. This technique creates a considerable
amount of overhead including the duplicated logic and the extra register. But it does
enable the algorithm to work on circuits with a large number of outputs.
4.5
Results
Alidina implemented his techniques in C code within the SIS logic optimization system. Using this implementation, he demonstrated very good results attaining as much
as 60% power reductions. Some of his results are shown in Tables 4.1 and 4.2 which
were taken directly from [2].
32
x1
x2
x3
x4
(a) Original Network
xl
X2
fl
x3
f2
x3
x4
(b) Final Network
Figure 4-5: Logic Duplication in a Multiple-Output Circuit
33
CKT
Lits
Original
Levs Pwr
compl6
286
7
1281
priorityl6
126
16
455
3026
8
6941
350
975
9
10
1744
2945
addcompl6
maxl6
csal6
addamaxl6
3090
9
7370
Precompute Logic
Bits Lits
Levs
Optimized
Pwr % Red .
2
4
6
8
_10
1
2
3
4
5
6
4/0
4/8
8/0
8/8
8
2
4
4
8
12
16
20
1
3
6
10
15
21
8
24
51
67
16
4
11
2
2
2
2
2
1
2
2
2
2
2
2
4
4
6
2
2
4
965
683
550
518
538
381
270
209
190
187
196
6346
5711
4781
3933
1281
2958
2775
25
47
57
60
58
16
41
54
58
59
57
9
18
31
43
27
0
6
6
8
18
25
4
5
2676
2644
9
10
4/0
4/8
8/0
8/8
8
24
51
67
2
4
4
6
7174
6751
6624
6116
3
8
10
17
Table 4.1: Power Reductions for Datapath Circuits
CKT
Lits
Original
Levs Pwr
Precompute Logic
Bits Lits
Levs
Optimized
Pwr % Red
267
8
1452
7
41
8
1429
2
cml50a
61
5
744
1
1
1
552
26
cm152a
i2
28
230
4
4
370
5606
9
22
2
30
1
3
261
2324
29
59
majority
12
4
173
3
4
2
124
28
mux
54
6
715
1
1
1
533
25
9symml
parity
t481
60
5
187
0
0
0
187
0
1028
11
1562
8
16
3
1393
11
Table 4.2: Power Reductions for Random Logic Circuits
34
Chapter 5
Architecture for Combinational
Precomputation
Chapter 4 described Alidina's technique for power reduction. In this technique, inputs
of a sequential circuit are selectively turned off to reduce switching activity.
The following chapters present a new technique for power reduction. This technique, based on Alidina's technique, turns off inputs to combinational circuits to
reduce switching activity. This chapter gives an overview of this new technique by
describing the hardware and requirements necessary to implement the technique.
5.1
The Hardware
This technique starts with a gate-level description of a combinational circuit. ( This
circuit may be a complete circuit or it may be a subcircuit that was extracted from
X1
X2
-
al. f 1
-
Combinational
f2
Circuit
Xn
.0. f M
Figure 5-1: Original circuit.
35
X
"'1
fl
X
f2
p
x
p-1
fm
Xn
Figure 5-2: Circuit with input disabling circuit.
a larger circuit. This will be explained in more detail in Chapter 7. ) Assume that
this circuit has n inputs and m outputs as is shown in Figure 5-1. In an effort to
reduce switching activity, the algorithm will "turn off" a subset of the n inputs using
the circuit shown in Figure 5-2. The figure shows p inputs being turned off using
block B where 1 < p < n. Assume that the set of inputs to be disabled is signified
by S = {x 1 , x 2 ,
* ..
, xp}, and the set of inputs that will not be disabled is signified
by D = {xp+l, xp+2, ' , xn}The term "turn off" means different things according to the type of circuit style
that is being used. If the circuit is built using static logic gates, then "turn off" means
prevent changes at the inputs from propagating through block B to block A. In this
case block B may be implemented using one of the latches shown in Figure 5-3. If
the circuit is built using dynamic logic, then "turn off" means prevent the outputs of
block B from changing from the pre-charged value. Assuming the nodes pre-charge
In
TI
Out
Out
In
v
Enable
Enable
Figure 5-3: Disabling inputs in combinational circuits
36
In c
Out
io
Enable
Figure 5-4: Disabling inputs in Domino circuits
to O's, this can be implemented using 2-input AND gates as shown in Figure 5-4.
Block g, the disable logic, determines when it is appropriate to turn off the selected
inputs. The logic is selected so that the inputs are disabled as frequently as possible
without affecting the values at the outputs of the circuit. The next section presents
restrictions for Block g that ensure that the function of the circuit does not change.
5.2
The Disable Logic Requirements
Block g of Figure 5-2 determines when it is appropriate to turn off the selected
inputs. The selected inputs may be "turned off" if the static value of all the outputs,
fi through fi, can be completely determined by the inputs that are not turned off,
Xp+1
...
X·.
First consider the single-output case as shown in Figure 5-5. In the single output
case, this requirement is fulfilled if gl and g2 satisfy:
91g-=1
f=1
(5.1)
92= 1
f=0
(5.2)
If either g or 92 is true, the exact value of f can be determined from p+1 ...
so
that the remaining inputs may be turned off. If both gl and g2 are false, then all the
inputs are needed to determine the outputs, so the circuit must be allowed to work
normally. Therefore, the inputs can be disabled when g = g1 + g2 is true as is shown
in Figure 5-5.
In the case of multiple outputs, Equations 5.1 and 5.2 can be generalized as:
gl,i= 1 : fi = 1
37
(5.3)
X1
X
P
f
X
p+1
xn
Figure 5-5: Single output circuit with input disabling circuit.
g2,i =1
for all i such that 1
g9,i + 92,i
=
fi =
(5.4)
i < m. For each output fi, the inputs may be turned off if
must be true
is true. To have all of the outputs evaluate properly, g9,i+ g92,i
for all i, 1 < i < m. In other words the inputs may be disabled if
m
g =
where the gl,i and the
92,i
II (gl,i
i=l
+ g2,i) = 1
(5.5)
are defined in Equations 5.3 and 5.4. Therefore, the logic
in block g of Figure 5-2 must satisfy Equation 5.5.
Given the architecture in Figure 5-2, there are two details that must be determined. First, a subset of the inputs must be selected to be turned off. Second, the
exact logic for block g must be determined such that Equation 5.5 is satisfied. In
particular, these details must be selected so that the power of block A is minimized
while keeping the overhead of the added logic to a minimum. Chapter 6 presents
algorithms for determining both of these given a particular circuit.
38
Chapter 6
Algorithms
As outlined in the previous section, algorithms are needed to accomplish two tasks
given the architecture shown in Figure 5-2. First, an algorithm is needed to select the
subset of inputs that will be disabled. Second, given this set of inputs, an algorithm
must produce the logic for block g. For each of these steps, the goal is to maximize
the savings function:
net savings = savings(A) - cost(B) - cost(g)
(6.1)
These algorithms are similar to on the algorithms developed by Alidina in [2] which
were described in Chapter 4 of this thesis. The thorough descriptions of these algorithms will not be repeated here. Instead, the algorithms will be explained briefly
and any modifications from Alidina's algorithm will be explained.
6.1
Synthesis of the Disable Logic
This algorithm determines the logic needed for block g assuming that S (the subset
of inputs that will be disabled) has already been determined.
completely defined by
This logic must be
+1 ... xn, it must maximize prob(g = 1), and it must satisfy
Equation 5.5 so that the outputs are not affected.
This problem is identical to the problem encountered in Alidina's technique. As
39
was shown in [2] and repeated in Section 4.2, the given constraints are satisfied using:
m
g = - (Usf + Usfi)
(6.2)
i=l
This results in the maximum power savings in block A given a particular set of
disabled inputs, S.
6.2
Reduction of Disable Logic Costs
Although the algorithm described in Section 6.1 results in the maximum power savings
of the original subcircuit (savings(A)), it says nothing about the resulting cost of the
disable logic (cost(g)). The original goal was to maximize the net savings given by
Equation 6.1. To do so, the algorithm must consider reducing prob(g = 1) in order to
reduce the cost of the disable logic. In particular, this algorithm will look for some
function, greduced, such that greduced = g and such that Equation 6.1 is maximized.
This becomes a much simpler task by noting that the savings of block A is ap-
proximately proportional to prob(greduced = 1) and that the cost of block B is roughly
constant with constant S. Therefore, any component of the implementation of g
that requires a significant amount of power but does not contribute significantly to
prob(g = 1) should be eliminated.
This can be accomplished using the following algorithm. First, find the cube of
g that contributes the least probability of making g true. If this cube is removed,
the gross savings is reduced by (1 -
pro9b(gedced-))
original savings. The cost is
reduced by cost(g) - cost(greduced). If the cost is reduced more than the savings, then
remove this cube from g and continue with the next cube. If the cost is not reduced
more than the savings, then leave this cube in g and discontinue.
The details of this algorithm are shown in Figure 6-1. In this pseudo-code, savings
refers to the savings(A), and cost refers to the cost(g).
40
REDUCE_G(
{
g, savings ):
origcost = ESTIMATE_COST(g);
done = false;
while ( not done ) {
select a cube, cube, from g such that prob(g - cube
mized;
greduced = g - cube;
cost = ESTIMATE-COST(greduced);
if ( orig-cost - cost > (1 - rob(greducedl)
X savings
1) is maxi-
)
g = greduced;
origcost = cost;
savings
-
prob(g-)9;
X savings;
probTg=1)
else
done = true;
return(g);
}.
.-
Figure 6-1: Procedure to reduce the cost of the disable logic
41
-
6.3
Selection of the Inputs
Given a particular combinational circuit, the set of inputs that will be turned off,
S, must be selected. In particular, these inputs should be selected so that the cost
function, Equation 6.1, is maximized.
In [2], Alidina develops an algorithm that performs a very similar task. This
algorithm is described in detail in Section 4.3. There are several shortcomings of this
procedure.
First, Alidina's algorithm is not fully automated. To use the algorithm, it must
be run several times using different values of k. Although this is not a problem if
only one circuit is being analyzed, it is a serious problem if many circuits are being
analyzed within a loop. (This is, in fact, the case in Chapter 7.)
Second, Alidina's algorithm has a very limited cost function. Alidina simply maximizes prob(g = 1). Although this is a very important part of maximizing the power
savings, it is not a complete measurement of the power savings. Other important factors include the number of inputs that are disabled and the cost of the added logic.
The true cost function of this technique is given in Equation 6.1. Although it is not
possible to evaluate this cost function perfectly through many iterations, it does lead
to a more accurate model.
To overcome the shortcomings of Alidina's algorithm, the new version of the algorithm uses a generalized cost function.
As is shown in Figure 6-2, this gener-
alization is implemented using the functions ESTIMATE-SAVINGS
MATE_COST. ESTIMATESAVINGS
and ESTI-
is a function that determines or estimates
the gross savings that are achieved in block A denoted savings(A).
This savings is
assumed to be directly proportional to prob(g = 1). This function may be implemented as a simple heuristic such as ISI x prob(g = 1), or it may be implemented
as a function that does a complex analysis of the savings achieved within block A.
For combinational circuits, it must consider timing relationships to be accurate. ESTIMATECOST
is a function that determines or estimates the costs due to blocks
B and g. Once again this function can be implemented using a simple or a complex
42
__
SELECTNPUTS( f, k ):
= 0;
BESTSAVINGS
SELECTEDSET =
;
X = { xi I xi is an input of f };
SELECTRECUR(f, f, b,X);
return( SELECTED-SET);
SELECT-RECUR(
gi, g2 , S, Q ):
{
g = 91 + 9g2 ;
pr = prob(g= 1);
savings = ESTIMATESAVINGS( S, g, pr );
cost = ESTIMATECOST(
g );
maxsavings = savings + REMAININGSAVINGS( Q, pr );
greduced = REDUCEG( g, savings, cost );
if( maxsavings
< BEST-SAVINGS)
return;
else if ( savings - cost > BESTSAVINGS) {
= savings- cost;
BESTSAVINGS
SELECTEDSET S;
choose xi E Q such that i is minimum;
SELECT-RECUR( Uig 1, Ui9g2, S U Xi, Q-xi
SELECT-RECUR( gi, g2, S, Q- i );
);
return;
}
Figure 6-2: Procedure to Determine the Optimal Set of Inputs
43
heuristic.
Using these generalized functions gives the algorithm more power and more flexibility. Because the functions are a better representation of the actual cost function, it
is simple to make the algorithm fully automated. The algorithm does not need to be
run for many values of k because the algorithm knows which solution is the best. The
generalized functions also make the algorithm more efficient. When the old algorithm
is run for many values of k, it is branching over the same binary tree several times.
Because the new algorithm understands the real costs better, it only needs to branch
over the tree once. Finally, because the cost function is more accurate, the results
are more accurate. Therefore, using these generalized functions achieves both better
results and less computation time.
Even so, there is one shortcoming. Because the cost functions are more complex,
it is more difficult to prune the branching. Actually, Equation 4.7 still holds, but the
actual information needed is how Equation 6.1 is affected as more inputs are added
to D. To overcome this, the MAXSAVINGS
function is used. MAX_SAVINGS
returns the maximum savings that can be achieved if all the inputs are disabled. This
can be determined because the prob(g = 1) is bounded according to Equation 4.7.
Having a bound on the maximum savings means that branching can be discontinued
if the maximum savings drops below the best savings achieved so far.
The resulting algorithm is shown in Figure 6-2. Except for the improvements
described above, it is basically the same algorithm presented by Alidina in [2]. This
algorithm results in the best possible set of inputs to turn off assuming that ES-
TIMATESAVINGS, ESTIMATE_COST, and REMAININGSAVINGS
reasonably accurate.
44
are
Chapter 7
Subcircuit Selection
In Chapters 5 and 6, a methodology has been described in which the power consumption of combinational circuits can be reduced by dynamically turning off select
inputs. This technique is closely based on the technique developed by Alidina in [2]
and described in Chapter 4 of this thesis. Although good results can be achieved with
this algorithm, there are limitations. One such limitation is the severe restrictions
that occur as the number of outputs increase. This chapter describes this limitation
and suggests a technique to overcomeit.
7.1
Motivation
As was shown in Section 5.1, Equations 5.3 and 5.4 must be satisfied for each output.
As the number of outputs increase, this restriction becomes even tighter. In general,
this tends to reduce the probability that inputs will be turned off (i.e., decreases
prob(g = 1)). This, in turn, reduces the power savings that can occur.
For example, consider the two circuits shown in Figure 7-1. The circuit in part
(a) is simply an 8-bit comparator. Because there is only one output, it is relatively
easy to find a disable function, g, that satisfies the restrictions stated in Section 5.1
and still has a good probability of being true. For example, letting D =
a[7], b[7]}
results in g = Usf = a[7] @ b[7]with prob(g = 1) = 0.5. The circuit in part (b) is
the same 8-bit comparator, except that now the inputs are being fed from adders.
45
W[7:0]
A[7:0]
AL Vj
A[7:0]
f
f
B[7:0]
vr71
XLI .J
B[7:0]
Z[7:0]
(b)
(a)
Figure 7-1: Two candidates for power reduction. (a) A simple 8-bit comparator. (b)
An 8-bit comparator preceded by two adders.
Intuitively, it seems that the same savings should be achieved because both circuits
contain the same combinational comparator. But because of the additional outputs
of the adders, none of the inputs can be turned off. The tool is simply not intelligent
enough to consider turning off interior nodes.
Alidina suggests one solution to this problem in [2]. He considers deriving g using
a subset of the outputs. Then, to keep the function of the rest of the outputs correct,
he suggests duplicating the logic that is shared by both sets of outputs.
This is
described in more detail in Section 4.4. For example, for the circuit in Figure 7l(b), this algorithm may result in both adders being duplicated. As is shown by this
example, logic duplication can result in a tremendous amount of over-head that limits
power reduction.
As an alternative solution, this chapter presents a method based on division into
subcircuits. The best solution to the example in Figure 7-1 would be to run the
algorithm on just the comparator subcircuit. To give the algorithm this flexibility,
the circuit is first divided into subcircuits. Each subcircuit is considered for power
reductions. Then, these subcircuits are recombined into groups of subcircuits and
again the power reduction technique is considered. This is repeated until, eventually,
46
the entire circuit may be considered.
The rest of this chapter considers this technique in greater detail. First some
observations about the nature of this technique are given in Section 7.2. Based on
these observations, two algorithms are presented in Sections 7.3 and 7.4.
7.2
Observations About Single-Output Subcircuits
In order to develop algorithms that can efficiently search a circuit for the optimal
subcircuits, a few observations are needed.
First, a couple of terms must be defined. A single-output subcircuit is a combina-
tional subcircuit that has only one output. A maximum-sized, single-output subcircuit
(MSSO subcircuit) is a single-output subcircuit to which no gate or set of gates may
be added such that the subcircuit still has only one output.
Next, note that no two MSSO subcircuits can overlap. To show this, assume
two single-output subcircuits do overlap such that one output gate is within the
other subcircuit.
In this case, the two subcircuits can be joined to make a larger
single-output subcircuit implying that the two subcircuits were not maximum-sized.
Consider another case in which two subcircuits overlap, but their output gates are
separate.
In this case there exists some gate whose output drives gates in both
subcircuits. This gate output is also a subcircuit output so the original two subcircuits
have more than one output.
Because of these two cases, MSSO subcircuits cannot
overlap.
Next, note that every gate in a circuit is included in at least one MSSO subcircuit
simply because gates have only one output. In addition, because MSSO subcircuits
cannot overlap, no gate can be included in two MSSO subcircuits. These two facts
demonstrate that the complete set of MSSO subcircuits is a unique division of the
circuit into non-overlapping subcircuits. The non-overlapping property also forces this
unique set to be the minimum number of single-output subcircuits that completely
define the circuit.
This minimum number of single-output subcircuits can be determined in linear
47
GETSINGLE_OUTPUT_SUBCIRCUITS(
circuit ):
arrange nodes of circuit in depth order outputs to inputs;
foreach node in depth order ( node ) {
if ( node is a primary output ) {
subcircuit= createnew_subcircuit();
mark node as part of subcircuit;
}
else
{
check every fanout of node;
if ( all fanouts are part of the same subcircuit )
subcircuit = subcircuit of the fanouts;
else
subcircuit= createnewsubcircuit();
mark node as part of subcircuit;
}
Figure 7-2: Procedure to Find the Minimum Set of Single-Output Subcircuits
time. To do so, simply walk through all of the gates in the circuit starting at the
outputs and working backwards to the inputs. If a gate has a primary circuit output,
then it is the beginning of a new MSSO subcircuit. If a gate has fanouts that are
part of different MSSO subcircuits, then this gate is the beginning of a new MSSO
subcircuit. Otherwise, all of this gate's fanouts belong to the same MSSO subcircuit,
and therefore this gate also belongs to this MSSO subcircuit. Pseudo-code for this
algorithm is shown in Figure 7-2.
To see how this algorithm works, consider the contrived circuit shown in Figure
7-3. In this circuit, the algorithm would start at the three outputs of the circuit.
Each of these gates represent the start of new MSSO subcircuits, A, D, and B. The
algorithm continues with the fan-ins of each of these three gates. The next level
of gates is labeled according to the rules explained above. This continues until the
inputs are reached. When the algorithm is complete, the circuit has been divided
into MSSO subcircuits as is shown in part (b) of Figure 7-3.
Next, note that there is no need to analyze any subcircuit that is composed of
48
(a)
(b)
Figure 7-3: Dividing a circuit into a minimum number of single-output subcircuits.
(a) The original circuit. (b) The single-output subcircuits.
only a part of one of these MSSO subcircuits.
Consider a single-output subcircuit
that is composed of a subset of the gates from a MSSO subcircuit including the
output gate. Because the outputs of these two circuits are the same, the restrictions
on g are identical. But the MSSO subcircuit has more internal nodes implying that
more power savings can be had. Therefore it makes sense to consider only the MSSO
subcircuit.
Now, consider a single-output subcircuit that is composed of a subset of the gates
from a MSSO subcircuit not including the output gate. This subcircuit has a differ-
ent output that indirectly feeds into the original output gate. Because this output
feeds the original, it must be more restrictive than the first. That is, when the universal quantification is evaluated it must be true that gne, => goriginal Therefore,
prob(gne = 1) < prob(goriginal= 1). Once again it makes sense to consider only the
MSSO subcircuit.
The conclusion of this argument is that there is no need to consider any subcircuit
that is a subset of a MSSO subcircuit. There is one notable exception to this rule. In
some cases, it may be desirable to consider turning off nodes that are in the interior
of a MSSO subcircuit to overcome timing restrictions. This will be discussed in more
detail in Chapter 8. Despite these exceptions due to timing, using MSSO subcircuits
is an excellent method to reduce the possible subcircuits to a manageable number.
49
The algorithms that are developed in the following two sections are based on this
idea.
7.3
The First Algorithm
The previous section showed how to divide a combinational circuit into MSSO subcircuits in order to narrow down the possible subcircuit possibilities. Using this idea,
one possible approach could be: 1) Create the set of MSSO subcircuits, 2) Try every
possible combination of these subcircuits, and 3) determine the combinations that
yield the best net savings. Unfortunately, step 2 above cannot be executed for any
significant number of MSSO subcircuits (certainly no more than ten). Therefore a
more intelligent algorithm is needed to reduce the number of possibilities.
First note that not all possible combinations of subcircuits make sense to evaluate.
If a pair of subcircuits are completely unrelated, in other words they have no outputs
or inputs in common, then there is obviously no reason to evaluate them as a pair.
A better result can be obtained by evaluating the subcircuits separately.
Therefore, an algorithm is used that loops through only combinations of MSSO
subcircuits that are interconnected. The algorithm branches over several trees - one
starting with each MSSO subcircuit. Figure 7-4 shows the branching pattern for the
example circuit of Figure 7-3. In this figure, the lower-case letters represent the set of
neighbors for the current combination. Two MSSO subcircuits are neighbors if at least
one input or one output of one subcircuit is also an input or an output of the other
subcircuit. Starting with one MSSO subcircuit, the algorithm selects a second MSSO
subcircuit from the first's neighbors. The set of neighbors is updated to include the
neighbors of the second MSSO subcircuit. This continues as the algorithm branches
over all possible combinations of MSSO subcircuits.
Of course the algorithm should not evaluate the same combination of MSSO subcircuits multiple times. For example, the algorithm should not evaluate ABCD,
ACBD, and BCDA because these are actually the same combination that were constructed in a different order. For most combinations, the branching will occur only in
50
bd
abc
^ (aD
Figure 7-4: Example of branching through subcircuit combinations.
the order of the MSSO subcircuit names. For example, if the branching is currently
at AC, the algorithm will branch to ACD, but will not branch to ACB because this
is out of order.
This strategy will eliminate all duplications, but it may also skip some valid combinations. In the example, ABC cannot be created in order because subcircuits A
and B are not neighbors. To account for these cases, the following special rule is used.
Assume that, at a certain node along the branching tree, a new MSSO subcircuit is
added to the list of neighbors. Further assume that the index of this new neighbor is
greater than the index of the MSSO subcircuit that is at the root of the tree. In this
case, an extra branch should be made to include this new neighbor. For the example
shown in Figure 7-4, this rule requires that the node AC branch to the node ABC. But
this rule does not allow node BC to branch to ABC because index(A) < index(B).
The algorithm presented so far is better than trying all possible combinations,
but it is still too complex to run on many circuits. To reduce complexity, a pruning
condition has been introduced.
First, note that as MSSO subcircuits are added,
more outputs of the circuit make Equations 5.3 and 5.4 more restrictive. Therefore
the quantity prob(g = 1) will tend to decrease. Assuming that this must be true, the
algorithm stops branching when the value of the prob(g = 1) drops below a threshold.
This threshold is just an arbitrary number based what the designer considers to be
significant savings. 10% may be a reasonable threshold.
51
The complete algorithm
using these rules is shown in Figure 7-5.
Although there are contrived cases when this algorithm will fail to find the optimal
division of subcircuits, it finds very good divisions for most circuits.
7.4
The Second Algorithm
For some circuits, the algorithm presented in Section 7.3 may still involve too much
computation. A few more observations leads to another heuristic algorithm.
As was shown earlier, the number and the function of subcircuit outputs are the
key factors in determining the probability prob(g = 1). Each output acts like a
restriction on the set of input vectors for which inputs may be turned off. In general,
it is desirable to maximize the size of the subcircuits and still keep the number of
subcircuit outputs low in order to achieve high savings. Therefore, the subcircuit
selection algorithm should be written to maximize the ratio between internal nodes
and outputs.
Consider joining two MSSO subcircuits to create a new subcircuit. If the MSSO
subcircuits share inputs, as in Figure 7-6(a), then the ratio of internal nodes to outputs
is not increasing. Another way of looking at this is that the outputs of the MSSO
subcircuits are still outputs of the combined subcircuit, and, therefore, the restrictions
on g remain the same. Because of this, it is unlikely that this combination will lead
to increased power savings. The same argument can be made for the case shown
in Figure 7-6(b). Because the internal node to output ratio is not increasing, it is
unlikely that there will be any substantial increase in the power savings.
Now consider Figure 7-6(c). In this case, three MSSO subcircuits have been
grouped so that one output is no longer an output. This is like removing one restric-
tion on the disablelogic, g. In this configuration,the internal node to output ratio has
been increased over any individual MSSO subcircuit. Therefore, it is reasonable that
the gross savings may actually increase compared to the sum of the three individual
MSSO subcircuits. At the same time, the disable logic will be shared reducing overhead. Obviously this type of situation is much more likely to produce good results
52
SUBCIRCUIT_SELECT( circuit ):
let A be the array of single-output subcircuits so that each subcircuit
is denoted A[i], 0 < i < JIA- 1
i =0;
while (i < AI) {
N = NEIGHBORS(A[i]);
SUB_SELECTRECUR( {A[ij}, N, N, i, i);
i
=
+ 1;
}
SUB_SELECTRECUR( B, N, M, f, 1):
B = set of single-output subcircuits that comprisethe current multi-output
subcircuit
N = set of neighboring single-output subcircuits
M = set of new neighboring single-output subcircuits
f, 1 = smallest and largest subcircuit indices found in B
EVALUATESUBCIRCUIT(
if ( prob < e )
B, prob );
return;
i =f;
while( i <)
{
if ( A[i] C M) {
X = NEIGHBORS(A[i]);
SUBSELECTRECUR( B U {A[il}, N UX, X - N, f, 1);
}
i=i+l;
}
while( i < AIl) {
if( A[i] C N)
{
X = NEIGHBORS(A[i]);
SUB_SELECTRECUR( B U {A[i]},N UX , X - N, f, i );
i=i+l;
Figure 7-5: Algorithm that evaluate different combinations of subcircuits.
53
z-i~7
(a)
4~~I
(b)
(c)
Figure 7-6: Possible groupings of adjacent MSSO subcircuits. (a) Subcircuits sharing
an input. (b) A subcircuit output feeding a subcircuit input. (c) A subcircuit output
and each of its fanouts.
than the previous two.
These observations lead to another heuristic algorithm. First, find all the MSSO
subcircuit outputs that are not primary circuit outputs.
Then, for each of these
nodes, group the adjoining MSSO subcircuits into a multiple-output subcircuit and
evaluate the power savings that can be achieved. Continue by trying combinations of
these nodes. This leads to a branching structure that is identical to the one described
in the previous section, except that each node in the tree now represents a circuit
node. In fact, the same pruning condition still holds: if prob(g = 1) decreases below
a threshold, the branching should discontinue. But, in this case, there are far fewer
possibilities to try so that the algorithm is much more efficient.
Once again, there are contrived cases when this algorithm will fail to find the
optimal division of subcircuits, but in general it is a very good heuristic.
54
Chapter 8
Results, Comparisons, and Future
Work
The algorithms described in Chapters 5, 6, and 7 have been implemented and executed
on example circuits. This chapter describes the experimental results that have been
achieved. In addition, the results are compared to the results of the Alidina technique,
and suggestions for future work are given.
8.1
Results
The algorithms described in this thesis have been implemented in C and have been
incorporated into the SIS logic synthesis and optimization platform. The power reduction technique has been used on many example circuits. Some of the results are
shown in Table 8.1.
Although some good results have been achieved, the majority of circuits evaluated
produced little or no power savings. There are two major reasons for this.
First, this technique relies on circuits that have certain functional properties. To
achieve savings, a circuit or a subcircuit must have inputs whose values are sometimes
unnecessary to determine the outputs. There exist circuits where this simply is not
true.
For example, an adder needs all of its input information all of the time to
determine the correct output values. These cases are just a shortcoming of the whole
55
I
...
CKT
Original
Lits Levs ] Pwr
compl6
92
16
364.1
comp8
44
8
169.1
fcomp8
88
8
338.2
priorityl5
60
4
150.8
priority7
17
4
Precompute Logic
Levs
Bits Lits
79.7
Optimized
Pwr % Red
2
4
6
2
4
6
2
4
6
8
1
4
16
24
4
16
24
4
16
24
36
0
1
1
2
1
1
2
1
1
2
2
0
326.3
252.8
156.7
161.4
153.1
167.1
293.8
251.4
265.5
300.8
132.1
10.4
30.6
57.0
4.6
9.5
1.2
13.1
25.7
21.5
11.1
12.4
2
2
1
93.7
40.2
2
2
1
79.0
0.9
Table 8.1: Power reductions of combinational circuits.
technique - better results are simply not possible.
The other major problem encountered with combinational circuits is timing. For
this technique to be successful, the disable signal must arrive before the signals that
are being turned off. This requires two things. All the inputs in D must arrive before
all the inputs in S, and the disable logic must evaluate very quickly. For random
circuits, these conditions are unlikely, resulting in insignificant savings for random
circuits.
Because of these two problems, positive results can be obtained for only a select
group of circuits.
8.2
Comparison to Alidina's Technique
Comparing the results of Alidina in section 4.5 versus the results of this new technique
in the previous section reveals Alidina's technique to be more powerful. Alidina's
technique seems to find better power savings for a larger range of circuits.
The main advantage of Alidina's technique is its independence from timing constraints. As described in Section 4.1, all of the additional logic is added to the previous
clock cycle, avoiding any possible timing problems. In the technique described in this
56
thesis, the target circuits are combinational so that adding logic to the previous cycle
is not possible. The timing constraints were very restrictive so that very few circuits
achieved good results.
Another advantage of Alidina's technique is lower overhead. Because Alidina is
disabling inputs coming out of a register, no additional logic is required to store the
value at the disabled node. In the technique described in this thesis, a general circuit
requires the addition of an entire latch. The additional overhead reduces the net
savings considerably.
Even so, there are still a class of circuits where this new technique can find savings
whereas Alidina's cannot. This new technique is generally better at finding savings
within circuits with a large number of outputs because of the strategies described in
Chapter 7. This is best demonstrated by the example circuit shown in Figure 7-1(b).
For this circuit, Alidina's technique requires logic duplication resulting in overhead
that would overwhelm the possible savings. This new technique is capable of finding
and considering the comparator as a separate subcircuit.
All things considered, Alidina's technique is more powerful. But the technique
presented in this thesis is still valuable because of circuits with a large number of
outputs for which Alidina's technique fails.
8.3
Future Work
The major difficulty of the work presented in this thesis is timing. To successfully
reduce the power consumption of static combinational circuits, the disable logic must
arrive before the nodes that are being disabled. A late disable signal will not prevent
the signals from switching and will simply be wasted overhead. There are several
possible ways of overcoming this difficulty.
The most obvious way to do this is to speed up the arrival of the disable signal.
This may be accomplished by computing the disable signal from nodes that evaluate
earlier.
The theory to do this does not yet exist, and it may be a very difficult
problem.
57
Another possibility is to disable signals that arrive later. For example, when using
such a technique on a comparator, it may be better to disable the carry signals instead
of the circuit inputs. Of course, this may also significantly reduce power savings.
The most promising way to overcome timing constraints is to implement g, the
disable logic, using Domino logic. Using Domino logic, g would always be pre-charged
to a 1. This disables the inputs at the beginning of the clock cycle. Then, if g evaluates
to a 0, the signals would be allowed to propagate normally. But, because g always
starts the clock cycle as a 1, there is no way the disable signals could propagate before
g evaluates. Therefore, timing is no longer an issue. Effectively, this is a self-timing
strategy to allow switching to propagate only after it has been determined that the
signals need to propagate.
There are difficulties with using this kind of mixed logic circuit. Domino logic
must be implemented within some kind of clocking methodology, so this could not
used in any asynchronous circuits. Domino circuits require a certain amount of overhead including the pre-charge and evaluate signals. And the inputs to the Domino
circuits cannot be purely combinational signals because Domino signals are not allowed to make 0 -
1 transitions except during the pre-charge. In addition, requiring
signals to wait before they propagate can increase the delay of the circuit considerably. Although there are quite a few difficulties with the mixed logic technique, it
shows a great deal of promise for overcoming timing constraints.
In any case, any serious future work must address the timing problems of this
technique.
58
Chapter 9
Conclusion
This thesis presented a new technique for power optimization of combinational CMOS
circuits. The technique adds logic to dynamically "turn off" a subset of inputs. This
decreases switching activity within the circuit reducing the required power.
To implement this technique, a standard architecture was presented along with
algorithms that optimize the power reduction for a particular circuit. Algorithms
used to select inputs and synthesize additional logic were updated from Alidina's
work on sequential circuits. In addition, new algorithms were developed that divide
the circuit into subcircuits in an effort to make the technique more versatile. These
algorithms were implemented, and the result was a completely automated CAD tool.
Experimental results show that the technique was not as successful as desired. For
some circuits, it is not possible to disable inputs because all of the input information
is needed all of the time. This problem is shared by Alidina's technique. For other
circuits, timing prevented the technique from reducing the power consumption. To
achieve savings, it is necessary that the disable signal arrive before the signals that are
being turned off. This is a severe restriction that makes it unlikely that power savings
will be found for random circuits. Alidina's technique avoids all timing problems by
evaluating the disable logic in the previous clock cycle. Even so, there is a class of
circuits for which this new technique outperforms Alidina's technique. If circuits have
a large number of outputs, Alidina's technique often fails to find savings. But this
new technique will divide the circuit into subcircuits and can find reasonable savings.
59
Overall, the power reduction CAD tool presented in this thesis has some shortcomings. But, for a class of circuits, it outperforms previous work in this area, and
therefore it is still a valuable CAD tool.
60
Bibliography
[1] S. B. Akers. Binary Decision Diagrams.
IEEE Transactions on Computers,
C-27(6):509-516, June 1978.
[2] M. Alidina, J. Monteiro, S. Devadas, A. Ghosh, and M. Papaefthymiou.
Precomputation-Based
Sequential Logic Optimization for Low Power. IEEE
Transactions on VLSI Systems, pages 426-436, December 1994.
[3] R. Bryant. Graph-Based Algorithms for Boolean Function Manipulation. IEEE
Transactions on Computers, C-35(8):677-691, August 1986.
[4] S. Chakravarty, T. Sheng, and R. W. Brodersen. On the Complexity of Using
BDDs for the Synthesis and Analysis of Boolean Circuits. In Proceedings of the
2 7th
Annual Allerton Conferenceon Communications, Control, and Computing,
pages 730-739, September 1989.
[5] A. Chandrakasan, T. Sheng, and R. W. Brodersen. Low Power CMOS Digital
Design. In Journal of Solid State Circuits, pages 473-484, April 1992.
[6] S. Devadas, A. Ghosh, and K. Keutzer. Logic Synthesis. McGraw-Hill, 1994.
[7] D. Dobberpuhl, et. al. A 200MHz 64b Dual-Issue CMOS Microprocessor. In
IEEE Journal of Solid-State Circuits, pages 106-107, 1992.
[8] A. Ghosh, S. Devadas, K. Keutzer, and J. White. Estimation of Average Switching Activity in Combinational and Sequential Circuits. In Proceedings of the 2 9 th
Design Automation Conference, pages 253-259, June 1992.
61
[9] L. Glasser and D. Dobberpuhl. The Design and Analysis of VLSI Circuits.
Addison-Wesley,
1985.
[10] C. Y. Lee. Representation of Switching Circuits by Binary-Decision Programs.
Bell Systems TechnicalJournal, 38(4):985-999,July 1959.
[11] J. Monteiro, S. Devadas, and A. Ghosh. Retiming Sequential Circuits for Low
Power. In Proceedings of the Int'l Conference on Computer-Aided Design, pages
398-402, November 1993.
[12] F. Najm. Transition Density, A Stochastic Measure of Activity in Digital Circuits. In Proceedings of the
28 th
Design Automation Conference, pages 644-649,
June 1991.
[13] K. Roy and S. Prasad.
SYCLOP: Synthesis of CMOS Logic for Low Power
Applications. In Proceedingsof the Int'l Conferenceon Computer Design: VLSI
in Computers and Procesors,pages 464-467, October 1992.
[14] J. Schutz. A 3.3V 0.61/m BiCMOS Superscalar Microprocessor. In IEEE Journal
of Solid-State Circuits, pages 202-203, 1994.
[15] A. Shen, S. Devadas, A. Ghosh, and K. Keutzer. On Average Power Dissipation
and Random Pattern Testability of Combinational Logic Circuits. In Proceedings
of the Int'l Conference on Computer-Aided Design, pages 402-407, November
1992.
[16] C-Y. Tsui, J. Monteiro, M. Pedram, S. Devadas, A. Despain, and B. Lin. Exact
and Approximate Methods for Switching Activity Estimation in Sequential Logic
Circuits. IEEE Transactions on VLSI Systems, March 1995.
[17] E. Vittoz. Low-Power Design: Ways to Approach the Limits. In IEEE Journal
of Solid-State Circuits, pages 14-18, 1994.
62