Media:Project_Report_KamChong

advertisement
Power-Aware Circuits with Dynamic Voltage Scaling
Chong Hu, Kamaram Munira
ECE 563 – Fall 2006
University of Virginia
<chonghu, munira>@virginia.edu
ABSTRACT
Power-Aware systems are highly desirable in today’s VLSI world,
which focuses more and more on energy-constrained systems e.g.
wireless sensors, music players. Power-aware architecture scales
down energy as quality requirements of the devices are relaxed.
This paper introduces power-awareness in a 16-bit Baugh
Woolley Multiplier. Two models of power-aware 16, 8-bit Baugh
Wooley multipliers are constructed and studied in contrast to the
non power-aware 16-bit multiplier, in terms of power
consumption, delay and estimate area. Dynamic Voltage Scaling
is used to further control the power consumption in the poweraware multipliers. All the multipliers are implemented using 90nm
Predictive Technology Model (PTM) library.
1. INTRODUCTION
As we get closer to the limits of scaling in CMOS circuits, power
consumption and heat dissipation issues are becoming ever more
important. In recent years, the impact of ubiquitous computing
and internet technology has accelerated this trend. The
applications for these domains are “energy-constrained”, which
means that they are run on battery powered embedded systems.
The resultant constraints on the energy budget require design-forpower as well as design-for-performance at all layers of system
design. Even when battery based constraints do not exist, energy
must often be conserved, in general-purpose as well as in realtime systems.
A “power-aware” system is a system that is sensitive to energy
considerations, therefore maximizing the lifetime of the energyconstrained system. The system will scale the power consumption
in response to the changing scenarios in its operating point. These
changes occur as a result of variation in its input statistics, desired
output quality tolerable latency and throughput [1]. A “poweraware” system will: (a) allow its client to adjust the expected
quality and the tolerable latency constraint, and (b) After
adjustments are made, the energy consumption is expected to vary
accordingly. Laptops, for example, use power-aware algorithms
that are based on CPU (i.e., processor) utilization [2], making
them ideal for interactive use. That is, if a laptop user is reading a
document for an extended period of time while running on battery
power, the laptop would automatically scale down the frequency
and supply voltage of the CPU in order to reduce power
consumption, as power consumption is proportional to the CPU
frequency and to the square of the CPU supply voltage.
Multiplier is one of the most time-consuming units in a processor.
The performance of many computational problems often is
dominated at how a multiplication operation can be executed. In
section 2, a 16-bit Baugh-Wooley (BW) multiplier is studied. In
section 3, the BW multiplier from section 1 is scaled according to
the output bit precision required. The effect of the precision
scaling is studied in terms of power consumption, delay and
circuit area. In section 4, Dynamic Voltage Scaling is introduced
Figure 1. 16x16-bit Baugh-Wooley Multiplier
to the already scaled multipliers to further lower the power
consumption of the circuit.
2. 16-BIT BAUGH-WOOLEY MULTIPLIER
Multipliers are, in effect, complex adder arrays. Figure 1
represents a 16x16-bit BW multiplier. A BW multiplier is a
regular multiplier that is suited for 2’s-complement numbers.
Multiplication is done in two steps. First the partial products are
calculated from the logical AND of the multiplicand A and a
multiplier bit Bi. The second step accumulates the partial
products. An accumulation is done after every partial product
generation by arrays of adders. Each box in Figure 1 correspond
to an AND gate, followed by a 1-bit transmission-gate Full-adder.
The last row is essentially an N-bit Full-adder (in this case, N is
equal to 16), which combines all the partial product results.
The critical path (highlighted in red in Figure 1) for the BW
multiplier is the delay of the longest vertical chain of adders and
the last row of adders. The last row can be optimized by
implement an N-bit Carry-Select Adder. For the case of a 16x16bit array multiplier, power consumption and delay should be
proportional to the required precision bits of the output. The
worst-case power consumption will take place when the required
input precision is 16 bits and the best-case consumption will be
when the input precision is 1 bit. A non- “power-aware”
multiplier as the BW multiplier in Figure 1 will blindly operate
assuming the worst-case condition. Therefore, even when
computing a multiplication of 1-bit precision, the non-“poweraware” multiplier will be consuming the power required to
compute a 16-bit multiplication. The critical path for delay
remains the same for any output precision.
Figure 3. 16, 8-bit Scalable (1) BW Multiplier
Figure 2. Perfect Power-Aware 16x16-bit Multiplier
3. POWER-AWARE MULTIPLIERS
A perfect power-aware 16x16-bit multiplier, as shown in Figure 2,
will be an ensemble of 16 dedicated point multipliers and a Zero
Detection Circuit as a scenario detector. Routing of multiplicands
will be controlled by the inputs. The 1x1, 2x2… 16x16-bit
multipliers will be engineered to be as efficient as possible to
perform 1x1, 2x2… 16x16-bit multiplication respectively [3].
However, this particular system will have a considerable amount
of energy cost of routing the multiplicands to the right point
system and routing the result to the system output. Also, the
system is highly inefficient in terms of areas and number of
transistors.
Another flavor of power-aware multiplier is by using reuse-ofpoint method [3]. In other words, the 16x16-bit BW multiplier
will be modified to handle lower precision requirements without
consuming the power of the worst-case scenario. The next two
subsections will describe two ways in which the 16x16-bit BW
will be scaled. Subsection 3.1 will introduce scalability by
cutting-off the lower significant bits of the operands, while
subsection 3.2 will deal with a slightly modified routing of the
outputs in the multiplier circuit. The required precision will be
controlled by the user or other control circuits.
3.1 16, 8-bit Scalable (1) BW Multiplier
The 16, 8-bit Scalable (1) BW multiplier is the modified version
of the 16x16-bit BW multiplier from Section 2, to handle both 16
and 8-bit output precision. 16-bit precision multiplication would
use the whole circuit as in the case of the worst case scenario.
However, when 8-bit precision output is needed, 0 will be passed
as inputs for the all the lower eight bits of both the operands. This
can be done by using AND gates or multiplexers to control the
input of the lower 8 bits of the operands, as shown in Figure 3.
As there is no switching activity in the upper half of the circuit
during an 8-bit precision multiplication, energy consumed is
lower than the case when there is a 16-bit multiplication. The 16
multiplexers used to select the lower significant inputs will be
optimized to consume a minimal amount of power. Also, during
an 8-bit precision multiplication, the critical path will be
shortened as half the adders in the longest vertical chain of adders
Figure 4. 16, 8-bit Scalable (2) BW Multiplier
will not be used. The shortened critical path is highlighted in red
in Figure 3. The last row, consisting of the 16-bit Carry-Select
Adder, still falls in the critical path.
3.2 16, 8-bit Scalable (2) BW Multiplier
The 16, 8-bit Scalable (2) BW multiplier is further modification of
the 16, 8-bit Scalable (1) BW multiplier. 16x16 bit multiplication
has an output of 32 bits, the 16 most significant bits from the last
row adders, while the 16 least significant bits are coming from the
right of the rows. When doing a multiplication of 8-bit precision,
only the most significant 8 bits need to go through the final stage
of adders. Therefore, the results of the partial products in the less
significant 8 bits can be directly routed to the output, instead of
going through additional adder chain, as shown in Figure 4. Eight
additional multiplexers after the last row of adder will control the
output bits 23 to 16, based on the information whether the circuit
is performing a multiplication of 8-bit or 16-bit precision. During
8-bit precision multiplication, there is switching activity only in
the lower, left quarter of the circuit plus the control parts. The
modified routing of the output further lowers power consumption
in the circuit as we can see in
Table 1-Comparison of the Delay, Energy and Area
Normal 1616-bit
16/816/8-bit
Scalable (1)
16/816/8-bit
Scalable (2)
Precision
(bit)
Delay
(ns)
Energy
(pJ)
pJ)
Normalized
Total Width*
16
1.67
3.11
3157
16
1.71
3.18
8
0.98
0.79
16
1.76
3.26
8
0.64
0.54
3205
3279
the next subsection. The critical path is further shortened as the
lower 8 adders in the last row would not be used. The new critical
path is highlighted in red in Figure 4.
Figure 5 - Delay, Energy Consumed vs. Supply Voltage
during an 8-bit precision multiplication
3.3 Simulation and Result Comparison *
The three samples of multipliers, 16-bit BW multiplier, 16, 8-bit
Scalable (1) BW multiplier, and 16, 8-bit Scalable (2) BW
multiplier was implemented in Cadence using 90nm Predictive
Technology Model (PTM) library and simulated for the worstcase scenario for both 16 and 8-bit output precision. The results
are recorded in Table 1.
As seen from the simulation results in the table, despite the small
overhead due to scaling in 16-bit mode, both the scaled
multipliers are better in conserving power under the 8-bit mode,
Scalable (2) BW multiplier proving better than Scalable(1) BW
multiplier. During an 8-bit precision multiplication, Scalable (1)
BW multiplier saves approximately 75% of the energy compared
to when it is doing an 8-bit precision multiplication in the non
power-aware 16-bit BW multiplier and is 41% faster. For the
Scalable (2) BW multiplier, 83% of the energy drawn from supply
is saved and the computation is 61% faster. There is also an area
overhead associated with the multiplier scaling. However, the
overhead is very small, approximately 2% increase for the 16, 8bit Scalable (1) BW multiplier and 4% increase for the 16, 8-bit
Scalable (2) BW multiplier.
4. APPLYING DYNAMIC VOLTAGE
SCALING
Dynamic Voltage Scaling (DVS) is a standard technique for
managing the power consumption of a system. It is based on the
fact that dynamic power consumed by a circuit is dependant on i)
α0→1, the activity factor or the fraction of the circuit that is
switching, ii) CL, the total capacitance of the circuit, iii) VDD, the
supply voltage, and iv) fCLK, the clock frequency.
2
Pdynamic  01 fCLKCLVDD
*
Normalized by the width of single CMOS Inverter as the actual
multipliers were not laid out.
Figure 6 - Delay vs. Energy Consumed during
an 8-bit precision multiplication
With increasing complexity of the digital integrated circuits, it is
assumed that power problem will worsen in future technologies.
Therefore, lowering supply voltage is becoming more and more
attractive. Reducing VDD has a quadratic effect on dynamic power.
In order to further lower power consumption, voltage reduction
was applied to both 16, 8-bit Scalable (1) and Scalable (2) BW
multipliers. It is seen from Figure 5, lowering the supply voltage
reduces energy consumption. It is also evident from the graph that
the 16, 8-bit Scalable (2) BW multiplier is better at conserving
energy than the 16, 8-bit Scalable (1) BW multiplier.
On the other hand, voltage reduction has a negative effect on the
overall speed of the circuit. As seen in Figure 5, the output delay
is inversely proportional to the supply voltage. However, if the
clock period of the circuit of the Scalable circuits are fixed at a
point where they can accommodate the worst-case scenario, which
is 16-bit precision multiplication with supply voltage of 1Volt,
there is a time slack of about 0.7ns for Scalable (1) Multiplier and
1.1ns for Scalable (2) Multiplier (Figure 6), operating under the
same supply voltage, but performing a 8-bit precision
multiplication. Therefore, while performing an 8-bit precision
multiplication, the supply voltage can be reduced by at least 30%
and the scaled circuit will be able to produce the output within the
clock period. In 8-bit mode, having the same delay as in 16-bit
mode under 1Volt, the scaled multipliers save another 50% of
energy with Scalable (1) design under 0.7Volt, and more than
60% with Scalable (2) under 0.6Volt,.
5. CONCLUTION
A 16-bit BW Multiplier was modified to handle both 16 and 8-bit
precision multiplication using reuse-of-point method. The 16, 8bit Scalable (1) BW Multiplier performed an 8-bit multiplication
by cutting-off the 8 least significant bits of both the operands. The
16, 8-bit Scalable (1) BW Multiplier was further modified to form
16, 8-bit Scalable (2) BW Multiplier by directly routing the lower
output bits to the final output port while performing 8-bit
precision multiplication. Both the scaled multipliers were better at
conserving energy than the non power-aware 16-bit BW
Multiplier in 8-bit mode, Scalable (2) proving to be better than
Scalable (1). Dynamic Voltage Scaling was then applied to the
scaled multipliers to further lower power consumption during 8bit precision multiplication.
6. REFERENCES
[1] Bhardwaj, M., Min, R., and Chandrakasan, A. Power-aware
systems. In Proceedings of the 34th Asilomar Conference on
Signals, Systems, and Computers. Nov. 2000, vol. 2, pp. 16951701.
[2] Hsu, C. and Feng, W. A Power-Aware Run-Time System for
High-Performance Computing. In Proceedings of the 2005
ACM/IEEE conference on Supercomputing. Nov. 2005.
[3] Bhardwaj, M., Min, R., and Chandrakasan, A. Quantifying
and Enhancing Power Awareness of VLSI Systems. IEEE Trans.
Very Large Scale Integration (VLSI) Syst. Dec. 2001, vol. 9, no.
6, pp. 757—772.
[4] Min, R., and Chandrakasan, A. Energy-aware architectures
for a real-valued FFT implementation. In Proc. Int. Symp. Low
Power Electronics and Design, Aug. 2003, pp. 360-365.
[5] Nielsen, L., Niessen, C., Sparso, J., and Van Berkel, K. Low
power operation using self-timed circuits and adaptive scaling of
the supply voltage. IEEE Transactions on VLSI Systems. Dec.
1994, vol. 2, pp. 391—397.
Download