Trading Accuracy for Power in a Multiplier Architecture

advertisement
Trading Accuracy for Power in a Multiplier
Architecture
Parag Kulkarni(paragk@ucla.edu)∗ , Puneet Gupta(puneet@ee.ucla.edu)∗ , Miloš D.
Ercegovac(milos@cs.ucla.edu)†
∗
Department of Electrical Engineering, University of California, Los Angeles
†
Department of Computer Science, University of California, Los Angeles
Abstract
Certain classes of applications are inherently capable of absorbing some error in computation, which allows
for quality to be traded off for power. Such a tradeoff is often achieved through voltage over-scaling. We propose a
novel multiplier architecture with tunable error characteristics, that leverages a modified inaccurate 2x2 multiplier
as its building block. Our inaccurate multipliers achieve an average power saving of 31.78% − 45.4% over
corresponding accurate multiplier designs, for an average error of 1.39% − 3.32%. We compare our architecture
with other approaches, such as voltage scaling, for introducing error in a multiplier. Using image filtering and
JPEG compression as sample applications we show that our architecture can achieve 2X - 8X better SignalNoise-Ratio (SNR) for the same power savings when compared to recent voltage over-scaling based power-error
tradeoff methods. We project the multiplier power savings to bigger designs highlighting the fact that the benefits are
strongly design-dependent. We compare this circuit-centric approach to power-quality tradeoffs with a pure software
adaptation approach for a JPEG example. Unlike recent design-for-error approaches for arithmetic logic, we also
enhance the design to allow for correct operation of the multiplier using a correction unit, for non error-resilient
applications which share the hardware resource.
Index Terms
Low-Power Circuits, Inexact Logic, Digital arithmetic, Very-large-scale integration, Voltage-Scaling.
I. I NTRODUCTION
NERGY consumption is a critical design criterion for today’s embedded and mobile systems. Significant effort has already been devoted to improve energy efficiency at various levels, from software, to
architecture all the way down to circuit and device levels. Recently there has also been some work in the
direction of trading accuracy for power. The work in [1] introduced the concept of a probabilistic switch,
which has a non-zero probability of producing an incorrect output. This form of a probabilistic CMOS
methodology is considered to be a promising basis for energy savings for Digital-Signal-Processing(DSP)
applications [2].
Techniques which trade energy for quality of final solution are typically at the algorithmic level, where
parameters such as quantization levels and precision of coefficients are traded for the quality of solution
[3]–[5]. However, they utilize deterministic or accurate building blocks and primitives - that is their
constituent arithmetic operations are always correct - with energy or performance efficiency obtained
through a reduction in size and/or number of these building blocks. The work in [6] aims to improve the
power-awareness of various blocks, including a multiplier. They do so by replacing a monolithic multiplier
by multiple blocks, each optimized for different input precisions along with logic to dynamically route the
data to the correct block depending on input width. While this improves the power dissipation it adversely
affects the area of the implementation.
Any application which can withstand bounded and relatively small errors from their constituent components
stands to gain from inaccurate but low-power building blocks. For instance, [7] uses color interpolation
filtering to demonstrate graceful degradation of SNR during voltage-scaling by ensuring that important
computations are least affected. [8] demonstrated how correctness of arithmetic primitives themselves
can be traded for energy consumed. In [9], the authors scale the voltage below the minimum voltage
E
supply value needed, so as to trade accuracy for power. The authors in [1] improved on this by using
the observation that errors in the most-significant bits affect the quality of the solution more as compared
to lower bits, hence they operate adders at more significant bits with a higher voltage and over-scale
the voltages for lower bits. [2] introduced the first methodology for a voltage scaling based inaccurate
multiplier, their Monte-Carlo simulation based approach achieves a 50% reduction in power using four
voltage domains and they use a filtering application to evaluate their approach. Such advanced voltage
over-scaling techniques which require multiple voltage domains within a single arithmetic unit is likely
to be impractical for a realistic design flow due to layout, voltage level conversion etc., overheads which
are ignored by [2] and others.
Though, most existing work [1], [2], [8] introduces errors by intelligently over-scaling voltage supplies,
there has been some work in the direction of introducing error into a system via manipulation of its logicfunction, for adders [10]–[12] as well as generic combinational logic [13]. The focus of the optimization
though is not power in either case and the latter paper acknowledges poor results for multi-output logic
such as arithmetic units. The authors in [10] re-design various common adder architectures exploiting
the error-tolerance of underlying applications so as to improve worst case delays, with the objective of
improving functional yield. In [13] the objective is to maximally reduce the area of a combinational circuit
during logic synthesis (using literal count as the optimization objective), given an error rate threshold.
[14] uses inaccurate 4 : 2 counters to build adders with fewer stages of logic with power savings of ∼
3% − 8%. [15] reports power savings of up to 66% without affecting accuracy of programs that manipulate
low resolution data, by reducing the bitwidth of floating point multipliers. None of these works that
introduce error via design provide any way to correct the incorrect output if needed. This may be especially
important for general purpose computing hardware which runs a variety of applications. [16] describes a
generic algorithm for the synthesis of approximate logic circuits, their objective is to create low-overhead
solutions for error detection in the original circuit (which is accurate but susceptible to errors/faults). Power
is not one of the optimization objectives and an extension to provide correction logic is not discussed.
Majority of the work in probabilistic or inaccurate low-power design has focused on adders and their
derivative systems. Multipliers on the other hand are one of the primary sources of power consumption
in DSP applications such as Finite-Impulse-Response (FIR) filters [17]. This work is an extension of our
previous work in [18] and focuses on low-power approximate multiplier architectures. Our contributions
are as follows.
• We present a 2x2 underdesigned multiplier block and show how it can be used to build arbitrarily
large power efficient inaccurate multipliers. The architecture lends itself to easy tunability of error
and we present methods to correct error (at a power cost) if needed.
• We evaluate the operation of this multiplier for image filtering and JPEG applications and compare
it with voltage scaling and bitwidth truncation based methods.
• For a complete study, we also project power savings from different software configurations and
compare with our approach.
Rest of this paper is organized as follows. The inaccurate multiplier is described in Section II, Section III
overviews our experimental setup and presents circuit level power results, Section IV details the impact
on real applications and Section V introduces a correction mechanism and we conclude in Section VI.
II. I NACCURATE M ULTIPLIER
In this section we introduce the building block for our inaccurate multipliers and show how larger
multipliers with tunable error characteristics can be built from it.
A. Building Block
Our objective is to introduce error into the multiplier by manipulating its logic function. We use the 2x2
multiplier as a building block. The modified Karnaugh Map is shown in Figure 1, with the changed entry
highlighted. The motivation behind this change was the observation that it is possible to represent the
output of two-bit by two-bit multiplication using just three bits instead of four (since the possible outputs
are seven in number). By representing the output of 3 ∗ 3 using three bits (111) instead of the usual four
(1001), we are able to significantly reduce the complexity of the circuit. The resulting simpler circuit is
shown in Figure 2a, and has an output that is correct for fifteen out of the sixteen possible inputs. Error
1
occurs with a magnitude of (9 − 7) = 2, with a probability of 16
(assuming a uniform input distribution).
The inaccurate version has close to half the area of the accurate (Figure 2b) version (see Section III-B); a
shorter and faster critical path and less interconnect. Since the inaccurate version of the 2x2 multiplier has
smaller switching capacitance than its accurate counterpart, it offers the potential for significant dynamic
power reduction for the same frequency of operation.
B. Building Larger Multipliers
Larger multipliers can easily be built using smaller multiplier blocks. We build multipliers of higher bitwidth by using the inaccurate 2x2 block to produce partial products and then adding the shifted versions
of the partial products [19]. Figure 3, shows an example of a single 4x4 multiplier built out of four 2x2
blocks, where AH , XH and AL , XL are the upper and lower two bits of inputs A, X respectively. Such an
approach can sometimes be restrictive for logic optimization and lead to sub-optimal results [19]. But the
simplicity of the inaccurate 2x2 building block means it does not suffer from this restriction. As a result
larger multiplier blocks can be built out of the 2x2 building block, and still perform better in terms of
power and area as compared to accurate architectures. The results presented in Section III-B will reflect
this.
Note that our baseline architectures are not built using 2x2 components, but are regular dynamic-power
optimized multipliers which are optimized by a commercial synthesis tool RTL-Compiler (RC) [20].
The optimization of the adders (that produce final product from partial products) is also left to RC, for
both the accurate and inaccurate cases. It is important to note that when building larger multipliers, we
introduce inaccuracy via the partial products, the adders remain accurate. This makes our error rates easily
computable and is the topic of the following sub-section.
C. Error Rates
The 2x2 multiplier introduced in Section II-A, has a small and easily computable error probability of
1
16 with a max (relative) error magnitude of 22.22%. But building multipliers of higher bit widths using
the inaccurate multiplier as a building block, leads to slightly more complicated relationships for their
error rates. We developed simulation models in C++ to compute the error probabilities and mean error
for higher bit widths. The results in Table I show that while the max-possible error percentage remains
constant at 22.22% (maximum error occurs when all partial products are erroneous), the probability
of error rises with increasing bit-width. But the mean (relative) error (averaged over all possible input
combinations) increases slowly and then almost saturates at about 3.3%. The graph in Figure 4 gives a
clearer understanding of why the mean-error saturates - for higher bit widths the less significant errors are
dominant and the larger errors are increasingly unlikely. This results in an almost static mean-error. We
will also see in later sections, that the ∼ 3.3% mean-error compares well with the mean-error achieved
via other methods.
It is important to note that the error models discussed in this section all assume a uniform distribution of
input vectors. While certain applications may have different distributions for the input vectors, Section IV
shows that our design still performs well for such realistic applications and their input distributions.
D. Tunable Error
The inaccurate multiplier we have introduced so far has a fixed mean-error and error-probability for a
given bit-width (Table I), but a designer may want to exploit other points on the accuracy-power curve.
Since our multiplier is built using 2x2 components, it is possible to replace these individual components
with accurate versions to reduce the error rate and mean-error. Such a replacement results in smaller power
savings, but provides a means to achieve different points on the error vs. power savings trade-off curve.
The resulting power vs. accuracy curve for our inaccurate multiplier is shown in Figure 5. As expected,
increasing the mean-error results in greater power savings.
III. E XPERIMENTAL SETUP AND
CIRCUIT LEVEL RESULTS
A. Experimental Setup
In this section we present a brief overview of our power measurement methodology and experimental setup.
All architectures were written in Verilog, and synthesized by RC [20] to meet the target frequencies. The
inaccurate multipliers were built using the 2x2 inaccurate building blocks to generate the partial products;
but the adder network to generate the final product was generated and optimized by the tool and was
completely accurate. The accurate versions were implemented in two different ways:
1) By generating partial products and adding shifted versions of them as was done for the inaccurate
case;
2) The entire architecture selection and optimization of the multiplier is left to synthesis tool.
The best result of the two was used for comparison for each case. To obtain accurate power numbers as
well as error characteristics, the synthesized netlists were simulated in NCSIM [21], using all possible
input vectors with back-annotated delays [22]. Resulting switching activity information is extracted using
a Value-Change-Dump (VCD) file [23], and fed to RC for dynamic power computation. The designs are
synthesized using the 45nm Nangate open cell library [24]. For voltage scaled versions of the multipliers
(used in Section III-E and Section IV-A), the original library was re-characterized at different voltage
points. The following flow was then used to generate an error model for the over-scaled multipliers •
•
•
•
Characterize the library for lower voltages that achieve power savings of 30% and 50% using a
commercial library characterization flow.
Extract timing for the baseline multiplier design as timed in the voltage-scaled library in the form
of an SDF file [22].
Perform simulation with the SDF delays annotated on the netlist and compare outputs with expected
(accurate) outputs and generate a CSV file of input combinations and associated errors.
Use the CSV file to build an error model for the over-scaled multiplier in MATLAB. This model is
used in the filtering algorithm in Section IV-A
B. Power and Area Results
Figure 7 shows the reduction in dynamic/leakage power as well as area for a 4-bit inaccurate multiplier.
Table II shows the dynamic power reduction (31.8%−45.4% ) at higher bit-widths and varying frequencies.
We take measurements at five different frequency values (by re-synthesizing the design with the appropriate
timing constraints) between F and 2F, where 2F is the maximum possible achievable frequency of the
accurate multiplier. We observe that the power benefits of the 2x2 multiplier are carried forward to
higher bit-widths. Also increasing the frequency of operation results in greater benefits (Figure 6). This is
because the inaccurate version is inherently faster, and needs less aggressive gate sizing to meet increasing
frequency constraints. Less gate sizing results in smaller switching capacitance.
C. Design Level Power Savings
To confirm power savings in a larger design that uses multiplier components, we used the inaccurate
multiplier in a variety of designs from [25]. The results are presented in Table III. As expected the power
savings are best in multiplier intensive designs such as the FIR filter, and far less pronounced on other
designs such as the mini RISC processor. These results highlight that approximate arithmetic approaches
may not be useful for all designs.
D. Partial Products vs. Adder Tree
Our inaccurate multiplier design introduces errors via the partial products. Alternatively it is also possible
to introduce the inaccuracy via the adders (used in reduction of partial products) of the multiplier, using
an inaccurate adder like the one introduced in [10]. In this section we evaluate an approach based on
introducing errors through the adders (while keeping the partial products accurate) and compare it to our
technique.
One of the issues with such an approach is that it is hard to analyze the errors, as noted in [2], making it
difficult to build a correction unit. For a comparison of the power-accuracy trade-off for such a system,
we used the inaccurate adder introduced in [10], to build inaccurate multipliers. Using accurate partial
products and by placing these inaccurate adders (best possible locations were exhaustively searched) at
different points in the adder network we were able to obtain the error-power tradeoff. It can be seen from
Table IV that the mean and max error from this technique is relatively large. Moreover, the power savings
are roughly in the same range as what we encountered before. The accuracy-power tradeoff (Figure 8)
for the partial product technique is better than the inaccurate adder technique.
E. Comparison with Voltage Over-scaling
We use the flow described in Section III-A to generate a model for a simple voltage over-scaling based
multiplier. In this section we compare the error rates resulting from such an approach with that of our
multipliers (see Table I). The results shown in Table V for voltage values that result in power savings of
30% and 50% (which is the rough range of power savings using the inaccurate multiplier) show that :
•
•
The maximum possible error for the voltage scaling approach can be significantly larger than that
introduced by our design.
The mean error is larger than either the partial product or full adder based approaches. This combined
with the large maximum error can lead to significant degradation in quality of final solution, as we
will see in Section IV-A
The comparison suggests that introducing error via design offers a better tradeoff of power vs. accuracy
as compared to simple voltage over-scaling techniques.
With shrinking feature size leakage power has become an important contributor to power consumption,
especially in mobile and hand-held devices. As a result leakage reduction techniques such as multiVth optimization are integral parts of modern design flows [26]. Previous work [26], [27] suggests that
techniques such as Vth swapping result in a "wall of slack" that can have adverse effects in the presence
of variability or voltage over-scaling. [26] suggests a slack redistribution approach to counter this effect,
which centers around gate up-sizing on frequently excited paths resulting in an area overhead. We used
the leakage power optimization flow in RC along with an commercial 65nm low-power library to observe
the effect of multi-Vth optimization on the efficiency of voltage over-scaling as a means to trade accuracy
for power.
We re-synthesize our baseline eight bit multiplier to create two netlists, using both our regular synthesis
flow and the other with leakage power optimization turned on. We then again use the flow described in
Section III-A to generate error models for both these netlists.
The results of voltage over-scaling for both netlists are shown in Table VI (voltage was scaled from nominal
voltage of 1.2V to 1.1V ). The netlist that is optimized for leakage-power shows significantly worse error
characteristics on voltage over-scaling as compared to the baseline flow. This can be explained by the
presence of a large number of high-Vth cells in the leakage-optimized netlist. As a result there are more
timing-paths with close to zero slack and over-scaling of voltage results in a larger number of paths
violating timing. Leading to a larger number of input vectors causing errors as well as an increase in
magnitude of errors.
Using the flow described above we generate multiple netlists with differing amounts of leakage power
optimization between the two points discussed above. Figure 9 shows the resulting curve and highlights
that as the focus on leakage power optimization increases the design performs progressively worse post
voltage-scaling. This experiment suggests that unless counter measures such as those discussed in [26]
are employed, voltage over-scaling is likely to become less attractive as a means to trade power with
accuracy with increasing emphasis on leakage-power optimization.
F. Comparison with Truncation
Truncation has been a common approach used for reducing power of multiplier architectures [15]. In
this section we compare truncation with our approach. Using the same power analysis setup as before
coupled with MATLAB models for a truncated multiplier, we are able to derive the power vs. accuracy
relationship for truncation techniques. The resulting curve is shown in Figure 10. We can make the
following observations from Figure 10 •
•
Least-Significant-Bit (LSB) truncation offers a better tradeoff than that achieved by Most-SignificantBit (MSB) truncation.
Both LSB and MSB truncation offer a much poorer tradeoff than both our proposed method as well
as introducing errors through the adders of the multiplier.
IV. I MPACT ON REAL APPLICATIONS
In this section we test our inaccurate multiplier on two image processing applications and then compare
software based power-quality tradeoff to our hardware based technique on the JPEG image compression
algorithm.
A. Image Filtering
The first application we use is a Gaussian smoothing based image sharpening filter, modeled in MATLAB,
similar to the one used in [2]. This is done by convolving the image with a matrix identical to the one
presented in [2]. For the inaccurate filter, the 8-bit multiplication in the convolution is performed by an
inaccurate multiplier, using its corresponding MATLAB model. Figure 11 shows the results for accurate
as well as various inaccurate multiplier approaches. Our underdesigned multiplier has an average power
saving of 41.48% (design/application level power savings were presented in Section III-C ) with a SNR
of 20.36dB. In comparison, the authors in [8] report a SNR of 19.63dB for 21.7% power saving (though
for a different technology) over baseline, using four different voltage domains. Figure 11 (e) and (d) show
that our approach results in 2X - 8X better SNR when compared to simple voltage over-scaling [9]. This
suggests that image processing/filtering applications could employ the presented inaccurate multiplier with
significant power savings and minimal loss in image quality. Note that the SNR for the filtering application
is defined between the accurately filtered image and inaccurately filtered image, this was done for uniform
comparison with [8], which uses this convention. For the JPEG application described next we revert back
to the more common definition, where SNR is defined between the original noise-less image and the
filtered result.
B. Comparison with Software-level Power-Quality Tradeoff
As a second application we use a JPEG compression algorithm to observe the effects of our inaccurate
multiplier on a more complex application and to compare software and hardware based quality tradeoffs.
As before, we replace the multiplication in the JPEG algorithm with the model of the inaccurate multiplier.
Table VII compares compression quality for four images.
The JPEG algorithm can trade accuracy for runtime by reducing the number of coefficients used for compression, allowing for a software based tradeoff. To compare with the software approach, we synthesized
the inaccurate multiplier again to consume the same power as the accurate one but operate at a greater
frequency. This would result in speed up of the JPEG application assuming that the multiplier constitutes
the critical path of the implementation. We first run the baseline JPEG algorithm increasing the number
of coefficients (runtime) used, resulting in the SNR vs. runtime curve shown in Figure 12. We use the
same set of coefficients as before, but with the tunable inaccurate multiplier (Section II-D) resulting in
a different (lower) set of SNR and runtime data points. Using the frequency scaling factor from our
synthesis results, we derive a SNR vs. runtime curve for when the multiplier is on the critical path (scaled
inaccurate case in Figure 12). Figure 12 shows that for the JPEG application, hardware based approach
has limited benefits and the software based approach yields a better tradeoff, especially at higher SNR
values. Also if the multiplier does not constitute the critical path then the software based approach offers
a much better tradeoff.
In section II we showed that the inaccurate multiplier can be built to have different values of meanerror and power consumption. Using that resulting accuracy-power curve (Figure 8), the frequency power
table previously presented (Table II) and the SNR vs. runtime curve derived above (Figure 12), we are
able to compare the accuracy vs. power curves of the hardware and software based approaches. In our
experiments the total runtime for the JPEG compression is kept constant. We use various configurations of
the inaccurate multiplier, each with a different mean-error and power consumption (Figure 8), yielding a
power vs. SNR curve for the hardware based approach. From the runtime vs. SNR curve in Figure 12 we
know the amount of runtime (hence number of coefficients) the software approach would need to achieve
the same SNR. Since we keep the runtime constant, we scale the frequency of operation appropriately and
use our power-frequency tables to derive a SNR vs. power relationship for the software approach. The
comparison of the two in Figure 13 shows that the hardware based approach still consumes less power
than the software one, to achieve the same SNR in a fixed amount of runtime. Though the difference
in power consumption is significantly smaller than that of the stand-alone inaccurate multiplier over the
baseline. These experiments hold under the assumption that the multiplier determines the frequency of
the operation and consumes the bulk of the power.
V. ACCURATE M ODE OF O PERATION
One of the advantages of our approach, is that simple decoder logic can be used to detect the presence
and magnitude of errors for any input vector. The inaccurate output can then be corrected if/when needed
using this information. In this section we look at schemes for both complete and partial correction.
While complete correction that allows for an exact output might be the most common use-case the
smaller overheads of partial correction make it an interesting alternative worth exploring. We also look
at three different configurations that the correction block can be used in, each requiring varying degrees
of architectural support in the form of frequency-scaling, voltage-scaling and power-shutoff.
A. Complete Correction
In our experiments we performed correction by using a combination of a decoder to detect the error
magnitude and an adder to add the correcting amount. Figure 18(a) shows an example of an errordetection and correction unit specifically for the simple the 2x2 case. The AND gate acts as a simple
decoder, detecting the 3 ∗ 3 input vector and the correcting adder adds the required amount (2) when the
error triggering input pattern is detected (and adding zero when error is not detected). Non-adder based
correction logic can also be used, Figure 18(b) shows an example of non-adder based correction for the
2x2 multiplier.
Such correction mechanisms involve an overhead, and will be less efficient in terms of area than the
baseline architecture. Therefore we envision a system (Figure 14) with two modes of operation - a regular,
non-critical and inaccurate mode, and a mission-critical and hence accurate mode. In the non-critical mode,
the correction unit will be power gated, resulting in the basic inaccurate operation, with its significant
power savings. In the critical mode of operation, the system produces an accurate result at the cost of
greater power in the critical mode of operation, and works at a slightly slower frequency. We re-ran our
initial experiments to evaluate this overhead by synthesizing the new architecture to work at 0.85 times
the original frequency in the accurate-mode, and at the same frequency as the baseline for the inaccurate
mode. As before, we evaluate the design at multiple frequencies. We observed an average area overhead
of 4.6% − 10.5% and in the inaccurate mode an average power overhead of 4.8% − 8.56% (Table VIII).
The above setup requires support for both power-shutoff/gating (for the correcting adder in non-critical
mode) and frequency scaling (to enable slower operation in critical/accurate mode). In the rest of this
sub-section we look at other simpler configurations that allow for two modes of operation but do not
involve frequency scaling.
The simplest approach to allow for two modes of operation without the need for frequency scaling is to
run both modes (of the inaccurate multiplier) at the same frequency and power-gate/shutoff the correcting
adder in the non-critical mode (as before). This means that the upper path in Figure 14 and the baseline
accurate architecture are synthesized to the same frequency. Power savings are attained from the fact that
with the correcting unit turned off, the switching capacitance of the rest of the inaccurate multiplier is
smaller than that of the baseline. The graph in Figure 15 plots the power savings of such a technique
with increasing frequency for two different bit-widths (four and eight). As before designs are evaluated at
frequencies between F and 2F, where 2F is the maximum possible achievable frequency of the baseline
accurate multiplier. We can make the following observations about such a configuration.
•
•
The power savings are in a much smaller range than with our previous setup (which involves both
frequency scaling and power shut-off). For example the average power reduction for the four bit
multiplier is now 12.44% as compared to 36% before.
The power savings for this configuration reduce with increasing frequency and at higher frequencies
such a configuration no longer offers a power benefit over the baseline. This is the opposite behavior
to the earlier configuration, where power savings increased with increasing frequency (see Figure 13).
While the average power benefits are significantly smaller than the previous setup (that involves frequencyscaling), this configuration might still be a viable choice for low-frequency applications, where a designer
can benefit from the lower power without reverting to frequency scaling.
The main reason the power benefits are not as impressive in such a configuration is that in the inaccurate
mode, there is a significant amount of timing slack, that remains unused since the correction unit is
turned off. One way to improve on this and take advantage of the available timing slack (in the inaccurate
mode) is to also scale the voltage in the inaccurate mode, effectively turning the excess slack into power
reduction. The results of such an approach that involves both power-shutoff and voltage-scaling are shown
in Figure 16 and can be summarized as follows.
•
The power savings for such an approach are almost as good as the approach that involves frequencyscaling. For example the average power reduction for the four bit multiplier is now 33.22% as
compared to 36% before.
B. Partial Correction
In this section we introduce a partial correction mechanism, that does not fully correct the errors of the
inaccurate multiplier but offers the benefit of a minimal overhead. A closer look at the error distribution of
the inaccurate multiplier reveals that the errors introduced by the multiplication of Most-Significant-Bits
(MSB) of the inputs are the largest in terms of magnitude of error. Detecting and correcting only these
errors can significantly reduce the mean-error. The detection and correction logic in such a scenario is a
smaller piece of circuitry and represents a much smaller overhead.
Table IX compares the error probabilities and mean-error for the partially corrected multiplier with the
original inaccurate multiplier. While the change in error probability is small, the overall reduction in
mean-error is significant for all bit-widths. The max-error remains unchanged at 22.22%.
Figure 17 shows what such a partial correction unit would look like for all bit-widths except the trivial
2x2 case. The following points are worth noting •
•
The adder used for partial correction (Figure 17) is significantly smaller than the logic used for
complete correction (Figure 14).
The hardware used for the partial correction is constant. The partial correction block adds a three
bit input with a one bit input irrespective of the input bit-width of the multiplier. As a result of its
constant size, the overhead of the partial correction block actually reduces with increasing bit-width
and becomes close to negligible for bit-widths of sixteen and above.
We synthesize the new architecture with partial correction to work at 0.85 times the original frequency
in the partially-corrected mode, and at the same frequency as the baseline for the inaccurate mode.
After synthesizing this modified design for multiple frequencies we observed an average (across multiple
frequencies) area overhead of 1.26% − 6.22% and in the partially correct mode an average power overhead
of 1.44% − 5.63% (Table X).
VI. C ONCLUSION
With a mean error of 1.39%−3.35% and power savings between 30%−50%, the underdesigned multiplier
architecture presented allows for trading of accuracy for power. It achieves 2X - 8X better SNR than simple
voltage over-scaling techniques, and does not suffer from overheads associated with the multiple voltage
domains of advanced over-scaling techniques. Our experiments with leakage-optimized netlists suggest
that voltage over-scaling might become a less attractive option to reduce power in heavily power optimized
designs. A simple correction mechanism is proposed for usage in a critical mode, when correctness of
output is needed. We show that for inaccurate multipliers introducing errors via partial products is more
promising than via the adders. We also show that both of these approaches afford a better tradeoff (of
power vs. accuracy) than bit-width truncation. The results suggest that design-for-error based techniques
have significant potential for power savings, and can be easily integrated into today’s automated ASIC
design flow. Future work includes extending the approach to other arithmetic components and an algorithm
for finding the point of maximum power benefit for a given error rate.
ACKNOWLEDGEMENT
The authors would like to thank Pratyush Aditya from Cadence Design Systems, for his valuable feedback
and contributions towards setting up power optimization for multiplier components generated by RC.
R EFERENCES
[1] K. V. Palem, “Energy aware computing through probabilistic switching: A study of limits,” IEEE Trans. Comput., vol. 54, no. 9, pp.
1123–1137, 2005.
[2] M. S. Lau, K.-V. Ling, and Y.-C. Chu, “Energy-aware probabilistic multiplier: design and analysis,” in CASES ’09: Proceedings of the
2009 international conference on Compilers, architecture, and synthesis for embedded systems. New York, NY, USA: ACM, 2009,
pp. 281–290.
[3] M. Lamoureux, “The poorman’s transform: approximating the fourier transform without multiplication,” Signal Processing, IEEE
Transactions on, vol. 41, no. 3, pp. 1413 –1415, mar 1993.
[4] I. Reed, D. Tufts, X. Yu, T. Truong, M.-T. Shih, and X. Yin, “Fourier analysis and signal processing by use of the mobius inversion
formula,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 38, no. 3, pp. 458 –470, mar 1990.
[5] G. Boudreaux-Bartels and T. Parks, “Discrete fourier transform using summation by parts,” in Acoustics, Speech, and Signal Processing,
IEEE International Conference on ICASSP ’87., vol. 12, apr 1987, pp. 1827 – 1830.
[6] M. Bhardwaj, R. Min, and A. Chandrakasan, “Power-aware systems,” in Signals, Systems and Computers, 2000. Conference Record
of the Thirty-Fourth Asilomar Conference on, vol. 2, 2000, pp. 1695 –1701 vol.2.
[7] N. Banerjee, G. Karakonstantis, J. H. Choi, C. Chakrabarti, and K. Roy, “Design methodology for low power and parametric robustness
through output-quality modulation: application to color-interpolation filtering,” Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 28, no. 8,
pp. 1127–1137, 2009.
[8] J. George, B. Marr, B. E. S. Akgul, and K. V. Palem, “Probabilistic arithmetic and energy efficient embedded signal processing,” in
CASES ’06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems. New
York, NY, USA: ACM, 2006, pp. 158–168.
[9] R. Hegde and N. R. Shanbhag, “Energy-efficient signal processing via algorithmic noise-tolerance,” in ISLPED ’99: Proceedings of
the 1999 international symposium on Low power electronics and design. New York, NY, USA: ACM, 1999, pp. 30–35.
[10] D. Shin and S. K. Gupta, “A re-design technique for datapath modules in error tolerant applications,” Asian Test Symposium, vol. 0,
pp. 431–437, 2008.
[11] D. Kelly and B. Phillips, “Arithmetic data value speculation,” in Asia-Pacific Computer Systems Architecture Conference, 2005, pp.
353–366.
[12] S.-L. Lu, “Speeding up processing with approximation circuits,” Computer, vol. 37, no. 3, pp. 67–73, 2004.
[13] D. Shin and S. K. Gupta, “Approximate logic synthesis for error tolerant applications,” in in Proc. 13th IEEE Design, Automation and
Test in Europe, 2010.
[14] B. J. Phillips, D. R. Kelly, and B. W. Ng, “Estimating adders for a low density parity check decoder,” F. T. Luk, Ed., vol. 6313,
no. 1. SPIE, 2006, p. 631302. [Online]. Available: http://link.aip.org/link/?PSI/6313/631302/1
[15] J. Tong, D. Nagle, and R. Rutenbar, “Reducing power by optimizing the necessary precision/range of floating-point arithmetic,” Very
Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 8, no. 3, pp. 273 –286, jun. 2000.
[16] M. R. Choudhury and K. Mohanram, “Approximate logic circuits for low overhead, non-intrusive concurrent error detection,” in
Proceedings of the conference on Design, automation and test in Europe, ser. DATE ’08. New York, NY, USA: ACM, 2008, pp.
903–908. [Online]. Available: http://doi.acm.org/10.1145/1403375.1403593
[17] S. K. Sangjin, S. Hong, M. C. Papaefthymiou, and W. E. Stark, “Low power parallel multiplier design for dsp applications through
coefficient optimization,” in in Proc. 12th IEEE International ASIC/SOC Conference, pp. 286–290.
[18] P. Kulkarni, P. Gupta, and M. Ercegovac, “Trading accuracy for power with an underdesigned multiplier architecture,” VLSI Design,
International Conference on, vol. 0, pp. 346–351, 2011.
[19] I. Koren, Computer Arithmetic Algorithms, 2nd ed. A.K. Peters, 2002.
[20] “Cadence rtl-compiler,” http://www.cadence.com/products/ld/rtl_compiler/pages/default.aspx.
[21] “Cadence incisive simulator,” http://www.cadence.com/products/ld/design_team_simulator/pages/default.%aspx.
[22] “Standard delay format,” http://www.vhdl.org/sdf/sdf_3.0.pdf.
[23] “Value change dump file,” http://en.wikipedia.org/wiki/Value_change_dump.
[24] “Nangate open cell library,” http://www.nangate.com/index.php?option=com_content&task=view&id=137&It%emid=137.
[25] “Opencores,” http://www.opencores.org/.
[26] A. B. Kahng, S. Kang, R. Kumar, and J. Sartori, “Slack redistribution for graceful degradation under voltage overscaling,” in
Proceedings of the 2010 Asia and South Pacific Design Automation Conference, ser. ASPDAC ’10. Piscataway, NJ, USA: IEEE
Press, 2010, pp. 825–831. [Online]. Available: http://portal.acm.org/citation.cfm?id=1899721.1899911
[27] X. Bai, C. Visweswariah, and P. N. Strenski, “Uncertainty-aware circuit optimization,” in Proceedings of the 39th annual
Design Automation Conference, ser. DAC ’02. New York, NY, USA: ACM, 2002, pp. 58–63. [Online]. Available:
http://doi.acm.org/10.1145/513918.513935
Parag Kulkarni is currently an Advanced Design Engineer at Altera Corporation. He received the B.Tech degree
in Information and Communication Technology from DA-IICT (www.da-iict.ac.in) in 2007 and M.S degree in Electrical Engineering in 2011 from University of California, Los Angeles where he was part of the Nano-CAD lab
(http://nanocad.ee.ucla.edu/). His research interests include low-power VLSI Design and CAD. Prior to graduate school
he worked at Cadence Design Systems in the logic and physical synthesis teams.
Puneet Gupta Puneet Gupta (http://www.ee.ucla.edu/ puneet) is currently a faculty member of the Electrical Engineering Department at UCLA. He received the B.Tech degree in Electrical Engineering from Indian Institute of
Technology, Delhi in 2000 and Ph.D. in 2007 from University of California, San Diego. He co-founded Blaze DFM
Inc. (acquired by Tela Inc.) in 2004 and served as its product architect till 2007.
He has authored over 60 papers, ten U.S. patents, and a book chapter. He is a recipient of NSF CAREER award,
ACM/SIGDA Outstanding New Faculty Award, European Design Automation Association Outstanding Dissertation
Award and IBM Ph.D. fellowship. Dr. Puneet Gupta has given tutorial talks at DAC, ICCAD, Intl. VLSI Design
Conference and SPIE Advanced Lithography Symposium. He has served on the Technical Program Committee of
DAC, ICCAD, ASPDAC, ISQED, ICCD, SLIP and VLSI Design. He served as the Program Chair of IEEE DFM&Y
Workshop 2009, 2010.
Dr. Gupta’s research has focused on building high-value bridges across application-architecture-implementation-fabrication interfaces for
lowered cost and power, increased yield and improved predictability of integrated circuits and systems.
Miloš D. Ercegovac is a professor and a former chair in the Computer Science Department of the Henry Samueli
School of Engineering and Applied Science, University of California at Los Angeles, where he has been on the
faculty since 1975. He earned his MS (’72) and PhD (’75) in computer science from the University of Illinois,
Urbana-Champaign, and BS in electrical engineering (’65) from the University of Belgrade, Serbia. Dr. Ercegovac has
specialized in research and teaching in digital arithmetic, digital and computer system design, and parallel architectures.
His dedication to teaching and research has also resulted in several co-authored books: two in the area of digital design
and two in digital arithmetic. He received the Lockheed-Martin Excellence in Teaching award in 2009. Dr. Ercegovac
has been involved in organizing the IEEE Symposia on Computer Arithmetic since 1978. He served as an associate
editor of the IEEE Transactions on Computers 1988 -1992 and as a subject area editor for the Journal of Parallel
and Distributed Computing 1986 -1993. Dr. Ercegovac’s work has been recognized by his election in 2003 to IEEE Fellow and to Foreign
Member of the Serbian Academy of Sciences and Arts in Belgrade, Serbia. He is a member of the ACM and of the IEEE Computer Society.
TABLES
Table I: Error probability and mean-error for varying bit-widths
Bit- Error-Prob
Width
2
0.0625
4
0.19
8
0.46
12
0.675
16
0.81
Mean-Error
Max-Error
1.39%
2.60%
3.25%
3.31%
3.32%
22.22%
22.22%
22.22%
22.22%
22.22%
Table II: Dynamic power reduction with increasing bit-widths for various frequencies
BitWidth
2
4
8
16
F
(%)
44.9
13.7
33.1
25.6
1.25F
(%)
42.1
31.6
40.4
29.6
1.5F
(%)
42.1
44.8
26.3
32.4
1.75F
(%)
48.9
44.7
48.8
33.8
2F
(%)
48.9
46.5
58.9
37.4
Avg.
(%)
45.4
36.3
41.5
31.8
Table III: Design level power savings
Design
# Multipliers
# Gates
(~)
FFT
FIR
RISC
32
4
1
158K
1.1K
10K
Mult.
Total
Power
Power
Reduction Reduction
25.166% 13.98%
31.09%
18.30%
28.04%
1.51%
Table IV: Error rates and power savings for inaccurate adder based eight bit multiplier
Error
Prob.
0.29
0.23
0.20
0.15
0.16
0.19
0.12
Mean Error
Max Error
Power Reduction
10.07%
6.01%
4.40%
3.40%
3.04%
2.92%
2.18%
62.22%
57.14%
57.14%
57.14%
57.14%
44.44%
44.44%
37.89%
32.35%
28.87%
20.16%
16.83%
19.81%
12.14%
Table V: Error probability and mean-error for Voltage-Scaling
Scaled- Error-Prob
Voltage
0.90V
0.89
0.77V
0.99
Mean-Error
Max-Error
20.86%
38.07%
100.0%
100.0%
Table VI: Effect of Leakge-Power optimization on Efficiency of Voltage-Scaling
Metric
Error Prob.
Mean-Error
Max-Error
Leakage Optimized Netlist
0.98
46.007%
100.0%
Regular Netlist
0.4577
10.008%
100.0%
Table VII: JPEG compression using the inaccurate multiplier
Image
Lenna
Coins
Sand-Dunes
Fireman
Average
Inaccurate Accurate
SNR
SNR
(dB)
(dB)
25.19
25.56
19.31
19.55
36.5
37.94
30.23
30.89
-
SNR Reduction
1.44%
1.22%
3.79%
2.13%
2.15%
Table VIII: Accurate operation area and power overhead
Bit
Width
2
4
8
16
Average
Area
Overhead
4.6%
5.87%
10.5%
9.875%
Max
Area
Overhead
8.14%
7.67%
13.44%
13.85%
Average
Power
Overhead
4.8%
7.5%
8.56%
8.22%
Max
Power
Overhead
8.32%
16.14%
12.88%
13.67%
Table IX: Mean-Error post partial correction
Bit Width
Original
0.19
0.46
0.67
0.81
4
8
12
16
Error Prob.
Partially Corrected
0.15
.45
.67
0.80
Original
2.60%
3.25%
3.31%
3.32%
Mean-Error
Partially Corrected
1.49%
2.21%
2.28%
2.29%
Table X: Partial correction area and power overhead
Bit
Width
2
4
8
16
Average
Area
Overhead
4.6%
6.22%
2.79%
1.26%
Max Area
Overhead
8.14%
6.47%
3.24%
1.68%
Average
Power
Overhead
4.8%
5.63%
3.48%
1.44%
Max
Power
Overhead
8.32%
6.44%
4.33%
1.57%
F IGURE C APTIONS
Figure 1. Karnaugh-Map for the inaccurate 2x2 multiplier
Figure 2. The accurate (b) and inaccurate (a) 2x2 multipliers, with the critical paths highlighted
Figure 3. Building larger multipliers from smaller blocks
Figure 4. Error percentage distribution for 4-bit and 12-bit inaccurate multipliers
Figure 5. Accuracy vs. Power Tradeoff for the 4-bit inaccurate multiplier. Increasing mean-error
leads to an increase in power savings.
Figure 6. Dynamic power vs. frequency for 4-bit accurate and inaccurate multipliers
Figure 7. Dynamic power, leakage power and area savings for a 4-bit inaccurate multiplier
Figure 8. Accuracy vs. power tradeoff comparison for partial product and adder based approaches.
The proposed partial product based approach yields a much better tradeoff.
Figure 9. Accuracy vs. power tradeoff for truncation for an 8-bit multiplier. Both adder and partialproduct based approaches offer better tradeoffs.
Figure 10. Accuracy vs. power tradeoff for truncation. Both adder and partial-product based
approaches offer better tradeoffs.
Figure 11. Image sharpening (a) original blurred image; (b) enhanced using accurate multiplier; (c)
by inaccurate multiplier, power reduction 41.5%, SNR : 20.365dB; (d) voltage over-scaling for 30%
power reduction, SNR : 9.16dB; (e) voltage over-scaling for 50% power reduction, SNR : 2.64dB;
(f) by introducing errors via the adders, SNR : 7.3dB
Figure 12. Run-time-accuracy tradeoff for software and hardware based approaches for JPEG
compression. Accurate : accurate hardware, purely software based tradeoff. Inaccurate : inaccurate
multiplier, when not in critical path. Scaled Inaccurate : inaccurate multiplier and in critical path.
Figure 13. Power-accuracy tradeoff for software and hardware based approaches for JPEG
compression
Figure 14. Accurate mode extension, the upper path is for accurate operation and the lower path
is for inaccurate operation
Figure 15. Power savings as a function of increasing frequency for the configuration that utilizes
only power-shutoff. With increasing target frequency this approach becomes less viable.
Figure 16. Power savings as a function of increasing frequency for configuration that utilizes powershutoff and voltage-scaling.
Figure 17. Partial-Correction, that fixes errors introduced by MSB multiplication.
Figure 18. Two types of correction logic for the 2x2 inaccurate multiplier
F IGURES
Figure 1: Karnaugh-Map for the inaccurate 2x2 multiplier
Figure 2: The accurate (b) and inaccurate (a) 2x2 multipliers, with the critical paths highlighted
Figure 3: Building larger multipliers from smaller blocks
Figure 4: Error percentage distribution for 4-bit and 12-bit inaccurate multipliers
Figure 5: Accuracy vs. Power Tradeoff for the 4-bit inaccurate multiplier. Increasing mean-error leads to an increase
in power savings.
Figure 6: Dynamic power vs. frequency for 4-bit accurate and inaccurate multipliers
Figure 7: Dynamic power, leakage power and area savings for a 4-bit inaccurate multiplier
Figure 8: Accuracy vs. power tradeoff comparison for partial product and adder based approaches. The proposed
partial product based approach yields a much better tradeoff.
Figure 9: Accuracy vs. power tradeoff for truncation for an 8-bit multiplier. Both adder and partial-product based
approaches offer better tradeoffs.
Figure 10: Accuracy vs. power tradeoff for truncation. Both adder and partial-product based approaches offer better
tradeoffs.
Figure 11: Image sharpening (a) original blurred image; (b) enhanced using accurate multiplier; (c) by inaccurate
multiplier, power reduction 41.5%, SNR : 20.365dB; (d) voltage over-scaling for 30% power reduction, SNR : 9.16dB;
(e) voltage over-scaling for 50% power reduction, SNR : 2.64dB; (f) by introducing errors via the adders, SNR : 7.3dB
Figure 12: Run-time-accuracy tradeoff for software and hardware based approaches for JPEG compression. Accurate
: accurate hardware, purely software based tradeoff. Inaccurate : inaccurate multiplier, when not in critical path.
Scaled Inaccurate : inaccurate multiplier and in critical path.
Figure 13: Power-accuracy tradeoff for software and hardware based approaches for JPEG compression
Figure 14: Accurate mode extension, the upper path is for accurate operation and the lower path is for inaccurate
operation
Figure 15: Power savings as a function of increasing frequency for the configuration that utilizes only power-shutoff.
With increasing target frequency this approach becomes less viable.
Figure 16: Power savings as a function of increasing frequency for configuration that utilizes power-shutoff and
voltage-scaling.
Figure 17: Partial-Correction, that fixes errors introduced by MSB multiplication.
(a) The error detection and correction logic using an
adder for the 2x2 case
(b) Non-adder based correction for the 2x2 case
Figure 18: Two types of correction logic for the 2x2 inaccurate multiplier
Download