Trading Accuracy for Power in a Multiplier Architecture Parag Kulkarni(paragk@ucla.edu)∗ , Puneet Gupta(puneet@ee.ucla.edu)∗ , Miloš D. Ercegovac(milos@cs.ucla.edu)† ∗ Department of Electrical Engineering, University of California, Los Angeles † Department of Computer Science, University of California, Los Angeles Abstract Certain classes of applications are inherently capable of absorbing some error in computation, which allows for quality to be traded off for power. Such a tradeoff is often achieved through voltage over-scaling. We propose a novel multiplier architecture with tunable error characteristics, that leverages a modified inaccurate 2x2 multiplier as its building block. Our inaccurate multipliers achieve an average power saving of 31.78% − 45.4% over corresponding accurate multiplier designs, for an average error of 1.39% − 3.32%. We compare our architecture with other approaches, such as voltage scaling, for introducing error in a multiplier. Using image filtering and JPEG compression as sample applications we show that our architecture can achieve 2X - 8X better SignalNoise-Ratio (SNR) for the same power savings when compared to recent voltage over-scaling based power-error tradeoff methods. We project the multiplier power savings to bigger designs highlighting the fact that the benefits are strongly design-dependent. We compare this circuit-centric approach to power-quality tradeoffs with a pure software adaptation approach for a JPEG example. Unlike recent design-for-error approaches for arithmetic logic, we also enhance the design to allow for correct operation of the multiplier using a correction unit, for non error-resilient applications which share the hardware resource. Index Terms Low-Power Circuits, Inexact Logic, Digital arithmetic, Very-large-scale integration, Voltage-Scaling. I. I NTRODUCTION NERGY consumption is a critical design criterion for today’s embedded and mobile systems. Significant effort has already been devoted to improve energy efficiency at various levels, from software, to architecture all the way down to circuit and device levels. Recently there has also been some work in the direction of trading accuracy for power. The work in [1] introduced the concept of a probabilistic switch, which has a non-zero probability of producing an incorrect output. This form of a probabilistic CMOS methodology is considered to be a promising basis for energy savings for Digital-Signal-Processing(DSP) applications [2]. Techniques which trade energy for quality of final solution are typically at the algorithmic level, where parameters such as quantization levels and precision of coefficients are traded for the quality of solution [3]–[5]. However, they utilize deterministic or accurate building blocks and primitives - that is their constituent arithmetic operations are always correct - with energy or performance efficiency obtained through a reduction in size and/or number of these building blocks. The work in [6] aims to improve the power-awareness of various blocks, including a multiplier. They do so by replacing a monolithic multiplier by multiple blocks, each optimized for different input precisions along with logic to dynamically route the data to the correct block depending on input width. While this improves the power dissipation it adversely affects the area of the implementation. Any application which can withstand bounded and relatively small errors from their constituent components stands to gain from inaccurate but low-power building blocks. For instance, [7] uses color interpolation filtering to demonstrate graceful degradation of SNR during voltage-scaling by ensuring that important computations are least affected. [8] demonstrated how correctness of arithmetic primitives themselves can be traded for energy consumed. In [9], the authors scale the voltage below the minimum voltage E supply value needed, so as to trade accuracy for power. The authors in [1] improved on this by using the observation that errors in the most-significant bits affect the quality of the solution more as compared to lower bits, hence they operate adders at more significant bits with a higher voltage and over-scale the voltages for lower bits. [2] introduced the first methodology for a voltage scaling based inaccurate multiplier, their Monte-Carlo simulation based approach achieves a 50% reduction in power using four voltage domains and they use a filtering application to evaluate their approach. Such advanced voltage over-scaling techniques which require multiple voltage domains within a single arithmetic unit is likely to be impractical for a realistic design flow due to layout, voltage level conversion etc., overheads which are ignored by [2] and others. Though, most existing work [1], [2], [8] introduces errors by intelligently over-scaling voltage supplies, there has been some work in the direction of introducing error into a system via manipulation of its logicfunction, for adders [10]–[12] as well as generic combinational logic [13]. The focus of the optimization though is not power in either case and the latter paper acknowledges poor results for multi-output logic such as arithmetic units. The authors in [10] re-design various common adder architectures exploiting the error-tolerance of underlying applications so as to improve worst case delays, with the objective of improving functional yield. In [13] the objective is to maximally reduce the area of a combinational circuit during logic synthesis (using literal count as the optimization objective), given an error rate threshold. [14] uses inaccurate 4 : 2 counters to build adders with fewer stages of logic with power savings of ∼ 3% − 8%. [15] reports power savings of up to 66% without affecting accuracy of programs that manipulate low resolution data, by reducing the bitwidth of floating point multipliers. None of these works that introduce error via design provide any way to correct the incorrect output if needed. This may be especially important for general purpose computing hardware which runs a variety of applications. [16] describes a generic algorithm for the synthesis of approximate logic circuits, their objective is to create low-overhead solutions for error detection in the original circuit (which is accurate but susceptible to errors/faults). Power is not one of the optimization objectives and an extension to provide correction logic is not discussed. Majority of the work in probabilistic or inaccurate low-power design has focused on adders and their derivative systems. Multipliers on the other hand are one of the primary sources of power consumption in DSP applications such as Finite-Impulse-Response (FIR) filters [17]. This work is an extension of our previous work in [18] and focuses on low-power approximate multiplier architectures. Our contributions are as follows. • We present a 2x2 underdesigned multiplier block and show how it can be used to build arbitrarily large power efficient inaccurate multipliers. The architecture lends itself to easy tunability of error and we present methods to correct error (at a power cost) if needed. • We evaluate the operation of this multiplier for image filtering and JPEG applications and compare it with voltage scaling and bitwidth truncation based methods. • For a complete study, we also project power savings from different software configurations and compare with our approach. Rest of this paper is organized as follows. The inaccurate multiplier is described in Section II, Section III overviews our experimental setup and presents circuit level power results, Section IV details the impact on real applications and Section V introduces a correction mechanism and we conclude in Section VI. II. I NACCURATE M ULTIPLIER In this section we introduce the building block for our inaccurate multipliers and show how larger multipliers with tunable error characteristics can be built from it. A. Building Block Our objective is to introduce error into the multiplier by manipulating its logic function. We use the 2x2 multiplier as a building block. The modified Karnaugh Map is shown in Figure 1, with the changed entry highlighted. The motivation behind this change was the observation that it is possible to represent the output of two-bit by two-bit multiplication using just three bits instead of four (since the possible outputs are seven in number). By representing the output of 3 ∗ 3 using three bits (111) instead of the usual four (1001), we are able to significantly reduce the complexity of the circuit. The resulting simpler circuit is shown in Figure 2a, and has an output that is correct for fifteen out of the sixteen possible inputs. Error 1 occurs with a magnitude of (9 − 7) = 2, with a probability of 16 (assuming a uniform input distribution). The inaccurate version has close to half the area of the accurate (Figure 2b) version (see Section III-B); a shorter and faster critical path and less interconnect. Since the inaccurate version of the 2x2 multiplier has smaller switching capacitance than its accurate counterpart, it offers the potential for significant dynamic power reduction for the same frequency of operation. B. Building Larger Multipliers Larger multipliers can easily be built using smaller multiplier blocks. We build multipliers of higher bitwidth by using the inaccurate 2x2 block to produce partial products and then adding the shifted versions of the partial products [19]. Figure 3, shows an example of a single 4x4 multiplier built out of four 2x2 blocks, where AH , XH and AL , XL are the upper and lower two bits of inputs A, X respectively. Such an approach can sometimes be restrictive for logic optimization and lead to sub-optimal results [19]. But the simplicity of the inaccurate 2x2 building block means it does not suffer from this restriction. As a result larger multiplier blocks can be built out of the 2x2 building block, and still perform better in terms of power and area as compared to accurate architectures. The results presented in Section III-B will reflect this. Note that our baseline architectures are not built using 2x2 components, but are regular dynamic-power optimized multipliers which are optimized by a commercial synthesis tool RTL-Compiler (RC) [20]. The optimization of the adders (that produce final product from partial products) is also left to RC, for both the accurate and inaccurate cases. It is important to note that when building larger multipliers, we introduce inaccuracy via the partial products, the adders remain accurate. This makes our error rates easily computable and is the topic of the following sub-section. C. Error Rates The 2x2 multiplier introduced in Section II-A, has a small and easily computable error probability of 1 16 with a max (relative) error magnitude of 22.22%. But building multipliers of higher bit widths using the inaccurate multiplier as a building block, leads to slightly more complicated relationships for their error rates. We developed simulation models in C++ to compute the error probabilities and mean error for higher bit widths. The results in Table I show that while the max-possible error percentage remains constant at 22.22% (maximum error occurs when all partial products are erroneous), the probability of error rises with increasing bit-width. But the mean (relative) error (averaged over all possible input combinations) increases slowly and then almost saturates at about 3.3%. The graph in Figure 4 gives a clearer understanding of why the mean-error saturates - for higher bit widths the less significant errors are dominant and the larger errors are increasingly unlikely. This results in an almost static mean-error. We will also see in later sections, that the ∼ 3.3% mean-error compares well with the mean-error achieved via other methods. It is important to note that the error models discussed in this section all assume a uniform distribution of input vectors. While certain applications may have different distributions for the input vectors, Section IV shows that our design still performs well for such realistic applications and their input distributions. D. Tunable Error The inaccurate multiplier we have introduced so far has a fixed mean-error and error-probability for a given bit-width (Table I), but a designer may want to exploit other points on the accuracy-power curve. Since our multiplier is built using 2x2 components, it is possible to replace these individual components with accurate versions to reduce the error rate and mean-error. Such a replacement results in smaller power savings, but provides a means to achieve different points on the error vs. power savings trade-off curve. The resulting power vs. accuracy curve for our inaccurate multiplier is shown in Figure 5. As expected, increasing the mean-error results in greater power savings. III. E XPERIMENTAL SETUP AND CIRCUIT LEVEL RESULTS A. Experimental Setup In this section we present a brief overview of our power measurement methodology and experimental setup. All architectures were written in Verilog, and synthesized by RC [20] to meet the target frequencies. The inaccurate multipliers were built using the 2x2 inaccurate building blocks to generate the partial products; but the adder network to generate the final product was generated and optimized by the tool and was completely accurate. The accurate versions were implemented in two different ways: 1) By generating partial products and adding shifted versions of them as was done for the inaccurate case; 2) The entire architecture selection and optimization of the multiplier is left to synthesis tool. The best result of the two was used for comparison for each case. To obtain accurate power numbers as well as error characteristics, the synthesized netlists were simulated in NCSIM [21], using all possible input vectors with back-annotated delays [22]. Resulting switching activity information is extracted using a Value-Change-Dump (VCD) file [23], and fed to RC for dynamic power computation. The designs are synthesized using the 45nm Nangate open cell library [24]. For voltage scaled versions of the multipliers (used in Section III-E and Section IV-A), the original library was re-characterized at different voltage points. The following flow was then used to generate an error model for the over-scaled multipliers • • • • Characterize the library for lower voltages that achieve power savings of 30% and 50% using a commercial library characterization flow. Extract timing for the baseline multiplier design as timed in the voltage-scaled library in the form of an SDF file [22]. Perform simulation with the SDF delays annotated on the netlist and compare outputs with expected (accurate) outputs and generate a CSV file of input combinations and associated errors. Use the CSV file to build an error model for the over-scaled multiplier in MATLAB. This model is used in the filtering algorithm in Section IV-A B. Power and Area Results Figure 7 shows the reduction in dynamic/leakage power as well as area for a 4-bit inaccurate multiplier. Table II shows the dynamic power reduction (31.8%−45.4% ) at higher bit-widths and varying frequencies. We take measurements at five different frequency values (by re-synthesizing the design with the appropriate timing constraints) between F and 2F, where 2F is the maximum possible achievable frequency of the accurate multiplier. We observe that the power benefits of the 2x2 multiplier are carried forward to higher bit-widths. Also increasing the frequency of operation results in greater benefits (Figure 6). This is because the inaccurate version is inherently faster, and needs less aggressive gate sizing to meet increasing frequency constraints. Less gate sizing results in smaller switching capacitance. C. Design Level Power Savings To confirm power savings in a larger design that uses multiplier components, we used the inaccurate multiplier in a variety of designs from [25]. The results are presented in Table III. As expected the power savings are best in multiplier intensive designs such as the FIR filter, and far less pronounced on other designs such as the mini RISC processor. These results highlight that approximate arithmetic approaches may not be useful for all designs. D. Partial Products vs. Adder Tree Our inaccurate multiplier design introduces errors via the partial products. Alternatively it is also possible to introduce the inaccuracy via the adders (used in reduction of partial products) of the multiplier, using an inaccurate adder like the one introduced in [10]. In this section we evaluate an approach based on introducing errors through the adders (while keeping the partial products accurate) and compare it to our technique. One of the issues with such an approach is that it is hard to analyze the errors, as noted in [2], making it difficult to build a correction unit. For a comparison of the power-accuracy trade-off for such a system, we used the inaccurate adder introduced in [10], to build inaccurate multipliers. Using accurate partial products and by placing these inaccurate adders (best possible locations were exhaustively searched) at different points in the adder network we were able to obtain the error-power tradeoff. It can be seen from Table IV that the mean and max error from this technique is relatively large. Moreover, the power savings are roughly in the same range as what we encountered before. The accuracy-power tradeoff (Figure 8) for the partial product technique is better than the inaccurate adder technique. E. Comparison with Voltage Over-scaling We use the flow described in Section III-A to generate a model for a simple voltage over-scaling based multiplier. In this section we compare the error rates resulting from such an approach with that of our multipliers (see Table I). The results shown in Table V for voltage values that result in power savings of 30% and 50% (which is the rough range of power savings using the inaccurate multiplier) show that : • • The maximum possible error for the voltage scaling approach can be significantly larger than that introduced by our design. The mean error is larger than either the partial product or full adder based approaches. This combined with the large maximum error can lead to significant degradation in quality of final solution, as we will see in Section IV-A The comparison suggests that introducing error via design offers a better tradeoff of power vs. accuracy as compared to simple voltage over-scaling techniques. With shrinking feature size leakage power has become an important contributor to power consumption, especially in mobile and hand-held devices. As a result leakage reduction techniques such as multiVth optimization are integral parts of modern design flows [26]. Previous work [26], [27] suggests that techniques such as Vth swapping result in a "wall of slack" that can have adverse effects in the presence of variability or voltage over-scaling. [26] suggests a slack redistribution approach to counter this effect, which centers around gate up-sizing on frequently excited paths resulting in an area overhead. We used the leakage power optimization flow in RC along with an commercial 65nm low-power library to observe the effect of multi-Vth optimization on the efficiency of voltage over-scaling as a means to trade accuracy for power. We re-synthesize our baseline eight bit multiplier to create two netlists, using both our regular synthesis flow and the other with leakage power optimization turned on. We then again use the flow described in Section III-A to generate error models for both these netlists. The results of voltage over-scaling for both netlists are shown in Table VI (voltage was scaled from nominal voltage of 1.2V to 1.1V ). The netlist that is optimized for leakage-power shows significantly worse error characteristics on voltage over-scaling as compared to the baseline flow. This can be explained by the presence of a large number of high-Vth cells in the leakage-optimized netlist. As a result there are more timing-paths with close to zero slack and over-scaling of voltage results in a larger number of paths violating timing. Leading to a larger number of input vectors causing errors as well as an increase in magnitude of errors. Using the flow described above we generate multiple netlists with differing amounts of leakage power optimization between the two points discussed above. Figure 9 shows the resulting curve and highlights that as the focus on leakage power optimization increases the design performs progressively worse post voltage-scaling. This experiment suggests that unless counter measures such as those discussed in [26] are employed, voltage over-scaling is likely to become less attractive as a means to trade power with accuracy with increasing emphasis on leakage-power optimization. F. Comparison with Truncation Truncation has been a common approach used for reducing power of multiplier architectures [15]. In this section we compare truncation with our approach. Using the same power analysis setup as before coupled with MATLAB models for a truncated multiplier, we are able to derive the power vs. accuracy relationship for truncation techniques. The resulting curve is shown in Figure 10. We can make the following observations from Figure 10 • • Least-Significant-Bit (LSB) truncation offers a better tradeoff than that achieved by Most-SignificantBit (MSB) truncation. Both LSB and MSB truncation offer a much poorer tradeoff than both our proposed method as well as introducing errors through the adders of the multiplier. IV. I MPACT ON REAL APPLICATIONS In this section we test our inaccurate multiplier on two image processing applications and then compare software based power-quality tradeoff to our hardware based technique on the JPEG image compression algorithm. A. Image Filtering The first application we use is a Gaussian smoothing based image sharpening filter, modeled in MATLAB, similar to the one used in [2]. This is done by convolving the image with a matrix identical to the one presented in [2]. For the inaccurate filter, the 8-bit multiplication in the convolution is performed by an inaccurate multiplier, using its corresponding MATLAB model. Figure 11 shows the results for accurate as well as various inaccurate multiplier approaches. Our underdesigned multiplier has an average power saving of 41.48% (design/application level power savings were presented in Section III-C ) with a SNR of 20.36dB. In comparison, the authors in [8] report a SNR of 19.63dB for 21.7% power saving (though for a different technology) over baseline, using four different voltage domains. Figure 11 (e) and (d) show that our approach results in 2X - 8X better SNR when compared to simple voltage over-scaling [9]. This suggests that image processing/filtering applications could employ the presented inaccurate multiplier with significant power savings and minimal loss in image quality. Note that the SNR for the filtering application is defined between the accurately filtered image and inaccurately filtered image, this was done for uniform comparison with [8], which uses this convention. For the JPEG application described next we revert back to the more common definition, where SNR is defined between the original noise-less image and the filtered result. B. Comparison with Software-level Power-Quality Tradeoff As a second application we use a JPEG compression algorithm to observe the effects of our inaccurate multiplier on a more complex application and to compare software and hardware based quality tradeoffs. As before, we replace the multiplication in the JPEG algorithm with the model of the inaccurate multiplier. Table VII compares compression quality for four images. The JPEG algorithm can trade accuracy for runtime by reducing the number of coefficients used for compression, allowing for a software based tradeoff. To compare with the software approach, we synthesized the inaccurate multiplier again to consume the same power as the accurate one but operate at a greater frequency. This would result in speed up of the JPEG application assuming that the multiplier constitutes the critical path of the implementation. We first run the baseline JPEG algorithm increasing the number of coefficients (runtime) used, resulting in the SNR vs. runtime curve shown in Figure 12. We use the same set of coefficients as before, but with the tunable inaccurate multiplier (Section II-D) resulting in a different (lower) set of SNR and runtime data points. Using the frequency scaling factor from our synthesis results, we derive a SNR vs. runtime curve for when the multiplier is on the critical path (scaled inaccurate case in Figure 12). Figure 12 shows that for the JPEG application, hardware based approach has limited benefits and the software based approach yields a better tradeoff, especially at higher SNR values. Also if the multiplier does not constitute the critical path then the software based approach offers a much better tradeoff. In section II we showed that the inaccurate multiplier can be built to have different values of meanerror and power consumption. Using that resulting accuracy-power curve (Figure 8), the frequency power table previously presented (Table II) and the SNR vs. runtime curve derived above (Figure 12), we are able to compare the accuracy vs. power curves of the hardware and software based approaches. In our experiments the total runtime for the JPEG compression is kept constant. We use various configurations of the inaccurate multiplier, each with a different mean-error and power consumption (Figure 8), yielding a power vs. SNR curve for the hardware based approach. From the runtime vs. SNR curve in Figure 12 we know the amount of runtime (hence number of coefficients) the software approach would need to achieve the same SNR. Since we keep the runtime constant, we scale the frequency of operation appropriately and use our power-frequency tables to derive a SNR vs. power relationship for the software approach. The comparison of the two in Figure 13 shows that the hardware based approach still consumes less power than the software one, to achieve the same SNR in a fixed amount of runtime. Though the difference in power consumption is significantly smaller than that of the stand-alone inaccurate multiplier over the baseline. These experiments hold under the assumption that the multiplier determines the frequency of the operation and consumes the bulk of the power. V. ACCURATE M ODE OF O PERATION One of the advantages of our approach, is that simple decoder logic can be used to detect the presence and magnitude of errors for any input vector. The inaccurate output can then be corrected if/when needed using this information. In this section we look at schemes for both complete and partial correction. While complete correction that allows for an exact output might be the most common use-case the smaller overheads of partial correction make it an interesting alternative worth exploring. We also look at three different configurations that the correction block can be used in, each requiring varying degrees of architectural support in the form of frequency-scaling, voltage-scaling and power-shutoff. A. Complete Correction In our experiments we performed correction by using a combination of a decoder to detect the error magnitude and an adder to add the correcting amount. Figure 18(a) shows an example of an errordetection and correction unit specifically for the simple the 2x2 case. The AND gate acts as a simple decoder, detecting the 3 ∗ 3 input vector and the correcting adder adds the required amount (2) when the error triggering input pattern is detected (and adding zero when error is not detected). Non-adder based correction logic can also be used, Figure 18(b) shows an example of non-adder based correction for the 2x2 multiplier. Such correction mechanisms involve an overhead, and will be less efficient in terms of area than the baseline architecture. Therefore we envision a system (Figure 14) with two modes of operation - a regular, non-critical and inaccurate mode, and a mission-critical and hence accurate mode. In the non-critical mode, the correction unit will be power gated, resulting in the basic inaccurate operation, with its significant power savings. In the critical mode of operation, the system produces an accurate result at the cost of greater power in the critical mode of operation, and works at a slightly slower frequency. We re-ran our initial experiments to evaluate this overhead by synthesizing the new architecture to work at 0.85 times the original frequency in the accurate-mode, and at the same frequency as the baseline for the inaccurate mode. As before, we evaluate the design at multiple frequencies. We observed an average area overhead of 4.6% − 10.5% and in the inaccurate mode an average power overhead of 4.8% − 8.56% (Table VIII). The above setup requires support for both power-shutoff/gating (for the correcting adder in non-critical mode) and frequency scaling (to enable slower operation in critical/accurate mode). In the rest of this sub-section we look at other simpler configurations that allow for two modes of operation but do not involve frequency scaling. The simplest approach to allow for two modes of operation without the need for frequency scaling is to run both modes (of the inaccurate multiplier) at the same frequency and power-gate/shutoff the correcting adder in the non-critical mode (as before). This means that the upper path in Figure 14 and the baseline accurate architecture are synthesized to the same frequency. Power savings are attained from the fact that with the correcting unit turned off, the switching capacitance of the rest of the inaccurate multiplier is smaller than that of the baseline. The graph in Figure 15 plots the power savings of such a technique with increasing frequency for two different bit-widths (four and eight). As before designs are evaluated at frequencies between F and 2F, where 2F is the maximum possible achievable frequency of the baseline accurate multiplier. We can make the following observations about such a configuration. • • The power savings are in a much smaller range than with our previous setup (which involves both frequency scaling and power shut-off). For example the average power reduction for the four bit multiplier is now 12.44% as compared to 36% before. The power savings for this configuration reduce with increasing frequency and at higher frequencies such a configuration no longer offers a power benefit over the baseline. This is the opposite behavior to the earlier configuration, where power savings increased with increasing frequency (see Figure 13). While the average power benefits are significantly smaller than the previous setup (that involves frequencyscaling), this configuration might still be a viable choice for low-frequency applications, where a designer can benefit from the lower power without reverting to frequency scaling. The main reason the power benefits are not as impressive in such a configuration is that in the inaccurate mode, there is a significant amount of timing slack, that remains unused since the correction unit is turned off. One way to improve on this and take advantage of the available timing slack (in the inaccurate mode) is to also scale the voltage in the inaccurate mode, effectively turning the excess slack into power reduction. The results of such an approach that involves both power-shutoff and voltage-scaling are shown in Figure 16 and can be summarized as follows. • The power savings for such an approach are almost as good as the approach that involves frequencyscaling. For example the average power reduction for the four bit multiplier is now 33.22% as compared to 36% before. B. Partial Correction In this section we introduce a partial correction mechanism, that does not fully correct the errors of the inaccurate multiplier but offers the benefit of a minimal overhead. A closer look at the error distribution of the inaccurate multiplier reveals that the errors introduced by the multiplication of Most-Significant-Bits (MSB) of the inputs are the largest in terms of magnitude of error. Detecting and correcting only these errors can significantly reduce the mean-error. The detection and correction logic in such a scenario is a smaller piece of circuitry and represents a much smaller overhead. Table IX compares the error probabilities and mean-error for the partially corrected multiplier with the original inaccurate multiplier. While the change in error probability is small, the overall reduction in mean-error is significant for all bit-widths. The max-error remains unchanged at 22.22%. Figure 17 shows what such a partial correction unit would look like for all bit-widths except the trivial 2x2 case. The following points are worth noting • • The adder used for partial correction (Figure 17) is significantly smaller than the logic used for complete correction (Figure 14). The hardware used for the partial correction is constant. The partial correction block adds a three bit input with a one bit input irrespective of the input bit-width of the multiplier. As a result of its constant size, the overhead of the partial correction block actually reduces with increasing bit-width and becomes close to negligible for bit-widths of sixteen and above. We synthesize the new architecture with partial correction to work at 0.85 times the original frequency in the partially-corrected mode, and at the same frequency as the baseline for the inaccurate mode. After synthesizing this modified design for multiple frequencies we observed an average (across multiple frequencies) area overhead of 1.26% − 6.22% and in the partially correct mode an average power overhead of 1.44% − 5.63% (Table X). VI. C ONCLUSION With a mean error of 1.39%−3.35% and power savings between 30%−50%, the underdesigned multiplier architecture presented allows for trading of accuracy for power. It achieves 2X - 8X better SNR than simple voltage over-scaling techniques, and does not suffer from overheads associated with the multiple voltage domains of advanced over-scaling techniques. Our experiments with leakage-optimized netlists suggest that voltage over-scaling might become a less attractive option to reduce power in heavily power optimized designs. A simple correction mechanism is proposed for usage in a critical mode, when correctness of output is needed. We show that for inaccurate multipliers introducing errors via partial products is more promising than via the adders. We also show that both of these approaches afford a better tradeoff (of power vs. accuracy) than bit-width truncation. The results suggest that design-for-error based techniques have significant potential for power savings, and can be easily integrated into today’s automated ASIC design flow. Future work includes extending the approach to other arithmetic components and an algorithm for finding the point of maximum power benefit for a given error rate. ACKNOWLEDGEMENT The authors would like to thank Pratyush Aditya from Cadence Design Systems, for his valuable feedback and contributions towards setting up power optimization for multiplier components generated by RC. R EFERENCES [1] K. V. Palem, “Energy aware computing through probabilistic switching: A study of limits,” IEEE Trans. Comput., vol. 54, no. 9, pp. 1123–1137, 2005. [2] M. S. Lau, K.-V. Ling, and Y.-C. Chu, “Energy-aware probabilistic multiplier: design and analysis,” in CASES ’09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems. New York, NY, USA: ACM, 2009, pp. 281–290. [3] M. Lamoureux, “The poorman’s transform: approximating the fourier transform without multiplication,” Signal Processing, IEEE Transactions on, vol. 41, no. 3, pp. 1413 –1415, mar 1993. [4] I. Reed, D. Tufts, X. Yu, T. Truong, M.-T. Shih, and X. Yin, “Fourier analysis and signal processing by use of the mobius inversion formula,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 38, no. 3, pp. 458 –470, mar 1990. [5] G. Boudreaux-Bartels and T. Parks, “Discrete fourier transform using summation by parts,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’87., vol. 12, apr 1987, pp. 1827 – 1830. [6] M. Bhardwaj, R. Min, and A. Chandrakasan, “Power-aware systems,” in Signals, Systems and Computers, 2000. Conference Record of the Thirty-Fourth Asilomar Conference on, vol. 2, 2000, pp. 1695 –1701 vol.2. [7] N. Banerjee, G. Karakonstantis, J. H. Choi, C. Chakrabarti, and K. Roy, “Design methodology for low power and parametric robustness through output-quality modulation: application to color-interpolation filtering,” Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 28, no. 8, pp. 1127–1137, 2009. [8] J. George, B. Marr, B. E. S. Akgul, and K. V. Palem, “Probabilistic arithmetic and energy efficient embedded signal processing,” in CASES ’06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems. New York, NY, USA: ACM, 2006, pp. 158–168. [9] R. Hegde and N. R. Shanbhag, “Energy-efficient signal processing via algorithmic noise-tolerance,” in ISLPED ’99: Proceedings of the 1999 international symposium on Low power electronics and design. New York, NY, USA: ACM, 1999, pp. 30–35. [10] D. Shin and S. K. Gupta, “A re-design technique for datapath modules in error tolerant applications,” Asian Test Symposium, vol. 0, pp. 431–437, 2008. [11] D. Kelly and B. Phillips, “Arithmetic data value speculation,” in Asia-Pacific Computer Systems Architecture Conference, 2005, pp. 353–366. [12] S.-L. Lu, “Speeding up processing with approximation circuits,” Computer, vol. 37, no. 3, pp. 67–73, 2004. [13] D. Shin and S. K. Gupta, “Approximate logic synthesis for error tolerant applications,” in in Proc. 13th IEEE Design, Automation and Test in Europe, 2010. [14] B. J. Phillips, D. R. Kelly, and B. W. Ng, “Estimating adders for a low density parity check decoder,” F. T. Luk, Ed., vol. 6313, no. 1. SPIE, 2006, p. 631302. [Online]. Available: http://link.aip.org/link/?PSI/6313/631302/1 [15] J. Tong, D. Nagle, and R. Rutenbar, “Reducing power by optimizing the necessary precision/range of floating-point arithmetic,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 8, no. 3, pp. 273 –286, jun. 2000. [16] M. R. Choudhury and K. Mohanram, “Approximate logic circuits for low overhead, non-intrusive concurrent error detection,” in Proceedings of the conference on Design, automation and test in Europe, ser. DATE ’08. New York, NY, USA: ACM, 2008, pp. 903–908. [Online]. Available: http://doi.acm.org/10.1145/1403375.1403593 [17] S. K. Sangjin, S. Hong, M. C. Papaefthymiou, and W. E. Stark, “Low power parallel multiplier design for dsp applications through coefficient optimization,” in in Proc. 12th IEEE International ASIC/SOC Conference, pp. 286–290. [18] P. Kulkarni, P. Gupta, and M. Ercegovac, “Trading accuracy for power with an underdesigned multiplier architecture,” VLSI Design, International Conference on, vol. 0, pp. 346–351, 2011. [19] I. Koren, Computer Arithmetic Algorithms, 2nd ed. A.K. Peters, 2002. [20] “Cadence rtl-compiler,” http://www.cadence.com/products/ld/rtl_compiler/pages/default.aspx. [21] “Cadence incisive simulator,” http://www.cadence.com/products/ld/design_team_simulator/pages/default.%aspx. [22] “Standard delay format,” http://www.vhdl.org/sdf/sdf_3.0.pdf. [23] “Value change dump file,” http://en.wikipedia.org/wiki/Value_change_dump. [24] “Nangate open cell library,” http://www.nangate.com/index.php?option=com_content&task=view&id=137&It%emid=137. [25] “Opencores,” http://www.opencores.org/. [26] A. B. Kahng, S. Kang, R. Kumar, and J. Sartori, “Slack redistribution for graceful degradation under voltage overscaling,” in Proceedings of the 2010 Asia and South Pacific Design Automation Conference, ser. ASPDAC ’10. Piscataway, NJ, USA: IEEE Press, 2010, pp. 825–831. [Online]. Available: http://portal.acm.org/citation.cfm?id=1899721.1899911 [27] X. Bai, C. Visweswariah, and P. N. Strenski, “Uncertainty-aware circuit optimization,” in Proceedings of the 39th annual Design Automation Conference, ser. DAC ’02. New York, NY, USA: ACM, 2002, pp. 58–63. [Online]. Available: http://doi.acm.org/10.1145/513918.513935 Parag Kulkarni is currently an Advanced Design Engineer at Altera Corporation. He received the B.Tech degree in Information and Communication Technology from DA-IICT (www.da-iict.ac.in) in 2007 and M.S degree in Electrical Engineering in 2011 from University of California, Los Angeles where he was part of the Nano-CAD lab (http://nanocad.ee.ucla.edu/). His research interests include low-power VLSI Design and CAD. Prior to graduate school he worked at Cadence Design Systems in the logic and physical synthesis teams. Puneet Gupta Puneet Gupta (http://www.ee.ucla.edu/ puneet) is currently a faculty member of the Electrical Engineering Department at UCLA. He received the B.Tech degree in Electrical Engineering from Indian Institute of Technology, Delhi in 2000 and Ph.D. in 2007 from University of California, San Diego. He co-founded Blaze DFM Inc. (acquired by Tela Inc.) in 2004 and served as its product architect till 2007. He has authored over 60 papers, ten U.S. patents, and a book chapter. He is a recipient of NSF CAREER award, ACM/SIGDA Outstanding New Faculty Award, European Design Automation Association Outstanding Dissertation Award and IBM Ph.D. fellowship. Dr. Puneet Gupta has given tutorial talks at DAC, ICCAD, Intl. VLSI Design Conference and SPIE Advanced Lithography Symposium. He has served on the Technical Program Committee of DAC, ICCAD, ASPDAC, ISQED, ICCD, SLIP and VLSI Design. He served as the Program Chair of IEEE DFM&Y Workshop 2009, 2010. Dr. Gupta’s research has focused on building high-value bridges across application-architecture-implementation-fabrication interfaces for lowered cost and power, increased yield and improved predictability of integrated circuits and systems. Miloš D. Ercegovac is a professor and a former chair in the Computer Science Department of the Henry Samueli School of Engineering and Applied Science, University of California at Los Angeles, where he has been on the faculty since 1975. He earned his MS (’72) and PhD (’75) in computer science from the University of Illinois, Urbana-Champaign, and BS in electrical engineering (’65) from the University of Belgrade, Serbia. Dr. Ercegovac has specialized in research and teaching in digital arithmetic, digital and computer system design, and parallel architectures. His dedication to teaching and research has also resulted in several co-authored books: two in the area of digital design and two in digital arithmetic. He received the Lockheed-Martin Excellence in Teaching award in 2009. Dr. Ercegovac has been involved in organizing the IEEE Symposia on Computer Arithmetic since 1978. He served as an associate editor of the IEEE Transactions on Computers 1988 -1992 and as a subject area editor for the Journal of Parallel and Distributed Computing 1986 -1993. Dr. Ercegovac’s work has been recognized by his election in 2003 to IEEE Fellow and to Foreign Member of the Serbian Academy of Sciences and Arts in Belgrade, Serbia. He is a member of the ACM and of the IEEE Computer Society. TABLES Table I: Error probability and mean-error for varying bit-widths Bit- Error-Prob Width 2 0.0625 4 0.19 8 0.46 12 0.675 16 0.81 Mean-Error Max-Error 1.39% 2.60% 3.25% 3.31% 3.32% 22.22% 22.22% 22.22% 22.22% 22.22% Table II: Dynamic power reduction with increasing bit-widths for various frequencies BitWidth 2 4 8 16 F (%) 44.9 13.7 33.1 25.6 1.25F (%) 42.1 31.6 40.4 29.6 1.5F (%) 42.1 44.8 26.3 32.4 1.75F (%) 48.9 44.7 48.8 33.8 2F (%) 48.9 46.5 58.9 37.4 Avg. (%) 45.4 36.3 41.5 31.8 Table III: Design level power savings Design # Multipliers # Gates (~) FFT FIR RISC 32 4 1 158K 1.1K 10K Mult. Total Power Power Reduction Reduction 25.166% 13.98% 31.09% 18.30% 28.04% 1.51% Table IV: Error rates and power savings for inaccurate adder based eight bit multiplier Error Prob. 0.29 0.23 0.20 0.15 0.16 0.19 0.12 Mean Error Max Error Power Reduction 10.07% 6.01% 4.40% 3.40% 3.04% 2.92% 2.18% 62.22% 57.14% 57.14% 57.14% 57.14% 44.44% 44.44% 37.89% 32.35% 28.87% 20.16% 16.83% 19.81% 12.14% Table V: Error probability and mean-error for Voltage-Scaling Scaled- Error-Prob Voltage 0.90V 0.89 0.77V 0.99 Mean-Error Max-Error 20.86% 38.07% 100.0% 100.0% Table VI: Effect of Leakge-Power optimization on Efficiency of Voltage-Scaling Metric Error Prob. Mean-Error Max-Error Leakage Optimized Netlist 0.98 46.007% 100.0% Regular Netlist 0.4577 10.008% 100.0% Table VII: JPEG compression using the inaccurate multiplier Image Lenna Coins Sand-Dunes Fireman Average Inaccurate Accurate SNR SNR (dB) (dB) 25.19 25.56 19.31 19.55 36.5 37.94 30.23 30.89 - SNR Reduction 1.44% 1.22% 3.79% 2.13% 2.15% Table VIII: Accurate operation area and power overhead Bit Width 2 4 8 16 Average Area Overhead 4.6% 5.87% 10.5% 9.875% Max Area Overhead 8.14% 7.67% 13.44% 13.85% Average Power Overhead 4.8% 7.5% 8.56% 8.22% Max Power Overhead 8.32% 16.14% 12.88% 13.67% Table IX: Mean-Error post partial correction Bit Width Original 0.19 0.46 0.67 0.81 4 8 12 16 Error Prob. Partially Corrected 0.15 .45 .67 0.80 Original 2.60% 3.25% 3.31% 3.32% Mean-Error Partially Corrected 1.49% 2.21% 2.28% 2.29% Table X: Partial correction area and power overhead Bit Width 2 4 8 16 Average Area Overhead 4.6% 6.22% 2.79% 1.26% Max Area Overhead 8.14% 6.47% 3.24% 1.68% Average Power Overhead 4.8% 5.63% 3.48% 1.44% Max Power Overhead 8.32% 6.44% 4.33% 1.57% F IGURE C APTIONS Figure 1. Karnaugh-Map for the inaccurate 2x2 multiplier Figure 2. The accurate (b) and inaccurate (a) 2x2 multipliers, with the critical paths highlighted Figure 3. Building larger multipliers from smaller blocks Figure 4. Error percentage distribution for 4-bit and 12-bit inaccurate multipliers Figure 5. Accuracy vs. Power Tradeoff for the 4-bit inaccurate multiplier. Increasing mean-error leads to an increase in power savings. Figure 6. Dynamic power vs. frequency for 4-bit accurate and inaccurate multipliers Figure 7. Dynamic power, leakage power and area savings for a 4-bit inaccurate multiplier Figure 8. Accuracy vs. power tradeoff comparison for partial product and adder based approaches. The proposed partial product based approach yields a much better tradeoff. Figure 9. Accuracy vs. power tradeoff for truncation for an 8-bit multiplier. Both adder and partialproduct based approaches offer better tradeoffs. Figure 10. Accuracy vs. power tradeoff for truncation. Both adder and partial-product based approaches offer better tradeoffs. Figure 11. Image sharpening (a) original blurred image; (b) enhanced using accurate multiplier; (c) by inaccurate multiplier, power reduction 41.5%, SNR : 20.365dB; (d) voltage over-scaling for 30% power reduction, SNR : 9.16dB; (e) voltage over-scaling for 50% power reduction, SNR : 2.64dB; (f) by introducing errors via the adders, SNR : 7.3dB Figure 12. Run-time-accuracy tradeoff for software and hardware based approaches for JPEG compression. Accurate : accurate hardware, purely software based tradeoff. Inaccurate : inaccurate multiplier, when not in critical path. Scaled Inaccurate : inaccurate multiplier and in critical path. Figure 13. Power-accuracy tradeoff for software and hardware based approaches for JPEG compression Figure 14. Accurate mode extension, the upper path is for accurate operation and the lower path is for inaccurate operation Figure 15. Power savings as a function of increasing frequency for the configuration that utilizes only power-shutoff. With increasing target frequency this approach becomes less viable. Figure 16. Power savings as a function of increasing frequency for configuration that utilizes powershutoff and voltage-scaling. Figure 17. Partial-Correction, that fixes errors introduced by MSB multiplication. Figure 18. Two types of correction logic for the 2x2 inaccurate multiplier F IGURES Figure 1: Karnaugh-Map for the inaccurate 2x2 multiplier Figure 2: The accurate (b) and inaccurate (a) 2x2 multipliers, with the critical paths highlighted Figure 3: Building larger multipliers from smaller blocks Figure 4: Error percentage distribution for 4-bit and 12-bit inaccurate multipliers Figure 5: Accuracy vs. Power Tradeoff for the 4-bit inaccurate multiplier. Increasing mean-error leads to an increase in power savings. Figure 6: Dynamic power vs. frequency for 4-bit accurate and inaccurate multipliers Figure 7: Dynamic power, leakage power and area savings for a 4-bit inaccurate multiplier Figure 8: Accuracy vs. power tradeoff comparison for partial product and adder based approaches. The proposed partial product based approach yields a much better tradeoff. Figure 9: Accuracy vs. power tradeoff for truncation for an 8-bit multiplier. Both adder and partial-product based approaches offer better tradeoffs. Figure 10: Accuracy vs. power tradeoff for truncation. Both adder and partial-product based approaches offer better tradeoffs. Figure 11: Image sharpening (a) original blurred image; (b) enhanced using accurate multiplier; (c) by inaccurate multiplier, power reduction 41.5%, SNR : 20.365dB; (d) voltage over-scaling for 30% power reduction, SNR : 9.16dB; (e) voltage over-scaling for 50% power reduction, SNR : 2.64dB; (f) by introducing errors via the adders, SNR : 7.3dB Figure 12: Run-time-accuracy tradeoff for software and hardware based approaches for JPEG compression. Accurate : accurate hardware, purely software based tradeoff. Inaccurate : inaccurate multiplier, when not in critical path. Scaled Inaccurate : inaccurate multiplier and in critical path. Figure 13: Power-accuracy tradeoff for software and hardware based approaches for JPEG compression Figure 14: Accurate mode extension, the upper path is for accurate operation and the lower path is for inaccurate operation Figure 15: Power savings as a function of increasing frequency for the configuration that utilizes only power-shutoff. With increasing target frequency this approach becomes less viable. Figure 16: Power savings as a function of increasing frequency for configuration that utilizes power-shutoff and voltage-scaling. Figure 17: Partial-Correction, that fixes errors introduced by MSB multiplication. (a) The error detection and correction logic using an adder for the 2x2 case (b) Non-adder based correction for the 2x2 case Figure 18: Two types of correction logic for the 2x2 inaccurate multiplier