Design and Implementation of Low-Power Digit-Serial Multipliers Yun-Nan Chang, Janardhan H. Satyanarayana, and Keshab K. Parhi Department of Electrical and Computer Engineering University of Minnesota, Minneapolis, MN 55455, USA E-mail: fynchang, jsatyana, parhig@ece.umn.edu Abstract Digit-serial architectures obtained using traditional unfolding techniques cannot be pipelined beyond a certain level because of the presence of feedback loops. In this paper, a novel design methodology is presented which permits bit-level pipelining of the digit-serial architectures. This enables bit-level pipelining of digitserial architectures thereby achieving sample speeds close to corresponding bit-parallel multipliers with signicantly lower area. This increased sample speed can be traded with reduction in power supply voltage resulting in signicant reduction in power consumption. The results show that for transformed multipliers with smaller digit-sizes ( 4), the singly-redundant multiplier consumes the least power and for larger digitsizes, the type-I multiplier consumes the least power. It is also found that the optimum digit-size for least powerp consumption in type-I and type-III multipliers is 2W , where W represents the word-length. The proposed digit-serial multipliers consume on an average 20% lower power than the traditional digit-serial architectures for the non-pipelined case, and about 5 ? 15 times lower power for the bit-level pipelined case. Also, modied Booth recoding is applied to transformed multipliers and it is found that the recoded multipliers consume about 22% lower power than the transformed multipliers without recoding. 1. Introduction Digital signal processing (DSP) is used in a wide range of applications such as telephone, radio, video, sonar, etc. The sample rate requirements vary from application to application and can range anywhere from This research was supported by Defense Advanced Research Project Agency under contract number DA/DABT63-96-C-0050. 10 KHz to 400 MHz. Real time implementation of these systems require hardware architectures which can process input signal samples as they are received, as opposed to storing them in registers and processing them in batch mode. It is well known that bit-serial systems, which process one bit of the input sample in one clockcycle, are area-ecient and ideal for low-speed applications [7][9]. On the other hand bit-parallel systems, which process one whole word of the input sample in one clock-cycle, are ideal for high speed applications [11][15]. However, in applications which require moderate sample rates both these systems may be ineective; bit-serial systems may be too slow and bit-parallel systems may be faster than necessary and occupy considerable amount of area. To this end, digit-serial systems [8][14] have become attractive for digital designers in the recent past. Most of the DSP computations involve the use of multiply accumulate operations and therefore the design of fast and ecient multipliers is imperative. Moreover, the demand for portable applications of DSP architectures has dictated the need for low power designs. Digit-serial multipliers are ideal for such designs and nd many applications in heterogeneous high-level synthesis environments. Recently, it was found that digit-serial multipliers could be pipelined at the bitlevel [1][2] thereby resulting in high processing speeds or low power. However, here the designs were obtained in an ad hoc manner. This paper presents a systematic design methodology for low power, digit-serial multipliers. Traditionally, digit-serial multipliers were obtained by either folding the corresponding bit-parallel architectures [16], or unfolding the bit-serial architectures [14]. However, architectures obtained in this manner cannot be pipelined at the bit-level. The approach presented in this paper enables the direct design of digitserial architectures which can be pipelined at the bitlevel thereby achieving sample speeds close to corre- sponding bit-parallel multipliers. This increased sample speed can be traded with reduction in power supply voltage resulting in signicant reduction in power consumption. The proposed digit-serial multipliers consume on an average 20% lower power than the traditional digit-serial architectures for the non-pipelined case, and about 5 ? 15 times lower power for the bitlevel pipelined case. Also, modied Booth recoding is applied to transformed multipliers and it is found that the recoded multipliers consume about 22% lower power than the transformed multipliers without recoding. The organization of this paper is as follows. Section 2 presents the proposed design methodology for unsigned multiplication. The design methodology is applied to various existing bit-serial multipliers and corresponding digit-serial designs which can be pipelined at the bit-level are obtained. In Section 3, extension of the proposed design methodology to handle modied Booth recoding is presented. In order to verify the correctness of the proposed architectures, the type-I digit-serial multiplier is implemented in 1:2 CMOS technology and experimental results related to critical path and power consumption are presented in Section 4. Finally, the main conclusions of the paper are summarized in Section 5. 2. Design methodology for unsigned multiplication In this section, a systematic design methodology is proposed which enables the direct design of digit-serial architectures from bit-serial architectures. Without loss of generality, this methodology is applied to multipliers (unsigned) which form the backbone of DSP computations. The proposed methodology can be easily extended to two's complement multiplication although it is not been shown here for the sake of brevity. Consider the bit-serial multiplication of two W -bit numbers a and b to yield a product p as described by the algorithm below: Algorithm INPUT: a, b OUTPUT: p INITIALIZE: ai , bi = 0 for i> W-1 ci;j , si;j = 0 8 i,j begin for i=0 to W-1 begin for j=0 to W begin ai bj + ci;j?1 + si?1;j+1 = 2ci;j + si;j ; (1) end pi =si;0 ; end for i=W to 2W-1 pi =sW ?1;i?W +1 ; end The proposed design methodology involves treating the bits as digits. Consequently, the bit-serial algorithm is modied for a general digit-size N as follows: Algorithm begin for i=0 to WN -1 begin for j=0 to WN begin Ai Bj + Ci;j?1 + Si?1;j+1 = 2N Ci;j + Si;j (2) end Pi =Si;0 ; end ? for i= WN to 2 WN -1 Pi =S WN ?1;i? WN +1 ; end In the above algorithm the capital letters are used to denote digits. For example, if the digit-size is N=4, then A0 represents a digit constituting the four bits a0 , a1 , a2 , and a3 , and has a value of A0 = 23a3 + 22 a2 + 21a1 + a0 . The next step is to translate (2) into a hardware architecture in such a manner that the nal architecture permits bit-level pipelining. This is achieved by splitting the product (Ai Bj ) in (2) into two parts in accordance with (Ai Bj )L + 2N (Ai Bj )H + Ci;j?1 + Si?1;j+1 = 2N Ci;j + Si;j : (3) Here, (Ai Bj )L represents the lower order terms of the product (Ai Bj ), and (Ai Bj )H represents the higher order terms. For example let us consider the product, say, (A0 B0) for a digit-size of 4. Then, after splitting, the product is given by (A0 B0 ) = a3 b3 22 + (a3 b2 + a2 b3 ) 21 + (4) (a3 b1 + a2 b2 + a1 b3 ) 20 24 + (a3 b0 + a2 b1 + a1 b2 + a0 b3 ) 23 + (a2 b0 + a1 b1 + a0 b2 ) 22 + (a1 b0 + a0 b1 ) 21 + a0 b0 20 : In the above equation, the terms inside f:g equal (A0 B0 )H , and the terms inside [:] equal (A0 B0 )L . Upon rearranging (3) one gets (Ai Bj )L + Ci;j?1 + Si?1;j+1 = 2N fCi;j ? (Ai Bj )H g + Si;j : (5) 0 = Ci;j ? (Ai Bj )H : Ci;j (6) Dene Then, (5) can be rewritten as 0 [(Ai Bj )L + (Ai Bj?1 )H ] + Ci;j ?1 + N 0 Si?1;j+1 = 2 Ci;j + Si;j : (7) The advantage of rearranging the terms in this manner is that the lower order and the higher order partial products can be summed using a regular rectangular carry-save array and the nal architecture can be pipelined to the bit-level. This was not always possible in the previously obtained (using unfolding) digit-serial architectures. As a result sample speeds comparable to bit-parallel architectures or very low power consumption can be achieved. In the following, the proposed design methodology is applied to various bit-serial multipliers to obtain new digit-serial multipliers. after pipelining. Reduction in the critical path below N full-adder delays is not possible because of the presence of feedback loops. Based on (7), each cell inside the dashed boxes in Fig. 1 is now replaced by the corresponding structure shown in Fig. 2. It is found that the critical path of the resulting digit-serial architecture is N + 2 WN ? 1 full-adder delays if two sum outputs are generated, and N + WN ? 1 full-adder delays if three sum outputs are generated (see Fig. 2). The important fact to note is that this critical path can be reduced to one full-adder delay by suitable pipelining. Each digit-cell consists of Partial Product Generator D a1 a2 a2 b1 a3 a3 b0 b2 (D) A 4 (D) (D) 4 Carry Save Array 4 4 Partial Products a1b a2b 2.1. Type-I multiplier 1 a0b a2b a0b 1 2 0 3 4 16 a1b a2b(D) a1b 0 3 FA FA a2b(D) a0b a1b(D) 3 2 1 FA D D a1 a2 0 D a3b 0 Sum_in FA a3b(D) 3 FA a3b(D) 2 FA a3b(D) 1 FA Sum_out 0 4 Sum_in Sum_in D a0b FA D 1 FA FA FA FA 4 4 b3 b2 b1 b0 a0 (D) b1 b0 b3 a0 a1 b2 b3 a0 2 Consider the bit-serial type-I multiplier [10] shown in Fig. 1 where the coecient word-length is four bits. This architecture contains four full adders, four multipliers, and some delay elements. In this multiplier B 4 Sum_out D 2 FA FA FA 1 4 4 Sum_out FA 0 4 2 D a3 Figure 2. Digit-cell for type-I multiplier. D D D D Figure 1. Type-I bit-serial multiplier with wordlength of 4 bits. the carry-out signal of every adder is fed back after a delay to the carry-in signal of the same adder. The critical path of this architecture is W full-adder delays. The traditional approach for designing the digit-serial architecture involves unfolding this structure by a factor equal to the digit-size N . However, the resulting critical path would be W + N ? 1 full-adder delays; which can be further reduced to N full-adder delays a partial product generator module and a carry save adder (CSA) module. The partial product generator computes the 16 partial products. It is clear from (7) that Ai is multiplied by both Bj and Bj?1 . This is reected in the partial product generator where both the current and the delayed versions of the signal B are used, with a subscript denoting the delayed version of B . For example, a3 b(1D) is used to denote the fact that a3 is multiplied by a delayed version of b1. The CSA module produces three sum output digits sum out0 , sum out1 and sum out2 in order to enable bit-level pipelining. Therefore, in the nal stage a digit-serial adder is required to sum all these outputs. A simple digit-serial 3:2 compressor adder can be rst used to reduce these three output digits to two digits. A digitserial carry look-ahead adder or any other fast carrypropagate adder is then used to add these two digits to generate the nal result. A fast adder is necessary because this stage should not form a bottleneck during bit-level pipelining. The entire architecture of the digit-serial multiplier obtained using the proposed design methodology is shown in Fig. 3. Other bit-serial b3 b2 b1 b0 D a0 Full-Adder D a1 a3 a2 D D D D Carry B3 B2 B1 B0 FA A0 A1 FA D D A2 A3 4 Digit Digit Digit Digit Cell Cell Cell Cell Digit-Serial 3 to 2 Compressor D FA 4 4 4 Sum FA 4 Carry Lookahead Adder 4 Figure 4. Bit-serial type-II multiplier with word-length of 4 bits. FA D D Figure 3. Type-I digit-serial multiplier with word-length 16 bits and digit size 4 bits, where the digit-cell is the one shown in Fig. 2. multipliers similar to the type-I bit-serial multiplier can obtained, for example, by changing the direction of the data ow of the sum signal, or by making the input signal broadcast to all multipliers [10]. The proposed methodology is quite general and can be applied to all these types of bit-serial multipliers to generate corresponding digit-serial multipliers which can be pipelined at the bit-level. 2.3. Type-III multiplier Consider the bit-serial type-III multiplier [10] shown in Fig. 5. The salient feature of this architecture is that the carry-out signal is not fed back as in the typeb3 b2 b1 b0 D a0 D D a1 a2 a3 D D D D D Figure 5. Bit-serial type-III multiplier with word-length of 4 bits. 2.2. Type-II multiplier Consider the bit-serial type-II multiplier [3] shown in Fig. 4. The main dierence between this multiplier and the type-I multiplier is that the critical path in this architecture is just two full-adder delays. Moreover, this architecture can be pipelined at the bit-level with an additional latency of only one clock-cycle unlike the type-I multiplier where the increase in latency would depend on the word-length. If this architecture is unfolded using the traditional technique, the critical path would be (N +1) full-adder delays. However, as in the case of the type-I multiplier, reduction in the critical path below N is not possible due to the presence of feedback loops. Based on (7), each cell in dashed boxes in Fig. 4 is replaced with corresponding digit-cells [6] which are then cascaded to design the entire multiplier. I multiplier. If this architecture is unfolded using the traditional technique, the critical path would be W fulladder delays. However, in this case since there is no carry feedback, the unfolded architecture can also be pipelined at the bit-level. The bit-serial multiplication for this architecture is expressed by ai bj + ci?1;j + si?1;j+1 = 2ci;j + si;j : (8) Going through the design methodology presented in the beginning of this section, the nal digit-serial multiplication equation can be derived in accordance with [(Ai Bj )L + (Ai?1 Bj )H ] + Ci0?1;j + 0 + Si;j : Si?1;j+1 = 2N Ci;j (9) Based on (9), each dashed cell in Fig. 5 is replaced by the digit-cell shown in Fig. 6. It should be noted that Partial Product Generator a4i b3 Ai 4 b2 b1 a4i+1b2 FA b0 a4i-3 b 1 a4i+2 a4i-2 b 0 a4i+3 a4i-1 4 4 sign(bi ) Ai-1 Sum_in sign(bi ) 4 FA FA a4i-1 b2 FA FA a4i-1 b1 FA FA D |bi | sign(bi ) ai D |bi | 0 FA FA FA Sum_out 1 block D (i = 0) 4 4 FA 2 D 4 D Sum_in D Figure 7. Bit-serial architecture for a singlyredundant multiplier. Sum_out 1 4 4 a3 FA D Sum_in a2 ai 16 0 4 a1 a4i+1b0 a4i+1b1 a b a4i-2 b2 4i-3 3 a4i-2 b3 a4i+2b0 a a4i b3 a4i b1 a4i b0 b 4i 2 a4i-1 b3 FA a0 4 D a4i+3b0 |bi | a4i-4 b2 a4i+1 Partial Products a4i+2b1 b+i b-i b3 4 Carry Save Array ... b0b1b2b3 B 4 FA FA FA Sum_out block D-1 (i = D-2) block D-2 (i = D-1) block 0 (i = D) 2 D Carry_in Carry_out all identical blocks 4 4 Figure 8. Digit-serial singly-redundant multiplier architecture. Figure 6. Digit-cell for the type-III multiplier. the partial product generator uses two coecient digits Ai and Ai?1 unlike the previous architectures where only one digit was used. It should also be noted that the carry-save portion generates four outputs at each stage. Therefore, at the output of the nal digit-cell, a digit-serial 4:2 compressor and a fast carry look-ahead adder are required to convert the four digits to one digit. The resulting architecture can be pipelined at the bit-level thereby achieving word-level speeds comparable to a bit-parallel multiplier. 2.4. Singly-redundant multiplier The "carry-save" data representation is the most common example of redundant representations, even though it is not always described as such. The use of redundant arithmetic in fast binary arithmetic was rst published in [4]. The basic idea behind using redundant arithmetic is to avoid carry propagation in parallel arithmetic cells. This is possible because the use of a redundant number representation allows serial operations to proceed in a most-signicant-bit mode. There are many avors of redundant arithmetic namely minimally redundant, maximally redundant, over redun- dant, etc. The reader is referred to [12][10] for a more detailed discussion of redundant arithmetic. The architecture of a bit-serial singly redundant multiplier is shown in Fig. 7 [17]. The term singly arises from the fact that one input (redundant) and output operands (redundant) belong to the set f1, 0, 1g, and another input belongs to the binary set f0, 1g. The key to achieving coding and area eciency in the singlyredundant architecture lies in the hybrid representation of the input and output signals of the cell. This architecture has a critical path of 4 adder delays and a xed latency of 1 clock cycle. Based on (9), the bit-serial singly-redundant architecture shown in Fig. 7 is transformed and the resulting digit-serial architecture is shown in Fig. 8, where D=W/N. This architecture is comprised of three sets of blocks. Each block, as before, consists of a partial product generator and a carry-save adder section. The architecture of the partial product generator for the singly-redundant multiplier is shown in Fig. 9. The structure of this generator is quite similar to that of the type-III multiplier. The only dierence is that the AND gate in the partial product generator of the typeIII multiplier is now replaced by a combination of an AND and an XOR gate as shown in Fig. 9. This is Bj Ai B j = bNj+3bNj+2bNj+1bNj bNj+3 aNibNj+3 bNj+2 bNj+1 bNj Partial-Product Generator A i-1 aNi+1b’Nj+1 aNi+1b’Nj+2 aNi+1b’Nj aNi-3 b’Nj+3 aNi+2b’Nj+1aNi b’Nj+3 aNi+2b’Nj aNi b’Nj+2 aNi-2 b’Nj+3aNi b’Nj+1 aNi-2 b’Nj+2aNi b’Nj aNi-4 bNj+2 Ai = aNi+3aNi+2aNi+1aNi FA aNi-3 aNi+1 bNj+1 aNi-2 aNi+2 A i-1 = aNi-1 aNi-2 aNi-3 aN-4 bNj aNi+3 aNi+3b’Nj sum_in 0 FA aNi-1 b’Nj+3 FA FA aNi-1 b’Nj+2 aNi-1 b’Nj+1 FA FA FA FA FA FA FA FA 4 aNi-1 D sum_in 1 4 sum_in 2 4 bx sum_out 1 sum_out 2 4 4 FA sign(b x) sum_out 0 4 FA FA FA D D bx ay carry_in carry_out 4 4 ay a y b’x Figure 10. Digit-cell for the singly redundant multiplier for i=0 to D-2 Figure 9. Architecture of the partial product generator for the singly-redundant multiplier 1 0 zero1 b 4i b 4i+1 sign1 D mag2 1 1 0 MUX done to incorporate the sign-bit of the redundant digit. The rst D ? 2 blocks in Fig. 8 are identical and the general architecture for each one of them is shown in Fig. 10. The architectures for the last two blocks are slightly dierent from the rst D ? 2 blocks. The dierence arises due to the fact that the signal a is in two's complement form with the most signicant bit representing the sign bit. Therefore, a small modication in the architecture of the last two modules is required to incorporate the sign bit. In all the digit-cells, i represents the space index and j represents the time index. For example, i is 0 for the rst module, 1 for the second module, and D for the last module. j is 0 for the rst clock cycle, 1 for the second clock cycle, and so on. The value of j equal to D ? 1 denotes the end of transmission of the entire word. It should be noted that in Fig. 10 for i = 0, A?1 is 0 and therefore the rst cell can be simplied. If the bit-serial architecture shown in Fig. 7 is unfolded using the traditional unfolding technique, the resulting digit-serial architecture will have a critical path equal to W full-adder delays. This architecture can however be pipelined to reduce the critical path to one full-adder delay at the expense of increased latency. The digit-serial architecture obtained using the proposed design methodology, however, has a critical path equal to N full-adder delays. This architecture can be pipelined to reduce the critical path to one full-adder delay at the expense of a slight increase in latency. This is the main advantage of the proposed design method- MUX mag1 1 zero2 b 4i+2 b 4i+3 sign2 Figure 11. Modified Booth’s Algorithm recoding module for a digit-size of 4 bits. ology over the unfolding technique. 3. Extension to modied Booth recoding One salient feature of the proposed design methodology is that the resulting digit-serial multipliers can be further improved by incorporating the recoding technique of binary numbers. The recoding of binary numbers was rst hinted by Booth [5], and the most popular recoding algorithm currently used is the Modied Booth's Algorithm (MBA) [13]. The application of the MBA to multiplication can help reduce the number of partial products by half. Therefore, the number of full adders required to accumulate the partial products is reduced, alongwith the critical path and the latency of the multiplier. The MBA can be applied to the proposed design methodology to generate a variety of MBA digit-serial multipliers which can be pipelined at the bit-level. In order to illustrate the eect of recoding on the proposed design methodology, the type-I multiplier is considered as an example. The rst step in MBA is to recode the multiplicand to a string of digits 2 f-2, -1, 0, 1, 2g. Fig. 11 shows a digit-serial recoding module with digit-size 4 to implement the recoding. In this circuit, if the recoded number is zero, the signal zero will equal 0. Signal sign represents the sign of the recoded number and the signal mag decides whether the absolute value of the recoded number is 1 or 2. After applying the MBA, the resulting digit-cell of type-I multiplier is shown in Fig. 12 where the number of full adders required is only half of that shown in Fig. 2. Partial Product generator a4i+3 1 mag0 a4i+2 0 1 MUX a4i+1 0 1 MUX 0 MUX CMOS technology and the resulting layout is shown in Fig. 13. The layout has 6 rows of standard cells occupying an area of 5:77 mm2 . This has been successfully simulated for various multiplier and multiplicand val- a4i a4i-1 1 0 MUX sign 0 zero0 4 a4i+1 1 mag1 0 MUX a4i+3 a4i+2 a4i+1 a4i a4i-1 1 0 1 MUX 0 MUX 1 0 MUX Figure 13. Layout (using 1:2 CMOS technology parameters) of a 16-bit16-bit type-I digitserial multiplier which has been designed using the proposed design methodology. D D sign 1 zero0 D sum_in0 FA FA 4 FA FA sum_in1 sum_in2 D D FA FA FA sum_out0 sum_out1 D FA sum_out2 D D Figure 12. Digit-cell for type-I multiplier with recoding. 4. Experimental results In this section, the dierent types of digit-serial architectures obtained using the proposed design methodology are compared with those obtained using the traditional unfolding approach in terms of speed and power consumption. In order to verify the operation of the proposed digit-serial architectures, the type-I digit-serial multiplier (non-pipelined) has been implemented using 1:2 ues. A bit-level pipelined version of this multiplier is found to operate at sample rates of up to 200 MHz . The critical paths of the various multiplier architectures are analyzed and the results are presented in Figs. 14 and 15. Fig. 14 presents the comparison of the critical paths for a xed digit-size of N = 4. Here, the rst bar represents the critical path (in terms of full-adder delays) for the bit-serial case. The next two bars represent the critical paths of the unfolded architecture for the non-pipelined and the pipelined case, respectively. It should be noted that the unfolded architecture has been pipelined to the maximum extent possible. Finally, the last two bars represent the critical paths of the transformed architectures. Here, transformed architectures represent those architectures obtained using the proposed design methodology. It is clear from the gure that transformation results in a reduction in the critical path in both the non-pipelined and the pipelined case except for the type-II multiplier where there is a slight increase. Another important observation is that by pipelining the transformed architecture, the critical path can always be reduced to one full-adder delay. However, this cannot always be bit-serial unf. (non-pip.) unf. (pip.) trs. (non-pip.) trs. (pip.) 25.0 20.0 15.0 ture where the critical path still increases linearly with digit-size. The advantage of using the proposed design methodology in these cases is that the critical path can be reduced to one full-adder delay after bit-level pipelining which is not possible in the unfolded case. Fig. 16 shows the average power consumption values obtained using the HEAT tool [18] for dierent digit- 10.0 5.0 0.0 I, trs. I, unf. II, trs. II, unf. III, trs. III, unf. red., trs. red., unf. 50.0 type-I type-II type-III architecture type redundant 40.0 Figure 14. Comparison of critical paths for different digit-serial multiplier architectures for a fixed digit-size of 4 and a word-length of 16 bits. power (mW) critical path (full-adder delays) 30.0 30.0 20.0 10.0 achieved by pipelining the unfolded architecture. Fig. 15 shows the variation of the critical path with digit-size for both the unfolded and the transformed 0.0 1 2 4 8 Digit - size critical path (full-adder delays) 30.0 N=2 N=4 N=8 20.0 10.0 0.0 I(unf.) I(trs.) II(unf.) II(trs.) III(unf.) III(trs.) r(unf.) r(trs.) architecture type Figure 15. Plot of the variation of the critical path with digit-size for both the unfolded and the transformed non-pipelined digit-serial architectures with the word-length fixed at 16 bits. non-pipelined architectures. It is clear from the gure that for the unfolded architectures, the critical path increases linearly (except for the type-III and the redundant multiplier) with digit-size. However, for the transformed architectures, the critical path decreases rst, reaches a minimum, and then starts to increase (for larger digit-sizes). The exceptions are the type-II transformed architecture and the redundant architec- Figure 16. Power (mW) consumption values (obtained using 1.2 micron (um) technology parameters) for different non-pipelined digitserial multipliers. The word sample frequency (wsf) is 25 MHz and the word-length is 16 bits. serial multipliers obtained both using the unfolding technique and the proposed design methodology without any pipelining or supply voltage reduction. Here, (I , trs.) for example, represents a type-I transformed and (I , unf.) represents a unfolded digit-serial multiplier. The results show that, for smaller digit-sizes, the singly-redundant multiplier consumes the least power among all transformed multipliers. This is because the critical path for this multiplier is directly proportional to the digit-size. For the type-I and type-III multiplier, the critical path is found to be N + 2(W=N ? 1) full-adder delays. Therefore, the critical path actually decreases with digit-size, reaches a minimum between digit-sizes 4 and 8 , and then starts to increase. Therefore, we conclude that for an arbitrary word-length W , the optimum digit-size for minimum power p consumption in these architectures is close to 2W . It can also be observed from the experimental results that for larger digit-sizes the type-I transformed multiplier consumes the least power. The singly-redundant multiplier consumes more power than the type-I multiplier formed type-I digit-serial multiplier. Three sets of results are shown for the transformed multiplier including the non-pipelined, 2-bit-level pipelined, and 1-bit-level pipelined multiplier. The pipelined multipliers are operated at a lower supply voltage (values at the top of the bars) to maintain the same word sample frequency. The results show that the non-pipelined transformed digit-serial multiplier consumes about 20% lower power than the unfolded multiplier. The bit-level pipelined multiplier operated at a lower supply voltage, however, consumes about 15 times lower power than the unfolded multiplier. Fig. 18 shows the comparison of average power consumption for the type-I digit-serial multiplier with and 50.0 5V 5V 40.0 unfolded trs. non-pip. trs. 2 bit pip. trs, 1 bit pip. 5V 5V 5V 5V 5V 40.0 1 2 4 1.5 1.53V 3V 1.5 1.5 3V 3V 1 2 4 8 4V 1. 1.334V 4V 10.0 Figure 18. Comparison of power consumption for the Booth recoded (br.) and the nonrecoded type-I transformed multiplier for a word-length of 16 bits. 5V 1.9 8 1.5 V 3V 4V 1.3 1.6 5V 20.0 1.9 1.5 8V 3V 5V 5V 30.0 1.6 1.3 5V 4V Power (mW) at wsf = 25 MHz 5V 5V 20.0 Digit - size 50.0 0.0 5V 5V 30.0 0.0 10.0 trs. non-pip trs. non-pip. br. trs. 1 bit pip. trs, 1 bit pip. br. 1.3 Power (mW) at wsf = 25 MHz for larger digit-sizes because of increased power consumption in the latches. We also observe that the proposed design methodology actually increased power consumption for the non-pipelined type-II multiplier. This is because the critical path in the transformed architecture was slightly longer than that of the unfolded architecture. The pipelined transformed type-II multiplier, however, consumed lower power than that of the unfolded pipelined type-II multiplier. The various digit-serial architectures obtained using the proposed design methodology are pipelined at the bit-level and are compared with corresponding unfolded architectures for power consumption (obtained using the HEAT tool). The results are presented in Table 1. Here, TFA and To represent, respectively, the propagation delays through a full-adder and a latch. Since the architectures shown have been fully pipelined, the critical path for all of them is the sum of TFA and To. The results for the typeI unfolded and type-II unfolded multipliers have not been presented because these architectures could not be pipelined at the bit-level. The results show that the bit-level pipelined transformed redundant architecture oers the best choice taking into consideration both the latency and the power consumption. More importantly, the latency is independent of the word-length which makes it attractive for larger designs. Fig. 17 shows the comparison of average power consumption values for both the unfolded and trans- 8 Digit - size Figure 17. Comparison of power consumption for the type-I digit-serial multiplier obtained using both the traditional unfolding technique and the proposed design methodology. The word-length is 16 bits. without recoding. Note that there is no corresponding recoding architecture for the bit-serial case. The eect of 1-bit-level pipelining on the power consumption in this architecture is also shown in the gure. It is observed that the 1-bit-level pipelined multipliers are operated at a lower supply voltage to maintain the same word sample frequency as the non-pipelined multiplier. The result shows that the multipliers with recoding consumes about 22% lower power than those without recoding. This is because the critical path and the number of full-adders in the recoded architectures are less than in those without recoding. Table 1. Comparison of power (mW) consumption for various bit-level pipelined digit-serial architectures. The word sample frequency is 25MHz. Architecture Design method Tcritical path type-I trs. TFA + To type-II trs. TFA + To type-III trs. TFA + To type-III unf. TFA + To red. trs. TFA + To red. unf. TFA + To 5. Conclusion This paper has presented a design methodology for a new class of digit-serial multiplier architectures. These architectures can be pipelined at the bit-level and as a result power can be reduced. It should also be noted that for large digit-sizes, the CSA module can be implemented using the Wallace tree algorithm [19]. Experiments using HEAT tool showed that about 35% lower power is obtained for the non-pipelined architecture using the Wallace tree approach when compared to the CSA based architecture for a digit-size of 8 and a word-length of 16 bits. For a specied wsf, the clock speed required with a bit-serial design is much higher than digit-serial with digit-size 4 or 8. As a result, the power consumed by a bit-serial design due to high-speed clock is much higher and this favors digit-serial architectures with respect to low-power consumption. Note that the power consumed by the clock is not accounted for by the HEAT tool. Future work is directed towards the design of low-power digit-serial multipliers with saturation capabilities. References [1] A. Aggoun, A. Ashur, and M. K. Ibrahim. A novel cell architecture for high-performance digit-serial computation. Electronics Letters, 29(6):938{940, May 1993. [2] A. Aggoun, A. Ashur, and M. K. Ibrahim. Systolic digit-serial multiplier. IEE Proceedings: Circuits, Devices, and Systems, 143(1):14{20, Feb. 1996. [3] D. Ait-Boudaoud, M. K. Ibrahim, and B. R. HayesGill. Novel cell architecture for bit level systolic arrays multiplication. In IEE Proceedings-E, volume 138, January 1991. [4] A. Avizienis. Signed digit number representation for fast parallel arithmetic. IRE Trans. on Computers, EC-10:389{400, Sept. 1961. # latches latency power (W=16, N=4) 2WN+3W N+W/N-1 8.47 2WN+2W N+2 8.46 2WN+4W N+W/N 9.61 3WN+2W W 12.90 2WN N 8.20 4WN+W W 13.08 [5] A. D. Booth. A signed binary multiplication technique. Quarterly J. Mech. Appl. Math., (4):236{240, 1951. [6] Y.-N. Chang, J. H. Satyanarayana, and K. K. Parhi. Low-power digit-serial multipliers. In Proc. IEEE International Symp. on Circuits and Systems (ISCAS), pages 1023{1026, Hong Kong, June 1997. [7] P. B. Denyer and D. Renshaw. VLSI Signal Processing: A Bit-Serial Approach. Addison Wesley, Reading, MA, 1986. [8] R. I. Hartley and P. F. Corbett. A digit-serial silicon compiler. In Proc. Design Automation Conference, pages 646{649, 1988. [9] R. I. Hartley and J. R. Jasica. Behavioral to structural translation in a bit-serial silicon compiler. IEEE Trans. Computer Aided Design, 7:877{886, Aug. 1988. [10] R. I. Hartley and K. K. Parhi. Digit-Serial Computation. Kluwer Academic, Boston, MA, 1995. [11] M. Hatamian and G. Cash. Parallel bit-level pipelined VLSI designs for high-speed signal proceessing. Proc. IEEE, 75:1192{1202, Sept. 1987. [12] K. Hwang. Computer Arithmetic, Principles, architectures, and design. John Wiley, 1979. [13] O. L. MacSorley. High speed arithmetic in binary computers. Proc. IRE, 49:67{91, 1961. [14] K. K. Parhi. A systematic approach for design of digitserial signal processing architectures. IEEE Trans. Circuits and Systems, 38(4):358{375, Apr. 1991. [15] K. K. Parhi and M. Hatamian. A high sample rate recursive lter chip. IEEE Press, NY, 1988. [16] K. K. Parhi, C.-Y. Wang, and A. P. Brown. Synthesis of control circuits in folded pipelined DSP architectures. IEEE Journal of Solid-State Circuits, 27(1):29{ 43, Jan. 1992. [17] G. Privat. A novel class of serial-parallel redundant signed-digit multipliers. In Proc. IEEE Int. Symp. on Circuits and Systems, pages 2116{2119, New Orleans, LA, May 1990. [18] J. H. Satyanarayana and K. K. Parhi. HEAT: Hierarchical energy analysis tool. In 33rd ACM/IEE Design Automation Conference, pages 9{14, Las Vegas, NV, June 1996. [19] C. S. Wallace. A suggestion for a fast multiplier. IEEE Trans. Electron. Comput., EC-13:14{17, 1964.