Design and Implementation of Low-Power Digit

advertisement
Design and Implementation of Low-Power
Digit-Serial Multipliers
Yun-Nan Chang, Janardhan H. Satyanarayana, and Keshab K. Parhi
Department of Electrical and Computer Engineering
University of Minnesota, Minneapolis, MN 55455, USA
E-mail: fynchang, jsatyana, parhig@ece.umn.edu
Abstract
Digit-serial architectures obtained using traditional
unfolding techniques cannot be pipelined beyond a certain level because of the presence of feedback loops.
In this paper, a novel design methodology is presented
which permits bit-level pipelining of the digit-serial architectures. This enables bit-level pipelining of digitserial architectures thereby achieving sample speeds
close to corresponding bit-parallel multipliers with signicantly lower area. This increased sample speed can
be traded with reduction in power supply voltage resulting in signicant reduction in power consumption.
The results show that for transformed multipliers with
smaller digit-sizes ( 4), the singly-redundant multiplier consumes the least power and for larger digitsizes, the type-I multiplier consumes the least power.
It is also found that the optimum digit-size for least
powerp consumption in type-I and type-III multipliers
is 2W , where W represents the word-length. The
proposed digit-serial multipliers consume on an average
20% lower power than the traditional digit-serial architectures for the non-pipelined case, and about 5 ? 15
times lower power for the bit-level pipelined case. Also,
modied Booth recoding is applied to transformed multipliers and it is found that the recoded multipliers consume about 22% lower power than the transformed multipliers without recoding.
1. Introduction
Digital signal processing (DSP) is used in a wide
range of applications such as telephone, radio, video,
sonar, etc. The sample rate requirements vary from application to application and can range anywhere from
This research was supported by Defense Advanced Research
Project Agency under contract number DA/DABT63-96-C-0050.
10 KHz to 400 MHz. Real time implementation of these
systems require hardware architectures which can process input signal samples as they are received, as opposed to storing them in registers and processing them
in batch mode. It is well known that bit-serial systems,
which process one bit of the input sample in one clockcycle, are area-ecient and ideal for low-speed applications [7][9]. On the other hand bit-parallel systems,
which process one whole word of the input sample in
one clock-cycle, are ideal for high speed applications
[11][15]. However, in applications which require moderate sample rates both these systems may be ineective;
bit-serial systems may be too slow and bit-parallel systems may be faster than necessary and occupy considerable amount of area. To this end, digit-serial systems
[8][14] have become attractive for digital designers in
the recent past.
Most of the DSP computations involve the use of
multiply accumulate operations and therefore the design of fast and ecient multipliers is imperative.
Moreover, the demand for portable applications of DSP
architectures has dictated the need for low power designs. Digit-serial multipliers are ideal for such designs
and nd many applications in heterogeneous high-level
synthesis environments. Recently, it was found that
digit-serial multipliers could be pipelined at the bitlevel [1][2] thereby resulting in high processing speeds
or low power. However, here the designs were obtained
in an ad hoc manner.
This paper presents a systematic design methodology for low power, digit-serial multipliers. Traditionally, digit-serial multipliers were obtained by either folding the corresponding bit-parallel architectures [16], or unfolding the bit-serial architectures [14].
However, architectures obtained in this manner cannot be pipelined at the bit-level. The approach presented in this paper enables the direct design of digitserial architectures which can be pipelined at the bitlevel thereby achieving sample speeds close to corre-
sponding bit-parallel multipliers. This increased sample speed can be traded with reduction in power supply
voltage resulting in signicant reduction in power consumption. The proposed digit-serial multipliers consume on an average 20% lower power than the traditional digit-serial architectures for the non-pipelined
case, and about 5 ? 15 times lower power for the bitlevel pipelined case. Also, modied Booth recoding
is applied to transformed multipliers and it is found
that the recoded multipliers consume about 22% lower
power than the transformed multipliers without recoding.
The organization of this paper is as follows. Section
2 presents the proposed design methodology for unsigned multiplication. The design methodology is applied to various existing bit-serial multipliers and corresponding digit-serial designs which can be pipelined
at the bit-level are obtained. In Section 3, extension
of the proposed design methodology to handle modied Booth recoding is presented. In order to verify the
correctness of the proposed architectures, the type-I
digit-serial multiplier is implemented in 1:2 CMOS
technology and experimental results related to critical
path and power consumption are presented in Section
4. Finally, the main conclusions of the paper are summarized in Section 5.
2. Design methodology for unsigned multiplication
In this section, a systematic design methodology is
proposed which enables the direct design of digit-serial
architectures from bit-serial architectures. Without
loss of generality, this methodology is applied to multipliers (unsigned) which form the backbone of DSP computations. The proposed methodology can be easily
extended to two's complement multiplication although
it is not been shown here for the sake of brevity.
Consider the bit-serial multiplication of two W -bit
numbers a and b to yield a product p as described by
the algorithm below:
Algorithm
INPUT: a, b
OUTPUT: p
INITIALIZE: ai , bi = 0 for i> W-1
ci;j , si;j = 0 8 i,j
begin
for i=0 to W-1
begin
for j=0 to W
begin
ai bj + ci;j?1 + si?1;j+1 = 2ci;j + si;j ; (1)
end
pi =si;0 ;
end
for i=W to 2W-1
pi =sW ?1;i?W +1 ;
end
The proposed design methodology involves treating
the bits as digits. Consequently, the bit-serial algorithm is modied for a general digit-size N as follows:
Algorithm
begin
for i=0 to WN -1
begin
for j=0 to WN
begin
Ai Bj + Ci;j?1 + Si?1;j+1 = 2N Ci;j + Si;j (2)
end
Pi =Si;0 ;
end
? for i= WN to 2 WN -1
Pi =S WN ?1;i? WN +1 ;
end
In the above algorithm the capital letters are used
to denote digits. For example, if the digit-size is N=4,
then A0 represents a digit constituting the four bits
a0 , a1 , a2 , and a3 , and has a value of A0 = 23a3 +
22 a2 + 21a1 + a0 . The next step is to translate (2)
into a hardware architecture in such a manner that the
nal architecture permits bit-level pipelining. This is
achieved by splitting the product (Ai Bj ) in (2) into
two parts in accordance with
(Ai Bj )L + 2N (Ai Bj )H + Ci;j?1 +
Si?1;j+1 = 2N Ci;j + Si;j :
(3)
Here, (Ai Bj )L represents the lower order terms of the
product (Ai Bj ), and (Ai Bj )H represents the higher
order terms. For example let us consider the product,
say, (A0 B0) for a digit-size of 4. Then, after splitting,
the product is given by
(A0 B0 ) =
a3 b3 22 + (a3 b2 + a2 b3 ) 21 + (4)
(a3 b1 + a2 b2 + a1 b3 ) 20 24 +
(a3 b0 + a2 b1 + a1 b2 + a0 b3 ) 23 +
(a2 b0 + a1 b1 + a0 b2 ) 22 +
(a1 b0 + a0 b1 ) 21 + a0 b0 20 :
In the above equation, the terms inside f:g equal (A0 B0 )H , and the terms inside [:] equal (A0 B0 )L . Upon
rearranging (3) one gets
(Ai Bj )L + Ci;j?1 + Si?1;j+1 =
2N fCi;j ? (Ai Bj )H g + Si;j :
(5)
0 = Ci;j ? (Ai Bj )H :
Ci;j
(6)
Dene
Then, (5) can be rewritten as
0
[(Ai Bj )L + (Ai Bj?1 )H ] + Ci;j
?1 +
N
0
Si?1;j+1 = 2 Ci;j + Si;j :
(7)
The advantage of rearranging the terms in this manner is that the lower order and the higher order partial products can be summed using a regular rectangular carry-save array and the nal architecture can be
pipelined to the bit-level. This was not always possible
in the previously obtained (using unfolding) digit-serial
architectures. As a result sample speeds comparable to
bit-parallel architectures or very low power consumption can be achieved. In the following, the proposed
design methodology is applied to various bit-serial multipliers to obtain new digit-serial multipliers.
after pipelining. Reduction in the critical path below
N full-adder delays is not possible because of the presence of feedback loops.
Based on (7), each cell inside the dashed boxes in
Fig. 1 is now replaced by the corresponding structure
shown in Fig. 2. It is found that the critical path
of the
resulting digit-serial architecture is N + 2 WN ? 1
full-adder
delays
if two sum outputs are generated, and
N + WN ? 1 full-adder delays if three sum outputs are
generated (see Fig. 2). The important fact to note is
that this critical path can be reduced to one full-adder
delay by suitable pipelining. Each digit-cell consists of
Partial Product Generator
D
a1
a2
a2 b1
a3
a3 b0
b2
(D)
A
4
(D)
(D)
4
Carry
Save
Array
4
4
Partial Products
a1b
a2b
2.1. Type-I multiplier
1
a0b
a2b
a0b
1
2
0
3
4
16
a1b
a2b(D)
a1b
0
3
FA
FA
a2b(D)
a0b
a1b(D)
3
2
1
FA
D
D
a1
a2
0
D
a3b
0
Sum_in
FA
a3b(D)
3
FA
a3b(D)
2
FA
a3b(D)
1
FA
Sum_out
0
4
Sum_in
Sum_in
D
a0b
FA
D
1
FA
FA
FA
FA
4
4
b3 b2 b1 b0
a0
(D)
b1
b0 b3
a0
a1 b2
b3
a0
2
Consider the bit-serial type-I multiplier [10] shown
in Fig. 1 where the coecient word-length is four bits.
This architecture contains four full adders, four multipliers, and some delay elements. In this multiplier
B
4
Sum_out
D
2
FA
FA
FA
1
4
4
Sum_out
FA
0
4
2
D
a3
Figure 2. Digit-cell for type-I multiplier.
D
D
D
D
Figure 1. Type-I bit-serial multiplier with wordlength of 4 bits.
the carry-out signal of every adder is fed back after
a delay to the carry-in signal of the same adder. The
critical path of this architecture is W full-adder delays.
The traditional approach for designing the digit-serial
architecture involves unfolding this structure by a factor equal to the digit-size N . However, the resulting
critical path would be W + N ? 1 full-adder delays;
which can be further reduced to N full-adder delays
a partial product generator module and a carry save
adder (CSA) module. The partial product generator
computes the 16 partial products. It is clear from (7)
that Ai is multiplied by both Bj and Bj?1 . This is
reected in the partial product generator where both
the current and the delayed versions of the signal B are
used, with a subscript denoting the delayed version of
B . For example, a3 b(1D) is used to denote the fact that
a3 is multiplied by a delayed version of b1. The CSA
module produces three sum output digits sum out0 ,
sum out1 and sum out2 in order to enable bit-level
pipelining. Therefore, in the nal stage a digit-serial
adder is required to sum all these outputs. A simple
digit-serial 3:2 compressor adder can be rst used to
reduce these three output digits to two digits. A digitserial carry look-ahead adder or any other fast carrypropagate adder is then used to add these two digits
to generate the nal result. A fast adder is necessary
because this stage should not form a bottleneck during bit-level pipelining. The entire architecture of the
digit-serial multiplier obtained using the proposed design methodology is shown in Fig. 3. Other bit-serial
b3 b2 b1 b0
D
a0
Full-Adder
D
a1
a3
a2
D
D
D
D
Carry
B3 B2 B1 B0
FA
A0
A1
FA
D
D
A2
A3
4
Digit
Digit
Digit
Digit
Cell
Cell
Cell
Cell
Digit-Serial
3 to 2
Compressor
D
FA
4
4
4
Sum
FA
4
Carry
Lookahead
Adder
4
Figure 4. Bit-serial type-II multiplier with
word-length of 4 bits.
FA
D
D
Figure 3. Type-I digit-serial multiplier with
word-length 16 bits and digit size 4 bits, where
the digit-cell is the one shown in Fig. 2.
multipliers similar to the type-I bit-serial multiplier can
obtained, for example, by changing the direction of the
data ow of the sum signal, or by making the input
signal broadcast to all multipliers [10]. The proposed
methodology is quite general and can be applied to all
these types of bit-serial multipliers to generate corresponding digit-serial multipliers which can be pipelined
at the bit-level.
2.3. Type-III multiplier
Consider the bit-serial type-III multiplier [10] shown
in Fig. 5. The salient feature of this architecture is
that the carry-out signal is not fed back as in the typeb3 b2 b1 b0
D
a0
D
D
a1
a2
a3
D
D
D
D
D
Figure 5. Bit-serial type-III multiplier with
word-length of 4 bits.
2.2. Type-II multiplier
Consider the bit-serial type-II multiplier [3] shown
in Fig. 4. The main dierence between this multiplier
and the type-I multiplier is that the critical path in
this architecture is just two full-adder delays. Moreover, this architecture can be pipelined at the bit-level
with an additional latency of only one clock-cycle unlike the type-I multiplier where the increase in latency
would depend on the word-length. If this architecture
is unfolded using the traditional technique, the critical
path would be (N +1) full-adder delays. However, as in
the case of the type-I multiplier, reduction in the critical path below N is not possible due to the presence
of feedback loops.
Based on (7), each cell in dashed boxes in Fig. 4
is replaced with corresponding digit-cells [6] which are
then cascaded to design the entire multiplier.
I multiplier. If this architecture is unfolded using the
traditional technique, the critical path would be W fulladder delays. However, in this case since there is no
carry feedback, the unfolded architecture can also be
pipelined at the bit-level. The bit-serial multiplication
for this architecture is expressed by
ai bj + ci?1;j + si?1;j+1 = 2ci;j + si;j : (8)
Going through the design methodology presented in
the beginning of this section, the nal digit-serial multiplication equation can be derived in accordance with
[(Ai Bj )L + (Ai?1 Bj )H ] + Ci0?1;j +
0 + Si;j :
Si?1;j+1 = 2N Ci;j
(9)
Based on (9), each dashed cell in Fig. 5 is replaced by
the digit-cell shown in Fig. 6. It should be noted that
Partial Product Generator
a4i b3
Ai
4
b2
b1
a4i+1b2
FA
b0
a4i-3 b
1
a4i+2
a4i-2 b
0
a4i+3
a4i-1
4
4
sign(bi )
Ai-1
Sum_in
sign(bi )
4
FA
FA
a4i-1 b2
FA
FA
a4i-1 b1
FA
FA
D
|bi |
sign(bi )
ai
D
|bi |
0
FA
FA
FA
Sum_out
1
block D
(i = 0)
4
4
FA
2
D
4
D
Sum_in
D
Figure 7. Bit-serial architecture for a singlyredundant multiplier.
Sum_out
1
4
4
a3
FA
D
Sum_in
a2
ai
16
0
4
a1
a4i+1b0
a4i+1b1
a b
a4i-2 b2 4i-3 3
a4i-2 b3
a4i+2b0
a
a4i b3
a4i b1
a4i b0
b
4i 2
a4i-1 b3
FA
a0
4
D
a4i+3b0
|bi |
a4i-4 b2
a4i+1
Partial Products
a4i+2b1
b+i
b-i
b3
4
Carry
Save
Array
... b0b1b2b3
B
4
FA
FA
FA
Sum_out
block D-1
(i = D-2)
block D-2
(i = D-1)
block 0
(i = D)
2
D
Carry_in
Carry_out
all identical blocks
4
4
Figure 8. Digit-serial singly-redundant multiplier architecture.
Figure 6. Digit-cell for the type-III multiplier.
the partial product generator uses two coecient digits Ai and Ai?1 unlike the previous architectures where
only one digit was used. It should also be noted that
the carry-save portion generates four outputs at each
stage. Therefore, at the output of the nal digit-cell, a
digit-serial 4:2 compressor and a fast carry look-ahead
adder are required to convert the four digits to one
digit. The resulting architecture can be pipelined at
the bit-level thereby achieving word-level speeds comparable to a bit-parallel multiplier.
2.4. Singly-redundant multiplier
The "carry-save" data representation is the most
common example of redundant representations, even
though it is not always described as such. The use of
redundant arithmetic in fast binary arithmetic was rst
published in [4]. The basic idea behind using redundant arithmetic is to avoid carry propagation in parallel
arithmetic cells. This is possible because the use of a
redundant number representation allows serial operations to proceed in a most-signicant-bit mode. There
are many avors of redundant arithmetic namely minimally redundant, maximally redundant, over redun-
dant, etc. The reader is referred to [12][10] for a more
detailed discussion of redundant arithmetic.
The architecture of a bit-serial singly redundant multiplier is shown in Fig. 7 [17]. The term singly arises
from the fact that one input (redundant) and output
operands (redundant) belong to the set f1, 0, 1g, and
another input belongs to the binary set f0, 1g. The key
to achieving coding and area eciency in the singlyredundant architecture lies in the hybrid representation of the input and output signals of the cell. This
architecture has a critical path of 4 adder delays and a
xed latency of 1 clock cycle.
Based on (9), the bit-serial singly-redundant architecture shown in Fig. 7 is transformed and the resulting digit-serial architecture is shown in Fig. 8, where
D=W/N. This architecture is comprised of three sets
of blocks. Each block, as before, consists of a partial
product generator and a carry-save adder section. The
architecture of the partial product generator for the
singly-redundant multiplier is shown in Fig. 9. The
structure of this generator is quite similar to that of
the type-III multiplier. The only dierence is that the
AND gate in the partial product generator of the typeIII multiplier is now replaced by a combination of an
AND and an XOR gate as shown in Fig. 9. This is
Bj
Ai
B j = bNj+3bNj+2bNj+1bNj
bNj+3
aNibNj+3
bNj+2
bNj+1
bNj
Partial-Product
Generator
A i-1
aNi+1b’Nj+1
aNi+1b’Nj+2
aNi+1b’Nj
aNi-3 b’Nj+3
aNi+2b’Nj+1aNi b’Nj+3 aNi+2b’Nj aNi b’Nj+2 aNi-2 b’Nj+3aNi b’Nj+1 aNi-2 b’Nj+2aNi b’Nj
aNi-4
bNj+2
Ai =
aNi+3aNi+2aNi+1aNi
FA
aNi-3
aNi+1
bNj+1
aNi-2
aNi+2
A i-1 =
aNi-1 aNi-2 aNi-3 aN-4
bNj
aNi+3
aNi+3b’Nj
sum_in 0
FA
aNi-1 b’Nj+3
FA
FA
aNi-1 b’Nj+2
aNi-1 b’Nj+1
FA
FA
FA
FA
FA
FA
FA
FA
4
aNi-1
D
sum_in 1 4
sum_in 2 4
bx
sum_out 1
sum_out 2
4
4
FA
sign(b x)
sum_out 0
4
FA
FA
FA
D
D
bx
ay
carry_in
carry_out
4
4
ay
a y b’x
Figure 10. Digit-cell for the singly redundant
multiplier for i=0 to D-2
Figure 9. Architecture of the partial product
generator for the singly-redundant multiplier
1
0
zero1
b 4i
b 4i+1
sign1
D
mag2
1
1
0
MUX
done to incorporate the sign-bit of the redundant digit.
The rst D ? 2 blocks in Fig. 8 are identical and the
general architecture for each one of them is shown in
Fig. 10. The architectures for the last two blocks are
slightly dierent from the rst D ? 2 blocks. The dierence arises due to the fact that the signal a is in two's
complement form with the most signicant bit representing the sign bit. Therefore, a small modication in
the architecture of the last two modules is required to
incorporate the sign bit.
In all the digit-cells, i represents the space index and
j represents the time index. For example, i is 0 for the
rst module, 1 for the second module, and D for the
last module. j is 0 for the rst clock cycle, 1 for the
second clock cycle, and so on. The value of j equal
to D ? 1 denotes the end of transmission of the entire
word. It should be noted that in Fig. 10 for i = 0, A?1
is 0 and therefore the rst cell can be simplied.
If the bit-serial architecture shown in Fig. 7 is unfolded using the traditional unfolding technique, the resulting digit-serial architecture will have a critical path
equal to W full-adder delays. This architecture can
however be pipelined to reduce the critical path to one
full-adder delay at the expense of increased latency.
The digit-serial architecture obtained using the proposed design methodology, however, has a critical path
equal to N full-adder delays. This architecture can be
pipelined to reduce the critical path to one full-adder
delay at the expense of a slight increase in latency. This
is the main advantage of the proposed design method-
MUX
mag1
1
zero2
b 4i+2
b 4i+3
sign2
Figure 11. Modified Booth’s Algorithm recoding module for a digit-size of 4 bits.
ology over the unfolding technique.
3. Extension to modied Booth recoding
One salient feature of the proposed design methodology is that the resulting digit-serial multipliers can be
further improved by incorporating the recoding technique of binary numbers. The recoding of binary numbers was rst hinted by Booth [5], and the most popular recoding algorithm currently used is the Modied
Booth's Algorithm (MBA) [13]. The application of the
MBA to multiplication can help reduce the number of
partial products by half. Therefore, the number of full
adders required to accumulate the partial products is
reduced, alongwith the critical path and the latency
of the multiplier. The MBA can be applied to the
proposed design methodology to generate a variety of
MBA digit-serial multipliers which can be pipelined at
the bit-level.
In order to illustrate the eect of recoding on the
proposed design methodology, the type-I multiplier is
considered as an example. The rst step in MBA is to
recode the multiplicand to a string of digits 2 f-2, -1,
0, 1, 2g. Fig. 11 shows a digit-serial recoding module
with digit-size 4 to implement the recoding. In this
circuit, if the recoded number is zero, the signal zero
will equal 0. Signal sign represents the sign of the recoded number and the signal mag decides whether the
absolute value of the recoded number is 1 or 2. After applying the MBA, the resulting digit-cell of type-I
multiplier is shown in Fig. 12 where the number of full
adders required is only half of that shown in Fig. 2.
Partial
Product
generator
a4i+3
1
mag0
a4i+2
0
1
MUX
a4i+1
0
1
MUX
0
MUX
CMOS technology and the resulting layout is shown in
Fig. 13. The layout has 6 rows of standard cells occupying an area of 5:77 mm2 . This has been successfully
simulated for various multiplier and multiplicand val-
a4i a4i-1
1
0
MUX
sign 0
zero0
4
a4i+1
1
mag1
0
MUX
a4i+3 a4i+2 a4i+1
a4i a4i-1
1
0
1
MUX
0
MUX
1
0
MUX
Figure 13. Layout (using 1:2 CMOS technology parameters) of a 16-bit16-bit type-I digitserial multiplier which has been designed using the proposed design methodology.
D
D
sign 1
zero0
D
sum_in0
FA
FA
4
FA
FA
sum_in1
sum_in2
D
D
FA
FA
FA
sum_out0
sum_out1
D
FA
sum_out2
D
D
Figure 12. Digit-cell for type-I multiplier with
recoding.
4. Experimental results
In this section, the dierent types of digit-serial
architectures obtained using the proposed design
methodology are compared with those obtained using
the traditional unfolding approach in terms of speed
and power consumption.
In order to verify the operation of the proposed
digit-serial architectures, the type-I digit-serial multiplier (non-pipelined) has been implemented using 1:2
ues. A bit-level pipelined version of this multiplier is
found to operate at sample rates of up to 200 MHz .
The critical paths of the various multiplier architectures are analyzed and the results are presented in
Figs. 14 and 15. Fig. 14 presents the comparison of
the critical paths for a xed digit-size of N = 4. Here,
the rst bar represents the critical path (in terms of
full-adder delays) for the bit-serial case. The next two
bars represent the critical paths of the unfolded architecture for the non-pipelined and the pipelined case,
respectively. It should be noted that the unfolded architecture has been pipelined to the maximum extent
possible. Finally, the last two bars represent the critical paths of the transformed architectures. Here, transformed architectures represent those architectures obtained using the proposed design methodology. It is
clear from the gure that transformation results in a
reduction in the critical path in both the non-pipelined
and the pipelined case except for the type-II multiplier
where there is a slight increase. Another important
observation is that by pipelining the transformed architecture, the critical path can always be reduced to
one full-adder delay. However, this cannot always be
bit-serial
unf. (non-pip.)
unf. (pip.)
trs. (non-pip.)
trs. (pip.)
25.0
20.0
15.0
ture where the critical path still increases linearly with
digit-size. The advantage of using the proposed design
methodology in these cases is that the critical path
can be reduced to one full-adder delay after bit-level
pipelining which is not possible in the unfolded case.
Fig. 16 shows the average power consumption values
obtained using the HEAT tool [18] for dierent digit-
10.0
5.0
0.0
I, trs.
I, unf.
II, trs.
II, unf.
III, trs.
III, unf.
red., trs.
red., unf.
50.0
type-I
type-II
type-III
architecture type
redundant
40.0
Figure 14. Comparison of critical paths for different digit-serial multiplier architectures for
a fixed digit-size of 4 and a word-length of 16
bits.
power (mW)
critical path (full-adder delays)
30.0
30.0
20.0
10.0
achieved by pipelining the unfolded architecture.
Fig. 15 shows the variation of the critical path with
digit-size for both the unfolded and the transformed
0.0
1
2
4
8
Digit - size
critical path (full-adder delays)
30.0
N=2
N=4
N=8
20.0
10.0
0.0
I(unf.) I(trs.) II(unf.) II(trs.) III(unf.) III(trs.) r(unf.) r(trs.)
architecture type
Figure 15. Plot of the variation of the critical
path with digit-size for both the unfolded and
the transformed non-pipelined digit-serial architectures with the word-length fixed at 16
bits.
non-pipelined architectures. It is clear from the gure
that for the unfolded architectures, the critical path
increases linearly (except for the type-III and the redundant multiplier) with digit-size. However, for the
transformed architectures, the critical path decreases
rst, reaches a minimum, and then starts to increase
(for larger digit-sizes). The exceptions are the type-II
transformed architecture and the redundant architec-
Figure 16. Power (mW) consumption values
(obtained using 1.2 micron (um) technology
parameters) for different non-pipelined digitserial multipliers.
The word sample frequency (wsf) is 25 MHz and the word-length
is 16 bits.
serial multipliers obtained both using the unfolding
technique and the proposed design methodology without any pipelining or supply voltage reduction. Here,
(I , trs.) for example, represents a type-I transformed
and (I , unf.) represents a unfolded digit-serial multiplier. The results show that, for smaller digit-sizes, the
singly-redundant multiplier consumes the least power
among all transformed multipliers. This is because the
critical path for this multiplier is directly proportional
to the digit-size. For the type-I and type-III multiplier, the critical path is found to be N + 2(W=N ? 1)
full-adder delays. Therefore, the critical path actually
decreases with digit-size, reaches a minimum between
digit-sizes 4 and 8 , and then starts to increase. Therefore, we conclude that for an arbitrary word-length W ,
the optimum digit-size for minimum power
p consumption in these architectures is close to 2W . It can
also be observed from the experimental results that
for larger digit-sizes the type-I transformed multiplier
consumes the least power. The singly-redundant multiplier consumes more power than the type-I multiplier
formed type-I digit-serial multiplier. Three sets of results are shown for the transformed multiplier including
the non-pipelined, 2-bit-level pipelined, and 1-bit-level
pipelined multiplier. The pipelined multipliers are operated at a lower supply voltage (values at the top of
the bars) to maintain the same word sample frequency.
The results show that the non-pipelined transformed
digit-serial multiplier consumes about 20% lower power
than the unfolded multiplier. The bit-level pipelined
multiplier operated at a lower supply voltage, however, consumes about 15 times lower power than the
unfolded multiplier.
Fig. 18 shows the comparison of average power consumption for the type-I digit-serial multiplier with and
50.0
5V
5V
40.0
unfolded
trs. non-pip.
trs. 2 bit pip.
trs, 1 bit pip.
5V
5V
5V
5V
5V
40.0
1
2
4
1.5
1.53V
3V
1.5
1.5 3V
3V
1
2
4
8
4V
1.
1.334V
4V
10.0
Figure 18. Comparison of power consumption for the Booth recoded (br.) and the nonrecoded type-I transformed multiplier for a
word-length of 16 bits.
5V
1.9
8
1.5 V
3V
4V
1.3
1.6
5V
20.0
1.9
1.5 8V
3V
5V
5V
30.0
1.6
1.3 5V
4V
Power (mW) at wsf = 25 MHz
5V
5V
20.0
Digit - size
50.0
0.0
5V
5V
30.0
0.0
10.0
trs. non-pip
trs. non-pip. br.
trs. 1 bit pip.
trs, 1 bit pip. br.
1.3
Power (mW) at wsf = 25 MHz
for larger digit-sizes because of increased power consumption in the latches. We also observe that the
proposed design methodology actually increased power
consumption for the non-pipelined type-II multiplier.
This is because the critical path in the transformed architecture was slightly longer than that of the unfolded
architecture. The pipelined transformed type-II multiplier, however, consumed lower power than that of the
unfolded pipelined type-II multiplier.
The various digit-serial architectures obtained using the proposed design methodology are pipelined
at the bit-level and are compared with corresponding unfolded architectures for power consumption (obtained using the HEAT tool). The results are presented in Table 1. Here, TFA and To represent, respectively, the propagation delays through a full-adder
and a latch. Since the architectures shown have been
fully pipelined, the critical path for all of them is
the sum of TFA and To. The results for the typeI unfolded and type-II unfolded multipliers have not
been presented because these architectures could not
be pipelined at the bit-level. The results show that the
bit-level pipelined transformed redundant architecture
oers the best choice taking into consideration both
the latency and the power consumption. More importantly, the latency is independent of the word-length
which makes it attractive for larger designs.
Fig. 17 shows the comparison of average power
consumption values for both the unfolded and trans-
8
Digit - size
Figure 17. Comparison of power consumption for the type-I digit-serial multiplier obtained using both the traditional unfolding
technique and the proposed design methodology. The word-length is 16 bits.
without recoding. Note that there is no corresponding recoding architecture for the bit-serial case. The
eect of 1-bit-level pipelining on the power consumption in this architecture is also shown in the gure.
It is observed that the 1-bit-level pipelined multipliers are operated at a lower supply voltage to maintain
the same word sample frequency as the non-pipelined
multiplier. The result shows that the multipliers with
recoding consumes about 22% lower power than those
without recoding. This is because the critical path and
the number of full-adders in the recoded architectures
are less than in those without recoding.
Table 1. Comparison of power (mW) consumption for various bit-level pipelined digit-serial architectures. The word sample frequency is 25MHz.
Architecture Design method Tcritical path
type-I
trs.
TFA + To
type-II
trs.
TFA + To
type-III
trs.
TFA + To
type-III
unf.
TFA + To
red.
trs.
TFA + To
red.
unf.
TFA + To
5. Conclusion
This paper has presented a design methodology for a
new class of digit-serial multiplier architectures. These
architectures can be pipelined at the bit-level and as a
result power can be reduced. It should also be noted
that for large digit-sizes, the CSA module can be implemented using the Wallace tree algorithm [19]. Experiments using HEAT tool showed that about 35%
lower power is obtained for the non-pipelined architecture using the Wallace tree approach when compared
to the CSA based architecture for a digit-size of 8 and
a word-length of 16 bits.
For a specied wsf, the clock speed required with a
bit-serial design is much higher than digit-serial with
digit-size 4 or 8. As a result, the power consumed by a
bit-serial design due to high-speed clock is much higher
and this favors digit-serial architectures with respect
to low-power consumption. Note that the power consumed by the clock is not accounted for by the HEAT
tool. Future work is directed towards the design of
low-power digit-serial multipliers with saturation capabilities.
References
[1] A. Aggoun, A. Ashur, and M. K. Ibrahim. A novel cell
architecture for high-performance digit-serial computation. Electronics Letters, 29(6):938{940, May 1993.
[2] A. Aggoun, A. Ashur, and M. K. Ibrahim. Systolic
digit-serial multiplier. IEE Proceedings: Circuits, Devices, and Systems, 143(1):14{20, Feb. 1996.
[3] D. Ait-Boudaoud, M. K. Ibrahim, and B. R. HayesGill. Novel cell architecture for bit level systolic arrays multiplication. In IEE Proceedings-E, volume 138,
January 1991.
[4] A. Avizienis. Signed digit number representation for
fast parallel arithmetic. IRE Trans. on Computers,
EC-10:389{400, Sept. 1961.
# latches
latency power (W=16, N=4)
2WN+3W N+W/N-1
8.47
2WN+2W
N+2
8.46
2WN+4W N+W/N
9.61
3WN+2W
W
12.90
2WN
N
8.20
4WN+W
W
13.08
[5] A. D. Booth. A signed binary multiplication technique. Quarterly J. Mech. Appl. Math., (4):236{240,
1951.
[6] Y.-N. Chang, J. H. Satyanarayana, and K. K. Parhi.
Low-power digit-serial multipliers. In Proc. IEEE International Symp. on Circuits and Systems (ISCAS),
pages 1023{1026, Hong Kong, June 1997.
[7] P. B. Denyer and D. Renshaw. VLSI Signal Processing: A Bit-Serial Approach. Addison Wesley, Reading,
MA, 1986.
[8] R. I. Hartley and P. F. Corbett. A digit-serial silicon compiler. In Proc. Design Automation Conference,
pages 646{649, 1988.
[9] R. I. Hartley and J. R. Jasica. Behavioral to structural translation in a bit-serial silicon compiler. IEEE
Trans. Computer Aided Design, 7:877{886, Aug. 1988.
[10] R. I. Hartley and K. K. Parhi. Digit-Serial Computation. Kluwer Academic, Boston, MA, 1995.
[11] M. Hatamian and G. Cash. Parallel bit-level pipelined
VLSI designs for high-speed signal proceessing. Proc.
IEEE, 75:1192{1202, Sept. 1987.
[12] K. Hwang. Computer Arithmetic, Principles, architectures, and design. John Wiley, 1979.
[13] O. L. MacSorley. High speed arithmetic in binary computers. Proc. IRE, 49:67{91, 1961.
[14] K. K. Parhi. A systematic approach for design of digitserial signal processing architectures. IEEE Trans.
Circuits and Systems, 38(4):358{375, Apr. 1991.
[15] K. K. Parhi and M. Hatamian. A high sample rate
recursive lter chip. IEEE Press, NY, 1988.
[16] K. K. Parhi, C.-Y. Wang, and A. P. Brown. Synthesis
of control circuits in folded pipelined DSP architectures. IEEE Journal of Solid-State Circuits, 27(1):29{
43, Jan. 1992.
[17] G. Privat. A novel class of serial-parallel redundant
signed-digit multipliers. In Proc. IEEE Int. Symp. on
Circuits and Systems, pages 2116{2119, New Orleans,
LA, May 1990.
[18] J. H. Satyanarayana and K. K. Parhi. HEAT: Hierarchical energy analysis tool. In 33rd ACM/IEE Design
Automation Conference, pages 9{14, Las Vegas, NV,
June 1996.
[19] C. S. Wallace. A suggestion for a fast multiplier. IEEE
Trans. Electron. Comput., EC-13:14{17, 1964.
Download