Solution - University of California, Berkeley

advertisement
UNIVERSITY OF CALIFORNIA
College of Engineering
Department of Electrical Engineering and Computer Sciences
Last modified on March 30, 2004 by Gang Zhou (zgang@eecs.berkeley.edu)
Jan Rabaey
Homework # 7 Solutions
EECS141
Problem 1: Variable-Block Carry-Skip Adder
The carry-skip adder is a pretty good circuit. However, upon closer inspection, you notice that if all the skip
blocks are of the same size, the latter blocks will finish switching quickly and then sit idle for a while waiting for
the carry signal to pass through all the bypass multiplexers. For example, in the diagram of a 32-bit carry-skip
adder below, the carry-out for bits 4-7 will be ready at the same time as the carry-out for bits 0-3. This second
block will sit around doing nothing while MUX1 does its job.
Setup
Setup
Setup
Setup
Setup
Setup
Setup
Setup
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
Sum
Sum
Sum
Sum
Sum
Sum
Sum
Sum
Cin
To speed up the circuit, we could vary the size of the skip block. Intuitively, we should then be able to reduce the
size of the first skip block and make each subsequent block increasingly larger. Because the critical path includes
the last skip block, we must also start to taper down the size of each block as we approach the end. To obtain the
optimal size of all the skip blocks, you realize that some really smart guy has already done all the mathematical
derivations…which means that you don’t have to do it yourself. After talking to this really smart guy, you know
that the optimal configuration for a 32-bit adder is (under the assumption that tMUX = 2tprop):
0
Setup
Setup
Setup
Setup
Setup
Setup
1 2 3
4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27
28 29 30
Sum
Sum
Sum
Sum
Sum
Sum
31
Cin
Estimate the worst-case delay for the simple 32-bit carry-skip adder in the first diagram and then estimate the
amount of delay improvement with this new variable-block scheme. Assume that the setup (creation of propagate
and generate signals) takes tsetup, each bit of carry propagation takes tprop (i.e. a skip block of m bits has a delay of
m*tprop), a MUX has a delay of tMUX, and the sum generation has a delay of tsum. Leave your answers in terms of
tsetup, tprop, tMUX, and tsum.
Solution:
For the standard carry-skip adder, the delay is just as was derived in lecture. In the worst case, we must pass the
carry-out signal from the first skip block (bits 0-3) all the way to the end. In that situation, we have to pass
through all 7 MUX’s. We then have to run through the last skip block (bits 28-31) in order to calculate the sum
for bit 31. At this point, we only have to propagate the carry through 3 bits (28, 29, and 30) to get the carry-in to
bit 31…which is needed to calculate the sum. Thus, the total delay of the simple carry-skip adder is:
Delay = tsetup + 4tprop + 7tMUX + 3tprop + tsum = tsetup + 7tprop + 7tMUX + tsum
For the variable-block carry-skip adder, the worst-case path is the same. However, the block sizes are different.
We only have 1 bit to run through in the first block and 1 bit in the last block. The delay is thus:
Delay = tsetup + tprop + 7tMUX + tsum
We get a delay improvement of 6tprop.
Since we told you to assume that 2tprop=tMUX, it would be ok to mix these terms up.
Problem 2: Short Adders
In this question we are going to compare the speed of different types of adders.
a) Calculate the worst case delay of an 8-bit Ripple carry adder consisting of the full adder blocks shown below.
You can use tp for the AND and OR functions and 2tp for the XOR gates. Express your answer in terms of tp.
Solution:
In the first part, we need one XOR delay to obtain the first propagate, then two gate delays to reach the cout. Then
for the next 6 stages we need 2 gate delays from cin to cout. In the final stage we have cin to sum XOR delay.
The total delay is
ttotal=4*tp + 6*2tp + 2tp = 18tp
b) For the second part of this question you are to implement an 8-bit Carry-Look Ahead adder. For n-input
AND/OR gates use the tp-n = 0.25*n2*tp, similarly for the XOR gate tXOR-n = 0.5*n2*tp. Find the worst-case delay of
this adder again in terms of tp.
Solution:
In part b) we assume that the CLA logic functions implemented are
co3 = g3 + p3 g2+p3 p2 g1+p3 p2 p1 g0+p3 p2 p1 p0 cin
co2= g2+ p2 g1+ p2 p1 g0+ p2 p1 p0 cin
co1= g1+ p1 g0+ p1 p0 cin
co0= g0+ p0 cin
Ripple the carry between two blocks
We spend an XOR delay to obtain g’s and p’s. Then from inputs to co3 we go through one 5-input AND and one
5-input OR, then co3 is passed to the next stage and again it generates co2 and finally goes through an XOR gate.
co2 sees 4-input AND 4-input OR.
2tp + 0.25*52 tp + 0.25*52 tp+0.25*42 tp + 0.25*42 tp+2tp = (2+6.25+6.25+4+4+2)tp = 24.5tp
Select the sum out between two alternatives
As an alternative (faster) solution the second block performs a carry select operation. In the case both sums are
generated in the second block. And we only need to choose using a MUX. A MUX implements the function F =
as + bs’. So it has a delay with a 2-input AND and a 2-input OR.
ttotal = 2tp + 0.25*52 tp + 0.25*52 tp+ tp + tp = (2 + 6.25+6.25 +1+1)tp = 16.5tp
c) Repeat the same calculations for 32-bit adders. Hint: Implement the Carry-Look Ahead Adder as a block
CLA of 4-bit block-length.
Solution:
In the CLA adder it takes an XOR delay (2tp) to generate the individual pi,gi, 4-input AND and 4-input OR to
generate the block P’s and G’s (4tp+4tp). From the outputs of the top level (in the diagram) it takes additional 4AND and 4-OR delays (4tp+4tp) to generate mid level block P, G. Then, 2-input AND, 2-input OR make us go
through bottom level (tp+tp). After additional 2-input AND, 2-input OR we go through middle level (tp+tp) and
reach back at top level in the diagram. In this top level 2-input AND, 2-input OR (tp+tp) is needed to generate the
final carry and, a final XOR (2tp) is needed to obtain the sum.
(2+4+4+4+4+2+2+2+2)tp = 26tp
In the RCA (ripple carry adder) case we again have 4tp+30*2tp+2tp = 66tp.
As we can clearly see as the number of bits increases, the carry look ahead adder has a distinct advantage. But for
adders with less than 10-bits it’s usually wiser to do the implementation simply in ripple carry.
More details:
The inputs of the top level are the individual pi, gi. As mentioned in part b) the equations implemented are
pi+3:i = pi+3 pi+2 pi+1 pi
gi+3:i = gi+3 + pi+3 gi+2+pi+3 pi+2 gi+1+ pi+3 pi+2 pi+1 gi
we can see that the worst case delay is 4 input AND + 4 input OR
gi+3:i means a carry is generated within the “block encompassing bit positions i+3 to i”
pi+3:i means the carry-in of the block is passed to the carry-out of the block.
The mid level blocks implement
pi+15:i = pi+15:i+12 pi+11:i+8 pi+7:i+4 pi+3:i
gi+15:i = gi+15:i+12 + p i+15:i+12 g i+11:i+8+ p i+15:i+12 p i+11:i+8 g i+7:i+4+
p i+15:i+12 p i+11:i+8 p i+7:i+4 g i+3:i
Once we have the pi:k and gi:k’s and co(k-1) (i.e. the carry out at stage k-1), we can obtain the carry-out of stage i
using the relation
coi = gi:k + pi:k co(k-1)
(which has a delay of 2-input AND and 2-input OR, meaning that to get a carry out at ith bit position, the block
encompassing i - k should either generate a carry or pass the carry coming as co(k-1).)
Problem 3: Pipelined Multipliers
An array multiplier consists of rows of adders, each producing partial sums that are subsequently fed to the next
adder row. In this problem, we consider the effects of pipelining such a multiplier by inserting registers between
the adder rows.
a) Redraw Figure 11.30 (textbook, pg. 590) inserting word-level pipeline registers as required to achieve maximal
throughput for the 4x4 multiplier. Hint #1: you must use additional registers to keep the input bits
synchronized to the appropriate partial sums. Hint #2: just use little filled black rectangles to indicate
registers and assume all registers are clocked using the same clock.
b) Repeat part (a) for a carry-save, as opposed to ripple-carry, architecture.
c) For each of the two multiplier architectures, compare the critical path, throughput, and latency of the pipelined
and non-pipelined versions.
d) Which architecture is better suited to pipelining, and how does the choice of a vector-merging adder affect this
decision?
Solution:
Problem 4: Modified Booth Recoding
Start with a NxN array multiplier and notice that the number of partial products required is N. This implies N-1
additions, and thus, N-1 rows in the array. Modified Booth Recoding (MBR) is a technique for halving the
number of partial products produced during a multiplication. This is nice because fewer partial products means
fewer additions, ultimately resulting in a faster multiplication.
a) Two important number system principles are required to understand how MBR works. First, the base of the
number system is called the radix. Decimal is radix-10, binary is radix-2, hexadecimal is radix-16, and so on.
MBR uses a radix-4 number system. Since two binary bits can represent four numbers, we can take an ordinary
binary number and split it into two bit ‘groups’ to form a radix-4 number:
Ordinary radix-2 (binary) number:
[0 0 1 1 1 0 1 0]2 = 0*27 + 0*26 + 1*25 + 1*24 + 1*23 + 0*22 + 1*21 + 0*20 = 58
Radix-4 number:
[00 11 10 10]2 = [0 3 2 2]4 = 0*43 + 3*42 + 2*41 + 2*40 = 58
Note that in binary, the 8 bits mean that we will have 8 partial products. In radix-4, we have only four ‘bits’, hence
half the partial products. When multiplying X*Y, two steps are taken before the multiplication is performed. First,
we recode Y using radix-4. Second, we calculate the four possible unshifted partial products: 0*X, 1*X, 2*X, and
3*X. The radix-4 ‘bit’ tells us which of these partial products to select and how far to shift it (ie. how many zeros
to append to the end). Demonstrate how this works by multiplying 94*121 using this technique.
b) Note that the biggest problem with this radix-4 multiplication is the partial product generation. 0*X, 1*X, and
2*X are easily generated using AND gates and a shifter. However, 3*X must be generated by adding 1*X + 2*X.
This addition is in the critical path of the multiplier, so we would like to remove it. We do this by getting rid of all
the 3*X partial products in the radix-4 calculation. Essentially, we need to remove radix-4 ‘bits’ that have the
value [3]4 or, equivalently, [11]2.
Consider a number system in which each ‘bit’ position can hold three values: {-1, 0, 1}. This is called a redundant
number system because there is more than one way to represent the same number. Numbers in this format can be
treated in the same way as ordinary binary numbers, eg:
Ordinary binary number:
[0 0 1 1 1 0 1 0]2 = 0*27 + 0*26 + 1*25 + 1*24 + 1*23 + 0*22 + 1*21 + 0*20 = 32 + 16 + 8+ 2 = 58
Redundant number system:
[0 1 0 -1 1 0 1 0]2 = 0*27 + 1*26 + 0*25 + -1*24 + 1*23 + 0*22 + 1*21 + 0*20 = 64 – 16 + 8 + 2 = 58
Note that all ordinary binary numbers are also included in this redundant number system, as well as a whole bunch
more numbers that contain –1 ‘bits’.
Convert the following redundant numbers into standard binary numbers and then into radix-4 numbers: [0 0 1 0 0
–1 0 0]2, [0 1 0 –1 0 1 0 1]2, [0 1 0 1 1 -1 0 1]2. Note that standard binary sequences of the form: {0, some ones}
can be converted to redundant sequences of the form: {1, some zeros, -1}. By replacing a string of 1’s with 0’s,
we can eliminate the possibility of two one’s in a group, thus eliminating the 3*X partial product!
c) MBR basically searches for strings of one’s in the binary number, converts them into an equivalent redundant
number representation, treats the result in radix-4, then does the multiplication. This can be easily accomplished
by using the look-up table in Table 1. Since we are using radix-4, i = {0, 2, 4, 6, 8, …}. Also, Y-1 = 0.
Now for X*Y, the partial products 0*X, 1*X, 2*X, -2*X, -1*X, -0*X are generated, Y is recoded according to the
table, and then the multiplication is performed. Recode Y = 121 into radix-4 bits of {-2, -1, 0, 1, 2} according to
the table. Now perform the multiplication 94*121.
d) Now let’s generate those partial products. A straightforward generation can be made using three signals: negate
(1: negate X, 0: no change), shift (1: shift left by one, 0: no change), and zero (1: force to zero, 0: no change).
Design a circuit that implements these three signals using standard gates (AND, OR, INVERTER, XOR, etc.).
e) So what does all this gain us? We’ve traded a 3*X partial product for –1*X and – 2*X. Recall that negation in
two’s complement requires us to negate all the bits, then add 1. How can we add these one’s in without making an
entirely new adder in the critical path? Hint: Try to find “holes” in the multiplication (ie. low order bits that
are known to be zero and can be replaced with our negate signal).
f) Design a circuit that uses the three signals in Part d to generate –2*X, –1*X, 0*X, 1*X, and 2*X. Bear in mind
that the negation does not need to add one because that will be taken care of using the method in Part e.
g) Congratulations! You’ve created all the primary building blocks of a Booth recoded multiplier. Lastly, what
additional improvement can be made to make this one of the fastest multipliers available?
Solution:
Download