UNIVERSITY OF CALIFORNIA College of Engineering Department of Electrical Engineering and Computer Sciences Last modified on March 30, 2004 by Gang Zhou (zgang@eecs.berkeley.edu) Jan Rabaey Homework # 7 Solutions EECS141 Problem 1: Variable-Block Carry-Skip Adder The carry-skip adder is a pretty good circuit. However, upon closer inspection, you notice that if all the skip blocks are of the same size, the latter blocks will finish switching quickly and then sit idle for a while waiting for the carry signal to pass through all the bypass multiplexers. For example, in the diagram of a 32-bit carry-skip adder below, the carry-out for bits 4-7 will be ready at the same time as the carry-out for bits 0-3. This second block will sit around doing nothing while MUX1 does its job. Setup Setup Setup Setup Setup Setup Setup Setup 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Sum Sum Sum Sum Sum Sum Sum Sum Cin To speed up the circuit, we could vary the size of the skip block. Intuitively, we should then be able to reduce the size of the first skip block and make each subsequent block increasingly larger. Because the critical path includes the last skip block, we must also start to taper down the size of each block as we approach the end. To obtain the optimal size of all the skip blocks, you realize that some really smart guy has already done all the mathematical derivations…which means that you don’t have to do it yourself. After talking to this really smart guy, you know that the optimal configuration for a 32-bit adder is (under the assumption that tMUX = 2tprop): 0 Setup Setup Setup Setup Setup Setup 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Sum Sum Sum Sum Sum Sum 31 Cin Estimate the worst-case delay for the simple 32-bit carry-skip adder in the first diagram and then estimate the amount of delay improvement with this new variable-block scheme. Assume that the setup (creation of propagate and generate signals) takes tsetup, each bit of carry propagation takes tprop (i.e. a skip block of m bits has a delay of m*tprop), a MUX has a delay of tMUX, and the sum generation has a delay of tsum. Leave your answers in terms of tsetup, tprop, tMUX, and tsum. Solution: For the standard carry-skip adder, the delay is just as was derived in lecture. In the worst case, we must pass the carry-out signal from the first skip block (bits 0-3) all the way to the end. In that situation, we have to pass through all 7 MUX’s. We then have to run through the last skip block (bits 28-31) in order to calculate the sum for bit 31. At this point, we only have to propagate the carry through 3 bits (28, 29, and 30) to get the carry-in to bit 31…which is needed to calculate the sum. Thus, the total delay of the simple carry-skip adder is: Delay = tsetup + 4tprop + 7tMUX + 3tprop + tsum = tsetup + 7tprop + 7tMUX + tsum For the variable-block carry-skip adder, the worst-case path is the same. However, the block sizes are different. We only have 1 bit to run through in the first block and 1 bit in the last block. The delay is thus: Delay = tsetup + tprop + 7tMUX + tsum We get a delay improvement of 6tprop. Since we told you to assume that 2tprop=tMUX, it would be ok to mix these terms up. Problem 2: Short Adders In this question we are going to compare the speed of different types of adders. a) Calculate the worst case delay of an 8-bit Ripple carry adder consisting of the full adder blocks shown below. You can use tp for the AND and OR functions and 2tp for the XOR gates. Express your answer in terms of tp. Solution: In the first part, we need one XOR delay to obtain the first propagate, then two gate delays to reach the cout. Then for the next 6 stages we need 2 gate delays from cin to cout. In the final stage we have cin to sum XOR delay. The total delay is ttotal=4*tp + 6*2tp + 2tp = 18tp b) For the second part of this question you are to implement an 8-bit Carry-Look Ahead adder. For n-input AND/OR gates use the tp-n = 0.25*n2*tp, similarly for the XOR gate tXOR-n = 0.5*n2*tp. Find the worst-case delay of this adder again in terms of tp. Solution: In part b) we assume that the CLA logic functions implemented are co3 = g3 + p3 g2+p3 p2 g1+p3 p2 p1 g0+p3 p2 p1 p0 cin co2= g2+ p2 g1+ p2 p1 g0+ p2 p1 p0 cin co1= g1+ p1 g0+ p1 p0 cin co0= g0+ p0 cin Ripple the carry between two blocks We spend an XOR delay to obtain g’s and p’s. Then from inputs to co3 we go through one 5-input AND and one 5-input OR, then co3 is passed to the next stage and again it generates co2 and finally goes through an XOR gate. co2 sees 4-input AND 4-input OR. 2tp + 0.25*52 tp + 0.25*52 tp+0.25*42 tp + 0.25*42 tp+2tp = (2+6.25+6.25+4+4+2)tp = 24.5tp Select the sum out between two alternatives As an alternative (faster) solution the second block performs a carry select operation. In the case both sums are generated in the second block. And we only need to choose using a MUX. A MUX implements the function F = as + bs’. So it has a delay with a 2-input AND and a 2-input OR. ttotal = 2tp + 0.25*52 tp + 0.25*52 tp+ tp + tp = (2 + 6.25+6.25 +1+1)tp = 16.5tp c) Repeat the same calculations for 32-bit adders. Hint: Implement the Carry-Look Ahead Adder as a block CLA of 4-bit block-length. Solution: In the CLA adder it takes an XOR delay (2tp) to generate the individual pi,gi, 4-input AND and 4-input OR to generate the block P’s and G’s (4tp+4tp). From the outputs of the top level (in the diagram) it takes additional 4AND and 4-OR delays (4tp+4tp) to generate mid level block P, G. Then, 2-input AND, 2-input OR make us go through bottom level (tp+tp). After additional 2-input AND, 2-input OR we go through middle level (tp+tp) and reach back at top level in the diagram. In this top level 2-input AND, 2-input OR (tp+tp) is needed to generate the final carry and, a final XOR (2tp) is needed to obtain the sum. (2+4+4+4+4+2+2+2+2)tp = 26tp In the RCA (ripple carry adder) case we again have 4tp+30*2tp+2tp = 66tp. As we can clearly see as the number of bits increases, the carry look ahead adder has a distinct advantage. But for adders with less than 10-bits it’s usually wiser to do the implementation simply in ripple carry. More details: The inputs of the top level are the individual pi, gi. As mentioned in part b) the equations implemented are pi+3:i = pi+3 pi+2 pi+1 pi gi+3:i = gi+3 + pi+3 gi+2+pi+3 pi+2 gi+1+ pi+3 pi+2 pi+1 gi we can see that the worst case delay is 4 input AND + 4 input OR gi+3:i means a carry is generated within the “block encompassing bit positions i+3 to i” pi+3:i means the carry-in of the block is passed to the carry-out of the block. The mid level blocks implement pi+15:i = pi+15:i+12 pi+11:i+8 pi+7:i+4 pi+3:i gi+15:i = gi+15:i+12 + p i+15:i+12 g i+11:i+8+ p i+15:i+12 p i+11:i+8 g i+7:i+4+ p i+15:i+12 p i+11:i+8 p i+7:i+4 g i+3:i Once we have the pi:k and gi:k’s and co(k-1) (i.e. the carry out at stage k-1), we can obtain the carry-out of stage i using the relation coi = gi:k + pi:k co(k-1) (which has a delay of 2-input AND and 2-input OR, meaning that to get a carry out at ith bit position, the block encompassing i - k should either generate a carry or pass the carry coming as co(k-1).) Problem 3: Pipelined Multipliers An array multiplier consists of rows of adders, each producing partial sums that are subsequently fed to the next adder row. In this problem, we consider the effects of pipelining such a multiplier by inserting registers between the adder rows. a) Redraw Figure 11.30 (textbook, pg. 590) inserting word-level pipeline registers as required to achieve maximal throughput for the 4x4 multiplier. Hint #1: you must use additional registers to keep the input bits synchronized to the appropriate partial sums. Hint #2: just use little filled black rectangles to indicate registers and assume all registers are clocked using the same clock. b) Repeat part (a) for a carry-save, as opposed to ripple-carry, architecture. c) For each of the two multiplier architectures, compare the critical path, throughput, and latency of the pipelined and non-pipelined versions. d) Which architecture is better suited to pipelining, and how does the choice of a vector-merging adder affect this decision? Solution: Problem 4: Modified Booth Recoding Start with a NxN array multiplier and notice that the number of partial products required is N. This implies N-1 additions, and thus, N-1 rows in the array. Modified Booth Recoding (MBR) is a technique for halving the number of partial products produced during a multiplication. This is nice because fewer partial products means fewer additions, ultimately resulting in a faster multiplication. a) Two important number system principles are required to understand how MBR works. First, the base of the number system is called the radix. Decimal is radix-10, binary is radix-2, hexadecimal is radix-16, and so on. MBR uses a radix-4 number system. Since two binary bits can represent four numbers, we can take an ordinary binary number and split it into two bit ‘groups’ to form a radix-4 number: Ordinary radix-2 (binary) number: [0 0 1 1 1 0 1 0]2 = 0*27 + 0*26 + 1*25 + 1*24 + 1*23 + 0*22 + 1*21 + 0*20 = 58 Radix-4 number: [00 11 10 10]2 = [0 3 2 2]4 = 0*43 + 3*42 + 2*41 + 2*40 = 58 Note that in binary, the 8 bits mean that we will have 8 partial products. In radix-4, we have only four ‘bits’, hence half the partial products. When multiplying X*Y, two steps are taken before the multiplication is performed. First, we recode Y using radix-4. Second, we calculate the four possible unshifted partial products: 0*X, 1*X, 2*X, and 3*X. The radix-4 ‘bit’ tells us which of these partial products to select and how far to shift it (ie. how many zeros to append to the end). Demonstrate how this works by multiplying 94*121 using this technique. b) Note that the biggest problem with this radix-4 multiplication is the partial product generation. 0*X, 1*X, and 2*X are easily generated using AND gates and a shifter. However, 3*X must be generated by adding 1*X + 2*X. This addition is in the critical path of the multiplier, so we would like to remove it. We do this by getting rid of all the 3*X partial products in the radix-4 calculation. Essentially, we need to remove radix-4 ‘bits’ that have the value [3]4 or, equivalently, [11]2. Consider a number system in which each ‘bit’ position can hold three values: {-1, 0, 1}. This is called a redundant number system because there is more than one way to represent the same number. Numbers in this format can be treated in the same way as ordinary binary numbers, eg: Ordinary binary number: [0 0 1 1 1 0 1 0]2 = 0*27 + 0*26 + 1*25 + 1*24 + 1*23 + 0*22 + 1*21 + 0*20 = 32 + 16 + 8+ 2 = 58 Redundant number system: [0 1 0 -1 1 0 1 0]2 = 0*27 + 1*26 + 0*25 + -1*24 + 1*23 + 0*22 + 1*21 + 0*20 = 64 – 16 + 8 + 2 = 58 Note that all ordinary binary numbers are also included in this redundant number system, as well as a whole bunch more numbers that contain –1 ‘bits’. Convert the following redundant numbers into standard binary numbers and then into radix-4 numbers: [0 0 1 0 0 –1 0 0]2, [0 1 0 –1 0 1 0 1]2, [0 1 0 1 1 -1 0 1]2. Note that standard binary sequences of the form: {0, some ones} can be converted to redundant sequences of the form: {1, some zeros, -1}. By replacing a string of 1’s with 0’s, we can eliminate the possibility of two one’s in a group, thus eliminating the 3*X partial product! c) MBR basically searches for strings of one’s in the binary number, converts them into an equivalent redundant number representation, treats the result in radix-4, then does the multiplication. This can be easily accomplished by using the look-up table in Table 1. Since we are using radix-4, i = {0, 2, 4, 6, 8, …}. Also, Y-1 = 0. Now for X*Y, the partial products 0*X, 1*X, 2*X, -2*X, -1*X, -0*X are generated, Y is recoded according to the table, and then the multiplication is performed. Recode Y = 121 into radix-4 bits of {-2, -1, 0, 1, 2} according to the table. Now perform the multiplication 94*121. d) Now let’s generate those partial products. A straightforward generation can be made using three signals: negate (1: negate X, 0: no change), shift (1: shift left by one, 0: no change), and zero (1: force to zero, 0: no change). Design a circuit that implements these three signals using standard gates (AND, OR, INVERTER, XOR, etc.). e) So what does all this gain us? We’ve traded a 3*X partial product for –1*X and – 2*X. Recall that negation in two’s complement requires us to negate all the bits, then add 1. How can we add these one’s in without making an entirely new adder in the critical path? Hint: Try to find “holes” in the multiplication (ie. low order bits that are known to be zero and can be replaced with our negate signal). f) Design a circuit that uses the three signals in Part d to generate –2*X, –1*X, 0*X, 1*X, and 2*X. Bear in mind that the negation does not need to add one because that will be taken care of using the method in Part e. g) Congratulations! You’ve created all the primary building blocks of a Booth recoded multiplier. Lastly, what additional improvement can be made to make this one of the fastest multipliers available? Solution: