CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 20: Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp09 CMPEN 411 L20 S.1 Review: Basic Building Blocks Datapath Execution units - Adder, multiplier, divider, shifter, etc. Control Finite state machines (PLA, ROM, random logic) Interconnect Register file and pipeline registers Multiplexers, decoders Switches, arbiters, buses Memory Caches (SRAMs), TLBs, DRAMs, buffers Sp09 CMPEN 411 L20 S.2 The Binary Multiplication 1 0 1 0 1 0 1 0 1 1 x Multiplicand Multiplier 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 + Partial products 1 0 1 0 1 0 1 1 1 0 0 1 1 1 0 Sp09 CMPEN 411 L20 S.3 Result Multiply Operation Multiplication is just a a lot of additions N multiplicand multiplier partial product array N can be formed in parallel double precision product 2N Sp09 CMPEN 411 L20 S.4 Multiplication Approaches Right shift and add Partial product array rows are accumulated from top to bottom on an N-bit adder - After each addition, right shift (by one bit) the accumulated partial product to align it with the next row to add Time for N bits Tserial_mult = O(N Tadder) = O(N2) for a RCA Making it faster Use a faster adder Use higher radix (e.g., base 4) multiplication – O(N/2 Tadder) - Use multiplier recoding to simplify multiple formation (booth) Sp09 Form the partial product array in parallel and add it in parallel Making it smaller (i.e., slower) Use serial-parallel mult Use an array multiplier - Very regular structure with only short wires to nearest neighbor cells. Thus, very simple and efficient layout in VLSI Can be easily CMPEN 411 L20 S.5 and efficiently pipelined Serial-parallel multiplier structure Sp09 CMPEN 411 L20 S.6 The Array Multiplier X3 Z7 X2 X1 X0 Y1 Z0 X3 X2 X1 X0 HA FA FA HA X3 X2 X1 X0 FA FA FA HA X3 X2 X1 X0 FA FA FA HA Z6 Sp09 CMPEN 411 L20 S.7 Z5 Z4 Y3 Z3 Y2 Z2 Z1 Y0 Booth multiplier Encoding scheme to reduce number of stages in multiplication. Performs two bits of multiplication at once—requires half the stages. Each stage is slightly more complex than simple multiplier, but adder/subtracter is almost as small/fast as adder. Sp09 CMPEN 411 L20 S.11 Booth encoding Two’s-complement form of multiplier: y = -2nyn + 2n-1yn-1 + 2n-2yn-2 + ... (first bit is the sign bit) (example, y=18=010010 y= -18 = 101110 ) Rewrite using 2a = 2a+1 - 2a: y = 2n(yn-1-yn) + 2n-1(yn-2 -yn-1) + 2n-2(yn-3 -yn-2) + ... Consider first two terms: by looking at three bits of y, we can determine whether to add x, 2x to partial product. Sp09 CMPEN 411 L20 S.12 Booth actions y = 2n(yn-1-yn) + 2n-1(yn-2 -yn-1) + 2n-2(yn-3 -yn-2) + ... Consider first two terms: by looking at three bits of y, we can determine whether to add x, 2x to partial product. yi yi-1 yi-2 increment 000 0 001 x 010 x 011 2x 100 -2x 101 -x 110 -x 111 0 Sp09 CMPEN 411 L20 S.13 Booth example x = 1001 (910), y = 0111 (710). P0 = 00000000 y3y2y1=011 y1y0y-1=11(0) y1y0y-1 = 110, P1 = P0 - (1001) = 11110111 x shift left for 2 bits to be 100100 y3y2y1 = 011, P2 = P1 (10*100100) = 11110111+01001000 = 001111111 (6310) An array multiplier needs N addtions, booth multiplier needs only N/2 additions Sp09 CMPEN 411 L20 S.14 Review: A 64-bit Adder/Subtractor add/subt Ripple Carry Adder (RCA) built out of 64 FAs Subtraction – complement all subtrahend bits (xor gates) and set the low order carry-in RCA advantage: simple logic, so small (low cost) disadvantage: slow (O(N) for N bits) and lots of glitching (so lots of energy consumption) Sp09 CMPEN 411 L20 S.15 A0 1-bit FA C1 S0 A1 1-bit FA C2 S1 A2 1-bit FA C3 S2 B0 B1 B2 ... C0=Cin C63 A63 B63 1-bit FA S63 C64=Cout Booth structure Sp09 CMPEN 411 L20 S.16 Wallace-Tree Multiplier Partial products 6 5 4 3 First stage 2 1 0 6 5 4 (a) 5 4 1 0 Bit position 2 1 0 Final adder 3 FA 2 1 0 6 5 4 3 HA (c) Sp09 CMPEN 411 L20 S.17 2 (b) Second stage 6 3 (d) Wallace-Tree Multiplier Full adder = (3,2) compressor Sp09 CMPEN 411 L20 S.18 Making it Faster: Tree Multiplier Structure 0 D Q (‘ier) 0 D 0 D 0 D (‘icand) partial product array reduction tree fast carry propagate adder (CPA) Sp09 CMPEN 411 L20 S.19 mux + reduction tree (log N) + CPA (log N) P (product) interconnect multiple forming circuits (4,2) Counter Built out of two (3,2) counters (just FA’s!) all of the inputs (4 external plus one internal) have the same weight (i.e., are in the same bit position) the internal carry output is fed to the next higher weight position (indicated by the ) (3,2) (3,2) Sp09 CMPEN 411 L20 S.20 Note: Two carry outs - one “internal” and one “external” Tiling (4,2) Counters (3,2) (3,2) (3,2) (3,2) (3,2) (3,2) Reduces columns four high to columns only two high Tiles with neighboring (4,2) counters Internal carry in at same “level” (i.e., bit position weight) as the internal carry out Sp09 CMPEN 411 L20 S.22 4x4 Partial Product Array Reduction Fast 4x4 multiplication using (4,2) counters How would you lay it out? multiplicand multiplier multiplicand multiplier partial product array reduced pp array (to CPA) double precision product Sp09 CMPEN 411 L20 S.24 five (4,2) counters 5-bit CPA 8-bit product 8x8 Partial Product Array Reduction Wallace tree multiplier ‘icand ‘ier partial product array two rows of nine (4,2) counters reduced partial product array one row of thirteen (4,2) counters to a 13-bit fast CPA Sp09 CMPEN 411 L20 S.25 An 8x8 Multiplier Layout How should it be laid out? multiplicand multiplier nine (4,2) counters nine (4,2) counters thirteen (4,2) counters 13-bit CPA Sp09 CMPEN 411 L20 S.26 Multipliers —Summary • Optimization Goals Different Vs Binary Adder • Once Again: Identify Critical Path • Other possible techniques - Logarithmic versus Linear (Wallace Tree Mult) - Data encoding (Booth) - Pipelining FIRST GLIMPSE AT SYSTEM LEVEL OPTIMIZATION Sp09 CMPEN 411 L20 S.32 Next Lecture and Reminders Next lecture Shifters, decoders, and multiplexers - Reading assignment – Rabaey, et al, 11.5-11.6 Sp09 CMPEN 411 L20 S.33