Lecture 18: Datapath Functional Units Outline Multi-input Adders Comparators Shifters Multipliers More complex operations 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 2 Carry-Save Adders (CSA) You can use a carry save adder to add three n-bit operands A0, A1, and A2 without performing any carry propagation. C,S C S A0 A1 A2 2c i1 si a0,i a1,i a2,i i 0,1,,n 1 You can also add one n-bit operand to an n-digit carry-save operand. C,Sout A C,Sin Result is in carry-save format. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 3 Carry-Save Adders Parallel arrangement of full-adders => constant delay. A 7n , T 4n Multi-operand carry-save adders also possible (m>3) – Array or tree arrangement. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 4 Multi-Operand Adders Add three or more (m>2) n-bit operands. Yield n logm-bit result in irredundant number representation Array adders – Linear arrangement of CPAs – Linear arrangement of CSAs and a final CPA • The final CPA has to be fast. If it is an RCA, the performances of the two alternatives are equal. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 5 4-Operand CPA Array 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 6 4-Operand CSA Array 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 7 Multi-input Adders Suppose we want to add k N-bit words – Ex: 0001 + 0111 + 1101 + 0010 = 10111 Straightforward solution: k-1 N-input CPAs – Large and slow 0001 0111 1101 0010 + 1000 + 10101 + 10111 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 8 Carry Save Addition A full adder sums 3 inputs and produces 2 outputs – Carry output has twice weight of sum output N full adders in parallel are called carry save adder – Produce N sums and N carry outs X4 C4 S4 Y4 Z4 X3 Y3 Z3 X2 C3 S3 C2 S2 Y2 Z2 X1 Y 1 Z1 C1 S1 XN...1 YN...1 ZN...1 n-bit CSA CN...1 18: Datapath Functional Units SN...1 CMOS VLSI Design 4th Ed. 9 CSA Application Use k-2 stages of CSAs – Keep result in carry-save redundant form Final CPA computes actual result 0001 0001 0111 1101 0010 0111 +1101 1011 4-bit CSA 0101_ 0101_ 1011 0101_ 1011 5-bit CSA +0010 01010_ 00011 00011 01010_ + 01010_ 10111 + 00011 10111 18: Datapath Functional Units X Y Z S C X Y Z S C A B S CMOS VLSI Design 4th Ed. 10 (m,2)-Compressors m 1 m 4 m 4 l 2c c out s ai c inl l 0 i0 l 0 1-bit adders. Similar to (m,k)-counters. Compresses m bits down to 2 by forwarding (m-3) intermediate carries to next higher position. No horizontal carry propagation. Built from full adders ((3,2) compressors) or (4,2) compressors arranged in linear or tree structures, 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 11 (m,2)-Compressors Example: 4-operand adder using (4,2) compressors. A 7m 2 , TLIN 4m 2 , TTREE 6logm 1 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 12 (m,2)-Compressors Structure of a (4,2) compressor 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 13 (m,2)-Compressors Advantages of (4,2)-compressors over FAs for realizing (m,2)-compressors: – Higher compression rate. – Less deep and more regular trees. Example: (8,2) compressor. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 14 Tree adders (Wallace Tree) Adder tree: n-bit m-operand carry-save adder composed of n tree structured (m,2) compressors. Fastest multi-operand adders using an adder tree and a fast final CPA. A A(m,2) n ACPA Omn n log n T Tm,2 TCPA Olog m log n 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 15 Adder Arrays and Trees Some FAs can often be replaced by HAs or eliminated altogether. Number of FAs does not depend on adder structure, but number of HAs does. An m-operand adder accomdates (m - 1) carry inputs. Adder trees (T = O(logn)) are faster than adder arrays (T = O(n)) at the same amount of gates (A = O(mn)). Adder trees are less regular and have more complex routing than adder arrays => larger area, difficult layout => limited use in layout generators. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 16 Sequential Adders Bit-serial adder A AFA AFF , T TFA TFF , L n Accumulators – With CPA A ACPA AREG , T TCPA TREG , L m – With CSA and final CPA A ACSA ACPA 4 AREG , T TCSA TREG , L m • Allows higher clock rates • Final CPA too slow – Pipelining or multiple cycles 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 17 Complement and Subtraction 2’s complement A A 1 2’s complement subtracter A B A B A B 1 2’s complement adder/subtracter sub A B A 1 B A B sub sub 1’s complement adder 18: Datapath Functional Units A Bmod2n 1 A B c out (end- around carry) CMOS VLSI Design 4th Ed. 18 Subtraction 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 19 Increment/Decrement Adds a single bit cin to an n-bit operand A c out ,Z c out 2 n Z A c in z i ai c i c i1 aic i ; i 0,,n 1 c 0 c in , c out c n (r.m.a) Corresponds to addition with B=0 (FA -> HA) Example: Ripple-carry incrementer using half-adders A 3n , T n 1 , AT 3n 2 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 20 Increment/Decrement Or, use incrementer slices Prefix problem Ci:k = Ci:j+1Cj:k => AND prefix structure 1 1 A nlogn 2n , T logn 2 , AT nlog2 n 2 2 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 21 Increment/Decrement Decrementer cout ,Z A cin Incrementer-decrementer cout ,Z A cin 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 22 Fast Incrementers 4-bit incrementer using multi-input gates 8-bit parallel-prefix incrementer (Sklansky-AND prefix structure) 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 23 Gray Incrementer Increments in Gray number system c 0 an 1 an 2 a0 (parity) c i1 aic i ; i 0,,n 3 (r.m.a) z0 a0 c 0 zi ai ai1c i1 ; i 1,,n 2 zn 1 an 1 c n 2 Prefix problem => AND-prefix structure 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 24 Counting Count clock cycles => counter Divide clock frequency => frequency divider (cout) Binary counter – Sequential incrementer/decrementer – Incrementer speed-up techniques applicable – Down-and up-down counters using incrementers or incrementer-decrementers 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 25 Example Ripple-carry up-counter using counter slices (HA+FF), cin is count enable. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 26 Synchronous Counters 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 27 Synchronous Counters 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 28 Asynchronous Counters Uses toggle flip-flops. – Lower toggle rate => lower power 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 29 Gray Counter Counter using Gray incrementers 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 30 Fast Counters 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 31 Fast Counters 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 32 Ring Counters Shift register connected to ring State is not encoded => n FF for counting n states. Must be initialized correctly. Applications: – Fast dividers (no logic between FF) – State counter for one-hot coded FSMs 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 33 Johnson Counter Inverted feedback n FF for counting 2n states. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 34 3-bit LFSR 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 35 3-bit LFSR Cycle Q0 Q1 Q2/Y 0 1 1 1 1 0 1 1 2 0 0 1 3 1 0 0 4 0 1 0 5 1 0 1 6 1 1 0 7 1 1 1 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 36 8-bit LFSR 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 37 Comparators Comparison operations EQ A B (equal) NE A B EQ (not equal) GE A B (greater or equal) LT A B G E (less t han) GT A B GE EQ LE A B G T G E EQ (greater than) (less or equal) 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 38 Comparators 0’s detector: A = 00…000 1’s detector: A = 11…111 Equality comparator: A = B Magnitude comparator: A < B 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 39 1’s & 0’s Detectors 1’s detector: N-input AND gate 0’s detector: NOTs + 1’s detector (N-input NOR) A7 A6 A5 A4 A3 A2 A3 A2 allones allzeros A1 A0 A1 A0 A7 A6 A5 A4 A3 A2 allones A1 A0 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 40 Equality Comparison EQ A B eqi1 ai bi eqi ai bi eqi ; i 0,,n 1 eq0 1 , EQ eqn (r.s.a) 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 41 Equality Comparator Check if each bit is equal (XNOR, aka equality gate) 1’s detect on bitwise equality B[3] A[3] B[2] A[2] A=B B[1] A[1] B[0] A[0] 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 42 Magnitude Comparison GE A B gei1 ai bi ai bi gei aibi ai bi gei ; i 0,,n 1 ge0 1 , GE gen (r.s.a.) 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 43 Magnitude Comparator Compute B – A and look at sign B – A = B + ~A + 1 For unsigned numbers, carry out is sign bit A B C B3 A B N A3 B2 A2 B1 Z A=B A1 B0 A0 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 44 Comparators Subtractor (A-B) GE c out EQ Pn 1:0 ARCA 7n , TRCA 2n or 3 APPAKS n logn , TPPAKS 2logn 2 Optimized comparator – Removing redundancies in subtractor (unused si) – Single-tree structure => speed up at no cost A 6n , TLIN 2n , TTREE 2log n 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 45 Comparators Example: ripple comparator using comparator slices 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 46 Signed vs. Unsigned For signed numbers, comparison is harder – C: carry out – – – – Z: zero (all bits of A – B are 0) N: negative (MSB of result) V: overflow (inputs had different signs, output sign B) S: N xor V (sign of result) 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 47 Decoder Decodes binary number An-1:0 to vector Zm-1:0 (m= 2n) 1 if A i zi 0 else ; i 0,,m 1 Z 2A A n 12n , T logn 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 48 Encoder Encodes vector Am-1:0 to binary number Zn-1:0 (m =2n) i k if k i thenak 1 else ak 0 Z i if ai 1 ; i 0,,m 1 Z log2 A A n2 n 1 1 T n 1 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 49 Detection Operations z an 1 an 2 a0 All-zeroes detection z an 1an 2 a0 All-ones detection A n , T log n Leading zeroes detection normalization, priority encding – For scaling, 0101010 A 2n , T n 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 50 Shift,Extension,Saturation 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 51 Shift,Extension,Saturation Applications – Adaption of magnitude or word length of operands. – Multiplication/division by multiples of 2 – Logic bit/byte operations – Scaling of numbers for word length reduction – Reducing error after under-/overflow Implementation of shift/extension/rotation by – Constant values: hard-wired – Variable values: multiplexers – n possible values: n-by-n barrel-shifter/rotator 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 52 Shifters Logical Shift: – Shifts number left or right and fills with 0’s • 1011 LSR 1 = 0101 1011 LSL1 = 0110 Arithmetic Shift: – Shifts number left or right. Rt shift sign extends • 1011 ASR1 = 1101 1011 ASL1 = 0110 Rotate: – Shifts number left or right and fills with lost bits • 1011 ROR1 = 1101 1011 ROL1 = 0111 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 53 Funnel Shifter A funnel shifter can do all six types of shifts Selects N-bit field Y from 2N–1-bit input – Shift by k bits (0 k < N) – Logically involves N N:1 multiplexers 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 54 Funnel Source Generator Rotate Right Logical Right Arithmetic Right Rotate Left Logical/Arithmetic Left 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 55 Array Funnel Shifter N N-input multiplexers – Use 1-of-N hot select signals for shift amount – nMOS pass transistor design (Vt drops!) k[1:0] left Inverters & Decoder s3 s2 s1 s0 Y3 Y2 Z6 Y1 Z5 Y0 Z4 Z3 Z2 Z1 18: Datapath Functional Units Z0 CMOS VLSI Design 4th Ed. 56 Logarithmic Funnel Shifter Log N stages of 2-input muxes – No select decoding needed 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 57 32-bit Logarithmic Funnel Wider multiplexers reduce delay and power Operands > 32 bits introduce datapath irregularity 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 58 Barrel Shifter Barrel shifters perform right rotations using wraparound wires. Left rotations are right rotations by N – k = k + 1 bits. Shifts are rotations with the end bits masked off. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 59 4-by-4 Barrel Rotator 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 60 Logarithmic Barrel Shifter Right shift only Right/Left shift 18: Datapath Functional Units Right/Left Shift & Rotate CMOS VLSI Design 4th Ed. 61 32-bit Logarithmic Barrel Datapath never wider than 32 bits First stage preshifts by 1 to handle left shifts 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 62 Binary Shifter 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 63 Barrel Shifter Area dominated by wiring 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 64 4x4 barrel shifter 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 65 Logarithmic Shifter 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 66 0-7 bit Logarithmic Shifter 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 67 Addition Flags 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 68 Adder with Flags C, N: for free V: fast cn, cn-1 computed by PPA => Very cheap Z: cin=1 (subtract.): Z = (A=B) = Pn-1:0 of PPA cin = 0/1 Z sn 1 sn 2 s0 (r.s.a) A ACPA n , TZ TCPA logn Faster without final sum z a b c z a b a b 0 0 i i 0 i in i1 i1 Z zn 1zn 2 z0 ; i 0,,n 1 (r.s.a.) A = ACPA 3n , TZ 4 logn 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 69 Condition Flags Signed and unsigned addition/subtraction differ only with respect to condition flags 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 70 ALU Arithmetic Logic Unit (ALU) 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 71 ALU Operations 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 72 Multiplication Example: 1100 : 1210 0101 : 510 1100 0000 1100 0000 00111100 : 6010 multiplicand multiplier partial products product M x N-bit multiplication – Produce N M-bit partial products – Sum these to produce M+N-bit product 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 73 General Form Y = (yM-1, yM-2, …, y1, y0) X = (xN-1, xN-2, …, x1, x0) Multiplicand: Multiplier: N 1 N 1 M 1 M 1 P y j 2 j xi 2i xi y j 2i j i 0 j 0 j 0 i 0 Product: p11 y5 y4 y3 y2 y1 y0 x5 x4 x3 x2 x1 x0 x0y5 x0y4 x0y3 x0y2 x0y1 x0y0 x1y5 x1y4 x1y3 x1y2 x1y1 x1y0 x2y5 x2y4 x2y3 x2y2 x2y1 x2y0 x3y5 x3y4 x3y3 x3y2 x3y1 x3y0 x4y5 x4y4 x4y3 x4y2 x4y1 x4y0 x5y5 x5y4 x5y3 x5y2 x5y1 x5y0 p10 p9 p8 p7 p6 p5 18: Datapath Functional Units p4 p3 p2 multiplicand multiplier partial products p1 p0 CMOS VLSI Design 4th Ed. product 74 Binary Multiplication 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 75 Dot Diagram Each dot represents a bit x0 partial products multiplier x x15 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 76 Array Multiplier 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 77 Array Multiplier 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 78 Carry Save Array Multiplier 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 79 Array Multiplier y3 y2 y1 y0 x0 x1 CSA Array x2 x3 CPA p7 p6 p5 p4 p3 p2 p1 p0 A B critical path Sin A Cin B Sin B Cout = Cout Cout A Sout 18: Datapath Functional Units Cin A = Cout Cin Sout B Cin Sout Sout CMOS VLSI Design 4th Ed. 80 Rectangular Array Squash array to fit rectangular floorplan y3 y2 y1 y0 x0 p0 x1 p1 x2 p2 x3 p3 p7 18: Datapath Functional Units p6 p5 p4 CMOS VLSI Design 4th Ed. 81 Multiplier Floorplan 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 82 Sequential Multipliers Partial products generated and added sequentially using an accumulator. A On , T Ologn , L n 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 83 Array Multipliers Partial products generated and added simultaneously in linear array using array adder. A On 2 , T On 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 84 Multiplication Algorithm Generation of partial products Adding up partial products – Sequentially (sequential shift and add) – Serially (combinational shift and add) – In parallel Speed-up techniques – Reduce the number of partial products – Accelerate addition of partial products 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 85 Parallel Multipliers Partial products generated in parallel and added subsequently in multi-operand adder (using tree adder) A On 2 , T Ologn 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 86 Signed Multipliers What about signed multiplication? – Complement operands before and result after multiplication => unsigned multiplication – Direct implementation (dedicated signed multipliers. Unsigned array multiplier using CSA and a final CPA is sometimes called Braun multiplier. The unit gate model yields for a CPA of type RCA A 8n 2 11n T 6n 9 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 87 Modified Braun Multiplier For multiplying two’s complement numbers Sometimes called Pezaris multiplier Subtract bits with negative weight => special FA’s 1 neg. bit: a b c in 2c out s 2 neg. bits: a b c in 2c out s Otherwise, exactly same structure and complexity as theBraun multiplier => efficient and flexible A a 2 a 2 6 7 i 7 i i0 6 B b7 2 bi 2 i 7 i0 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 88 Modified Braun Multiplier 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 89 Modified Braun Multiplier Type-1 adder has one negative input, the sum output is negative. Type 2 adder has two negative inputs, the carry output is negative. You can also design an adder with three negative inputs and two negative outputs (Type 3 adder), but it is never used. Type 0 and Type 3 adders are identical. Type 1 and Type 2 adders are identical. s a b c in c out a b a c in bcin 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 90 Baugh-Wooley Multiplier 3 3 4 i 4 i P A B a4 2 ai 2 b4 2 bi 2 i0 i0 3 3 3 3 a4 b4 2 aib j 2 a4 2 b j 2 b4 2 ai 2 i 123 j 0 MSB 1i0 4j402 4 43 1 4 4 4 4 2 4 4 4i0 4 3 8 i j ordinary multiplication 4 j 4 extra terms 3 3 8 i j 4 4 j 4 4 i a4 b4 2 aib j 2 a4 2 2 b j 2 1 b4 2 2 ai 2 1 i0 j 0 j 0 i0 3 3 3 3 3 a4 b4 2 8 aib j 2 i j a4 1 b4 12 8 a4 b4 2 4 a4 bi aib4 2 i4 i0 j 0 i0 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 91 Baugh-Wooley Multiplier 29 28 a4 b4 a4 1 b4 s9 s8 27 25 24 23 22 21 20 a3b0 a2b1 a2b0 a1b1 a1b0 a0b1 a0b0 a4 b1 a4 b0 a3b1 a4 b2 a3b2 a2b2 a1b2 a0b2 a4 b3 a3b3 a2b3 a1b3 a0b3 a3b4 a2b4 a1b4 a0b4 a4 s1 s0 b4 s7 18: Datapath Functional Units 26 s6 s5 s4 s3 CMOS VLSI Design 4th Ed. s2 92 Baugh-Wooley Multiplier 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 93 Baugh-Wooley Multiplier 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 94 Fewer Partial Products Array multiplier requires N partial products If we looked at groups of r bits, we could form N/r partial products. – Faster and smaller? – Called radix-2r encoding Ex: r = 2: look at pairs of bits – Form partial products of 0, Y, 2Y, 3Y – First three are easy, but 3Y requires adder 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 95 Booth Encoding Let us first try out base 4 encoding n 1 n 2 2 i0 i0 B bi 2i c i 4 i The ci have to be 0,1,2, or 3. However, 3 is problematic. Try the following: n2 2 B b2i 22i for even bits i 0 n2 2 B b2i1 22i1 for odd bits i 0 B B B 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 96 Booth Encoding Numerical example: 1810 1 24 0 23 0 22 1 21 0 20 100102 1810 0 20 0 22 1 24 1 21 0 23 0 25 Reordering terms, 1 B B 2 { B 2 2 B 14 2 43 shift left shift right c i 2b2i1 b2i b2i1 c i 2,1,0,1,2 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 97 Booth Encoding The ci can be written as c 0 b1 b0 2b1 c1 b1 b2 2b3 c 2 b3 b4 2b5 c 3 b5 b6 2b7 c 4 b7 b8 2b9 Take b-1 as 0. For an 8-bit unsigned number, take b8 and b9 as0 as well. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 98 Booth Encoding Take 18 as a numerical example again 1810 100102 c 0 0 1 0 11 2 2 c1 1 1 0 1 0 2 1 c 2 0 11 1 0 2 1 1810 112 2 4 0 1 41 1 4 2 18 For two’s complement signed numbers, extension to the leftside should not be used. 1810 101101 1 101110 1 02 2 4 0 0 41 14 2 18 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 99 Booth Encoding Note that Booth notation is redundant. 024 124 2 However, the method shown above always yields the same representation for the same binary numbers. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 100 Booth Encoding Instead of 3Y, try –Y, then increment next partial product to add 4Y Similarly, for 2Y, try –2Y + 4Y in next partial product 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 101 Booth Encoding 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 102 Booth Hardware Booth encoder generates control lines for each PP – Booth selectors choose PP bits 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 103 Booth Multipliers Applicable to sequential, array, and parallel multipliers. Additional recoding logic and more complex partial product generation (+8n in terms of area and +7 in terms of delay) Adder array/tree cut in half. • Considerably smaller (array and tree) • Twice as fast for adder arrays • Slightly faster for adder trees. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 104 Booth Multipliers Negative partial products require sign extension. Suited for signed multiplication. Radix 8 (3-bit recoding) possible. – Reduces partial products 3 times. – Pre-computing 3B, … is difficult. – Sometimes used. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 105 Sign Extension Partial products can be negative – Require sign extension, which is cumbersome – High fanout on most significant bit s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s PP0 PP1 PP2 s s s 0 x-1 x0 PP3 PP4 multiplier x s s s s s s s s PP5 PP6 PP7 PP8 18: Datapath Functional Units CMOS VLSI Design 4th Ed. x15 0 x16 0 x17 106 Simplified Sign Ext. Sign bits are either all 0’s or all 1’s – Note that all 0’s is all 1’s + 1 in proper column – Use this to reduce loading on MSB s 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 s 1 1 1 1 1 1 1 1 1 1 1 1 1 s 1 1 1 1 1 1 1 1 1 1 1 s 1 1 1 1 1 1 1 1 1 s 1 1 1 1 1 1 1 s 1 1 1 1 1 s 1 1 1 s 1 s s s s s s s s PP0 PP1 PP2 PP3 PP4 PP5 PP6 PP7 PP8 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 107 Even Simpler Sign Ext. No need to add all the 1’s in hardware – Precompute the answer! s s s 1 s s 1 s s 1 s s 1 s s 1 s s 1 s s s s s 18: Datapath Functional Units CMOS VLSI Design 4th Ed. PP0 PP1 PP2 PP3 PP4 PP5 PP6 PP7 PP8 108 Advanced Multiplication Signed vs. unsigned inputs Higher radix Booth encoding Array vs. tree CSA networks 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 109 Tree Addition Wallace Trees. Very irregular tree. – Irregular wiring and/or layout – Non-uniform bit-arrival times at the final adder. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 110 Wallace Tree Multiplier 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 111 Wallace Tree Multiplier 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 112 Dot Diagram for Array Multiplier 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 113 Dot Diagram for Tree Multiplier 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 114 4:2 Tree Multiplier 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 115 4:2 Compressor 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 116 Carry-Save Adder 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 117 4:2 Compressors 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 118 4:2 Compressors 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 119 16x16 Booth Encoded Multipliers 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 120 TDM Multipliers 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 121 Vertical Compressor Slice 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 122 CPA Prefix Network Nonuniform arrival times 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 123 Multiplier Implementations Sequential Multipliers – Low performance, small area, resource sharing Braun or Baugh-Wooley Multiplier (array multiplier) – Medium performance, high area, high regularity – Layout generators => data paths and macro cells – Simple pipelining, faster CPA => higher speed Booth-Wallace Multiplier – High performance, high area, low regularity – Custom multipliers, netlist generators – Often pipelined (between CSA and CPA) 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 124 Composition from Smaller Multipliers (2n x 2n)-bit multiplier can be composed from 4 (n x n)-bit multipliers (can be repeated recursively). A B AH 2 n AL B H 2 n BL AH B H 2 2n A H BL AL BH 2 n AL BL This requires 4 (n x n)-bit multipliers and (2n)-bit CSA and (3n)-bit CPA. Less efficient in terms of area and speed. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 125 Squaring 2 Squaring is actually multiplication P A AA Multiplier optimizations possible. a3 a 2 a3 a2 a3 a3 a2 a1a3 a1a3 a2 a3 a1 a0 a3 p5 a1a2 a2 a2 p4 a 3 a3 p7 p6 a0 a3 a0 a2 a0 a1 a1a2 a 2 a1 a3 a0 a0 a2 a1 a2 a0 a1a0 a0 a1 a0 a0 a0 a1a1 p3 p2 p1 p0 n 21 partial products => optimized squarer better than multiplier. Table lookup (ROM) less efficient. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 126 Division Division basics A R Q B B A Q B R ; R B Conditions on values: 2n n A 0,2 1 , B,Q,R 0,2 1, B 0 Algorithms – Subtract and shift – Sequential, recursive, non-associative 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 127 Division Basic Algorithm – Compare and conditionally subtract – Expensive comparison and CPA Restoring Division – Subtract and conditionally restore – Expensive CPA and restoring Non-restoring division – Detect sign, subtract/add, and correct by next steps. – Expensive CPA 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 128 Division SRT Division – Estimate range, subtract/add (CSA), correct by next steps. – Inexpensive CSA 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 129 Restoring Division Put x in register A, d in register B, 0 in register P, and perform n divide steps (n is the quotient wordlength). Each step consists of – Shift the register pair (P,A) one bit left. – Subtract the contents of B from P, put the result back in P. – If the result is negative, set the low order bit of A to 0, otherwise to 1. – If the result is negative, restore the old value of P by adding the contents of B back into P. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 130 Restoring Division 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 131 Restoring Division 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 132 Non-restoring Division A variant that skips the restoring step and instead works with negative residuals. If P is negative, – Shift the register pair (P,A) one bit left. – Subtract the contents of register B from P. If P is negative, set the low-order bit of A to 0, otherwise set to 1. After n cycles, – The quotient is in A – If P is positive, it is the remainder, otherwise it has to be restored (add B to it). 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 133 Non-restoring Division 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 134 Non-restoring Division A n 1ACPA On 2 or On 2 logn T n 1TCPA On 2 or On logn 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 135 Signed Division Example: Signed non-restoring array divider. B>0, final correction step omitted A 9n 2 , T 2n 2 4n 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 136 SRT Division Sweeney, Robertson, Tocher 1 if B2 i Ri1 qi 0 if B2 i Ri1 B2 i i 1 if Ri1 B2 If B is normalized If 2 n 1 B 2 n B2 i 2 n i1 Ri1 2 n i1 B2 i 1 if 2 n i1 Ri1 qi 0 if 2 n i1 Ri1 2 n i1 n i1 1 if Ri1 2 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 137 SRT Division Only 3 MSB are compared – qi’ are estimated – CSA instead of CPA can be used Correction in the following steps + final correction step. Redundant representation of qi’ (SD representation), final conversion necessary (CPA). Highly regular and fast O(n) SRT aray dividers – Only slightly slower/larger than array multipliers 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 138 SRT Division A nACSA 2ACPA On 2 T nTCSA TCPA On 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 139 SRT Division Pre-normalization of divisor ½ ≤ d ≤ 1 and dividend x<d. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 140 SRT Division The quotient digit set plays a crucial role in the complexity of implementation. Restoring algorithm: 0 ≤ qi ≤ r-1 Non-restoring algorithm: qi 1,1 SRT: quotient digit selection function 1 if 1 2wi 2 12 2wi 12 qi1 0 if 1 1 if 2w i 2 SRT division is very fast in the case of consecutive zeros in q. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 141 SRT Division 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 142 SRT Division 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 143 High Radix Division Radix b b 2 m , qi b 1,K ,1,0,1,K , b 1 m quotient bits per step => fewer, but more complex steps. Suitable for SRT algorithm => faster Complex comparisons and decisions Table look-up (Pentium bug) 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 144 Pentium Bug March 1993: Intel introduces the Pentium June 1994: Prof. Thomas Nicely, Lynchburg College, reports errors in calculating twin primes reciprocals. October 1994: After considerable background discussion, word starts circulating on the Internet. Others confirm error and find more instances. November 1994: Tim Coe, of Vitess Semiconductor, proposes a [substantially correct] software model explain the cause.An Intel internal report analyzes a flaw in the Pentium FDIV instruction. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 145 Pentium bug Intel CEO Andy Grove responds (Nov. 24, 1994): – Minor bug known at Intel since mid-94.– – All micros have bugs. – “Average user” will never see the problem (MTBE: 27,000 years). – Most applications do fewer than 1,000 divisions a day (?!). – FDIV error rate is about 1.5 × 10−9 – Error conditions guarantee small errors. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 146 Pentium Bug Response (continued) – Many applications (e.g. graphics) can tolerate occasional small errors. – Offers replacement for justified need. Popular press generally accepts Intel’s claims about “obscure error.” Intel confirms 2 million defective chips have been shipped. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 147 Pentium Bug December 1994: IBM disagrees (MTBE: 24 days); stops shipment of Pentium based PCs. – Even casual spreadsheet users may do about 4.2× 106 divides per day. – The error distribution is not uniform. – Under some reasonable conditions FDIV error rate can approach 10−2 Some question IBM’s motives. A flurry of Internet communication condemns Intel’s attitude and questions its evaluation of the problem. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 148 Pentium Bug Intel revises replacement policy. Hard to interpret policy but easy to accomplish in practice. 2% of home users and 10% businesses eventually get replacements. Intel (Andy Grove) admits it mishandling the problem, but stands by its evaluation. Public perception is that Intel was responsive ⇒ positive publicity. March 1995: Coe, et al. article appears in IEEE Journ. Computational Sci. and Eng. May 1995: Lamport article appears at TAPSOFT. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 149 Pentium Bug Kahan posts should-have-known SRT test article. 1996: Intel establishes the world’s largest verification division, dominating industrial research through 20??. Reported cost of the Pentium affair reportedly $450 million; $15/$16 billion market in 1996. Intel Marketing Rep: “. . . wrote it off to advertising.” 1997–2000: All major μprocessor manufacturers adopt formal verification. Surge in CAD industry tool offerings. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 150 Pentium Bug Significant research results appear in floating point verification. 2000–2002: Articles, conference panel sessions on verification “culture.” IC technology roadmap: looming “design crisis.” Nice discussion of SRT and the bug in http://www.eng.utah.edu/~cs5830/handouts/lec-SRT.pdf 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 151 Pentium Humor Q: How many Pentium designers does it take to screw in a light bulb? – A: 1.99904274017, but that’s close enough for non-technical people. Q: What’s another name for the ”Intel Inside” sticker they put on Pentiums? – A: The warning label. Q: What do you call a series of FDIV instructions on a Pentium? – A: Successive approximations. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 152 Pentium Humor – Q: Why didn’t Intel call the Pentium the 586? • A: Because they added 486 and 100 on the first Pentium and got 585.999983605. – Q: According to Intel, the Pentium conforms to the IEEE standards 754 and 854 for floating point arith-metic. If you fly in aircraft designed using a Pentium, what is the correct pronunciation of ”IEEE”? – A: Aaaaaaaiiiiiiiiieeeeeeeeeeeee! 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 153 Division by Multiplication Division by convergence Q A A R0 R1 L Rm 1 B B R0 R1 L Rm 1 1 Q B 1 1 B B A Bi1 Bi Ri 2 n 1 y 1 y 2 n 1 y 2 14 2 43 1 2 3 Bi Ri y 1 Bi 2 n , Ri 2 Bi 2 n Bi 1 Algorithm: Bi1 Bi Ri , Ai1 Ai Ri Ri Bi 1 , i 0,K ,m 1 A0 A , B0 B , Q Am 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 154 Division by Multiplication Quadratic convergence L log n 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 155 Division by Reciprocation Use the reciprocal Q A 1 A B B How to find the reciprocal? – Newton-Raphson iteration method find f X by recursionX i1 X i f X 0 f X i 1 1 1 f X B , f X 2 , f 0 B X X 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 156 Division by Reciprocation Algorithm: X i1 X i 2 B X i ; i 0,K ,m 1 X0 B , Q Xm Quadratic convergence L = O (log n) Speed-up: first approximation of X0 from table. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 157 Divider Implementations Iterative dividers (through multiplication): – Resource sharing of existing components (multiplier) – Medium performance, medium area – High efficiency if components are shared Sequential dividers (restoring, non-restoring, SRT) – Resource sharing of existing components (e.g. adder) – Low performance, low area 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 158 Divider Implementations Array dividers – Dedicated hardware component – High performance, high area – High regularity -> layout generators, pipelining – Square root extraction possible by minor changes – Combination with multiplication and/or square root No parallel dividers exist as compared to parallel multipliers. – Sequential nature of division. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 159 Square Root Algorithm: A R Q , A Q2 R A 0,2 2n 1 , Q 0,2 n 1 Qi Qi1 qi 2 i qn 1,K ,qi ,0,K ,0 2 Q Qi1 qi 2 Qi1 qi 2 i 2Qi1 qi 2 i 2 i i 2 qi Ri1 2 i 2Qi1 2 i , Qi Qi1 qi 2 i Ri Ri1 qi 2 i 2Qi1 qi 2 i ; i n 1,K ,0 Rn A , Qn 0 , R R0 , Q Q0 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 160 Square Root Implementation: – Similar to division -> same algorithms applicable – Restoring, non-restoring, SRT, high radix Combination with division in same component possible Only triangular array required ADIV A 2 T TDIV 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 161 Elementary Functions Exponential function: ex, exp(x) Logarithm function: ln x, log x Trigonometric functions: sin x, cos x, tan x Inverse trig. Functions: arcsin x, arccos x, arctan x Hyperbolic functions: sinh x, cosh x, tanh x 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 162 Algorithms Table lookup – Inefficient for large word lengths Taylor series expansion – Complex implementation Polynomial and rational approximations Shift and add algorithms Convergence algorithms – Similar to division by convergence – Two (or more) recursive formulas: one formula converges to a constant, the other to the result. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 163 Algorithms Coordinate rotation (CORDIC) – 3 equations for x-, y- coordinate, and angle – Computes all elementary functions by proper input settings and choice of odes and outputs – Simple, universal hardware, small look-up table. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 164 Design Levels Transistor level design – Circuit and layout designed by hand (full custom) – Low design efficiency – High circuit performance – High flexibility: choice of architecture and logic style – Transistor level circuit optimizations • Logic style (static/dynamic logic, complementary CMOS/pass-transistor logic) • Special arithmetic circuits better than with gates. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 165 Design Levels Gate level design – Cell based design techniques: standard cells, gate-array/sea-of-gates, field programmable gate array (FPGA) – Circuit implemented by hand or synthesis (library) – Layout implemented by automated place and route – Medium to high design efficiency – Medium to low circuit performance – Medium to low flexibility: full choice of architecture. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 166 Design Levels Block level design – Layout blocks and netlists from parameterized automatic generators or compilers – High design efficiency – Medium to high circuit performance – Low flexibility (limited choice of architectures) 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 167 Design Levels Block level design – Implementations: • Data-path: bit-sliced, bus oriented layout, implementation of entire data paths, medium performance, medium diversity • Macro-cells: tiled layout, fixed/single operation components, high performance, small diversity • Portable netlists: gate level design 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 168 Synthesis High-level synthesis – Synthesis from abstract, behavioral hardware description (e.g., data dependency graphs) using e.g. VHDL – Involves architectural synthesis and arithmetic transformations – High-level synthesis still not fully mature 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 169 Synthesis Low-level synthesis – Layout and netlist generators – Included in libraries and synthesis tools – Low level synthesis is state-of-the art – Basis for efficient ASIC design – Limited diversity and flexibility of library components 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 170 Synthesis Circuit optimization – Efficient optimization of random-logic is state of the art. – Optimization of entire arithmetic circuits is not feasible • Only local optimizations possible – Logic optimization cannot replace the synthesis of efficient arithmetic circuit structures using generators. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 171 Low Power High glitching activity due to high bit dependencies and large logic depth Reduce the switched capacitance by choosing an area efficient circuit architecture Allow for lower supply voltage by speeding up the circuitry 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 172 Low Power Reduce the transition activity – Apply stable inputs when circuit not in use • Disable circuits – Reduce glitching transitions by balancing signal paths (partly done by speed-up techniques, otherwise difficult to realize) – Reduce glitching transitions by reducing logic depth – Take advantage of correlated data streams – Choose appropriate number representations (e.g. Gray codes for counters) 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 173 Testability Testability goal: high fault coverage with few test vectors that are easy to generate/apply. Random test vectors: easy to generate and apply/propagate, few vectors give high (but not perfect) fault coverage for most arithmetic circuits. Special test vectors: sometimes hard to generate and apply, required for coverage of hard-detectable faults which are inherent in most arithmetic circuits. 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 174 Testability Hard detectable faults found in: – Circuits of arithmetic operations with inherent special cases (arithmetic exceptions): detectors, comparators, incrementers, and counters (MSBs), adder flags. – Circuits using redundant number representations (≠ redundant hardware): dividers (Pentium bug!) 18: Datapath Functional Units CMOS VLSI Design 4th Ed. 175