MIT OpenCourseWare http://ocw.mit.edu 6.004 Computation Structures Spring 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Binary Multiplication Cost/Performance Tradeoffs: a case study Digital Systems Architecture 1.01 x a n bits b n bits a b EASY PROBLEM: design combinational circuit to multiply tiny (1-, 2-, 3-bit) operands... HARD PROBLEM: design circuit to multiply BIG (32-bit, 64-bit) numbers 2n bits since (2n-1)2 < 22n We can make big multipliers out of little ones! Engineering Principle: Exploit STRUCTURE in problem. Lab #3 due tonight! 3/5/09 6.004 – Spring 2009 modified 2/23/09 10:44 L09 - Multipliers 1 3/5/09 6.004 – Spring 2009 L09 - Multipliers 2 Our Basis: Making a 2n-bit multiplier using n-bit multipliers n=1: minimalist starting point Multiplying two 1-bit numbers is pretty simple: Given n-bit multipliers: a aH aL a x n bits b = ab x aLbL Synthesize 2n-bit multipliers: 0 ab aHbL 2n bits aHbH b x = Of course, we could start with optimized combinational multipliers for larger operands; e.g. aLbH a b bH bL 2n bits n bits x 2n bits ab ab a1a0 2 b1b0 2 2-bit Multiplier 4 c3c2c1c0 the logic gets more complex, but some optimizations are possible... 4n bits 6.004 – Spring 2009 3/5/09 L09 - Multipliers 3 6.004 – Spring 2009 3/5/09 L09 - Multipliers 4 Brick Wall view Our induction step: of partial products 2n-bit by 2n-bit multiplication: 1. Divide multiplicands into n-bit pieces 2. Form 2n-bit partial products, using n-bit by n-bit multipliers. 3. Align appropriately 4. Add. Making 4n-bit multipliers from n-bit ones: 2 “induction steps” aLbH aHbH aLbL aHbL aH aL x bH bL REGROUP partial products 2 additions rather than 3! a0b3 a1b3 a0b2 a3 a2 a1 a0 x b3 b2 b1 b0 a3b3 a2b2 a3b2 Induction: we can use the same structuring principle to build a 4n-bit multiplier from our newly-constructed 2n-bit ones... 3/5/09 L09 - Multipliers 5 x a1b3 "g(n) is of order f(n)" a0b2 a3b1 a2b0 a3b0 ADD Example: g(n) = (f(n)) g(n) = (f(n)) if there exist C 2 C 1 > 0, such that for all but finitely many integral n 0 a2b3 a1b2 a0b1 a3b3 a2b2 a1b1 a0b0 a3b2 a2b1 a1b0 Subassemblies: • Partial Products • Adders MULT "Order Of" notation: a0b3 b3 b2 b1 b0 c 1•f(n) g(n) c 2•f(n) L09 - Multipliers 7 (n 2) n 2 (n 2+2n+3) 2n 2 g(n) = O(f(n)) (...) implies both inequalities; O(...) implies only the second. Partial Products: Things to Add: Adder Width: Hardware Cost: Latency: 3/5/09 n 2+2n+3 = since "almost always" Step 2: Sum 6.004 – Spring 2009 L09 - Multipliers 6 Performance/Cost Analysis Step 1: Form (& arrange) Partial Products: a3 a2 a1 a0 3/5/09 6.004 – Spring 2009 Multiplier Cookbook: Chapter 1 Given problem: a0b1 a1b1 a0b0 a1b0 a2b1 a3b1 a2b0 a3b0 a•b 6.004 – Spring 2009 a1b2 a2b3 6.004 – Spring 2009 3/5/09 2 n 2n-2 2n ? = = = = 2 (n ) (n) (n) 2 (n ) (n2) ?? L09 - Multipliers 8 Observations: Repackaging Function ADD MULT a1b3 a0b2 (n ) partial products. 2 (n ) full adde rs. Put the Solution where the Problem is. 2 (n ) partial products. 2 (n ) full adde rs. Hmmm. a2b3 a1b2 a0b1 a3b3 a2b2 a1b1 a0b0 a3b2 a2b1 a1b0 a3b1 a2b0 a3b0 2 Engineering Principle #2: a0b3 MULT a0 b3 a1 b3 a0 b2 a2 b3 a1 b2 a0 b1 a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 a3 b1 a2 b0 a3 b0 ADD How about n2 blocks, each doing a little multiplication and a little addition? 3/5/09 6.004 – Spring 2009 L09 - Multipliers 9 Goal: Sk+1 b2 a0 b3 b1 a1 b3 a0 b2 b0 a2 b3 a1 b2 a0 b1 a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 a0 a3 b1 a2 b0 a1 a3 b0 a b3 Sk bi Single "brick" of brick-wall array... Ck+2 Ck • Forms partial product • Adds to accumulating sum along with carry aj 2 S’k+1 a3 (A+B)i 6.004 – Spring 2009 Ci Takes 2 addend bits plus carry bit. Produces sum and carry output bits. Sk+1 L09 - Multipliers 11 Sk 2 • AND gate forms 1x1 product • 2-bit sum propagates from top to bottom • Carry propagates to left 6.004 – Spring 2009 bi 0 Brick design: Ck+2 FA FA S’k+1 S’k Ck aj Wastes some gates… but consider (say) optimized 4x4-bit brick! CASCADE to form an n-bit adder. 3/5/09 a0 b3 b1 a1 b3 a0 b2 b0 a2 b3 a1 b2 a0 b1 a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 a0 a3 b1 a2 b0 a1 a3 b0 a Array Layout: • operand bits bused diagonally • Carry bits propagate right-to-left • Sum bits propagate down a3 Ai Bi FA b2 S’k Necessary Component: Full Adder Ci+1 L09 - Multipliers 10 Design of 1-bit multiplier "Brick": Array of Identical Multiplier Cells b3 3/5/09 6.004 – Spring 2009 3/5/09 L09 - Multipliers 12 Multiplier Cookbook: Chapter 2 Latency revisited Here’s our combinational multiplier: a0 b3 a1 a2 a3 Combinational Multiplier: What’s its propagation delay? a0 b2 a0 b3 Naive (but valid) bound: b1 • O(n) additions a1 b3 a0 b2 b0 • O(n) time for each addition a2 b3 a1 b2 a0 b1 • Hence O(n2) time required a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 On closer inspection: a3 b1 a2 b0 • Propagation only toward left, bottom a3 b0 • Hence longest path bounded by length + width of array: O(n+n) = O(n)! 3/5/09 6.004 – Spring 2009 L09 - Multipliers 13 b3 Hardware for b2 (n2) n by n bits: a0 b3 b1 a1 b3 a0 b2 Latency: (n) b0 a2 b3 a1 b2 a0 b1 Throughput: (1/n) a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 a3 b1 a2 b0 Note: lots of tricks are a3 b0 available to make a faster combinational multiplier… a1 a2 a3 + 3/5/09 6.004 – Spring 2009 Combinational Multiplier: The Pipelining Bandwagon... best bang for the buck? where do I get on? WE HAVE: Suppose we have LOTS of multiplications. • Pipeline rules - "well formed pipelines" • Plenty of registers • Demand for higher throughput. PIPELINING Can we do better from a cost/performance standpoint? a0 3/5/09 L09 - Multipliers 15 6.004 – Spring 2009 b3 a1 a3 What do we do? Where do we define stages? 6.004 – Spring 2009 L09 - Multipliers 14 3/5/09 b2 a b 0 3 a2 b1 a1 b3 a0 b2 b0 a2 b3 a1 b2 a0 b1 a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 a3 b1 a2 b0 a3 b0 L09 - Multipliers 16 Stupid Pipeline Tricks Even Stupider Pipeline Tricks gotta break that long carry chain! a0 b3 WORSE idea: a2 b3 a1 b2 a0 b1 a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 Stages: (n) Clock Period: Hardware cost for n by n bits: (n) a3 b1 a2 b0 a3 b0 Latency: Throughput: 2 (n ) 2 (n ) Back to basics: what’s the point of pipelining, anyhow? ( 1/n ) 3/5/09 6.004 – Spring 2009 L09 - Multipliers 17 a0 b2 a0 b3 a0 b2 a1 • Segment every long combinational path a1 b3 a0 b1 a0 b0 a1 b2 a1 b1 a2 b3 a2 b2 a1 b0 a2 a3 a3 b3 a3 b2 b1 b0 (n) (1) (n2) (n) Throughput: (1) a3 a2 b1 a2 b0 • Well-formed pipeline (careful!) • Constant (high!) throughput, independently of operand size. a0 b3 b2 a0 b3 a0 b2 a1 b1 b0 a1 b3 a0 b1 a0 b0 a1 b2 a1 b1 a2 b3 a2 b2 a3 b3 a3 b2 a1 b0 a2 b1 a2 b0 a3 b1 a3 b0 ... but suppose we don’t need the throughput? GOAL: (n) stages; (1) clock period! 3/5/09 Stages: Clock Period: Hardware cost for n by n bits: Latency: a2 a3 b1 a3 b0 6.004 – Spring 2009 L09 - Multipliers 18 Multiplier Cookbook: Chapter 3 b3 • Break array into diagonal slices 3/5/09 6.004 – Spring 2009 Breaking O(n) combinational paths LONG PATHS go down, to left: • Doesn’t break long combinational paths • NOT a well-formed pipeline... ... different register counts on alternative paths ... data crosses stage boundaries in both directions! a0 b3 a1 b3 a0 b2 a2 b3 a1 b2 a0 b1 a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 a3 b1 a2 b0 a3 b0 a1 b3 a0 b2 L09 - Multipliers 19 6.004 – Spring 2009 3/5/09 L09 - Multipliers 20 Multiplier Cookbook: Chapter 4 Moving down the cost curve... b3 Suppose we have INFREQUENT multiplications... pipelining doesn’t help us. a0 a0 b3 a0 b2 a1 a1 b3 a2 a2 b3 a2 b2 a3 a3 b3 a3 b2 a1 b2 a1 b1 Sequential Multiplier: b2 ai b2 Can we do better from a cost/ performance standpoint? Hmmm, do I really need all these extras? b3 b1 ai b3 ai b2 b1 b0 a0 b1 a0 b0 b0 • Re-uses a single n-bit “slice” to emulate each pipeline stage • a operand entered serially ai b1 ai b0 • Lots of details to be filled in... a1 b0 a2 b1 a2 b0 Stages: Clock Period: Hardware cost for n by n bits: Latency: a3 b1 a3 b0 Throughput: 3/5/09 6.004 – Spring 2009 L09 - Multipliers 21 Suppose we want to minimize hardware (at any cost)… b2 • Consider bit-serial! b1 ai b3 ai b2 b0 • Form and add 1-bit partial product per clock ai b1 ai b0 b3 Bit Serial multiplier: Cost minimization: how far can we go? ai • Re-uses a single brick to emulate an n-bit slice • both operands entered serially • O(n2) clock cycles required • Needs additional storage (typically from existing registers) • Reuse single “brick” for each bit bj of slice; • Re-use slice for each bit of a operand Stages: Clock Period: Hardware cost for n by n bits: Latency: Throughput: 6.004 – Spring 2009 3/5/09 L09 - Multipliers 22 Multiplier Cookbook: Chapter 5 Extremes Dept... b3 3/5/09 6.004 – Spring 2009 (Ridiculous?) 1 ( 1 ) ( constant!) ( n ) ( n ) (1/n) L09 - Multipliers 23 6.004 – Spring 2009 b2 ai b1 ai b3 ai b2 b0 ai b1 ai b0 ( 1 n ) ( 1 ) (constant) ( 1 ) + ? (n2) (1/n2) 3/5/09 L09 - Multipliers 24 Summary: Scheme: $ Latency Thruput Combinational (n2) (n) (1/n) N-pipe (n2) (n) (1) Slice-serial (n) (n) (1/n) Bit-serial (1)* (n2) (1/n2) Lots more multiplier technology: fast adders, Booth Encoding, column compression, ... 6.004 – Spring 2009 3/5/09 L09 - Multipliers 25