ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago Outline • Circuit Design Problem • Solution Approaches: – Truth Table (TT) vs. Computational/Algorithmic – Yes, hardware, just like software can implement any algorithm (after all software runs on hardware)! – Flat vs. Divide-&-Conquer – Divide-&-Conquer: • Associative operations/functions • General operations/functions – Other Design Strategies for fast circuits: • Speculative computation • Best of both worlds (best average and best worst-case) • Pipelining • Summary Circuit Design Problem • Design an 8-bit comparator that compares two 8-bit #s available in two registers A[7..0] and B[7..0], and that o/ps F = 1 if A > B and F = 0 if A <= B. • Approach 1: The TT approach -- Write down a 16-bit TT, derive logic expression from it, minimize it, obtain gate-based realization, etc.! A 00000000 B 00000000 F 0 00000000 00000001 0 -------------------00000001 00000000 1 ---------------------11111111 11111111 0 – – – – Too cumbersome and time-consuming Fraught with possibility of human error Difficult to formally prove correctness (i.e., proof w/o exhasutive testing) Will generally have high hardware cost (including wiring, which can be unstructured and messy) and delay Circuit Design Problem (contd) • Approach 2: Think computationally/algorithmically about what the ckt is supposed to compute: • Approach 2(a): Flat computational/programming approach: – Note: A TT can be expressed as a sequence of “if-then-else’s” – If A = 00000000 and B = 00000000 then F = 0 else if A = 00000000 and B = 00000001 then F=0 ………. else if A = 00000001 and B = 00000000 then F=1 ………. – Essentially a re-hashing of the TT – same problems as the TT approach Circuit Design Problem: Strategy 1: Divide-&-Conquer • Approach 2(b): Structured algorithmic approach: – Be more innovative, think of the structure/properties of the problem that can be used to solve it in a hierarchical or divide-&-conquer (D&C) manner: Stitch-up of solns to A1 and A2 to form the complete soln to A Root problem A Legend: : D&C breakup arrows : data/signal flow to solve a higher-level problem : possible data-flow betw. sub-problems Data dependency? Subprob. A1 A1,1 A1,2 Subprob. A2 A2,1 A2,2 Do recursively until subprob-size is s.t. TT-based design is doable – D&C approach: See if the problem can be: • “broken up” into 2 or more smaller subproblems: two types of breaks possible by # of operands: partition set of n operands into 2 or more subsets of operands by operand size: breaking a constant # of n-bit operands into smaller size operands (this mainly applies when the # of operands are a constant, e.g., add. of 2 #s) • whose solns can be “stitched-up” (stitch-up function) to give a soln. to the parent prob • also, consider if there is dependency between the sub-probs (results of some required to solve the other(s)) – Do this recursively for each large subprob until subprobs are small enough (the leaf problem) for TT solutions – If the subproblems are of a similar kind (but of smaller size) to the root prob. then the breakup and stitching will also be similar, but if not, they have to be broken up differently Circuit Design Problem: Strategy 1: Divide-&-Conquer • Especially for D&C breakups in which: a) the subproblems are the same problem type as the root problem, and b) there is no data dependency between subproblems, the final circuit will be a “tree”of stitch-up functions (of either the same size or different sizes at different levels—this depends on the problem being solved) with leaf functions at the bottom of the tree, as shown in the figure below for a 2-way breakup of each problem/sub-problem. Stitch-up functions Leaf functions • Solving a problem using D&C generally yields a fast, low-cost and streamlined design (wiring required is structured and not all jumbled up and messy). Shift Gears: Design of a Parity Detection Circuit—An n-input XOR (a) A linearly-connected circuit of 2-i/p XORs x(0) x(1) x(2) X(3) x(15) f • No concurrency in design (a)---the actual problem has available concurrency, though, and it is not exploited well in the above “linear” design • Complete sequentialization leading to a delay that is linear in the # of bits n (delay = (n-1)*td), td = delay of 1 gate • All the available concurrency is exploited in design (b)---a parity tree (see next slide). • Question: When can we have a circuit for an operation/function on multiple operands built of “gates” performing the same operation for fewer (generally a small number betw. 2-5) operands? • Answer: (1) It should be possible to break down the n-operand function into multiple operations w/ fewer operands. (2) When the operation is associative. An oper. “x” is said to be associative if: a x b x c = (a x b) x c = a x (b x c) OR stated as a function f(a, b, c) = f(f(a,b), c) = f(a, f(b,c)) • This implies that, for example, if we have 4 operations a x b x c x d, we can either perform this as: – a x (b x (c x d)) [getting a linear delay of 3 units or in general n-1 units for n operands] i.e., in terms of function notation: f(a, f(b, f(c,d))) – or as (a x b) x (c x d) [getting a logarithmic (base 2) delay of 2 units and exploiting the available concurrency due to the fact that “x” is associative] i.e., in terms of function notation: f(f(a,b), f(c,d)) • Is XOR associative? Yes. • The parenthesisation corresp. to the above ckt is: – (…..((x(0) xor x(1)) xor x(2))) xor x(3)))) xor …. xor x(15))….) • All these Qs can be answered by the D&C approach Shift Gears: Design of a Parity Detection Circuit—A Series of XORs x(15) x(14) • if we have 4 operations a x b x c x d, we can either perform this as a x (b x (c x d)) [getting a linear delay of 3 units] or as (a x b) x (c x d) [getting a logarithmic (base 2) delay of 2 units and exploiting the available concurrency due to the fact that “x” is associative]. • We can extend this idea to n operands (and n-1 operations) to perform as many of the pairwise operations as possible in parallel (and do this recursively for every level of remaining operations), similar to design (b) for the parity detector [xor is an associative operation!] and thus get a (log2 n) delay. • In fact, any parenthesisation of operands is correct for an associative operation/function, but the above one is fastest. Surprisingly, any parenthesisation leads to the same h/w cost: n1 2-i/p gates, i.e., 2(n-1) gate i/ps. Why? Analyze. x(1) x(0) w(3,4) w(3,6) w(2,3) w(3,1) w(3,3) w(3,5) w(3,7) w(2,2) w(1,1) w(3,2) w(2,1) w(3,0) w(2,0) w(1,0) Delay = (# of levels in AND-OR tree) * td = log2 (n) *td w(0,0) = f (b) 16-bit parity tree An example of simple designer ingenuity. A bad design would have resulted in a linear delay, an ingenious (simple enough though) & well-informed design results in a log delay, and both have the same gate i/p cost Parenthesization of tree-circuit: (((x(15) xor x(14)) xor (x(13) xor x(12))) xor ((x(11) xor x(10)) xor (x(9) xor x(8)))) xor (((x(7) xor x(6)) xor (x(5) xor x(4))) xor ((x(3) xor x(2)) xor (x(1) xor x(0)))) D&C for Associative Operations • Let f(xn-1, ….., x0) be an associative function. • Can the D&C approach be used to yield an efficient, streamlined n-bit xor/parity function? Can it lead automatically to a tree-based ckt? • What is the D&C principle involved here? f(xn-1, .., x0) Stitch-up function---same as the original function for 2 inputs f(a,b) a f(xn-1, .., xn/2) b f(xn/2-1, .., x0) • Using the D&C approach for an associative operation results in a breakup by # of operands and the stitch up function being the same as the original function (this is not the case for non-assoc. operations), but w/ a constant # of operands (2, if the original problem is broken into 2 subproblems) • Also, there are no dependencies between sub-problems • If the two sub-problems of the D&C approach are balanced (of the same size or as close to it as possible), then unfolding the D&C results in a balanced operation tree of the type for the xor/parity function seen earlier of (log n) delay D&C for Associative Operations (cont’d) • Parity detector example 16-bit parity w(0,0) = f Breakup by operands stitch-up function = 2-bit parity/xor 8-bit parity 8-bit parity w(1,1) w(2,3) Delay = (# of levels in AND-OR tree) * td = log2 (n) *td w(2,2) w(3,6) w(3,7) x(15) x(14) w(1,0) w(2,1) w(3,4) w(3,5) w(2,0) w(3,2) w(3,3) w(3,0) w(3,1) x(1) x(0) D&C Approach for Non-Associative Opers: n-bit > Comparator • O/P = 1 if A > B, else 0 • Is this is associative? Not sure for breakup by bits in the 2 operands. Issue of associativity mainly applies for n operands, not on the n-bits of 2 operands • For a non-associative func, determine its properties that allow determining a break-up & a correct stitch-up function • Useful property: At any level, comp. of MS (most significant) half determines o/p if result is > or < else comp. of LS ½ determ. o/p • Can thus break up problem at any level into MS ½ and LS ½ comparisons & based on their results determine which o/p to choose for the higher-level (parent) result • No sub-prob. dependency A Comp. A[7..0]],B[7..0] A1 Breakup by size/bits Comp A[7..4],B[7..4] If A1,1 result is > or < take A1,1 result else take A1,2 result A1,1 Comp A[7..6],B[7..6] A1,1,1 Comp A[7],B[7] If A1,1,1 result is > or < take A1,1,1 result else take A1,1,2 result Small enough to be designed using a TT A1,1,2 Comp A[6],B[6] If A1 result is > or < take A1 result else take A2 result A1,2 Comp A[5,4],B[5,4] Stitch-up of solns to A1 and A2 to form the complete soln to A A2 Comp A[3..0],B[3..0] D&C Approach for Non-Associative Opers: n-bit > Comparator (cont’d) A Comp. A[7..0]],B[7..0] Breakup by size/bits A1 Comp A[7..4],B[7..4] If A1,1 result is > or < take A1,1 result else take A1,2 result A1,1 Comp A[7..6],B[7..6] A1,1,1 Comp A[7],B[7] If A1,1,1 result is > or < take A1,1,1 result else take A1,1,2 result If A1 result is > or < take A1 result else take A2 result Stitch-up of solns to A1 and A2 to form the complete soln to A A2 Comp A[3..0],B[3..0] A1,2 Comp A[5,4],B[5,4] A1,1,2 Comp A[6],B[6] The TT may be derived directly or by first thinking of and expressing its computation in a high-level programming language and then converting it to a TT. Small enough to be designed using a TT If A[i] = B[i] then { f1(i)=0; f2(i) = 1; /* f2(i) o/p is an i/p to the stitch logic */ A[i] B[i] 0 0 0 1 1 0 1 1 f1(i) f2(i) 0 1 0 0 1 0 0 1 (2-bit 2-o/p comparator) /* f2(i) =1 means f1( ), f2( ) o/ps of parent should be that of the LS ½ of this subtree should be selected by the stitch logic as its o/ps */ else if A[i] < B[i} then { f1(i) = 0; /* indicates < */ f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */ else if A[i] > B[i] then {f1(i) = 1; /* indicates > */ f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */ Comparator Circuit Design Using D&C (contd.) • Once the D&C tree is formulated it is easy to get the low-level & stitch-up designs • Stitch-up design shown here A Comp. A[7..0]],B[7..0] A1 If A1 result is > or < take A1 result else take A2 result Comp A[7..4],B[7..4] If A1,1 result is > or < take A1,1 result else take A1,2 result A1,1 Comp A[7..6],B[7..6] A1,1,1 Comp A[7],B[7] If A1,1,1 result is > or < take A1,1,1 result else take A1,1,2 result f1(i) f2(i) 0 1 0 0 1 0 0 1 Comp A[3..0],B[3..0] A1,2 Comp A[5,4],B[5,4] A1,1,2 Stitch up logic details for subprobs i & i-1: If f2(i) = 0 then { my_op1=f1(i); my_op2=f2(i) } /* select MS ½ comp o/ps */ else /* select LS ½ comp. o/ps */ {my_op1=f1(i-1); my_op2=f2(i-1) } Comp A[6],B[6] my_op1 my_op2 A[i] B[i] 0 0 0 1 1 0 1 1 Stitch-up of solns to A1 and A2 to form the complete soln to A A2 my_op Stitch-up logic 2-bit 2:1 Mux f2(i) I0 2 f1(i) f2(i) f1(i-1) f2(i-1) OR 2 I1 2 f1(i) f2(i) f1(i-1) f2(i-1) my_op1 my_op2 X 0 X X f1(i) f2(i) X 1 X X f1(i-1) f2(i-1) f(i)=f1(i),f2(i) f(i-1) (Direct design) (Compact TT) Comparator Circuit Design Using D&C – Final Design • H/W_cost(8-bit comp.) = 7*(H/W_cost(2:1 Muxes)) + 8*(H/W_cost(2-bit comp.) F= my1(6) • H/W_cost(n-bit comp.) = (n-1)*(H/W_cost(2:1 Muxes)) + n*(H/W_cost(2-bit comp.)) my(5) Log n levels of Muxes I0 I0 I0 I1 my(4)(1) my(5)(1) my(4) I0 2 2 my(2) my(1) 2 I0 2 2 f(6) 2 2 I1 2 I0 2 2 f(5) I1 2 my(0) 2 f(4) 2 I1 2 I0 2 2 f(3) 2 2-bit f(1)(2) 2:1 Mux 2-bit f(3)(2) 2:1 Mux 2-bit f(5)(2) 2:1 Mux I1 Critical path (all paths in this ckt are critical) 2-bit my(1)(2) 2:1 Mux I1 2 f(7) 2 my(5)(2) 2 2-bit f2(7) = f(7)(2) 2:1 Mux 2 1-bit 2:1 Mux 2-bit my(3)(2) 2:1 Mux 2 my(3) • Delay(8-bit comp.) = 3*(delay of 2:1 Mux) + delay of 2-bit comp. • Note parallelism at work – multiple logic blocks are processing simult. • Delay(n-bit comp.) = (log n)*(delay of 2:1 Mux) + delay of 2-bit comp. f(2) 2 I1 2 f(1) 2 f(0) 2 1-bit comparator 1-bit comparator 1-bit comparator 1-bit comparator 1-bit comparator 1-bit comparator 1-bit comparator 1-bit comparator A[7] B[7] A[6] B[6] A[5] B[5] A[4] B[4] A[3] B[3] A[2] B[2] A[1] B[1] A[0] B[0] D&C: Mux Design 2n :1 MUX I 2n 1 Breakup by operands (data) Simultaneous breakup by bits (select) I0 Sn-1 S0 Two sets of operands: Data operands (2n) and control/select operand (n bits) All bits except msb should have different combinations; msb should be at a constant value (here 0) I0 2n-1 :1 MUX Stitch-up I 2 nn-11 MSB value should differ among these 2 groups Sn-2 S0 2:1 Mux Sn-1 I2n-1 n All bits except msb should have different combinations; msb should be at a constant value (here 1) 2n-1 :1 MUX I 2 n 1 Sn-2 S0 (a) Top-Down design (D&C) Opening up the 8:1 MUX’s hierarchical design and a top-down view All bits except msb should have different combinations; msb should be at a constant value (here 0) MSB value should differ among these 2 groups I0 I1 I2 I3 I4 I5 I6 I7 I0 I1 I2 8:1 MUX Z I3 All bits except msb should have different combinations; msb should be at a constant value (here 1) MUX Selected when S0 = 0, S1 = 1, S2=1 S0 2:1 I2 MUX S0 I4 S2 S1 S0 2:1 I04:1 Mux I5 2:1 MUX I4 2:1 MUX S1 I2 2:1 MUX I6 2:1 I7 MUX S0 I6 Z S2 S1 Selected when S0 = 0, S1 = 1. These i/ps should differ in S2 S0 I6 2:1 MUX I6 4:1 Mux Top-Down vs Bottom-Up: Mux Design I0 2:1 I1 S0 I2 2:1 I3 S0 I 2 n 21 I 2 n 1 2n-1 2:1 MUXes 2n-1 :1 MUX 2:1 S0 Sn-1 S1 (b) Bottom-Up (“Divide-and-Accumulate”) • Generally better to try top-down (D&C) first An 8:1 MUX example (bottom-up) Selected when S0 = 0 I0 I1 I2 I0 I1 I2 I3 I4 I5 I6 I7 8:1 MUX Z I3 2:1 I0 MUX S0 2:1 I2 MUX I5 2:1 MUX I3 I5 S0 I4 I1 These inputs should have different lsb or S0 values, since their sel. is based on S0 (all other remaining, i.e., unselected bit values should be the same). Similarly for other i/p pairs at 2:1 Muxes at this level. I6 2:1 I7 MUX MUX Z I4 S2 S1 S0 S2 S1 S0 4:1 I6 I7 S0 Selected when S0 = 1 Multiplier D&C AXB: n-bit mult Breakup by bits (operand size) Stitch up: Align and Add = 2n*W + 2n/2*X + 2n/2*Y + Z n AhXBh: (n/2)-bit mult • • • W n AhXBl: (n/2)-bit mult X Y n AlXBh: (n/2)-bit mult n Z AlXBl: (n/2)-bit mult Multiplication D&C idea: A x B = (2n/2*Ah + Al)(2n/2*Bh + Bl), where Ah is the higher Cost = 3 2n-bit n/2 bits of A, and Al the lower n/2 bits = 2n*Ah*Bh + adders = 6n Stitch-Up Design 1 FAs (full 2n/2*Ah*Bl + 2n/2*Al*Bh + Al*Bl = PH + PM1 + PM2 + PL (inefficient) adders) for Example: RCAs (ripple 10111001 = 185 PM2(n2n) carry adders) PL(n2n) X 00100111 = 39 + + = 0001110000101111 = 7215 2n-bit 4 PH(n2n) PM1(n2n) D&C breakup: (10111001) X (00100111) = (2 (1011) adders + 1001) X (24(0010) + 0111) = 28(1011 X 0010) + 24(1011 x 0111 + 1001 X 0010) + 1001 X 0111 + = 28(00010110) + 24(01001101 + 00010010) + Critical path: Delay (using RCAs) = 00111111 (a) too high-level = bbbbbbbb00111111 = PL analysis: 2*((2n)-bit + bbbb01001101bbbb = PM1 What is the delay of adder delay) = 4n*(FA + bbbb00010010bbbb = PM2 2n delay) the n-bit multiplier (b) More exact + 00010110bbbbbbbb = PH using such a stitch up considering overall _____________________ critical path: (i+2n(# 1)? 0001110000101111 = 7215 i+1) = 2n+1 FA delays FA7 carry o/p ci+1 = aibi + aici + bici is 5n-i/p delay units FA7 z7 z6 z5 z4 z3 z2 FA7 Critical paths (3 of n) going through (n+1) FAs Delay for adding 3 numbers X, Y, Z using two RCAs? Ans: (n+1) FA delay units or 5(n+1) i/p delay units z1 z0 • Multiplier D&C (cont’d) Ex: 10111001 = 185 X 00100111 = 39 Stitch-Up Design 2 (efficient) = 0001110000101111 = 7215 n D&C breakup: (10111001) X (00100111) = (24(1011) + 1001) X (24(0010) + 0111) = 28(1011 X 0010) + 24(1011 x 0111 + 1001 X 0010) + 1001 X 0111 PL = 28(00010110) + 24(01001101 + 00010010) + 00111111 + (Arrows in adds on the = bbbbbbbb00111111 = PL PM1 + bbbb01001101bbbb = PM1 left show Couts of adds @ del=n/2 + bbbb00010010bbbb = PM2 lower-order propagating as Cin to + cin + 00010110bbbbbbbb = PH next higher-order adds) Cout 000Cin _____________________ + 0001110000101111 = 7215 P M2 (n/2)-bit adders @ del=n/2+1 Cost = 5 (n/2)-bit Adders = 2.5 n FAs@ del=2[n/2] +1 for RCAs cin Critical path: Delay = 3*((n/2)-bit adder delay) = 1.5n*(FA delay) for RCAs cin Intermediate Sums + PH + 00 ….0 Cin @ del=3[n/2] +1 n/2 Cin @ del=2[n/2] +1 @ del=2[n/2] +2 lsb of MS half @ del=n/2+2 n/2 n/2 n/2 • • • • • • • Multiplier D&C (cont’d) The (1.5n + 1) FA delay units is the delay Stitch-Up Design 2 (efficient) assuming PL … PH have been computed. n What is the delay of the entire multiplier? Does the stitch-up need to wait for all bits PL + of PL … PH to be available before it “starts”? PM1 The stitch up of a level can start when the lsb of the msb half of the product bits of @ del=n/2 each of the 4 products PL … PH are available: + cin for the top level this is at n/4 + 2 after the + previous level’s such input is avail (= when PM2 lsb of msb half of i/p n-bit prod. avail; see analysis for 2n-bit product in figure) (n/2)-bit adders @ del=n/2+1 Using RCAs: (n-1) [this is delay after lsb of msb half avail. at top level) + { (n/2 +2) + cin @ del=2[n/2] +1 Intermediate (n/4 +2) + … + (2+2) (stopping at 4-bit mult) cin Sums + 2 [this is boundary-case 2-bit mult delay + at bit 3—lsb of msb half of 4-bit product] + PH 1/3 [this is the delay of a 2-i/p AND gate + translated in terms of FA delay units which, 00 ….0 Cin using 2-i/p gates, is 3 2-i/p gate delays] } Cin @ @ del=2[n/2] +2 = (n-1 ) + {(1/2)[(S i=0 logn 2i ) + 2logn – 1.17} @ del=3[n/2] +1 del=2[n/2] +1 lsb of MS half @ del=n/2+2 [corrective term: -[½(3) – 1/3] for taking n/2 n/2 n/2 n/2 prev. summation up to i=1,0] = n-1 + (1/2)[2n-1] + 2logn - 1.17 ~ 2(n+log n ) ~ • We were able to obtain this similar-to-array-multiplier design using Q(2n) FA delays—similar to the well-known D&C using basic D&C guidelines and it did not require extensive array multiplier that uses carry-save adders ingenuity as it might have for the designers of the array multiplier • We can obtain an even faster multiplier (Q(log n) delay) using D&C Why do we need 2 FA delay units for a 2-bit and carry-save adders for stitch-up; see appendix mult after 4 1-bit prods of 1-bit mults avail? 2n SU2 = Stitch up design 2 for multiplication SU2(n) n SU2(n/2) n SU2(n/2) n SU2(n/2) n SU2(n/2) n/2 SU2 (n/4) SU2 (n/4) SU2 (n/4) SU2 (n/4) • What is its cost in terms of # of FAs (RCAs)? • The level below the root (root = 1st level) has 4 (n/2)-bit multiplies to generate the PL …. PH of the root, 16 (n/4)-bit multiplies in the next level, upto 2-bit mults. at level logn. • Thus FAs used = 2.5[n + 4(n/2 )+ 16(n/4)] + 4 logn -1*(2) + 4 logn *(1/7) [the last two terns are for the boundary cases of 2-bit and 1-bit multipliers that each require 2 and 1/7 FAs, resp.; see Qs below) = 2.5n(S i=0 logn – 2 2i) + 2(n/2)2 + (1/7)n2 = 2.5[n(n/2 -1]/(2 -1)) + 0.64n2 = 1.25n2 -2.5n + 0.64n2 ~ 1.89n2 = Q(n2) • Why do we need 2 FA cost units for a 2-bit multiplication (with 4 1-bit products of 1-bit mults available)? • Assuming we use only 2-input gates, why do we add (n/7) FA cost units for each 1-bit multiplier (which is a 2-i/p AND gate)? • Using carry-save adders or CSvA’s [see appendix], the cost is similar (quadratic in n, i.e., Q(n2)). D&C Example Where a “Straightforward” Breakup Does Not Work • Problem: n-bit Majority Function (MF): Output f = 1 when a majority of bits is 1, else f =0 Root problem A: n-bit MF [MF(n)] f f2 Subprob. A2 MF(MS n/2 bits) St. Up (SU) f1 Subprob. A1 MF(LS n/2 bits) • Need to ask (general Qs for any problem): Is the stitch-up function SU required in the above straightforward breakup of MF(n) into two MF’s for the MS and LS n/2 bits: Computable? Efficient in both hardware and speed? • Try all 4 combinations of f1, f2 values and check if its is possible for any function w/ i/ps f1, f2 to determine the correct f value: f1 = 0, f2 = 0 # of 1’s in minority (<= n/4) in both halves, so totally # of 1’s <= n/2 f = 0 f1 = 1, f2 = 1 # of 1’s in majority (> n/4) in both halves, so totally # of 1’s > n/2 f = 1 f1 = 0, f2 = 1 # of 1’s <= n/4 in LS n/2 and > n/4 in MS n/2, but this does not imply if total # of 1’s is <= n/2 or > n/2. So no function can determine the correct f value (it will need more info, like exact count of 1’s) f1 = 1, f2 = 0: same situation as the f1 = 0, f2 = 1 case. Thus the stitch-up function is not even computable in the above breakup of MF(n). D&C Example Where a “Straightforward” Breakup Does Not Work (contd.) • Try another breakup, this time of MF(n) into functions that are different from MF. Root problem A: n-bit MF [MF(n)] f Subprob. A2: (> compare of A1 o/p (log n)+1 f1 and floor(n/2) D&C tree for A2 Subprob. A1: Count # of 1’s in the n-bits D&C tree for A1 • Have seen (log n) delay (>) comparator for two n-bit #s using D&C • Can we do 1-counting using D&C? How much time will this take? Dependency Resolution in D&C: (1) The Wait Strategy • So far we have seen D&C breakups in which there is no data dependency between the two (or more) subproblems of the breakup • Data dependency leads to increased delays • We now look at various ways of speeding up designs that have subproblem dependencies in their D&C breakups Root problem A Subprob. A2 Subprob. A1 Data flow • Strategy 1: Wait for required o/p of A1 and then perform A2, e.g., as in a ripplecarry adder: A = n-bit addition, A1 = (n/2)-bit addition of the L.S. n/2 bits, A2 = (n/2)bit addition of the M.S. n/2 bits • No concurrency between A1 and A2: t(A) = t(A1) + t(A2) + t(stitch-up) = 2*t(A1) + t(stitch-up) if A1 and A2 are the same problems of the same size • Note that w/ no dependency, the delay expression is: t(A) = max{t(A1), t(A2)} + t(stitch-up) = 2*t(A1) + t(stitch-up) if A1 and A2 are the same problems of the same size Adder Design using D&C • Example: Ripple-Carry Adder (RCA) – Stitching up: Carry from LS n/2 bits is input to carry-in of MS n/2 bits at each level of the D&C tree. – Leaf subproblem: Full Adder (FA) Add n-bit #s X, Y Add MS n/2 bits of X,Y FA FA Add LS n/2 bits of X,Y FA (a) D&C for Ripple-Carry Adder FA Example of the Wait Strategy in Adder Design FA7 • Note: Gate delay is propotional to # of inputs (since, generally there is a series connection of transistors in either the up or down network = # of inputs R’s of the transistors in series add up and is prop to # of inputs delay ~ RC (C is capacitive load) is prop. to # of inputs) • The 5-i/p gate delay stated above for a FA is correct if we have 2-3 i/p gates available (why?), otherwise, if only 2-i/p gates are available, then the delay will be 6-i/p gate delays (why?). • Assume each gate i/p contributes 2 ps of delay • For a 16-bit adder the delay will be 160 ps • For a 64 bit adder the delay will be 640 ps Adder Design using D&C—Lookahead Wait (not in syllabus) • • Example: Carry-Lookahead Adder Add n-bit #s X, Y (CLA) – Division: 4 subproblems per level Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits Add ms n/4 bits – Stitching up: A more complex stitching up process (generation of global ir P, G P, G P, G P, G “super” P,G’s to connect up Linear connection of local P, G’s from each unit to determine global or super P, G for each unit. But linear delay, so not much better than RCA the subproblems) – Leaf subproblem: 4-bit basic (a) D&C for Carry-Lookahead Adder w/ Linear Global P, G Ckt CLA with small p, g bits. More intricate techniques (like P,G But, the global P for each unit is an associative function. So can be done in max log n time (for the last unit; less time for earlier units). generation in CLA) for complex stitching up for fast designs may need to be devised that is not directly suggested by D&C. But Add n-bit #s X, Y D&C is a good starting point. Add ms n/4 bits P, G Add 3rd n/4 bits Add 2nd n/4 bits P, G P, G Add ls n/4 bits P, G Tree connection of local P, G’s from each unit to determine global P, G for each unit (P is associative) to do a prefix computation (b) D&C for Carry-Lookahead Adder w/ a Tree-like Global P, G Ckt Dependency Resolution in D&C: (2) The “Design-for-all-cases-&-select (DAC)” Strategy Root problem A 00 Subprob. A1 Subprob. A2 I/p00 01 10 Subprob. A2 I/p01 Subprob. A2 I/p10 I/p11 11 Subprob. A2 4-to-1 Mux • Strategy 2: DAC: For a k-bit i/p from A1 to A2, design 2k copies of A2 each with a different hardwired k-bit i/p to replace the one from A1. • Select the correct o/p from all the copies of A2 via a (2k)-to-1 Mux that is selected by the k-bit o/p from A1 when it becomes available (e.g., carry-select adder) • t(A) = max(t(A1), t(A2)) + t(Mux) + t(stitch-up) = t(A1) + t(Mux) + t(stitch-up) if A1 and A2 are the same problems Select i/p (2) The “Design-for-all-cases-&-select (DAC)” Strategy (cont’d) Root problem A Subprob. A2 SUP Subprob. A1 Note: The DAC based replication will apply to the smallest subproblem (subcircuit) of A2 that is directly dependent on A1’s o/ps, and not necessarily to all of A2. Similarly for lower-level DACs. DAC SUP Subprob. A2,1 SUP Subprob. A2,2 Subprob. A1,1 Subprob. A1,2 DAC DAC SUP SUP Wait Wait SUP Wait SUP Wait Generally, wait strategy will be used at all lower levels after the 1st wait level Figure: A D&C tree w/ a mix of DAC and Wait strategies for dependency resolution between subproblems • The DAC strategy has a MUX delay involved, and at small subproblems, the delay of a subproblem may be smaller than a MUX delay or may not be sufficiently large to warrant extra replication or mux cost. • Thus a mix of DAC and Wait strategies, as shown in the above figure, may be faster, w/ DAC used at higher levels and Wait at lower levels. Example of the DAC Strategy in Adder Design Simplified Mux Cout 1 4 • For a 16-bit adder, the delay is (9*4 – 4)*2 = 64 ps (2 ps is the delay for a single i/p); a 60% improvement ((160-64)*100/160) over RCA • For a 64-bit adder, the delay is (9*8 – 4)*2 = 136 ps; a 79% improvement over RCA Dependency Resolution in D&C: (3) Speculative Strategy Root problem A Subprob. A1 Subprob. A2 2-to-1 Mux FSM Controller: If o/p(A1A2) = guess(A2) (compare when A1 generates a completion signal) then generate a completion signal after some delay corresponding to stitch up delay else set i/p to A1 = o/p(A1 S2) and generate completion signal after delay of A2 + stitch up I1 I0 op(A1A2) 01 Estimate (guess), based on analysis or stats select i/p to Mux completion signal • Strategy 3: Have a single copy of A2 but choose a highly likely value of the k-bit i/p and perform A1, A2 concurrently. If after k-bit i/p from A1 is available and selection is incorrect, redo A2 w/ correct available value. • t(A) = p(correct-choice)*(max(t(A1), t(A2)) + (1-p(correct-choice))*[t(A2) + t(A1)) + t(stitchup), where p(correct-choice) is probability that our choice of the k-bit i/p for A2 is correct. • For t(A1) = t(A2), this becomes: t(A) = p(correct-choice)*t(A1) + (1-p(correct-choice))*2t(A1)+ t(stitch-up) = t(A1) + (1-p(correct-choice))*t(A1)+ t(stitch-up) • Need a completion signal to indicate when the final o/p is available for A; assuming worstcase time (when the choice is incorrect) is meaningless is such designs • Need an FSM controller for determining if guess is correct and if not, then redoing A2 (allowing more time for generating the completion signal) . Dependency Resolution in D&C: (4) The “Independent Pre-Computation” Strategy Concept Example of an unstructured logic for A2 Root problem A u v x w’ x yw z’ a1u’ x a1 v’ x’ u v x w’ x yw z’ u’ x Subprob. A1 A2_dep Subprob. A2 v’ x’ Data flow A2_indep A2 A2_indep Critical path after a1 avail (8-unit delay) a2 A2_dep Critical path after a1 avail (4-unit delay) a2 • Strategy 4: Reconfigure the design of A2 so that it can do as much processing as possible that is independent of the i/p from A1 (A2_indep). This is the “independent” computation that prepares for the final computation of A2 (A2_dep) that can start once A2_indep and A1 are done. • t(A) = max(t(A1), t(A2_indep)) + t(A2_dep) + t(stitch-up) • E.g., Let a1 be the i/p from A1 to A2. If A2 has the logic a2 = v’x’ + uvx + w’xy + wz’a1 + u’xa1. If this were implemented using 2-i/p AND/OR gates, the delay will be 8 delay units (1 unit = delay for 1 i/p) after a1 is available. If the logic is re-structured as a2= (v’x’ + uvx + w’xy) + (wz’ + u’x)a1, and if the logic in the 2 brackets are performed before a1 is available (these constitute A2_indep), then the delay is only 4 delay units after a1 is available. • Such a strategy requires factoring of the external i/p a1 in the logic for a2, and grouping & implementing all the non-a1 logic, and then adding logic to “connect” up the non-a1 logic to a1 as the last stage. a1 D&C Summary • For complex digital design, we need to think of the “computation” underlying the design in a structured manner---are there properties of this computation that can be exploited for faster, less expensive, modular design; is it amenable to the D&C approach? Think of: – – – – Breakup into >= 2 subprobs via breakup of (# of operands) or (operand sizes [bits]) Stitch-up (is it computable?) Leaf functions Dependencies between sub-problems and how to resolve them • The design is then developed in a structured manner & the corresponding circuit may be synthesized by hand or described compactly using a HDL (e.g., structural VHDL) • For an operation/func x on n operands (an-1 x an-2 x …… x a0 ) if x is associative, the D&C approach gives an “easy” stitch-up function, which is x on 2 operands (o/ps of applying x on each half). This results in a tree-structured circuit with (log n) delay instead of a linearlyconnected circuit with (n) delay can be synthesized. • If x is non-associative, more ingenuity and determination of properties of x is needed to determine the breakup and the stitch-up function. The resulting design may or may not be tree-structured • If there is dependency between the 2 subproblems, then we saw strategies for addressing these dependencies: – – – – Wait (slowest, least hardware cost) Design-for-all-cases (high speed, high hardware cost) Speculative (medium speed, medium hardware cost) Independent pre-computation (medium-slow speed, low hardware cost) Strategy 2: A general view of DAC computations (w/ or w/o D&C) • • • If there is a data dependency between two or more portions of a computation (which may be obtained w/ or w/o using D&C), don’t wait for the the “previous” computation to finish before starting the next one Assume all possible input values for the next computation/stage B (e.g., if it has 2 inputs from the prev. stage there will be 4 possible input value combinations) and perform it using a copy of the design for possible input value. All the different o/p’s of the diff. Copies of B are Mux’ed using prev. stage A’s o/p E.g. design: Carry-Select Adder (at each stage performs two additions one for carryin of 0 and another for carry-in of 1 from the previous stage) A B y (a) Original design: Time = T(A)+T(B) 0 x A B(0,0) y 0 0 B(0,1) 4:1 Mux • z x 1 1 z B(1,0) 0 1 B(1,1) 1 (b) Speculative computation: Time = max(T(A),T(B)) + T(Mux). Works well when T(A) approx = T(B) and T(A) >> T(Mux) Strategy 3: Get the Best of Both Worlds (Average and Worst Case Delays)! Approximate analysis: Avg. dividend D value = 2n-1 For divisor V values in the “lower half range”[1, 2n-1], the average quotient Q value is the Harmonic series (1+ ½ + 1/3 + … + 1/ 2n-1) – this comes from a Q value of x for V in the range (2n1/x) to 2n-1/(x-1)) - 1, i.e., for approx. (2n-1/x2) #s, which have a probability of 1/x2 , giving a probabilistic value of x(1/x2) = 1/x in the average Q calculation. The above summation is ~ ln (2n-1) ~( n-1)/1.4 (integration of 1/k from k = 1 to 2n-1) Q for divisors in the upper half range [2n-1 +1, 2n] is 0 overall avg. quotient = (n-1)/2.8 avg. subtractions needed = 1 + (n-1)/2.8 = Q(n/2.8) • • • • Registers inputs inputs Unary Division Ckt (good ave case: Q(n/2.8) subs, done1 bad worst case: Q(2n) subs) output select start Ext. FSM done2 Mux NonRestoring Div. Ckt (bad ave case [Q(n) subs], good worst case: Q(n) subs) output Register Use 2 circuits with different worst-case and average-case behaviors Use the first available output Get the best of both (ave-case, worst-case) worlds In the above schematic, we get the good ave case performance of unary division (assuming uniformly distributed inputs w/o the disadvantage of its bad worst-case performance): ave. case = Q(1) subs, worst case = Q(n) subs Strategy 4a: Pipeline It! (Synchronous Pipeline) Clock Registers Stage 1 Original ckt or datapath Stage 2 Conversion to a simple level-partitioned pipeline (level partition may not always be possible but other pipelineable partitions may be) Stage k • Throughput is defined as # of outputs / sec • Non-pipelined throughput = (1 / D), where D = delay of original ckt’s datapath • Pipeline throughput = 1/ (max stage delay + register delay) • Special case: If original ckt’s datapath is divided into n stages, each of equal delay, and dr is the delay of a register, then pipeline throughput = 1/((D/n)+dr). • If dr is negligible compared to D/n, then pipeline throughput = n/D, n times that of the original ckt • FSM controller may be needed for non-uniform stage delays; not needed otherwise Strategy 4a: (Synchronous) Pipeline It! (contd.) • Legend : Register Ij(t k) = processed version of i/p Ij @ time k. We assume delay of each basic odule below is 1 unit. • Comparator o/p produced every 1 unit of time, instead of every (logn +1) unit of time, where 1 time unit here = delay of mux or 1-bit comparator (both will have the same or similar delay) We can reduce reg. cost by inserting at every 2 levels, throughput decreases to 1 per every 2 units of time output I1(t 4) I2(t 5) F= my1(6) 1-bit my(5)(2)2:1 Mux I0 my(5) I1(t 3) I2(t 4) I3(t 5) I1 my(4)(1) my(5)(1) 2 stage 4 i/ps Log n level 2-bit my(3)(2) 2:1 Mux of Muxes I1 I0 2 my(3) I0 2 I0 2 f(6) 2 my(1) 2 2-bit f(5)(2)2:1 Mux 2 f(7) 2 I0 2 my(2) I1 2 2-bit stage 3 i/ps my(1)(2) 2:1 Mux 2 2 2-bit 2(7) = f(7)(2) 2:1 Mux my(4) I1 f(5) 2 I0 2 2 f(4) 2 2 my(0) I1 f(2) 2 2 2-bit f(1)(2)2:1 Mux stg.2 i/ps I1(t 1) I2(t 2) I3(t 3) I4(t 4) I5(t 5) I1 I0 2 2 f(3) 2 I1 2 2-bit f(3)(2)2:1 Mux I1(t 2) I2(t 3) I3(t 4) I4(t 5) 2 f(1) 2 f(0) 2 1-bit 1-bit 1-bit 1-bit 1-bit 1-bit 1-bit 1-bit comparator comparatorcomparator comparator comparator comparatorcomparator comparator I1(t 0) I2(t 1) I3(t 2) I4(t 3) I5(t 4) I6(t 5) A[7] B[7] A[6] B[6] A[5] B[5] A[4] B[4] A[3] B[3] A[2] B[2] A[1] B[1] A[0] B[0] time axis Strategy 4a: (Synchronous) Pipeline It! (contd.) Legend : Intermediate & output register : Input register Pipelined Ripple Carry Adder Problem: I/P and O/P data direction is not the same as the computation direction. They are perpendicular! In other words, each stage has o/ps, rather than o/ps only appearing at the end of the pipeline. Next 3 S0, S1 o/ps S7, S6 o/ps for i/ps recvd 4 cc back S5, S4 o/ps for i/ps recvd 4 cc back S3, S2 o/ps for i/ps recvd 4 cc back S1, S0 o/ps for i/ps recvd 4 cc back Adder o/p produced every 2 unit’s of FA delay instead of every n units of FA delay in an n-bit RCA Strategy 4b: Pipeline It!—Wave Pipelining • • Wave pipelining is essentially pipelining of combinational circuits w/o the use of registers and clocking, which are expensive and consume significant amount of power. The min. safe input rate (MSIR: rate at which inputs are given to the wave-pipelined circuit) is determined by the difference in max and min delays across various module or gate outputs and inputs, respectively, and is designed so that a later input’s processing, however fast it might be, will never “overwrite” that data at the input of any gate or module that is still processing the previous input. tmin(i/p:m2) m1 • • • • • m2 tmax(o/p:m2) Consider two modules m1 and m2 above (the modules could either be subcircuits like a full adder or even more complex) or can be as simple as a single gate). Let tmin(i/p:mj) be the mindelay (i.e., min arrival time) over all i/ps to mj, and let tmax(o/p:m2) be the max delay at the o/p(s) of m2. Certainly a safe i/p rate (SIR) period for wave pipelining the above configuration is tmax(o/p:m2). But can we do it safely at a higher rate (lower period)? The min safe i/p rate (MSIR) period for m2 will correspond to a situation in which any new i/p appears at m2 only after the processing of its current i/ps is over so that while it is still processing its current i/ps are held stable. Thus a 2nd i/p should appear at m2 no earlier than tmax(o/p:m2) the 2nd i/p to the circuit itself, i.e., at m1 should not appear before tmax(o/p:m2) - tmin(i/p:m2). Similarly, for safe 1st i/p operation of m1, the 2nd i/p should not appear before tmax(o/p:m1) tmin(i/p:m1) = tmax(o/p:m1) as tmin(i/p:m1) = 0. Thus for safe operation of the 1st i/ps of both m1 and m2, the 2nd i/p should not appear before max(tmax(o/p:m1), tmax(o/p:m2) - tmin(i/p:m2)), the min. safe i/p rate period for the 1st i/p Strategy 4b: Pipeline It!—Wave Pipelining (contd.) tmin(i/p:m2) m1 m2 Fig. 1 • • • • • fastest i’th i/p: (i-1)tsafe + tmin(i/p:mj) >= (i-2)tsafe + tmax(o/p:mj)- tmin(i/p:mj)+ tmin(i/p:mj) = (i-2)tsafe + tmax(o/p:mj). Thus safe (see slowest o/p) tmax(o/p:m2) m1 m2 mj slowest (i-1)’th o/p: (i-2)tsafe + tmax(o/p:mj) Fig. 2 The Q. now is whether tsafe(1) (the min i/p safe rate period for the 1st i/p) = max(tmax(o/p:m1), tmax(o/p:m2) - tmin(i/p:m2)), will also be a safe rate period for the ith i/p for i > 1? Consider the 2nd o/p from m2 that appears at time tsafe(1) + tmax(o/p:m2), and the 3rd i/p to it which appears at 2 tsafe(1) + tmin(i/p:m2) >= tsafe(1) + tmax(o/p:m2) (since tsafe(1) = max(tmax(o/p:m1), tmax(o/p:m2) - tmin(i/p:m2)), and thus no earlier than when the 2nd o/p of m2 appears, and is thus safe. A similar safety analysis will show that tsafe(1) is a safe i/p rate period for any ith i/p for both m2 and m1, for any i. Since it is also the min. such period (at the module level) for the 1 st i/p (and in fact for any ith i/p) as we have established earlier, tsafe(1) is the min. safe i/p rate period for any ith i/p for both m2 and m1. We term this min. i/p rate period t safe (= tsafe(1) ). Thus if there are k modules m1, …, mk, a simple extension of the above analysis gives us that the this min. i/p rate period tsafe = maxi=1 to k{tmax(o/p:mi) - tmin(i/p:mi)}; see Fig. 2 above. Interesting Qs: 1. Is tsafe a safe period not just at the module level but for every gate gj in the circuit (e.g., will tsafe be a safe period for every gate gj in m2), which it needs to be in order for this period to work? 2. Can we have a better (smaller) min. safe i/p rate period determination by considering the above analysis at the level of individual gates than at the level of modules? Strategy 4b: Pipeline It!—Wave Pipelining: Example 1 • • • • • Let max delay of a 2-i/p gate be 2 ps and min delay be 1 ps. What is the tsafe for this ckt for the two modules shown? tsafe = max (4ps, 10-1 = 9ps) = 9 ps Thus MSIR = 9 ps a little better than 10 ps corresponding to the max o/p delay. So this is not a circuit than can be effectively wave pipelined Generally a ckt that has more balanced max o/p and min i/p delay for each module and gate is one that can be effectively wave pipelines, i.e., one whose MSIR is much lower than the max o/p delay of the circuit v’ x’ u v x w’ x yw z’ a1u’ x a1 m1 tmin(i/p:m2) = 1ps tmax(o/p:m1) = 4ps m2 tmax(o/p:m2) = 10ps Strategy 4b: Pipeline It!—Wave Pipelining: Example 2 • F= my1(6) 12 ps 1-bit my(5)(2)2:1 Mux I0 my(5) my(4)(1) my(5)(1) 2 m2 I1 Log n level 2-bit my(3)(2) 2:1 Mux of Muxes I1 I0 2 my(3) 2-bit (7) = f(7)(2) 2:1 Mux I1 I0 2 f(7) 2 my(1) 2 2-bit f(5)(2)2:1 Mux I0 2 I1 I0 2 I1 4 ps 2 my(0) 2 2-bit f(3)(2)2:1 Mux 2 • I1 • 2 2-bit f(1)(2)2:1 Mux I0 2 2 I1 2 m1 f(6) 2 I0 2 my(2) 2 2 2-bit my(1)(2) 2:1 Mux 2 6 ps 2 my(4) • f(5) 2 f(4) 2 f(3) 2 f(2) 2 f(1) 2 f(0) 2 1-bit 1-bit 1-bit 1-bit 1-bit 1-bit 1-bit 1-bit comparator comparatorcomparator comparator comparator comparatorcomparator comparator A[7] B[7] A[6] B[6] A[5] B[5] A[4] B[4] A[3] B[3] A[2] B[2] A[1] B[1] • What if we divide the circuit into 4 modules, each corresponding to a level of the circuit, and did the analysis for that? See next slide. A[0] B[0] • Let max delay of a basic unit (1-bit comp., 2:1 mux) is 3 ps and min delay be 2 ps. What is the tsafe for this ckt for the two modules shown? tsafe = max (6ps, 12-4 = 4ps) = 8 ps Thus MSIR = 8 ps, 33% lower than 12 ps corresponding to the max o/p delay. So this is a circuit that can be reasonably effectively wave pipelined. This is due to the balanced nature of the circuit where all i/p o/p paths are of the same length (the diff. betw. max and min delays come from the max and min delays of the components or gates themselves). Strategy 4b: Pipeline It!—Wave Pipelining: Example 3 • F= my1(6) 12 ps 1-bit my(5)(2)2:1 Mux I0 my(5) 6 ps my(5)(1) 2 m4 I1 9 ps my(4)(1) Log n level 2-bit my(3)(2) 2:1 Mux of Muxes I1 I0 2 my(3) I0 I1 I0 2 4 ps my(1) 2 2-bit f(5)(2)2:1 Mux I0 2 2 I1 f(7) f(6) 2 m3 I0 2 4 ps 2 my(0) I1 m2 2 2-bit f(1)(2)2:1 Mux I1 I0 2 2 2 3 ps f(5) 2 6 ps • I1 2 2-bit f(3)(2)2:1 Mux 2 2 ps 2 2 2-bit my(1)(2) 2:1 Mux my(2) 6 ps 2 2-bit (7) = f(7)(2) 2:1 Mux 2 2 my(4) • f(4) 2 f(3) 2 f(1) m1 f(2) 2 2 f(0) 2 1-bit 1-bit 1-bit 1-bit 1-bit 1-bit 1-bit 1-bit comparator comparatorcomparator comparator comparator comparatorcomparator comparator A[7] B[7] A[6] B[6] A[5] B[5] 0 ps A[4] B[4] A[3] B[3] A[2] B[2] A[1] B[1] A[0] B[0] • Let max delay of a basic unit (1-bit comp., 2:1 mux) is 3 ps and min delay be 2 ps. What is the tsafe for this ckt for the 4 modules shown? tsafe = max (3-0, 6-2, 9-4, 12-6) ps = 6 ps Thus MSIR = 6 ps, 50% lower than 12 ps corresponding to the max o/p delay. So we get a better (lower) MSIR if we analyze the ckt at a finer granularity. Appendix: Q(log n) delay multiplier Multiplier D&C (cont’d): Carry-Save Addition Appendix: Q(log n) delay multiplier Multiplier D&C (cont’d): Carry-Save Add. Based Stitch-Up S(PL) C(PL) CSvA CSvA S(PM1) C(PM1) CSvA S(PM2) No CSvA needed C(PM2) S(PH) C(PH) Add 4 #s using CSvA’s n/2 (C & S) CSvA Add 6 #s using CSvA’s: 3 delay units Add 7 #s using CSvA’s (7 lsb bits need to be added): 4 delay units n/2 (C & S) n/2 (C & S) Fig. : Stitch-up # 3: Adding 6 numbers in parallel using CSvA’s takes 3 units of time and 4 CSvA’s. n/2 (C & S) Fig. : Separate (and thus parallel) Carry save adds for each of the 4 (n/2)-bit groups shown at the top level of multiplication • • • Using CSvAs (carry-save adders) [each sub-prod., e.g., PL, is formed of 2 nos. sum bits and carry bits, and so there are 8 n-bit #s to be CSvA’ed in the final stitch-up and takes a delay of approx. 5 units if done in seq. but only 4 units if done in parallel. We then get 2 final nos. (carries # and sums #) that are added by a carrypropagate adder like a CLA, which takes Q(log n) time, and overall multiplier delay is Q(4*log n) [4 time units at each of the (log n -2) levels (need at least 2 bit inputs for the above structure to be valid) + at moat 2 time units for the bottom two levels (why?)] + Q(log n) = Q(log n) —similar to Wallace-tree mult, We were able to obtain this fast design using D&C (and did not need the extensive ingenuity that W-T multiplier designers must have needed] ! Hardware cost (# of FAs), ignoring final carry-prop. adder for the entire mult.? Exercise.