ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt

advertisement
ECE 465
High Level Design Strategies
Lecture Notes # 9
Shantanu Dutt
Electrical & Computer Engineering
University of Illinois at Chicago
Outline
• Circuit Design Problem
• Solution Approaches:
– Truth Table (TT) vs. Computational/Algorithmic –
Yes, hardware, just like software can implement any
algorithm (after all software runs on hardware)!
– Flat vs. Divide-&-Conquer
– Divide-&-Conquer:
• Associative operations/functions
• General operations/functions
– Other Design Strategies for fast circuits:
• Speculative computation
• Best of both worlds (best average and best worst-case)
• Pipelining
• Summary
Circuit Design Problem
• Design an 8-bit comparator that compares two 8-bit #s available in
two registers A[7..0] and B[7..0], and that o/ps F = 1 if A > B and F =
0 if A <= B.
• Approach 1: The TT approach -- Write down a 16-bit TT, derive logic
expression from it, minimize it, obtain gate-based realization, etc.!
A
00000000
B
00000000
F
0
00000000
00000001 0
-------------------00000001
00000000 1
---------------------11111111
11111111 0
–
–
–
–
Too cumbersome and time-consuming
Fraught with possibility of human error
Difficult to formally prove correctness (i.e., proof w/o exhasutive testing)
Will generally have high hardware cost (including wiring, which can be
unstructured and messy) and delay
Circuit Design Problem (contd)
• Approach 2: Think computationally/algorithmically about
what the ckt is supposed to compute:
• Approach 2(a): Flat computational/programming
approach:
– Note: A TT can be expressed as a sequence of “if-then-else’s”
– If A = 00000000 and B = 00000000 then F = 0
else if A = 00000000 and B = 00000001 then F=0
……….
else if A = 00000001 and B = 00000000 then F=1
……….
– Essentially a re-hashing of the TT – same problems as the TT
approach
Circuit Design Problem: Strategy 1: Divide-&-Conquer
•
Approach 2(b): Structured algorithmic approach:
– Be more innovative, think of the structure/properties of the problem that can be used
to solve it in a hierarchical or divide-&-conquer (D&C) manner:
Stitch-up of solns to A1
and A2 to form the
complete soln to A
Root problem A
Legend:
: D&C breakup arrows
: data/signal flow to
solve a higher-level
problem
: possible data-flow
betw. sub-problems
Data dependency?
Subprob. A1
A1,1
A1,2
Subprob. A2
A2,1
A2,2
Do recursively until
subprob-size
is s.t. TT-based
design is doable
– D&C approach: See if the problem can be:
• “broken up” into 2 or more smaller subproblems: two types of breaks possible
 by # of operands: partition set of n operands into 2 or more subsets of operands
 by operand size: breaking a constant # of n-bit operands into smaller size operands
(this mainly applies when the # of operands are a constant, e.g., add. of 2 #s)
• whose solns can be “stitched-up” (stitch-up function) to give a soln. to the parent prob
• also, consider if there is dependency between the sub-probs (results of some required
to solve the other(s))
– Do this recursively for each large subprob until subprobs are small enough (the leaf problem)
for TT solutions
– If the subproblems are of a similar kind (but of smaller size) to the root prob. then the
breakup and stitching will also be similar, but if not, they have to be broken up differently
Circuit Design Problem: Strategy 1: Divide-&-Conquer
•
Especially for D&C breakups in which: a) the subproblems are the same problem type as
the root problem, and b) there is no data dependency between subproblems, the final
circuit will be a “tree”of stitch-up functions (of either the same size or different sizes at
different levels—this depends on the problem being solved) with leaf functions at the
bottom of the tree, as shown in the figure below for a 2-way breakup of each
problem/sub-problem.
Stitch-up functions
Leaf functions
•
Solving a problem using D&C generally yields a fast, low-cost and streamlined design
(wiring required is structured and not all jumbled up and messy).
Shift Gears: Design of a Parity Detection Circuit—An n-input XOR
(a) A linearly-connected circuit of 2-i/p XORs
x(0)
x(1)
x(2)
X(3)
x(15)
f
• No concurrency in design (a)---the actual problem has available concurrency, though, and it is not
exploited well in the above “linear” design
• Complete sequentialization leading to a delay that is linear in the # of bits n (delay = (n-1)*td), td = delay
of 1 gate
• All the available concurrency is exploited in design (b)---a parity tree (see next slide).
• Question: When can we have a circuit for an operation/function on multiple operands built of “gates”
performing the same operation for fewer (generally a small number betw. 2-5) operands?
• Answer:
(1) It should be possible to break down the n-operand function into multiple operations w/ fewer
operands.
(2) When the operation is associative. An oper. “x” is said to be associative if:
a x b x c = (a x b) x c = a x (b x c) OR stated as a function f(a, b, c) = f(f(a,b), c) = f(a, f(b,c))
• This implies that, for example, if we have 4 operations a x b x c x d, we can either perform this as:
– a x (b x (c x d)) [getting a linear delay of 3 units or in general n-1 units for n operands]
i.e., in terms of function notation: f(a, f(b, f(c,d)))
– or as (a x b) x (c x d) [getting a logarithmic (base 2) delay of 2 units and exploiting the available
concurrency due to the fact that “x” is associative]
i.e., in terms of function notation: f(f(a,b), f(c,d))
• Is XOR associative? Yes.
• The parenthesisation corresp. to the above ckt is:
– (…..((x(0) xor x(1)) xor x(2))) xor x(3)))) xor …. xor x(15))….)
• All these Qs can be answered by the D&C approach
Shift Gears: Design of a Parity Detection Circuit—A Series of XORs
x(15) x(14)
• if we have 4 operations a x b x c x d, we can
either perform this as a x (b x (c x d)) [getting a
linear delay of 3 units] or as (a x b) x (c x d)
[getting a logarithmic (base 2) delay of 2 units
and exploiting the available concurrency due to
the fact that “x” is associative].
• We can extend this idea to n operands (and
n-1 operations) to perform as many of the
pairwise operations as possible in parallel (and
do this recursively for every level of remaining
operations), similar to design (b) for the parity
detector [xor is an associative operation!] and
thus get a (log2 n) delay.
• In fact, any parenthesisation of operands is
correct for an associative operation/function,
but the above one is fastest. Surprisingly, any
parenthesisation leads to the same h/w cost: n1 2-i/p gates, i.e., 2(n-1) gate i/ps. Why? Analyze.
x(1) x(0)
w(3,4)
w(3,6)
w(2,3)
w(3,1)
w(3,3)
w(3,5)
w(3,7)
w(2,2)
w(1,1)
w(3,2)
w(2,1)
w(3,0)
w(2,0)
w(1,0)
Delay = (# of levels in
AND-OR tree) * td =
log2 (n) *td
w(0,0) = f
(b) 16-bit parity tree
An example of simple
designer ingenuity. A
bad design would
have resulted in a
linear delay, an
ingenious (simple
enough though) &
well-informed design
results in a log delay,
and both have the
same gate i/p cost
Parenthesization of tree-circuit: (((x(15) xor x(14)) xor (x(13) xor x(12))) xor ((x(11) xor x(10)) xor
(x(9) xor x(8)))) xor (((x(7) xor x(6)) xor (x(5) xor x(4))) xor ((x(3) xor x(2)) xor (x(1) xor x(0))))
D&C for Associative Operations
• Let f(xn-1, ….., x0) be an associative function.
• Can the D&C approach be used to yield an efficient, streamlined n-bit
xor/parity function? Can it lead automatically to a tree-based ckt?
• What is the D&C principle involved here?
f(xn-1, .., x0)
Stitch-up function---same as the
original function for 2 inputs
f(a,b)
a
f(xn-1, .., xn/2)
b
f(xn/2-1, .., x0)
• Using the D&C approach for an associative operation results in a breakup by #
of operands and the stitch up function being the same as the original function
(this is not the case for non-assoc. operations), but w/ a constant # of operands
(2, if the original problem is broken into 2 subproblems)
• Also, there are no dependencies between sub-problems
• If the two sub-problems of the D&C approach are balanced (of the same size or
as close to it as possible), then unfolding the D&C results in a balanced operation
tree of the type for the xor/parity function seen earlier of (log n) delay
D&C for Associative Operations (cont’d)
• Parity detector example
16-bit parity
w(0,0) = f
Breakup by operands
stitch-up
function = 2-bit parity/xor
8-bit parity
8-bit parity
w(1,1)
w(2,3)
Delay = (# of levels in
AND-OR tree) * td =
log2 (n) *td
w(2,2)
w(3,6)
w(3,7)
x(15) x(14)
w(1,0)
w(2,1)
w(3,4)
w(3,5)
w(2,0)
w(3,2)
w(3,3)
w(3,0)
w(3,1)
x(1) x(0)
D&C Approach for Non-Associative Opers: n-bit > Comparator
• O/P = 1 if A > B, else 0
• Is this is associative? Not sure for breakup by bits in the 2 operands. Issue of associativity mainly applies
for n operands, not on the n-bits of 2 operands
• For a non-associative func, determine its properties that allow determining a break-up & a
correct stitch-up function
• Useful property: At any level, comp. of MS (most significant) half determines o/p if result is > or < else
comp. of LS ½ determ. o/p
• Can thus break up problem at any level into MS ½ and LS ½ comparisons & based on their results
determine which o/p to choose for the higher-level (parent) result
• No sub-prob. dependency
A
Comp. A[7..0]],B[7..0]
A1
Breakup by size/bits
Comp A[7..4],B[7..4]
If A1,1 result is
> or < take
A1,1 result else
take A1,2 result
A1,1
Comp A[7..6],B[7..6]
A1,1,1
Comp A[7],B[7]
If A1,1,1 result is
> or < take
A1,1,1 result else
take A1,1,2 result
Small enough to be
designed using a TT
A1,1,2
Comp A[6],B[6]
If A1 result is
> or < take
A1 result else
take A2 result
A1,2
Comp A[5,4],B[5,4]
Stitch-up of solns to A1
and A2 to form the
complete soln to A
A2
Comp A[3..0],B[3..0]
D&C Approach for Non-Associative Opers: n-bit > Comparator (cont’d)
A
Comp. A[7..0]],B[7..0]
Breakup by size/bits
A1
Comp A[7..4],B[7..4]
If A1,1 result is
> or < take
A1,1 result else
take A1,2 result
A1,1
Comp A[7..6],B[7..6]
A1,1,1
Comp A[7],B[7]
If A1,1,1 result is
> or < take
A1,1,1 result else
take A1,1,2 result
If A1 result is
> or < take
A1 result else
take A2 result
Stitch-up of solns to A1
and A2 to form the
complete soln to A
A2
Comp A[3..0],B[3..0]
A1,2
Comp A[5,4],B[5,4]
A1,1,2
Comp A[6],B[6]
The TT may be derived directly or by first thinking of and expressing its
computation in a high-level programming language and then converting
it to a TT.
Small enough to be
designed using a TT
If A[i] = B[i] then { f1(i)=0; f2(i) = 1; /* f2(i) o/p is an i/p to the stitch logic */
A[i] B[i]
0 0
0 1
1 0
1 1
f1(i) f2(i)
0 1
0 0
1 0
0 1
(2-bit 2-o/p comparator)
/* f2(i) =1 means f1( ), f2( ) o/ps of parent should be that of the LS ½ of this subtree
should be selected by the stitch logic as its o/ps */
else if A[i] < B[i} then { f1(i) = 0; /* indicates < */
f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */
else if A[i] > B[i] then {f1(i) = 1; /* indicates > */
f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */
Comparator Circuit Design Using D&C (contd.)
• Once the D&C tree is formulated
it is easy to get the low-level &
stitch-up designs
• Stitch-up design shown here
A
Comp. A[7..0]],B[7..0]
A1
If A1 result is
> or < take
A1 result else
take A2 result
Comp A[7..4],B[7..4]
If A1,1 result is
> or < take
A1,1 result else
take A1,2 result
A1,1
Comp A[7..6],B[7..6]
A1,1,1
Comp A[7],B[7]
If A1,1,1 result is
> or < take
A1,1,1 result else
take A1,1,2 result
f1(i) f2(i)
0 1
0 0
1 0
0 1
Comp A[3..0],B[3..0]
A1,2
Comp A[5,4],B[5,4]
A1,1,2
Stitch up logic details for subprobs i & i-1:
If f2(i) = 0 then { my_op1=f1(i);
my_op2=f2(i) } /* select MS ½ comp o/ps */
else /* select LS ½ comp. o/ps */
{my_op1=f1(i-1); my_op2=f2(i-1) }
Comp A[6],B[6]
my_op1 my_op2
A[i] B[i]
0 0
0 1
1 0
1 1
Stitch-up of solns to A1
and A2 to form the
complete soln to A
A2
my_op
Stitch-up
logic
2-bit
2:1 Mux
f2(i)
I0
2
f1(i) f2(i) f1(i-1) f2(i-1)
OR
2
I1
2
f1(i) f2(i) f1(i-1) f2(i-1) my_op1 my_op2
X 0
X
X
f1(i)
f2(i)
X 1
X
X
f1(i-1) f2(i-1)
f(i)=f1(i),f2(i) f(i-1)
(Direct design)
(Compact TT)
Comparator Circuit Design Using D&C – Final Design
• H/W_cost(8-bit comp.) =
7*(H/W_cost(2:1 Muxes)) +
8*(H/W_cost(2-bit comp.)
F= my1(6)
• H/W_cost(n-bit comp.) =
(n-1)*(H/W_cost(2:1 Muxes))
+ n*(H/W_cost(2-bit comp.))
my(5)
Log n levels
of Muxes
I0
I0
I0
I1
my(4)(1)
my(5)(1)
my(4)
I0
2
2
my(2)
my(1)
2
I0
2
2
f(6)
2
2
I1
2
I0
2
2
f(5)
I1
2
my(0)
2
f(4)
2
I1
2
I0
2
2
f(3)
2
2-bit
f(1)(2) 2:1 Mux
2-bit
f(3)(2) 2:1 Mux
2-bit
f(5)(2) 2:1 Mux
I1
Critical path (all
paths in this ckt
are critical)
2-bit
my(1)(2) 2:1 Mux
I1
2
f(7)
2
my(5)(2)
2
2-bit
f2(7) = f(7)(2) 2:1 Mux
2
1-bit
2:1 Mux
2-bit
my(3)(2) 2:1 Mux
2
my(3)
• Delay(8-bit comp.) = 3*(delay of 2:1
Mux) + delay of 2-bit comp.
• Note parallelism at work – multiple
logic blocks are processing simult.
• Delay(n-bit comp.) = (log n)*(delay
of 2:1 Mux) + delay of 2-bit comp.
f(2)
2
I1
2
f(1)
2
f(0)
2
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
A[7] B[7]
A[6] B[6]
A[5] B[5]
A[4] B[4]
A[3] B[3]
A[2] B[2]
A[1] B[1]
A[0] B[0]
D&C: Mux Design
2n :1
MUX
I 2n 1
Breakup by operands (data)
Simultaneous breakup by bits (select)
I0
Sn-1 S0
Two sets of operands: Data
operands (2n) and control/select
operand (n bits)
All bits except
msb should
have different
combinations;
msb should be
at a constant
value (here 0)
I0
2n-1 :1
MUX
Stitch-up
I 2 nn-11
MSB value should differ
among these 2 groups
Sn-2 S0
2:1
Mux
Sn-1
I2n-1
n
All bits except
msb should
have different
combinations;
msb should be
at a constant
value (here 1)
2n-1 :1
MUX
I 2 n 1
Sn-2 S0
(a) Top-Down design (D&C)
Opening up the 8:1 MUX’s hierarchical design and a top-down view
All bits except msb should have
different combinations; msb
should be at a constant value
(here 0)
MSB value should differ
among these 2 groups
I0
I1
I2
I3
I4
I5
I6
I7
I0
I1
I2
8:1
MUX Z

I3
All bits except msb should have
different combinations; msb
should be at a constant value
(here 1)
MUX
Selected when
S0 = 0, S1 = 1, S2=1
S0
2:1
I2
MUX
S0
I4
S2 S1 S0
2:1
I04:1 Mux
I5
2:1
MUX
I4
2:1
MUX
S1
I2
2:1
MUX
I6
2:1
I7
MUX
S0
I6
Z
S2
S1
Selected when S0 = 0, S1 = 1.
These i/ps should differ in S2
S0
I6
2:1
MUX
I6
4:1 Mux
Top-Down vs Bottom-Up: Mux Design
I0
2:1
I1
S0
I2
2:1
I3
S0
I 2 n 21
I 2 n 1
2n-1
2:1
MUXes
2n-1 :1
MUX
2:1
S0
Sn-1 S1
(b) Bottom-Up (“Divide-and-Accumulate”)
• Generally better to try top-down (D&C) first
An 8:1 MUX example (bottom-up)
Selected when S0 = 0
I0
I1
I2
I0
I1
I2
I3
I4
I5
I6
I7
8:1
MUX Z

I3
2:1
I0
MUX
S0
2:1
I2
MUX
I5
2:1
MUX
I3
I5
S0
I4
I1
These inputs should
have different lsb or S0
values, since their sel. is
based on S0 (all other
remaining, i.e., unselected
bit values should be the
same). Similarly for other
i/p pairs at 2:1 Muxes at
this level.
I6
2:1
I7
MUX
MUX Z
I4
S2 S1
S0
S2 S1 S0
4:1
I6
I7
S0
Selected when S0 = 1
Multiplier D&C
AXB:
n-bit
mult
Breakup by bits
(operand size)
Stitch up: Align and Add =
2n*W + 2n/2*X + 2n/2*Y + Z
n
AhXBh:
(n/2)-bit
mult
•
•
•
W
n
AhXBl:
(n/2)-bit
mult
X
Y
n
AlXBh:
(n/2)-bit
mult
n
Z
AlXBl:
(n/2)-bit
mult
Multiplication D&C idea:
A x B = (2n/2*Ah + Al)(2n/2*Bh + Bl), where Ah is the higher
Cost = 3 2n-bit
n/2 bits of A, and Al the lower n/2 bits
= 2n*Ah*Bh +
adders = 6n
Stitch-Up
Design
1
FAs (full
2n/2*Ah*Bl + 2n/2*Al*Bh + Al*Bl = PH + PM1 + PM2 + PL
(inefficient)
adders)
for
Example:
RCAs (ripple 10111001 = 185
PM2(n2n) carry adders)
PL(n2n)
X 00100111 = 39
+
+
= 0001110000101111 = 7215
2n-bit
4
PH(n2n)
PM1(n2n)
 D&C breakup: (10111001) X (00100111) = (2 (1011)
adders
+ 1001) X (24(0010) + 0111)
= 28(1011 X 0010) + 24(1011 x 0111 + 1001 X 0010)
+ 1001 X 0111
+
= 28(00010110) + 24(01001101 + 00010010) +
Critical path:
Delay (using RCAs) =
00111111
(a) too high-level
 = bbbbbbbb00111111 = PL
analysis: 2*((2n)-bit
+ bbbb01001101bbbb = PM1
What is the delay of
adder delay) = 4n*(FA
+ bbbb00010010bbbb = PM2
2n
delay)
the n-bit multiplier
(b) More exact
+ 00010110bbbbbbbb = PH
using such a stitch up
considering overall
_____________________
critical path: (i+2n(# 1)?
0001110000101111 = 7215
i+1) = 2n+1 FA delays
FA7
carry o/p ci+1 = aibi + aici + bici
is 5n-i/p delay units
FA7
z7
z6
z5
z4
z3
z2
FA7
Critical paths
(3 of n) going
through (n+1)
FAs
Delay for adding 3 numbers X, Y, Z using two RCAs?
Ans: (n+1) FA delay units or 5(n+1) i/p delay units
z1
z0
•
Multiplier D&C (cont’d)
Ex: 10111001 = 185
X 00100111 = 39
Stitch-Up Design 2 (efficient)
= 0001110000101111 = 7215
n
D&C breakup: (10111001) X (00100111) = (24(1011) + 1001) X (24(0010) + 0111)
= 28(1011 X 0010) + 24(1011 x 0111 + 1001 X 0010) + 1001 X 0111
PL
= 28(00010110) + 24(01001101 + 00010010) + 00111111
+
(Arrows in adds on the
= bbbbbbbb00111111 = PL
PM1
+ bbbb01001101bbbb = PM1 left show Couts of
adds
@ del=n/2
+ bbbb00010010bbbb = PM2 lower-order
propagating as Cin to
+ cin
+ 00010110bbbbbbbb = PH next higher-order adds)
Cout
000Cin
_____________________
+
0001110000101111 = 7215
P
M2
(n/2)-bit adders
@ del=n/2+1
Cost = 5 (n/2)-bit
Adders = 2.5 n FAs@ del=2[n/2] +1
for RCAs
cin
Critical path:
Delay =
3*((n/2)-bit
adder delay) =
1.5n*(FA delay)
for RCAs
cin
Intermediate
Sums
+
PH
+
00 ….0 Cin
@ del=3[n/2] +1
n/2
Cin @
del=2[n/2] +1
@ del=2[n/2] +2
lsb of MS half
@ del=n/2+2
n/2
n/2
n/2
•
•
•
•
•
•
•
Multiplier D&C (cont’d)
The (1.5n + 1) FA delay units is the delay
Stitch-Up Design 2 (efficient)
assuming PL … PH have been computed.
n
What is the delay of the entire multiplier?
Does the stitch-up need to wait for all bits
PL
+
of PL … PH to be available before it “starts”?
PM1
The stitch up of a level can start when the
lsb of the msb half of the product bits of
@ del=n/2
each of the 4 products PL … PH are available:
+ cin
for the top level this is at n/4 + 2 after the
+
previous level’s such input is avail (= when
PM2
lsb of msb half of i/p n-bit prod. avail; see
analysis for 2n-bit product in figure)
(n/2)-bit adders
@ del=n/2+1
Using RCAs: (n-1) [this is delay after lsb of
msb half avail. at top level) + { (n/2 +2) +
cin
@ del=2[n/2] +1
Intermediate
(n/4 +2) + … + (2+2) (stopping at 4-bit mult)
cin
Sums
+ 2 [this is boundary-case 2-bit mult delay
+
at bit 3—lsb of msb half of 4-bit product] +
PH
1/3 [this is the delay of a 2-i/p AND gate
+
translated in terms of FA delay units which,
00 ….0 Cin
using 2-i/p gates, is 3 2-i/p gate delays] }
Cin @
@ del=2[n/2] +2
= (n-1 ) + {(1/2)[(S i=0 logn 2i ) + 2logn – 1.17} @ del=3[n/2] +1
del=2[n/2] +1
lsb of MS half
@ del=n/2+2
[corrective term: -[½(3) – 1/3] for taking
n/2
n/2
n/2
n/2
prev. summation up to i=1,0] = n-1 +
(1/2)[2n-1] + 2logn - 1.17 ~ 2(n+log n ) ~ • We were able to obtain this similar-to-array-multiplier design using
Q(2n) FA delays—similar to the well-known D&C using basic D&C guidelines and it did not require extensive
array multiplier that uses carry-save adders ingenuity as it might have for the designers of the array multiplier
• We can obtain an even faster multiplier (Q(log n) delay) using D&C
Why do we need 2 FA delay units for a 2-bit and carry-save adders for stitch-up; see appendix
mult after 4 1-bit prods of 1-bit mults avail?
2n
SU2 = Stitch up design 2 for
multiplication
SU2(n)
n
SU2(n/2)
n
SU2(n/2)
n
SU2(n/2)
n
SU2(n/2)
n/2
SU2
(n/4)
SU2
(n/4)
SU2
(n/4)
SU2
(n/4)
• What is its cost in terms of # of FAs (RCAs)?
• The level below the root (root = 1st level) has 4 (n/2)-bit multiplies to generate the PL …. PH
of the root, 16 (n/4)-bit multiplies in the next level, upto 2-bit mults. at level logn.
• Thus FAs used = 2.5[n + 4(n/2 )+ 16(n/4)] + 4 logn -1*(2) + 4 logn *(1/7) [the last two terns are
for the boundary cases of 2-bit and 1-bit multipliers that each require 2 and 1/7 FAs, resp.; see
Qs below)
= 2.5n(S i=0 logn – 2 2i) + 2(n/2)2 + (1/7)n2 = 2.5[n(n/2 -1]/(2 -1)) + 0.64n2 = 1.25n2 -2.5n + 0.64n2
~ 1.89n2 = Q(n2)
• Why do we need 2 FA cost units for a 2-bit multiplication (with 4 1-bit products of 1-bit mults
available)?
• Assuming we use only 2-input gates, why do we add (n/7) FA cost units for each 1-bit
multiplier (which is a 2-i/p AND gate)?
• Using carry-save adders or CSvA’s [see appendix], the cost is similar (quadratic in n, i.e.,
Q(n2)).
D&C Example Where a “Straightforward” Breakup Does Not Work
• Problem: n-bit Majority Function (MF): Output f = 1 when a majority of bits is 1, else f =0
Root problem A:
n-bit MF [MF(n)]
f
f2
Subprob. A2
MF(MS n/2 bits)
St. Up
(SU)
f1
Subprob. A1
MF(LS n/2 bits)
• Need to ask (general Qs for any problem): Is the stitch-up function SU required in the
above straightforward breakup of MF(n) into two MF’s for the MS and LS n/2 bits:
 Computable?
 Efficient in both hardware and speed?
• Try all 4 combinations of f1, f2 values and check if its is possible for any function w/ i/ps
f1, f2 to determine the correct f value:
 f1 = 0, f2 = 0  # of 1’s in minority (<= n/4) in both halves, so totally # of 1’s <= n/2  f = 0
 f1 = 1, f2 = 1  # of 1’s in majority (> n/4) in both halves, so totally # of 1’s > n/2  f = 1
 f1 = 0, f2 = 1  # of 1’s <= n/4 in LS n/2 and > n/4 in MS n/2, but this does not imply if total
# of 1’s is <= n/2 or > n/2. So no function can determine the correct f value (it will need
more info, like exact count of 1’s)
 f1 = 1, f2 = 0: same situation as the f1 = 0, f2 = 1 case.
 Thus the stitch-up function is not even computable in the above breakup of MF(n).
D&C Example Where a “Straightforward” Breakup Does Not Work
(contd.)
• Try another breakup, this time of MF(n) into functions that are different from MF.
Root problem A:
n-bit MF [MF(n)]
f
Subprob. A2:
(> compare of A1 o/p
(log n)+1 f1
and floor(n/2)
D&C
tree for
A2
Subprob. A1:
Count # of 1’s
in the n-bits
D&C
tree for
A1
• Have seen (log n) delay (>) comparator for two n-bit #s using D&C
• Can we do 1-counting using D&C? How much time will this take?
Dependency Resolution in D&C:
(1) The Wait Strategy
• So far we have seen D&C breakups in which there is no data dependency
between the two (or more) subproblems of the breakup
• Data dependency leads to increased delays
• We now look at various ways of speeding up designs that have subproblem
dependencies in their D&C breakups
Root problem A
Subprob. A2
Subprob. A1
Data flow
• Strategy 1: Wait for required o/p of A1 and then perform A2, e.g., as in a ripplecarry adder: A = n-bit addition, A1 = (n/2)-bit addition of the L.S. n/2 bits, A2 = (n/2)bit addition of the M.S. n/2 bits
• No concurrency between A1 and A2:
t(A) = t(A1) + t(A2) + t(stitch-up)
= 2*t(A1) + t(stitch-up) if A1 and A2 are the same problems of the same size
• Note that w/ no dependency, the delay expression is:
t(A) = max{t(A1), t(A2)} + t(stitch-up) = 2*t(A1) + t(stitch-up) if A1 and A2 are the
same problems of the same size
Adder Design using D&C
• Example: Ripple-Carry Adder (RCA)
– Stitching up: Carry from LS n/2 bits is
input to carry-in of MS n/2 bits at
each level of the D&C tree.
– Leaf subproblem: Full Adder (FA)
Add n-bit #s X, Y
Add MS n/2 bits
of X,Y
FA
FA
Add LS n/2 bits
of X,Y
FA
(a) D&C for Ripple-Carry Adder
FA
Example of the Wait Strategy in Adder Design
FA7
• Note: Gate delay is propotional to # of inputs (since, generally there is a series connection of
transistors in either the up or down network = # of inputs  R’s of the transistors in series
add up and is prop to # of inputs  delay ~ RC (C is capacitive load) is prop. to # of inputs)
• The 5-i/p gate delay stated above for a FA is correct if we have 2-3 i/p gates available
(why?), otherwise, if only 2-i/p gates are available, then the delay will be 6-i/p gate delays
(why?).
• Assume each gate i/p contributes 2 ps of delay
• For a 16-bit adder the delay will be 160 ps
• For a 64 bit adder the delay will be 640 ps
Adder Design using D&C—Lookahead Wait (not in syllabus)
•
•
Example: Carry-Lookahead Adder
Add n-bit #s X, Y
(CLA)
– Division: 4 subproblems per
level
Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits
Add
ms
n/4
bits
– Stitching up: A more complex
stitching up process
(generation of global ir
P, G
P, G
P, G
P, G
“super” P,G’s to connect up
Linear connection of local P, G’s from each unit to determine global or
super
P, G for each unit. But linear delay, so not much better than RCA
the subproblems)
– Leaf subproblem: 4-bit basic
(a) D&C for Carry-Lookahead Adder w/ Linear Global P, G Ckt
CLA with small p, g bits.
More intricate techniques (like P,G But, the global P for each unit is an associative function. So can be
done in max log n time (for the last unit; less time for earlier units).
generation in CLA) for complex
stitching up for fast designs may
need to be devised that is not
directly suggested by D&C. But
Add n-bit #s X, Y
D&C is a good starting point.
Add ms n/4 bits
P, G
Add 3rd n/4 bits Add 2nd n/4 bits
P, G
P, G
Add ls n/4 bits
P, G
Tree connection of local P, G’s from each unit to determine global
P, G for each unit (P is associative)
to do a prefix computation
(b) D&C for Carry-Lookahead Adder w/ a Tree-like
Global P, G Ckt
Dependency Resolution in D&C:
(2) The “Design-for-all-cases-&-select (DAC)” Strategy
Root problem A
00
Subprob. A1
Subprob. A2
I/p00
01
10
Subprob. A2
I/p01
Subprob. A2
I/p10
I/p11
11
Subprob. A2
4-to-1 Mux
• Strategy 2: DAC: For a k-bit i/p from A1 to A2,
design 2k copies of A2 each with a different
hardwired k-bit i/p to replace the one from A1.
• Select the correct o/p from all the copies of A2
via a (2k)-to-1 Mux that is selected by the k-bit
o/p from A1 when it becomes available (e.g.,
carry-select adder)
• t(A) = max(t(A1), t(A2)) + t(Mux) + t(stitch-up)
= t(A1) + t(Mux) + t(stitch-up) if A1 and A2 are
the same problems
Select i/p
(2) The “Design-for-all-cases-&-select (DAC)” Strategy (cont’d)
Root problem A
Subprob. A2
SUP
Subprob. A1
Note: The DAC based replication will
apply to the smallest subproblem
(subcircuit) of A2 that is directly
dependent on A1’s o/ps, and not
necessarily to all of A2. Similarly for
lower-level DACs.
DAC
SUP
Subprob. A2,1
SUP
Subprob. A2,2
Subprob. A1,1
Subprob. A1,2
DAC
DAC
SUP
SUP
Wait
Wait
SUP
Wait
SUP
Wait
Generally, wait
strategy will be
used at all lower
levels after the 1st
wait level
Figure: A D&C tree w/ a mix of DAC and Wait strategies for dependency resolution between subproblems
• The DAC strategy has a MUX delay involved, and at small subproblems, the delay of a subproblem
may be smaller than a MUX delay or may not be sufficiently large to warrant extra replication or
mux cost.
• Thus a mix of DAC and Wait strategies, as shown in the above figure, may be faster, w/ DAC used at
higher levels and Wait at lower levels.
Example of the DAC Strategy in Adder Design
Simplified Mux
Cout
1
4
• For a 16-bit adder, the delay is (9*4 – 4)*2 = 64 ps (2 ps is the delay for a single
i/p); a 60% improvement ((160-64)*100/160) over RCA
• For a 64-bit adder, the delay is (9*8 – 4)*2 = 136 ps; a 79% improvement over RCA
Dependency Resolution in D&C:
(3) Speculative Strategy
Root problem A
Subprob. A1
Subprob. A2
2-to-1 Mux
FSM Controller:
If o/p(A1A2) = guess(A2) (compare when A1
generates a completion signal) then generate
a completion signal after some delay
corresponding to stitch up delay
else set i/p to A1 = o/p(A1 S2) and generate
completion signal after delay of A2 + stitch up
I1
I0
op(A1A2)
01
Estimate (guess),
based on
analysis or stats
select i/p to
Mux
completion signal
• Strategy 3: Have a single copy of A2 but choose a highly likely value of the k-bit i/p and
perform A1, A2 concurrently. If after k-bit i/p from A1 is available and selection is incorrect, redo A2 w/ correct available value.
• t(A) = p(correct-choice)*(max(t(A1), t(A2)) + (1-p(correct-choice))*[t(A2) + t(A1)) + t(stitchup), where p(correct-choice) is probability that our choice of the k-bit i/p for A2 is correct.
• For t(A1) = t(A2), this becomes: t(A) = p(correct-choice)*t(A1) + (1-p(correct-choice))*2t(A1)+
t(stitch-up) = t(A1) + (1-p(correct-choice))*t(A1)+ t(stitch-up)
• Need a completion signal to indicate when the final o/p is available for A; assuming worstcase time (when the choice is incorrect) is meaningless is such designs
• Need an FSM controller for determining if guess is correct and if not, then redoing A2
(allowing more time for generating the completion signal) .
Dependency Resolution in D&C:
(4) The “Independent Pre-Computation” Strategy
Concept
Example of an unstructured logic for A2
Root problem A
u v
x
w’ x
yw
z’ a1u’ x
a1
v’ x’
u v
x
w’ x
yw
z’
u’ x
Subprob. A1
A2_dep
Subprob. A2
v’ x’
Data flow
A2_indep
A2
A2_indep
Critical path after
a1 avail (8-unit delay)
a2
A2_dep
Critical path after
a1 avail (4-unit delay)
a2
• Strategy 4: Reconfigure the design of A2 so that it can do as much processing as possible that is
independent of the i/p from A1 (A2_indep). This is the “independent” computation that prepares for the final
computation of A2 (A2_dep) that can start once A2_indep and A1 are done.
• t(A) = max(t(A1), t(A2_indep)) + t(A2_dep) + t(stitch-up)
• E.g., Let a1 be the i/p from A1 to A2. If A2 has the logic a2 = v’x’ + uvx + w’xy + wz’a1 + u’xa1. If this were
implemented using 2-i/p AND/OR gates, the delay will be 8 delay units (1 unit = delay for 1 i/p) after a1 is
available. If the logic is re-structured as a2= (v’x’ + uvx + w’xy) + (wz’ + u’x)a1, and if the logic in the 2
brackets are performed before a1 is available (these constitute A2_indep), then the delay is only 4 delay
units after a1 is available.
• Such a strategy requires factoring of the external i/p a1 in the logic for a2, and grouping & implementing all
the non-a1 logic, and then adding logic to “connect” up the non-a1 logic to a1 as the last stage.
a1
D&C Summary
• For complex digital design, we need to think of the “computation”
underlying the design in a structured manner---are there properties of
this computation that can be exploited for faster, less expensive,
modular design; is it amenable to the D&C approach? Think of:
–
–
–
–
Breakup into >= 2 subprobs via breakup of (# of operands) or (operand sizes [bits])
Stitch-up (is it computable?)
Leaf functions
Dependencies between sub-problems and how to resolve them
• The design is then developed in a structured manner & the
corresponding circuit may be synthesized by hand or described
compactly using a HDL (e.g., structural VHDL)
• For an operation/func x on n operands (an-1 x an-2 x …… x a0 ) if x is
associative, the D&C approach gives an “easy” stitch-up function,
which is x on 2 operands (o/ps of applying x on each half). This results
in a tree-structured circuit with (log n) delay instead of a linearlyconnected circuit with (n) delay can be synthesized.
• If x is non-associative, more ingenuity and determination of properties
of x is needed to determine the breakup and the stitch-up function. The
resulting design may or may not be tree-structured
• If there is dependency between the 2 subproblems, then we saw
strategies for addressing these dependencies:
–
–
–
–
Wait (slowest, least hardware cost)
Design-for-all-cases (high speed, high hardware cost)
Speculative (medium speed, medium hardware cost)
Independent pre-computation (medium-slow speed, low hardware cost)
Strategy 2: A general view of DAC
computations (w/ or w/o D&C)
•
•
•
If there is a data dependency between two
or more portions of a computation (which
may be obtained w/ or w/o using D&C),
don’t wait for the the “previous” computation
to finish before starting the next one
Assume all possible input values for the
next computation/stage B (e.g., if it has 2
inputs from the prev. stage there will be 4
possible input value combinations) and
perform it using a copy of the design for
possible input value.
All the different o/p’s of the diff. Copies of B
are Mux’ed using prev. stage A’s o/p
E.g. design: Carry-Select Adder (at each
stage performs two additions one for carryin of 0 and another for carry-in of 1 from the
previous stage)
A
B
y
(a) Original design: Time = T(A)+T(B)
0
x
A
B(0,0)
y
0
0
B(0,1)
4:1 Mux
•
z
x
1
1
z
B(1,0)
0
1
B(1,1)
1
(b) Speculative computation: Time = max(T(A),T(B)) + T(Mux).
Works well when T(A) approx = T(B) and T(A) >> T(Mux)
Strategy 3: Get the Best of Both Worlds
(Average and Worst Case Delays)!
Approximate analysis: Avg. dividend D value =
2n-1
For divisor V values in the “lower half range”[1,
2n-1], the average quotient Q value is the
Harmonic series (1+ ½ + 1/3 + … + 1/ 2n-1) – this
comes from a Q value of x for V in the range (2n1/x) to 2n-1/(x-1)) - 1, i.e., for approx. (2n-1/x2)
#s, which have a probability of 1/x2 , giving a
probabilistic value of x(1/x2) = 1/x in the average
Q calculation.
The above summation is ~ ln (2n-1) ~( n-1)/1.4
(integration of 1/k from k = 1 to 2n-1)
Q for divisors in the upper half range [2n-1 +1,
2n] is 0
 overall avg. quotient = (n-1)/2.8  avg.
subtractions needed = 1 + (n-1)/2.8 = Q(n/2.8)
•
•
•
•
Registers
inputs
inputs
Unary
Division Ckt
(good ave
case: Q(n/2.8)
subs,
done1
bad
worst case:
Q(2n) subs)
output
select
start
Ext.
FSM
done2
Mux
NonRestoring
Div. Ckt
(bad ave
case [Q(n)
subs],
good
worst case:
Q(n) subs)
output
Register
Use 2 circuits with different worst-case and average-case behaviors
Use the first available output
Get the best of both (ave-case, worst-case) worlds
In the above schematic, we get the good ave case performance of unary division
(assuming uniformly distributed inputs w/o the disadvantage of its bad worst-case
performance): ave. case = Q(1) subs, worst case = Q(n) subs
Strategy 4a: Pipeline It! (Synchronous Pipeline)
Clock
Registers
Stage 1
Original ckt
or datapath
Stage 2
Conversion
to a simple
level-partitioned
pipeline (level
partition may not
always be possible
but other pipelineable partitions
may be)
Stage k
• Throughput is defined as # of outputs / sec
• Non-pipelined throughput = (1 / D), where D = delay of original ckt’s datapath
• Pipeline throughput = 1/ (max stage delay + register delay)
• Special case: If original ckt’s datapath is divided into n stages, each of equal delay, and
dr is the delay of a register, then pipeline throughput = 1/((D/n)+dr).
• If dr is negligible compared to D/n, then pipeline throughput = n/D, n times that of the
original ckt
• FSM controller may be needed for non-uniform stage delays; not needed otherwise
Strategy 4a: (Synchronous) Pipeline It! (contd.)
•
Legend
: Register
Ij(t k) = processed version of i/p Ij
@ time k. We assume delay of each
basic odule below is 1 unit.
•
Comparator o/p produced every 1 unit of time, instead of
every (logn +1) unit of time, where 1 time unit here = delay
of mux or 1-bit comparator (both will have the same or
similar delay)
We can reduce reg. cost by inserting at every 2 levels,
throughput decreases to 1 per every 2 units of time
output
I1(t 4) I2(t 5)
F= my1(6)
1-bit
my(5)(2)2:1 Mux
I0
my(5)
I1(t 3) I2(t 4) I3(t 5)
I1
my(4)(1)
my(5)(1)
2
stage 4 i/ps
Log n level
2-bit
my(3)(2)
2:1 Mux
of Muxes
I1
I0
2
my(3)
I0
2
I0
2
f(6)
2
my(1)
2
2-bit
f(5)(2)2:1 Mux
2
f(7)
2
I0
2
my(2)
I1
2
2-bit stage 3 i/ps
my(1)(2)
2:1 Mux
2
2
2-bit
2(7) = f(7)(2) 2:1 Mux
my(4)
I1
f(5)
2
I0
2
2
f(4)
2
2
my(0)
I1
f(2)
2
2
2-bit
f(1)(2)2:1 Mux stg.2 i/ps
I1(t 1) I2(t 2) I3(t 3) I4(t 4) I5(t 5)
I1
I0
2
2
f(3)
2
I1
2
2-bit
f(3)(2)2:1 Mux
I1(t 2) I2(t 3) I3(t 4) I4(t 5)
2
f(1)
2
f(0)
2
1-bit
1-bit
1-bit
1-bit
1-bit
1-bit
1-bit
1-bit
comparator comparatorcomparator comparator comparator comparatorcomparator comparator
I1(t 0) I2(t 1) I3(t 2) I4(t 3) I5(t 4) I6(t 5)
A[7] B[7]
A[6] B[6] A[5] B[5]
A[4] B[4]
A[3] B[3]
A[2] B[2] A[1] B[1]
A[0] B[0]
time axis
Strategy 4a: (Synchronous) Pipeline It! (contd.)
Legend
: Intermediate & output register
: Input register
Pipelined Ripple Carry Adder
Problem: I/P and O/P data direction is not the same as the computation direction.
They are perpendicular! In other words, each stage has o/ps, rather than o/ps only appearing at
the end of the pipeline.
Next 3
S0, S1 o/ps
S7, S6 o/ps
for i/ps recvd
4 cc back
S5, S4 o/ps
for i/ps recvd
4 cc back
S3, S2 o/ps
for i/ps recvd
4 cc back
S1, S0 o/ps
for i/ps recvd
4 cc back
Adder o/p produced every 2 unit’s of FA delay instead of every n units of FA
delay in an n-bit RCA
Strategy 4b: Pipeline It!—Wave Pipelining
•
•
Wave pipelining is essentially pipelining of combinational circuits w/o the use of registers
and clocking, which are expensive and consume significant amount of power.
The min. safe input rate (MSIR: rate at which inputs are given to the wave-pipelined circuit)
is determined by the difference in max and min delays across various module or gate outputs
and inputs, respectively, and is designed so that a later input’s processing, however fast it
might be, will never “overwrite” that data at the input of any gate or module that is still
processing the previous input.
tmin(i/p:m2)
m1
•
•
•
•
•
m2
tmax(o/p:m2)
Consider two modules m1 and m2 above (the modules could either be subcircuits like a full
adder or even more complex) or can be as simple as a single gate). Let tmin(i/p:mj) be the mindelay (i.e., min arrival time) over all i/ps to mj, and let tmax(o/p:m2) be the max delay at the
o/p(s) of m2.
Certainly a safe i/p rate (SIR) period for wave pipelining the above configuration is
tmax(o/p:m2). But can we do it safely at a higher rate (lower period)?
The min safe i/p rate (MSIR) period for m2 will correspond to a situation in which any new
i/p appears at m2 only after the processing of its current i/ps is over so that while it is still
processing its current i/ps are held stable. Thus a 2nd i/p should appear at m2 no earlier than
tmax(o/p:m2)  the 2nd i/p to the circuit itself, i.e., at m1 should not appear before
tmax(o/p:m2) - tmin(i/p:m2).
Similarly, for safe 1st i/p operation of m1, the 2nd i/p should not appear before tmax(o/p:m1) tmin(i/p:m1) = tmax(o/p:m1) as tmin(i/p:m1) = 0.
Thus for safe operation of the 1st i/ps of both m1 and m2, the 2nd i/p should not appear before
max(tmax(o/p:m1), tmax(o/p:m2) - tmin(i/p:m2)), the min. safe i/p rate period for the 1st i/p
Strategy 4b: Pipeline It!—Wave Pipelining (contd.)
tmin(i/p:m2)
m1
m2
Fig. 1
•
•
•
•
•
fastest i’th i/p: (i-1)tsafe + tmin(i/p:mj)
>= (i-2)tsafe + tmax(o/p:mj)- tmin(i/p:mj)+ tmin(i/p:mj) =
(i-2)tsafe + tmax(o/p:mj). Thus safe (see slowest o/p)
tmax(o/p:m2)
m1
m2
mj
slowest (i-1)’th o/p: (i-2)tsafe + tmax(o/p:mj)
Fig. 2
The Q. now is whether tsafe(1) (the min i/p safe rate period for the 1st i/p) = max(tmax(o/p:m1),
tmax(o/p:m2) - tmin(i/p:m2)), will also be a safe rate period for the ith i/p for i > 1?
Consider the 2nd o/p from m2 that appears at time tsafe(1) + tmax(o/p:m2), and the 3rd i/p to it
which appears at 2 tsafe(1) + tmin(i/p:m2) >= tsafe(1) + tmax(o/p:m2) (since tsafe(1) =
max(tmax(o/p:m1), tmax(o/p:m2) - tmin(i/p:m2)), and thus no earlier than when the 2nd o/p of m2
appears, and is thus safe.
A similar safety analysis will show that tsafe(1) is a safe i/p rate period for any ith i/p for both
m2 and m1, for any i. Since it is also the min. such period (at the module level) for the 1 st i/p
(and in fact for any ith i/p) as we have established earlier, tsafe(1) is the min. safe i/p rate
period for any ith i/p for both m2 and m1. We term this min. i/p rate period t safe (= tsafe(1) ).
Thus if there are k modules m1, …, mk, a simple extension of the above analysis gives us that
the this min. i/p rate period tsafe = maxi=1 to k{tmax(o/p:mi) - tmin(i/p:mi)}; see Fig. 2 above.
Interesting Qs:
1. Is tsafe a safe period not just at the module level but for every gate gj in the circuit (e.g.,
will tsafe be a safe period for every gate gj in m2), which it needs to be in order for this
period to work?
2. Can we have a better (smaller) min. safe i/p rate period determination by considering
the above analysis at the level of individual gates than at the level of modules?
Strategy 4b: Pipeline It!—Wave Pipelining: Example 1
•
•
•
•
•
Let max delay of a 2-i/p gate be 2 ps
and min delay be 1 ps.
What is the tsafe for this ckt for the
two modules shown?
tsafe = max (4ps, 10-1 = 9ps) = 9 ps
Thus MSIR = 9 ps a little better than
10 ps corresponding to the max o/p
delay. So this is not a circuit than
can be effectively wave pipelined
Generally a ckt that has more
balanced max o/p and min i/p delay
for each module and gate is one that
can be effectively wave pipelines,
i.e., one whose MSIR is much lower
than the max o/p delay of the circuit
v’ x’
u v
x
w’ x
yw
z’ a1u’ x
a1
m1
tmin(i/p:m2) = 1ps
tmax(o/p:m1) = 4ps
m2
tmax(o/p:m2) = 10ps
Strategy 4b: Pipeline It!—Wave Pipelining: Example 2
•
F= my1(6)
12 ps
1-bit
my(5)(2)2:1 Mux
I0
my(5)
my(4)(1)
my(5)(1)
2
m2
I1
Log n level
2-bit
my(3)(2)
2:1
Mux
of Muxes
I1
I0
2
my(3)
2-bit
(7) = f(7)(2) 2:1 Mux
I1
I0
2
f(7)
2
my(1)
2
2-bit
f(5)(2)2:1 Mux
I0
2
I1
I0
2
I1
4 ps
2
my(0)
2
2-bit
f(3)(2)2:1 Mux
2
•
I1
•
2
2-bit
f(1)(2)2:1 Mux
I0
2
2
I1
2
m1
f(6)
2
I0
2
my(2)
2
2
2-bit
my(1)(2)
2:1 Mux
2
6 ps
2
my(4)
•
f(5)
2
f(4)
2
f(3)
2
f(2)
2
f(1)
2
f(0)
2
1-bit
1-bit
1-bit
1-bit
1-bit
1-bit
1-bit
1-bit
comparator comparatorcomparator comparator comparator comparatorcomparator comparator
A[7] B[7]
A[6] B[6] A[5] B[5]
A[4] B[4]
A[3] B[3]
A[2] B[2] A[1] B[1]
• What if we divide the circuit into 4 modules, each
corresponding to a level of the circuit, and did the
analysis for that? See next slide.
A[0] B[0]
•
Let max delay of a basic
unit (1-bit comp., 2:1
mux) is 3 ps and min
delay be 2 ps.
What is the tsafe for this
ckt for the two modules
shown?
tsafe = max (6ps, 12-4 =
4ps) = 8 ps
Thus MSIR = 8 ps, 33%
lower than 12 ps
corresponding to the max
o/p delay.
So this is a circuit that
can be reasonably
effectively wave
pipelined. This is due to
the balanced nature of
the circuit where all i/p
 o/p paths are of the
same length (the diff.
betw. max and min
delays come from the
max and min delays of
the components or gates
themselves).
Strategy 4b: Pipeline It!—Wave Pipelining: Example 3
•
F= my1(6)
12 ps
1-bit
my(5)(2)2:1 Mux
I0
my(5)
6 ps
my(5)(1)
2
m4
I1
9 ps
my(4)(1)
Log n level
2-bit
my(3)(2)
2:1
Mux
of Muxes
I1
I0
2
my(3)
I0
I1
I0
2
4 ps
my(1)
2
2-bit
f(5)(2)2:1 Mux
I0
2
2
I1
f(7)
f(6)
2
m3
I0
2
4 ps
2
my(0)
I1
m2
2
2-bit
f(1)(2)2:1 Mux
I1
I0
2
2
2
3 ps
f(5)
2
6 ps
•
I1
2
2-bit
f(3)(2)2:1 Mux
2
2 ps
2
2
2-bit
my(1)(2)
2:1 Mux
my(2)
6 ps
2
2-bit
(7) = f(7)(2) 2:1 Mux
2
2
my(4)
•
f(4)
2
f(3)
2
f(1) m1
f(2)
2
2
f(0)
2
1-bit
1-bit
1-bit
1-bit
1-bit
1-bit
1-bit
1-bit
comparator comparatorcomparator comparator comparator comparatorcomparator comparator
A[7] B[7]
A[6] B[6] A[5] B[5]
0 ps
A[4] B[4]
A[3] B[3]
A[2] B[2] A[1] B[1]
A[0] B[0]
•
Let max delay of a basic
unit (1-bit comp., 2:1
mux) is 3 ps and min
delay be 2 ps.
What is the tsafe for this
ckt for the 4 modules
shown?
tsafe = max (3-0, 6-2, 9-4,
12-6) ps = 6 ps
Thus MSIR = 6 ps, 50%
lower than 12 ps
corresponding to the max
o/p delay. So we get a
better (lower) MSIR if
we analyze the ckt at a
finer granularity.
Appendix: Q(log n) delay multiplier
Multiplier D&C (cont’d): Carry-Save Addition
Appendix: Q(log n) delay multiplier
Multiplier D&C (cont’d): Carry-Save Add. Based Stitch-Up
S(PL)
C(PL)
CSvA
CSvA
S(PM1)
C(PM1)
CSvA
S(PM2)
No CSvA
needed
C(PM2)
S(PH)
C(PH)
Add 4 #s
using
CSvA’s
n/2 (C & S)
CSvA
Add 6 #s
using CSvA’s:
3 delay units
Add 7 #s
using CSvA’s
(7 lsb bits need
to be added): 4
delay units
n/2 (C & S)
n/2 (C & S)
Fig. : Stitch-up # 3: Adding 6 numbers in parallel
using CSvA’s takes 3 units of time and 4 CSvA’s.
n/2 (C & S)
Fig. : Separate (and thus parallel) Carry save adds for each of the 4
(n/2)-bit groups shown at the top level of multiplication
•
•
•
Using CSvAs (carry-save adders) [each sub-prod., e.g., PL, is formed of 2 nos. sum bits and carry bits, and so
there are 8 n-bit #s to be CSvA’ed in the final stitch-up and takes a delay of approx. 5 units if done in seq. but
only 4 units if done in parallel. We then get 2 final nos. (carries # and sums #) that are added by a carrypropagate adder like a CLA, which takes Q(log n) time, and overall multiplier delay is Q(4*log n) [4 time units
at each of the (log n -2) levels (need at least 2 bit inputs for the above structure to be valid) + at moat 2 time
units for the bottom two levels (why?)] + Q(log n) = Q(log n) —similar to Wallace-tree mult,
We were able to obtain this fast design using D&C (and did not need the extensive ingenuity that W-T
multiplier designers must have needed] !
Hardware cost (# of FAs), ignoring final carry-prop. adder for the entire mult.? Exercise.
Download