6.004 Computation Structures

advertisement
MIT OpenCourseWare
http://ocw.mit.edu
6.004 Computation Structures
Spring 2009
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Binary Multiplication
Cost/Performance Tradeoffs:
a case study
Digital Systems Architecture 1.01
x
a
n bits
b
n bits
a b
EASY PROBLEM: design
combinational circuit to multiply
tiny (1-, 2-, 3-bit) operands...
HARD PROBLEM: design circuit to
multiply BIG (32-bit, 64-bit)
numbers
2n bits
since (2n-1)2 < 22n
We can make big
multipliers out of
little ones!
Engineering Principle:
Exploit STRUCTURE in problem.
Lab #3 due tonight!
3/5/09
6.004 – Spring 2009
modified 2/23/09 10:44
L09 - Multipliers 1
3/5/09
6.004 – Spring 2009
L09 - Multipliers 2
Our Basis:
Making a 2n-bit multiplier
using n-bit multipliers
n=1: minimalist starting point
Multiplying two 1-bit numbers is pretty simple:
Given n-bit multipliers:
a
aH aL
a
x
n bits
b
=
ab
x
aLbL
Synthesize 2n-bit multipliers:
0 ab
aHbL
2n bits
aHbH
b
x
=
Of course, we could start with optimized combinational
multipliers for larger operands; e.g.
aLbH
a
b
bH bL
2n bits
n bits
x
2n bits
ab
ab
a1a0
2
b1b0
2
2-bit
Multiplier
4
c3c2c1c0
the logic gets
more complex,
but some
optimizations
are possible...
4n bits
6.004 – Spring 2009
3/5/09
L09 - Multipliers 3
6.004 – Spring 2009
3/5/09
L09 - Multipliers 4
Brick Wall view
Our induction step:
of partial products
2n-bit by 2n-bit multiplication:
1. Divide multiplicands into n-bit pieces
2. Form 2n-bit partial products, using n-bit by n-bit
multipliers.
3. Align appropriately
4. Add.
Making 4n-bit multipliers from n-bit
ones: 2 “induction steps”
aLbH
aHbH aLbL
aHbL
aH aL x bH bL
REGROUP partial
products 2 additions
rather than 3!
a0b3
a1b3 a0b2
a3 a2 a1 a0
x
b3 b2 b1 b0
a3b3 a2b2
a3b2
Induction: we can use the same structuring
principle to build a 4n-bit multiplier from our
newly-constructed 2n-bit ones...
3/5/09
L09 - Multipliers 5
x
a1b3
"g(n) is of order f(n)"
a0b2
a3b1 a2b0
a3b0
ADD
Example:
g(n) = (f(n))
g(n) = (f(n)) if there exist C 2 C 1 > 0,
such that for all but finitely many integral n 0
a2b3 a1b2 a0b1
a3b3 a2b2 a1b1 a0b0
a3b2 a2b1 a1b0
Subassemblies:
• Partial Products
• Adders
MULT
"Order Of" notation:
a0b3
b3 b2 b1 b0
c 1•f(n) g(n) c 2•f(n)
L09 - Multipliers 7
(n 2)
n 2 (n 2+2n+3) 2n 2
g(n) = O(f(n))
(...) implies both
inequalities; O(...)
implies only the
second.
Partial Products:
Things to Add:
Adder Width:
Hardware Cost:
Latency:
3/5/09
n 2+2n+3 =
since
"almost always"
Step 2: Sum
6.004 – Spring 2009
L09 - Multipliers 6
Performance/Cost Analysis
Step 1: Form (& arrange)
Partial Products:
a3 a2 a1 a0
3/5/09
6.004 – Spring 2009
Multiplier Cookbook: Chapter 1
Given problem:
a0b1
a1b1 a0b0
a1b0
a2b1
a3b1 a2b0
a3b0
a•b
6.004 – Spring 2009
a1b2
a2b3
6.004 – Spring 2009
3/5/09
2
n
2n-2
2n
?
=
=
=
=
2
(n )
(n)
(n)
2
(n )
(n2) ??
L09 - Multipliers 8
Observations:
Repackaging Function
ADD
MULT
a1b3
a0b2
(n ) partial products.
2
(n ) full adde rs.
Put the Solution where the
Problem is.
2
(n ) partial products.
2
(n ) full adde rs.
Hmmm.
a2b3 a1b2 a0b1
a3b3 a2b2 a1b1 a0b0
a3b2 a2b1 a1b0
a3b1 a2b0
a3b0
2
Engineering Principle #2:
a0b3
MULT
a0 b3
a1 b3 a0 b2
a2 b3 a1 b2 a0 b1
a3 b3 a2 b2 a1 b1 a0 b0
a3 b2 a2 b1 a1 b0
a3 b1 a2 b0
a3 b0
ADD
How about n2 blocks, each doing a
little multiplication and a little
addition?
3/5/09
6.004 – Spring 2009
L09 - Multipliers 9
Goal:
Sk+1
b2
a0 b3
b1
a1 b3 a0 b2
b0
a2 b3 a1 b2 a0 b1
a3 b3 a2 b2 a1 b1 a0 b0
a3 b2 a2 b1 a1 b0
a0
a3 b1 a2 b0
a1
a3 b0
a
b3
Sk
bi
Single "brick" of brick-wall
array...
Ck+2
Ck
• Forms partial product
• Adds to accumulating sum
along with carry
aj
2
S’k+1
a3
(A+B)i
6.004 – Spring 2009
Ci
Takes 2 addend bits plus carry bit. Produces sum
and carry output bits.
Sk+1
L09 - Multipliers 11
Sk
2
• AND gate forms 1x1 product
• 2-bit sum propagates from top to
bottom
• Carry propagates to left
6.004 – Spring 2009
bi
0
Brick design:
Ck+2
FA
FA
S’k+1
S’k
Ck
aj
Wastes some gates… but consider
(say) optimized 4x4-bit brick!
CASCADE to form an n-bit adder.
3/5/09
a0 b3
b1
a1 b3 a0 b2
b0
a2 b3 a1 b2 a0 b1
a3 b3 a2 b2 a1 b1 a0 b0
a3 b2 a2 b1 a1 b0
a0
a3 b1 a2 b0
a1
a3 b0
a
Array Layout:
• operand bits bused diagonally
• Carry bits propagate right-to-left
• Sum bits propagate down
a3
Ai Bi
FA
b2
S’k
Necessary Component: Full Adder
Ci+1
L09 - Multipliers 10
Design of 1-bit multiplier "Brick":
Array of Identical Multiplier Cells
b3
3/5/09
6.004 – Spring 2009
3/5/09
L09 - Multipliers 12
Multiplier Cookbook:
Chapter 2
Latency revisited
Here’s our combinational multiplier:
a0
b3
a1
a2
a3
Combinational Multiplier:
What’s its propagation delay?
a0
b2
a0 b3
Naive (but valid) bound:
b1
• O(n) additions
a1 b3 a0 b2
b0
• O(n) time for each addition
a2 b3 a1 b2 a0 b1
• Hence O(n2) time required
a3 b3 a2 b2 a1 b1 a0 b0
a3 b2 a2 b1 a1 b0
On closer inspection:
a3 b1 a2 b0
• Propagation only toward
left, bottom
a3 b0
• Hence longest path bounded
by length + width of array:
O(n+n) = O(n)!
3/5/09
6.004 – Spring 2009
L09 - Multipliers 13
b3
Hardware for
b2
(n2)
n by n bits:
a0 b3
b1
a1 b3 a0 b2
Latency:
(n)
b0
a2 b3 a1 b2 a0 b1
Throughput:
(1/n)
a3 b3 a2 b2 a1 b1 a0 b0
a3 b2 a2 b1 a1 b0
a3 b1 a2 b0
Note: lots of tricks are
a3 b0
available to make a faster
combinational multiplier…
a1
a2
a3
+
3/5/09
6.004 – Spring 2009
Combinational Multiplier:
The Pipelining Bandwagon...
best bang for the buck?
where do I get on?
WE HAVE:
Suppose we have LOTS of
multiplications.
• Pipeline rules - "well
formed pipelines"
• Plenty of registers
• Demand for higher
throughput.
PIPELINING
Can we do better from a
cost/performance
standpoint?
a0
3/5/09
L09 - Multipliers 15
6.004 – Spring 2009
b3
a1
a3
What do we do? Where do we
define stages?
6.004 – Spring 2009
L09 - Multipliers 14
3/5/09
b2
a
b
0
3
a2
b1
a1 b3 a0 b2
b0
a2 b3 a1 b2 a0 b1
a3 b3 a2 b2 a1 b1 a0 b0
a3 b2 a2 b1 a1 b0
a3 b1 a2 b0
a3 b0
L09 - Multipliers 16
Stupid Pipeline Tricks
Even Stupider Pipeline Tricks
gotta break
that long
carry chain!
a0 b3
WORSE idea:
a2 b3 a1 b2 a0 b1
a3 b3 a2 b2 a1 b1 a0 b0
a3 b2 a2 b1 a1 b0
Stages:
(n)
Clock Period:
Hardware cost for n by n bits:
(n)
a3 b1 a2 b0
a3 b0
Latency:
Throughput:
2
(n )
2
(n )
Back to basics:
what’s the point of pipelining, anyhow?
( 1/n )
3/5/09
6.004 – Spring 2009
L09 - Multipliers 17
a0
b2
a0 b3
a0 b2
a1
• Segment every long
combinational path
a1 b3
a0 b1
a0 b0
a1 b2
a1 b1
a2 b3
a2 b2
a1 b0
a2
a3
a3 b3
a3 b2
b1
b0
(n)
(1)
(n2)
(n)
Throughput:
(1)
a3
a2 b1
a2 b0
• Well-formed pipeline
(careful!)
• Constant (high!)
throughput,
independently of
operand size.
a0
b3
b2
a0 b3
a0 b2
a1
b1
b0
a1 b3
a0 b1
a0 b0
a1 b2
a1 b1
a2 b3
a2 b2
a3 b3
a3 b2
a1 b0
a2 b1
a2 b0
a3 b1
a3 b0
... but suppose we don’t need
the throughput?
GOAL: (n) stages; (1) clock period! 3/5/09
Stages:
Clock Period:
Hardware cost for n by n bits:
Latency:
a2
a3 b1
a3 b0
6.004 – Spring 2009
L09 - Multipliers 18
Multiplier Cookbook: Chapter 3
b3
• Break array into diagonal
slices
3/5/09
6.004 – Spring 2009
Breaking O(n) combinational paths
LONG PATHS go down, to left:
• Doesn’t break long
combinational paths
• NOT a well-formed pipeline...
... different register
counts on alternative
paths
... data crosses stage
boundaries in both
directions!
a0 b3
a1 b3 a0 b2
a2 b3 a1 b2 a0 b1
a3 b3 a2 b2 a1 b1 a0 b0
a3 b2 a2 b1 a1 b0
a3 b1 a2 b0
a3 b0
a1 b3 a0 b2
L09 - Multipliers 19
6.004 – Spring 2009
3/5/09
L09 - Multipliers 20
Multiplier Cookbook: Chapter 4
Moving down the cost curve...
b3
Suppose we have INFREQUENT
multiplications... pipelining
doesn’t help us.
a0
a0 b3
a0 b2
a1
a1 b3
a2
a2 b3
a2 b2
a3
a3 b3
a3 b2
a1 b2
a1 b1
Sequential Multiplier:
b2
ai
b2
Can we do better from a cost/
performance standpoint?
Hmmm, do I
really need
all these
extras?
b3
b1
ai b3
ai b2
b1
b0
a0 b1
a0 b0
b0
• Re-uses a single n-bit “slice” to
emulate each pipeline stage
• a operand entered serially
ai b1
ai b0
• Lots of details to be filled in...
a1 b0
a2 b1
a2 b0
Stages:
Clock Period:
Hardware cost for n by n bits:
Latency:
a3 b1
a3 b0
Throughput:
3/5/09
6.004 – Spring 2009
L09 - Multipliers 21
Suppose we want to minimize
hardware (at any cost)…
b2
• Consider bit-serial!
b1
ai b3
ai b2
b0
• Form and add 1-bit
partial product per clock
ai b1
ai b0
b3
Bit Serial multiplier:
Cost minimization: how far can we go?
ai
• Re-uses a single brick to emulate
an n-bit slice
• both operands entered serially
• O(n2) clock cycles required
• Needs additional storage
(typically from existing
registers)
• Reuse single “brick” for
each bit bj of slice;
• Re-use slice for each bit
of a operand
Stages:
Clock Period:
Hardware cost for n by n bits:
Latency:
Throughput:
6.004 – Spring 2009
3/5/09
L09 - Multipliers 22
Multiplier Cookbook: Chapter 5
Extremes Dept...
b3
3/5/09
6.004 – Spring 2009
(Ridiculous?)
1
( 1 ) ( constant!)
( n )
( n )
(1/n)
L09 - Multipliers 23
6.004 – Spring 2009
b2
ai
b1
ai b3
ai b2
b0
ai b1
ai b0
( 1 n )
( 1 ) (constant)
( 1 ) + ?
(n2)
(1/n2)
3/5/09
L09 - Multipliers 24
Summary:
Scheme:
$
Latency
Thruput
Combinational
(n2)
(n)
(1/n)
N-pipe
(n2)
(n)
(1)
Slice-serial
(n)
(n)
(1/n)
Bit-serial
(1)*
(n2)
(1/n2)
Lots more multiplier technology: fast adders, Booth Encoding, column
compression, ...
6.004 – Spring 2009
3/5/09
L09 - Multipliers 25
Download