Lecture 4

advertisement
ELEC 516 VLSI System Design and
Design Automation Spring 2010
Lecture 4 - Shifter and Multiplier
Design
Reading Assignment:
Weste: Chapter 8
Rabaey: Chapter 11
Note: some of the figures in this slide set are adapted from the slide set
of “ Digital Integrated Circuits” by Rabaey. Et. al. 2002
1
ELEC516/10 Lecture 4
Shifter Design
• Shifting operations are important and are used extensively for
– arithmetic shifting, logical shifting, rotation,
– floating point operations, scaling and multiplications by
constant number
– Data alignment
– Field extraction/combination
– Address generation
• Shifting a data-word left or right over a constant amount is
trivial hardware operation. A programmable shifter, however, is
more complex.
• E.g. shift left or right for a variable number of bit
• Design style
– Two dimension arrays
– Variable size
– Rotate
– Padding with zeros/ones
2
ELEC516/10 Lecture 4
A simple shifter
Right
Ai
Ai-1
nop
Left
Bi
Bi-1
Bit-Slice i
•The above design will...rapidly become complex and
slow for larger shift values
•More structural approach is advisable: Two
commonly used shift structures, the barrel shifter
and the logarithmic shifter.
3
ELEC516/10 Lecture 4
Barrel Shifter
• It consists of array of transmission gates, where the
number of row equals the word length of the data
and the number of columns equals the maximum
shift length.
• A major advantage for this shifter is that the signal
has to pass through at most one transmission gate
and hence the delay is theoretically constant and
independent of the shift value or shifter size. This is
not true in reality since the capacitance at the input
of the buffers rise linearly with the maximum shiftwidth.
4
ELEC516/10 Lecture 4
Barrel Shifter (2)
A3
B3
Sh1
A2
B2
Sh2
: Data Wire
A1
B1
: Control Wire
Sh3
A0
B0
Area Dominated by Wiring
Sh0
5
Sh1
Sh2
Sh3
ELEC516/10 Lecture 4
Logarithmic Shifter
• While the barrel shifter implements the whole shifter as a single
array of pass-transistors, the log. shifter uses a staged approach. It
uses stages of multiplexers which decompose the shift into powerof-two stages.
• A shifter with a maximum shift width of M consists of log2M stages,
where the ith stage either shifts over 2i or passes the data
unchanged.
• Log. shifter is usually smaller than the barrel shifter. For larger
values, of M, it is definitely the structure of choice.
• The speed depends upon the shift-width in a log. way since a n-bit
shifter requires log2n stages.
• Other shift options are frequently required, for instance, shuffles,
bit reversals, and interchanges.
6
ELEC516/10 Lecture 4
Logarithmic Shifter (2)
Sh1 Sh1
Sh2 Sh2
Sh4 Sh4
A3
B3
A2
B2
A1
B1
A0
B0
• In general, it can be concluded that a barrel-shifter is
appropriate for smaller shifters. For large shift values, the
log. shifter becomes more effective, in terms of area and
speed. Also log. shifter is more regular and hence can be
easily generated automatically.
ELEC516/10 Lecture 4
7
Multiplexer-based shifter
8
ELEC516/10 Lecture 4
Shifter design - Summary
• The design of a shifter is a trade-off between area,
delay.
• Barrel shifter: fastest but requires more transistors
Speed: O(1), area: n2 transistors
• Logarithmic shifter: Slower but less transistors:
Speed: O(log n), area: n log n transistors
• Barrel shifter is wire-dominated circuit
9
ELEC516/10 Lecture 4
The Multiplier
• Very important operation. Often the speed of multiplication
limits the performance of the digital processor.
• Multiplications are used in many digital signal processing
applications:
– correlations, convolution, filtering, and frequency analysis.
– Vector product, matrix multiplication.
– Weighted sums required in many DSP such as Neural
network, Filtering etc…
• Multipliers are in fact complex adder arrays.
• The analysis of the multiplier gives us some further insight
on how to optimize the performance (or the area) of
complex circuit topologies.
10
ELEC516/10 Lecture 4
Example
• Example: 10x5
Multiplicand:
Multiplier:
1 0 1 0
0 1 0 1
1 0 1 0
0 0 0 0
1 0 1 0
0 0 0 0
10
5
4 partial products
0 1 1 0 0 1 0
50
•The multiplication process may be viewed to consist
of two steps:
•Evaluation of partial products
•Accumulation of the shifted partial products.
• Partial products can be generated using an array of AND gates.
11
ELEC516/10 Lecture 4
The Multiplier(II)
• Binary multiplication is equivalent AND operation.
Evaluation of the partial products consists of the
logical ANDing of the multiplicand and the relevant
multiplier bit.
• Different techniques exist. The choice of technique
is based on factors such as speed, throughput,
numerical accuracy and area.
• N*N multiplier has 2n bits output
– Integer multiplier – takes the n LSB bits
– Floating point multiplier (or fixed point with decimal
point in the MSB) e.g. FP, 1.XXX * 1.XXX, takes the n
MSB bits
12
ELEC516/10 Lecture 4
Simple multiplier
• Generates and add one partial product at each
cycles.
• Takes n cycles.
multiplicand
Partial Product
generation
Adder
multiplier
Shift right
every cycle
Shift
13
ELEC516/10 Lecture 4
Issues for design fast multiplier
• Reduce the number of partial products
• Fast adder cells
• Reducing the number of addition required to sum
the partial products – e.g. use tree adders
14
ELEC516/10 Lecture 4
The Array Multiplier
• Consider two unsigned binary number X and Y
that are M and N bits wide, respectively
N 1
M 1
X   X i 2 Y  Yj 2
i
i 0
j
b  X Y 
M  N 1
b
k 0
j 0
k
2
k
 M 1 N 1

M 1
N 1




i j 
X 
X i 2i  Y 
Yj 2 j  
X
Y
2
i j







i 0
j 0
i

0
j

0







 
•Pk the partial product terms called summands. There are
M*N summands which are generated in parallel by a set of
M*N AND gates
15
ELEC516/10 Lecture 4
The Array Multiplier (II)
• A n*n multiplier requires n(n-2) full adders, n half
adders, and n2 AND gates. The worst case delay is
(2n+1)tg, where tg is the worst case adder delay.
16
ELEC516/10 Lecture 4
The Array Multiplier (III)
• The following is a basic cell used in array multiplier
B
Y
C
Y
X
+
CO
X
17
PO
ELEC516/10 Lecture 4
A 4*4 array multiplier
x3
Z7
18
Y0
x2
x1
X3
X2
X1
X0
HA
FA
FA
HA
X3
X2
X1
X0
FA
FA
FA
HA
X3
X2
X1
X0
FA
FA
FA
HA
Z6
Z5
Z4
Y3
Y2
x0
Y1
Z0
Z1
Z2
Z3
ELEC516/10 Lecture 4
The MxN Array Multiplier - Critical Path
FA
HA
FA
FA
FA
FA
HA
HA
Critical Path 1
Critical Path 2
FA
FA
FA
HA
tmult  [(M 1)  ( N  2)]tcarry  ( N 1)tsum  tand
19
ELEC516/10 Lecture 4
Carry-Save Adder (old style)
• We don’t need to optimize the carry chain of each of
the rows. Postpone the carry to a later stage
Delay=N.tcarry+
tand +
tmerge
CSA
HA
HA
HA
HA
HA
FA
FA
FA
HA
FA
FA
FA
FA
HA
FA
HA
M
N
Vector merging stage
[Rab96] p.411
20
ELEC516/10 Lecture 4
Booth Encoding
• The multiplier we studied before use radix-2
multiplication, i.e. by observing one bit of the
multiplicand at a time.
• Higher radix multipliers may be designed to reduce the
number of adders and hence the delay required to
compute the partial sums.
• Booth encoding - perform two’s complement
multiplication and perform several steps of the
multiplication at once.
• It takes the advantage of the fact that an add-subtracter
is nearly as fast and small as a simple adder.
• The most common form of Booth’s algorithm looks at
three bits of the multiplier at a time to perform two
stages of multiplication.
21
ELEC516/10 Lecture 4
Booth Multiplier: Example
• 2a = 2a+1- 2a and hence we can recode each 1 in
multiplier as “+2-1”
– Converts sequences of 1 to 10…0(-1)
– Might reduce the number of 1’s
0
Less 1’s in
this sequence
0
22 [© K. Bazaragan]
0
1
1
1
1
1
1
0
0
0
0
+1 -1
+1 -1
+1 -1
+1 -1
+1 -1
+1 -1
1
0
0
0
0
0
-1
ELEC516/10 Lecture 4
Booth Recoding: Multiplication Example
Sign extension
1
1
1
0
0 0
0 0 1
0 0 1
23 [© K. Bazaragan]
1
0
0
1
0
0 0
0 1
+1 0
0 0
1 0
0 0
0 0
0
1 0
1
1
0
0
1
0
1
1
-1
0
0
0
0
0
0
1 0 0
6x
14 Only two
rows of
partial sums
(-6)
84
ELEC516/10 Lecture 4
Booth Recoding: Advantages and
Disadvantages
• Major advantage: Can reduce the number of 1’s
in multiplier
• So far:
– We did not improve the speed of the multiplier as we
still have to wait for the critical path, e.g., the shift-add
delay in sequential multiplier.
– Booth recording results in increased area as we need
recoding circuitry AND subtraction
24
ELEC516/10 Lecture 4
Modified Booth Multiplier
• We can reduce the # of partial sums –Group more bits
• Group pairs, leaving –2, -1, 0, 1, 2
– Grouping reduces # of partial products by half
• Booth recoding results in:
– Gets rid of 3’s (sequences of 1’s in general)
0
1
1
(+1 -1)
(+1 -1)
+1 0
+2
0
1
1
1
0
(+1 -1)
(+1 -1)
(+1 -1)
-1 +1
-1
0
0
0
-1 0
-2
0
0
1
0
(+1 -1)
0 +1 -1 0
+1
-2
[©Hauck]
25
ELEC516/10 Lecture 4
Modified Booth Encoding (II)
• Consider the two’s complement representation of
the multiplier y:
n
y  2 yn  2
n1
yn1  2
n2
yn2  
• We can rewrite 2a = 2a+1- 2a and hence
n1
n
y  2 ( yn1  yn )  2 ( yn2  yn1)  2
n2
( yn3  yn2 )  
• Look at the first two terms
n
2 ( yn1  yn )  2
26
n1
( yn2  yn1)
ELEC516/10 Lecture 4
Modified Booth Multiplier
• Can encode the digits by looking at three bits at a
time (reduce the partial sums)
• Booth recoding table:
27
i+1
i
i-1
add
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0*M
1*M
1*M
2*M
–2*M
–1*M
–1*M
0*M
– Must be able to add
multiplicand times –2, -1,
0, 1 and 2
– Since Booth recoding got
rid of 3’s, generating
partial products is not
that hard (shifting and
negating)
[©Hauck]
ELEC516/10 Lecture 4
Booth Multiplier: Example
• Retire two bits per shift operation
0 0 1 1 0 1
• Addition: signed
1 1 1 0 1 0
– Sign extend 2 bits if adding
two partial products at a time
i
i-1
add
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0*M
1*M
1*M
2*M
–2*M
–1*M
–1*M
0*M
28 [© K. Bazaragan]
0 -1 -2
1 0 0 1 1 0
1 1 1 1 0 0 1 1
0 0 0 0 0 0
1
i+1
13
-6
1
1 1 1 0 1 1 0 0 1 0
ELEC516/10 Lecture 4
Booth Multiplier
• The following shows a structure of a Booth multiplier
Stage j+1
Stage j
Pj+
Left shift 2
Left shift 2
1
yi+4
Adder/subtractor
code
Pj+
yi+3
yi+2
Adder/subtractor
code
yi+2
1
yi
mux sel
0
x
yi+1
mux sel
2x
0
x
2x
Pj
29
ELEC516/10 Lecture 4
Modified Booth Multiplier Summary
• Uses high-radix to reduce number of intermediate
addition operands
– Can go higher: radix-8, radix-16
– Radix-8 should implement *3, *-3, *4, *-4
– Recoding and partial product generation becomes
more complex
• Can automatically take care of signed multiplication
30
ELEC516/10 Lecture 4
Wallace-Tree Based Multiplier
• Principle
–
–
–
–
Sum N shifted partial products
Do N-input addition efficiently
Reduced N-input addition in steps
Use counters, e.g. carry-save adder (CSA) (3/2
reduction)
• CSA is simple, it is just a full adder
– At the end of the array you need to add two parts
together.
– This take a fast adder, but you only need one at the
end, not one for each partial product.
31
ELEC516/10 Lecture 4
Reduction by Carry-save adders
• Example: X(2,1,0)*Y(2,1,0), Let A0=X(0)*Y(0), A1 =
X(1)*Y(0), X(2)*Y(0), etc.
A2 A1 A0
B2 B1 B0
C2 C1 C0
C0
A2 B1
B2 C1
C2
CSA
CSA
A1 B0
A0
CPA
32
ELEC516/10 Lecture 4
Carry-Save Multiplier
HA
HA
HA
HA
HA
FA
FA
FA
HA
FA
FA
FA
FA
FA
HA
HA
Vector Merging Adder
33
ELEC516/10 Lecture 4
Wallace Tree Multiplier
• The Wallace tree multiplier uses logic tricks to
speed up the required addition. It is an adder tree
built from carry save adders using 3-to-2 reduction
ABC
000
001
010
011
100
101
110
111
34
CS
00
01
01
10
01
10
10
11
No. of 1’s
0
1
1
2
1
2
2
3
A 1-bit adder provides a 3:2
compression in the number
of bits. The addition of
partial products in a column
of an array multiplier may be
thought of as totaling up the
number of 1’s in that column,
with an carry being passed
to the next column to the
left.
ELEC516/10 Lecture 4
Wallace Tree Multiplier
Multiplicand
Partial Product Generator
Partial Products
Summation Network
Two 2n bit operands
Carry Propagate Adder
35
Final 2n bit Product
ELEC516/10 Lecture 4
Wallace-Tree Multiplier
First stage
Partial products
6
5
3
4
2
1
0
6
5
4
3
0 Bit position
2
1
0
6
5
4
3
2
1
0
HA
FA
(c)
36
1
Final adder
Second stage
5
2
(b)
(a)
6
3
4
(d)
ELEC516/10 Lecture 4
Wallace Tree Example
Delay = 4 CSA + 1 CLA
[Par00] p130
[© Oxford U Press]
37
ELEC516/10 Lecture 4
Wallace-Tree Multiplier
38
ELEC516/10 Lecture 4
Wallace-Tree Based Multiplier
y0 y1
y2
y0 y1 y2
Ci-1
FA
y3
Ci
y3 y4 y5
FA
Ci-1
FA
FA
Ci
Ci-1
Ci
Ci-1
y4
FA
Ci
Ci-1
FA
Ci
Ci-1
y5
FA
Ci
FA
C
C
39
S
S
ELEC516/10 Lecture 4
The issues of sign extension
• When the partial product is negative, we need to do
sign extension.
• If we do it just by copying of bit, there is impact on
the delay since the fanout can be large.
• We can do some tricks
– Pre-add the triangle of 1’s
11111111
111111
1111
11
10101011
– The to clear out 1’s by adding 1 to the row
11111111
S
0 0 0 0 0 0 0 0 (S=0)
or 1 1 1 1 1 1 1 1 (S=1)
40
ELEC516/10 Lecture 4
The issues of sign extension
• Now you only need to add few bits
SSS
1S
1S
1S
1 0 1 0 1 0 11
• Adding these few bits is equivalent to complete sign
extension
41
ELEC516/10 Lecture 4
Other Multiplier structures
• Serial Multiplier: Very compact but very slow: M+N bit product
requires Td= MN clock cycles
• Serial/Parallel Multiplier: Very modular, good trade-off: Td=M+N cycles
42
ELEC516/10 Lecture 4
Multipliers —Summary
• Optimization Goals Different Vs Binary Adder
• Once Again: Identify Critical Path
• Other possible techniques
- Logarithmic versus Linear (Wallace Tree Mult)
- Data encoding (Booth)
- Pipelining
FIRST GLIMPSE AT SYSTEM LEVEL OPTIMIZATION
43
ELEC516/10 Lecture 4
Floating-point units
•
•
•
•
•
•
•
More complex operation/more time
Fewer access
Often designed outside the normal ALU
Co-processor
Floating point representation
Data = (-1)sign*0.1 Fraction*2exp
Normalization:
– 1 < Data <= ½ (Exp =0, Sign =0)
– First Decimal Digit is one
– No need for representing it
• IEEE standard: sign – 1 bit, exponent – 11 bits,
fraction – 52 bits => total 64 bits
44
ELEC516/10 Lecture 4
Floating Point Addition
• Align operands
– Check exponents
– Shift data
• Add fractional bits
– Integer addition
• Normalization
– Shift data
– Increment or decrement exponents
• Rounding data
45
ELEC516/10 Lecture 4
Floating point adder
+/-
sign
A
B
exponent
A
B
Exp. Diff.
Sign
Unit
mantissa
A
B
Shift
Align
Adder
(Mantissa)
Exp. update
Norm
Round
C
sign
46
C
exponent
mantissa
C
ELEC516/10 Lecture 4
Floating Point Multiplication
• Add exponents
– 11 bit addition
• Multiply the mantissa
– Integer multiplication
• Normalization
– Shift data (at most by one)
– Decrement exponent
• Rounding data
47
ELEC516/10 Lecture 4
Floating Point Multiplier
sign
A
B
exponent
A
B
mantissa
A
B
Exp. Add
Ex-or
Multiplier
(Mantissa)
Exp. update
Norm
Round
C
sign
48
C
exponent
mantissa
C
ELEC516/10 Lecture 4
Comparator
• A = B, A > B, A < B
49
ELEC516/10 Lecture 4
High speed comparator
•
•
A single-cycle comparator based on the priority-encoding
algorithm and dynamic circuit design technique [Huang
2002]
4 steps:
1. XOR gate is used to determine whether each corresponding bit of
the two numbers is equal or not.
2. A priority encoder is used to set the most significant unequal bit
of the result from step 1 to ‘1’ and reset all other bits to ‘0’.
3. The result of step 2 is “ANDed” with the two input numbers.
4. All the bits of the results of step 3 are “ORed” together to
determine which number is greater.
50
ELEC516/10 Lecture 4
Dynamic Priority Encoder
Critical path: 7 transistors because of the NAND gate implementation
51
ELEC516/10 Lecture 4
Wide bit width comparator – 64 bits
• Hierarchical- multistages
• Phase pipelining to achieve single clock
52
ELEC516/10 Lecture 4
New comparator not using Priority encoder
• New algorithm uses a parallel MSBs bit checking
method instead of priority encoding to determine the
location of the first significant bit that the two inputs
are different.
• Using this method facilitates the use of NOR-type
logic gate and results in faster speed for dynamic
logic implementation
53
ELEC516/10 Lecture 4
A.B
New algorithm
•
54
4 steps
1. Both AB’ and A’B are computed. Unlike the original PE algorithm which
uses XOR gate to find the bits that A and B are different, the information
of which number is larger at that particular bit location. E.g :4’b0010
indicates that at bit 1, A is larger than B.
2. A data conversion (calculating A* and B*) is done to determine the most
significant bit that is a ‘1’ in the result of step 1. Different from the
priority encoder, instead of setting the most significant 1-bit to 1 and
resetting all the other bits to ‘0’, we set all the preceding bits of the most
significant 1-bit (not including the most significant 1-bit itself) to 1 and
reset all the other bits to zero. By doing so the implementation can be
done using NOR type of dynamic logic.
3. we calculate (A*)’B* and A*(B*)’. If A* has a longer running length of zero,
A*(B*)’. will be all zero and (A*)’B* will have some bits equal to 1, and
vice versa.
4. We check whether the result of step 3 is an all zero vector or not by
ORing all the bits together. A corresponding zero vector means that the
other input is the greater one.
ELEC516/10 Lecture 4
Implementation
55
ELEC516/10 Lecture 4
Download