(word format) with simulations/schematics appendix

advertisement
1
ECE 471 Multiplier-Adder Project – High
Frequency Design
Robert T. Rice with Brian Miller

Abstract—A multiplier-adder performs the very function that
its name suggests: a multiplication followed by an addition. This
form is very useful in many areas of electrical engineering,
especially in digital signal processing. This particular
implementation is equivalent to a single-tap finite impulse
response digital filter, where the current value is added to a
weighted (multiplied) version of its previous value.
I. INTRODUCTION
M
cannot be performed by electronic
components in the traditional, decimal sense that we
apply to it. Although many of the techniques used by standard
decimal multiplication must be utilized, logical operations
must be used in place of standard, single-digit multiplication.
The key functional unit of a multiplier or multiple-bit adder
is the full-adder cell. This is comprised of three inputs and two
outputs to generate appropriate logic for the summing and
carry out bits of the three inputs. In this design, each bit stage
was pipelined using buffered transmission gates, operating off
alternate phases of the system clock.
The end goal of the design process was a system that was
able to run at a maximum frequency of 10GHz. A secondary
consideration was in trying to minimize the power dissipation
per GHz factor.
In order to minimize the latency for the data output, each
clocking stage of the design operates off an alternate clock
level, so stage computations must finish in a half cycle or less.
ULTIPLICATION
II. UNSIGNED MULTIPLIER
The unsigned, 4-bit multiplier functions exactly as its name
implies. Two unsigned, 4-bit values are applied to the inputs
and the output is a single, 8-bit number which is the two inputs
multiplied together.
The structure of the multiplier itself is reminiscent of
standard binary add-and-shift multiplication. Simple, two input
AND gates and full adders are used to compute the
propagations and the sums of those propagations. There are
seven levels of these computations, which are pipelined
between each bit stage. The pipelining causes the worst delay
to be that of a single full adder and transmission gate, allowing
for very high frequency design.
2
simple ripple carry adder. Much like the multiplier, each bit
stage is pipelined using a transmission gate. Each transmission
gate is separated by a buffer made from two double-size
inverters. The transmission gates themselves are double-sized
to improve signal transmission. A series of six delay stages
was added to the second operand (X[n]) to ensure that it would
arrive at the appropriate time.
Schematic Plot of the 8-Bit Adder.
There are 8 full adders used in the construction of this
adder, one for each bit.
Schematic Plot of the Full Adder.
IV. FULL SYSTEM
Schematic Plot of the Unsigned Multiplier.
III. 8-BIT ADDER
The adder is positioned after the multiplier stages to receive
the value of the two multiplied operands, and also to add that
value to a delayed value of one of the 4-bit operands according
to the function
S  Y * X [n  1]  X [n]
(1.1)
Since the X operand is only four bits, the upper four bits of
the second input of the adder are grounded. The structure is a
The full system includes three main, top level components:
the unsigned multiplier, the adder, and also the delay cell. A
delay cell was needed to provide the appropriate delay for the
second operand, relative to the product of the multiplier.
This system effectively implements a single-tap, FIR filter
where the previous value is weighted by a given coefficient
(The Y vector). As is evident, there are two inputs of 4-bits
each (X, Y) and the single 8-bit sum output vector (S).
3
Maximum Operating Frequencies by Process.
With smaller processes the distances between the transistors
fall, paths become shorter and may have less capacitance and
resistance, allowing for the faster transition times a high
operating frequency requires. There appears to be an inverse
exponential relationship between process size and maximum
operating frequency
The difference in the 32nm and 22nm maximum frequencies
is relatively small, which is expected since the two processes
have significant amounts of variation in the parameters.
Indeed, the only apparent advantage in using a smaller process,
like 22nm, is the ability to place more logic on die.
V. RESULTS
The process of simulation in this designed necessitated that
many different CMOS process sizes were utilized to ensure
that the power dissipation as a function of frequency was
maximized, in addition to exploring exactly how fast the
system was capable of operating. The 250nm models were
supplied by the design tool used to create the schematics.
Models using a smaller process dimension were taken from
predictive technology models. Operating characteristics
extracted from simulations using the different processes are
shown below.
Power
Process
Power/
Speed
Power vs. Frequency
600
500
400
Max.
Delay Area
Clock Logic
Total
(ps)
(mW) (mW)
(mW) (mW/GHz) (GHz)
(mm^2)
B. Power Consumption
Total power consumption was dominated by the clock
power dissipation, due to the size of the clock buffers. These
buffers were sized from a nominal inverter to a 512x to ensure
a fast rise time no different than the input signal.
Power (mW)
Schematic Plot of the System.
Total
300
Clock
200
Logic
100
Frequency
0
0.00
-100
250nm
5200
0.0370
450
45.00
495
321.4
1.54
130nm
4000
0.0093
130
3.60
133.6
53.4
2.50
90nm
1840
0.0046
60
0.90
60.9
14.0
4.34
65nm
1640
0.0023
55
0.40
55.4
10.0
5.55 Total
32nm
1040
0.0012
32
0.90
32.9
4.3
2.00
4.00
6.00
8.00
10.00
Frequency (GHz)
Total Power Consumption by Frequency.
power consumption decreases as the clock frequency
increased primarily because the minimum process length
decreases. There is also something to be said for the lower
22nm
1000
0.0006
29
1.50 30.50
3.8
8.00
supply voltages and lower drain currents, which has an
Operating Characteristics of the Multiplier-Adder.
exponential effect on the power reduction. There appears to an
asymptote to this curve of power, and indeed, data showed
A. Maximum Operating Frequency
clock dissipated power remaining relatively constant from
Maximum operating frequencies increased exponentially, as
65nm to 22nm processes.
expected, when moving to newer generation processes. The
The total power consumption may be called into question,
plot below shows this effect.
however, due to the very high power density. The power
dissipated per area is calculated to be approximately 1340
W/cm2, which is roughly the same power output per area as a
rocket nozzle. This metric is even higher for the 22nm process.
7.69
is
C. System Operation
The two figures below show the correct operation of the
system. The first figure shows that applying two operands
(Y=810 and X=110) to the system will result in the correct
output of first 110, and then one cycle later, 910. The second
plot show the correct functionality of the circuit at a clock
frequency of 8GHz.
4
There is a large amount of space between components in the
layout, and almost half the space is simply wasted. The area
could easily be halved with correct orientation and placement
of components.
Waveform Proof of Output Functionality.
E. Expected Results
The expected results were actually exceeded as far as
maximum operating frequency obtained. It was expected that
the device would never exceed 5GHz due to the propagation
delay incurred by using transmission gates with low impedance
as latches. Since it was necessary to clock the latches
alternately every delay stage, the maximum delay time the
system could have between latches was half a clock cycle,
which further reduced the maximum operating frequency. The
devices scaled very well, however, up to 32nm. After this
point the only notable gains were the reduced size of the logic
area.
It was expected that the power used would be much lower
than actual. Much attention was paid to providing the latches
with a near ideal clock edge that nearly 95% of the power was
consumed by the clock tree. Previous estimates of the clock
tree power dissipation were in the 40%-60% range.
VI. CONCLUSION
Proof of Operating Frequency.
D. Device Layout
The regular structure of the adder enable for a relatively
compact layout, with the 250nm process yielding a 135um by
240um die area. This is not including the clock tree, however,
and the multiplier and delay stage were not included in the
layout.
The best implementation of this system is using a 65nm
process. This design has the lowest level of logic power
dissipation, while still maintaining a maximum operating
frequency of 5.55GHz. At this frequency, the clock edge rise
times may be long enough as to merit reducing the size of the
clock tree buffers, and thus reducing some of the wasted
power. The area of the die is significantly larger than the 32nm
and 22nm designs, but would be less affected by process
variation, and thus designs could be made to tighter
specifications.
Future development on this design would necessitate a
redesign of the clock buffering system—simply too much of
the total power is dissipated by the clock tree. Some work may
be devoted to finding the optimum performance point for the
sizing of the CMOS pull-up and pull-down paths, as these
were oversized in this particular design, at the cost of using
unnecessary power.
The sole purpose of this design was to perform at a very
high clock frequency, which was accomplished. A sobering
factor, though, was the huge amount of power dissipated by
the designs because of this design choice.
REFERENCES
[1] Fang Lu and Henry Samueli, “A 200-Mhz CMOS Pipelined MultiplierAccumulator Using a Quasi-Domino Dynamic Full-Adder Cell Design”,
Journal of Solid-State Circuits, vol. 28, No. 2, Feb. 1993
[2] P. Chiang and B. Nikolic, “Adders,” ECE471, Feb. 2007
Adder Layout.
5
Schematics
8-bit Pipelined Adder
CMOS Full Adder
6
CMOS Mirror Adder
4x4 Pipelined Multiplier
7
LVS Report
Net-list summary for
/nfs/rack/u2/m/millerb3/cadence/LVS/layout/netlist
count
511
nets
28
terminals
483
pmos
483
nmos
Net-list summary for
/nfs/rack/u2/m/millerb3/cadence/LVS/schematic/netlist
count
511
nets
28
terminals
483
pmos
483
nmos
layout schematic
instances
un-matched 0 0
rewired
0 0
size errors
0 0
pruned
0 0
active
966 966
total
966 966
nets
un-matched 0 0
merged
0 0
pruned
0 0
active
511 511
total
511 511
terminals
un-matched 0 0
matched but
different type
0 0
total
28 28
Download