1 ECE 471 Multiplier-Adder Project – High Frequency Design Robert T. Rice with Brian Miller Abstract—A multiplier-adder performs the very function that its name suggests: a multiplication followed by an addition. This form is very useful in many areas of electrical engineering, especially in digital signal processing. This particular implementation is equivalent to a single-tap finite impulse response digital filter, where the current value is added to a weighted (multiplied) version of its previous value. I. INTRODUCTION M cannot be performed by electronic components in the traditional, decimal sense that we apply to it. Although many of the techniques used by standard decimal multiplication must be utilized, logical operations must be used in place of standard, single-digit multiplication. The key functional unit of a multiplier or multiple-bit adder is the full-adder cell. This is comprised of three inputs and two outputs to generate appropriate logic for the summing and carry out bits of the three inputs. In this design, each bit stage was pipelined using buffered transmission gates, operating off alternate phases of the system clock. The end goal of the design process was a system that was able to run at a maximum frequency of 10GHz. A secondary consideration was in trying to minimize the power dissipation per GHz factor. In order to minimize the latency for the data output, each clocking stage of the design operates off an alternate clock level, so stage computations must finish in a half cycle or less. ULTIPLICATION II. UNSIGNED MULTIPLIER The unsigned, 4-bit multiplier functions exactly as its name implies. Two unsigned, 4-bit values are applied to the inputs and the output is a single, 8-bit number which is the two inputs multiplied together. The structure of the multiplier itself is reminiscent of standard binary add-and-shift multiplication. Simple, two input AND gates and full adders are used to compute the propagations and the sums of those propagations. There are seven levels of these computations, which are pipelined between each bit stage. The pipelining causes the worst delay to be that of a single full adder and transmission gate, allowing for very high frequency design. 2 simple ripple carry adder. Much like the multiplier, each bit stage is pipelined using a transmission gate. Each transmission gate is separated by a buffer made from two double-size inverters. The transmission gates themselves are double-sized to improve signal transmission. A series of six delay stages was added to the second operand (X[n]) to ensure that it would arrive at the appropriate time. Schematic Plot of the 8-Bit Adder. There are 8 full adders used in the construction of this adder, one for each bit. Schematic Plot of the Full Adder. IV. FULL SYSTEM Schematic Plot of the Unsigned Multiplier. III. 8-BIT ADDER The adder is positioned after the multiplier stages to receive the value of the two multiplied operands, and also to add that value to a delayed value of one of the 4-bit operands according to the function S Y * X [n 1] X [n] (1.1) Since the X operand is only four bits, the upper four bits of the second input of the adder are grounded. The structure is a The full system includes three main, top level components: the unsigned multiplier, the adder, and also the delay cell. A delay cell was needed to provide the appropriate delay for the second operand, relative to the product of the multiplier. This system effectively implements a single-tap, FIR filter where the previous value is weighted by a given coefficient (The Y vector). As is evident, there are two inputs of 4-bits each (X, Y) and the single 8-bit sum output vector (S). 3 Maximum Operating Frequencies by Process. With smaller processes the distances between the transistors fall, paths become shorter and may have less capacitance and resistance, allowing for the faster transition times a high operating frequency requires. There appears to be an inverse exponential relationship between process size and maximum operating frequency The difference in the 32nm and 22nm maximum frequencies is relatively small, which is expected since the two processes have significant amounts of variation in the parameters. Indeed, the only apparent advantage in using a smaller process, like 22nm, is the ability to place more logic on die. V. RESULTS The process of simulation in this designed necessitated that many different CMOS process sizes were utilized to ensure that the power dissipation as a function of frequency was maximized, in addition to exploring exactly how fast the system was capable of operating. The 250nm models were supplied by the design tool used to create the schematics. Models using a smaller process dimension were taken from predictive technology models. Operating characteristics extracted from simulations using the different processes are shown below. Power Process Power/ Speed Power vs. Frequency 600 500 400 Max. Delay Area Clock Logic Total (ps) (mW) (mW) (mW) (mW/GHz) (GHz) (mm^2) B. Power Consumption Total power consumption was dominated by the clock power dissipation, due to the size of the clock buffers. These buffers were sized from a nominal inverter to a 512x to ensure a fast rise time no different than the input signal. Power (mW) Schematic Plot of the System. Total 300 Clock 200 Logic 100 Frequency 0 0.00 -100 250nm 5200 0.0370 450 45.00 495 321.4 1.54 130nm 4000 0.0093 130 3.60 133.6 53.4 2.50 90nm 1840 0.0046 60 0.90 60.9 14.0 4.34 65nm 1640 0.0023 55 0.40 55.4 10.0 5.55 Total 32nm 1040 0.0012 32 0.90 32.9 4.3 2.00 4.00 6.00 8.00 10.00 Frequency (GHz) Total Power Consumption by Frequency. power consumption decreases as the clock frequency increased primarily because the minimum process length decreases. There is also something to be said for the lower 22nm 1000 0.0006 29 1.50 30.50 3.8 8.00 supply voltages and lower drain currents, which has an Operating Characteristics of the Multiplier-Adder. exponential effect on the power reduction. There appears to an asymptote to this curve of power, and indeed, data showed A. Maximum Operating Frequency clock dissipated power remaining relatively constant from Maximum operating frequencies increased exponentially, as 65nm to 22nm processes. expected, when moving to newer generation processes. The The total power consumption may be called into question, plot below shows this effect. however, due to the very high power density. The power dissipated per area is calculated to be approximately 1340 W/cm2, which is roughly the same power output per area as a rocket nozzle. This metric is even higher for the 22nm process. 7.69 is C. System Operation The two figures below show the correct operation of the system. The first figure shows that applying two operands (Y=810 and X=110) to the system will result in the correct output of first 110, and then one cycle later, 910. The second plot show the correct functionality of the circuit at a clock frequency of 8GHz. 4 There is a large amount of space between components in the layout, and almost half the space is simply wasted. The area could easily be halved with correct orientation and placement of components. Waveform Proof of Output Functionality. E. Expected Results The expected results were actually exceeded as far as maximum operating frequency obtained. It was expected that the device would never exceed 5GHz due to the propagation delay incurred by using transmission gates with low impedance as latches. Since it was necessary to clock the latches alternately every delay stage, the maximum delay time the system could have between latches was half a clock cycle, which further reduced the maximum operating frequency. The devices scaled very well, however, up to 32nm. After this point the only notable gains were the reduced size of the logic area. It was expected that the power used would be much lower than actual. Much attention was paid to providing the latches with a near ideal clock edge that nearly 95% of the power was consumed by the clock tree. Previous estimates of the clock tree power dissipation were in the 40%-60% range. VI. CONCLUSION Proof of Operating Frequency. D. Device Layout The regular structure of the adder enable for a relatively compact layout, with the 250nm process yielding a 135um by 240um die area. This is not including the clock tree, however, and the multiplier and delay stage were not included in the layout. The best implementation of this system is using a 65nm process. This design has the lowest level of logic power dissipation, while still maintaining a maximum operating frequency of 5.55GHz. At this frequency, the clock edge rise times may be long enough as to merit reducing the size of the clock tree buffers, and thus reducing some of the wasted power. The area of the die is significantly larger than the 32nm and 22nm designs, but would be less affected by process variation, and thus designs could be made to tighter specifications. Future development on this design would necessitate a redesign of the clock buffering system—simply too much of the total power is dissipated by the clock tree. Some work may be devoted to finding the optimum performance point for the sizing of the CMOS pull-up and pull-down paths, as these were oversized in this particular design, at the cost of using unnecessary power. The sole purpose of this design was to perform at a very high clock frequency, which was accomplished. A sobering factor, though, was the huge amount of power dissipated by the designs because of this design choice. REFERENCES [1] Fang Lu and Henry Samueli, “A 200-Mhz CMOS Pipelined MultiplierAccumulator Using a Quasi-Domino Dynamic Full-Adder Cell Design”, Journal of Solid-State Circuits, vol. 28, No. 2, Feb. 1993 [2] P. Chiang and B. Nikolic, “Adders,” ECE471, Feb. 2007 Adder Layout. 5 Schematics 8-bit Pipelined Adder CMOS Full Adder 6 CMOS Mirror Adder 4x4 Pipelined Multiplier 7 LVS Report Net-list summary for /nfs/rack/u2/m/millerb3/cadence/LVS/layout/netlist count 511 nets 28 terminals 483 pmos 483 nmos Net-list summary for /nfs/rack/u2/m/millerb3/cadence/LVS/schematic/netlist count 511 nets 28 terminals 483 pmos 483 nmos layout schematic instances un-matched 0 0 rewired 0 0 size errors 0 0 pruned 0 0 active 966 966 total 966 966 nets un-matched 0 0 merged 0 0 pruned 0 0 active 511 511 total 511 511 terminals un-matched 0 0 matched but different type 0 0 total 28 28