HIGH SPEED CONVOLUTION USING RESIDUE NUMBER SYSTEMS by KURT ANTHONY LOCHER Submitted to the DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE in partial fulfillment of the requirements for the degrees of . BACHELOR OF SCIENCE IN ELECTRICAL ENGINEERING and MASTER OF SCIENCE IN ELECTRICAL ENGINEERING AND COMPUTER SCIENCE at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY January © 1989 All rights reserved 1989. Kurt A. Locher The author hereby grants to MIT permission to reproduce and to distribute copies of this thesis document in whole or in part. Signature of A uthor ________-___-________ Department Engineering of Electrical and Computer Science January 21, C e rtifie d b y __ _ _ __-_ _-_-_ _ _ _ __-_ _ _ _ _ 1989 _ __-_ _ _ _ _ __ _ __ Bruce R. Musicus Thesis Certified by t Thesis Accepted (Academic) _ __I_ I -. ~-, Supervisor _ _ Hogan Paul Supervisor by- (Raytheon Corporation) ____ - Arthur C. Smith Chairman, Department 9 1990 MAR~k /a A DC-so) IF Committee on Graduate Students a 2 HIGH SPEED CONVOLUTION USING RESIDUE NUMBER SYSTEMS by KURT ANTHONY LOCHER Submitted to the 1988 in partial fulfillment of the requirements 23, on January Science and Computer Engineering of Electrical Department for the degrees of Engineering in Electrical of Science Bachelor and and Engineering in Electrical Master of Science Science Computer Abstract and speed the Although size. hardware is focus number system, the same concepts can be directly system number developed architectures speed and Using the size developed that prunes speed/size optimal results the design space into and out number residue a design is the hardware residue of system is presented architectures: standard residue to quadratic residue applied classes design in designs, a are class. each aid was design a core group of designs with leaving of residue number system residue designs. Thesis Supervisor: large architectures associated overhead To representation. that uses for the Title: detailed a combinations. One of the largest disadvantages computational of the the investigated implemetations detailed with several on general Two well. as implementations, different judge which by parameters two gives that using convolution signal is on VLSI The focus residue number system techniques. choice speed for high architectures investigates This thesis and provide a conventional several of the design implementations with the conversion between comparision binary architectures optimizations developed Bruce R. Musicus Assistant Professor of Electrical of Engineering 3 Acknowledgements First, I would like to thank Bruce Musicus for all of the great ideas that flowed during our discussions on residue architectures. I'm sure that I would not have gotten as far if not for his inspiration. I would also like to thank Frank Horrigan for introducing me to this subject and Paul Hogan for agreeing to supervise me on the Raytheon side. Finally, to Claire who put up with me through finished... the entire process, I'm finally 4 of Table Contents List of Abbreviations............................................................................................... List of Figures................................................................................................................7 6 Chapter 9 Introduction....................................................................................... Chapter 2 FIR Filter Background ....................................................................... Conventional Im plem entations................................................................ D irect Form ....................................................................................... Transpose Form ................................................................................. Systolic D ata Flow G raph Architectures.................................................. Bitw ise D ecom position ................................................................................. Chapter 3 RNS Background................................................................................. Basic Residue Arithm etic Units...................................................................23 Residue Adders ................................................................................. Residue M ultiply by 2 Block .......................................................... 10 11 11 13 14 20 22 23 32 Residue M ultipliers .......................................................................... 33 Residue General Function Units .................................................... 41 Problem s with RNS....................................................................................... 43 Conversion into and out of residue representation...................44 Scaling and M agnitude Com parison.............................................. Processing Com plex Quantities (QRNS).................................................... Chapter 4 44 45 M odular Efficient RNS FIR filter...................................................... 47 Residue FIR filter ................................................................................... Brute Force........................................................................................ Coefficient Decom position .............................................................. Base 2 - Bitwise..................................................................... Balanced Ternary ................................................................ 48 48 49 51 52 Offset Radix 4......................................................................... 55 Balanced Quinary...................................................................57 Subtap Sum m ary.................................................................. 61 Scaling Sum m ing.......................................................... 64 New Algorithm with Fewer Buses................................................... 68 and The Algorithm ....................................................................... 68 The H ardw are........................................................................ Balanced Ternary ...................................................... Balanced Quinary...................................................... 71 71 74 M odified Balanced Septary...................................... Subtap Sum m ary...................................................... Putting it all Together .................................................................... Binary to RNS Conversion Block............................................................... Table Lookup Approach................................................................... No Table Approach ........................................................................... Residue to Binary Conversion.................................................................... Chinese Rem ainder Theorem .......................................................... M ixed Radix Conversion .................................................................. The A lgorithm ....................................................................... The H ardware.........................................................................91 Chapter 5 Design Aid.............................................................................................. 76 79 82 84 84 87 88 88 89 89 93 5 M oduli Selection Algorithm (Basic RNS).................................................. The g ............................................................................................... 93 96 Discussion...................................................................................................... 97 Chapter 6 Standard Binary Arithmetic with Pipelining................................ Development of architecture .................................................................... Filter Tap ........................................................................................... 98 98 98 Binary Subtap............................................................................100 Balanced Higher Ternary ..................................................................... 100 Radices..........................................................................100 Shift Add Reconstruction....................................................................101 Hardware Required/Contrast with RNS architecture................................102 General Discussion...........................................................................................103 Chapter 7 Appendix 1 Conclusion.................................................................................................104 Dynamic Range for Optimum Moduli Sets.........................................106 Appendix 2 -- Design Aid Code....................................................................................109 References....................................................................................... 124 6 List of Abbreviations CRT Chinese Remainder FIR finite IIR infinite impulse MSI medium scale MUX multiplexor RNS residue SSI small VLSI very XNOR exclusive nor XOR exclusive or (gate) WSI wafer impulse response response large (filter) system integration scale scale (filter) integration number scale Theorem integration (gate) integration 7 List of Figures Figure 1 Direct Form FIR Filter (N=4) Figure 2 Binary Tree of Adders Figure 3 Transpose Form FIR Filter Figure 4 Data Flow Graph for 4 point Convolution Figure 5 Direct Form Systolic Sweep Figure 6 Transpose Form Systolic Sweep Figure 7 Multiply Figure 8 Bitwise FIR Filter Figure 9 Subtap for Bitwise FIR Filter Figure 10 ROM residue adder Figure 11 b bit adder + ROM Figure 12 Graphic Example of cases of residue sum Figure 13 binary Figure 14 final Accumulate adder residue and Systolic Sweep conditional subtractor adder Figure 15 preadd p to one of the inputs before adding Figure modulo 16 accumulator Figure 17 Multiply by 2 block Figure 18 Binary Figure 19 3x3 shift+add array multiplier multiplier Figure 20 Enhanced Multiply by 2 Block (2x, 2x+p.) Figure 21 Residue Conditional Accumulator #1 Figure 22 Residue Conditional Accumulator #2 Figure 23 General Function of two variables (1 bit) Figure 24 General Function of two variables (2 bit) Figure 25 Block Diagram of an FIR Filter Figure 26 Balanced Ternany Figure 27 Balanced Quaternary Subtap (positive) Figure 28 Balanced Quaternary Subtap (negative) Figure 29 Balanced Quinary Subtap (positive) Figure Balanced Quinary Subtap (negative) 30 Subtap -- Biased/Unbiased Figure 31 Size vs b Figure 32 Transistors vs b Figure 33 Computational -- Biased/Unbiased Procedure #1 8 Figure 34 Computational Figure 35 Horner's Algorithm #1 Figure 36 Horner's Algorithm #2 Figure 37 Latency 2, multiply by 3 Figure 38 General Figure 39 Balanced Ternary (New Algorithm) b+1 bit MUX Figure 40 Balanced Ternary (New Algorithm) Figure 41 Balanced Quinary (New Algorithm) Figure 42 Balanced Septary (New Algorithm) Figure 43 Size vs b -- New Algorithm Figure 44 Transistors vs b -- New Algorithm Figure 45 Normalization Figure 46 Block Diagram of a norm box Figure 47 Binary to RNS, Figure 48 Top Level Block diagram of no lookup conversion Figure 49 Top Level Block Diagram of mixed radix conversion Figure 50 Residue Figure 51 6 bit Adder/Register Figure 52 6 Pipelined Figure 53 Binary Figure 54 Input bit Procedure #2 Block Diagram of New Algorithm algorithm for New final Algorithm large table lookup Subtractor Subtap Stage Combination adder/register combination 9 Chapter 1 Introduction Residue Number (RNS) Systems alternate number since the 1800's, to add parallelism investigated for representation, to application that and then the basic ideas have been a renewed interest has theory, but instead as a result speed standard the advances non-RNS, binary significant to amount the RNS of 1950's. suited is the The which occur technology to be applied. representation in both Szabo and Tanaka, Recently not because of new circuit technology. to RNS applications also open the The conversion and from hardware RNS and possibility processes to binary As latency. add convolution in computation is to create radar, operation sonar, is performed is used is especially for both using analog and because techniques an FIR correlation Frequently, applications. communications and filtering of size a design aid that searches the possible architectures creating a sufficiently rich that provide some variety in speed and hardware size. exercise in programming a search The finding The user will be able to input and the length of the filter, and the design aid will return the optimum set of designs. is aimed toward and Maybe RNS can change this. input the dynamic range of the filter coefficients and simple the suited; the most sum or equivalently an optimum design for a given convolution problem. effort a a result, This thesis will focus on the RNS implementation of the FIR filter. the from for the benefits of RNS to overcome evaluation of a long convolution speed limitations of digital circuitry. goal A overhead. filter. the aware RNS was first early integrated are ideally in VLSI of overhead been significantly. applications, There are a few applications for which RNS common the an density. must be very large the computation conversion designs in advanced of advancing allowing high and circuit not in RNS and WSI technologies for other, computation of concept have was published in 1967 by Current VLSI Unfortunately, theory mathematicians digital on the subject there number to certain computations. conclusive book since a use algorithm. Most of set of implementations The design aid itself is 10 Chapter 2 A FIR Filter Background finite convolution impulse of an Mathematically, written as response input signal (FIR) with the convolution filter performs a fixed the (finite length) expression for a length N discrete system time response. system response is follows: y[n] = jh[i]x[n-i] i=O (1) This operation is commonly denoted shorthand by the expression y[n] = x[n] h[n] where and y[n] x[n] the is computation is the input sequence, output intensive sequence. operation h[n] This for large is the system impulse response, an convolution becomes because N multiplies N * extremely are needed to compute each output point. Finite close can impulse cousins, before unstable In have a number guarantees of advantages filters1 . (IIR) response which a implementation, of the because the dispersion caused performance impulse nonrecursively filter limited over their First, FIR filters Although operation. stable is that dynamic stable range on of finite applications of radar and can paper register Second, FIR filters can be designed the coefficient truncation. phase. filters appear to be a filter design problem that could be solved on stability issues paper infinite the realized be response communications become lengths or to have linear the frequency by the nonlinear phase of IIR filters can be harmful to the Finally, roundoff noise caused by of the system. lengths can be minimized with finite register an FIR structure. In order to obtain the good properties of an FIR filter, however, one gives up the filter. degrees of freedom afforded by the recursive coefficients of an IIR (see prior footnote, the FIR filter is actually a constrained version of the IIR filter) To obtain a similar frequency magnitude response from an FIR filter, large values of N are needed relative to the order of a similar IIR filter. As a result it has been proposed that FIR filters be implemented in the 1 The output of an IIR filter depends on past values of the output as well as past values of the input. The difference equation for an IIR filter is commonly written as follows: N-1 M-1 y[n] = -E i=1 aiy[n-i] + Xbix[n-i] i=O The recursive nature of this computation forces any implementation to be recursive and causes stability to be an issue. 11 frequency domain using Unfortunately, several an of FFT algorithm the to reduce properties good processing of an FIR requirements. filter are not time domain, the completely preserved if an FFT implementation is used. Conventional Implementations If an FIR filter is going to be implemented in the sum has to convolution multiplies N-1 and The computation be computed. adds calculate to each output at requires In point. the least N following discussion I will assume that a sufficient number of multipliers and adders are to perform the entire computation each time available be implemented these however, could only to use arithmetic fewer follow directly designs units from by multiplexing1 ; time complete the FIR filters can cycle. designs and only complication to the discussion. add unnecessary Because there are a large number of ways to compute a convolution sum, I will the discuss two introduction general common most to systolic forms architecture in some design and detail a the other architecture. The characterize to give then possibilities. Direct Form The textbook form direct chain is the result of delay of bluntly translating forms a length registers shift register that stores the previous N-1 (FIFO) These N-1 input are result. are design A hardware. is the direct form FIR filter architecture the scheme. delayed values multiplied by of the the input appropriate necessary quantities without into first-in-first-out the current and The total design contains N multipliers and N-1 minimum N-1 (1) (delayed) values of the input. along with weights equation using summed value to of the form the two input adders which a time multiplexing A length 4 (N=4) direct form filter is shown in figure 1. 1 A time multiplexed design uses the same arithmetic unit(s) more than once per time period with different data. For example, an FIR filter of arbitrary length N could be designed using only one multiplier and one adder; each time period would then be, at minimum, greater than N multiply times. Because we are aiming for the highest throughput possible, it is safe to assume that time multiplexing is not a viable option. 12 Processing Element 1 Processing Element 2 Processing Element 3 Processing Element 4 Delay Delay h[0] .0DelaDey h[1] h[2] h[3] y[n] Figure The figure shows N-1 1 Direct Form FIR Filter adders in a linear chain for clarity; design would probably use a binary tree of adders (figure 2). advantage of a tree structure is its reduced latency. an actual The obvious The latency of a linear chain is N-1 adder delays; for a tree structure the latency is Flog2 NI Fxl where denotes the smallest integer greater than or equal to x. The less apparent advantage is that a tree structure is more easily pipelined. Pipeline registers can be placed on each level of the tree (where the dashed line crosses the data paths in the figure). If a register is placed anywhere between adders in the linear chain, an additional register is needed in all successive adders to delay multiplier results until the correct partial Figure 2 sum arrives. Tree Adder 13 The bonuses tree adder there is appears an equal to be an obvious and opposite choice, penalty attached. removes the regular structure that is shown in figure stage to the implementation in figure however, as The with tree adder To add an additional 1. stage is merely 1, an additional attached To extend a design that uses a tree adder, the entire adder structure to the end. must be modified, in this case, a new level must be affixed to the tree. discussed chose the in for VLSI optimal all the section on designs; however, form direct this architectures, systolic it is interesting to with implementation is not necessarily note that LSI Logic to pipelining As implement their transpose form new FIR filter chip. Transpose The Form most second but an easy perform the same algorithm, out and to correct as drawn of ith for denoted be processing each way to show this is to assume the engineer reverse element processing equation following is the architecture filter At first it is not at all obvious that the two architectures shown in figure 3. design FIR common by it. pi[n-1], we sum form the (2) and proceeding element first processing at the can partial element. pj[n-1] = p. 1 [n-2] + h[n-i] * x[n] Starting the Letting inductively, we have pi [n-l] = h[N-1] * x[n] p 2 [n-1] = p 1 [n-2] + h[N-2] * x[n] = h[N-1] * x[n-1] + h[N-2] * x[n] p [n-1] h[N-i+j] = * x[n-j] j=0 Finally, equation y[n-1] setting = PN [n-1] yields form more the expression convolution in (1). Overall, the transpose uses hardware than the direct basic form because the delay registers must be large enough to contain the sums of the scaled versions of the input. that Assuming wide. implementation become very are included to the If registers to achieve similar. a The similar throughput, different coefficients input, the registers represented by the same number of bits as is the twice as the will be adder tree of a direct the connection hardware scheme are form requirements does, however, 14 significantly change the properties of the design. The largest difference is that all additions are localized so that no "global" .N input addition is needed. x[nl Processing Element 1 Processing Element 2 x Delay -- is the that transpose stages to throughput of the transpose delay the plus end one also, but only throughput Delay Delay Transpose FIR Filter Form additional multiply h[0] Delay - Figure 3 result Processing Element 4 h[1] h[2] h[3] One Processing Element 3 of be linear the simply expanded structure. In without additional form add can form the delay; pipeline form direct adding addition, is one registers can the achieve this registers to the adder addition of pipeline with the by As always there is a cost for these benefits; in this case it is that the structure. However, as discussed later, input must be broadcast to each tap of the filter. this cost may not be too severe. Data Flow Systolic FIR filters Because required, of that provide other a general for architecture systolic mapping a problem into an architecture. technique for evaluating structure computational The in the computations to investigate FIR filter designs is within which architectures. methodology general regularity a high degree exhibit framework a good systolic Graph Architectures In a addition, a a regular the concepts involving possible provide concepts architecture against designs. Systolic architectures became very popular in the late 1970's 1980's because of work done at Carnegie-Mellon by H.T. Kung. and early The basic philosophy of a systolic design is a rhythmic flow of data through a series of 15 where line production each another, (PE's). elements processing partial results adds something of whom is a analogy for a systolic architecture A good regularly move result. 1 final to the one from to worker The goal is to achieve 100% efficiency (each PE occupied), a simple data flow between PE's, a simple structure control expandable design. wafer For these All to implementations minimize characteristics these of design integration 2 scale no (preferably become control properties time where and PE's fundamental computational generally recognized exhibiting a calculations. computations, more regular general that systolic computational More complex problems, are dataflow systolic framework. there mapped better in to signal better structure such as those for VLSI performance. interconnected that to problems simple primitive applied with requiring more there are are however, it is now architecture a wavefront processing system selectively model; are However, architecture. important modularly there was a drive to push more concepts regular a requirements. the systolic into very problems Because problems are be and all), maximize can In the early days of systolic architectures complex at ideal to 3 general or some other a number examine of simple within the One of these is the convolution sum. an infinite number architectures of systolic that could be used to solve a particular problem, a data flow graph can be used to visualize the characteristics of the different possible designs. The idea behind a data flow graph is to list all of the primitive operations of the larger computation to be performed in a geometrically graph for a four point As an example, the data flow regular grid. convolution is shown in figure 4. All of the computation for a single output point is listed on one line; the computation the next (chronological) output point is listed on the next (consecutive) for line. 1 Analogy due to H.T. Kung, one of the fathers of systolic architectures 2 Wafer Scale Integration (WSI) is a fabrication technique that uses an entire silicon wafer to provide ultrahigh circuit densities. Because the yields for wafer size designs would be unacceptably low, extra processing elements are included, and the top level of metalization is configured to allow selective interconnection of processing elements. 3 Kung, S.Y. VLSI Array Processors, IEEE ASSP Magazine, July 1985, Vol. 2, No. 3, pp 422 16 y[ 3 ] = I h[O]x[3] h[1]x[2] h[2]x[l] h[3]x[O] y[4]= I h[O]x[4] h[1]x[3] h[2]x[2] h[3]x[1] y[5] = 2: h[O]x[5] h[1]x[4] h[2]x[3] h[3]x[2] y[6] = I h[O]x[6] h[1]x[5] h[2]x[4] h[3]x[3] y[7] = Z h[0]x[7] h[1]x[6] h[2]x[5] h[3]x[4] y18] = I h[O]x[8] h[1]x[7] h[2]x[6] h[3]x[5] Figure 4 Once the acceptable operations required manner, the primitive operations available perform at each to Data Flow Graph time have can processors Every step. laid been be assigned the processors the processors computations which and the sweep defines a different If map to computations on the data flow graph in a linear pattern in a regular pattern are swept architecture will timing operations. for the across an primitive Some of these obviously may be more desirable than others. architecture. and make in of processing pattern includes the initial processor placement on the data flow graph that out have a PE 1 simple across the data flow data flow graph, between processors PE 3 PE 2 the resulting and a simple PE 4 sweep direction Figure 5 An example All Direct Form Processor of a processor arrangement Arrangement and sweep and is shown in figure 5. of the processing for a single output point is performed period. multiplies; Sweep Each of the PE's, multipliers in this case, computes in a single time one of the four an additional adder is needed to sum up the results of the four PE's. Moving the row of processors down the data flow graph by one line (one time step) causes the coefficient, h[i], in each processor to remain and causes the 17 delayed versions of the input to shift right one PE as a new input point enters PE #1. as An architecture the direct form that has these properties has already been described implementation. PE 4 PE 3 PE 2 PE 1 sweep direction Figure Transpose 6 example Another Because of operating on the remain same input respective Processor processor processor Arrangement arrangement, same from processors on is arrangement point at the Focusing is downward. direction a diagonal the their in of Form step of the time. Also, in figure processors 6. are the coefficients because the sweep for a particular output step the computation Sweep shown all to and point, PE #1 does the first multiply at time i; PE #2 does the second multiply at does the third multiply at time i+2; and PE #4 does the final time i+1; PE #3 multiply at time i+3. 1 If PE #1 passes the result of the first multiply to PE #2 at time i+1, the two products can be summed to form a partial result that is passed to PE #3 adders, at time i+2. be multiply- particular time with four different on operate In this way the four PE's, which would output points at one This architecture is the same as the the transpose form results exiting PE #4. that was described before and shown in figure 3. At this point by architectures it is examining the directly error. The placement delayed versions elements. Two examples. A of of the possible horizontal to possible the input characterize data flow processors must be graph determines made processor organizations processor arrangement different possible rather than trial and the how available are will seen the to in require input and/or computational the previous N delayed 1 At this point the reader should be thankful that a longer FIR filter is not being used as an example 18 versions of the input to be made available with the current input always used A diagonal by PE #1 and the oldest version of the input always used by PE #4. processor arrangement broadcast to all arrangements with -1 will elements PE different requirements processing will = slope result in #1 through on input to PE #4. the input. be Other For with slope = 1 would require 2N-1 processor arrangement example, a diagonal require the current = 7 delayed versions of the input to be available with every other one used at any While time. input, Both the the the direction sweep form direct are the processors swept the considered in a the and had the attached on the coefficients. sweeps downward coefficients to determine output point which the and and their respective to For example, if the circularly shift the direction sweep requirements, coefficient for this point are performed. in form remained the layout processor tandem particular design coefficients on requirements transpose horizontally, different processing elements. on the requirements the through each time step. Although requirements the determines Other sweeps are possible, however. processing elements. processors determines and in both cases therefore layout processor respectively, data flow input specify the two requirements be must the between The easiest way to do this seem to be focusing examining and how the computations primitive As a final example of a systolic architecture, a shifted are coefficients through the elements processor will be examined to show the general use of the data flow graph. PE 4 4--, PE 3 P sweep direction PE 1 Figure A 7 good direction -1, exercise to is shown in figure 7. indicates elements, Processor Multiply/Accumulate and that the the processor and arrangement Sweep and sweep First, the processor layout, diagonal with slope = current sweep the analyze Arrangement input direction through the processing elements. will dictates be that Now, focusing broadcast the to all coefficients on a single processing will shift output point, it 19 becomes by apparent that all computation for this output point will be performed the same processing partial an processing element, sums output element over in this case and consecutively point. An four consecutive time a multiply/accumulator, (as the enhanced final product version periods. will accumulate is accumulated) of the Each the produce multiply/accumulate architecture is used in the Zoran Digital Filter Processor family. The to the previous example architecture filter each time filter coefficients every implementation built; all M-1 because Because the of the interesting coefficients shift adaptive filter can be implemented N time periods. but every that is chosen described. step, an the results of all was through the that updates the If the output is to be downsampled, Mth processing element can be ignored. out of M processing needed advantages in elements these voided coefficients by each time cycle. do places is not a even In this need to register to be shift the A final advantage is that point failures in an arithmetic unit affect only the output points that would be computed by this element with (1 this of N output points). But as always design. control First, added is there are also disadvantages necessary to determine processing element should output on a particular clock cycle. would be solved by adding a tag bit to the coefficient path. In general, this A tag attached to h[O] would indicate to a PE to output and clear its accumulator. serious problem to A second more would be routing the output from each PE to common output pins for the VLSI chip. used which selectively Some form drive the bus, of tristate driver arrangement but loading the on this bus could be could be excessive. Although the problems the of the advantages could architecture be solved are for with the therefore design exhibits required, and focus a exclusively simple a maximum require downsampling multiply accumulate on the modularity, throughput. or design. adaptability minimum Other more should the For the residue FIR filter transpose a architecture, application; a more specialized added complication and hardware is not warranted. I will previous form architecture. quantity specific probably of This hardware applications that investigate the 20 Bitwise Decomposition As will become obvious later, it is advantageous to minimize the number of residue multipliers that are needed for the residue FIR filter. It is possible to build a binary fixed point filter that uses only adders and no multipliers. A similar design, discussed later, can be used for a residue FIR filter. Mini-Convolutions 20 Shift-and-Add channel x[n] Figure 8 The idea behind the Bitwise FIR Filter Architecture adder only FIR filter is recognizing that a bxb bit multiply consists of b recursive left shifts and b-1 conditional adds as shown in equation (3). 21 Let y = b2y. i=O then x*y = x * 2*y If the results of several multiplies 2(x*y ) = i=O i=O are being summed then the shifts and adds do not need to be performed until after the final sum. bits needed broken to represent b mini into the coefficients, convolution the equations If b is the number of convolution the final equation results of can which be are shifted and added. y[n] i=O Because to be x[n-l]*h [i]*2 x[n-i]*h[i] = = x[n-i]*h .[i] 2 i=O j=O j=O i=O hj[i] in the final equation is either 0 or 1 no general multiplies need performed Using the in the mini-convolution transpose form for chains. the mini-convolutions, employed to condition the adder in each processing element. the hj[i] 's are If hj[i] equals 1, the current value of the input is added to the previous result; if hj[i] is 0, the previous result is passed on unchanged. FIR filter is shown in figure A top level block diagram of bitwise 8; an individual processing element is shown in figure 9. Current Input b b b A C -N/C b bit ADDERJ b Result from previous stage Figure 9 The idea accumulators of can besides base 2. design and breaking be the extended FIR to for Bitwise computation other into decompositions FIR Resultto next stage Filter parallel of the channels of coefficients This concept is used in both RNS designs and the conventional design that follow. filter Element Processing Reuto A deeper discussion is included in the sections on residue conventional filter design. 22 Chapter 3 RNS Background Residue possible arithmetic, based to alternative on simple conventional principles of for arithmetic number larger theory, is a operations. integer several relatively primel numbers Mi1 , m2, ... , mr as a moduli set, Starting with or remainders an integer x by its residues it is possible to represent members of the moduli set: x mod mi, x mod m2, ... x mod mr. , to the This new of x is unique for any integer x that less that the product of representation the moduli in the moduli set (0 s x s M-1, where M = mim2... mr ). The proof of this result is from the Chinese Remainder Theorem. subtraction, operations m i)) most with the "commute" conversion = (x-y mod mi) The multiplication. no is where is operation there channels; for useful of operations the addition, A basic result of number theory is that these and multiplication. mod mi) moduli is arithmetic Residue - is either between mi) - (y mod mod addition, independently performed coupling (((x operation: in each channels. the or subtraction, of the r the Because moduli can typically be selected to be much smaller than the integers x and y, the operation can executed be parallel in in rapidly than if x and y are in their conventional straightforward equally difficult uncoupling not is division Because a to compare the digits in the has a It is because number of disadvantages. operation 2 , it is not in residue representation. numbers the magnitude of two representation of more for FIR filters. integer fundamental round or truncate to also arithmetic residue channel representation. of these properties that residue arithmetic is appealing Unfortunately, moduli each The numbers. a number is that It is result of there is no In longer any significance that can be attached to a particular digit position. general, to a number perform converting operations. either of these into and residue representation converted must be out of residue back This to a conventional leads into The representation. is usually done by a table lookup. of residue representation uses a result of the Chinese the representation final difficulty: conversion The conversion into out Remainder Theorem and 1 Two numbers are relatively prime if they contain no common integer factors other than 1. For example, the numbers 10 and 21 are relatively prime; 10 and 14 are not. 2 The set of integers is not closed under division. For example, what integer equals 5 divided by 3? 23 is not as simple. now let's assume possibilities Basic All of these problems are a topic of current research, but for for that these residue Residue difficulties arithmetic can be overcome and examine units. Arithmetic Units In order to design FIR filters, it is necessary to implement some basic RNS arithmetic units is that blocks. they One any modulus requirement "programmable" be The term programmable moduli. for building than less to that permit will be imposed computations size either by different with can be used implies that the same hardware a certain on these rewriting entries into a If there is a choice table or asserting some constant(s) to one or more inputs. between a design that involves a table lookup and one that does not, all else In equal, the latter would be preferred. are some tricks that there addition, can be used with certain classes of moduli to optimize arithmetic computation; will not be practical if specialized arithmetic units Within these modulus. are high throughput some filter; although actual filter design arithmetic units. designing more of it is worthwhile later. will not First, develop a set to the apparent be in these designs used The techniques units needed design will of the of primitive be a until helpful for programmable With the adder the more complex multiply by residue adder will be addressed. Finally, although it 2 block and a general multiply block can be designed. involves a table lookup, a general function Residue a RNS FIR to implement be needed in order these units is begun, custom designs minimum size. units will residue Several and designed for each must be the goals of the arithmetic units few bounds design The overall in these classes is fairly restrictive. however, membership unit will be discussed briefly. Adders One of the earliest proposals for a residue adder was to use a conventional ROM as a table lookup (figure 10). a b bit binary channel (i.e. m 2b), example, a 6 bit modulus (m early research reduce to size into it is necessary to have a 22b x b ROM. 64) would require a 4Kx6 ROM. exploiting the of the ROM. For moduli that can be represented within symmetry inherent in the A first stab is realizing For There was some addition tables to that the operation of addition is commutative; this reduces the size of the ROM by a factor of two. Unfortunately, as the design is optimized to reduce the size of the ROM, the 24 external large circuitry area increases required to and the implement throughput memories and decreases. the In access general, time memories prohibits this approach for all but the smallest modulil. of the these Although table lookup would not be practical for a residue adder, it is important to note that any integer operation on two variables can- be performed using a table lookup. x y f(x, y) Figure 10 Table Lookup Residue Adder Focusing on the addition problem, the size of ROM in the previous design can be significantly in figure 11. reduced by using a standard b bit binary adder as shown The output of the adder is b+1 bits wide including both the b bit result and a carry bit. Because this b+1 bit output may exceed the modulus, the ROM is necessary to correct the result to lie in the normalized range [0, m-1]. In this case the size of the ROM is 2b+lxb. be 128x6 bits. Although this is For a 6 bit modulus, the ROM would significantly better than the previous design decreasing the size of the ROM by by a factor of 32, a closer examination of the ROM's contents shows that this design can also be improved. 1 Chaing C-L & Jonsson Lennart, Design? pgs 80-83 Residue Arithmetic and VLSI 1983 IEEE Computer 25 x y tb b Input A Input B b bit Adder Result Carry Out b' Address b+1 2 x b ROM Data b Assuming formi, ROM Residue Add using a Binary Adder with Correction Figure 11 the inputs to our residue adder are in normalized residue the output of the binary adder is falls into three cases (see figure 12). First, if the sum of the two numbers is greater than or equal to 0 and less than the modulus, the result is already passes the result unchanged. in normalized residue form, Second, if the sum of the two numbers is greater than or equal to the modulus and less than or equal to of the binary adder), the b bit result exceeds equal to 2 2 b (where b is the width its normalized representation the value of the modulus, and the ROM subtracts of the binary adder. and the ROM by the modulus from the output Finally, if the sum of the two numbers is greater than or b (carry bit set), the b+1 bit result, including the carry bit as the 1 A residue is in normalized residue form if its magnitude is between 0 and the modulus. A residue is not in normalized residue form if its magnitude is greater than or equal to the modulus or less than 0. F26 form by the value of the modulus, and the b+lth bit, exceeds the normalized ROM subtracts the modulus from the b+1 bit result. 2 b= 64 m = 43 Modulus Bias g =64-43 =21 CASE 1 15 15 10- CASE 2 25 52 43 = 9 52 - 21 =73 (9) 5252+ 27 CASE 3 25 66 = 2 + carry -2 ROM -- 66 - 43 =23 2+21=23 41 Residue Addition with 12 Figure a Binary Adder The ROM entries can be reduced to two operations: either the output of the is subtracted from or the modulus binary adder is passed unchanged ROM can be eliminated entirely as shown in figure 13. it. The first binary adder performs as before with its output that may or may not be normalized. b+1 bit binary adder. and This subtracter serves indicating (by [0, m-1]; greater than m. binary the modulus the dual purpose its overflow from of providing bit) whether the The the result of the first other possible final result output of the first adder is If the overflow is set, the output of the binary adder was in the normalized. range subtracts The adder if no overflow is set, the output of the binary adder was The overflow can be used to select between the output of the and subtracter. 27 x y b b Input A Input B b bit Adder Carry Out Result b b+1 Input A Input B b+1 bit Subtractor Result (A-B) Overflow b b P \ _ 2-1 MUXA IB A b <x + y> Residue Adder without ROM Figure 13 One final wordlength optimization binary channel. results from Adding two normalized adder yields a result, u, in the range [0, 2(m-1)]. bit channel, then 2(m-1) s 2 b+1. of nature modulo the residues in a the finite binary If m is representable in a b When m is subtracted from u, a number v is obtained that is always less than m (consider only the case u-m > 0) and is therefore representable in b bits. The number v can be obtained I pow- 28 alternatively by adding g = 2b - m to the low order b bit of u and ignoring the carry; this is a result of the mod 2b nature of the channel. The advantage to this approach is that a b bit binary adder can be used instead of b+1 bit binary subtracter; one stage of carry propagation shown in figure is saved. The final residue adder is 14. y x <x Figure 14 Final Residue +y>m Adder Design 29 Because the carry is not input into the second binary adder, the carry out of the second adder will only be set if u is in the range [m, greater than or equal to 2 the select ]. If u is b (the carry out of the first adder is set), the carry The logical OR of the two carries is used out of the second adder will not be set. to 2 b- 1 multiplexer; either if carry is the set, subtracted version is chosen. is it useful At this point residue adder. blocks to Similar summaries simple permit will comparisons for other final be generated between of the final summaryl more version architecture complex The basic components that will go into the summaries are 1 bit full sections. adders, include a hardware to MUX's, and gates. simple For the final simple adder the residue summary is as follows: Number Sizing Transistors 1 bit Full Adder 2b 19584b g2 64b 2-1 MUX b 4896b + 1632 g2 10b + 4 OR gate 1 4896 g2 6 24480b + 6528 p2 74b + 10 Part Type Totals Architecture the Examining insight into the final Summary adder modulo operation of for Final design, modulo RNS gain some helpful performed with binary we addition Adder can The basic result is that modulo addition is the same as binary arithmetic units. addition unless the result exceeds the modulus in which case the modulus, m, is subtracted from the binary sum, or, equivalently, binary sum. p = 2b - m is added to the As a result of the previous discussion for the final modulo adder, we will focus on performing the correction, if necessary, Now, instead of possibly performing the correction by adding later, preadd p to one of the inputs and use a single binary adder as shown in figure 15. is initially normalized, correspondingly it falls be in the range in the range [g, 2b -1], [0, m-1]; g. because Because xi xi + p will it can be represented entirely in a 1 The space estimates were derived from an existing standard cell library. The transistor count numbers were derived from simple designs in CMOS and include both p and n type transistors. A more detailed discussion of these hardware estimates is included in the Appendix. 30 b bit binary channel, and the carry out of the preadder can be ignored. By the previous result the output of the main binary adder will now either equal the correct modulo sum or exceed this sum by p. adder provides The carry out of the binary a flag to indicate which case the answer is in. If x 1 + x 2 is greater than or equal to m, then xi + x2 + 9 will be greater than or equal to 2b Since x1 + x 2 > m is the case that needed correction, and the carry will be set. the output of the the carry, adder, ignoring is the proper normalized If x 1 + x2 is less than m, then xi + x2 + p will be less than 2b and modulo sum. the carry binary will not be set; the output of the binary adder will exceed the correct normalized modulo sum by p. X1 t tFb Input A b Input B b bit Adder Carry Out X2 Result N/C lb tb Input A Input B b bit Adder Carry Out Result Tb Figure 15 Preadding p to one of the Inputs At first it appears that preadding one of the inputs trades one problem for another very similar problem. Without preadding, the binary the carry, can fall short of the correct modulo sum by p; with binary sum can exceed the correct modulo sum by p. sum, ignoring preadding, the However, if a series of 31 numbers at xi is being accumulated and the preadd of p to each can be performed minimal general expense, residue adder. we can It is increase always the performance possible to guarantee over that that one of the of the inputs to the binary adder is a biased residue (exceeds its normalized value by p.) and the other is a proper normalized residue. If the current partial sum is a biased residue, the carry out, 0, is used to select the normalized version of xi; if the current partial sum is normalized, biased version necessary of the input. The completed modulo register is shown in figure . 16 1, is used accumulator 16. x x+ Figure carry out, 1 Residue Accumulator to select the including a 32 A single addition in the modulo adder delay, adder while an addition delays. improved On the accumulator in the surface is requires final modulo it appears that only one b bit adder requires the modulo adder somehow; however, it is important to realize that the a rather constrained form of the addition problem. delay assumes that the preadds only average delay value. the output of the Nevertheless, sum may all of these caveats, this be Also, the one b bit adder and is an correction stage must be included accumulator because the final even with could accumulator is can be performed with no overhead An additional two b bit at be in biased form. configuration is very useful when several numbers are being accumulated (for example an FIR filter). Residue Multiply by 2 Block The next arithmetic unit to examine is a modulo multiply by 2 block that takes a normalized block This is very multiplier block. residue input and generates useful when building the Now, in the multiply a number nothing is as by two: straightforward a normalized more shift in residue output. general modulo number system it is simple to standard binary simply complex residue left one place. computations Unfortunately, and this is no exception. The obvious way to implement a modulo multiply by two block is to build upon what together. we already know by using a modulo adder with both inputs tied Looking at the final modulo adder in figure 14, the first b bit adder will just be a left shifted version of the input. function, however, can be hardwired making the output of the The left shift first adder unnecessary. To eliminate the adder, the high order bit of the input is routed to "carry out," and the remaining b-1 bits are left shifted with a 0 inserted as the low order bit to form the b bit "result." figure 17. The basic multiply by two block is shown in 33 x b Hardwired Left Shift [b-1..o] [b-1] [b-2..0] 0 b-1 b bb b Input A Input B b bit Adder Carry Out Result /b T b _ 2-1 MUX A/B y b <2x>m Figure Residue 17 Residue Multiply by 2 Block Multipliers The final residue arithmetic block to be added to our toolbox is a general modulo multiplier. A multiplier block is considerably more complex than the adder block or multiply by two block. needed in the RNS to binary Because several modulo multipliers are converter, the general overall system design 34 goals for Hopefully, from being latency, throughput, the multipliers the system and hardware real can be designed in such Binary before multiplier multipliers and tackling the designs can be divided Shift and clocked and tend to be slower overall; faster because pipeline registers carries be addressed. a way to prevent them the design of standard more complicated array multipliers. clocked (although must bottleneck. At this point it is instructive to investigate multipliers1 estate are propagated into add binary modulo multiplier problem. two classes: shift and add designs are by their nature array designs which are not necessarily could be inserted into carry chains) are more efficiently. X x*y Figure 18 Shift and Add Multiplication 1 Material in this section was obtained from Rabiner and Gold Theory and Application of Digital Signal Processing, pgs 514-540. See this reference for a more exhaustive discussion of binary multiplier design. Other ideas can be obtained by using the systolic design techniques discussed earlier. 35 A shift and add multiplier forms its product exactly as its name implies by accumulating the following Let y = 2y sum: then x*y= x * i=O = 2y (2i*x)*yi i=O i=O Shifted values of the multiplicand x are accumulated appropriate y. bits of the multiplier conditioned on the An unwrapped nonrecursive version of the shift and add multiplier is shown in figure 18.1 The major disadvantage of a shift and add multiplier is that the carry bits do not propagate To solve this problem efficiently. array minimize the length of the longest carry propagation path. array multiplier is shown in figure bit full adder complicated same with n2 cells. carry Better propagation A simple 3x3 bit In the figure the circles represent 19. array adders attempt to multipliers but schemes, can be the basic created using structure remains 1 more the full adders and the simpler version is easier to understand. 2 x 1 Y1 x 0y 1 X 2 Y0 0 X 2 0 x y0 0 1 Z0pIP -,~P X0 y 2 x 2Y 2 P2 0 p5 P P 4 Figure 19 3 3x3 Array Multiplier 1 A design that uses a smaller adder and is recursive is shown in Rabiner and Gold pg 516 2 Again, see Rabiner and Gold for a discussion of various implementations. 36 and add type seems most conducive the modulus result, several In because the partial to the modulo problem after normalized step. each the Although by to times several In its value. normalize this order to of correction circuitry would be needed. stages order a modulo implement add multiplier, and shift of unfortunately has multiply by and biased blocks both two been have the designed, Although block be designed could of the output, versions residue adder by 2 block. If a final almost twice the latency of the multiply both the that provided similar to the a structure residue both multiply by two blocks and residue adder blocks must be available. versions array would be faster, the result of a full bxb bit binary multiply could multiplier exceed be can the attack Of the two classes of binary multipliers, the shift modulo multiplier problem. accumulations ready to we are multipliers, of binary some knowledge With normalized accumulator in figure 7 could be used that would have the reduced latency that we desire. An enhanced designed 20) (figure version of the cost and no with minimal hardware multiply by block two additional delay. can be To understand the biased side of the multiply by two block, two cases must be examined, 2x[n] > m and 2x[n] < m. 2x[n] + 2p1. Regardless, the binary output of the left shifter equals m, then 2x[n] is an unnormalized residue and one of the If 2x[n] two Vs is needed to normalize the residue yielding the desired result, 2x[n] + pL. If 2x[n] < m, then the output of the left shifter exceeds the desired result by g. Fortunately in the case 2x[n] < m the output of the adder in the unbiased side this modification disadvantage are a b bit 2-1 is that of the design input must be available, but if components required for The additional result. of the doubler has the desired and a b bit MUX both biased several and two only versions of the unbiased multiply by The register. blocks are chained together, this is only a problem for the first one. The modified accumulator recursive) except and the accumulator that add the is be very accumulation conditional. results from the previous stage, the result to the next stage. will will The not occur accumulator the in will previous place take (not partial add another value to the partial sum and pass The carry out from the previous stage must also be passed to indicate whether the partial result is biased. of the multiplier, yi, to similar must also be asserted to determine the add or pass the previous partial result to the next stage. The appropriate bit whether to perform 37 x[n-1] x[n-1] -g Hardwired Left Shift Hardwired Left Shift ..0] [b-i S- [b-2..0]1 -, [b-1] -2..0] 0 0 b-1 -1 1..1] [0] N/C Input A Input B b bit Adder Result b <2x> Figure 20 Modified Multiply by 2 Block <2x>m* R 38 Carry In Previous Result n-1] Si_ [n-1] C i_ 0 9 2 ix 2 1x + g y; Figure Ci [n] Si [n] Carry Out Result Out 21 Residue Conditional Accumulator #1 39 The 4-1 MUX A possible design for this accumulator is shown in figure 21. in the figure performs the following function: yi Cin MUX outut Explanation 0 0 0 previous result biased, result same 0 1 m previous result unbiased, 1 0 Input previous result biased, 1 1 previous result unbiased, Input + result biased use unbiased input use biased input Both biased and unbiased versions of zero are needed to assure that the carry out properly accumulator of the the indicates of the state the If result. previous result is unbiased and a 0 is added to it, there will be no carry from the adder and the next stage will assume a biased input. Instead, a biased zero is added to the unbiased previous result which biases the result; the carry out again will not be set, however this time, correctly indicating that the result is biased. Number Part Type 1 bit Full Adder b 4-1 MUX b Transistors Sizing 9792b g2 32b 8160b + 9792 g 2 18b + 32 p2 24b + 24 1 bit Register b+1 8160b + 8160 Totals - - - 26112b + 17952 Hardware it Unfortunately, multiplexor to previous result. Summary is not decode zeros If the carry for Residue aesthetically or to require 74b + 56 Accumulator pleasing a biased to use #1 rungs zero to be added out from the accumulator of a to the can be fixed up, the multiplexor can be reduced to a 2-1 multiplexor by placing an AND gate after the multiplexor to condition the add. The biased zero was only necessary in the case when the no add is being performed (yi = 0) and the previous result is unbiased (Cin = 0). If the previous unchanged, the carry out must be set to 1. result is passed to the next stage With the addition of the necessary gates, the new accumulator design is shown in figure 22. n- 40 Part Type 1 bit Full Adder b 2-1 MUX b Transistors Sizing Number g2 9792b 32b 4896b + 1632 p 2 10b + 4 p2 24b + 24 8160b + 8160 1 bit Register b+1 AND gate 2 9792 p2 12 OR gate 1 4896 p2 6 1 2 2 Inverter 3264 p 22848b + 27744 Totals Hardware Summary for 66b + 48 Accumulator Residue #2 y. C [n] x[n-1] x[n-1] +p S [n] C i- [n-1] S i- [n-1] ee Figure 22 Residue Conditional #2 Accumulator Finally, we are ready to put all of the blocks together to form the general purpose programmable residue multiplier. a b bit adder that registers. Because generates there is is the discussed which has not been exhaustively a biased no form previous multiplier does not need an accumulator. only The b AND and two the first second section unbiased previous accumulator result. is accordingly A final the multiplier hardwired section that unbiases to contains b bit pipeline section gates determine unbiased 0 or an unbiased x[n] is passed to the second section. the in "first stage" which of x[n] result block of the whether an The carry in of 1 to expect the potentially an biased output of the final accumulator has not been included, but may be needed. 41 In The complete multiplier requires a large amount of circuitry. accumulator is 2b residue multiplier needed summary for 2 - b. binary the of the number design, array 1 bit full discussed multipliers for needed This can be compared to the b 2 a bxb bit 1 bit full adders An earlier. abbreviated Number Type 2b 2 1 bit Full Adder 2-1 MUX 3b 1 bit Register 2 3.5b The latency 2b - 2 Inverter b - 1 modulo -- for Summary the Residue multiplier has residue fairly clock cycles, Multiplier high performance. and the throughput is The limiting factor on the clock cycle time is the delay through of a b bit binary consisting accumulator Even with the performance gate delays. - complete through the multiplier is b+1 1 clock cycle. 3b - OR gate side On the positive 2 + 1.5b - 1 Totals Hardware b - b2 + b - 1 AND gate hardware adders of the residue multiplier is shown below. Part the Using either and a b-i bit register. a b bit accumulator, a b bit doubler, the ith one of which requires other sections, to the first section there are b-1 addition adder delay and a few that can be obtained, the amount of any residue design to avoid multipliers if at all required encourages possible. Residue For General more approaches. efficient, in combination Function Units general To some functions implement cases, of binary a to use arithmetic it is general table units. worthwhile integer lookup to examine function, rather The general than it may a more function be more complex of two variables has already been discussed as the first possible adder design. 2 lookup table b bit Here a 2b x b ROM was used which is the most general implementation of an integer function of two variables. If the function exhibits any special properties, the 42 table can be lookup requirements hardware and throughput smaller pieces. into several broken of the In lookup table the this case may approach become competitive with those of custom designs. 8*K I x A A/B 3 4*K 0 x2 B 2*K 0 I |I A A/B B Xi 1*K 0 A _ A/B y | | 1| B A_ SA/B Y 0 B Four Input Residue Adder (K*x) bit General Four 23 Figure Linear Function First, let's examine the case of a function of one variable. is linearl, (d=1) Evaluation If the function the b bit input can be broken into groups of d bits, and smaller d bit lookups can be performed with the results added. scale a four bit residue, x, by a four bit constant, K, Fb/d] For example to the bits of x can be used to select precalculated multiples of the constant k from tables or hardwired to multiplexors. Instead replaced This is shown more clearly in figure 23 for the case d = 1. of selecting between with AND 0 and a multiple of K, the 2-1 gates when d=1. addressing scheme will have to be used. 24. If d>1, MUX's could be MUX's or some other table An example of d=2 is shown in figure It is interesting to notice that with d=1 the design is very similar to the general residue multiplier except that since K is known, multiples of it can be precalculated. The evaluation of a multivariable However, for an integer function function of two variables could also use to be performed this trick. by partial 1 A linear function is one which satisfies the equation, f(ax + by) = af(x) + bf(y). An example of a linear function is f(x) = Kx; an example a function which is not is f(x) = x + K. 43 lookups, the function must also be linear. 1 Considering that a linear function f(x,y) = f(x + y) = f(x) + f(y), of two variables has the property the two inputs could be evaluated in parallel and the results added to obtain the final result. So, in general, the partial evaluation scheme is most useful for single variable functions. 4*K 0 8*K 12*K I I I I Ao A A2 A3 2o x2 - 2*K 1*K lI I Ao 30 3*K I A | A2 A3 02 xiS Y SI 3 0 Two Input Residue Adder (K*x)m Figure 24 Four bit General Linear Function Evaluation (d=2) Problems with RNS advantages of RNS, The discussion to this point has focused on .the parallelism is Unfortunately, numbers to the can be developed units arithmetic there must computation added be are also several into converted using and of disadvantages their modulus binary standard form, residue The substantial size converted back to binary. how programmable arithmetic units. First, binary RNS. and how results must be of the conversion and latency units tend to rule out the use of RNS for all but very large computational tasks. Second, because cannot be consistently. digits the residue scaled in Finally, any simple also are uncoupled, manner; because the all a number digits residue in residue must digits are be form modified uncoupled, 1 Actually, the function has to satisfy a slightly weaker requirement than linearity in two variables. This requirement is as follows: f(x,y) = f(X,Y) + f(X-x, Y-y) V (X,Y) within the range of f. The author challenges the reader to find a useful function satisfying the above condition that is not linear. 44 not possible residue in the magnitude comparison is with RNS drastically limit the number of possible These form. applications problems and to some extent explain the reason that RNS has not been widely used. into and out of residue Conversion The into conversion representation representation residue is simple and the conversion The residue mod m of a number for each modulus can operate independently. If r is the remainder obtained when the number is divided by the modulus. moduli are included in an RNS design, r similar conversion units can operate in parallel. The cannot independently operate is conversion based 1 on for a classic residue the and known theorem as conversion the from each the process modulus. ... , The Remainder Chinese {x1, x2, Given the residue representation (CRT) . Theorem is not simple out of residue conversion xr} of x, the value of x can be computed using the following identity: ll where M = + M 2 <M 2 >m M1<M~>MI x x= 2 + ... + Mr<Mr mi, Mi = M/mi, and xi = x mod mi. m>Xr mod M Although conversion hardware will be addressed in section 4.3, the large mod M computations do not look very promising. To avoid mod M calculations, the Mixed Radix Conversion (MRC) algorithm is conventionally discussed through Scaling used. Although in a later section, an intermediate details it is a two part mixed and Magnitude the radix of the conversion representation. algorithm will process that be passes 2 Comparison Because RNS is not a weighted number system, it is not possible to round a For either of number or to compare the magnitude of one number to another. these operations the residue digits must be converted out of the residue 1 In fact, the mathematical validity of the residue number system relies on the results of the theorem. 2 Mixed radix number systems are similar to fixed radix except that the radix a can vary from place to place. If xi are the digits of a mixed radix number with radices ai, the value of the number is computed as follows: J-1 i=0 ai) 0 x, 45 Rather representation. part of the algorithm which is itself conversion for the a Processing To however, many in are coefficients with high into two general can be used the first to a mixed radix to convert number weighted number, radix system. within the of All modulo the channels, to return to a residue representation. integer computation signal filtering applications Several encountered. residue complex fixed real only point a Quantities (QRNS) Complex this to can be performed algorithm can be reversed and the deal converting mixed radix representation operations than approaches The both considered; been have to first is simply data complex developed but these approaches representations, categories. been has use three tend and to to fall parallel real The second is to use one of -the quadratic residue number residue channels. systems (QRNS). Processing using an innovative trick. coefficient, the inputs to the three second channel parallel are a, real channels are c, and the and a+b, The output output of the first channel from the (bd) is subtracted The outputs of both the first are subtracted from the (bd) channel and the second b, d, and c+d, respectively. (ac) to form the real part of the result (ac - bd). (ac) is performed channels If (a + bi) is the complex input and (c + di) is a coefficients of the three channels of the with three quantities complex output of the third channel (ac + bc + ad + bd) to form the complex part of the result (ad + bc). hardware The To however. of expense adding a third avoid this expense the quadratic channel can be significant, residue number system (QRNS) has been developed that can uncouple the real and imaginary part of complex operations. 1 extension primes of fields this QRNS is a complex modulus number system isomorphic of primes of the form 4k + form, -1 is a quadratic to the 1 where k is an integer. residue. 2 Letting I represent For the quadratic residue of -1 modulo p, the two quadratic residues of the input (a+bi) are formed as follows: A = (a + bI) mod p and B = (a - bI) mod p. coefficients in a quadratic residue form also C = (c + dl) mod p With the and D = (c + dl) 1 For more detail on QRNS see the reference section for some interesting papers. 2 A number r is a quadratic residue modulo p iff there is a solution to the equation x2= r mod p 46 mod p, multiplication and addition are uncoupled: (A, B) - (C, D) = (A - C, B - D) with the computations A-C and B-D being performed in modulo p channels. At residue first QRNS channels, seems but to be an advantage deeper as significant as expected. complex numbers, limited set. moduli in the investigation over using three reveals that the conventional advantage is not Although only two channels are used to represent moduli in each channel come from a significantly The moduli in a QRNS system must be primes of the form 4k+1. a standard another. This moduli. To RNS limitation avoid this system causes problem only a much need to smaller other number dynamic systems such as modified quadratic number system (MQRNS) moduli, but these systems have other problems. be relatively prime range have from been The to one a set developed that allow a richer set of For a more detailed discussion of QRNS and its extensions see the references listed at the end of the thesis. 47 Modular Efficient RNS FIR filter Chapter 4 filter. either quantities the techniques designs are very the conversion from of into that the out filter of integer chapter, with the complex channels residue FIR complex include Computing residue and real previous would only add unnecessary of complex cases The inclusion derived similar. number the increases complicates be can a on focuses follows using data and hardware resulting slightly implementations Although coefficients that discussion architecture The and/or representation. confusion. It is also assumed that the designs will be implemented in VLSI or WSI. Practically, hardware there required some restrictions design. is no for the MSI RNS could However, designs. this massive the support assumption places between the different blocks in the Some of these have been mentioned in the section on systolic designs. most significant is the interconnection or that platform of the interconnectivity Figure The other components 25 Residue FIR Filter System restriction to constraint. can use between discrete components; the following architecture circuit card populated A printed several signal layers for the comparison with SSI interconnection a typical VLSI process includes only two layers 48 of metal for interconnection between the custom blocks. interconnects in VLSI or WSI tend to consume area. has gates but has that been optimized interconnection for gate count less space than the at a result, A design that uses more may be occupy simple interconnects, As expense of a design a complex scheme. A top level block diagram of a real residue FIR filter is shown in figure 25. The complete system should require only three basic designs: RNS conversion, the consideration, system level designs should system. The designs RNS binary to programmable be to for to any As a conversion. filter tap and FIR conversion be used be programmed can to binary and an RNS a FIR filter tap, a binary to moduli of the the either modulus a particular in asserting a value to an input or by loading a value(s) into a register(s). an additional Ideally, the RNS to binary any standard This designs and makes the system expandable limits the number of specialized adding by conversion block and RNS filter moduli set, but, because of the very a set number of moduli for the system, chain. to be used could be programmed conversion tap structured computation it may be worthwhile by with that includes to design the optimal moduli set into this component. Residue FIR filter tap The major focus of the design is the residue FIR filter tap. that any with implemented filter being residue techniques It is assumed will a large have number of taps to justify the added overhead of two conversion stages. As a the filter chains. result, the primary classes approach will normalized chapter to output The normalizing the more developed. be 3. the that other result complex One was limited range of values. explores tap, Several two are reasonable sections and in this match the result from a progressing from section architectures. Two the biased/unbiased for the slightly but instead, residue approach constraining the chapter throughput will present of the conversion filter chains. with units of output will be presented a design algorithm function different filter tap designs general in not to a and for size and throughput estimated for each. each design type with hardware remaining uses developed of each will in depth in this will be discussed The filter tap simple constraint speed/size designs The that 49 Brute Force The obvious first approach to a residue FIR filter chain is to replace all of the units arithmetic corresponding to construct auxiliary gates, 2b 2 + b individual been MUXes, or registers, Decomposition In multiplier. residue have 1 bit adders are needed for each tap. coefficient each bits with left partial b multiplier, shifted comes approach of the brute force of the complexity Most Not including 2-1 each filter tap. required Coefficient All fixup block. Although this design will work, a lot of hardware is in chapter 3. developed the with Each tap will need one residue adder, units. a multiplier output and diagram filter block FIR transpose residue arithmetic multiplier, one residue the in (multiplies of of the input) are results versions1 from the If the partial results are not added together at each tap, but instead summed. residue adds of the partial results need to be passed on to the next tap, the performed only once, at the end of the filter In chain. addition, because convolution is a linear operation, the left shifts of the input also only need to be performed once. approach This was already examined in chapter 2 as a way to build a The same top level design can be used for an RNS filter by replacing the binary arithmetic units with their binary FIR filter without explicit multipliers. residue equivalents. the result of the multiplies At each subtap either the current input or 0 is added to previous stage. because both the input The subtaps are and 0 are available able to operate without without computation. Looking closer, the basic idea exploited here is that the b individual bits of the coefficients select numbers (either 0 or current input) that are added to the b previous partial results. available such as: 1*input, Now, if more versions of the current input are 2*input, and 3*input, then the coefficient bits can be taken in groups of two to select between these three numbers could be added to the result of the prior stage. subtaps needed per tap is Fb/2 1. In general, and zero that In this case the number of it is possible to represent the In the binary 1 More exactly, versions of the input that have been recursively doubled. domain doubling is performed by a left shift; in the residue domain the result of the left shift must also be normalized. 50 coefficients in any radix 1 fixed number integer representation and precompute all multiples of the input that a single digit in this representation can span. For a base a of the coefficients decomposition Floga max(m)l subtaps 2 are needed per tap. it is useful to examine the To add some formalism to the development, The coefficients are represented in base a notation as mathematics involved. follows: J-1 h[i] = h [i] ax (1) j=0 where J = Floga of h[i]. max(m)l and hj[i] is the jth digit in the base a representation Inserting equation (1) into the convolution equation yields J-1~ y[n] hI [i] a& = x[n-i] i=0 _ j=0 By are reversing the order of summation, two different computational procedures generated. J-1 y[n] = - [i=O j=0 y[n] = [a h [i] j=O The first procedure (2) ai I:2 h [i] x[n-i] (eq 2) 1 x[n-i] (3) i=O dictates that the mini-convolutions with the results scaled by powers of a are computed The second procedure and then added. (eq 3) dictates that the input is prescaled by powers of a and the results of the mini-convolutions procedures was are not directly added. significant in the The difference between case a binary with these two = 2 because multiplying by factors of two in a binary representation is equivalent to left shifting. In the residue case, where computation must be performed to scale 1 A fixed radix (base a) integer number system is defined by the following rule: (...a3a2a aa) = ... + 3a3 + a 2a2 + a1 cl + a0 where ai are the digits of the radix a representation. 2 Remember, m is the modulus of the channel, and therefore a range of distinct numbers is needed for the representation of the coefficients that meets or exceeds the maximum modulus. 51 by any number, the one procedure however, will be addressed may be better than the other. This, later when the scaling units are developed. Base 2 - Bitwise The simplest case of coefficient algorithm that is (X = 2, bitwise. shown in 8 figure residue conditional with accumulators with the biased/unbiased decomposition The unbiased/biased architecture the in binary subtaps figure to 9 by replaced case the In the binary (figure 22). is identical number in the of subtaps per tap was determined by the precision of the coefficients; residue channel case the number (which modulus of subtaps effectively for the the the precision magnitude of the of the coefficients max(m)] = b subtapsl Flog2 The complexity per tap for this design is b * b bit are needed per channel. adders, or b 2 sets As shown above within a particular channel). by is determined 1 bit full adders which is less than one-half the number needed brute force design. is there Unfortunately, the added expense of broadcasting both the b bit current input and the b bit biased current input to all subtaps. A complete summary of hardware required per subtap is shown below Part Transistors Sizing Number Type 9792b 32b 1 bit Full Adder b 2 2-1 MUX b 4896b+1632 g2 10b + 4 1 bit Register b+2 8160b + 16320 g2 24b + 48 AND gate b+1 4896b + 4896 p2 6b + 6 p2 6 2 OR gate 1 4896 Inverter 1 3264 p2 I Totals 27744b + 31008 --- p2 72b +66 Global Bus 2b signal lines Critical 2-1 MUX + AND gate + b bit adder + OR gate + register Path Throughput (1.9b + 8.06 ns)- Architecture Summary for base 2 subtap 1 Because arithmetic is being performed using b bit binary arithmetic units, it seems logical to set the maximum modulus to that number which uses the full dynamic range of these units. In general, max(m) = 2 b where b is the chosen width of the channel. 52 Ternary Balanced The advantage coefficients is of going that fewer a higher are needed; to subtaps the disadvantage If the digits {-1, is that more with digits {0, the coefficients are represented as standard radix -3 (ternary) 2), four b bit numbers must be broadcast to each subtap. the For example, if scaled versions of the input must be broadcast to each subtap. no reason to use the standard digits. of decomposition radix 1, there is Fortunately, 0, 1) are used instead (balanced ternary), a simple trick permits a design that needs only two b bit numbers to be broadcast to each tap. With the coefficients in a ternary balanced representation, there are four possible versions of the input (<x>m, <x>m + g, <-x>m, or <-x>m + 4, selected and carry out of the prior stage) that could be added to the by the coefficient However, in the balanced case the latter two can prior result at each subtap. be easily derived from the former two. bit binary channel the negative channel is If a normalized unbiased residue in a b is two's complemented 1 , the result is the biased version of If a normalized biased residue of the residue. the two's complemented, negative of the residue. result is the normalized version 2b_ <-X> m+ complementing Xm + )m +p4-(x) -p = can be built into hardware -x>m by using XOR gates invert bits and using the carry in of the adder to perform the add 1. balanced ternary subtap of the 2 2b _<>m = m + g-(x>m Two's in a b bit binary using this is technique shown in figure to The final 26. The coefficients are coded as follows: hi h0 ternary digit 0 0 0 0 1 1 1 1 -1 1 0 undefined 1 Two's complement is a convenient way to represent both positive and negative numbers in a binary system. The two's complement of a b bit number x is obtained by subtracting x from 2 b, -x = 2b - x. In the binary system this computation is equivalent to inverting each digit of x (0->1, 1->O) and adding 1 to the result. 2 Remember when looking at the equations , = 2b - m, so 2b = m + g, and <-x>m = m - <x>m 53 three Because necessary to precalculated above digits the hold being be placed definitions two binary bits of register are a subtap. Since the coefficients are representation, the represented, coefficient and can coefficient hardware are for unique in any chosen were digit/radix to specifically the simplify decoding. 1'1 1 0Yi x[n-1] .C [n] x[n-1] + C i- S. S [n-1] 1-1 Figure At this balanced point tertiary a 26 logical system concern includes is Subtap Ternary Balanced that negative the range numbers. The range subtracted from numbers can be used. numbers in the maximum modulus. without changing any residue All that needs coefficient range ( 3 its to be guaranteed ) is greater value, number of subtaps needed for typical values of b. Since m the negative is that the span of than or equal The number of subtaps needed per tap is Flog3 2b] b is the number of bits in the binary channels. a of numbers for a J digit balanced tertiary number system is [-3J/ 2 + 1, 3/2 - 1]. can be in of coefficients to the where The chart below shows the .[n] 54 b Flog3 2 2 2 3 4 For b > lower the 2, the balanced number of 5 4 6 4 7 8 5 6 9 6 more complicated representation per needed number of global broadcast buses. slightly 3 ternary subtaps channel of the coefficients without does increasing the This benefit is obtained at the expense of a subtap, a two XOR delay, and a two bit coefficient at each subtap. shown 2 b) gate increase in propagation The summary of the design is below Number Sizini Transistors 1 bit Full Adder b 9792b p2 32b 2-1 MUX + XOR b 4896b + 3264 g 2 16b + 6 p2 24b + 72 Part Type 1 bit Register b+3 8160b + 24480 AND gate b+1 4896b + 4896 p 2 p2 6b + 6 10 XOR gate 1 4896 OR gate 1 4896 g 2 6 1 2 3264 p. 2 27744b + 45696 p 2 78b + 102 Inverter Totals Global Bus 2b signal lines Critical XOR + (2-1 MUX + XOR) + AND + b bit adder + OR + register Path Throughput (1.9b + 9.98 ns)- Architecture Summary 1 for balanced ternary subtap r 7-- 55 Radix 4 Offset are If we willing to broadcast additional versions With the coefficients can For radix 4 one of the offset For the design the former digit sets {-2, -1, 0, 11 or {-1, 0, 1, 2} should be used. digit set will be (arbitrarily) input we As in the radix three case, achieve a radix 4 decomposition of the coefficients. the standard digit set {0, 1, 2, 3} is not optimal. of the chosen. in the offset radix 4 representation, six versions of the input (<- 2 x>m, <- 2 x>m + g, <-x>m, <-x>m + g, <x>m, or <x>m + g, selected by the coefficient and carry out of the prior stage) could be added to the result of the prior stage. broadcast Using the two' s complementing trick, only four need to be to each subtap. x[n-1] x[n-1] +g [n] 2x[n-1] 2x[n-1] +pg C [n-1] S. [n-1] [n] Figure 27 Offset Quaternary Subtap The design is shown in a easy to understand positive logic form in figure 27. The coefficients are coded as follows: 56 Using the Inverters ho 0 0 0 0 1 1 1 0 -2 1 1 -1 same coefficient removed) by quaternary h1 coding the drawing the can be slightly design circuit digit in a less optimized intuitive manner (two as shown in figure 28. 2x[n-1] 2x[n-1] +g [n] x[n-1] x[n-1] +i Ci- [n] [n-1] S i-I [n-1] Figure 28 Offset Quaternary Subtap (Negative Logic) The span of the coefficients in a J digit offset radix 4 representation is The total corresponding span equals _-- 3 (4 - 1), 3 as does 4 J it (4 - in 1) _ any radix number of subtaps per tap is J = Flog4 2b] = below lists J for some typical values of b. 4 system. Fb/21. The The chart 57 number of subtaps The the balanced are more doubled. ternary Fb/21 2 1 3 4 2 2 5 3 6 3 7 8 4 4 9 5 per tap decreases Unfortunately, design. complicated, b and requisite the the number from the number offset of needed quaternary global bus with subtaps lines has The summary of the optimized hardware is shown below Number Sizing Transistors 1 bit Full Adder b 9792b g2 32b 4-1 MUX + XNOR b 11424b + 11424 1 bit Register b+3 8160b + 24480 NOR gate b+1 4896b + 4896 g2 AND gate 1 4896 XOR gate 1 4896 g2 OR gate 1 4896 Part Type p2 26b + 34 24b + 72 4b + 4 p2 6 10 p2 6 34272b + 55488 Totals 2 2 86b + 134 Global Bus 4b signal lines Critical XOR + (4-1 MUX + XNOR) + NOR + b bit adder + OR + register Path (1.9b + 12.14 ns)- 1 Throughput t Architecture Balanced The Summary for the offset quaternary subtap Quinary logical extension of the offset quaternary representation of the coefficients is a balanced quinary representation, radix 5 with digit set {-2, -1, 58 0, 1, 2}. No new innovations needed for the design which is shown in positive logic form in figure 29. The coefficients for the positive logic form are coded as follows: h2 hl ho quinary 0 0 0 0 0 0 1 0 1 1 2 1 0 1 -1 1 1 1 -2 i2 i1 digit i0 x[n-1] x[n-1] +p 2x[n-1] [n] 2x[n-1] +p [n] Ci_ [n-1] Si_ [n-1]. Figure 29 Balanced Quinary Subtap By modifying the design slightly, it is possible to eliminate the inverter. final version coefficient is shown coding as in follows: figure 30. This version requires a The different 59 I h2 The only difference I hl I ho quinary 0 0 1 0 0 0 0 1 0 1 0 2 1 0 0 -1 1 1 0 -2 between the two coefficient digit sets is that ho has been inverted. i2 ii i0 b x[n-1] A0 S1 b Al pr x[n-1] + b 2x[n-1] [n] A2 b 2x[n-1] +j. C _ A3 so [n] [n-1] Si-i [n-1] Figure 30 Balanced Quinary Subtap (negative logic) The span of the coefficients in a J digit balanced radix 5 representation is 1 5 1j . 2 '2 I The total span equals 5J as it does in any radix 5 system. number of subtaps per tap is J = typical values of b. Flog5 2bl. The corresponding The chart below lists J for some 60 Unfortunately, the b 5l0g5 2b] 2 3 1 2 4 2 5 6 7 3 3 4 8 4 9 4 advantage of using until b is greater than or equal to 9. requires offset balanced quinary is Although the balanced not realized quinary subtap the same number of global buses and has the same throughput quaternary warranted unless requirements, the design, b the > 9.1 architecture Part Type marginally For extra those summary is truly shown b 4-1 MUX + XNOR b 9792b would massive not dynamic 32b 11424b + 11424 p 2 26b + 34 g2 24b + 96 8160b + 32640 b+4 NOR gate b 4896b g2 4b AND gate 1 4896 p 2 6 XOR gate 1 4896 p2 OR gate 1 4896 p 2 6 34272b + 58752 p 2 86b + 152 Global Bus Critical Path Throughput range Transistors p2 1 bit Register Totals be below Sizing Number 1 bit Full Adder hardware as the 10 4b signal lines XOR + (4-1 MUX + XNOR) + NOR + b bit adder + OR + register (1.9b + 12.14 ns)- 1 Architecture Summary for the offset quinary subtap 1 Proof that aesthetics and symmetry are not the only things that is important F- 61 Subtap Summary The coefficients representations, but could the be decomposed disadvantages larger multiplexors would outweigh any having fewer subtaps. needed calculated using having even more higher global buses per tap the for each numbers of the from the four designs. architecture The values summary for 2 size (p ) entire tap Itransistors 86496 210 282 354 426 498 570 642 714 114240 141984 169728 197472 225216 252960 280704 size (p2) Itransistors 172992 342720 567936 848640 1184832 1576512 2023680 2526336 420 846 1416 2130 2988 3990 5136 6426 Binary per 2 subtap entire tap J size (p ) transistors size (pt2) transistors 2 2 101184 128928 156672 184416 212160 239904 267648 295392 258 336 414 492 570 648 726 804 202368 257856 470016 737664 848640 1199520 1605888 1772352 516 672 1242 1968 2280 3240 4356 4824 b 2 3 4 5 6 7 8 4 4 5 6 9 6 3 by The tables below list the size and number of per subtap J 2 3 4 5 6 7 8 9 and advantages that would be obtained design. b 2 3 4 5 6 7 8 9 radix At this point it is most instructive to examine the four designs for different values of b. transistors of into Balanced Ternary were each 62 per b J size (42) 2 3 4 5 6 7 1 2 2 3 3 4 4 5 124032 158304 192576 226848 261120 295392 329664 363936 8 9 subtap ent ire transistors 306 392 size (p 2 ) 124032 316608 478 385152 564 650 736 822 908 680544 783360 1181568 1318656 1819680 Offset per b 2 3 4 5 6 7 8 9 J 1 2 2 3 3 4 4 4 306 784 956 1692 1950 2944 3288 4540 en tire transistoirs 324 410 496 582 668 754 840 926 127296 161568 195840 230112 264384 298656 332928 367200 transistors Quaternary subtap size (p2) tap Balanced tap size (p2) transistors 127296 323136 391680 690336 793152 1194624 1331712 1468800 324 820 992 1746 2004 3016 3360 3704 Quinary Figures 31 and 32 provide a graphic comparison of the size and number of transistors, base 3 respectfully, for the appears to be marginally different designs. better for b = Although 3 and the balanced the balanced base 5 appears to be clearly better for b = 9, the offset base four design seems to have an advantage for all values of b in between. Assuming a large number of filter taps, it is possible to neglect the scaling and summing units needed for the different requirements, balanced quaternary designs. however, ternary 1 can The not implementations and the balanced differences be between neglected. need 2b Both global bus the the global binary 2 lines ; busing and the the offset quinary need 4b lines. 1 Remember, the output of each mini-convolution (subtap) chain must be scaled and summed with the scaled outputs of the other chains. 2 In reality, the clock must be globally broadcast also, but this is common to all designs. 63 3000000 2500000 Binary 2000000" Balanced Ternary 1500000 Offset Quaternary 1000000 M Balanced Quinary 500000 0 2 Figure 31 3 4 Size (g2) 5 verses 6 Number 7 of 8 Bits 9 in Channel Binary 7000 ' 6000 ' 5000 Binary 4000 Balanced Ternary 3000 Offset Quaternary M 2000 Balanced Quinary 1000 0 1 2 Figure The 3 4 5 6 32 Transistors verses Number delay through the subtaps increases or the number of bits 7 8 9 of Bits in Binary Channel increases monotonically in the binary channel increases. as the radix A summary of the latencyl through a single subtap is shown in the following table: 1 As are the hardware sizing estimates, the latency estimates are derived from a standard cell library. See the appendix... 64 offset auaternarv 15.94 ns 17.84 ns 19.74 ns 21.64 ns 23.54 ns 25.44 ns 27.34 ns 29.24 ns balanced b 2 3 4 5 6 7 8 9 binary 11.86 13.76 15.66 17.56 19.46 21.36 23.26 25.16 Scaling and Although hardware ternary ns ns ns ns ns ns ns ns 13.78 15.68 17.58 19.48 21.38 23.28 25.18 27.08 ns ns ns ns ns ns ns ns balanced auinarv 15.94 ns 17.84 ns 19.74 ns 21.64 ns 23.54 ns 25.44 ns 27.34 ns 29.24 ns Summing the scaling comparison and summing between the operations different can be neglected coefficient for the decomposition designs, it is an essential part of the complete design. Figure Earlier Either the in 33 this Premultiplication chapter two the radix as shown in (figure 34). Powers computational of the procedures Radix were discussed. 33), or chains can be postmultiplied by powers of input can be premultiplied the results of the mini-convolution by by powers To of the radix (figure simplify the computation and decrease I ___ 65 the necessary figures 35 and requiring channels times. hardware, that 2 36. the a form of Homer's Algorithmi can be used as seen in Algorithm has the complication The use of Horner's data arrival times be skewed in so that the inputs to the final adder chain the of mini-convolution arrive at the proper To minimize the number of registers needed to skew the data, the first computational procedure is used in the form shown in figure 35. for the multiply by a blocks consecutive subtaps. equalizes If the second the delay computational needed The latency between procedure outputs is used, of a chain of registers would be needed to provide delayed versions of the input to each mini-convolution Figure With a 34 chain. Postmultiplication computational procedure by Powers finally chosen, of the Radix scaling units developed to multiply by the radices {2, 3, 4, 5} used in the designs. the minimum latency designs must Obviously, are desired, but the scaling blocks must operate with a throughput equal to that of the filter subtaps or, equivalently, equal to 1 Horner's Algorithm is usually associated with polynomial evaluation where n I i=0 a x1 be is evaluated as ao + x(a, + x(a 2 + x(a 3 + x( ... 2 The ith subtap in each mini-convolution chain will be operating on data from different input times. 66 one clock cycle. that meets A multiply by 2 block has already been developed (figure 20) the timing requirements. It has a throughput and latencyl equal to one clock cycle and outputs both biased and unbiased forms of the product. The multiply blocks. by 4 block would consist of two consecutive multiply by two It would have a throughput of one clock cycle, but a latency of two (three) clock cycles. The *4 block would also output both biased and unbiased forms of the product. Figure 35 Horner's Algorithm for Premultiplication Unfortunately, a multiply by 3 unit is not as simple as the previous two. combination of one-half of a *2 unit multiply by 3. with additional hardware is needed A design with a latency of two is shown in figure 38. number of adders could be reduced by adding x to 2x and conditionally g to the result; A to The adding cost of this hardware optimization, however, is an increase in latency to 3 clock cycles. 1 Assuming both x and x+m are available. Otherwise, an extra stage would be necessary to generate both unbiased and biased versions of the multiplicand. This extra stage would have a throughput of one clock cycle, but would increase the total latency to two clock cycles. One half of the multiply by two block will operate on an unbiased input to produce only an unbiased result in one clock cycle. 67 Figure 36 Horner's Algorithm for Postmultiplication <3x> m Figure 37 Multiply by 3 Block 68 The multiply by 5 unit is very similar to the multiply by three block. A design with a latency of three is shown in can be implemented by adding an additional multiply by 2 block to the *3 design and an additional delay register for the input. As with the *3 block, the number of adders could be reduced by adding x to 4x and conditionally adding g to the result at the cost of increased latency. New Algorithm To reduce with the number radix decompositions, biased/unbiased used Buses of globally designs signals needed for larger in more depth. In each design the number of bits channels is equal to the maximum number of bits that a residue could occupy. The output of each subtap to be normalized1 be uniquely broadcast it is necessary to back up and examine the basis of the in the binary normalized Fewer represented. dynamic range because The normalization is restriction unnormalized achieved by forces numbers the could guaranteeing that a normalized unbiased residue is always added to a normalized biased one. output of the previous stage will always be in normalized be biased. Because of the uncertain unbiased and biased must be precalculated keep residues number The versions of each within The but may or may not state of the previous stage's output, both normalized and bused to the subtaps. normalized 2 a binary digit multiple of the input This clever procedure used to channel has led to the large of buses. Algorithm If more represent uniquely the are each subtap, as required included normalized represented. normalized, number bits in the residues, Instead of binary channels unnormalized requiring that than are residues the output of needed to can also be a subtap be the output can be restricted to some range of values modulo m. add in the new product plus positive or negative to keep the result within some restricted range. of bits in the binary channels, the sum of At multiples of m With a sufficient two unbiased residues, 1 Generally, normalized residue implies that the magnitude of the residue falls within the range [0, m-1]. In this case, because the dynamic range of the channel is equal to the dynamic range need for the maximum modulus, the "normalized range" could actually be any offset of the standard normalized range (ie [a, m+a-1] I a e Z ). The point is that the span of the output cannot exceed m uniquely. 2 Inductive reasoning 69 either of which may not be normalized, which also may not be normalized. always be unbiasedi, will previous If these digits are powers adder merely two's negative representation designs values in of negative all numbers were indicated complementing However, residue, of two, the unbiased digit at the subtap by a left shifter. can be calculated by two's complementing. support complement unbiased Since the output of the previous stage will multiples of the coefficient can be calculated To an we only need to provide the subtaps with unbiased digit multiples of the input. Negative values result in whether trick unbiased environment, binary numbers considered the used to an must be used. positive. sum was the In the The carry out of the biased or unbiased. invert residues two's generated Even the positive values 2 . with true negative numbers in the system a method exists to ensure that the output of a subtap lies within a certain range. In order to keep the temporary results form growing in multiples of m must be added or subtracted from the accumulation. magnitude, By adding and subtracting multiples of both x (the current input) and x - m (now a true negative number), subtracted to keep the result within a specified range. the multiples of m can is negative, a positive number is added to it; a negative number is added to it. following notation 3 be automatically added If the previous result if the previous result is positive, The subtap algorithm is listed below with the : pi[n] equal to the result of the ith stage at the nth time step and hj[N-i] equal to the jth digit in the balanced radix a decomposition the N-i th coefficient. if (hj[N-i] == 0) /* case 0*! pi[n] = pi.1[n-1] /* case 1 */ if (hj[N-i] > 0 && pi-l[n-1] > 0) pi[n] = pi.1[n-1] if (hj[N-i] > 0 or + hj[N-i] * (x-m) && pi-i[n-1] < 0) pi[n] = pi.1[n-1] /* case 2 */ + hj[N-i] * x 1 Inductive reasoning, again 2 When <x>m was inverted to generate <-x>m, the result <-x>m equaled m - <x>m not - <x>m. 3 Some of this notation was developed in section 2.2.2 in the discussion of the transpose filter of 70 if (hj[N-i] < 0 && pi.i[n-1] > 0) pi[n] = p-ii[n-1] /* case 3 */ + hj[N-i] * x /* case 4 */ if (hj[N-i] < 0 && pi-i[n-1] < 0) pi[n] = pi-i[n-1] Examining algorithm the magnitudes shows the * (x-m) + hj[N-i] of the quantities range that the output, pi[n], number of bits needed in the binary channels. the largest when pi[n-1] from contribute -m to can span in the above and therefore The magnitude of pi[n] will be -1, Also, because the x spans from 0 to m-1 and x-m it is expected to the largest results. be examined separately. that the Regardless, For case #1, cases each including x-m would case in the algorithm will since a negative number is being added to pi-i[n-1], the maximum 1[n-1] = 0. For case #2, since a positive number is being added to pi-i[n-1], maximum magnitude For #3, case Finally, for Collecting - 1. the case these magnitude of pi[n] maximum #4, the is close to zero and the magnitude of the coefficient hj[N-i] equals its maximum. spans involved the of pi[n] will equal -m*max(hj) when pi_ the will equal (m-1)*max(hj) - 1 when pi-i[n-1] = -1. magnitude of pi[n] will equal (m-1)*max(hj). magnitude will equal -m*min(hj) - 1. maximum results, the output spans the range -m*max(hj) to -m*min(hj) If h[n] is decomposed in a balanced radix system with all digits equal to powers of two and the maximum value of m equal to 2b, the span of the output can be efficientlyl represented channel where c = log2 Because in a b+c+1 bit two's complement binary (max(hj)) and an extra bit is used for the sign. the temporary values in each mini-convolution chain can span the range ± m* max(hi), some method is needed to normalize the values after the final multiples required tap. Although of m and will not the calculation require a which consists significant therefore must be considered. Also, amount as result for the residue this hardware channel. will not be significant, adding/subtracting of hardware, is part algorithm, the outputs of the final subtaps must be scaled the final of it is of the original and summed to form Again,- for a large number of taps but still must be incorporated into the 1 Efficient implies unique representation with no wasted dynamic range for the case m = 71 system. ignored. After return to these designs. the two designs subtap different components the of been have system completed, will we be will Hardware The The and balanced quinary, implement the algorithm of the coefficient channels binary are will septary in the previous decomposition slightly A general left shifts. that implementations three balanced be and The (base 7). balanced hardware ternary, required to section is very similar to the design are that the The major differences subtaps. wider, are examined some method is needed to perform is shown block diagram of the component elements in 38. figure left shift x I x-m x[n-1] x[n-1] these Initially, complete -A invert zero A/B b A Co -N/C - m - S.i Si_ [n-1] Figure Balanced The 38 New Algorithm Subtap Ternary first implementation, balanced ternary, does capability because the only digits are -1, 0, and 1. is shown in figure 39. subtap Architecture except need the left shift The basic design of subtap It is virtually the same as the coefficient decomposition that no carry are decoded as follows: not forwarding circuitry is needed. The coefficients 72 hI ho 0 0 0 0 1 1 1 1 -1 1 0 undefined il ternary digit 10 2-1 MULX b+1i x[n-1] A b+ b +1 A Co -N/C Y - b+ - x[n-1] -m b+1- B A/ gb o b+ b+1 b+Y ~ B [n-1] S. 1-1 '[b+1] b+; If) Figure The high result the order bit is negative, coefficient 39 of the New Balanced previous result Ternary Subtap indicates whether and the high order bit of the coefficient is negative. The XNOR combination the previous indicates whether of these signals assures that a positive number is always added to a negative one and a negative always added to a positive. The design can be slightly optimized by eliminating MUX. Because the maximum value of m is and the minimum value of x-m is - 2 b. 2 b, one stage of the 2-1 the maximum value of x is 2b _ I So, the high order bit of x is always set to 0, and the high order bit of x-m is always set to 1. Because the output of the XNOR is 0 to select x and 1 to select x-m, it can provide the high order bit of x and x-m. In addition, globally broadcast. Although the the high order bits of x and x-m do not need to be The final design is shown in figure 40 . carry forwarding circuitry has been removed and with it an OR gate in the critical path, the overall design is probably worse than the 73 More gates were added than were delay of additional carry stage in the adder is and the propagation removed, design. unbiased/biased corresponding Y. 11 Y 1U x[n-1] x[n-1] - m S. [n] [n-1] S Figure 40 Improved New Balanced longer than the delay of the removed OR gate. lines are needed. architecture summary At is least design this Ternary Subtap Also, the same number of bus shows proof of The concept. listed below Number Sizing Transistors 1 bit Full Adder b+1 9792b + 9792 p.2 32b + 32 2-1 MUX + XOR b Part Type 4896b + 3264 p 2 16b + 6 p2 24b + 72 1 bit Register b+3 8160b + 24480 AND gate b+1 4896b + 4896 g2 6b + 6 p2 10 XOR gate 1 4896 XNOR gate 1 4896 p 2 8 27744b + 52224 p 2 78b + 134 Totals r Global Bus 2b signal lines Critical XNOR + (2-1 MUX + XOR) + AND + b+1 bit adder + register Path Throughput Architecture (1.9b + 10.36 ns)-1 Summary for balanced ternary subtap #2 74 Quinary Balanced subtap algorithm purpose of the The new balanced to numbers two could not implementation ternary The requires the number lower quinary be to be an have been expected implementation, the broadcast, globally will however, of global the subtap Because amount of hardware per subtap. buses, not to reduce the algorithm is to reduce balanced improvement. the number from four to two. of globally broadcast numbers Yi2 3 il iO b +1 AO Si x[n-1] x[n-1] b+1 - m --- - A11+ 2x[n-1] -m CSO - N/C A b+2 b 2xn1 A23b+ S [n] bb+ S Figure except basic design that hardware 0b Q [b+2]C [n-1] The . 41 will New Balanced be very must be added Quinary Subtap similar to the to perform balanced left ternary design Because shifts. the standard cell library that I am using does not include a left shifter, a b + 1 bit MUX with hardwired left shifted versions x and x-m is used. The b+1th bit of the x input to the MUX is tied to 0, the b+1th bit of the x-m input is tied to 1; the low order bits of both 2x and 2x-2M inputs is tied to 0. The high order bit (b+2 bit) of all inputs is added by the selector line after the MUX. is shown in figure 43. The final design The coefficients are decoded as follows: 75 _ again, Once quinary balanced used quinary digit 0 0 0 0 0 1 1 0 1 1 2 1 0 1 -1 1 1 1 -2 designs results algorithm subtap the is number in of hardware more globally to necessary longer relative hardware Also, the left shifter could be implemented as a populated grid of transmission gatesi, and a b bit 2-1 MUX could be for technology, choose to the between architecture x and x-m. summary is as Regardless, within the available follows: Number Sizing Transistors 1 bit Full Adder b+2 9792b + 19584 t2 32b + 64 4-1 MUX + XOR b+1 8160b + 19584 t2 24b + 58 1 bit Register b+5 8160b + 40800 p2 24b + 120 Part has lines broadcast the determine and VLSI layout of both At this point a custom cost of an additional bus line. partially ho 0 4b to 2b. reduced from I - However, delays. propagation been the ht I. Type 4896b g 2 6b + 12 AND gate b+2 XOR gate 1 4896 p2 10 XNOR gate 1 4896 p2 8 31008b + 89760 g 2 86b + 272 Totals Global Bus 2b signal lines Critical XNOR + (4-1 MUX + XOR) + AND + b+2 bit adder + register Path Throughput Architecture (1.9b + 14.06)-l Summary for the balanced quinary subtap #2 1 The grid would be rectangular with transmission gates on the two central diagonals to allow the rows (input) to be directly passed to the columns (output) or to allow the rows to be passed shifted left one place. Once the select lines have been set, the propagation delay through the shifter would only be one transmission gate delay. A similar design with a fully populated grid is frequently used as a barrel shifter. 76 Modified Balanced Septary algorithm only requires two b bit numbers to be the new subtap Because a broadcast, radix biased/unbiased seven design can be six b bit numbers algorithm, With implemented. would have been the needed. standard balanced septary digits would be the set {-3, -2, -1, 0, 1, 2, 31. old The Because we are trying to avoid real multiplications, the modified digit set {-4, -2, -1, 0, 1, 2, 41 will be used. to be performed by numbers Although the modified digit set allows all multiplications left shifts, it does put some restrictions on the range of spanned. 3 i2 3 i1 3 i0 0 0 x[n-1] x[n-1] - m 2x[n-1] S. [n] 2x[n-1] - 2m 4x[n-1] 4x[n-1] - 4m S i-i [n-1]- Figure 42 New Modified Septary Subtap A good way to understand the modified digit set is to examine a number in the standard digit set. balanced septary representation convert and A simple rule exists for the conversion: whenever a 3 occurs replace it with a -4 and occurs replace it with a 4 and carry -1. modified system are listed below it to the modified starting from right to left carry 1, and whenever a -3 The positive numbers of a two digit 77 modified standard bt I ao 0 0 0 1 0 0 1 1 2 0 2 2 3 1 -4 3 -3 0 4 4 -2 1 -2 5 -1 1 -1 6 0 0 7 1 8 2 1 1 1 2 9 3 2 -4 10 -3 1 4 11 -2 2 -2 12 -1 2 -1 13 0 2 0 14 1 2 1 15 2 2 2 16 3 the Not Possible table of modified 17 largest magnitude 2(7i) + 1 + i=0 second first term Unfortunately, largest can be represented number negative is that is (2...2)7. with all The total span of a J bit modified radix 7 number system is digits equal to -2. where the the In general, for any number is (22)7. of digits the largest positive number that can the 7 representations, balanced radix positive number that can be represented Correspondingly, number al 1 Examining decimal bo 0, = (7 j -1) + 1 i=0 term on for 2(7i) the and left the accounts final term for the for negative the numbers, positive the numbers. we lose approximately one-third of the span to make all of the digits powers of two. The resulting number of subtaps per tap is listed in the following table for several values of b. 78 b Subtaps 2 1 2 2 3 4 5 6 2 7 3 8 4 4 3 9 requires fewer subtaps The modified radix 7 representation per tap than the balanced radix 5 representation for b equal 5 or 7. Once again, the actual hardware design is very similar to the previous two designs. Using the formula at the end of section 4.1.3.1, b+3 bits are needed in the binary channels. The left shifts are performed by a hardwired 8-1 MUX would be more efficient in hardware size and although a custom left shifterl speed. One small advantage of using the 8-1 MUX is that the delay of an AND gate in the critical path is removed, although the added delay of an 8-1 MUX and 3 additional adder stages more than compensates in added delay. design is shown in figure 42. The coefficients are decoded as follows: h2 hl ho coefficient 0 0 0 0 0 0 1 1 0 1 0 2 0 1 1 4 1 0 1 -1 1 1 0 -2 1 1 1 -4 Any comparison of this design with other designs should only two b bit busses are used. consider that Limiting the number of global buses was the primary goal, and it has been achieved. shown The final The complete hardware summary is below 1 In this case the rectangular grid would have transmission gates on three central diagonals. 79 Number Sizing Transistors 1 bit Full Adder b+3 9792b + 29376 g2 32b + 96 8-1 MUX b+2 17952b + 84864 p2 50b + 220 Part Type p2 24b + 144 10b + 30 1 bit Register b+6 8160b + 48960 XOR gate b+3 4896b + 14688 p2 XNOR gate 4896 1 8 40800b + 182784 p 2 Totals Global Bus Critical g2 Path Throughput Architecture 116b + 498 2b signal lines XNOR + (4-1 MUX + XOR) + AND + b+2 bit adder + register (1.9b + 16.78 ns)- 1 Summary for the balanced quinary subtap #2 Summary Subtap The tables below list the size in g 2 , the number of transistors needed, and the latency for each of the three subtap designs using the new algorithm. Figures 43 and 44 of numbers were obtained from a standard cell library. graphically summarize the size and transistor data. entire tap per subtap b J 2 3 2 2 3 4 5 6 7 8 9 4 4 5 6 6 size (g2 ) 107712 135456 163200 190944 218688 246432 274176 301920 I transistors 290 368 446 524 602 680 758 836 Balanced size (g 2 ) 215424 270912 489600 763776 874752 1232160 1645056 1811520 Ternary All I transistors 580 736 1338 2096 2408 3400 4548 5016 80 b 2 3 4 5 6 7 8 9 b 2 3 4 5 6 7 8 9 J 1 2 2 3 3 4 4 4 entire tap per subtap size (p2) I transistors size (g 2 ) I transistors 444 444 151776 151776 1060 365568 530 182784 1232 427584 213792 616 2106 734400 244800 702 2364 827424 788 275808 3496 1227264 874 306816 3840 960 1351296 337824 4184 1475328 1046 368832 Balanced Quinary J 1 2 2 2 3 3 4 4 per subtap size (g2) I transistors 264384 730 846 305184 962 345984 1078 386784 1194 427584 468384 1310 1426 509184 1542 549984 Modified Balanced entire tap size (s2 ) I transistors 264384 730 1692 610368 1924 691968 2156 773568 1282752 3582 1405152 3930 5704 2036736 6168 2199936 Septary 2500000 2000000 * 1500000 Balanced Ternary Offset Quaternary 1000000- l Radix 7 500000 0 2 Figure 43 Size (g2) 3 4 verses 5 Number 6 7 of Bits 8 in 9 the Binary Channels 81 7000 6000 5000Balanced Ternary 4000- Offset Quaternary 3000 I Radix 7 200010000 2 Figure 44 b 2 3 4 5 6 7 8 9 3 4 Transistors 5 6 7 verses Number Channels 8 9 of Bits in the Binary Modified Balanced Balanced Septary Quinary Ternary 20.58 ns 17.86 ns 14.16 ns 22.48 ns 19.76 ns 16.06 ns 24.38 ns 21.66 ns 17.96 ns 26.28 ns 19.86 ns 23.56 ns 28.18 ns 25.46 ns 21.76 ns 30.08 ns 27.36 ns 23.66 ns 31.98 ns 29.26 ns 25.56 ns 33.88 ns 31.16 ns 27.46 ns Latency through Subtap In each case the new algorithm subtaps appear to be both larger and Unfortunately, the slower than the corresponding old algorithm designs. numbers can only be considered rough estimates of the actual hardware and Since all of the RNS designs discussed in speed of the new algorithm designs. this paper are intended for full custom implementation, the actual implementations would not be restricted to the parts in a standard cell library, and the subtap designs, both old and new algorithm, would be both smaller 82 and However, faster.1 the algorithm new the available, where implementation and an AND shifter, algorithm an 8-1 MUX gate. If would subtaps be was used larger physically instead of a 2-1 designs were full custom or comparable more and occurs for the modified radix 7 The most discrepancy slower multiplexors. by implemented shifts were left even Because there was no left shift disadvantaged by the standard cell library. block are designs superior compared both in a left MUX, the new hardware and speed. Putting it all Together in Earlier convolution same channel computation the Fortunately, the chapter this was must same discussed be scaling each mini- biased/unbiased algorithm. The for - the algorithm also. scaling and for the performed boxes can be of summing new The used. primary difference between the two cases is the form of the output at the final subtaps. algorithm biased/unbiased the output of a subtap always is Using the new algorithm, the output of output requires only one clock cycle. a subtap is always unbiased, but may or may not be normalized. range of unnormalized but normalized, To compute an unbiased normalized version of the may or may not be biased. the For the values cycles are needed to generate that the output can 2 span , Depending on several clock a normalized version of the output. Earlier, the output of a subtap was shown to vary within the range ± m* max(hi). One of the fundamental assumptions of the new algorithm is that the members of the digit set are powers of two; written as is 2 k, and the output range written as ±2k m. performed times m. therefore, by successively adding or subtracting hi can be equivalently Normalizing the output decreasing powers of The top level block diagram of this algorithm is shown in figure 45. 1 The area given for actual part to prevent several parts will be respective boundary 2 Determined by the a standard cell part includes a boundary around the edges of the violations of design rules. The area for a custom design including significantly less than the sum of the areas of each part and their layers. maximum allowed digit in the coefficient decomposition 2 83 Normalized Output Output of Final Subtap -- y' y -9.- Yy'-m 2 k-1 m 2 k-2 m m m Figure 45 Normalizing Stage A block diagram of a norm box is shown in figure 46. Each of the norm boxes operates in a manner similar to the subtaps of the filter. If the input to a norm box is positive, a negative multiple of the modulus is added to it; if the At each step input is negative, a positive multiple of the modulus is added to it. the range of the output is reduced by a power of two until the output falls into the range ± m. The fix block at the end of the chain operates in a similar manner, but adds the capability to output both unbiased and biased1 versions of the output. 2k-I , k-I 2 nm range Figure 46 Norm Box At the beginning of this section it was implied that the same algorithm can be used here also. scaling This is units designed for biased/unbiased true; however, it should be mentioned that scaling units can also be designed within the philosophy of the new algorithm. x, x-m and left shifted The new scaling units would add versions of each together 1 Biased version of x is equal to x-m ignoring the b+lst bit. to obtain an unnormalized 84 A chain of norm boxes and a fix box scaled value that spans a certain range. would return the and hardware scaled value to the normalized aesthetic to advantages the new There may slight range. algorithm units; scaling however, they do exhibit an increased latency because of the added correction stages. Binary to RNS Conversion Now that Block have architectures been developed to a high-speed compute convolution sum within RNS, a method is needed to convert the data into a residue form at an equally high throughput binary to RNS programmability is converter is necessary Programmability modulus. The programming. needed for to allow the can filter chain be each modulus, same design to obtained designs Because a and a low latency. with which are some operate needed of for any levels different also form for of each modulus used the simplest form of programming, a single b bit number that could be loaded into a register or asserted to input pins. More complex programming would consist of loading several values into registers or blocks of memory. primary When comparing designs for the binary to RNS converter, the consideration becomes the level of programming. How much is it worth to eliminate tables? Table Lookup Approach The table lookup approach is useful because a binary to RNS conversion is If the binary input to the a linear function, ignoring possible normalization. filter is d bits and the residue channels are b bits, the residue value of the input is equal to the sum of the low order b bits of the binary input with the residue value of the high order (d-b) bits. The conversion of the high order bits can be performed by a table; the resulting conversion figure 47. Unfortunately, the 2d-bxb bit table Used could be very large and slow for large values of d. Because the large table is implementing a linear function, however, it could be replaced by similar approach for unit is shown in general been discussed in chapter 3. linear a number of smaller tables. functions of one variable has A already -1 85 Figure 47 Table Binary Input Residue Output Binary to Lookup Approach for Using several smaller tables seems to be the throughput the outputs have been point. conversion, for the developed Since filter chains, In order to avoid drastic recoding a high to efficiently modulo add of residue accumulators versions several Conversion best way to achieve but some method is needed of these tables. RNS these can be used as a starting of the high order d-b bits of the input, only the radix 2 and radix 4 accumulators1 will be considered. Because the radix 4 accumulator designs used an offset digit set {-2, -1, 0, 1}, the input must be recoded. high order (d-b) bits of the The recoding algorithm is as follows: taking the input in pairs representation for the number, then starting obtain the standard radix four at the low order digit if the digit is 2 replace it with -2 and carry 1 to the next place, if the digit is 3 replace it with -1 and carry 1, otherwise 1 A valid observation is biased/unbiased algorithm. radix 2 accumulator can the new radix 3 design. radix 5 design; the only let the digit remain unchanged. An alternate that radix 2 and 4 accumulators were only developed for the Versions can also be developed using the new algorithm. A be implemented by removing the two's complement circuitry from A radix 4 accumulator has a data path that is identical to the difference is a more complex coefficient decoding. 86 result be must this method correctly, interpreted Although the or (1010... 10)2. method to recode the input is to add (22...2)4 the generate does desired An example for the single radix four digit is shown carry if a digit is 2 or 3. below 01 (1) 10 (2) 11 10 10 10 10 10 (0') 11 (1') either conversion Examining contain may bits order (3) 00 (0) algorithm, an 1 01 (-1') 1 00 (-2') additional radix 4 offset the digit the than of form the high radix standard 4 representation. the Using biased/unbiased algorithm, four are values at needed each {x, x+p, 2x, 2x+p} Here x denotes the normalized mod m value of 2i accumulator where i is the place of the low order bit of the bit pair in the total input. must values These shift register. a tapped into loaded be addressable For a d bit binary practically, or, more memory into input b bit and residue channel, the total number of b bit stored values is 4F(d-b)/21. If the new algorithm is used instead, only x and x-m are needed at each These values would also be loaded into addressable memory or a accumulator. number of b bit stored values is only 2F(d-b)/21. algorithm uses the new Both of the offset the conversion perform an extra correction but also increases the latency adds hardware, even otherwise, the registers not only stage which bit binary a (d-b) adder to between the standard digit set and the offset digit set. is on the order of b, the cycle; every bonus of the conversion. radix 4 design require The adder increases both the hardware d-b Unfortunately, Although only one-half as many has an equal and opposite penalty. are needed, the total For a d bit input and b bit residue channel, tapped shift register. binary size and latency of the conversion. add can probably adder must If occur within a single clock be pipelined increasing the latency more. set conversion The digit 1, 2, 31 digit set {0, modified to use the values and hardware contain size. can be avoided entirely if the standard is used. standard Although either radix 4 accumulator digit set, both designs will larger selectors. The decrease radix 4 can be require more stored in latency is paid for in 87 No Table Approach If and simple binary latency, simple to programmable Because multiplier. is programmability RNS more conversion doubler residue the b bit important units can design both than be in residue value of 2b hardware developed an extended is knowni, using size the residue it can be multiplied by the d-b high order bits of the input using the residue multiplier design from chapter 3; the b low order bits of the input can be used as an 4 Input 0] 2 0 *2 .0 *2 Figure 48 No Table Binary to RNS 1 2b = g mod m, think about it... greater than 2 b- 1, Conversion Also, if the moduli set is chosen so that all moduli are the biased form of 2b can be formed by left shifting g. 88 the first accumulator for seed unbiased to RNS binary programmable simple p needs to be loaded or only converter; is a result The the multiplier. in asserted to an input. starting at bit 0, segmenting the binary input, can be designed by blocks *2 residue using unit conversion programmable simple Another into b bit Using *2 blocks the higher order segments can be multiplied by sections. Fd/bl design for primarily for small of the This design would be competitive 3 is shown in figure 48. = block diagram A summed. and the result of two powers appropriate the Fd/bl. Residue to Binary Conversion An RNS to binary conversion unit is needed to put the results from all of the algorithm conversion are that algorithms Two independently. literature that each allow would have digit extensively been and Theorem Remainder Chinese the Remainder Chinese In chapter 3 the mathematical to be converted discussed in the Mixed Radix the link Theorem Chinese Remainder between the weighted number representation. ( 1 x= where M = the Mi Theorem This equivalence -1 1>mIx1 + M 2<M 2 >x was presented to show the a conventional is as follows: -1 2 + ... + Mr<Mr mx algorithms which could be very large. must be mod M The computations required to CRT expression are multiply-accumulates 2 rb and representation residue H mi, Mi = M/mi, and xi = x mod mi. is on the order of other than the binary complex algorithm. conversion evaluate Unfortunately, number. Because the RNS digits are uncoupled and unordered1 , there is no to RNS case. simple more is significantly conversion to binary RNS to form a binary back together channels the residue modulo M; however, M To avoid modulo M arithmetic investigated. 1 If each residue in an RNS system is considered to be a digit, the digits are uncoupled because there are no carries. This feature allows us to perform smaller computations on each digit without any links between the digits. A more familiar weighted number system such as the decimal number system has coupled digits. 89 Mixed Radix Conversion A standard way to avoid the modulo M arithmetic in the CRT is to use the mixed radix conversion converted evaluated to to an With this algorithm the residues are first algorithm. intermediate find a standard radix1 mixed representation fixed radix value. that The calculations is then required to convert from residue to mixed radix are all b bit modulo mi and the remaining calculations The are conventional binary. Algorithm mixed radix representation are chosen to be equal to the moduli in the moduli set. The best motivate To simplify way following to the process conversion the algorithm is the to radices reverse in the engineer it. First, the notation must be defined Fi= ith mixed radix mi= ith modulus xi= ith residue coefficient Yij = mij =<mni>mj coefficient <Cinj Assume x is already in the mixed radix form + F mm + F1mIm m3 + x = IF + Frm 0 1 1 2 12 3 12 3 If there are r moduli in the moduli set, there will be r digits in the associated mixed radix representation. Taking residues of both sides of this equation for each modulus in the moduli set, the left side (x mod mi) equals xi, and the right side is some function of the Fi's. using residue arithmetic. With the xi's known we can solve for the Fi The first three digits are shown below rF 0 = x1 x1 = 0 X2= <F 0 + Flml>m2 1 1 = <x 2 - FO>m2<m1- 1 >m2 = [(x 2 - 712) * m12] mod m 2 X3= <FO + Fimi + F2mim2>m3 F 2 = [((x3 - 713) * m 1 3 - 723) * M2 3 ] mod m 3 1 Previously, for coefficient decomposition fixed radix number systems were defined. Mixed radix number systems are similar except that the radix a can vary from place to place. If xi are the digits of a mixed radix number with radices ai, the value of the number is computed as follows: J-1 i i=0 0 90 Much of the computation for the residue to mixed radix conversion can be performed in parallel. The example below shows the conversion process for a simple 3 moduli RNS. Example m1 = 3 m2 = 4 m3= 5 x1=1 x2=0 x3=4 -(712 = 1) F 1 =1 3 3 *(m12-1=3) *(m 1 3-i=2) 1 F2 = = 1) -(71 1 1 -(723 = 1) 0 *(m2 3 ~ =4) 0 F3 = 0 A top level block diagram of the required computation is shown in figure 49. X 1 Figure 49 Mixed Radix Conversion channels) Algorithm (for 4 residue 91 The Hardware The primary consideration moduli the set, hardware design process includes sets hardware conversion the the the conversion Because programmability. for moduli. The Although it would be nice for testing if the is feature the optimum the omitted, addition really not moduli level of modulus in the number of a new requires design. set could be arbitrarily moduli If required. is programmability The into the design. can be hardwired seti the each on limit of another modulus allowable programmed, a is advantage of hardwiring the moduli set is that the residue multiplies can be by performed without having to load lookups table residue The arithmetic. new range unbiased To simplify will be used algorithm the following discussion the original biased/unbiased for 2 the tables. be also could algorithm the presentation. applied, but it would only complicate Except for timing registers, the data flow graph in figure 49 is a top level for diagram block needed, The unbiased in the (figure negative is subtractor residue complementing the mixed radix conversion Two hardware. form, subtrahend 50). residue, Because the a very the and simple are inputs are subtraction adding the inversion biased/unbiased the If both design. can be performed will generate addition with this subtractor is that an overflow by two's to form complemented the the biased form of requirement is and the carry out of the adder will indicate the state of the output. problem designs one a residue subtractor and the other a residue scaler. available minuend the satisfied, The only will cause undefined To guarantee that no overflow occurs, the moduli set must be ordered (ie results. ml < m2 < m3 < ... < mr). unbiased/biased bA Y (x 2 ~ 1) b I 2 Figure 50 Residue Subtractor (Unbiased/Biased Algorithm) 1 The optimum moduli set is that set of reletively prime moduli representable in a b bit binary channel that have the maximum product. See chapter 5 2 The number of multiplies that would occur in a practically sized MRC would require a large number of memory in which to store tables. 92 The linear high residue operation, order bits accumulator Because scaling unit is also a simple design. is the design of the binary very to to similar RNS described earlier can be used the units together, the conversion conversion. in The scaling is a unit standard a configuration for the radix 4 similar to that in figure 24. Putting completed into its two with a throughput final binary form, equal the to the several binary multiply out the mixed radix coefficients. by standard pipelined to binary match shift and add the throughput mixed filter chains. multiplies conversion can To place the must be input be performed to These multiplies can be performed multipliers of the radix with filter chains. the individual Because adders the mixed radix digits are weighted 1 , the low order digits may be neglected. 1 Assuming that all moduli are on the order of 2b, the most significant mixed radix digit multiplies ~2 rb; the least significant multiplies 1. If several moduli are used, the low order digits will be noise. 93 Chapter 5 Now tool Design Aid some basic that can be developed friendly form. the length both old and input a to present Given of the architectures the number and designed size and of bits in the input speed analyzed, a data in more and coefficients, and will output the optimal architectures for Originally, the intention was to have the user new algorithms. cost been the hardware filter, the program size/latency have function. After some consideration, I decided to output all possible optimal size/speed designs, and allow the user to weight the alternatives. designs In addition, no distinction will be made between the two types of because it was not possible to accurately estimate the hardware or speed of these designs or characterize the advantage of having fewer buses. The compute algorithm Moduli The algorithm the consists minimum to prune number of moduli Algorithm Selection sections: an channels for each bitwidth (Basic range is requirement, range [3 , 9]. an algorithm and to an RNS) number of bits in the coefficients and input required dynamic range for the the coefficients dynamic distinct fairly the design space. filter determine the bits, of two e bits, and the length (d + e + log2 N ).1 optimum moduli N, filter. the length If the of the input is d the total number of bits of Starting set 2 and with this dynamic range can be chosen for each b within the The optimum moduli set itself is not so important; however, the number of moduli, r, in the sets for different number of residue channels that are needed. values However, of b determines the it is much simpler to solve the reversed problem: given b and r find the optimum moduli set and its product. The moduli selection algorithm can then be used to solve for r given b and a target moduli product. A first attempt at optimal moduli selection all sets of r relatively prime numbers uses an exhaustive search 3 of requiring b or fewer bits. The search proved to be extensive for large b and r and required a significant amount of 1 The log2 N term can be significant for a long FIR filter; it is the price of exact calculation (no rounding). 2 Optimum moduli set implies the set of relatively prime moduli representable in b bits that has the highest product of any set of relatively prime moduli representable in b bits with the same number of moduli in the set. 3 See Appendix for code 94 computer time. a search This prompted for a more efficient that algorithm used some of the properties of an optimal moduli set. To improve the speed of the algorithm, it is worthwhile to consider some limit the search a bit. of the properties of the optimum moduli set and way to simple modulus mi to be included must be being excluded observation Because 2 included because 2 Since 2 set, the disadvantages they are not the of its being advantages of other relatively prime potential moduli to A first mi. only factors of 2 b is the highest even , it can only other exclude even moduli set number in our potential and only one even number can be included in the final moduli set, 2 b should A second observation is that the remainder of the moduli set be in this set. This information can will contain odd numbers that are as large as possible. be added certain b should always be included in any optimum moduli set. b contains numbers. in the moduli for a the following: than greater is that algorithm is a better motivate A search algorithm to substantially reduce the search to the exhaustive time; however, the search still becomes unwieldy for larger b and r. exhaustive search the reduced To improve on algorithm As a first attempt at directly choose the moduli set and avoid a lengthy search. direct selection, start with remaining odd numbers requisite the number 2 it is possible to b - 1 as the first odd in the set and step down the adding those that are relatively have been r moduli for this The inspiration chosen. algorithm is similar to that above which included prime to the set until 2 b because it is the highest Only the highest number having a factor of 3 will number with a factor of 2. be included along with the highest having a factor of each prime 5, 7, 11, It works this algorithm does not always give the best answer. Unfortunately, etc. for b = 2, 3, 4, 5, and 7, but for some cases of b=6 or 8 it gives a suboptimal result. A deeper occasionally smaller ones. of investigation optimal to exclude For example, optimum moduli sets shows that it is a larger odd number in order to include two if the highest number having a factor of 3 also contains a factor of 5 (which is the case for b = 8) then this number not only excludes all smaller factors of 3 but also all smaller factors of 5. The problem can be more easily visualized if a moduli set with r moduli is thought of as been generated by adding a modulus to the moduli set of r-1 moduli. point the product of the moduli will be increased more by At some excluding the highest factor of 3 and 5 and including the second highest factor of 3 and the 95 second highest factor of 5 rather than including that is relatively prime to the existing set. the next smaller odd number A numerical example is shown below 8 Bit Moduli --- 6 Moduli in Set 256, 255, 253, 251, 247, 241 8 Bit Moduli --- 7 Moduli in Set 256, 253, 251, 249, 247, 245, 241 In this example 255 contains both factors of 3 and 5. When the seventh modulus is added it becomes advantageous to omit 255 from the moduli set and add the next highest factor of three, 249, and the next highest factor of five, This is because 255*239 = 60945 and 249*245 = 61005. 245. The replacement could have been anticipated by realizing that 255 is a double factor oddl and calculating containing LOW which the factors equals the in the double product of the factor odd highest by the double factor divided In this case LOW= 249*245/255 = 239.2. odd. second odds If the next number that is relatively prime to the existing set (239 in this case) is less than LOW, then the double factor prime should be omitted instead. A heuristic algorithm was developed to avoid double factor primes without explicit factoring. First, relatively prime odds sequentially a moduli less than excluded, and the largest r-1 becomes set is calculated, and if the current one from all odds have been excluded moduli sets having and the relatively prime odds search space). greater than which odds of 2b largest r-1 Then, each odd in the set is 2b is generated. (omitting the excluded one from the moduli set consisting The product of the new the previous are sequentially from a current set without any a greater product, the search chosen are product, this set When excluded. of the resulting ends with the current set as a result. The excluding heuristic practical values r and b. be ordered algorithm gives the correct moduli set for For some very large values of r, the exclusion must in ordered for the algorithm to converge on the optimal set. 1 More properly, 255 contains two small factors (3 and 5) The 96 first error occurs for b = 8 and r = 38. The error can be ignored for all uses. 1 practical Now that a method to find the optimum moduli set given r and b has been developed, we need a way to find the sufficient number of b bit moduli needed range. to achieve a given dynamic will be less than or equal to needed b). / Starting 2 with Since all moduli within a b bit channel b, an initial lower guess is r = (# bits of range of this value moduli r, the heuristic selection algorithm can be called with successively higher values of r until a moduli set is found that has the sufficient range. Aid The Design Given the number of moduli needed for different bitwidth moduli and the data for the different bitwidth residue subtap designs, a set of size/latency (size, pairs delay) be that formed can be to searched find potential First, the size data listed in chapter 4 is multiplied by the optimum designs. appropriate can number of moduli for the required a size number for the entire filter tap. respective to obtain bitwidths Because all of the subtaps within a tap all operate in parallel, the filter latency is equal to the subtap latency for any number of moduli latency figures size and Second, the scaled channels. Four design types grouped into pairs that make up the possible design space. were investigated ternary, offset investigated quinary, quaternary, for and implementations using the the biased/unbiased and balanced using modified the new balanced are considered, binary, algorithm: Three quinary. design algorithm:- balanced septary. For each one for each moduli are balanced types ternary, design width from were balanced type, seven b=3 to b=9. So, the design space of the old algorithm contains up to 28 designs, and the design space of the new algorithm contains up to 21 designs 2 . The principle search algorithm used to prune the space of finding dominant designs that exclude of designs others. operates on a When two designs are compared, if one has both a smaller size and a lower latency than another, the first is said to dominate. The dominant design would always be chosen, and the other can be pruned. If one has a lower delay and the other has a 1 The dynamic range obtained from 37 8 bit moduli is 274.65 bits of precision. unlikely that any application would need this precision. It is 2 In general, some of the lower values of b will be excluded because there are not enough moduli representable in b' bits to acheive the desired dynamic range. 97 smaller size, the two are said to coexist, neither excludes the other. design in the design designs dominant space is compared to the will size/speed combinations. These remain. If each others, a group of coexisting will designs have the optimum A good way to visualize the process is to imagine a 2- D scatter plot of the design space with size on one axis and delay on the other. We only want to keep those designs that plot closest to the origin. Discussion Running the looking at the charts offset quaternary the optimum gives the in chapter 4. algorithm set, set are that For the would and the size charts usually the balanced show be anticipated biased/unbiased design usually has the smallest size have the smallest size for b=4 to 8. optimum results from algorithm the and the largest delay of the offset design quaternary to The other' designs in the old algorithm ternary and the binary. Because the delay increases with the radix, this is also expected. One design type in each algorithm is always dominated by others. For the old algorithm, the balanced radix 5 design is dominated by the offset radix 4 design for b less than values of b, 9. the advantage moduli are used. Although the subtap delays of the are the same for all radix 5 design is not realized until 9 bit For the new algorithm, the modified balanced septary design is dominated by the balanced quinary design for all values of b. 98 Standard Binary Arithmetic with Pipelining Chapter 6 alternative be presented using standard binary will another and analyzed, have been presented RNS designs that the Now Using some arithmetic. of the techniques developed for residue filter taps and extensive pipelining, a design conventional be can a higher Although no general decisions will be that implemented throughput than the residue designs. at operates made between the residue and standard implementations, this design would be a good starting point for any comparisons. Development of architecture The first is coefficient The architecture design is based on two concepts. inserted in the carry The coefficient the residue avoid to decomposition and multiplies, chain decomposition filter discussion. of long second the binary is to increase adders of bits must be included registers throughput. within concept has been thoroughly discussed The primary difference here is that overflow A sufficient number within standard binary arithmetic. cannot be permitted pipeline in the binary channels to prevent overflow. pipelined adder is the real innovation of the binary design. The The adders can be pipelined to a granularity of single bit adders for a throughput of (1 adder delay + 1 register delay)- 1 . pipelined the high order carry out is needed to condition the next because stage for the old condition For the residue problem the adders can not be algorithm the next designs stage in the order bit and the high new algorithm is needed to For the residue designs. designs, all of the calculation has to be performed within a single clock cycle. Filter Tap Using the transpose filter form, the temporary results move along a data path consisting of N adder/register combination is shown in figure 51. pairs. A to all adder/register the overall data flow. adder/register By shifting some of the registers to before the adder, the adder becomes pipelined as in figure 52. register shift single combinations Performing the same in the data path preserves 99 Figure 51 Throughput Limiting Data Path in a Transpose Form FIR Filter 52 Figure In the Increased Throughput Combination a six figures, bit is pipelined is that the A first observation sections. adder carries. dlb, 1 In general, in observation operating the the are data current and the final adder increases when This should be expected because are needed extra registers for the one bit registers. 2 bought with increases in is being clocked through, the second section inputs, section on the two delayed inputs. Another hardware. is that the adder is operating on three different on bit two a b bit adder/register pipelined into d bit sections, where throughput Assuming time. inputs, and three into requires b one bit adders and 2b + (b/d) - d - 1 Increases to inputs for each output, Adder/Register number of registers they are shifted to the inputs of the adder. there are two Binary first the on at the same adds the one section is previous Something is needed guarantee that the inputs to the adder are staggered in time. 1 dlb read d divides b (integer divide without remainder) 2 For d=1, the number of registers = 3b - 2; for d = b/2, the number of registers = 3b/2 + 1. 100 Subtap Binary To actually design a filter tap, a coefficient decomposition must be chosen. The hardware is simplest if the binary decomposition is used. Each consists of a b bit adder and c AND gates as shown in figure 53. of registers stagger the input is shown in figure 54. An input stage With (b/d) sections per ((b/d) 2 + (b/d))/2 registers are needed for the input stage. adder, taps receive the staggered input, the adders subtap Because all will always receive the proper inputs. h. Staggered Input New Adder Register Combination Output Previous Stage Figure 53 Balanced The Binary Subtap using Binary Arithmetic Ternary advantage of the higher radix is that fewer subtaps The balanced ternary design uses the tap. Pipelined are needed per same method used by the residue The design needs an additional c XOR designs to two's complement the input. gates to invert the input; the same signal gating the XOR gates is fed into the carry in Higher of the adder/register combination. Radices It is possible to go to even higher radix decompositions of the coefficients than the balanced inputs ternary; is broadcast however, to the the taps designs in complicate different time slightly. slices, the Although coefficient remains the same for all time slices, and scaling all time slices by factors of 101 radix designs adds form and the offset decoding, coefficient Using this design. ternary a left shifter to the balanced appropriate for the higher A general form two can be performed by shifting the input. balanced quaternary, quinary, and modified balanced septary subtap designs can be implemented. 1 Because there is no does not decrease the throughput of the subtap. however, difficulty, some in synchronizing The and a register. the subtap is limited only by the d bit adder section additional hardware for subtap, the throughput from the previous feedforward of the versions staggered There is input that have been scaled by the digits of the decomposition radix. Shift Reconstruction Add of the parallel out taps constructed filter With described subtaps above, Each of the J the output of the final stage will not be in a "friendly" form. mini-convolution chains will output a b bit result that consists of b/d There is the temptation to place an output slices of different time results. stage chain 2 after each back together chains convolution final at to subtap the put the When in time. same staggered in time; following discussion result the scaling throughput this however, as of each summing and the mini-convolution filter is of the mini- considered, the To use pipelined adders, the inputs output stage does not seem as appealing. must be d bit is the form in which they are output. The binary coefficient focuses decomposition the number of mini-convolution coefficient, with on the chains, J, to the is equal In this case width of the chain consists of b Assuming that there are J mini-convolution chains, a b bit output, the data arrives at the outputs as b/d delayed data. of reconstruction, single bit adder sections. e, and the output of each mini-convolution bit time slices. case simplest 1 each with d bit sections of time The 0th and e-1st outputs are shown below 1 Because no left shift block is available in the standard cell library, an explicit design is omitted. For all of these designs number of subtaps needed for e bit coefficients can be found in the tables of chapter 4. 2 The output stage would consist of register chains similar to the input stage. 102 ho I F channel channel he.1 [i-b] ye- [i-b] Ye-i [i - (b-1)]1 y0 y 0 [i - (b-1)] [i-2] y yI[i-2] I y 0 [i-1] e:1-1] Focusing on the i-1st time slice, one bit can be grabbed from each channel's Ye-2[i-1], If the these bits are grouped in descending order {ye.1[i-1], output. ..., y1 [i-1], yo[i-1]}, the first slice into the adder is obtained. Because the output of the hj channel is multiplied by 2i, the group of bits is properly ordered in At the next output time, the i-2nd group contains results an e bit binary word. The e bit binary word obtained from this output from this same time slice. this same time slice. results from contains At the next output time the i-3rd group binary word. e bit obtained to the previously must be left shifted one place and added The e bit word obtained must be left shifted two places This process continues until the final and added to the previous partial sum. group that contains results from the same time slice arrives at the high order shifted e bit numbers are added together to generate a single All total, b bits. binary The hardware result. required the reconstruction to perform is b e bit adders; the latency of this hardware is b+e clock cycles. A coefficient quaternary other can algorithm similar combinations used be 2 with decomposition adder bit adders the can results be reconstructed Using sections. with the any process, However, using especially any scaling by numbers other than powers of two. pipelined offset case: the reconstruction the complicates significantly for reconstruct to same throughput as the filter chain. RNS Hardware Required/Contrast with Similar hardware to the comparison between comparison that follows will different focus because of the assumed large number of taps. reconstruction hardware will be smaller architecture than residue primarily on designs, the filter taps the In any case, the input stage and their to RNS converter and the RNS to binary converter. counterparts, the binary One difference is that the size of the RNS hardware was determined by the total dynamic range; the size 103 a function of both the coefficient pipelined binary design is of the massively width and the combination of the input width and the filter length. For any exact1 FIR filter the total required dynamic range is d + e + log2 N where d = input width, e = coefficient width, and N = length of the filter. b bit moduli channels with r and b chosen such that residue FIR filter uses r the product rb is just greater than d + e + log 2 N. each implementation, filter is complete r(b/2)b wide. bits wide pipelined The (d + implementation) uses e channels each With the offset quaternary a (b/2)b tap uses residue filter uses (d + binary filter log 2 N)e bits wide. The optimum 6 bit moduli with the offset quaternary implementation. The total width of the RNS data path is r(b/2)b = 5(6/2)6 quaternary (binary The complete For a rough comparison, assume d = 8, e = 8, and N = 10000. offset The data paths. log 2 N) bits wide. pipelined binary filter (binary implementation) size RNS design has 5 The filter binary have would 4 23 bit = 90 bits wide. channels 2 The for a total Both designs would involve similar shift and add stages, but width of 92 bits. the RNS design would require 2 or 4 buses to broadcast versions of the input to the subtaps. In addition, the design RNS would require conversion large stages at the beginning and end of the filter. General Discussion More the RNS decisions between limited very comparison competitive sizing standpoint. designs. be comparisons detailed to would and the that was the comparable 3 needed pipelined binary the performed, residue to pipelined Whether further investigation it would implementation a yardstick any Based on the binary designs from binary with which seem a hardware higher than shows the pipelined provide make actually designs. the throughput is significantly Also, superior or inferior, be the RNS design to to measure the RNS designs. 1 Exact implies that no rounding occurs. 2 The coefficients are decomposed into e/2 quaternary digits, and width of the channel data paths must be increased by one because the maximum coefficient now has a magnitude of 2. 3 using the same decomposition radix 104 Conclusion Chapter 7 With a cursory introduction to RNS, it is easy to become excited about the for potential great processing signal any of computation speed high After a deeper study of algorithm requiring only addition and multiplication. the topic, it rings clear, however, that RNS is only useful for a very limited scaling and magnitude use RNS of is the limitation stems from the problems with In part, number of applications. the constraint largest the of overhead huge the but comparison, on the general Even units. conversion for applications for which RNS is ideally suited such as the FIR filter problem, the size of must problem the sufficiently be to large warrant RNS applying techniques. If more efficient conversion units can be designed, maybe the space of RNS applications could increase. The designs in chapter 4 are in the proper direction, away from the infamous table lookup solution to any RNS problem. However, a of number substantial are computations required either for the of the computational units, CRT or the MRC, and, regardless of the efficiency there is a large amount of computation to be performed. The massively pipelined binary concept may prove to be superior to RNS The massively pipelined techniques even on the problems that RNS is good at. designs faster are all in cases, and the hardware initial estimates seem Obviously, more research should be done in this area before any comparable. are made concerning global decisions the two methods. Assuming RNS does have merit, there is much more work to be done to extend the little that I have done. compared, that the new algorithm I believe hardware in designs biased/unbiased If custom VLSI versions of the designs are size all was arithmetic But, units if what representation arithmetic signed number to operate balanced ternary, offset is used idea for all numbers two's on a a result, form the two's complement numbers. balanced quinary more efficient quaternary, in complement As representations. designed More speed. designs in general. discussions, architecture all were units be Another more all for assumed the of also possibly and research should be done on the new algorithm Throughout would be better than the designs design; or could developed? is logic is required to use for the redundant number representations. arithmetic units, the Although shorter propogation delay 105 could be technique used to could conventional increase give design. the throughput RNS the needed of the RNS edge in the filters. Maybe this battle with a pipelined 106 Dynamic Range for Optimum Moduli Sets Appendix 1 Bit 3 4 Product 1 2 3 4 8.OOOOOOOOE+00 5.60000000E+0 1 2.80000000E+02 8.40000000E+02 Bits of Precision 3.0000 5.8074 8.1293 9.7142 1 2 3 4 5 1.60000000E+01 2.40000000E+02 3.12000000E+03 3.43200000E+04 2.40240000E+05 7.20720000E+05 4.0000 7.9069 11.6073 15.0668 17.8741 19.4591 1.0000 0.9884 3.20000000E+0 1 9.92000000E+02 2.87680000E+04 7.76736000E+05 1.94184000E+07 4.46623200E+08 8.48584080E+09 1.44259294E+1 1 1.87537082E+12 2.06290790E+13 1.44403553E+14 5.0000 9.9542 14.8122 19.5671 24.2109 28.7345 32.9824 37.0699 40.7703 44.2297 47.0371 1.0000 0.9954 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 6.40000000E+0 1 4.03200000E+03 2.45952000E+05 1.45111680E+07 7.98114240E+08 4.23000547E+10 1.98810257E+12 8.81392140E+13 3.78998620E+15 1.55389434E+17 5.74940907E+ 18 1.78231681E+20 5.16871875E+21 1.1888053 1E+23 2.02096900E+24 2.62725974E+25 6.0000 11.9773 17.9080 23.7907 29.5720 35.2999 40.8545 46.3248 51.7511 57.1087 62.3181 67.2723 72.1303 76.6539 80.7413 84.4418 1.0000 0.9981 0.9949 0.9913 0.9857 0.9806 0.9727 0.9651 0.8796 59.00 55.00 53.00 47.00 44.33 43.00 41.00 37.00 31.00 29.00 23.00 17.00 13.00 1 2 3 4 5 1.28000000E+02 1.62560000E+04 2.03200000E+06 2.49936000E+08 3.02422560E+10 7.0000 13.9887 20.9545 27.8970 34.8158 1.0000 0.9992 0.9978 0.9963 0.9947 127.00 125.00 123.00 121.00 6 5 1 2 3 4 5 6 7 8 9 10 11 6 7 Bits Used EfficiencyIncrease Moduli Length 1 3 6 9 12 1.0000 0.9679 0.9033 0.8095 0.9673 0.9417 0.8937 0.8108 7.00 5.00 3.00 15.00 13.00 11.00 7.00 3.00 0.9684 0.9578 31.00 29.00 27.00 25.00 23.00 0.9424 19.00 0.9267 0.9060 0.8846 0.8552 17.00 13.00 0.9875 0.9784 0.9584 0.9518 0.9442 0.9343 0.9247 0.9125 0.8971 11.00 7.00 63.00 61.00 107 8 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 3.59882846E+12 4.06667616E+14 4.43267702E+16 4.74296441E+18 4.88525334E+20 4.93410588E+22 4.78608270E+24 4.25961360E+26 3.53547929E+28 2.79302864E+30 2.03891091E+32 1.44762674E+34 9.69909918E+35 5.91645050E+37 3.49070580E+39 1.85007407E+41 8.69534814E+42 3.73899970E+44 1.45820988E+46 5.39537657E+47 1.67256674E+49 4.85044353E+50 1.11560201E+52 2.11964382E+53 41.7107 48.5308 55.2990 62.0405 68.7270 75.3852 81.9851 88.4609 94.8359 101.1397 107.3295 113.4792 119.5453 125.4761 131.3587 137.0866 142.6412 148.0675 153.3529 158.5623 163.5165 168.3745 172.8981 177.1460 42 49 56 63 70 77 84 91 98 105 112 119 126 133 140 147 154 161 168 175 182 189 196 203 0.9931 0.9904 0.9875 0.9848 0.9818 0.9790 0.9760 0.9721 0.9677 0.9632 0.9583 0.9536 0.9488 0.9434 0.9383 0.9326 0.9262 0.9197 0.9128 0.9061 0.8984 0.8909 0.8821 0.8726 119.00 113.00 109.00 107.00 103.00 101.00 97.00 89.00 83.00 79.00 73.00 71.00 67.00 61.00 59.00 53.00 47.00 43.00 39.00 37.00 31.00 29.00 23.00 19.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 2.56000000E+02 6.52800000E+04 1.65158400E+07 4.14547584E+09 1.02393253E+12 2.46767740E+14 5.90355529E+16 1.41094972E+19 3.28751284E+21 7.52840440E+23 1.70894780E+26 3.81095359E+28 8.04111207E+30 1.67370004E+33 3.33066308E+35 6.56140627E+37 1.26635141E+40 2.41873119E+42 4.37790346E+44 7.83644720E+46 1.35570536E+49 2.26402796E+51 3.69036557E+53 5.79387395E+55 8.74874967E+57 1.30356370E+60 1.81195354E+62 2.48237635E+64 3.25191302E+66 8.0000 15.9944 23.9773 31.9489 39.8973 47.8101 55.7124 63.6133 71.4775 79.3167 87.1432 94.9441 102.6652 110.3667 118.0033 125.6253 133.2178 140.7952 148.2951 155.7789 163.2135 170.5972 177.9460 185.2406 192.4790 199.6981 206.8171 213.9151 220.9485 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 1.0000 0.9996 0.9991 0.9984 0.9974 0.9960 0.9949 0.9940 0.9927 0.9915 0.9903 0.9890 0.9872 0.9854 0.9834 0.9814 0.9795 0.9777 0.9756 0.9736 0.9715 0.9693 0.9671 0.9648 0.9624 0.9601 0.9575 0.9550 0.9524 255.00 253.00 251.00 247.00 241.00 239.24 239.00 233.00 229.00 227.00 223.00 211.00 208.14 199.00 197.00 193.00 191.00 181.00 179.00 173.00 167.00 163.00 157.00 151.00 149.00 139.00 137.00 131.00 108 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 4.12992954E+68 4.66682038E+70 5.08683422E+72 5.44291260E+74 5.60619999E+76 5.66226199E+78 5.49239413E+80 4.77044208E+82 1.39678594E+84 1.01965374E+86 7.23954154E+87 4.85049283E+89 2.95880063E+91 1.74569237E+93 9.25216956E+94 4.34851969E+96 1.86986347E+98 7.66644022E+99 2.83658288E+101 8.22609036E+102 6.66313319E+104 227.9372 234.7574 241.5256 248.2671 254.9536 261.6118 268.2117 274.6522 279.5241 285.7139 291.8636 297.9297 303.8605 309.7431 315.4710 321.0256 326.4519 331.8094 337.0189 341.8769 348.2167 240 248 256 264 272 280 288 296 304 312 320 328 336 344 352 360 368 376 384 392 400 0.9497 0.9466 0.9435 0.9404 0.9373 0.9343 0.9313 0.9279 0.9195 0.9157 0.9121 0.9083 0.9043 0.9004 0.8962 0.8917 0.8871 0.8825 0.8777 0.8721 0.8705 127.00 113.00 109.00 107.00 103.00 101.00 97.00 86.86 29.28 73.00 71.00 67.00 61.00 59.00 53.00 47.00 43.00 41.00 37.00 29.00 81.00 109 Appendix 2 -- Design Final /* * *** /* /* /* /* /* Design ** ** * *** Program Aid ** Aid Code **** ** **** *** * ***** ** ** ** * *** ** **** **** This program is an attempt at creating an RNS design aid. It combines the heuristic moduli selection algorithm with the architecture design data from section 4 of the thesis. Lightspeed C Written by Kurt A. Locher January 6, 1989 Debugged by Kurt A. Locher until January 7, 1989 1:08am ******* * **/ * * * * * / / / / / **************************************************/ #include <stdio.h> #include <math.h> typedef char string[80]; /* maxmoduli indicates the maximum number of relatively prime moduli */ /* that can selected from the set of numbers less than or equal to /* 2Ab. b is the index of maxmoduli, which starts at 0. */ int maxmoduli[] = {0, 0, 0, 4, 6, 11, 16, 29, 50, 80}; maino { typedef struct ( int type; int bits; double size; double delay; char *next; } rec; int inputbits; int coeffbits; int b; /* number of bits in the input */ /* number of bits in the coeffcients */ /* number of bits the binary channels */ int i; int r estimate; /* initial guess at # of moduli needed *1 /* number of moduli needed for different b*/ int r[10]; /* lowest bitwidth with the required dynamic range */ int firstb; int *moduli; int *findmoduliO; double filterlength; double drange; double product; /* length of the filter in taps */ /* dynamic range in bits of moduli sets */ 110 double FILE FILE FILE FILE FILE FILE FILE totaldrange; /* total dynamic range needed (in bits) */ 1* 1* 1* 1* 1* 1* *nsd; *osd; *ntd; *otd; *nld; *old; *fopeno; fp fp fp fp fp fp for for for for for for newalgorithmsizedata */ oldalgorithmsizedata */ newalgorithmtransdata */ oldalgorithmtransdata */ newalgorithm latency-data */ oldalgorithmjlatency-data */ string otype[6]; /* text description of old design type */ double osize[6][9]; /* size of old designs for different bitwidths */ double odelay[6][9]; /* latency of old designs for different b */ string ntype[6]; /* text description of new design type */ double nsize[6][9]; /* size of new designs for different bitwidths */ /* latency of new designs for different b */ double ndelay[6][9]; string t[10]; string dummy; int j; long int atol(); double int atofO; design; char addflag; char char tempflag; *mallocO; rec rec rec *start; *pointer; *lastpointer; /* First, find the number of moduli needed for each value of b */ /* from 3 to 9 */ printf("Enter number of bits in the input scanf("%d", &inputbits); printf("Enter number of bits in the coefficients scanf("%d", &coeffbits); printf("Enter length of the filter (# of taps) scanf("%lf", &filterlength); totaldrange = (double)inputbits + (double)coeffbits + (log(filterlength)/log(2)); /* for each value of b */ for(b=3; b<=9; b++) { r_estimate = (int)(totaldrange / b) - 1; do { r_estimate++; if (restimate > maxmoduli[b]) r_estimate = 0; { 111 firstb = b; break; /* do loop */ } moduli drange ) = = b, findmoduli(restimate, log(product)/log(2); &product); while (drange < totaldrange); r[b] = restimate; if (r[b] !=O) { printf("\nThe optimal moduli set for b = %d is /* ... ", b); for(i=O; i<r[b]; i++) { moduli[i]); printf('%d\t", printf("\nThe product is %.8e for %lf bits of precision\n\n", product, (log(product)/log(2))); } ' */ firstb++; /* Now, load the size and latency data from files */ if((osd = fopen("oldalgorithmsize-data", printf("Error opening "r")) == NULL) oldalgorithm sizedata\n"); for(i=O; i<4; i++) { fscanf(osd, "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s", t[3], t[4], t[5], t[6], t[7], t[8], t[9] ) otype[i], for (j=3; j<=9; j++) osize[i][j] = r[j] * atof(tU]); } fclose(osd); if ((old = fopen("oldalgorithm-latency-data", printf("Error opening "r")) == NULL) oldalgorithm latency-data\n"); for(i=O; i<4; i++) { fscanf(old, "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s", t[3], t[4], t[5], t[6], t[7], t[8], t[9] ) for (j=3; j<=9; j++) odelay[i][j] = atof(tU]); } fclose(old); /* Now lets prune the number of alternatives a bit */ start = (rec *)malloc(sizeof(rec)); start->size = 9e30; start->delay = 9e30; start->next = NULL; for(design=O; design<4; design++) { dummy, 112 for(b=firstb; b<=9; b++) { pointer = start; lastpointer = NULL; addflag = 1; do { tempflag = 0; if (pointer->size > osize[design][b]) if (pointer->delay > odelay[design][b]) tempflag++; tempflag++; addflag = 0; if (tempflag == 0) if (tempflag == 2) { /* delete current record */ if (lastpointer == NULL) { /* if first in list */ start = (rec *)pointer->next; free(pointer); pointer = start; } else { lastpointer->next = pointer->next; free(pointer); pointer = (rec *)pointer->next; } } else { /* otherwise, just step to the next record */ lastpointer = pointer; pointer = (rec *)pointer->next; } } while (pointer != NULL); if (addflag == 1) pointer = (rec *)malloc(sizeof(rec)); pointer->type = design; pointer->bits = b; pointer->size = osize[design][b]; pointer->delay = odelay [design] [b]; pointer->next = NULL; if (start == NULL) start = pointer; else lastpointer->next = (char *)pointer; } /* } And print the results */ pointer = start; printf("\n\nThe old combinations are\n"); designs which provide the best size/speed do I printf("%s design with %d %d bit moduli --> size %.Of, delay .2 f\n", otype[pointer->type], r[pointer->bits], pointer->bits, pointer->size, 113 pointer->delay); pointer = (rec *)pointer->next; } while (pointer != NULL); /* And now for something a little different */ if((nsd = fopen("newalgorithmsize data", "r")) == NULL) opening newalgorithm size-data\n"); printf("Error for(i=O; i<3; i++) { fscanf(nsd, "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s", t[3], t[4], t[5], t[6], t[7], t[8], t[9] ) ntype[i], for (j=3; j<=9; j++) nsize[i][j] = rU] * atof(t[j]); fclose(nsd); "r")) == NULL) if ((nld = fopen("newalgorithmjlatency_data", newalgorithm-latency-data\n"); opening printf("Error for(i=O; i<3; i++) { fscanf(nld, "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s", t[3], t[4], t[5], t[6], t[7], t[8], t[9] ) dummy, for (j=3; j<=9; j++) ndelay[i][j] = atof(t[j]); } fclose(nld); /* Now lets prune the number of alternatives a bit */ start = (rec *)malloc(sizeof(rec)); start->size = 9e30; start->delay = 9e30; start->next = NULL; for(design=0; design<3; design++) { for(b=firstb; b<=9; b++) { pointer = start; lastpointer = NULL; addflag = 1; do { tempflag = 0; if (pointer->size > nsize[design] [b]) if (pointer->delay > ndelay[design][b]) tempflag++; tempflag++; addflag = 0; if (tempflag == 0) if (tempflag == 2) { /* delete current record */ if (lastpointer == NULL) { /* if first in list */ start = (rec *)pointer- >next; free(pointer); pointer = start; } 114 else { lastpointer->next = pointer->next; free(pointer); pointer = (rec *)pointer->next; } else { /* otherwise, just step to the next record */ lastpointer = pointer; pointer = (rec *)pointer->next; } } while (pointer != NULL); if (addflag == 1) { pointer = (rec *)malloc(sizeof(rec)); pointer->type = design; pointer->bits = b; pointer->size = nsize[design][b]; pointer->delay = ndelay[design] [b]; pointer->next = NULL; if (start == NULL) start = pointer; else lastpointer->next } } /* = (char *)pointer; } And print the results */ pointer = start; printf("\n\nThe new combinations are\n"); designs which provide the best size/speed do { %.2fin", printf("%s design with %d ntype[pointer->type], %d bit moduli -- > size %.Of, r[pointer->bits], pointer->bits, delay pointer- >size, pointer->delay); pointer = (rec *)pointer->next; } while (pointer != NULL); } /* /* /* /* /* /* int int Function findmoduli returns the "optimal" set of moduli with the It operates by recursively desired number of bits and elements. calling a sub-optimal moduli selection function with a list of excluded numbers. Inputs: # of moduli, # of bits/moduli Outputs: moduli set, log2 (product of moduli set) *findmoduli(number, number; bits, product) /* number of moduli in set */ * / * / * / */ */ */ 115 int bits; double *product; /* maximum number of bits per modulus /* product of best moduli set, returned */ */ { char char char int int int int int int doitaga in; *mallo *reallo CO; /* done flag */ /* dynamic memory /* dynamic memory allocation */ allocation */ /* looping variables */ i, j; /* size of exclusion set passed */ numexc; /* exclusion set */ *exclude; *result; /* best set of moduli */ /* current set of moduli */ *mod; powerO; /* integer power function */ double temp; /* product of current moduli set */ /* Initialize stuff... */ mod = (int *)malloc(number * sizeof(int)); result = (int *)malloc(number * sizeof(int)); exclude = (int *)malloc(sizeof(int)); numexc mod[0] = = 0; power(2, bits); /* /* start with no exclusions */ always include highest power of 2 */ /* Take first stab with sub-optimal algorithm findset(mod, number, exclude, numexc); *product = mod[O]; for(i=1; i<number; i++) *product = *product * mod[i]; /* /* */ recursively call suboptimal algorithm successively each member of the current best moduli set */ excluding */ do { doitagain = 0; for(i=O; i<number; i++) result[i] = mod[i]; numexc++; exclude = (int *)realloc(exclude, numexc*sizeof(int)); for(i=1; i<number; i++) { exclude[numexc - 1] = result[i]; findset(mod, number, exclude, numexc); temp = mod[0]; for(j=1; j<number; j++) temp = temp * modU]; /* Is the current moduli set better than the best */ if (temp > *product) { *product = temp; doitagain = 1; break; } /* if */ } /* for */ /* break out of the for loop */ 116 } while(doitagain); return(result); } /* /* /* /* /* /* /* /* /* * / * / * / * / * / * / * / * / */ Function findset executes the basic flawed algorithm for finding a moduli set with the addition of select number exclusion. The basic algorithm starts with the highest possible moduli in a b It then counts down binary representation (2 to the power of b). the odd numbers less than this highest modulus including those that are relatively prime to the currently existing set. By the exclusion feature allows a calling function to compensate for the the flaw in this algorithm by excluding multiple factor numbers. /* ** * ***** * ** * **** * ***** * ***** *** * *** **** ** ** * ** ** ** */ ***** findset(modset, lenset, exclude, lenexc) *modset; /* predimensioned array to hold moduli, modset[O] = 2An */ /* length of requested moduli set */ lenset; /* predimensioned and initialized exclude set */ *exclude; /* length of exclude set */ int lenexc; int int int int I char success; int i, j; int int newmod; gcdo; /* flag */ /* /* /* looping variables */ potential new member of moduli set */ greatest common denominator function */ /* find first odd number that does not conflict with exclude set */ newmod = modset[O] - 1; if (lenexc != 0) { do { success = 1; for(j=O; j<lenexc; j++) { if (newmod == excludeU]) I newmod -= 2; /* go to next greatest odd * / success = 0; break; /* the for loop */ } } } while(!success); } modset[1l] = newmod; /* Using first odd element as a seed, find the remaining relatively */ For these elements not only */ /* prime elements to complete the set. */ /* must the exclude list be checked, but also they must be /* relatively prime to the existing elements */ for(i=2; i<lenset; i++) { newmod = modset[i-1] - 2; 117 do { /* check exclude list */ if (lenexc != 0) { do { success = 1; for(j=O; j<lenexc; j++) { if (newmod == excludeU]) { /* go to next newmod -= 2; greatest odd */ success = 0; break; /* the for loop */ } } } while(!success); } /* check for relative primality */ success = 1; for(j=1; j<i; j++) { if (gcd(modset[j], newmod) != 1) newmod -= 2; /* go to next greatest odd * / success = 0; break; /* the for loop */ } } } while(!success); modset[i] = newmod; } /* for */ } int power(num, int num; int pow; pow) { int i; int res; res = num; for(i=1; i<pow; i++) res = num * res; return(res); } /* *** * ** * ** *** *** *** * * ***** * *** ** * *** /* function gcd returns the greatest commmon /* using Euclid's Algorithm... ** ** * *** ** ** ** ** ** divisor of two integers * **/ * */ / 118 int gcd(a, b) int a; int b; { int temp; int r; /* make sure a > b */ if (a < b) { temp = a; a =b; b = temp; } /* Euclid's Algorithm do { r = a % b; a= b; b = r; } while(r > 0); return(a); } */ 119 Selection Moduli * *** /* * ***** **** Algorithm ** ** ** ***** **** * *** ** ** **** * *** * ** ** ****** / / /* This program is an attempt at creating an RNS design aid. /* It uses a recursive method that is based on some of the /* basic characteristics of an optimal moduli set. * * /* Written by Kurt A. Locher August 29, 1988 */ November 12, 1988 /* Ported to the Macintosh **************************************************** ***** */ */ #include <stdio.h> #include <math.h> maino { int nummod; int numbits; int i; int *moduli; int *findmoduliO; double product; do [ printf("Enter bitlength of moduli scanf("%d", &numbits); printf("Enter number of moduli scanf("%d", &nummod); moduli = findmoduli(nummod, numbits); product = 1; printf("\nThe optimal moduli set is ... "); for(i=O; i<nummod; i++) { product = product * moduli[i]; printf("%d\t", moduli[i]); } product is %.8e for %lf bits of precision\n\n", product, (log(product)/log(2))); } while(numbits > 2); printf("\nThe } /* /* /* /* Function findmoduli returns the "optimal" set of moduli with the It operates by recursively desired number of bits and elements. calling a sub-optimal moduli selection function with a list of excluded numbers. int *findmoduli(number, bits) /* number of moduli in set */ int number; /* maximum number of bits per modulus */ int bits; { * * * / / / */ 120 /* done flag */ /* dynamic memory /* dynamic memory allocation allocation int i, j; /* looping variables */ int numexc; int *exclude; int *mod; /* size of exclusion set passed */ /* exclusion set */ /* current set of moduli */ char char char doitagain; *mallocO; *reallocO; */ */ int *result; /* best set of moduli */ int powerO; /* integer power function */ double product; double temp; /* product of best moduli set */ /* product of current moduli set */ /* Initialize stuff... */ mod = (int *)malloc(number * sizeof(int)); result = (int *)malloc(number * sizeof(int)); exclude = (int *)malloc(sizeof(int)); /* start with no exclusions */ numexc = 0; mod[0] = power(2, bits); /* always include highest power of 2 */ /* Take first stab with sub-optimal algorithm findset(mod, number, exclude, numexc); product = mod[O]; */ for(i=1; i<number; i++) product = product * mod[i]; /* recursively call suboptimal algorithm successively excluding */ /* each member of the current best moduli set */ do { doitagain = 0; for(i=0; i<number; i++) result[i] = mod[i]; numexc++; exclude = (int *)realloc(exclude, numexc*sizeof(int)); for(i=1; i<number; i++) { exclude[numexc - 1] = result[i]; findset(mod, number, exclude, numexc); temp = mod[O]; for(j=1; j<number; j++) temp = temp * modU]; /* Is the current moduli set better than the best */ if (temp > product) { product = temp; doitagain = 1; /* break out of the for loop */ break; } /* if */ } /* for */ } while(doitagain); return(result); 121 } /* Function findset executes the basic flawed algorithm for finding * * * * * * * /* a moduli set with the addition of select number exclusion. The /* basic algorithm starts with the highest possible moduli in a b /* binary representation (2 to the power of b). It then counts down /* the odd numbers less than this highest modulus including those /* that are relatively prime to the currently existing set. By /* the exclusion feature allows a calling function to compensate for /* the the flaw in this algorithm by excluding multiple factor /* numbers. / / / / / / / */ */ int findset(modset, lenset, exclude, lenexc) int *modset; /* predimensioned array to hold moduli, modset[O] = 2^n */ /* length of requested moduli set */ int lenset; /* predimensioned and initialized exclude set */ int *exclude; /* length of exclude set */ int lenexc; { char success; int i, j; int int newmod; gcdo; */ /* flag /* /* /* looping variables */ potential new member of moduli set */ greatest common denominator function */ /* find first odd number that does not conflict with exclude set */ newmod = modset[O] - 1; if (lenexc != 0) { do { success = 1; for(j=O; j<lenexc; j++) ( if (newmod == excludej]) newmod -= 2; /* go to next greatest odd */ success = 0; break; /* the for loop */ } } } while(!success); } modset[1] = newmod; /* Using first odd element as a seed, find the remaining relatively */ /* prime elements to complete the set. For these elements not only */ */ /* must the exclude list be checked, but also they must be /* relatively prime to the existing elements */ for(i=2; i<lenset; i++) { newmod = modset[i-1] - 2; do { /* check exclude list */ 122 if (lenexc != 0) { do { success = 1; for(j=O; j<lenexc; j++) { if (newmod~ == excludeUI]) newmod -= 2; greatest odd /* go to next */ success = 0; break; /* the for loop */ } } } while(!success); } /* check for relative success = 1; for(j=1; j<i; j++) { primality */ if (gcd(modset[j], newmod) != 1) /* go to next greatest odd newmod -= 2; */ success = 0; break; /* the for loop */ } } } while(!success); modset[i] = newmod; ) /* for */ } int power(num, int num; int pow; pow) { int i; int res; res = num; for(i=1; i<pow; i++) res = num * res; return(res); } /* ** * ** ** *** ** **** * ** * ** **** * **** *** ** ** *** ** ** * ** ** */ * *** /* function gcd returns the greatest commmon divisor of two integers /* using Euclid's Algorithm... /******************** int gcd(a, b) int a; ******************************** ** * ** */ */ / 123 int b; { int temp; int r; /* make sure a > b */ if (a < b) { temp = a; a =b; b = temp; } /* Euclid's Algorithm */ do { r = a % b; a= b; b = r; } while(r > 0); return(a); } 124 References General Herstein, I.N. Abstract Algebra, MacMillan Publishing, New York, 1986 The Art of Computer Programming, Vol 2 Knuth, D.E. Algorithms, Addison-Wesley, Reading, MA, Seminumerical (pp 178-197 1981 & 268-277) McClellen, J.H. & Rader, C.M. Number- Theory in Digital Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1979 Oppenheim, A.V. & Shafer, R.W. Digital Signal Processing, Prentice- Hall, New York, 1975 Rabiner, L.R. & Gold, Processing, B. Computer Technology, F.J. Application of Digital Signal Prentice-Hall, Englewood Cliffs, NJ, 1975 Szabo, N.S. & Tanaka, R.I. Taylor, Theory and "Residue Residue Arithmetic and its Application to McGraw-Hill, New York, 1967 Arithmetic: A Tutorial with Examples", IEEE Computer, May 1984 Architecture Bayoumi, M.A., et al. Transactions 1987 "A VLSI Implementation of Residue Adders", IEEE on Circuits and Systems, Vol CAS-34, No 3, March 125 "Residue Arithmetic and VLSI", IEEE 1983 Chaing, C-L Huang, C.H. "Implementation Number Residue a Fast of System", IEEE Systems, Vol CAS-28, January Digital Processor Transactions on Using Circuits the and 1981 "The Use of Residue Number Systems in the Design Jenkins, W.K., et al, of Finite Impulse Response Digital Filters", IEEE Transactions on Circuits and Systems, April 1977 et al, Johnson, B.L., Processing", "The Residue Number System for VLSI SPIE Vol 696 Advanced Algorithms Signal and Architectures for Signal Processing, 1986 "Residue Number Scaling and Other Operations using ROM Jullien, G.A. Arrays", IEEE Transactions on Computers, C-27, No 4, April 1978 Key, E.L. "Digital Signal Processing with Residue Number Systems", IEEE 1983 Kung, H.T. "Why Systolic Architectures?", IEEE Computer, Jan 1982 "Architectures for Signal Processing", MIT Course Musicus, B.R. Handout Soderstrand, M.A. "A New Hardware Implementation of Modulo Adders for Residue Number Systems" Taylor, F.J., et al. "An Efficient Residue-to-Decimal Converter", IEEE Transactions on Circuits and Systems, Vol CAS-28, No 12, December 1981 Taylor, F.J. "A VLSI Residue Arithmetic Multiplier", IEEE Transactions on Computers, Vol C-31, No 6, June 1982 126 "DFP Family Overview" and "ZR33881 Digital Filter ZORAN Corporation, Data Sheet" Processor Complex Residue Arithmetic Baranieka, A.Z., et al, "Residue Number System Implementations of Number Theoretic Transforms in Complex Residue Rings", IEEE Transactions on ASSP, Vol ASSP-28, No 3, June 1980 Jenkins, W.K. "Quadratic Modular Number Codes for Complex Digital Signal Processing", IEEE ISCAS 1984 Jullien, G.A., et al, "Complex Digital Signal Processing over Finite Rings", IEEE Transactions on Circuits and Systems, Vol CAS-34, No 4, April 1987 Krishnan, R., et al, "The Modified Quadratic Number System (MQRNS) for Complex High-Speed Signal Processing", IEEE Trasactions on Circuits and Systems, Vol CAS-33, No 3, March 1986 Krishnan, R., et al, "Complex Digital Signal Processing Using Quadratic Number System", IEEE Trasactions on ASSP, Vol ASSP-34, No 1, February Soderstrand, M.A. 1986 "Applications of Quadratic-Like Number System Arithmetic to Ultrasonics", Conference ASSP 1984, vol 2, March 1984 Complex Residue IEEE International