International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 7 - Nov 2013 Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT Nisha Laguri#1, K. Anusudha*2 #1 M.Tech Student, Electronics, Department of Electronics Engineering, Pondicherry University, Puducherry, India *2 Assistant professor, Department of Electronics Engineering, Pondicherry University, Puducherry, India Abstract— In this paper a Split Radix FFT without the use of multiplier is designed. All the complex multiplications are done by using Distributed Arithmetic (DA) technique. For faster calculation parallel prefix adder is used. Basically high radix algorithms are developed for efficient calculation of FFT. These algorithms reduces overall arithmetic operations in FFT, but increases the number of operations and complexity of each butterfly. In Split Radix FFT, mixed-radix approach helps to achieve low number of multiplications and additions. DA is basically a bitserial computational operation that forms an inner (dot) product of a pair of vectors in a single direct step. The advantage of DA is its efficiency of mechanization. A method is incorporated to overcome the overflow problem introduced by DA method. Keywords— Split Radix, Fast Fourier Transform (FFT), DA, Parallel Prefix Adder I. INTRODUCTION Digital Signal Processing (DSP) kernels such as Discrete Fourier Transform (DFT) are common in real time applications. Since DFT computation requires a large amount of arithmetic operations, Fast Fourier Transform (FFT) processors are required to meet real-time requirement. Fast Fourier Transform is a very common operation which is used in various signal processing units. It has also applications in the wide range of radar, image & speech processing etc. It is important to have architecture which performs FFT quickly. Different FFT algorithms, like the Radix-4 and the High radix algorithms are also been developed for efficient calculation of FFT. These algorithms reduces overall arithmetic operations in FFT, but increases the number of operations and complexity of each butterfly. Various implementations are reported with high radix algorithm. Among them, radix-4 algorithm is very popular due to its lesser complexity. Split Radix FFT calculates the even parts using the radix-2 algorithm and the odd parts using the radix-4 algorithm. This mixed-radix approach helped to achieve lower number of multiplications and additions. The resulting butterfly has simple structure. Every butterfly has two main operations i.e., complex multiplication and addition. Complex multiplication decides the speed, hardware cost and consumption of power. ISSN: 2231-5381 Usually there are three conventions always to tackle the complex multiplication: Booth-Wallace multiplier, CORDIC multiplier, and CSD multiplier. It is not easy to handle the constant twiddle factors in CSD arithmetic and it results in large area cost. Distributed Arithmetic (DA) with Modulo Arithmetic, are the computation algorithms that perform multiplication with look-up table based schemes. The commonly encountered form of computation in digital signal processing is a sum of products and it can be executed most efficiently by DA. The advantage of DA is its efficiency of mechanization. Since twiddle factors in any FFT algorithm are fixed for specific N-point FFT, DA can be used to replace complex multiplication in FFT. As a summary, this paper addresses the implementation aspects DA based multiplier and SRFFT. This paper further arranged as follows. Section II explains the Split Radix FFT algorithm and corresponding butterfly diagram. Section III presents how a complex multiplication can be substituted with DA operations. Sections IV gives detailed architecture for DA based complex multiplier. Section V explains about parallel prefix adder. Section VI and VII explain about Kogge-Stone Adder and Brent Kung Adder. Results are compared with other architectures in Section VIII and lastly, in section IX paper is concluded. Table I. Number of real multiplication and additions to compute an n-point complex DFT Real Multiplication N Radix-2 Radix-4 16 24 20 32 88 64 264 208 Real Addition Split Radix-2 Radix-4 20 152 148 68 408 196 1032 Radix Split Radix 148 388 976 964 II. SPLIT-RADIX FFT ALGORITHM While calculating FFT using Radix-2 method, it can be concluded that even-numbered points and odd-numbered points are computed independently. This leads to the http://www.ijettjournal.org Page 341 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 7 - Nov 2013 possibility of using different computational methods for different independent parts of the algorithm. This will reduce computational complexity. Split-radix algorithm uses above method by combining the simplicity of radix-2 algorithm and lesser computational complexity of radix-4 algorithm, achieving the lowest number of arithmetic operation count to compute DFT of power-of-two sizes N. Split-radix method recursively expresses DFT of length N in terms of one smaller DFT of length N/2 and two smaller DFTs of length N/4. Split-radix algorithm is only applicable when N is a multiple of 4, but we can combine this with other FFT algorithms. This section presents briefly the SRFFT algorithm and its butterfly structure. General equation of Discrete Fourier Transform is given as(1) Where, Radix-2 algorithm calculates odd and even components of X (k) using decimation in frequency. Corresponding equations for even and odd components are given by the equations: Fig.2. SRFFT butterfly structure with outputs The difference between part (1) and (2) is in the number of stages required to finish the butterfly operation. Compared to part (1), part (2) completes the operation in a single step and is also a symmetrical structure from a hardware point of view. III. DISTRIBUTED ARITHMETIC METHOD FOR COMPLEX MULTIPLICATION (2) (3) Distributed Arithmetic can be used to implement multiplication operation if either the multiplicand or the multiplier value is fixed. It stores the possible combinations of fixed operand in ROM and suitable combination is added and shifted with respect to bits of other operand. The method for DA based complex multiplication can be summarized as(4) It shows that 4 real multiplications and 2 real additions are required to compute and . But these equations can be considered as one „multiply and accumulate operation. (5) Let, are fixed coefficients and are the input words. If is M-bit fractional number in 2‟s complement form then it can be expressed in following form Fig.1. SRFFT butterfly structure (6) ISSN: 2231-5381 http://www.ijettjournal.org Page 342 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 7 - Nov 2013 IV. ARCHITECTURE OF DA BASED COMPLEX MULTIPLIER B. Carry generation network The signal from the first stage will proceed with the next stage, to yield all carry bits signal. The stage containing three main complex logic cells called as Black cell, Gray cell and buffer cell. Black cell compute both and as define in equation (9) and (10), whereas Gray cell only execute . The stage of Prefix carry tree is a part that differentiate or determine the adder used. In this stage, carry is compute corresponding to each bit. Execution of operations is carried out in parallel. After the computation of carries in parallel they are segmented into small pieces. It uses carry propagate and generate as intermediate signals which are given by the logic equations 9 & 10: (9) (10) Fig..3. DA based complex multiplier The detailed architecture for complex multiplier is shown in above Fig. The real and imaginary parts of incoming words and are stored in two 8 bits wide parallel in serial out register. Shifting is carried out starting from LSB to MSB. Each output bit of these two registers are used as address lines of the ROMs. The ROM stores precalculated outcomes for both and . The size of each ROM is 4×8. One of the input to the 2:1 MUX is directly fed from the output of ROM and the other input to MUX is inverted. Input and output bit width for MUX is also 8 bits. The select line of MUX is „cin‟ signal and it remains as „0‟ till the MSB arrives at output. If select line „cin‟ of Mux is 1, it selects inverted output from ROM and it is added to the value stored in the partial product register (PPR). The PPR is a 8 bit wide „parallel in parallel out‟ register which also performs 1-bit right shift operation. Finally the output is taken from the left shift register. V. PARALLEL PREFIX ADDER The parallel prefix adders are more flexible and are used to speed up the binary additions. Parallel prefix adders are fastest adders and these adders are used for high performance arithmetic circuits in many industries. The construction of parallel prefix adder involves three stages: A. Pre- processing stage B. Carry generation network C. Post processing Fig.4. Carry operator The operations given in the figure are as follows- C. Post Processing Complement the overall adder operation, carry bits that produced from the second stage shall pass through the last part known as Post-Processing stage. This is the final step to compute the summation of input bits. It is common for all adders and the bits of sum are computed by logic equation- A. Pre-possessing stage In this stage we compute, generate and propagate various signals to each pair of inputs A and B. Logic equations for these signals are given in equations 7 & 8: (11) (12) (7) (8) ISSN: 2231-5381 http://www.ijettjournal.org Page 343 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 7 - Nov 2013 VII. BRENT-KUNG PREFIX ADDER The Brent-Kung adder is a parallel prefix adder. Parallel prefix adders are special class of adders that are based on the use of generate and propagate signals. Simpler Brent-Kung adders was been proposed to solve the disadvantages of Kogge-Stone adders. The cost and wiring complexity is greatly reduced, but the logic depth of Brent-Kung adders increases to 2log (2n-1). The block diagram of 4-bit BrentKung adder is shown in Fig.7. HDL coded multiplier is carried out in Xilinx ISE simulator. The proposed multiplier is coded in Verilog language. Fig.5. Complex logic cells inside the Prefix Carry Tree Specific into pre-processing stage, it is obviously consumes two input ports and while producing generation signal and propagation signal. By refer to the equation (7) and (8), the pre-processing block has managed to create a circuit as shown in Figure5. The block of this stage should be put at every single input bits of the adder. VI. KOGGE-STONE PREFIX ADDER Kogge-Stone adder is a parallel prefix form carry look-ahead adder. A parallel prefix adder can be represented as a parallel prefix graph consisting of carry operator nodes. The time which is required to generate carry signals in this prefix adder is o(log n). It is a fastest adder design and common design for high performance adders in many industries. The KoggeStone adder was first developed by Peter M. Kogge and Harold S. Stone which they published in 1973. The better performances of Kogge-Stone adder are minimum logic depth and bounded fan-out. But it has large area. The block diagram of 4-bit Kogge-Stone adder is shown in Fig6. Fig.7. 4-bit Brent-Kung adder RESULT AND DISCUSSION This design was synthesized in Xilinx Verilog of version 13.2 ISE web pack. The simulation is done by integrated test bench in 13.2 versions. All the operation can be simulated at one time; the behavioural simulation is done by executing the test bench file. Table II. Comparison Table between different Distributed Arithmetic based Complex Multiplier Distributed Distributed Distributed Arithmetic1 Arithmetic2 Arithmetic3 No. of Slices 24 49 40 No.of bonded 8 15 16 Delay(ns) 10.516 16.405 17.305 Maximum 207.639 94.661 86.699 0.00195 0.00195 0.00129 IOBs Frequency(Mhz) Fig.6. 4-bit Kogge-Stone adder ISSN: 2231-5381 Power(w) http://www.ijettjournal.org Page 344 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 7 - Nov 2013 Table III. Comparison between different Butterfly Structure Butterfly 1 Butterfly 2 Butterfly 3 No. of slices 153 136 32 Bonded 144 144 72 Delay 13.598 16.365 9.901 DSP48A 1s 7 6 4 the pre-scaling of input data; instead, it pre-scaled the data and stored in ROM. In this paper a new overflow technique is proposed for the DA based complex multiplier. The Distributed Arithmetic based complex multiplier is proposed with parallel prefix adder i.e. Brent Kung Adder. It shows the less delay along with high frequency when compare with other adders. The simulated result also shows that the Split Radix FFT proposed with prefix adder reduces delay. So the proposed design is high in speed. IOBs REFERENCES [1] [2] Table IV. Comparison of SRFFT with different Adder Parameters [3] SRFFT with SRFFT with Brent Adder Kung Adder No. of slices 81 86 No. of Bonded IOBs 20 20 Delay (ns) 10.190 7.297 Bit Width 8X8 8X8 [4] [5] [6] [7] IX. CONCLUSION [8] This paper describes the design of 8 point Split Radix FFT. For the design of SRFFT, distributed arithmetic based complex multiplier is used. The proposed architecture avoided ISSN: 2231-5381 Sunil P. Joshi, Roy paily, “Distributed Arithmetic Based Split- Radix FFT”, Journal of Signal Processing System, Springer, Online Publication, 31 May 2013 M. Mohamed Ismail, Dr. M. J. S. Rangachar and Dr. Ch. D. V. Paradesi Rao “VLSI Implementation of OFDM using Efficient Mixed-Radix 8-2 FFT Algorithm with bit reversal for the output sequences”, International Journal of Elecctronics and Communication Engineering, IISN 09742166 vol. 5, no. 4, pp. 513-520, 2012 Ansuman Diptisankar Das, Abhishek Mankar, N. Prasad, K.K. Mahapatra, Avas Kanta Swain, “Efficient VLSI Architectures of Split Radix FFT using New Distributed Arithmetic”, International Journal of Soft Computing and Engineerimg (IJSCE) ISSN:2231-2307, volume-3, Issue-1, march 2013 K. Swarnalatha, S. Mohan Das, P. Uday Kumar, “An Efficient Carry Select Adder with less delay and reduced area using FPGA Quartus II Verilog Design” International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 8, August 2013 John G. Proakis, Dimitris, G. Manolakis, “Digital Signal Processing: Principles, Algorithms, And Applications, 4th edition, Published by Pearson Edition, inc. @ 2007, pp.519-532 Eleanor Chu, Alan George, “Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms, CRC Press LLC, N.W. Corporate Blvd. Boea Raton, Florida33431, pp.22-25 Joseph Cavanagh, “Computer Arithmetic and Verilog HDL Fundamentals”, 3rd Edition, 2010 CRC Press, Parkway N.W. Boea Raton, Florida, pp.329-334 Jonas Claeson, “Design And Implementation of an Asynchronous Pipelined FFT Processor”, M. Eng. Thesis, Avdelning, Institution, Linköping, June 12, 2003 http://www.ijettjournal.org Page 345