Reducing Memory References for FFT Calculation Mokhtar A. Aboleaze Ayman I. Elnaggar Dept. of Computer Science and Engineering York University Toronto, ON. Canada aboelaze@cs.yorku.ca Dept. of Electrical and Computer Engineering Sultan Qaboos University Muscat, Oman ayman@squ..edu Abstract Fast Fourier Transform (FFT) is one of the most widely used algorithms in digital signal processing. It is used in many signal processing and communication applications. many of the FFT operations are performed in embedded systems. Since Embedded systems is very small processors used in almost all type of appliances from microwave ovens to cars, and many embedded systems are portable and depend on small batteries for power; low energy design is extremely important in embedded systems design. One of the major energy consumption sources in any processor is memory access. Memory access requires more energy than almost any operation in a DSP (Digital Signal Processor), or embedded processor, reducing memory access plays a very important role in reducing energy consumption. In this paper we concentrate on the energy consumption in memory in calculating FFT. we compare between three different techniques in calculating FFT with reference to energy consumption in memory access. We also investigate the effect of the number of registers in the CPU on reducing energy consumption in memory access. 1. Introduction Fast Fourier Transform (FFT) is probably one of the most used signal processing algorithms in the world. FFT is used in digital communication and in general in digital signal processing and is widely used as a mathematical tool in different areas. In the next few paragraphs, we briefly mention some of the applications of FFT. FFT is used to reduce the computational time required for solving the problem of electromagnetic scattering from wire antennas and conducting rectangular plates [18]. It is also used in solving a system of Toeplitz normal equation [19]. It is also used in optimal frequency acquisition and measurements in Search and Rescue Satellite Aidded Tracking (SARSAT) [12]. In interpolation techniques for resampling of correlation pulse signals departing from a small number of Nyquist samples [3]. FFT is also used in spectral estimation in [7] and [9], in single tone detection and frequency estimation in [2], and in increasing object detection in radar systems [17]. Also FFT is used in Orthogonal Frequency Division Multiplexing (OFDM) which is the basis for Multi Carrier Code Division Multiple access (MC-CDMA), and its direct spread spectrum version MC-DS-CDMA [6] and [8]. Windowed FFT is used in electric power quality assessment in [11]. While in [5] the authors introduced the generalized sliding FFT to efficiently implement the hopping FFT. Then they showed how to use it to implement the block LMS adaptive filter. Finally the FFT was used in 3D induction well logging probelm where it could be used in characterization of oil reservoirs. Power consumption is a very important factor in the design of special purpose as well as general purpose processors [14]. For wireless devices, power plays a crucial role since the device operates on a battery with a limited power supply capability. Obviously that require hardware to use as little power as it possibly could to perform the job at hand. For general purpose processors, increasing power consumption leads to sophisticated cooling techniques which both increase the price and reduce the reliability of the processor. One of that main sources of energy consumption in any processor chip is the memory (or the cache if there is a cache). It was reported in [16] that instruction cache reference amounts to 43% of the total power consumption of the chip. For small embedded processors without cache, and since the main memory consumes more energy than cache, that figure could be higher in small processors without cache. Minimizing the number of times the processor goes to the memory, helps to reduce the energy consumption in the chip. While there is nothing we can do to reduce the memory access to get instructions (assuming a gernal purpose processor architecture); data access can be reduced in two different ways. The first is using algorithms that require less memory access (reusing the accessed element as many times was possible before discarding it, or reordering memory access). The second is by using a large set of registers, thus accessed elemenst can be temprarily stored in registers for further processing without memory access. In this paper, we present a comparative study for the memory access in calculating in place FFT. We consider access to both data points and coefficients. We investigate three different techniques to calculate the FFT and show the number of memory references of the different techniques which is an indication of the power dissipated in the memory. The organization of this paper is as follows. In Section 2, we present a brief overview of FFT techniques. Section 3 discusses related work. In section 4, we present our comparative study for the number of memory references in calculating in-place FFT using a variable number of registers. Section 5 is a conclusion and future work. sequence. As we mentioned earlier, memory access carries with it a heavy price from the energy consumption point of view. Minimizing the number of memory access goes a long way in reducing the energy consumed in calculating the FFT. For example, in the first stage, we access both x(0), and x(4). That results in calculating new values for x(0), and x(4). Then we proceed to access the rest of the data points. Then at the beginning of the second stage, again, we access x(0), and x(2). If the previous values calculated for x(0), and x(2) are stored in a pair of registers, then we don’t need to go to the memory in order to access them again. If these values are not stored in registers, then we have to go to the memory to fetch them. the same could be said about the twiddle factor access. 2. FFT Algorithm x(0) The Discrete Fourier Transform (DFT) of an N data samples is defined by x(4) N–1 Xk = ∑ ik xi ⋅ WN (1) i=0 – j ( 2π ⁄ N ) Nth Where WN = e , are the root of unity, x is the original sequence, and X is the FFT of x, k=0,1,2,...,N-1 In general, both X, x and W are complex numbers. This can be calculated using matrix-vector multiplication, where ik W N are arranged as N-by-N matrix. That method requires O(N2)complex multiplications. A faster way to calculate the DFT is the Fast Fourier Transform (FFT). FFT algorithm works by dividing the input points into 2 sets (k sets in general), calculate the FFT of the smaller sets (recursively), and combining them together to get the Fourier transform of the original set. That will reduce the number of multiplications from O(N2) to O(N log N). Figure 1. shows the 8-points decimation in time FFT. The 8 points are divided into 2 sets of 4 points each, then caculating the FFT for these 2 sets of 4 points each, then combining these 2 sequences to produce the 8 points FFT. The 4points FFT is recursively done in the same way. The 2 points FFT is known as the butterfly operation, and is shown at the bottom of Figure 1. As we can see from Figure 1, Both data and coefficients are accessed from the memory. Intermediate calculations are done, then they are stored in the same memory location they were accessed from (in place FFT calculation). By the end of the calculations (the end of the log2N stages), the memory contains the FFT sequence of the input X(0) X(1) -1 0 W8 0 W8 x(2) x(6) -1 0 W8 2 W8 X(3) -1 0 W8 x(1) x(5) 0 W8 -1 0 W8 -1 X(4) -1 -1 X(5) 1 W8 0 W8 x(3) x(7) X(2) -1 -1 -1 2 W8 2 W8 3 X(6) -1 -1 X(7) W8 r a A = a + WN ⋅ b b r r WN -1 B = a – WN ⋅ b Figure 1. 8-points decimation in time FFT Whether the required data will be in the memory or in registers depends on the number of registers in the CPU, the architecture of the CPU, the instruction set of the CPU, and the compiler used in generating the machine code. Here we concentrate on the number of available registers in the CPU, we also will make some assumptions regarding the compiler. In this paper, we calculate the number of memory access in calculating FFT for different algorithms, and for different values of N. Some of these algorithms are designed specifically in order to reduce the number of memory access (by rearranging the calculations), while other are designed to minimize the number of calculations, or to simplify the code. In the next section, we review some of the previous work in minimizing power consumption for FFT calculations. The power reduction falls in three main categories, reducing memory access for data reference, reducing memory access for address generation, and reducing the number of arithmetic calculations in order to save energy. 3. Related Work Low power design for digital signal processing in general, and FFT in particular has attracted a lot of attention. Low power could be implemented on many levels. The algorithm level, the architecture level, and the hardware level. At the algorithm level, many attempts have been done to minimize power consumption, higher radix FFT, mixed radix FFT [20], Other attempts are made on both the architecture level and the circuit level. In [10], the authors considered the problem of coefficients address generation for special purpose FFT processors. They designed the hardware required for access of the Fourier coefficients from the coefficients memory. They showed that their scheme can result in power saving of 70-80% compared to Cohen’s scheme (note that saving in the energy consumption of the address generation unit only, not the entire processor power). In [15], the authors proposed an address generation scheme for FFT processors. Their proposed hardware complexity is 50% less than Cohen’s scheme. Also their scheme activate only half the memory (they used memory banks) thus saving more energy than previously known techniques. A low-power dynamic reconfigurable FFT fabric was proposed in [22]. The processor can dynamically be configured for 16-points to 1024-points FFT calculation. The overhead for dynamic configuration is minimal while the power reduction compared to general purpose reconfigurable architecture is in the range 30-90%. In [13] the authors proposed a novel FFT algorithm to reduce the number of multiplications as well as the number of memory accesses. Their algorithm depends on re-arranging the computation in the different stages of the Cooley-Tukey algorithm in order to minimize the Twid- dle factor access. Their algorithm clusters together all the butterfly operations that use a certain Twiddle factor, access that twiddle factor to perform the operations, and never access it again. That result in 30% reduction in twiddle factor access compared to the conventional DIF FFT. They also implemented their algorithm on TI TMS320C62x digital signal processor. Their implementation shows a significant reduction in the memory access on the TI TMS320C62x digital signal processor. 4. Comparison The number of memory access depends on many factors. The first is the produced machine code. The machine code depends of course on both the high level source code (if written on high language source code; otherwise the assembly code), and the compiler used. It also depends on the instruction set of the target processor and the number of available registers in the processor. The compiler plays a very important role in the speed of the executed code. An efficient use of the resources (including the available registers) can speedup the execution. In this paper, we simulated three different algorithms for FFT. We did not produce the assembly code, but rather got the data access pattern from the C code. As mentioned before there is no direct correspondence between the data access pattern and the memory access pattern. data access could be directed to memory if the data are in the memory, or a register if the requested data reside in a register. Another important factor is writing to the memory. If a data element is changed, we can proceed to write the changes to the memory, or we can keep it in a register to be used in the computations and written to the memory either at the end of the code, or when we need to use the register to store another data item. In this paper, we assumed that the compiler uses the available registers to access and store the data. Once a data element is stored in a register, it stays there until the compiler has to use the register to store new data. In that case the registers are released in a FIFO fashion. We also assumed a register-register architecture, in which the operands are assumed to be in a register which is typical of a RISC architecture. That puts an upper bound on the performance, and depends to a great extent on the compiler to produce an optimal code. The first algorithm is the regular radix 2 DIF FFT, the pseudo code in Figure 2. can be found in [1]. note that the last stage is separated from the rest of the stages since we do not need to multiply by the twiddle factors in that stage, by separating the last stage from the rest of the stages, we save on memory access as well as on multiplications. Radix-4 FFT algorithm is shown in Figure 3. Also taken from [1]. Note that in both figures, that is a pseudo code where the elements of x are complex numbers. In reality it will take more than a single addition/multiplications to represent the addition or multiplications of a complex number. The third algorithm we considered is the one shown in [13], that minimizes the Twiddle factor memory access. fft_dif(x[],m) n=2^m; for(i=m; i>1; i--) { m1=2^i; mh=m1/2; for(j=0;j<mh;j++) { e=exp(2*PI*i*j/m1); for(r=0; i<n; i=i+m1) { u=x[r+j]; v=x[r+j+mh]; a[r+j]=u+v; a[r+j+mh]=(u-v)*e; } } } for(r=0;r<n;r=r+2) { x[r]=x[r]+x[r+1]; d a[r+1]=a[r]-a[r+1]; } Figure 2. DIF FFT Algorithm Table 1. shows the number of memory access for the three above mentioned algorithms for different number of registers ranging from 4 to 20. We also considered differnt sizes for FFT sequences with length N = 2n points. One can see that for architectures with a small number of registers, the regular Radix-2 FFT performs best over a wide range of FFT sizes (from 16 to 1024). For architectures with large number of registers, Radix-4 FFT performs the best over the same range of sizes. For architecture with a moderate number of registers (8,12,16) Radix-2 FFT and the reduced memory access method perform equally good with a slight edge for the simple radix-2. In [13], the authors compared between their algorithm and the Radix-2 FFT running on TI TMS320C62x and show that their algorithm require less memory access than Ra- dix-2. We believe that this may be because of the VLIW architecture of the processor, the number of registers available, or the compiler used. However, we our work shows that with a very good compiler, radix-2 and 4 FFT can outperform the reduced memory access algorithm. Table 1. The number of memory accesses for different number of registers, and FFT size =2n # of Reg Rad2 Rad4 Reduced Rad2 Rad4 Reduced Rad2 Rad4 Reduced Rad2 Rad4 Reduced Rad2 Rad4 Reduced Rad2 Rad4 Reduced Rad2 Rad4 Reduced 4 504 567 645 1312 8 380 390 364 988 12 275 339 279 711 16 266 215 253 676 20 256 187 248 672 1681 3232 3407 4137 7680 942 2428 2350 2352 5756 731 1731 2067 1795 4071 673 1652 1351 1657 3892 668 1648 1187 1652 3888 9817 17792 18175 22713 40448 5600 13308 12542 12992 30204 4243 9347 11091 9779 21095 3913 8948 7335 9001 20212 3908 8944 6483 8996 20208 51577 90624 90879 115449 29568 67580 62718 66304 22131 46979 55635 49395 20329 45044 37031 45289 20324 45040 32851 45284 n n=4 n=5 n=6 n=7 n=8 n=9 n=10 fft_rad4 n=2^ldn; for(ldm=ldn; ldm>1;ldm=ldm-2) { m=2^ldm; mr=m/4; for(j=0;j<mr;j++) { for(r=0;r<n;r=r+m) { u0=a[r+j]; u1=a[r+j+mr]; u2=a[r+j+2*mr]; u3=a[r+j+3*mr]; t0=u0+u2+u1+u3; t1=u0+u2-u1-u3; t2=u0-u2+(u1-u3)$ t3=u0-u2-(u1-u3)$ a[r+j]=t0; a[r+j+mr]=t2*W; a[r+j+2*mr]=t1*W a[r+j+3*mr]=t3*W } } } Figure 3. Radix 4 DIF FFT 5. Conclusions In this paper, we compared between three different algorithms in the number of memory access required for FFT calculations. We generated data trace from the C code and we use approximate measures to estimate the number of memory access required to complete the computations assuming a variable number of registers available to the compiler. We did not assume a specific compiler, just the data trace and a FIFO register set to be used with the compiler. Our work shows that although the results depends on the number of registers, and the problem size, however the simpler Radix-2 FFT performs well compared to the other 2. 6. References [1] Arndt, J. “Algorithms for Programmers”, could be found at www.jjj.de/joerg.html August 2005. [2] Chan, Y.T.; Ma, Q.; Inkol, R.; “Evaluation of various FFT methods for single tone detection and frequency estimation”. IEEE Canadian Conference on Electrical and Computer Engineering. Vol. 1 25-28 May 1997 pp 211-214 [3] Coenen, A.; De Vos, A. ““FFT-baed interpolation for multipath detection in GPS/GLONASS receivers”. Electronic Letters, Vol. 29, No. 19. Sept. 1992 pp 1787-1788. [4] Cohen, D. “Simplified control of FFT hardware”, IEEE Transactions on Accousitc, Speech, and Signal Processing. Vol. 24, No. 6, Dec. 1976 pp577-579 [5] Farhanq-Boroujeny, B.; and Gazor, S. “Generalized Sliding FFT and its application to implementation of block LMS adaptive filters”. IEEE Transactions on Signal Processing. Vol. 42, Issue 3. March 1994. pp 532-538 [6] Frederiksen, F. B.; Prasad, R. “An Overview of OFDM and related techniques towards development of future wireless multimedia communications”. IEEE Radio and Wireless Conference RAWCON2002. 11-14 Aug. 2002. pp 19-22 [7] Gan, R. Eman, K.; Wu S. “An extended FFT algorithm for ARMA spectral estimation”, IEEE Transactions on Accoustic Speech, and Signal Processing, Vol. 32, No. 1 Feb. 1984 pp 168-170. [8] Ghorashi, S. A.; Allen, B.; Ghavami, M.; Aghvami, A. H.; “An overview of MB-UWB OFDM”. IEE Seminar on Ultra Wide Band Communications Technologies and System Design. * July 2004. pp 107-110. [9] Gough, P. T. “A fast spectral estimation algorithm based on the FFT”, IEEE Transactions on Accoustic Speech, and Signal Processing, Vol. 44. No. 8. August 1996 pp 13171322. [10] Hasan, M.; Arslan, T. “Coefficients memory addressing scheme for high performance processors” Electronic letters. Vol. 37, No. 22. Oct. 2001 pp1322-1324 [11] Heydt, G. T.; Field, P. S.;, Pierce, D.; Tu, L.; and Hensley, G. “Applications of the windowed FFT to electric power quality assessment” IEEE Transactions on Power Delivery. Vol. 14, Issue 4. Oct. 1999 pp 1411-1416 [12] Holm, S. “Optimum FFT-based frequency acquisition with application to COSPAS-SARSAT” IEEE Transactions on Aerospace and Electronic Systems. Vol. 29, No. 2 April 1993 pp 464-475 [13] Jiang, Y.; Zhou, T.; Tang, Y.; Wang Y. ““Twiddle-factorbased FFT algorithm with reduced memory access” Proc. of the 19th IEEE International Parallel and Distributed Processing Symposium IPDPS2002 April 2002. pp 70-77. [14] Lahiri, K.; Raghunathan, A.; Dey S.; Panigrahi, D.; “Battery-driven system desig: A new frontier in low power design”. proceeding of the 7th Asia and South Passific Design Automation Conference ASP-DAC 2002.7-11 jan. 2002, pp 261-267. [15] Ma, Y.; Wanhammar, L. “A Hardware efficeint control of memory addressing for high-performance FFT processors” IEEE Transactions on Signal Processing Vol. 48, No. 3 March 2000 pp917-921 [16] Montanaro, J. et al “A 160-MHz, 32-b 0.5-W CMOS RISC microprocessor” Proc. IEEE Intl. Solid-State Circuit Conference., Feb. 1996 pp 214-229. [17] Prasad, N.; Shameem, V.; Desai, U. B.; Merchant S. N.; “Impreovement in target detection performance of pulse codded Doppler radar based on multicarrier modulation with fast Fourier Transform (FFT)”. IEE Proceedings on Radar, Sonar, and Navigation. Vol. 151. No. 1 Feb. 2004 pp11-17. [18] Sarkar T.; Arvas, E.; Rao, S. “Application of FFT and conjugate gradient method for the solution of electromagnetic radiation from electrically large and small conduction bodies,” in IEEE Transactions on Antennas and Propagation. Vol. AP-34, No. 5, May 1986 pp 635-640. [19] Yarlagadda, R.; Babu, B. “A note on the application of FFT to the solution of a system of Toeplitz normal equation”, IEEE Transactions on Circuits and Systems, Vol. 27, No. 2 Feb. 1980 pp 151-154 [20] Yeh, W.-C.; Jen, C.-W. “High-speed and low-power split radix FFT” IEEE Transactions on Signal Processing. Vol. 51, No. 3 March 2003 pp 864-874. [21] Zhang, Z. Q.; and Liu, Q.; H. “Applications of the BCGSFFT method to 3-D induction well logging problems. IEEE Transactions on Geoscience and Remote Sensing. Vol. 41, Issue 5. May 2003. pp 998-1004 [22] Zhao, Y.; Erdogan A.; Arslan T. “A low-power and domain-specific reconfigurable FFT fabric for system-onchip applications” Proc. of the 19th IEEE International Parallel and Distributed Processing Symposium IPDPS05 April 2005 pp 169a-172a