International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013 An Efficient Pipelined FFT Processor for OFDM Communication Systems * M.Viswanadh1 M.Harshavardhan Reddy2 1Assistant 2Assistant 3 Naveena Boppana3 Professor, Dept. Of ECE, MLRIT, Hyderabad, AP, India Professor, Dept. Of ECE, MLRIT, Hyderabad, AP, India Associate Professor, Dept. Of ECE, VVIT, Chevella, Hyderabad, AP, India Abstract: Increasing speeds and complexity of wireless communication systems have necessitated the progress and advancement of high performance signal processing elements. Today's emerging technologies require fast processing and efficient use of resources. These resources include power, memory, and chip area. Ongoing research seeks to optimize resource usage as well as performance. Design becomes a balance and compromise of flexibility, performance, complexity, and cost. This paper will specifically address the power-efficient design of an FFT processor as it relates to emerging OFDM communications such as cognitive radio. Keywords: Fast Fourier transform (FFT), orthogonal frequency-division multiple access (OFDMA), very large scale integration (VLSI). 1. Introduction transmission through numerous Cognitive radio is a method of channels. The Fast Fourier wireless communication by way of Transform (FFT) processor is the dynamically the heart of OFDM that enables its multiple fast and efficient modulation of subcarriers to changing conditions signals. The FFT algorithm is a in the communication channels. fast computation of the Discrete These subcarriers are enabled by a Fourier Transform (DFT) which is modulation scheme an essential component of the orthogonal frequency adapting transmission of multiplexing known as division (OFDM). OFDM modulation most into component signals lower for ISSN: 2231-5381 data rate simultaneous used in OFDM. As the FFT processor is the converts a high data rate signal multiple scheme computationally in intensive OFDM communication, an improvement http://www.ijettjournal.org Page 304 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013 in the power efficiency of this and plenty of research results have component can have great impacts been published. Multiband-OFDM on These standard is one solution for UWB impacts are significant considering technology. A proposal for Multi- the number of mobile and remote band OFDM UWB standard is communication devices that rely published on study the overall system. limited battery-powered by group IEEE [12]. 802.15 After 3a IEEE operation. This project will serve 802.15 3a was withdrawn in the as an exploration of current FFT Spring of 2006, Multiband-OFDM processor has algorithms architectures as and well as been controlled by ECMA International. In December 2007, optimization techniques that aim the to reduce the power consumption Standard ECMA-368: High Rate of these devices. Ultra Wideband PHY and MAC Ultra-Wideband (UWB) second Standard’ revised was version released, which Technology brings the convenience specified physical layer (PHY) and and wireless medium access control layer (MAC) high-speed of the UWB technology based on mobility communications of to interconnects in devices through Multiband-OFDM out the digital home and office [1]. some key issues need to be solved Instead of wired connection, this for technology Multiband- OFDM UWB solution enables wireless [5]. designing audio, and other data with high requirement. One of the issues data focuses on its FFT (Fast Fourier low power the Transform) the Communications 25% design complexity of the total Commission (FCC) in USA issued digital baseband transceiver [1]. the ruling that Ultra-Wideband Although (UWB) could be used for data already been published in this communication. Since then, UWB research area in the past few years has became a hot research topic [4], some key problems still exist ISSN: 2231-5381 http://www.ijettjournal.org many which power consumption. In February 2002, Federal block, low based in and of CMOS connection for transmitting video, speed support However, results takes have Page 305 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013 and need to be improved for the be used for these specifications. speed, power There are a large number of FFT consumption consideration. Based algorithms and architectures in on ECMA-368, for the requirement the signal processing literature [7]. of Multiband-OFDM system, this Therefore, FFT processor should work on a algorithms few hundred MHz, which makes it should be analyzed and compared. difficult to implement. And since Based on different algorithms and this system targets for the wireless architectures, portable devices, small area and consumptions, area and speed of low power consumption are also the processor will be achieved. So imperative. Therefore, this thesis their ASIC suitability should be focuses on the area and power analyzed and the effort should be consumption improvement under focused the algorithms and architectures and area and ECMA-368 standard requirements. This paper aims at designing ASIC (Application Specific the state and of architectures different on the power choosing optimization. Furthermore, improvement space analyzed and art the should the be architecture Integrated Circuit) FFT processor should be further optimized. The for Multiband-OFDM UWB system. proposed In order to achieve this goal, architecture should be validated several steps need to be followed. by The first step is to find the specifications for simulation and before implementation. After that, this FFT circuit needs to be implemented processor, which is determined by with VHDL. The synthesis step is the UWB followed by using both Simplify Pro standard. The step requires the targeted for FPGA and Design analysis Compiler for ASIC. Multiband on this Matlab algorithm OFDM OFDM and UWB technology and the constraints of its FFT processor. After defining 2. FFT Processor Design In the OFDMA system, only the specifications, optimized FFT partial algorithm and architecture should computed in the user end. Thus, ISSN: 2231-5381 subcarriers http://www.ijettjournal.org need to Page 306 be International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013 the FFT pruning applicable to algorithm the is OFDMA ible to keep infinite resolution of coefficients and operations. All demodulation under the low-power coefficients and input signals have context to of the battery-driven be represented with nite handset. In the literature, FFT number of bits in binary format pruning and depending on the tradeoff etween architectures [9] are designed for the hardware cost (memory usage) computing selective frequency bins and the accuracy f output signals. on Generally algorithms the spectrum but not for speaking, each reflecting the resource-allocation multiplication may ntroduce an scheme in the OFDMA system. error due to rounding operations Therefore, in this brief, we propose or truncations, which is referred the partial cached-FFT algorithm as arithmetic quantization error. for (RB) Besides, all he twiddle factors are OFDMA represented with limited number system. Furthermore, we design a of bits and the loss due to the mixed inexact typical resource-block distributions in the pipelined/cached-FFT coefficients is called processor with constellation and coefficient quantization error. The power we theoretical performance evaluation implement and measure the FFT has been given in previous works. processor chip to show its energy Several efficiency. analyzed the effect of fixed-point PARTIAL CACHED – FFT ALGORITHM arithmetic for radix-2 FFT. In this FOR Resource Block ALLOCATION section, we derive the equivalent In actual awareness. general, Finally, considering hardware design, the the accuracy of FFT/IFFT module is an important design factor of ystem performance. In practice, fixed-point arithmetic is used o implement FFT algorithms in hardware because it is not pos- ISSN: 2231-5381 previous works have matrix form of both DIF and DIT FFT algorithms. alternative DIT algorithms multiplicative Although and have the DIF the FFT same complexity, he sequence of butterfly stages and twiddle factor stages is re-versed. In other words, the signal flow of http://www.ijettjournal.org Page 307 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013 two alternative repre- entations is be increased, because more delay actually the mutual mirroring of registers each other. 100% utilization. If radix-4 and must be inserted for four streams are used, the cost will significantly grow. Therefore, the radix-2/22 dual-stream architecture is a compromise of the cost and the throughput. Figure 1 2D FFT Processor Architecture The proposed 64 to 1024-point cached-FFT processor, as shown in Fig. 1, consists of a radix-2/22 pipelined butter- y processor, cache sets, and main memory. We propose a pelined feedback dual-delay- (DDF) butterfly architecture, to avoid the idling of the butterfly traditional (BU) the single-delay-feedback architecture. sequences unit Two share radixthe 2 butterfly units by using a delay register ter the first schedule butterfly. for the The time dual-stream processing. The gray and slash blocks of time slots represent the operations of BF2_D and BF2_2D butterfly units, respectively. Thus, the butterfly processor can achieve 100% utilization rate. If the radix23 butterfly processor is used for the dual streams, the latency will ISSN: 2231-5381 CACHE/MEMORY ARCHITECTURE: We use facilitate two the cache data sets to transfers between butterfly processor and the cache sets and between the cache ts and the main memory. The cache sets must perform two rite operations and two read operations in one clock cycle r the dual-stream processing. To avoid the utilization of the ur-port cache and simplify the control complexity, we divide ach cache set into two banks for even and odd addresses. The dd/even address detector serves to detect whether the data e accessed in the correct bank and to exchange the access positions if necessary. we can carefully Thereby, manipulate the computational time schedule to ensure that one even address and one odd address are accessed for two read ports http://www.ijettjournal.org Page 308 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013 and two write ports. Similar to the carefully manage the read and cache write addresses of the cache sets architecture, the main memory needs to perform two read to and two write operations in one summarize four types of butterfly cycle. Therefore, we divide the computation sequences in Fig. 6 main memory into two banks of from the control tables of the 128- single-port to 1024-poin partial cached-FFT 512-word static random-access memory (SRAM) for avoid data conflicts. We schemes. even and odd memory addresses. Figure 3 Architectures of the (a) one 16-word cache set and (b) main memory Type 1 computes one complete group. In the radix-22 butterfly, data are read from {0, 1, 2, 3} cache addresses but are written back to the {0, 4, 8, 12} cache Figure 2 Proposed partial cached-FFT processor Computation Time Schedule: Since the cache flush/refresh time is smaller than the one- ass processing time, processor can the butterfly continuously compute without an idle time slot. However, there are timing overlaps between the cache flushing/refreshing and the butterfly processing. Thus, we must ISSN: 2231-5381 addresses. The cache-address exchanges seem to violate the inplace property of the FFT algorithm. However, before being overwritten, the original contents in cache addresses {4, 8, 12} are skillfully read into the butterfly processor. This strategy is applied to the arrangement of the cache read/write control throughout the schedules of all types. In addition, data- conflict avoidance requires http://www.ijettjournal.org Page 309 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013 that there be one even and one An RB that consists of multiple odd read/write addresses in a symbols cycle all the time. To avoid an transmission unit in the OFDMA inevitable cache conflict in the system. The data of the user are eleventh cycle, we add a bypass logically packed as many virtual register. the RBs (VRBs) and are then mapped epoch 1 of the 128-point FFT. In to the physical RBs (PRBs) in the pass 0, four radix-22 butterflies OFDMA. Type 2 processes is the basic data are computed without the cache- The area of a chip depends on address exchange, whereas eight the number of components to be radix-2 butterflies are computed placed on it. Effective placement with exchanges. techniques are being used in the Type 3 deals with the epoch 2 of industry to minimize area and the 512-point FFT when only one- maximize functionality. Effective pass operation is necessary in the place and route can not only aid in last control reducing the area of the chip but complexity and to maintain the also reduce the delay due to wiring regularity, and interconnect capacitance. cache-address epoch. To we reduce idle the pass-0 operation and compute radix-2 The proposed FFT processor butterflies in pass 1 without the was fabricated using TSMC 0.18- cache-address exchange. Type 4 μm handles the epoch 2 of the 1024- technology. point FFT. This case is similar to 2100 × 2100 μm2 with 1210 × Type 3, but pass 1 consists of 1210 μm2 core area, including radix-22 operations and the cache- four 512 × 13 single-port SRAMs, address exchange is required. In four 8 × 13 caches, and four 128 × Table and 13 ROMs. The function of the chip FFT is verified, and its performance is III, the proposed architectures traditional cached- are compared in one-poly measured six-metal The using chip a CMOS occupies digital test terms of cost, number of trivial station. This chip can operate at multiplication, maximum 51 MHz with 33.3 mW. and FFT. ISSN: 2231-5381 latency per We use the worst- and best-case http://www.ijettjournal.org Page 310 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013 patterns for measuring the energy- FFT processor for minimal RB saving capability of different RB transmission. allocation schemes, as shown in outperforms other Fig. 4. Each curve represents a memory-based FFT series of {1, 3/4, 1/2, 1/4, 1/8, latency for processing 1024-point 1/16, 1/32, 1/64} computed RBs full FFT is 61 μs, which meets the in the whole OFDMA spectrum. We 66.7- and 91.5-μs OFDM symbol can see that the energy dissipation durations of the 3GPP-LTE and scales to the modulation order and Mobile the respectively. amount of the allocated resource. This result cached- chips. WiMAX The standards, 4. Results and Conclusions The input data length of our proposed pipelined FFT processor is a parameter which can be decided at the range of 128, 512, 1024 and 2048 points. Take 1024 points FFT as an example. At first, the 1024 points FFT is coded using MATLAB software. After the chosen FFT algorithm is Figure 4 Energy dissipation versus FFT-demodulated resource bit rate (best case) The comb-distributed scheme has the best performance, whereas the localized scheme has the worst performance. Note that the RBlevel distributed scheme is a compromise between localized and comb-distributed schemes. Thus, its energy dissipation also lies between them. The energy scaling ranges from 1.90 to 0.64 nJ/FFT point for the 1024-point FFT, i.e., valid, the processor language architecture was and modeled of in functionally the VHDL verified using Xilinx 12.3i software and timing simulation using ISIM SE software. During the timing simulation, a test bench file included the TEXTIO package was written to read input data and write FFT result. behavioral simulation waveforms for the Processor are shown in figure 5. 67% energy can be saved in the ISSN: 2231-5381 The http://www.ijettjournal.org Page 311 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013 [4] C. P. Fan and G. A. Su, “A grouped fast Fourier transform algorithm design for selective transformed outputs,” in Proc. IEEE APCCAS, 2006, pp. 1939– 1942. Figure 5 Simulation of 1024 point FFT In this Paper, We have presented the fabrication and testing of the FFT processor integrated circuit to show its superior energy efficiency. Therefore, we found that this FFT processor is significantly beneficial to the realization of the OFDMA receiver. Acknowledgements The authors would like to thank the anonymous reviewers for their comments which were very helpful in improving the quality and presentation of this paper. References: [1] Y. W. Lin, H. Y. Liu, and C. Y. Lee, “A dynamic scaling FFT processor for DVB-T applications,” IEEE J. Solid-State Circuits, vol. 39, no. 11, pp. 2005–2013, Nov. 2004. [2] T. H. Yu, C. Z. Zhan, Y. J. Cho, C. L. Yu, and A. Y. Wu, “Efficient fast Fourier transform processor design for DVB-H system,” in Proc. 18th VLSI/CAD Symp., Aug. 2007, pp. 65–68. [3] Z. Hu and H. Wan, “A novel generic fast Fourier transform pruning technique and complexity analysis,” IEEE Trans. Signal Process., vol. 53, no. 1, pp. 274– 282, Jan. 2005. ISSN: 2231-5381 [5] R. Min, M. Bhardwaj, and A. Chandrakasan, “A partially operated FFT/IFFT processor for low complexity OFDM modulation and demodulation of WiBro in-car entertainment system,” IEEE Trans. Consum. Electron., vol. 54, no. 2, pp. 431–436, May 2008. [5] C80216m-08_503, Motorola IEEE 802.16 m Downlink Resource Mapping, IEEE, May 2008. [6] 3GPP, R1-071091, Philips ResourceBlock Mapping of Distributed Transmissions in E-UTRA Downlink, Feb. 2007. [7] R. Min, M. Bhardwaj, and A. Chandrakasan, “Quantifying and enhancing power awareness of VLSI systems,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 6, pp. 757– 772, Dec. 2001. [8] T. Lenart and V. Owall, “Architectures for dynamic data scaling in 2/4/8 K pipeline FFT cores,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 11, pp. 1286–1290, Nov. 2006. [9] A. Sinha, A.Wang, and A. P. Chandrakasan, “Algorithmic transforms for efficient energy scalable computation,” in Proc. Int. Symp. LowPower Electronics and Design, Italy, Aug. 2000, pp. 31–36. [10] S. H. Nawab, et al., “Approximate signal processing,” J. VLSI Signal Processing Syst. Signal, Image, and Video Technol., vol. 15, no. 1/2, pp. 177–200, Jan. 1997. http://www.ijettjournal.org Page 312 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013 [11] L. McMillan and L. A.Westover, “A forward-mapping realization of the inverse discrete cosine transform,” in Proc. Data Compression Conf., Snowbird, UT, Mar. 1992, pp. 219–228. (M.Tech) in VLSI Design. She is planning for her research under JNTU Hyderabad. [12] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A System Perspective. Reading, MA: Addison-Wesley, 1994. [13] V. Gutnik and A. P. Chandrakasan, “Embedded power supply for lowpower DSP,” IEEE Trans. Very Large Scale Integration (VLSI) Syst., vol. 5, pp. 425– 435, Dec. 1997. Authors Profile: M.Viswanadh, is working as an assistant professor in MLRIT, Hyderabad, in the Department of Electronics & Comm. Engineering. He completed his masters (M.Tech) in VLSI Design. He is planning for his research. He is interested in the fields of VLSI, image processing, wireless communications. M.Harshavardhan Reddy is working as an assistant professor in MLRIT, Hyderabad, in the Department of Electronics & Comm. Engineering. He completed his masters (M.Tech) in VLSI Design. He is planning for his research. He is interested in the developments in wireless communications. Naveena Boppana, is Associate Professor in VVIT, Chevella Hyderabad, in the Department of Electronics & Comm. Engineering. She completed his masters ISSN: 2231-5381 http://www.ijettjournal.org Page 313