International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 8 - Apr 2014 Fast Fourier Transform Implementation on FPGA Using Soft-Core Processor NIOS II Poonam S. Isasare1, Mahesh T. Kolte2 1 2 Student of Department of Electronics and Telecommunication, University of Pune, Pune, Maharashtra, India Professor & Head of Department of Electronics and Telecommunication, University of Pune, Pune, Maharashtra, India Abstract- FPGAs with soft-core processors offer the opportunity for testing & implementation various trade-offs between hardware and software implementations of the functions to implement. With the Altera NIOS II the processor can be customized through the addition of new instructions. Custom specific functions can be implemented as coprocessors. Altera has provided a C to Hardware compiler that can be used to speed-up some C functions within a C program using hardware acceleration block. While working with C2H compiler, we need to write code in C language and then as per our need we can accelerate that block of code over hardware. Design tool automatically integrates that block with NIOSII processor. This methodology saves lot of time required for implementation and validation of the complex design. In this work we present a preliminary performance evaluation of the C2H compiler on FFT algorithm. We compare the compiler results with results with HDL language and calculate the time efficiency. After code transformation, speedups between 6 and 10 have been obtained. For loops with a recurrence, a speedup greater than 2 has been obtained; we show the basic C transformations that provide the best C2H results. In this work we use of the C2H Altera compiler for the automatic VHDL synthesis of FFT algorithm. Keyword- FFT, C2H, CYCLONE III, DMA, FPGA I. INTRODUCTION x (k) = ∑ ( ) for k= 0, 1,……………….., N-1 Where the twiddle factor root of unity is defined as, = / denotes the N point primitive For k= 0, 1,……………….., N-1 In this work FFT algorithm is implemented on Soft Core Processor which is Altera Implemented processor Nios II Soft processor on Cyclone III FPGA. We implement FFT algorithm in embedded C language and then we use hardware acceleration by using Altera C2H compiler & DMA this will maximize the throughput in terms of clock cycle. The remaining paper is organized as follows Section II presents the System design for FFT algorithm implementation. Section III provides FFT IMPLEMENTATION technique. The comparison results are discussed in IV. Concluding remarks are given in Section V. II. SYSTEM DESIGN The architecture of FFT algorithm implementation is shown as below FFT algorithm was proposed by Cooley and Turkey in 1965. FFT is the basis of Digital Signal Processing (DSP). FFT is the bottleneck in some DSP applications. There are some FFT algorithms have been developed, such as radix-2 algorithms, radix-2m algorithms, Split-based FFT algorithm, Prime Factor algorithm, Winograd Fast Fourier Transform Algorithm (WFTA), and Fast Hartley Transform algorithm (FHT). The Discrete Fourier Transform (DFT) is the most straightforward mathematical method for determining the frequency content X (k) of a time-domain sequence x(n) . The N-point DFT is defined as: ISSN: 2231-5381 http://www.ijettjournal.org Fig 1: System design of FFT Page 411 International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 8 - Apr 2014 1. FFT UNIT The fast Fourier transform (FFT) is a highly efficient method for calculating the discrete Fourier transform (DFT). The DFT is used in signal processing applications for a range of purposes, such as analyzing the frequency components of signals and data compression. The DFT is a computationally intensive function. A naïve (non-FFT) implementation of an n-point DFT requires n2 complex multiplications. These are some basic facts about the FFT algorithm to be aware of: The FFT operates on complex data. It performs calculations simultaneously on real and imaginary components of the data. The algorithm implements complex multiplication as four multiplications, one addition and one subtraction. One of the fundamental operations in the FFT algorithm is the butterfly calculation. The butterfly calculation either breaks a larger DFT into smaller DFTs, or recombines smaller DFTs into a larger. The name butterfly comes from the shape of the dataflow diagram describing the operation. The FFT function uses a technique called bit reversal to rearrange the input points so that the outputs are in the correct order. Conventional software FFT implementations obtain some of their speed by pre-calculating sine and cosine terms used in the butterfly calculations. These sine and cosine terms are called twiddle factors. 2. NIOS II PROCESSOR The Nios II processor is general purpose 32 bit RISC soft core processors. The soft core processors are generally implemented in VHDL, verilog etc. and it can downloaded in any FPGA hardware, it can implement on many parallel processor on FPGA. The Nios II processor can be used with a various components to form a complete system. Nios II processors have full 32-bit instruction set, data path, address space and general purpose registers. It interfaces needed to connect to other chips on the FPGA board . These components are interconnected Avalon Switch Fabric. Memory blocks in the Cyclone III device. This memory block provides on chip memory to soft core processor. A JTAG UART interface is used to connect to the circuitry that provides a Universal Serial Bus to the host computer through which FPGA board is connected. This software is called the USB-Blaster. Another module, called the JTAG Debug module, is provided to allow the host computer to control the Nios II processor. 3. SYSTEM INTERCONNECT AVALON SWITCH FABRIC BUS The Avalon bus is used to connect peripherals and processor of a system. . It is asynchronous bus system ISSN: 2231-5381 which master slave components in which processor can initiate bus transfers which acts as a master and memory which is a slave component only accepts transfers initiated by the processors. No. of masters and slaves are allowed on system interconnect Avalon switch Fabric Bus. As per the priority master gets access to the slave. The user defined master-slave connections and arbitration priorities bus logic is generated automatically. Arbitration is based on a slave-side arbitration scheme. If both instruction and data master of the Nios processor connect to a single master, for improved performance, the data master should be assigned a higher arbitration priority. Since Altera FPGAs do not support tri-state buffers for implementation of general logic, multiplexers are used to route signals between masters and slaves. Although peripherals may reside on or off-chip, all bus logic is implemented on-chip. The Avalon bus is not a shared bus structure. Each master-slave pair has a dedicated connection between them, so multiple masters can perform bus transactions simultaneously, as long as they are not accessing the same slave. The Avalon bus provides several features to facilitate the use of simple peripherals. Peripherals that produce interrupts only need to implement a single interrupt request signal. The Avalon bus logic automatically forwards the interrupt request to the master, along with the r defined at design time. Arbitration logic also handles interrupt priorities when multiple peripherals request an interrupt from a single master, so the interrupt with the highest priority is forwarded first. Separate data, address, and control lines are used, so the peripherals do not have to decode address and data bus cycles. I/O peripherals are memory mapped. Address mapping is defined at design time. 4. SDRAM CONTROLLER The SDRAM controller core support for double data rate , and low-power DDR2 (LPDDR2) SDRAM. The SDRAM controller provides high performance data accessed runtime programmability. This controller with Avalone provides the AVLONE memory mapped interfaced. The controller reorders data to reduce row conflicts and bus turn-around time by grouping read and write transactions together, allowing for efficient traffic patterns and reduced latency. 5. ALTERA NIOS II JTAG DEBUG MODULE The Nios II a supports a JTAG debug module which gives on-chip emulation features to control the processor from a host PC. PC-based software debugging tools communicate with the JTAG debug module and provide facilities, such as the following features: Downloading programs to memory Starting and stopping execution Setting breakpoints and watch points Analyzing registers and memory http://www.ijettjournal.org Page 412 International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 8 - Apr 2014 Collecting real-time execution trace data the CPU’s software integrated development environment (IDE). By using DMA The Nios II C2H Compiler also provides a unique solution with full support for pointers and array accesses. This is possible due to the integration with SOPC Builder, which gives the accelerated function access to the same memory map that it had when running in software. This is also necessary for easy transfer of data between the accelerator and the CPU, as well as the other peripherals in the system. FFT Using DMA can improve the performance of some hardware/software applications. Large number of memory required for processor to execute the butterfly operation which is a time consuming process. Fig 2 JTAG debug module III. 1. 2. IV. REQUIREMENT OF TOOL HARDWARE: Altera Cyclone III FPGA board SOFTWARE Altera Quartus II software Altera C2H compiler Altera Nios II embedded design suit SYSTEM IMPLEMETATION TECHNIQUE 1. SOFTWARE IMPLEMENTATION In this method with help of soft processor NiosII IDE we had Implemented FFT Algorithm with the help of C2H compiler and that will run on FPGA board we initialize the counter for this algorithm and we will get number of clock cycle for software approach. 2. HARDWARE ACCELERATION USING DMA The Nios II C2H Compiler automates a significant portion of the design flow by generating coprocessors that offload and enhance performance of a microprocessor running software written in pure ANSI C. It is tightly integrated into the software build flow and SOPC Builder system generation tool. The tool automatically integrates the accelerator into the hardware and software projects, providing a pure-software development environment for managing hardware/software partitioning. The Nios II C2H Compiler uses SOPC Builder to connect the accelerator to the processor and any other peripherals in the system. This gives the accelerator direct access to a memory map identical to that of the CPU, allowing seamless support for pointers and arrays when migrating from software to hardware. The GUI for the compiler is ISSN: 2231-5381 .3 SOPC builder system design Fig. 4 Accelerator integration and interface http://www.ijettjournal.org Page 413 International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 8 - Apr 2014 V. REFERENCES RESULT Hello from Nios II! --Performance Counter Report-Total Time: 8.308E-05 seconds (4154 clock-cycles) +---------------+-----+-----------+---------------+-----------+ | Section | % | Time (sec)| Time (clocks)|Occurrences| +---------------+-----+-----------+---------------+-----------+ |Hardware + DMA | 99.8| 0.00008| 4144| 1| +---------------+-----+-----------+---------------+-----------+ Real data index 0 = 36, Imaginary Data index 0 = 0 Real data index 1 = -40, Imaginary Data index 1 = 96 Real data index 2 = -40, Imaginary Data index 2 = 40 Real data index 3 = -40, Imaginary Data index 3 = 16 Real data index 4 = -40, Imaginary Data index 4 = 0 Real data index 5 = -40, Imaginary Data index 5 = -16 Real data index 6 = -40, Imaginary Data index 6 = -40 Real data index 7 = -40, Imaginary Data index 7 = -96 --Performance Counter Report-Total Time: 0.0001295 seconds (6475 clock-cycles) +---------------+-----+-----------+---------------+-----------+ | Section | % | Time (sec)| Time (clocks)|Occurrences| +---------------+-----+-----------+---------------+-----------+ |Software Only | 99.5| 0.00013| 6445| 1| +---------------+-----+-----------+---------------+-----------+ Real data index 0 = 36, Imaginary Data index 0 = 0 Real data index 1 = -40, Imaginary Data index 1 = 96 Real data index 2 = -40, Imaginary Data index 2 = 40 Real data index 3 = -40, Imaginary Data index 3 = 16 Real data index 4 = -40, Imaginary Data index 4 = 0 Real data index 5 = -40, Imaginary Data index 5 = -16 Real data index 6 = -40, Imaginary Data index 6 = -40 Real data index 7 = -40, Imaginary Data index 7 = -96 VI. [1] Claudio Brunelli, Roberto and Airoldi, Jari Nurmi, "Implementation and Benchmarking of FFT algorithms on multicore platforms" IEEE Tans. Aug2010 [2] Wang Xu, Zhang Yan and Ding Shunying, "A high performance FFT liabrary with single instruction multiple data (SIMD) ARCHITECTURE", IEEE trans. Aug2011 [3] Cyclone literature http://www.altera.com/literature/hb/cyclonev/cv_54008.pdf [4] JTAG debug module http://www.altera.com/devices/processor/nios2/benefits/ni2-jtag-debug.html [5]Altera Corp, Quartus® II Version 6.0 Handbook, Volume 4: SOPC Builder, Altera Corp., San Jose, CA, 2006. [6] Altera Corporation, “Avalon Bus Specification, Reference Manual,” [Online Document], 2003 July, [Cited 2004 February 11], Available HTTP: http://www.altera.com/literature/manual/mnl_avalon_bus.pdf [7] Altera Nios II handbook www.altera.com/literature/hb/nios2/n2cpu_nii5v1.pdf [8] Altera c2h compiler www.altera.com/literature/ug/ug_nios2_c2h_compiler.pdf [9] Accelerating Nios II with c2h compiler www.altera.com/literature/tt/tt_nios2_c2h_accelerating_tutorial.pdf [10] D. Etiemble, S. Bouaziz and L. Lacassagne, "Customizing 16-bit floating-point instructions on a NIOS II processor for FPGA image and media processing", in IEEE Workshop on Embedded Systems for Real Time Media Processing (Estimedia), Jersey City, September 2005. CONCLUSION In this paper presents a The Fast Fourier Transform (FFT) implementation, which was implemented using a Field Programmable Gate Array (FPGA)-based Nios II soft-core processor working in combination with custom hardware accelerators generated through high-level synthesis. The proposed system architecture, synthesized on Cyclone III FPGA board, was developed through an iterative design space exploration methodology using Altera’s C2H compiler. This hardware and DMA approach simply reduced clock cycle compare with software algorithm so we will get design efficiency as well as maximum throughput and it also reduces latency period. ACKNOWLEDMENT I wish to acknowledge all my colleagues from Electronics & Telecommunication Department and other contributors. ISSN: 2231-5381 http://www.ijettjournal.org Page 414