The Fastest FFT in the West The incorporation of a large FFT [1] in a single FPGA, while noteworthy, may evoke a “so what” response. Again its speed will be compared to the more standard single chip DSP design. We propose to compare Xilinx FPGA performance with an exhaustive list of DSP devices. The test benchmark (fig. 1), established in 1995, is the execution time of a 256 point FFT. Speed in the FPGA design is set by the computation time of the radix 2 butterfly. For 16 bit data and a 50 Mhz system clock the computation time indicated in [1] is 320 nsecs. The number of butterfly computations ((N/2)log2N) for a 256 point FFT is: (256/2)log2 256 = 128 x 8 = 1024 The resulting execution time is (320nsecs) x 1024 or 312.5 usecs - respectable but not spectacular. However, with a 70 Mhz system clock the time drops to 223 usecs which is better than most, but not the fastest. The next step is to realize that FFT operations can be partitioned by pass. Indeed, if butterfly computation nodes were provided for each of the 8 passes (a pipelined FFT), the time could be reduced to one eighth or 28 usecs which would make it the uncontested champion. But the contest rules require all 8 butterfly nodes be embedded in a single FPGA device. According to [1] the butterfly processor data path (fig. 2) for N = 256 requires 201 CLBs, and 8 x 201 = 1608, which exceeds the capacity of the largest Xilinx device (the 4025 with 1024 CLBs). Fortunately, partial partitioning schemes are possible. Referring to the signal flow graph of an 8 point FFT (fig. 3), it may be observed that after the first pass two identical signal flows occur with each handling half the data, i.e., a pair of 128 point FFTs. It would thus be possible to partition the tasks so that the first pass is an N-256 butterfly processor while the remaining 7 passes (fig. 4) are performed by a pair of N = 127 processors. The CLB count for the data path is: first pass 7 passes Total 201 (3 stages) 2 x 186 = 372 (2 stages) 573 CLBs 573 CLBs will easily fit on the 4025 device. The computation time has now been reduced to that of a 128 point FFT: (128/2) x log2 128 = 64 x 7 = 448 CLBs The FFT execution time is 448 x 320 x 5/7 = 102.4 usecs - not quite as fast as the floating point C80 WARP Processor with 4 on-chip processing nodes and cache memory. It is certainly the fastest fixedpoint FFT processor. [1] Mintzer, L., “Large FFTs in a single FPGA”. Proc. of ICSPAT ’96. Fig. 1. Execution Times for 256 Point FFT Fig. 2. Distributed Arithmetic Data Path Blocks for Computing Radix 2 Butterfly, N=8192 Fig. 3. FFT Signal Flow for N=8 (Normal In, Reverse Binary Out) Fig. 4. Partitioning for Higher Speed