The Fastest FFT in the West (11/96)

advertisement
The Fastest FFT in the West
The incorporation of a large FFT [1] in a single FPGA, while noteworthy, may evoke a “so what”
response. Again its speed will be compared to the more standard single chip DSP design. We propose to
compare Xilinx FPGA performance with an exhaustive list of DSP devices. The test benchmark (fig. 1),
established in 1995, is the execution time of a 256 point FFT.
Speed in the FPGA design is set by the computation time of the radix 2 butterfly. For 16 bit data and a 50
Mhz system clock the computation time indicated in [1] is 320 nsecs. The number of butterfly
computations ((N/2)log2N) for a 256 point FFT is:
(256/2)log2 256 = 128 x 8 = 1024
The resulting execution time is (320nsecs) x 1024 or 312.5 usecs - respectable but not spectacular.
However, with a 70 Mhz system clock the time drops to 223 usecs which is better than most, but not the
fastest.
The next step is to realize that FFT operations can be partitioned by pass. Indeed, if butterfly computation
nodes were provided for each of the 8 passes (a pipelined FFT), the time could be reduced to one eighth or
28 usecs which would make it the uncontested champion. But the contest rules require all 8 butterfly
nodes be embedded in a single FPGA device. According to [1] the butterfly processor data path (fig. 2)
for N = 256 requires 201 CLBs, and 8 x 201 = 1608, which exceeds the capacity of the largest Xilinx
device (the 4025 with 1024 CLBs). Fortunately, partial partitioning schemes are possible.
Referring to the signal flow graph of an 8 point FFT (fig. 3), it may be observed that after the first pass two
identical signal flows occur with each handling half the data, i.e., a pair of 128 point FFTs. It would thus
be possible to partition the tasks so that the first pass is an N-256 butterfly processor while the remaining 7
passes (fig. 4) are performed by a pair of N = 127 processors. The CLB count for the data path is:
first pass
7 passes
Total
201 (3 stages)
2 x 186 = 372 (2 stages)
573 CLBs
573 CLBs will easily fit on the 4025 device. The computation time has now been reduced to that of a 128
point FFT:
(128/2) x log2 128 = 64 x 7 = 448 CLBs
The FFT execution time is 448 x 320 x 5/7 = 102.4 usecs - not quite as fast as the floating point C80
WARP Processor with 4 on-chip processing nodes and cache memory. It is certainly the fastest fixedpoint FFT processor.
[1] Mintzer, L., “Large FFTs in a single FPGA”. Proc. of ICSPAT ’96.
Fig. 1. Execution Times for 256 Point FFT
Fig. 2. Distributed Arithmetic Data Path Blocks for Computing Radix 2 Butterfly, N=8192
Fig. 3. FFT Signal Flow for N=8 (Normal In, Reverse Binary Out)
Fig. 4. Partitioning for Higher Speed
Download