ELEC692 VLSI Signal Processing Architecture Lecture 8 Architecture for Fourier Transform Usage of FFT • Frequency transformation • Applications – – – – OFDM wireless systems Speech/Multimedia data processing Satellite wireless transmission DTV, DAB broadcasting using OFDM • Real-time requirement needs special hardware to do this: – E.g. COFDM for DTV • • • • • Signal bandwidth 7.5MHz Useful symbol duration = 1ms Number of parallel subcarrier = 7.5*1000/1 = 7500 Need 8K complex point FFT Compute 8K complex FFT in 1ms, i.e. 8M complex FFT in a second • Not efficient and practical to implement in software, need special HW for FFT – In fact there are quite some off-the-selves FFT processors available in the market, but it is better to integrate the hardware within your chip DFT review • The N-point discrete Fourier transform X(k) of an Npoint sequence x(n) (and the inverse DF) is given by N 1 Y (k ) x(n)WNnk for k 0,1,...,N 1 n 0 with WN e j ( 2 / N ) 1 N 1 x(n) Y (k )WNnk N k 0 with WN e j ( 2 / N ) for k 0,1,...,N 1 DFT WNk N / 2 WNk ; WNk N WNk WN0 WNN / 2 1 WNN / 4 WN3 N / 4 j 2 W W (1 j ) 2 2 WN3 N / 8 WN7 N / 8 (1 j ) 2 2 (a jb) *W81 [(a b) j (b a )] 2 2 3 (a jb) *W8 [(b a ) j (b a )] 2 N /8 N 5N /8 N Direct Implementation of DFT • Product of a matrix (W) and a vector (x) y Wx • An 8-point FFT example y (0) 1 y (1) 1 y (2) 1 y ( 3 ) 1 y (4) 1 y (5) 1 y (6) 1 y (7) 1 1 w1 w2 w3 w4 w5 w6 w7 1 w2 w4 w6 1 w2 w4 w6 1 w3 w6 w1 w4 w7 w2 w5 1 w4 1 w4 1 w4 1 w4 1 w5 w2 w7 w4 w1 w6 w3 1 w6 w4 w2 1 w6 w4 w2 1 x(0) w7 x(1) w6 x(2) w5 x(3) w4 x(4) 3 w x(5) w2 x(6) 1 w x(7) 1D array for DFT for N=8 Complex multiplications xa u ( xr jxi )(ar jai ) ( xr ar xi ai ) j ( xi ar xr ai ) [ xr (ar ai ) ( xr xi )ai ] j[ xi (ar ai ) ( xr xi )ai ] Fast DFT • Fast DFT (Discrete Fourier Transform) algorithm – Cooley-Tukey decomposition (1965) • Radix-2 Decimation-in-time (DIT) or Decimation-in-Frequency (DIF) • Divide the problem size into two interleaved halves with each recursive stage • Radix-2 decomposition first computes the evenindexed numbers x0,x2,…,xn-2 and then the oddindexed number x1,x3,…,xn-1, and then combines these two results. – The sequence can be decomposed recursively to reduce the overall runtime to O(nlogn) Radix-2 DIF DFT Y (k ) N / 21 W x(n) W N 0 nk N ( N / 2) k N N / 21 nk W N x(n N / 2) m 0 Since WNN/2 corresponds to a rotation of 180o, the factor of the second sum ( N / 2) k k can be even further reduced. WE have Y (k ) (1) WN N / 2 1 nk k W [ x ( n ) ( 1 ) x(n N / 2)] N N 0 The division of k into even and odd values leads to the following: N / 2 1 x ' (n) x(n) x(n N / 2) xn Y ( 2k ) [ x(n) x(n N / 2)]W Y (2k 1) 0 n 2k N n 0 N / 2 1 [ x(n) x(n N / 2)]W W n 0 k 0,1,..., N / 2 1 X’0(n) n N n 2k N x1' (n) [ x(n) x(n N / 2)]WNn Xn+N/2 N / 2 1 Y (2k ) x0' (n)WNn 2 k n 0 Y (2k 1) N / 2 1 x (n)W n 0 ' 1 n 2k N WNn -1 X’1(n) Butterfly operation Radix-2 decomposition of 8-point FFT x(0) y(0) W0 x(1) y(4) -1 W0 x(2) y(2) -1 W2 x(3) W0 x(4) x(5) -1 y(14) W1 W0 -1 -1 -1 -1 y (5) W0 y(3) -1 W3 x(7) y(6) -1 -1 W2 x(6) W0 W2 -1 W0 -1 y(7) Implementation of Radix 2 FFT • Two extreme methods • Reuse single Butterfly – Slower – Smaller area – More complicated control • Fully multi-stage straight implementation – Faster – Larger area – More regular control • Trade-off between the two ends based on – Speed, area, power Comparison of calculation DFT FFT MUL ADD MUL (N-2)2 (N)2 N/2log2N-(N-1) Nlog2N X’0(n) xn hardware WNn Xn+N/2 -1 X’1(n) Butterfly operation ADD Data transport • One problem for FFT is its less regular data transport. If the butterfly PEs are configured such that PEs with lower exponents of W come first in each stage, a configuration results with identical communication networks between stages, (perfect shuffle) Conventional single butterfly FFT implementation Strong speed limitation Large intermediate results storage area need (N complex words) If the memory is not partitioned, the number of R/W accesses to perform the FFT creates a bottleneck An N-point FFT requires N/r logrN radix-r butterfly computations and 2N logrN R/W RAM access Single-stage (1-D) implementationhorizontal projection • Horizontal projectionprovide PE for a single stage • Use only N/2 PE, i.e. one stage only – Reduce throughput by a factor of log2N comparing with a 2-D array. – Need to take care about the complex communication structure PEs do not have fixed coefficients, they need to change after each cycle and the global communication network is disadvantageous Single-stage (1-D) implementation implementation- horizontal projection • Pipelining with PEs does not allow a direct increase in through put for this architecture since the results of the current processing are required for the next processing step. • However sequential data blocks of length N can be processed independently of one another, so several data blocks can be processed by interleaving • Need increase in # of register Single-stage (1-D) implementation horizontal projection • If N is large, we cannot implement all N # of PE. • Project N/2 butterfly PEs to M*PEs where M is also a power of 2 and M < N/2 • Special registers for input data, intermediate results and result data are required. – Register cyclically read and write a particular sequence of 2M complex data Single-stage (1-D) implementation Vertical projection • Vertical projection: Have 1PE for each stage (total # = logN PE) • Need circuitry between PEs to prepare the correct data input • From stage to stage, the length of the sequence onto which the FFT is applied is halved. • Given the previous stage led to a DFT of length 2n, in accordance with perfect shuffle, the sequence of length 2n must be halved and the 1st and (n+1)th values must be fed to the following PE. Then the 2nd and (n+2)th values are fed to it. • Hence the sequence must be delayed by n clock cycles in accordance with the position of the midpoint: Data formatting/sorting for Vertical projection • The block un-1,…,u0 must be delayed by n clock cycles. • When un is available, the values from the stream u must be fed to the new lower stream v’. The values of u are input in parallel into the next butterfly stages for n clock cycles. • SO the values of v are fed in parallel to the next butterfly PE for n clock cycles and vn-1,…,v0 are delayed by 2n cycles and v2n-1,…,vn delayed by n cycles. Data formatting/sorting for Vertical projection • Special circuit is necessary for the data input of the 1st stage. • Incoming data stream of N data is divided into 2 parts of N/2 data. The clock rate is hence halved.We need a demultiplexer followed by a FIFO register Overall architecture of Linear FFT array based ob butterfly PEs and delay commutators Consists of N PEs and delay commutators are located between the PEs. Due to the continued halving, control signals are extracted using frequency dividers Higher radix FFT • Radix-4 DIF algorithm N 1 X (k ) x(n)WNkn n 0 N / 4 1 x(n)W kn N n 0 N / 4 1 x(n)W n 0 We have kn N N / 2 1 x(n)W kn N n N / 4 W kN / 4 N N / 4 1 n 0 3 N / 4 1 x(n)W n N / 2 kn N N 1 x(n)W n 3 N / 4 kn N N / 4 1 N / 4 1 N kn N kn 3N kn kN / 2 3 kN / 4 x(n )WN WN x ( n ) W W x ( n )WN N N 4 2 4 n 0 n 0 WNkN / 4 ( j)k ,WNkN / 2 (1)k ,WN3kN / 4 ( j)k , Thus X (k ) N / 41 n 0 [ x ( n) ( j ) k x ( n N N 3N ) (1) k x(n ) ( j ) k x(n )]WNnk 4 2 4 Radix-4 DIF algorithm X ( 4k ) N / 4 1 N N [ x ( n) x ( n 4 ) x ( n 2 ) x ( n n 0 X (4k 1) N / 4 1 N 3N )]WN0WNkn/ 4 4 N [ x(n) jx(n 4 ) x(n 2 ) jx(n n 0 X ( 4 k 2) N / 4 1 n 0 X (4k 3) N / 4 1 n 0 [ x ( n) x ( n 3N )]WNnWNkn/ 4 4 N N 3N ) x(n ) x(n )]WN2 nWNkn/ 4 4 2 4 [ x(n) jx(n N N 3N ) x(n ) jx(n )]WN0WNkn/ 4 4 2 4 • Butterfly of Radix-4 Algorithm Radix-4 Signal flow graph Higher radix FFT • Radix-8 algorithm N 1 A(8k l ) x(n)W n 0 7 N / 81 n 0 n 0 N / 81 [ x(n (8 k l ) n N 7 N / 8 1 x(n n 0 n 0 m N (8 k l )( mN / 8 n ) )WN 8 m N lm nl nk )W8 ]WN WN / 8 8 2N l 4 N 2l 6 N l ) W x ( n ) W x ( n )W4 ] 4 4 8 8 8 n 0 N 3N l 5 N 2l 7 N l l nl nk [ x(n ) x(n )W4 x(n )W4 x(n )W4 ]W8 }WN WN / 8 8 8 8 8 l 0,1,2,3,4,5,6,7; k 0 ~ N / 8 1 {[ x(n) x(n Some pipeline FFT Processor Architecture • Assume input sequence to be in normal order and output is allowed to be in digitreversed (radix-2 or radix-4) order. • Assume DIF type of decomposition • Here we assume additive butterfly has been separated from multiplier to show the hardware requirement distinctively Radix-2 Multi-path Delay Commutator (R2MDC) N=16 Input sequence has been broken into 2 parallel data stream flowing forward, with correct “distance” between data elements entering the butterfly scheduled by proper delays # of multipliers: log2N – 2 # of butterfly: log2N # of registers: (3/2)N-2 Radix-2 Single-path Delay Feedback (R2SDF) N=16 Storing the butterfly output in feedback shift registers. A single data streams goes through the multiplier at every stage. # of multiplers: log2N – 2 # of butterfly: log2N # of registers: N-1 Radix-4 Single-path Delay Feedback (R4SDF) N=256 Use radix-4 and CORDIC iterations. Utilization of multipliers increased to 75% due to storage of 3 out of radix-4 butterfly outputs. Utilization of the radix-4 butterfly (which is more complicated than radix-2 butterfly, containing at least 8 complex adders) is dropped to 25%. # of multiplers: log4N – 1 # of butterfly: log4N # of registers: N-1 Radix-4 Multi-path Delay Commutator (R4MDC) N=256 Utilization Rate: Butterflies: 25%, multiplier: 250% # of multiplers: 3log4N # of butterfly: log4N # of registers: (5/2)N-4 Some observation • Delay-feedbacks are more efficient than corresponding delay commutator in terms of memory utilization since the stored butterfly output can be directly used by the multipliers • Radix-4 algorithm based single-path architectures have higher multiplier utilization, but radix-2 algorithm have simpler butterflies which are better utilized. Comparison Radix / Speed Low ----------------------------------- High Control Theme Simple ----------------------------------- Complex Processing Ability / Unit Low ----------------------------------- High Combine the advantages Further decompose high radix PE Radix-22 DIF FFT • Optimal hardware – Same number of non-trivial multiplications at the same positions in the SFG as of radix-4 algorithms – The same butterfly structure as that of radix-2 algorithms. – Radix-22 DIF FFT (S. He, M. Torkelson, “A New Approach to Pipeline FFT Processor”, in Proceedings of IPPS, 1996, pp. 766-780. Radix-22 DIF FFT N 1 Y (k ) x(n)WNnk for k 0,1,...,N 1 n 0 Apply a 3-dimensional linear index map n N N n1 n2 n3 N where n1 , n2 {0,1} 2 4 and n3 {0 ~ N / 4 1} k k1 2k 2 4k3 N The Common factor algorithm has the form of Summation Over n1 Y (k1 2k 2 4k3 ) N / 4 1 n3 0 N N ( n1 n2 n3 )( k1 2 k 2 4 k3 ) N N x( n1 n2 n3 )WN 2 4 2 4 n2 0 n1 0 1 1 N N ( n2 n3 ) k1 ( n2 n3 )( 2 k 2 4 k3 ) k1 N BN / 2 ( n2 n3 )WN 4 WN 4 4 n3 0 n2 0 N N N N where BNk1/ 2 ( n2 n3 ) x ( n2 n3 ) (1) k1 x( n2 n3 ) 4 4 4 2 N / 4 1 1 Radix-22 DIF FFT • Proceed the second step of N ( n2 n3 )( k1 2 k 2 4 k3 ) decomposition to the WN 4 remaining DFT coefficients, N including the “twiddle factor” n2 ( k1 2 k 2 ) Nn 2 k3 n3 ( k1 2 k 2 ) 4 n3 k 3 4 W W W W to exploit the exceptional N N N N values in multiplication before n3 ( k1 2 k 2 ) 4 n3 k3 n2 ( k1 2 k 2 ) ( j ) W W N N the next butterfly is constructed. After substituting and simplification, we have Y (k1 2k 2 4k3 ) where N / 4 1 [ H k , k , n W n3 0 1 2 3 n3 ( k1 2 k 2 ) N ]WNn3/k43 BF I BF I H k1 , k 2 , n3 [ x(n3 ) (1) k1 x(n3 N N 3N )] ( j ) ( k1 2 k2 ) [ x(n3 ) (1) k1 x(n3 )] 2 4 4 BF II Butterfly with decomposed twiddle factors Full multipliers are required to compute the product of the decomposed twiddle factor. The order of the twiddle factors is different from that of radix-4 algorithm. Complete Radix-22 DIF FFT • Apply the CFA recursively to the remaining DFTs of length N/4. BF4 Control BF2 I BF2 II Control Radix-22 Single-path Delay Feedback (R22SDF) 2 types of butterflies: 1 identical to R2SDf, the other contains also the logic to implement the trivial twiddle factor multiplication A log2N bit binary counter servers two purposes: -Synchronization controller - Address generation counter for twiddle factor reading in each stages Radix-22 Single-path Delay Feedback (R22SDF) • Structure for BF2I and BF2II BF2II BF2I Operation scheduling 1st N/2 cycle, 2-to-1 mux in BF2I switch to “0” and the butterfly is idle. Input data is directed to the shift registers until they are filled. Next N/2 cycles, the mux turn to “1”, the butterfly computes a 2-point DFT with incoming data and the data stored in the shift registers Z1(n) x(n) x(n N / 2) 0n N /2 Z1(n N / 2) x(n) x(n N / 2)