FFT

advertisement
ELEC692 VLSI Signal
Processing Architecture
Lecture 8
Architecture for Fourier Transform
Usage of FFT
• Frequency transformation
• Applications
–
–
–
–
OFDM wireless systems
Speech/Multimedia data processing
Satellite wireless transmission
DTV, DAB broadcasting using OFDM
• Real-time requirement needs special hardware to do
this:
– E.g. COFDM for DTV
•
•
•
•
•
Signal bandwidth 7.5MHz
Useful symbol duration = 1ms
Number of parallel subcarrier = 7.5*1000/1 = 7500
Need 8K complex point FFT
Compute 8K complex FFT in 1ms, i.e. 8M complex FFT in a
second
• Not efficient and practical to implement in software, need special
HW for FFT
– In fact there are quite some off-the-selves FFT processors
available in the market, but it is better to integrate the
hardware within your chip
DFT review
• The N-point discrete Fourier transform X(k) of an Npoint sequence x(n) (and the inverse DF) is given by
N 1
Y (k )   x(n)WNnk
for
k  0,1,...,N  1
n 0
with
WN  e  j ( 2 / N )
1 N 1
x(n)   Y (k )WNnk
N k 0
with
WN  e  j ( 2 / N )
for
k  0,1,...,N  1
DFT
WNk  N / 2  WNk ;
WNk  N  WNk
WN0  WNN / 2  1
WNN / 4  WN3 N / 4   j
2
W
 W

(1  j )
2
2
WN3 N / 8  WN7 N / 8 
(1  j )
2
2
(a  jb) *W81 
[(a  b)  j (b  a )]
2
2
3
(a  jb) *W8 
[(b  a )  j (b  a )]
2
N /8
N
5N /8
N
Direct Implementation of DFT
• Product of a matrix (W) and a vector (x)
y  Wx
• An 8-point FFT example
 y (0)  1
 y (1)  1

 
 y (2) 1

 
y
(
3
)

  1
 y (4) 1

 
 y (5)  1
 y (6)  1

 
 y (7) 1
1
w1
w2
w3
w4
w5
w6
w7
1
w2
w4
w6
1
w2
w4
w6
1
w3
w6
w1
w4
w7
w2
w5
1
w4
1
w4
1
w4
1
w4
1
w5
w2
w7
w4
w1
w6
w3
1
w6
w4
w2
1
w6
w4
w2
1   x(0) 
w7   x(1) 
w6   x(2)


w5   x(3) 
w4   x(4)

3 
w   x(5) 
w2   x(6) 


1
w   x(7)
1D array for DFT for N=8
Complex multiplications
xa  u
( xr  jxi )(ar  jai )  ( xr ar  xi ai )  j ( xi ar  xr ai )
 [ xr (ar  ai )  ( xr  xi )ai ]  j[ xi (ar  ai )  ( xr  xi )ai ]
Fast DFT
• Fast DFT (Discrete Fourier Transform) algorithm
– Cooley-Tukey decomposition (1965)
• Radix-2 Decimation-in-time (DIT) or Decimation-in-Frequency
(DIF)
• Divide the problem size into two interleaved halves with
each recursive stage
• Radix-2 decomposition first computes the evenindexed numbers x0,x2,…,xn-2 and then the oddindexed number x1,x3,…,xn-1, and then combines
these two results.
– The sequence can be decomposed recursively to reduce the
overall runtime to O(nlogn)
Radix-2 DIF DFT
Y (k ) 
N / 21
W x(n)  W
N 0
nk
N
( N / 2) k
N
N / 21
nk
W
 N x(n  N / 2)
m 0
Since WNN/2 corresponds to a rotation of 180o, the factor of the second sum
( N / 2) k
k
can be even further reduced. WE have
Y (k ) 
 (1)
WN
N / 2 1
nk
k
W
[
x
(
n
)

(

1
)
x(n  N / 2)]
 N
N 0
The division of k into even and odd values leads to the following:
N / 2 1
x ' (n)  x(n)  x(n  N / 2) xn
Y ( 2k ) 
[ x(n)  x(n  N / 2)]W
Y (2k  1) 
0
n 2k
N
n 0
N / 2 1
[ x(n)  x(n  N / 2)]W W
n 0
k  0,1,..., N / 2  1
X’0(n)
n
N
n 2k
N
x1' (n)  [ x(n)  x(n  N / 2)]WNn
Xn+N/2
N / 2 1
Y (2k )   x0' (n)WNn 2 k
n 0
Y (2k  1) 
N / 2 1
 x (n)W
n 0
'
1
n 2k
N
WNn
-1
X’1(n)
Butterfly operation
Radix-2 decomposition of 8-point
FFT
x(0)
y(0)
W0
x(1)
y(4)
-1
W0
x(2)
y(2)
-1
W2
x(3)
W0
x(4)
x(5)
-1
y(14)
W1
W0
-1
-1
-1
-1
y (5)
W0
y(3)
-1
W3
x(7)
y(6)
-1
-1
W2
x(6)
W0
W2
-1
W0
-1
y(7)
Implementation of Radix 2 FFT
• Two extreme methods
• Reuse single Butterfly
– Slower
– Smaller area
– More complicated control
• Fully multi-stage straight implementation
– Faster
– Larger area
– More regular control
• Trade-off between the two ends based on
– Speed, area, power
Comparison of calculation
DFT
FFT
MUL
ADD
MUL
(N-2)2
(N)2
N/2log2N-(N-1) Nlog2N
X’0(n)
xn
hardware
WNn
Xn+N/2
-1
X’1(n)
Butterfly operation
ADD
Data transport
• One problem for FFT is its less regular data transport.
If the butterfly PEs are configured such that PEs with lower exponents of W
come first in each stage, a configuration results with identical
communication networks between stages, (perfect shuffle)
Conventional single butterfly FFT
implementation
Strong speed limitation
Large intermediate results storage area need (N complex words)
If the memory is not partitioned, the number of R/W accesses to perform the FFT
creates a bottleneck
An N-point FFT requires N/r logrN radix-r butterfly computations and 2N logrN R/W
RAM access
Single-stage (1-D) implementationhorizontal projection
• Horizontal projectionprovide PE for a single
stage
• Use only N/2 PE, i.e.
one stage only
– Reduce throughput by a
factor of log2N comparing
with a 2-D array.
– Need to take care about
the complex
communication structure
PEs do not have fixed coefficients,
they need to change after each cycle
and the global communication
network is disadvantageous
Single-stage (1-D) implementation
implementation- horizontal projection
• Pipelining with PEs does
not allow a direct increase
in through put for this
architecture since the
results of the current
processing are required for
the next processing step.
• However sequential data
blocks of length N can be
processed independently of
one another, so several
data blocks can be
processed by interleaving
• Need increase in # of
register
Single-stage (1-D) implementation horizontal projection
• If N is large, we cannot implement all N #
of PE.
• Project N/2 butterfly PEs to M*PEs where
M is also a power of 2 and M < N/2
• Special registers for input data,
intermediate results and result data are
required.
– Register cyclically read and write a particular
sequence of 2M complex data
Single-stage (1-D) implementation
Vertical projection
• Vertical projection: Have 1PE for each stage (total # = logN PE)
• Need circuitry between PEs to prepare the correct data input
• From stage to stage, the length of the sequence onto which the
FFT is applied is halved.
• Given the previous stage led to a DFT of length 2n, in
accordance with perfect shuffle, the sequence of length 2n must
be halved and the 1st and (n+1)th values must be fed to the
following PE. Then the 2nd and (n+2)th values are fed to it.
• Hence the sequence must be delayed by n clock cycles in
accordance with the position of the midpoint:
Data formatting/sorting for Vertical projection
• The block un-1,…,u0 must be delayed by n clock
cycles.
• When un is available, the values from the stream u
must be fed to the new lower stream v’. The values of
u are input in parallel into the next butterfly stages for
n clock cycles.
• SO the values of v are fed in parallel to the next
butterfly PE for n clock cycles and vn-1,…,v0 are
delayed by 2n cycles and v2n-1,…,vn delayed by n
cycles.
Data formatting/sorting for Vertical
projection
• Special circuit is necessary for the data input of the 1st
stage.
• Incoming data stream of N data is divided into 2 parts of
N/2 data. The clock rate is hence halved.We need a
demultiplexer followed by a FIFO register
Overall architecture of Linear FFT array based
ob butterfly PEs and delay commutators
Consists of N PEs and delay commutators are located
between the PEs.
Due to the continued halving, control signals are extracted
using frequency dividers
Higher radix FFT
• Radix-4 DIF algorithm
N 1
X (k )   x(n)WNkn
n 0

N / 4 1
 x(n)W
kn
N
n 0

N / 4 1
 x(n)W
n 0
We have
kn
N
N / 2 1
 x(n)W

kn
N
n N / 4
W
kN / 4
N
N / 4 1

n 0

3 N / 4 1
 x(n)W
n N / 2
kn
N

N 1
 x(n)W
n 3 N / 4
kn
N
N / 4 1
N / 4 1
N kn
N kn
3N kn
kN / 2
3 kN / 4
x(n  )WN  WN
x
(
n

)
W

W
x
(
n

)WN


N
N
4
2
4
n 0
n 0
WNkN / 4  ( j)k ,WNkN / 2  (1)k ,WN3kN / 4  ( j)k ,
Thus
X (k ) 
N / 41

n 0
[ x ( n)  (  j ) k x ( n 
N
N
3N
) (1) k x(n  )  ( j ) k x(n 
)]WNnk
4
2
4
Radix-4 DIF algorithm
X ( 4k ) 
N / 4 1
N
N
 [ x ( n)  x ( n  4 )  x ( n  2 )  x ( n 
n 0
X (4k  1) 
N / 4 1
N
3N
)]WN0WNkn/ 4
4
N
 [ x(n)  jx(n  4 )  x(n  2 )  jx(n 
n 0
X ( 4 k  2) 
N / 4 1

n 0
X (4k  3) 
N / 4 1

n 0
[ x ( n)  x ( n 
3N
)]WNnWNkn/ 4
4
N
N
3N
)  x(n  )  x(n 
)]WN2 nWNkn/ 4
4
2
4
[ x(n)  jx(n 
N
N
3N
)  x(n  )  jx(n 
)]WN0WNkn/ 4
4
2
4
• Butterfly of Radix-4 Algorithm
Radix-4 Signal flow graph
Higher radix FFT
• Radix-8 algorithm
N 1
A(8k  l )   x(n)W
n 0
7 N / 81


n 0 n 0
N / 81
[ x(n 
(8 k  l ) n
N
7 N / 8 1


x(n 
n 0 n 0
m N (8 k l )( mN / 8 n )
)WN
8
m N lm nl nk
)W8 ]WN WN / 8
8
2N l
4 N 2l
6 N l
)
W

x
(
n

)
W

x
(
n

)W4 ]

4
4
8
8
8
n 0
N
3N l
5 N 2l
7 N l l nl nk
 [ x(n  )  x(n 
)W4  x(n 
)W4  x(n 
)W4 ]W8 }WN WN / 8
8
8
8
8
l  0,1,2,3,4,5,6,7; k  0 ~ N / 8  1

{[ x(n)  x(n 
Some pipeline FFT Processor
Architecture
• Assume input sequence to be in normal
order and output is allowed to be in digitreversed (radix-2 or radix-4) order.
• Assume DIF type of decomposition
• Here we assume additive butterfly has
been separated from multiplier to show the
hardware requirement distinctively
Radix-2 Multi-path Delay
Commutator (R2MDC)
N=16
Input sequence has been broken into 2 parallel data stream
flowing forward, with correct “distance” between data
elements entering the butterfly scheduled by proper delays
# of multipliers: log2N – 2
# of butterfly: log2N
# of registers: (3/2)N-2
Radix-2 Single-path Delay
Feedback (R2SDF)
N=16
Storing the butterfly output in feedback shift registers. A single
data streams goes through the multiplier at every stage.
# of multiplers: log2N – 2
# of butterfly: log2N
# of registers: N-1
Radix-4 Single-path Delay
Feedback (R4SDF)
N=256
Use radix-4 and CORDIC iterations. Utilization of multipliers
increased to 75% due to storage of 3 out of radix-4 butterfly
outputs. Utilization of the radix-4 butterfly (which is more
complicated than radix-2 butterfly, containing at least 8
complex adders) is dropped to 25%.
# of multiplers: log4N – 1
# of butterfly: log4N
# of registers: N-1
Radix-4 Multi-path Delay
Commutator (R4MDC)
N=256
Utilization Rate: Butterflies: 25%, multiplier: 250%
# of multiplers: 3log4N
# of butterfly: log4N
# of registers: (5/2)N-4
Some observation
• Delay-feedbacks are more efficient than
corresponding delay commutator in terms
of memory utilization since the stored
butterfly output can be directly used by the
multipliers
• Radix-4 algorithm based single-path
architectures have higher multiplier
utilization, but radix-2 algorithm have
simpler butterflies which are better utilized.
Comparison
Radix / Speed
Low  ----------------------------------- High
Control Theme
Simple  ----------------------------------- Complex
Processing Ability / Unit
Low  ----------------------------------- High
Combine the advantages
 Further decompose high radix PE
Radix-22 DIF FFT
• Optimal hardware
– Same number of non-trivial multiplications at
the same positions in the SFG as of radix-4
algorithms
– The same butterfly structure as that of radix-2
algorithms.
– Radix-22 DIF FFT (S. He, M. Torkelson, “A
New Approach to Pipeline FFT Processor”, in
Proceedings of IPPS, 1996, pp. 766-780.
Radix-22 DIF FFT
N 1
Y (k )   x(n)WNnk
for
k  0,1,...,N  1
n 0
Apply a 3-dimensional linear index map
n
N
N
n1  n2  n3 N where n1 , n2  {0,1}
2
4
and
n3  {0 ~ N / 4  1}
k  k1  2k 2  4k3 N
The Common factor algorithm has the form of
Summation
Over n1
Y (k1  2k 2  4k3 )

N / 4 1

n3  0
N
N
( n1  n2  n3 )( k1  2 k 2  4 k3 )
N
N
x( n1  n2  n3 )WN 2 4


2
4
n2  0 n1  0
1
1
N
N
( n2  n3 ) k1 
( n2  n3 )( 2 k 2  4 k3 )
 k1 N
    BN / 2 ( n2  n3 )WN 4
WN 4
4
n3  0 n2  0 

N
N
N
N
where BNk1/ 2 ( n2  n3 )  x ( n2  n3 )  (1) k1 x( n2  n3  )
4
4
4
2
N / 4 1
1
Radix-22 DIF FFT
• Proceed the second step of
N
( n2  n3 )( k1  2 k 2  4 k3 )
decomposition to the
WN 4
remaining DFT coefficients,
N
including the “twiddle factor”
n2 ( k1  2 k 2 )
Nn 2 k3
n3 ( k1  2 k 2 )
4 n3 k 3
4

W
W
W
W
to exploit the exceptional
N
N
N
N
values in multiplication before
n3 ( k1  2 k 2 )
4 n3 k3
n2 ( k1  2 k 2 )

(

j
)
W
W
N
N
the next butterfly is
constructed.
After substituting and simplification, we have
Y (k1  2k 2  4k3 ) 
where
N / 4 1
[ H k , k , n W
n3  0
1
2
3
n3 ( k1  2 k 2 )
N
]WNn3/k43
BF I
BF I
H k1 , k 2 , n3   [ x(n3 )  (1) k1 x(n3 
N
N
3N
)]  ( j ) ( k1  2 k2 ) [ x(n3  )  (1) k1 x(n3 
)]
2
4
4
BF II
Butterfly with decomposed twiddle factors
Full multipliers are required to compute the product of the decomposed twiddle
factor. The order of the twiddle factors is different from that of radix-4 algorithm.
Complete Radix-22 DIF FFT
• Apply the CFA recursively to the remaining DFTs
of length N/4.
BF4
Control
BF2 I BF2 II
Control
Radix-22 Single-path Delay
Feedback (R22SDF)
2 types of butterflies: 1 identical to R2SDf, the other contains also the
logic to implement the trivial twiddle factor multiplication
A log2N bit binary counter servers two purposes:
-Synchronization controller
- Address generation counter for twiddle factor reading in each stages
Radix-22 Single-path Delay Feedback (R22SDF)
• Structure for BF2I and BF2II
BF2II
BF2I
Operation scheduling
1st N/2 cycle, 2-to-1 mux in BF2I switch to “0” and the butterfly is idle.
Input data is directed to the shift registers until they are filled.
Next N/2 cycles, the mux turn to “1”, the butterfly computes a 2-point
DFT with incoming data and the data stored in the shift registers
Z1(n)  x(n)  x(n  N / 2)
0n N /2
Z1(n  N / 2)  x(n)  x(n  N / 2)
Download