Pipelined Parallel FFT Architectures Using Folding Transformation

advertisement
Pipelined Parallel FFT Architectures Using Folding
Transformation
Mr.R.V.TALATHI
Department of E&TC Engineering
VLSI & EMBEDDED SYSTEM
PVPIT, PUNE
rohitvtalathi@gmail.com
Abstract - In this paper, the approach is to develop
parallel pipelined architectures for the Fast Fourier
transform (FFT) is presented. The folding transformation
and register minimization techniques are proposed for
designing FFT architectures. Novel parallel-pipelined
architectures for the computation of fast Fourier
transform are derived The proposed architectures exploit
redundancy in the computation of FFT samples to reduce
the hardware complexity. A comparison is drawn between
the proposed designs and the previous architectures. The
power consumption can be reduced up to 37% and 50% in
2-parallel CFFT and RFFT architectures, respectively.
Keywords– Fast Fourier Transform (FFT), folding, register
minimization, low power.
1. INTRODUCTION
DFT is one of the most important tools in the field of digital
signal processing. Several Fast Fourier Transform (FFT)
algorithms have been developed over the years due to its
computational complexity. FFT plays a critical role in
modern digital communications such as Digital Video
Broadcasting (DVB) and Orthogonal Frequency Division
Multiplexing (OFDM) systems. Algorithms such as radix-4
,[1] split-radix [2] and radix-2^2 [3] have been developed
based on the basic radix-2 FFT approach. The one of the
most classical approaches for pipelined implementation of
radix-2 FFT is Radix-2 multi-path delay commutator
(R2MDC) [4]. A standard usage of the storage buffer in
R2MDC leads to the Radix-2 Single-path delay feedback
(R2SDF) [5 ]architecture with reduced memory.
In additional, most of these hardware architectures are not
fully utilized and require high hardware complexity. In the
period of high speed digital communications, the high
throughput and low power designs are essential to meet the
speed and power requirements while keeping the hardware
overhead to a minimum. In this paper, a new approach to
design the architecture from the FFT flow graphs is
presented. Folding transformation [6] and register
minimization technique [7][8] are used to derive several
known FFT architectures.
Prof..S.M.Kulkarni
Department of E&TC Engineering
PVPIT Bavdhan
Pune India.
smk_1@rediffmail.com
2. FOLDING TRANSFORMATION
In the folding transformation, many butterflies in
the same column can be mapped to one butterfly unit. If the
FFT size is N, a folding factor of N/2 leads to 2-parallel
architecture and in another design, a folding factor of N/4
leads to design 4-parallel architectures in which four
samples are processed in the same clock cycle. Various
folding sets lead to a family of FFT architectures [9].
2.1 FFT Architectures Design Techniques
In this section, the folding transformation method and
register minimization to derive several known FFT
architectures is illustrated in general.[9] The process is
described using an 8-point radix-2 DIF FFT as an example.
It can be extended to other radices in a similar fashion.
Figure. 1 shows the flow graph of a radix-2 8-point DIF
FFT[9].
Fig.1 Data Flow graph (DFG) a Radix-2 8-point DIF FFT
This algorithm can be represented as a data flow
graph (DFG) as shown in fig. 1. The nodes in the DFG
represent tasks or computations. In this case, all the nodes
represent the butterfly computations of the radix-2 FFT [11]
algorithm. To transform the DFG, a folding set is required
which is an ordered set of operations executed by the same
functional unit. Each folding set contains K entries some of
which may be null operations is called the folding factor,
number of delays. The folded delays for the
pipelined DFG are
DF (A0 - B0) = 2
DF (A0 - B2) = 4
DF (A1 - B1) = 2
DF (A1 - B1) = 4
DF (A2 - B0) = 0
DF (A2 - B2) = 2
DF (A3 - B1) = 0
DF (A3 - B3) = 2
DF (B0 - C0) = 1
DF (B0 - C1) = 2
DF (B1 - C0) = 0
DF (B1 - C1) = 1
DF (B2 - C2) = 1
DF (B2 - C3) = 2
DF (B3 - C2) = 0
DF (B3 - C3) = 1…(3)
3. FFT DESIGN TECHNIQUES
Fig.2 pipelined DFG of a 8-point DIF-FFT as a
preprocessing step for folding
The Fig 3. Shows block diagram of folding
architectures of FFT techniques.
For example, consider the folding set A = {ϕ, ϕ, ϕ,
ϕ, A0, A1, A2, A3} for K=8. The operation A0 belongs to
the folding set A with the folding order 4. The functional
unit executes the operations A0, A1, A2, and A3 at the
respective time instances and will be idle during the null
operations. The systematic folding techniques are used to
derive the 8-point FFT architecture. Consider an edge e
connecting the nodes U and V with w (e) delays.
The folding equation (1) for the edge e is
DF (U - V) = K w(e)-PU + v - u (1)
Where PU is the number of pipeline stages in the hardware
unit which executes the node U. By using folding sets,
folding equations are derived with negative delays (w/o
pipeline) and non-negative delays (with pipeline or
retiming). Consider folding of the DFG in fig.2 with the
folding sets
A = {ϕ, ϕ, ϕ, ϕ, A0, A1, A2, A3}
B = {B2, B3, ϕ, ϕ, ϕ, ϕ, B0, B1}
C = {C1, C2, C3, ϕ, ϕ, ϕ, ϕ, C0}.
Assume that the butterfly operations do not have any
pipeline stages, i.e., PA=0, PB=0, PC=0.Retiming and/or
pipelining [10] can be used to either satisfy DFU-V) ≥0 or
determine that the folding sets are not feasible .The negative
delays on some edges can be observed. The equations are
DF (A0 - B0) = 2
DF (A0 - B2) = - 4
DF (A1 - B1) = 2
DF (A1 - B1) = - 4
DF (A2 - B0) = 0
DF (A2 - B2) = - 6
DF (A3 - B1) = 0
DF (A3 - B3) = - 6
DF (B0 - C0) = 1
DF (B0 - C1) = - 6
DF (B1 - C0) = 0
DF (B1 - C1) = - 7
DF (B2 - C2) = 1
DF (B2 - C3) = 2
DF (B3 - C2) = 0
DF (B3 - C3) =1…. (2)
The DFG can be pipelined is shown to
ensure that folded hardware has non-negative
Fig. 3 Block diagram of FFT design techniques
The technique for minimizing register is lifetime
analysis which analyzes the time for when a data is
produced (Tinput) and when a data finally is consumed
(Toutput).
T input = u + PU
(4)
T output = u + PU + maxv {DF (U→V)}
(5)
where u is the folding order of U and PU is the number of
pipelining stages in the functional unit that executes u. From
(3) the 24 registers are required to implement the folded
architecture.
4. REGISTER MINIMIZATION TECHNIQUES
In Lifetime Analysis the no. of live variables at
each time unit is computed and the maximum no. of live
variables at any time unit is determined. this is the minimum
no. of registers required to implement the DSP program .
Lifetime analysis technique is used to design the
folded architecture with minimum possible registers. For
example, in the current 8-point FFT design, consider the
variables y0, y1,. . . y7, i.e., the outputs at the nodes
A0,A1,A2,A3 respectively. It takes 16 registers to
synthesize these edges in the folded architecture. The
minimum number of registers required to implement this
DSP program is the maximum no. of live variables at any
time unit.
NODE
y0
y1
y2
y3
y4
y5
y6
y7
Tinput Toutput
4
6
5
7
4
8
5
9
6
8
7
9
F
i
g. 6 Register allocation table
Fig.4 Linear lifetime table.
The linear lifetime table and lifetime chart for these
variables is shown in figure. 4 and figure. 5. From the
lifetime chart,[4] it can be seen that the folded architecture
requires 4 registers as opposed to 16 registers in a
straightforward implementation. The next step is to perform
forward-backward register allocation.
From the allocation table in Fig.6 and the folding equations,
the final architecture in Fig. 7 can be synthesized and can be
derived by minimizing the registers on all variables at once.
The hardware utilization is only 50% in the derived
architecture the pipelined parallel FFT architectures are
presented by using this methodology.
Fig.5 Linear lifetime chart
Register allocation can be performed using an allocation
table .the allocation scheme dictates how the variables are
assigned to registers in the allocation table. The allocation
table that uses the forward-backward scheme to allocate the
data for the 3 x 3 matrix transposer is shown in fig. 6.[6][8].
1) Determine the minimum number of registers using
lifetime analysis.
2) Input each variable at the time step corresponding
to the beginning of its lifetime.
3) Each variable is allocated in a forward manner until
it is dead or it reaches the last register.
4) Since the allocation is periodic the allocation of the
current iteration also repeats itself in subsequent
iterations.
5) For variables that reach the last register and are not
yet dead the remaining life period is calculated and
these variables are allocated to a register in a
backward manner on a first come first served basis.
6) Repeat steps 4 & 5 as required until the allocation
is complete.
Fig.7 Folded Architecture
5. COMPARISON AND ANALYSIS
The hardware complexity of the architectures
depends on the required no. of multipliers, adders, and delay
elements. The performance is presented by throughput. The
no. of multiplier required in the radix 2^4 architecture is less
than previous designs. The proposed FFT architecture leads
to low hardware complexity.
Architectures
Multipliers
Adders
Delays
R2MDC
2(log4N-1)
2(log4N-1)
(log4N-1)
(log8N-1)
2(log8N-1)
2(log16N-1)
4log4N
4log2N
4log2N
4log2N
4log2N-2
4log2N-2
2(3N/2-2)
2(N-1)
2(N-1)
2(N-1)
< 2N
< 2N
R2SDF
R2^2SDF
R2^3SDF
Radix 2^3
Radix 2^4
Throug
hput
1
1
1
1
4
4
Fig.8 Comparison of pipelined FFT Architectures of N
point FFT.
design activity so you can evaluate how to reduce your
design supply and thermal power consumption.
6. RESULTS
6.1 Device Utilization Summary:
Here ,We are designed and simulated
our design of folding FFT and Normal FFT in software
XILLINX 13.2 and device is SPARTAN 3.
This Report shows the required Power is less and it also
gives the power information.
6.5. Schematic View of Folded transformation:6.2. Timing Summary:
Timing constraints communicate all design requirements to
the implementation tools. This also implies that all paths are
covered by the appropriate constraint. This provides
considerations that explain the strategy for identifying and
Here shows the folding architecture of 256 point FFT.
Folding architectures reduces multipliers, adders so it
required low power.
constraining the most common timing paths .
Minimum period: 10.666ns
(Maximum Frequency: 93.758MHz)
Minimum input arrival time before clock: 5.224ns
Maximum output required time after clock: 4.063ns
Maximum combinational path delay: No path found
6.3. Place & Route report:
It specifies to place and route design to
completion and to achieve timing constraints.
Number of External IOBs 90
Number of External Input IOBs 40
Number of External Input IBUFs 40
Number of External Output IOBs 50
Number of External Output IOBs 50
6.4. Power generation Report:
The XPower Analyzer (XPA) tool performs power
estimation post implementation. It is the most accurate tool
since it can read from the implemented design database the
exact logic and routing resources used. The summary power
report and the different views you can navigate our design:
by clock domain, by type of resource and by design
hierarchy. XPA also allows you to adjust environment
settings and
These architectures are optimized for the case
when input are real valued. for RFFT, two different
scheduling approaches have been proposed with one having
less complexity in control logic while the other has fewer
delay elements.
The capability of processing two input samples in parallel,
the frequency of the operation can be reduced by 2 , which
in turn reduces the power consumption up to 50%. These are
very suitable for applications in implantable. The real FFT
architectures are not fully utilized. future work will be
designed FFT architectures for real valued signal with full
hardware utilization.
[10] J. Palmer and B.Nelson “A parallel FFT architectures
for FPGAs” Lecture Notes comput Sci ,vol 3203 pp 948953,2004
7. CONCLUSION
A novel four parallel 256 point radix-2^8FFT
architecture has been developed using proposed method.
The hardware costs of delay elements and complex adders
and the number of complex multipliers is reduced using
higher radix FFT algorithm by using proposed approach.
The throughput can be further increased by adding more
pipeline stages which is possible due to the feed-forward
nature of the design. The power consumption can also be
reduced and leads to low hardware complexity in proposed
architectures compared to previous architectures.
8. REFERENCES
[1] J.A.C. Bingham, “Multicarrier modulation for data
transmission: an idea whose time has come,”
IEEE Communication Magazine, vol. 28, no. 5 pp.
5-14, May 1990.
[2] P. Duhamel, “Implementation of split-radix
FFT algorithms for complex, real, and real symmetric
Data”, IEEE Trans. Acoust., Speech,
Signal Process, vol. 34, no. 2, pp. 285–295, Apr.
1986.
[3] S. He and M. Torkelson, “A new approach to
Pipeline FFT processor”, in Proc. of IPPS, 1996,
pp. 766–770.
[4] L. R. Rabiner and B. Gold, “Theory and
Application of Digital Signal Processing”,
Englewood Cliffs, NJ: Prentice-Hall, 1975.
[5] E. H. Wold and A. M. Despain, “Pipeline and
parallel- pipeline FFT processors for VLSI
Implementation”, IEEE Trans. Comput., vol.C-33,
no. 5, pp. 414–426, May 1984.
[6] K. K. Parhi, C. Y. Wang, and A. P. Brown,
“Synthesis of control circuits in folded pipelined
DSP architectures,” IEEE J. Solid-State Circuits,
vol. 27, no. 1, pp. 29–43, Jan. 1992.
[7] K. K. Parhi, “Systematic synthesis of DSP
data format converters using lifetime analysis and
Forward-backward register allocation,” IEEE Trans.
Circuits Syst. II, Exp. Briefs, vol. 39, no. 7, pp.
423–440, Jul. 1992.
[8] K. K. Parhi, “Calculation of minimum number
of registers in arbitrary life time chart,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 41, no. 6,
pp. 434–436, Jun. 1995.
[9] L.R. Rabiner and B.Gold Theory and Application of
digital Signal Processing. Englewood cliffs ,NJ : PrenticeHall, 1975
[11] R. Storn, “A novel radix-2 pipeline architecture for the
computation of the DFT,” in Proc. IEEE ISCAS, 1988, pp.
1899–1902.
[12] W.W. Smith and J. M. Smith, Handbook of Real-Time
Fast Fourier Transforms. Piscataway,
NJ: Wiley-IEEE Press, 1995
[13] M. Ayinala, M. Brown, K.K. Parhi, “Pipelined parallel
FFT architectures via folding transformation”, IEEE
Transactions on VLSI Systems, pp. 1068-1081, vol.20, no.
6, June 2012
Download