An Efficient Pipelined FFT Processor for OFDM Communication Systems

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013
An Efficient Pipelined FFT Processor for OFDM
Communication Systems
*
M.Viswanadh1
M.Harshavardhan Reddy2
1Assistant
2Assistant
3
Naveena Boppana3
Professor, Dept. Of ECE, MLRIT, Hyderabad, AP, India
Professor, Dept. Of ECE, MLRIT, Hyderabad, AP, India
Associate Professor, Dept. Of ECE, VVIT, Chevella, Hyderabad, AP, India
Abstract: Increasing speeds and complexity of wireless communication
systems have necessitated the progress and advancement of high performance
signal processing elements. Today's emerging technologies require fast
processing and efficient use of resources. These resources include power,
memory, and chip area. Ongoing research seeks to optimize resource usage as
well as performance. Design becomes a balance and compromise of flexibility,
performance, complexity, and cost. This paper will specifically address the
power-efficient design of an FFT processor as it relates to emerging OFDM
communications such as cognitive radio.
Keywords:
Fast Fourier
transform (FFT), orthogonal
frequency-division
multiple access (OFDMA), very large scale integration (VLSI).
1. Introduction
transmission through numerous
Cognitive radio is a method of
channels.
The
Fast
Fourier
wireless communication by way of
Transform (FFT) processor is the
dynamically
the
heart of OFDM that enables its
multiple
fast and efficient modulation of
subcarriers to changing conditions
signals. The FFT algorithm is a
in the communication channels.
fast computation of the Discrete
These subcarriers are enabled by a
Fourier Transform (DFT) which is
modulation
scheme
an essential component of the
orthogonal
frequency
adapting
transmission
of
multiplexing
known
as
division
(OFDM).
OFDM
modulation
most
into
component
signals
lower
for
ISSN: 2231-5381
data
rate
simultaneous
used
in
OFDM. As the FFT processor is the
converts a high data rate signal
multiple
scheme
computationally
in
intensive
OFDM
communication, an improvement
http://www.ijettjournal.org
Page 304
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013
in the power efficiency of this
and plenty of research results have
component can have great impacts
been published. Multiband-OFDM
on
These
standard is one solution for UWB
impacts are significant considering
technology. A proposal for Multi-
the number of mobile and remote
band OFDM UWB standard is
communication devices that rely
published
on
study
the
overall
system.
limited
battery-powered
by
group
IEEE
[12].
802.15
After
3a
IEEE
operation. This project will serve
802.15 3a was withdrawn in the
as an exploration of current FFT
Spring of 2006, Multiband-OFDM
processor
has
algorithms
architectures
as
and
well
as
been
controlled
by
ECMA
International. In December 2007,
optimization techniques that aim
the
to reduce the power consumption
Standard ECMA-368: High Rate
of these devices.
Ultra Wideband PHY and MAC
Ultra-Wideband
(UWB)
second
Standard’
revised
was
version
released,
which
Technology brings the convenience
specified physical layer (PHY) and
and
wireless
medium access control layer (MAC)
high-speed
of the UWB technology based on
mobility
communications
of
to
interconnects in devices through
Multiband-OFDM
out the digital home and office [1].
some key issues need to be solved
Instead of wired connection, this
for
technology
Multiband- OFDM UWB solution
enables
wireless
[5].
designing
audio, and other data with high
requirement. One of the issues
data
focuses on its FFT (Fast Fourier
low
power
the
Transform)
the
Communications
25% design complexity of the total
Commission (FCC) in USA issued
digital baseband transceiver [1].
the ruling that Ultra-Wideband
Although
(UWB) could be used for data
already been published in this
communication. Since then, UWB
research area in the past few years
has became a hot research topic
[4], some key problems still exist
ISSN: 2231-5381
http://www.ijettjournal.org
many
which
power
consumption. In February 2002,
Federal
block,
low
based
in
and
of
CMOS
connection for transmitting video,
speed
support
However,
results
takes
have
Page 305
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013
and need to be improved for the
be used for these specifications.
speed,
power
There are a large number of FFT
consumption consideration. Based
algorithms and architectures in
on ECMA-368, for the requirement
the signal processing literature [7].
of Multiband-OFDM system, this
Therefore,
FFT processor should work on a
algorithms
few hundred MHz, which makes it
should be analyzed and compared.
difficult to implement. And since
Based on different algorithms and
this system targets for the wireless
architectures,
portable devices, small area and
consumptions, area and speed of
low power consumption are also
the processor will be achieved. So
imperative. Therefore, this thesis
their ASIC suitability should be
focuses on the area and power
analyzed and the effort should be
consumption improvement under
focused
the
algorithms and architectures and
area
and
ECMA-368
standard
requirements.
This paper aims at designing
ASIC
(Application
Specific
the
state
and
of
architectures
different
on
the
power
choosing
optimization.
Furthermore,
improvement
space
analyzed
and
art
the
should
the
be
architecture
Integrated Circuit) FFT processor
should be further optimized. The
for Multiband-OFDM UWB system.
proposed
In order to achieve this goal,
architecture should be validated
several steps need to be followed.
by
The first step is to find the
specifications
for
simulation
and
before
implementation. After that, this
FFT
circuit needs to be implemented
processor, which is determined by
with VHDL. The synthesis step is
the
UWB
followed by using both Simplify Pro
standard. The step requires the
targeted for FPGA and Design
analysis
Compiler for ASIC.
Multiband
on
this
Matlab
algorithm
OFDM
OFDM
and
UWB
technology and the constraints of
its FFT processor. After defining
2. FFT Processor Design
In the OFDMA system, only
the specifications, optimized FFT
partial
algorithm and architecture should
computed in the user end. Thus,
ISSN: 2231-5381
subcarriers
http://www.ijettjournal.org
need
to
Page 306
be
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013
the
FFT
pruning
applicable
to
algorithm
the
is
OFDMA
ible to keep infinite resolution of
coefficients
and
operations.
All
demodulation under the low-power
coefficients and input signals have
context
to
of
the
battery-driven
be
represented
with
nite
handset. In the literature, FFT
number of bits in binary format
pruning
and
depending on the tradeoff etween
architectures [9] are designed for
the hardware cost (memory usage)
computing selective frequency bins
and the accuracy f output signals.
on
Generally
algorithms
the
spectrum
but
not
for
speaking,
each
reflecting the resource-allocation
multiplication may ntroduce an
scheme in the OFDMA system.
error due to rounding operations
Therefore, in this brief, we propose
or truncations, which is referred
the partial cached-FFT algorithm
as arithmetic quantization error.
for
(RB)
Besides, all he twiddle factors are
OFDMA
represented with limited number
system. Furthermore, we design a
of bits and the loss due to the
mixed
inexact
typical
resource-block
distributions
in
the
pipelined/cached-FFT
coefficients
is
called
processor with constellation and
coefficient quantization error. The
power
we
theoretical performance evaluation
implement and measure the FFT
has been given in previous works.
processor chip to show its energy
Several
efficiency.
analyzed the effect of fixed-point
PARTIAL CACHED – FFT ALGORITHM
arithmetic for radix-2 FFT. In this
FOR Resource Block ALLOCATION
section, we derive the equivalent
In
actual
awareness.
general,
Finally,
considering
hardware
design,
the
the
accuracy of FFT/IFFT module is
an
important
design
factor
of
ystem performance. In practice,
fixed-point arithmetic is used o
implement
FFT
algorithms
in
hardware because it is not pos-
ISSN: 2231-5381
previous
works
have
matrix form of both DIF and DIT
FFT
algorithms.
alternative
DIT
algorithms
multiplicative
Although
and
have
the
DIF
the
FFT
same
complexity,
he
sequence of butterfly stages and
twiddle factor stages is re-versed.
In other words, the signal flow of
http://www.ijettjournal.org
Page 307
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013
two alternative repre- entations is
be increased, because more delay
actually the mutual mirroring of
registers
each other.
100% utilization. If radix-4 and
must
be
inserted
for
four streams are used, the cost will
significantly grow. Therefore, the
radix-2/22
dual-stream
architecture is a compromise of
the cost and the throughput.
Figure 1 2D FFT Processor Architecture
The proposed 64 to 1024-point
cached-FFT processor, as shown
in Fig. 1, consists of a radix-2/22
pipelined
butter-
y
processor,
cache sets, and main memory. We
propose
a
pelined
feedback
dual-delay-
(DDF)
butterfly
architecture, to avoid the idling of
the
butterfly
traditional
(BU)
the
single-delay-feedback
architecture.
sequences
unit
Two
share
radixthe
2
butterfly
units by using a delay register ter
the
first
schedule
butterfly.
for
the
The
time
dual-stream
processing. The gray and slash
blocks of time slots represent the
operations of BF2_D and BF2_2D
butterfly units, respectively. Thus,
the butterfly processor can achieve
100% utilization rate. If the radix23 butterfly processor is used for
the dual streams, the latency will
ISSN: 2231-5381
CACHE/MEMORY ARCHITECTURE:
We
use
facilitate
two
the
cache
data
sets
to
transfers
between butterfly processor and
the cache sets and between the
cache ts and the main memory.
The cache sets must perform two
rite
operations
and
two
read
operations in one clock cycle r the
dual-stream processing. To avoid
the utilization of the ur-port cache
and
simplify
the
control
complexity, we divide ach cache
set into two banks for even and
odd
addresses.
The
dd/even
address detector serves to detect
whether the data e accessed in the
correct bank and to exchange the
access
positions
if
necessary.
we
can
carefully
Thereby,
manipulate
the
computational
time schedule to ensure that one
even address and one odd address
are accessed for two read ports
http://www.ijettjournal.org
Page 308
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013
and two write ports. Similar to the
carefully manage the read and
cache
write addresses of the cache sets
architecture,
the
main
memory needs to perform two read
to
and two write operations in one
summarize four types of butterfly
cycle. Therefore, we divide the
computation sequences in Fig. 6
main memory into two banks of
from the control tables of the 128-
single-port
to 1024-poin partial cached-FFT
512-word
static
random-access memory (SRAM) for
avoid
data
conflicts.
We
schemes.
even and odd memory addresses.
Figure 3 Architectures of the (a) one 16-word cache set and
(b) main memory
Type 1 computes one complete
group. In the radix-22 butterfly,
data are read from {0, 1, 2, 3}
cache addresses but are written
back to the {0, 4, 8, 12} cache
Figure 2 Proposed partial cached-FFT processor
Computation Time Schedule:
Since the cache flush/refresh
time is smaller than the one- ass
processing
time,
processor
can
the
butterfly
continuously
compute without an idle time slot.
However, there are timing overlaps
between
the
cache
flushing/refreshing and the butterfly processing. Thus, we must
ISSN: 2231-5381
addresses.
The
cache-address
exchanges seem to violate the inplace
property
of
the
FFT
algorithm. However, before being
overwritten, the original contents
in cache addresses {4, 8, 12} are
skillfully read into the butterfly
processor. This strategy is applied
to the arrangement of the cache
read/write control throughout the
schedules of all types. In addition,
data- conflict avoidance requires
http://www.ijettjournal.org
Page 309
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013
that there be one even and one
An RB that consists of multiple
odd read/write addresses in a
symbols
cycle all the time. To avoid an
transmission unit in the OFDMA
inevitable cache conflict in the
system. The data of the user are
eleventh cycle, we add a bypass
logically packed as many virtual
register.
the
RBs (VRBs) and are then mapped
epoch 1 of the 128-point FFT. In
to the physical RBs (PRBs) in the
pass 0, four radix-22 butterflies
OFDMA.
Type
2
processes
is
the
basic
data
are computed without the cache-
The area of a chip depends on
address exchange, whereas eight
the number of components to be
radix-2 butterflies are computed
placed on it. Effective placement
with
exchanges.
techniques are being used in the
Type 3 deals with the epoch 2 of
industry to minimize area and
the 512-point FFT when only one-
maximize functionality. Effective
pass operation is necessary in the
place and route can not only aid in
last
control
reducing the area of the chip but
complexity and to maintain the
also reduce the delay due to wiring
regularity,
and interconnect capacitance.
cache-address
epoch.
To
we
reduce
idle
the
pass-0
operation and compute radix-2
The proposed FFT processor
butterflies in pass 1 without the
was fabricated using TSMC 0.18-
cache-address exchange. Type 4
μm
handles the epoch 2 of the 1024-
technology.
point FFT. This case is similar to
2100 × 2100 μm2 with 1210 ×
Type 3, but pass 1 consists of
1210 μm2 core area, including
radix-22 operations and the cache-
four 512 × 13 single-port SRAMs,
address exchange is required. In
four 8 × 13 caches, and four 128 ×
Table
and
13 ROMs. The function of the chip
FFT
is verified, and its performance is
III,
the
proposed
architectures
traditional
cached-
are
compared
in
one-poly
measured
six-metal
The
using
chip
a
CMOS
occupies
digital
test
terms of cost, number of trivial
station. This chip can operate at
multiplication,
maximum 51 MHz with 33.3 mW.
and
FFT.
ISSN: 2231-5381
latency
per
We use the worst- and best-case
http://www.ijettjournal.org
Page 310
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013
patterns for measuring the energy-
FFT processor for minimal RB
saving capability of different RB
transmission.
allocation schemes, as shown in
outperforms
other
Fig. 4. Each curve represents a
memory-based
FFT
series of {1, 3/4, 1/2, 1/4, 1/8,
latency for processing 1024-point
1/16, 1/32, 1/64} computed RBs
full FFT is 61 μs, which meets the
in the whole OFDMA spectrum. We
66.7- and 91.5-μs OFDM symbol
can see that the energy dissipation
durations of the 3GPP-LTE and
scales to the modulation order and
Mobile
the
respectively.
amount
of
the
allocated
resource.
This
result
cached-
chips.
WiMAX
The
standards,
4. Results and Conclusions
The input
data length of our
proposed pipelined FFT processor is a
parameter which can be decided at
the range of 128, 512, 1024 and 2048
points. Take 1024 points FFT as an
example. At first, the 1024 points FFT
is coded using MATLAB software.
After the chosen FFT algorithm is
Figure 4 Energy dissipation versus FFT-demodulated
resource bit rate (best case)
The comb-distributed scheme
has the best performance, whereas
the localized scheme has the worst
performance. Note that the RBlevel
distributed
scheme
is
a
compromise between localized and
comb-distributed schemes. Thus,
its energy dissipation also lies
between them. The energy scaling
ranges from 1.90 to 0.64 nJ/FFT
point for the 1024-point FFT, i.e.,
valid,
the
processor
language
architecture
was
and
modeled
of
in
functionally
the
VHDL
verified
using Xilinx 12.3i software and timing
simulation using ISIM SE software.
During the timing simulation, a test
bench
file
included
the
TEXTIO
package was written to read input
data
and
write
FFT
result.
behavioral simulation waveforms for
the Processor are shown in figure 5.
67% energy can be saved in the
ISSN: 2231-5381
The
http://www.ijettjournal.org
Page 311
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013
[4] C. P. Fan and G. A. Su, “A grouped
fast Fourier transform algorithm design
for selective transformed outputs,” in
Proc. IEEE APCCAS, 2006, pp. 1939–
1942.
Figure 5 Simulation of 1024 point FFT
In this Paper, We have presented
the fabrication and testing of the FFT
processor integrated circuit to show
its
superior
energy
efficiency.
Therefore, we found that this FFT
processor is significantly beneficial to
the realization of the OFDMA receiver.
Acknowledgements
The authors would like to thank
the anonymous reviewers for their
comments which were very helpful in
improving
the
quality
and
presentation of this paper.
References:
[1] Y. W. Lin, H. Y. Liu, and C. Y. Lee, “A
dynamic scaling FFT processor for DVB-T
applications,”
IEEE
J.
Solid-State
Circuits, vol. 39, no. 11, pp. 2005–2013,
Nov. 2004.
[2] T. H. Yu, C. Z. Zhan, Y. J. Cho, C. L.
Yu, and A. Y. Wu, “Efficient fast Fourier
transform processor design for DVB-H
system,” in Proc. 18th VLSI/CAD Symp.,
Aug. 2007, pp. 65–68.
[3] Z. Hu and H. Wan, “A novel generic
fast Fourier transform pruning technique
and complexity analysis,” IEEE Trans.
Signal Process., vol. 53, no. 1, pp. 274–
282, Jan. 2005.
ISSN: 2231-5381
[5] R. Min, M. Bhardwaj, and A.
Chandrakasan, “A partially operated
FFT/IFFT processor for low complexity
OFDM modulation and demodulation of
WiBro in-car entertainment system,”
IEEE Trans. Consum. Electron., vol. 54,
no. 2, pp. 431–436, May 2008.
[5] C80216m-08_503, Motorola IEEE
802.16 m Downlink Resource Mapping,
IEEE, May 2008.
[6] 3GPP, R1-071091, Philips ResourceBlock
Mapping
of
Distributed
Transmissions in E-UTRA Downlink, Feb.
2007.
[7] R. Min, M. Bhardwaj, and A.
Chandrakasan,
“Quantifying
and
enhancing power awareness of VLSI
systems,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 9, no. 6, pp. 757–
772, Dec. 2001.
[8] T. Lenart and V. Owall, “Architectures
for dynamic data scaling in 2/4/8 K
pipeline FFT cores,” IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., vol. 14,
no. 11, pp. 1286–1290, Nov. 2006.
[9] A. Sinha, A.Wang, and A. P.
Chandrakasan, “Algorithmic transforms
for
efficient
energy
scalable
computation,” in Proc. Int. Symp. LowPower Electronics and Design, Italy, Aug.
2000, pp. 31–36.
[10] S. H. Nawab, et al., “Approximate
signal processing,” J. VLSI Signal
Processing Syst. Signal, Image, and Video
Technol., vol. 15, no. 1/2, pp. 177–200,
Jan. 1997.
http://www.ijettjournal.org
Page 312
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 6 - Nov 2013
[11] L. McMillan and L. A.Westover, “A
forward-mapping
realization of
the
inverse discrete cosine transform,” in
Proc. Data Compression Conf., Snowbird,
UT, Mar. 1992, pp. 219–228.
(M.Tech) in VLSI Design. She is planning for
her research under JNTU Hyderabad.
[12] N. Weste and K. Eshraghian,
Principles of CMOS VLSI Design: A
System
Perspective.
Reading,
MA:
Addison-Wesley, 1994.
[13] V. Gutnik and A. P. Chandrakasan,
“Embedded power supply for lowpower
DSP,” IEEE Trans. Very Large Scale
Integration (VLSI) Syst., vol. 5, pp. 425–
435, Dec. 1997.
Authors Profile:
M.Viswanadh, is working
as an assistant professor
in MLRIT, Hyderabad, in
the
Department
of
Electronics
&
Comm.
Engineering. He completed
his masters (M.Tech) in
VLSI
Design.
He
is
planning for his research. He is interested in
the fields of VLSI, image processing, wireless
communications.
M.Harshavardhan Reddy
is working as an assistant
professor
in
MLRIT,
Hyderabad,
in
the
Department of Electronics
& Comm. Engineering. He
completed his masters
(M.Tech) in VLSI Design. He is planning for
his research. He is interested in the
developments in wireless communications.
Naveena
Boppana,
is
Associate Professor in VVIT,
Chevella Hyderabad, in the
Department of Electronics &
Comm. Engineering. She
completed
his
masters
ISSN: 2231-5381
http://www.ijettjournal.org
Page 313
Download