Introduction to OFDM and MIMO with emphasis on 802.16

advertisement
FPGAs for Signal Processing and
Communication Systems
Raghu Rao
Wireless and Signal Processing Group,
Xilinx Inc.
05/14/2010
Agenda
• Overview of FPGAs
– Building DSP sub-systems on FPGAs
– Digital baseband
• The Platform FPGA
• Communication systems and DSP on FPGAs
• Architectural tradeoffs for FPGAs
– The Matrix inversion problem
• FPGA tools and design methodology
2
R. M. Rao, 2008
What are FPGAs?
• An array of configurable logic blocks with
configurable interconnects between them.
• Each logic block can implement any 6-input
combinatorial function.
• Logic blocks can be connected to generate larger
circuits.
• Additional DSP specific resources (multiply
accumulate units).
3
R. M. Rao, 2008
Virtex-4/5 FPGA Arhitecture
High-Level View
• FPGA family with 3 members
tailored for specific classes of
processing
– SX: DSP
– LX: Logic centric
– FX: Full featured
• Embedded PowerPC hard IP
• Giga-bit serial connectivity
• DSP processing tiles “DSP48”
4
R. M. Rao, 2008
Virtex-5 FPGA Platform
Can be configured as a
shift register
• 2 slices per CLB, 4 LUTs per CLB
• Can be configured as a shift
register
• Can be configured as distributed
memory
Can be configured as RAM
5
R. M. Rao, 2008
Virtex-5 DSP48E
Full Custom Design Enabling Efficient DSP
Cascade routing enables
scalable performance
ACOUT
Wider internal data-path
and 96-accumulated output
enable higher precision
Pipeline registers enable
550Mhz performance
BCOUT
PCOUT
ACIN
New 25x18 input increases
precision and efficiency
=
BCIN
New second stage enables SIMD
and bitwise logic operations
Optional Pipeline Register/
Routing Logic
Optional Register
Routing Logic
C (48-bit)
Multiplier
B (18-bit)
A (25-bit)
Optional Pipeline Register/
Routing Logic
48-bit
P (48-bit)
Optional P(96-bit)
PCIN
Pattern detect circuitry
increases functionality
6
R. M. Rao, 2008
Dynamically Reconfigurable
DSP OPMODEs
OpMode
Zero
Hold P
A:B Select
Multiply
C Select
Feedback Add
36-Bit Adder
P Cascade Select
P Cascade Feedback Add
P Cascade Add
P Cascade Multiply Add
P Cascade Add
P Cascade Feedback Add Add
P Cascade Add Add
Hold P
Double Feedback Add
Feedback Add
Multiply-Accumulate
Feedback Add
Double Feedback Add
Feedback Add Add
C Select
Feedback Add
36-Bit Adder
Multiply-Add
17-Bit Shift P Cascade Select
17-Bit Shift P Cascade Feedback Add
17-Bit Shift P Cascade Add
17-Bit Shift P Cascade Multiply Add
17-Bit Shift P Cascade Add
17-Bit Shift P Cascade Add Add
17-Bit Shift Feedback
17-Bit Shift Feedback Feedback Add
17-Bit Shift Feedback Add
17-Bit Shift Feedback Multiply Add
17-Bit Shift Feedback Add
6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
Z
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
4
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
Y
3 2
0 0
0 0
0 0
0 1
1 1
1 1
1 1
0 0
0 0
0 0
0 1
1 1
1 1
1 1
0 0
0 0
0 0
0 1
1 1
1 1
1 1
0 0
0 0
0 0
0 1
0 0
0 0
0 0
0 1
1 1
1 1
0 0
0 0
0 0
0 1
1 1
X
1 0
0 0
1 0
1 1
0 1
0 0
1 0
1 1
0 0
1 0
1 1
0 1
0 0
1 0
1 1
0 0
1 0
1 1
0 1
0 0
1 0
1 1
0 0
1 0
1 1
0 1
0 0
1 0
1 1
0 1
0 0
1 1
0 0
1 0
1 1
0 1
0 0
Output
+/- Cin
+/- (P + Cin)
+/- (A:B + Cin)
+/- (A * B + Cin)
+/- (C + Cin)
+/- (C + P + Cin)
+/- (A:B + C + Cin)
PCIN +/- Cin
PCIN +/- (P + Cin)
PCIN +/- (A:B + Cin)
PCIN +/- (A * B + Cin)
PCIN +/- (C + Cin)
PCIN +/- (C + P + Cin)
PCIN +/- (A:B + C + Cin)
P +/- Cin
P +/- (P + Cin)
P +/- (A:B + Cin)
P +/- (A * B + Cin)
P +/- (C + Cin)
P +/- (C + P + Cin)
P +/- (A:B + C + Cin)
C +/- Cin
C +/- (P + Cin)
C +/- (A:B + Cin)
C +/- (A * B + Cin)
Shift(PCIN) +/- Cin
Shift(PCIN) +/- (P + Cin)
Shift(PCIN) +/- (A:B + Cin)
Shift(PCIN) +/- (A * B + Cin)
Shift(PCIN) +/- (C + Cin)
Shift(PCIN) +/- (A:B + C + Cin)
Shift(P) +/- Cin
Shift(P) +/- (P + Cin)
Shift(P) +/- (A:B + Cin)
Shift(P) +/- (A * B + Cin)
Shift(P) +/- (C + Cin)
– Over 40 Different Modes
- Each XtremeDSP Slice
individually controllable
- Change operation in a single
clock cycle
- Enables resource sharing for
maximum utilization
7
R. M. Rao, 2008
Reconfigurability
Waveform identification module
Waveform 1
Waveform 2
Waveform 3 can be reconfigured
into this region of the FPGA.
Waveform 2 can be “reloaded” into its
region when Waveform identification
module detects waveform 2 being
received.
8
R. M. Rao, 2008
Virtex-6 resources
9
R. M. Rao, 2008
GMACs Performance DSP48 slices
10
R. M. Rao, 2008
Processing capabilities of FPGAs
BDTI Certified(tm) Results (c) 2008 BDTI. For more info and results see www.BDTI.com.
11
R. M. Rao, 2008
Processing capabilities of FPGAs
BDTI Certified(tm) Results (c) 2008 BDTI. For more info and results see www.BDTI.com.
12
R. M. Rao, 2008
Pipelined Multiplier
C
To Adjacent DSP48 Tile
48
BCOUT
18
3 delay latency
PCOUT
18
48
18
MS Word
LS Word
A
B
18
X
36
72
36
Y
48
48
A
B
ZERO 48
18
z-3
18
48
48
36
18
18
48
CIN

48
SUB
Z
48
P (PCOUT)
48
36b product sign extended to 48b
Register
18
Wire Shift Right By 17b
48
BCIN
PCIN
13
R. M. Rao, 2008
P
Pipelined Complex 18x18 MPY
Ai
S4
18
Bi
Ar
18
18
Pr
48
18
‘0’
S2
18
Bi
Ar
48
S3
Br
Ar
-
48
48
18
48
Pi
S1
Bi
sn = Slice n
18
18
Register
36
48
‘0’
Sign Extension
14
R. M. Rao, 2008
Wide Filters At Full Speed
Within the Virtex-4 DSP Slice Column
• Systolic N-tap FIR
– Scalable N-levels deep implementation
– N-levels deep at 500MHz performance
• Uses Integrated Pipeline Registers to
Synchronize Filter Inputs
• Utilizes Input and Output Cascade Routing
Build Massively Parallel 512-TAP FIR Filter
In a Single Device Achieving
256 GMACCs/s Performance
Equivalent Implementation Would Consume
444 Embedded Multipliers and 77,008 LCs
And Would Only Achieve ½ The Performance
15
R. M. Rao, 2008
Xilinx FFT IP (4)
• FFT fully utilizes FPGA arithmetic hardware resources
• FFT viewed as a recursion using a butterfly kernel
a
(a + b)
CADD1
CADD2
b
(a + b) e-j2pk/N
CMPY
Phase factors: e-j2pk/N
•
•
CADD{1|2}: complex adder
CMPY: complex multiplier
16
R. M. Rao, 2008
Virtex-4 DSP Slice
• DSP slice key for
implementing highperformance arithmetic
• Embedded 18x18 MPY
and 48b adder
– Butterfly phase rotator
– Cross-addition
17
R. M. Rao, 2008
Butterfly CMPLX MPY
Pr + jPi = (Ar+jAi) x (Br + jBi)
• Complex MPY used in FFT
butterfly
• Optimized to employ Virtex-4
DSP Slice
Pr
DSP Slice 2
DSP Slice 1
Ar
– 4 and 3 MPY option
•
Br
Complex MPY available as IP Bi
module†
DSP Slice 4
Ai
DSP Slice 3
Pi
† Available: 6.2i IP Update 2
18
R. M. Rao, 2008
Performance/Parallelism/Area
• FPGA: highly parallel computing machine
• Achieve performance using functional unit parallelism
• Butterfly array to produce highperformance FFT processor
• High computation rate using (possibly)
hundreds of DSP slices
– Allocate resources as appropriate to meet
system requirements
• Large memory bandwidth using multiport memory constructed from BRAMs
Mem read BW: 320 x 36 x 500e6 = 5.76 Tera-bps
• Area/throughput tradeoff delivered via Xilinx IP library
19
R. M. Rao, 2008
FFT Architecture
• For small number of carriers and modest data rates single
butterfly (I)FFT is probably suitable - Small FPGA footprint
Input Data
Data
Ram 0
switch
Iteration Engine
switch
Phase
Factor ROM
Data
Ram 1
Output Data
20
R. M. Rao, 2008
Block boundary detection/Fine
timing acquisition
1 OFDM block of
repeated data
Half an OFDM block
SAMPLES
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
||2
KNOWN
SEQUENCE
ave
Timing Est
()*
Freq Est
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
arg
F. Tufvesson, O. Edfors, M. Faulkner, “Time and Frequency Synchronization for OFDM
using PN-Sequence Preambles”, VTC-1999/Fall, vol 4, pp.2203-7, New Jersey, 1999.
21
R. M. Rao, 2008
Fine-timing acquisition using a
clipped correlator
10 time multiplexed
correlators
in0
out0
Full precision correlators :
32 embedded multipliers
896 flipflops
in1
Register1
2
xnz
1
1
d
xn
2
addr
sysgenq
DAddr
coef f
yn
R
-1
addr
sysgen
z
bsysgen
ab
a
2
a
sysgen
cast
coeff
bc2
1
3
sysgen
cast
yn
ld
bc3
en
3
a
-1
sysgen
z
Delay2
Data Addr
BaudClkCoef Addr
BaudClk
AddSub
1
yn
en
1-bit correlator
MAC
ROM1
LD
2
-1
sysgen
z q
ld
CAddr
4
d
sub
-1
sysgen
z
-1
sysgen
z
Delay2
Delay4
-2
sysgen
z
Delay
load
FSM
1
xn
x
DAddr
DAddr
DAddr
DAddr
DAddr
DAddr
DAddr
CAddr
CAddr
CAddr
CAddr
CAddr
CAddr
CAddr
xn
xnz
LD
xnz
yn
xnz
LD
C4
C6
C7
Delay5
a
-1
sysgen
z
AddSub1
-1
sysgen
z
-8
sysgen
zen
a+b b
en
a
a+b b
en
a+b b
en
-1
sysgen
zen
AddSub12
a
xnz
LD
C5
-8
sysgen
zen
Delay1
AddSub
yn
xnz
LD
C3
Each 1-bit correlator :
10 slices
Total for clipped correlator :
589 slices
xn
yn
xnz
LD
C2
xn
yn
LD
C1
xn
yn
-1
sysgen
z
-1
sysgen
z
en
Delay3
a
AddSub13
a+b b
en
Delay7
a
xnz
LD
xn
yn
a+b b
en
xn
yn
-1
AddSub2 sysgen
z
-1
sysgen
z
-7
sysgen
zen
a
Bank of correlators
AddSub4
a+b b
en
Delay6
-1
sysgen
z
1
y
22
R. M. Rao, 2008
High Performance
Processing
Embedded Software
Serial Gigabit
OBSAI/CPRI
Proprietary serial
backplane
Inter-chip connectivity
The Platform
DUC,DDC
CFR,DPD
RACH
Searcher
OFDM PHY
TCC
MIMO
Connectivity
MAC (Media Access)
Decision oriented
tasks
CORBA
RTOS
NBAP
SCA (JTRS radios)
High MIPs tasks
Radio PHY
Supported by embedded
DSP tiles, distributed
memory, block memory and
logic fabric
DAC
DAC
ADC
ADC
EMIF
SRIO
Logic & IO
OBSAI/CPRI
SRIO
AD/DA interface
EMIF
23
R. M. Rao, 2008
Digital Receiver Architecture:
Abstracted Architecture
• Common model of abstraction for digital receiver is inner/outer receiver
Receiver Abstraction
Digital IF Processing
Ø
Ø
Ø
Ø
Up-Conversion
Down-Conversion
Channelizer
Fast AGC
Inner Receiver
Ø
Ø
Ø
Ø
Ø
Ø
Ø
Ø
Ø
Outer Receiver
Frequency Offset Estimation/Correction
Sample Clock Offset Correction
Channel Estimation/Equalization
Frame detection
AGC
Successive Interference Cancellation
Space-Time-Coding
IFFT/FFT
Per sub-carrier processing
q Beamforming
q QRD-RLS
Ø
Channel Coding
q LDPC
q TPC
q CTC
q Viterbi
q (De-) Interleave
Control, Protocol and Link Layer processing
Ø
Ø
Ø
Ø
Ø
Ø
Medium Access Control (MAC)
Link Layer Processing
CPRI
OBSAI
Ø
System Initialization, Control and Monitoring
Application
Ethernet
Ø
Ø
PCI Express
SRIO
24
R. M. Rao, 2008
Receiver Abstraction and
Projection on to Platform FPGA
Receiver
Characteristics
Function
Digital IF
Ø MAC Intensive
Processing
Inner Receiver Ø MAC intensive
Ø Some functions LUT
intensive
CORDIC in QRD-RLS
Ø FFT processing for OFDM
Ø Correlation processing for
timing
Ø Per-carrier complexity
processing (MIMO-OFDM)
NTX  N RX  Num. Sub-carriers
FPGA
Platform
SX
SX/LX
Outer
Receiver
Ø Symbol rate tasks
Ø Channel coding
LX
Control/
Protocol
Ø
Ø
Ø
Ø
FX
Gigabit connectivity
Linux
OS “heavy” tasks
TCP/IP
Comments
Ø DSP48 main
requirement
Ø DSP48 leveraged
FFT
Ø FPGA fabric for
CORDIC
FFT
Ø ACS/ACSO dominated
by low bit precision
add/multiplexors
Good match for
fabric
Lots of memory
required
Ø Embedded PPC used
Ø Rocket IO for
PCI Express
SRIO
Receiver Abstraction
SX
SX/LX
LX
FX
FPGA product portfolio
Tailored for various
processing Tasks in
communications
receiver
25
R. M. Rao, 2008
Digital Frontend
Digital upconversion (downconversion)
Crest factor reduction
Digital pre-distortion
26
R. M. Rao, 2008
Wired Communications
•
•
•
Flexible serial transceivers support multi-rate applications.
GTX transceivers run at 150Mbps to 6.5Gbps with 25% lower power consumption.
GTH transceivers support line rates beyond 11Gbps to enable 40G and 100G
protocols and more.
27
R. M. Rao, 2008
Magnitude
Orthogonal Frequency Division
Multiplexing (OFDM)
Frequency
OFDM divides a frequency selective channel into a number
of flat fading channels
28
R. M. Rao, 2008
OFDM Modulation
• A QAM symbol is modulated onto each subcarrier
• IFFT/FFT are used for efficient modulation and demodulation
Frequency Domain
QAM
Mapping
S/P
Strip
cyclic
prefix
Cyclic
Prefix
IFFT
Time Domain
RF
and
A/D
Time Domain
S/P
P/S
D/A
and
RF
(a)
Frequency Domain
FFT
FEQ
P/S
QAM
decoding
(b)
29
R. M. Rao, 2008
MIMO Systems
•
•
•
MIMO systems:
• Multiple Antennas at the transmitter and
receiver.
3 types of MIMO Systems:
• STBC MIMO systems
• Diversity gain.
• Spatial Multiplexing MIMO systems
• Capacity/throughput gain.
• Feedback MIMO systems
• Higher performance thru interference
reduction.
MISO (multiple input single output) Systems:
• STBC can be used with just 1 receive antenna.
• Provides diversity gain.
• To achieve array gain, need knowledge of
channel at the transmitter (feedback).
Rx Antenna 1
Tx Antenna 1
Tx Antenna 2
Tx Antenna M
H
Rx Antenna 2
Rx Antenna N
30
R. M. Rao, 2008
Spatial Multiplexing
• A spatial multiplexing MIMO system transmits different data symbols from each
transmitter.
• The signals from each transmitter combine over the air and are received by multiple
receive antennas.
• SM systems have a rate=M (num transmit antennas). The diversity order depends on
the type of encoding and receiver (uncoded SM with ML decoding has diversity
order=N (num receive antennas)).
r1(t) = a11x(t)+a12y(t)+a13z(t)
x(n)
y(n)
MODULATOR
x(t)
MODULATOR
y(t)
z(n)
MODULATOR
z(t)
MIMO
MIMO
Receiver
Receiver
x(n)
y(n)
z(n)
r3(t) = a31x(t)+a32y(t)+a33z(t)
31
R. M. Rao, 2008
MIMO and OFDM
• MIMO – Multiple Input Multiple Output
Communication System. Employs multiple
antennas at both transmitter and receiver.
• OFDM – Orthogonal Frequency Division
Multiplexing. Breaks up a broadband channel into
many parallel narrowband channels (subcarriers).
• MIMO-OFDM – A Combination of MIMO and
OFDM. Appears like many parallel MIMO systems
on orthogonal subcarriers.
32
R. M. Rao, 2008
MIMO-OFDM System
OFDM
DEMODULATOR 1
MIMO DECODER
OFDM
TRANSMITTER N
RICH SCATTERING
ENVIRONMENT
OFDM
TRANSMITTER 1
OFDM
DEMODULATOR N
Each transmitter is an independent OFDM modulator.
The source symbols could be space-time block coded or just QAM modulated
for spatial multiplexing.
Each receiver is an OFDM demodulator combined with a MIMO decoder to
invert the channel on each subcarrier and extract the source symbols.
33
R. M. Rao, 2008
Spatial Multiplexing Receivers
Zero Forcing receiver:
h11
Rx Antenna 1
Tx Antenna 1
h12
h21
Tx Antenna 2
h22
y1  h11 x1 + h12 x2 + n1
y2  h21 x1 + h22 x2 + n2
 y1   h11
 y   h
 2   21
 xˆ1  
 xˆ   
 2 
W
Rx Antenna 2
h12   x1   n1 
+ 



h22   x2   n2 
  y1 
y 
 2
1
For ZF receivers W  H 1
 xˆ1   h11 h12   y1 
 xˆ    h
 y 
h
 2   21 22   2 
Significant increase in noise when the channel is in a deep fade.
34
R. M. Rao, 2008
Spatial Multiplexing Receivers
• MMSE MIMO Decoders:
– Cancels interference and minimizes noise.
– Minimizes the over all error (mean squared error).
E[( xˆ  x) ]
2
WMMSE
M

Es
1
M
 H

H
H
H
+
I
H
M 

SNR 

35
R. M. Rao, 2008
QRD
• One of the popular methods of matrix inversion is
based on QRD.
H  QR
H 1  R 1Q H
• Q is Unitary and R is upper triangular
• A Unitary matrix has a trival inverse, Q  Q
• An upper triangular matrix can be inverted by
back-substitution
1
H
36
R. M. Rao, 2008
Architectures for QRD
• There are many architectures to get the QR
decomposition of any matrix.
– Givens Rotations and its variations
– Householder transformations, etc.
• A systolic structure makes implementation
straightforward and scalable.
• Givens rotations based QRD has a nice and easy
systolic structure.
37
R. M. Rao, 2008
Givens Rotations
• For a 2x1 vector of real numbers
 c s  a   a 2 + b2 

s c  b   
0

   

c
a
a +b
2
2
,s 
b
a 2 + b2
• For a NxM matrix, repeat the process 2 cells at a time.
 a11 a12
a
 21 a22
 a31 a32
a13   a11 a12

a23    a21 a22
a33   0 a32
a13   a11 a12
 
a23    0 a22

a33   0 a32

a13   a11 a12
 
a23    0 a22
 
a33   0
0
 
a13 

a23 

a33 

38
R. M. Rao, 2008
Systolic Arrays
• Structured arrays with identical cells. Usually a
“boundary” cell and an “internal” cell for the QRD
process.
Internal cell
Boundary cell
1. The boundary cell generates the
rotations.
2. Internal cell applies the rotations to all
the cells in the row.
3. The systolic array in this figure can
handle any matrix below 3x3.
39
R. M. Rao, 2008
Boundary and Internal Cell
a/x
x
)

2
Z 1

(


r
Z
1
mode -ve in mode 0,
+ve in mode 1


1
ve
c/1

0


s/w

z
-ve
This negative is needed since
W12=-(W11a12)W22
1
c
s/w
Z 1
mode
c/1
mode
This register needs to be
initialized to 1, since in the
first cycle the output needs to
be +1
40
R. M. Rao, 2008
Triangularization mode
a31
a21
a11
a32
a22
a12
a33
a23
a13
The rotation factors for
zeroing out cell A(2,1)
are stored in cell
A(1,2), etc.
 a11 a12
a
 21 a22
 a31 a32
a13   a11 a12

a23    0 a22
a33   a31 a32

a13   a11 a12
 
a23    0 a22

a33   0 a32
 
a13   a11 a12

a23    0 a22
 
a33   0
0
 
a13 

a23 

a33 

• For QRD of upto a 3x3
matrix we need 3 boundary
cells and 3 internal cells.
• Boundary cells calculate
rotation vectors and
internal cells store them.
• Data is fed column-wise
into the systolic array.
• This may have to be
staggered depending on
the pipelining delays thru
the boundary cell and
internal cell.
41
R. M. Rao, 2008
Back-substitution mode
•
1
0
if (i  j )
Wij  0
0
a
a
 11 12
0 a
22

0
0

elseif (i  j )
a13  W W W 
11
12
13



a23   0 W22 W23 

0 W33 
a33   0

Wij  rjj1
elseif (i  j )
 j 1

Wij     Wim rmj  rjj1
 m 1

end
•
( )
  (W a + W a )W
W12   W11 a12 W22
W13
11 13
Computing R-1 with backsubstitution
12
23
The Wij  rjj1 is already
computed in the boundary cell
and stored away. So just use it.
33
42
R. M. Rao, 2008
Q-matrix computation mode
0
0
1
0
1
0
1
0
0
s  x * I .s + s * I .c
z  x * I .c*  s * I .s*
c  c;
QH A  R
QH I  QH
first column of Q matrix
second column of Q matrix
third column of Q matrix
0
1
0 c
32

0  s32
0   c21 s21 0  c31 0 s31   a11 a12
s32    s21 c21 0  0 1 0   a21 a22
c32   0
0 1    s31 0 c31   a31 a32
QH
A
a13   a11 a12
a23    0 a22

a33   0
0

a13 

a23 

a33 

R
43
R. M. Rao, 2008
Scalability
•
4x1 matrix
•
A 4x4 systolic array needs 4
boundary cells and 6 internal cells
and can handle all matricies
below 4x4. (i.e. 1x1, 1x2, .. 2x2,
…, 3x4, 4x4)
But if your design is restricted to
only a 2x2, you need only a 2x2
systolic array. With this you can
handle 1x1, 1x2 and 2x2.
4x4 matrix
2x2 matrix
3x3 matrix
44
R. M. Rao, 2008
FPGA Tools for DSP Systems
Design
• Higher level tools are raising the level of
abstraction.
• Allows non-hardware engineers (algorithm
designers) to get a first look at hardware.
• System Generator
– Simulink to Hardware
• C-to-Gates tools
– C or “higher” level languages to gates
45
R. M. Rao, 2008
Xilinx DSP Tools and Flows
Accelerate DSP Design
C/C++
ESL
Partners
MATLAB
Language Based Flow
RTL
RTL
MATLAB /
Simulink
Simulink
Mixed Flow
Graphical
Based Flow
RTL
RTL
FPGA Implementation with ISE
46
R. M. Rao, 2008
System Generator
System Level Modeling & Simulation Framework
HDL
C
Work in the language of your problem
47
R. M. Rao, 2008
System Generator Flow
DSP Development Flow
1. Develop Algorithm &
System Model
HDL Simulation Flow
Simulink MDL
2. Automatic Code
Generation
RTL VHDL & Cores
HDL Test Bench
Test Vectors
3. Xilinx Implementation
Flow
Bitstream
Download to FPGA
FPGA
48
R. M. Rao, 2008
Hardware/Software Co-simulation
•Encapsulates HDL semantics
•Simulink as verification framework
Hardware
co-simulation
HDL co-simulation
49
R. M. Rao, 2008
FlexOFDM
• A Configurable MIMO-OFDM Technology Demonstrator.
• Not specific to any standard, but can be configured (with some
effort) to showcase technologies that are part of some of the
Wireless standards.
• Provides an architecture for the PHY and MAC layers, which can
act as a starting point or spring board for product development.
• Investigate communication algorithms and architectures as they
efficiently map to Xilinx FPGAs.
This is not a product/IP from Xilinx, but is available to partners, to
speed up their MIMO-OFDM development efforts, on an AS IS basis.
ADVANCED SYSTEMS TECHNOLOGY
GROUP (ASTG) 50
R. M. Rao, 2008
Configurable MIMO-OFDM
Transmitter
Add Cyclic
Extension/Block
Shaping
Time shared
FFT across
antennas
Pilot insertion and
data loading
1
RealOut1
double
1
RealIn
DataIn
DataIn
SampleClk
3
RealOut
double
double
DataDone
Bool
2
sysgen
not
Inverter
Bool
RealOut
double
SampleClk
and
sysgen
-0
z
Bdata
double
Start
rfd
Logical2
Zeroblks
double
DataSubc
ImagOut
Preamble
Bool
double
xk_im
xn_im
double
SampleClk
Bdata
double
Enable
xk_index
FFT
double
rfd
double
RealIn
RealOut
DataRequest
Bool
vout
RealIn
double
RealOut1
ImagOut1
ImagIn
UFix_6_0
double
ImagOut
Fix_16_10
RealOut2
ImagIn
ImagOut2
Addr
RealOut3
start
Preamble
Fix_16_10
double
WriteFIFO
ReadFIFO
double
WriteFIFO
ImagOut3
BFrame
DataEnable
and
sysgen
-0
z
double
Preamble
Bdata
DataEnable
ImagOut
Zeroblks
xk_re
xn_re
ImagIn
Packetization
and Encoding
DataSubcarrier
FFTbusy
Pilot Insertion
and Data loading
double
enable
Busy
FFT
double
RealOut4
Add Cyclic Extension
BaudClk
ImagOut4
double
double
double
double
4
double
ImagOut2
double
double
Clock Generator
BaudClk
double
6
ImagOut3
RealOut4
BFrame
SampleClk
5
RealOut3
7
Logical
double
3
RealOut2
double
Spatial Demultiplexing
double
2
ImagOut1
Packet Controller
Packet
Controller
double
8
Packetization and
configurable STBC
encoding
ImagOut4
Spatial
Demultiplexing
and Interpolation
Clock
Generator
Resource sharing (folding factor)
Ratio of System clock rate to symbol rate > 8 needed for a 4 transmit antenna system
51
R. M. Rao, 2008
MIMO Receiver Architecture
Combine
PD
Block
Boundary
Detection
Packet
Detection
Block
Boundary
Rx 1
Strip
CP
Input
FIFO
FFT
Output
FIFO
Strip
CP
Input
FIFO
FFT
Output
FIFO
Rx 2
Packet
Detection
Coarse CFO
estimate
Rx 3
Packet
Detection
CFO
estimator
CFO Compensator
Packet
Detection
MIMO
Decode
Strip
CP
Input
FIFO
FFT
Output
FIFO
Strip
CP
Input
FIFO
FFT
Output
FIFO
Soft
Decisions
Coarse CFO
estimate
Rx 4
Pilot based CFO
estimator
Preamble
Packet
Controller
Payload
Samples processed at sample clock rate
Channel
Estimator
Samples processed
at system clock rate
MIMO
Decoder
FIFO
MIMO
Decoder
Matrix
(MMSE, etc)
52
R. M. Rao, 2008
Fine-timing acquisition using a
clipped correlator
10 time multiplexed
correlators
in0
out0
Full precision correlators :
32 embedded multipliers
896 flipflops
in1
Register1
2
xnz
1
1
d
xn
2
addr
sysgenq
DAddr
coef f
yn
R
-1
addr
sysgen
z
bsysgen
ab
a
2
a
sysgen
cast
coeff
bc2
1
3
sysgen
cast
yn
ld
bc3
en
3
a
-1
sysgen
z
Delay2
Data Addr
BaudClkCoef Addr
BaudClk
AddSub
1
yn
en
1-bit correlator
MAC
ROM1
LD
2
-1
sysgen
z q
ld
CAddr
4
d
sub
-1
sysgen
z
-1
sysgen
z
Delay2
Delay4
-2
sysgen
z
Delay
load
FSM
1
xn
x
DAddr
DAddr
DAddr
DAddr
DAddr
DAddr
DAddr
CAddr
CAddr
CAddr
CAddr
CAddr
CAddr
CAddr
xn
xnz
LD
xnz
yn
xnz
LD
C4
C6
C7
Delay5
a
-1
sysgen
z
AddSub1
-1
sysgen
z
-8
sysgen
zen
a+b b
en
a
a+b b
en
a+b b
en
-1
sysgen
zen
AddSub12
a
xnz
LD
C5
-8
sysgen
zen
Delay1
AddSub
yn
xnz
LD
C3
Each 1-bit correlator :
10 slices
Total for clipped correlator :
589 slices
xn
yn
xnz
LD
C2
xn
yn
LD
C1
xn
yn
-1
sysgen
z
-1
sysgen
z
en
Delay3
a
AddSub13
a+b b
en
Delay7
a
xnz
LD
xn
yn
a+b b
en
xn
yn
-1
AddSub2 sysgen
z
-1
sysgen
z
-7
sysgen
zen
a
Bank of correlators
AddSub4
a+b b
en
Delay6
-1
sysgen
z
1
y
53
R. M. Rao, 2008
MIMO-OFDM Receiver
Packet Detection
1
-1
en
RealIn1
z
RealIn1
9
PacketDetect
Delay
-1
z
2
PacketDetect
en
ImagIn1
Delay1
-1
z
3
ImagIn1
en
RealIn2
Delay2
-1
z
en
4
ImagIn2
CFO_Est
Delay4
-1
z
en
Carrier Frequency
Offset Correction
Display2
en
6
0
Fine Timing Acq
RealIn2
Delay3
-1
z
5
RealIn3
ImagIn3
ImagIn2
Delay5
-1
z
7
en
RealIn4
PktDetPulse
Delay6
-1
z
8
Baud_clk
en
ImagIn4
Delay7
MIMO Packet Detect1
9
Reset
a
b
z
-1
Delay8
a -b
Output FIFO
AddSub
RealIn
Out2
Cyclic prefix
removal
0
FFT
Display1
ImagIn
SampleClk
Clock Generator
BBDValid
BaudClk
BaudClk
Clock
Generator
Fine T iming Acquisition
BlkBounDetect
Rxreal1
ReadEnable
Out_real1
WriteFIFO
RealIn1
RxReal1
RxOut1
RxStream1
ImagIn1
Rxreal1
Rximag1
RxStream1
Out_imag1
Rximag1
RxImag1
RealIn2
RxStream1
RxStream2
Rxreal2
Out_real2
Rxreal2
RxOut2
ImagIn2
RealIn3
RxStream2
RxReal2
Rximag2
RxStream2
ImagIn3
RxOut3
RxStream3
Rxreal3
Enable
Rximag3
ReadFIFO
RxStream4
Rxreal3
Rximag3
Out_imag3
RxImag3
RxOut4
Rxreal4
Rxreal4
PacketDetect
Out_real4
Out_imag1
CFO_est
RxStream4
BaudClk
RxReal4
Rximag4
FIFO_status_f lag
FFT_Start
1
SoftDecReal1
Rximag2
Out_real3
RxReal3
RxStream3
ImagIn4
Out_imag2
RxImag2
RxStream4
RealIn4
Out_real1
RxStream3
Rximag4
Out_imag4
ValidData
ReadFIFO
FIFO_status_f lag
Cyclic Prefix Removal
Input Buffer
2
SoftDecImag1
ReadFIFO
RxImag4
CFO_Valid
Enable
Addr
Valid out
Addr
Addr
CFO_Valid
AddrOut
wreal_1_1
v alid_out
Output FIFO
wimag_1_1
10
ValidOut
FFT_RFD
Channel
Estimation
wreal_1_2
Reset
FFT_Start
wimag_1_2
FFT
wreal_1_3
wimag_1_3
ReadWeightMatrix
wreal_1_4
wimag_1_4
wreal_2_1
Rxreal1
Ch_1_1
Ch_tx1rx1
Ch_1_2
Ch_tx1rx2
wreal_1_1
Ch_1_3
Ch_tx1rx3
Ch_1_4
Ch_tx1rx4
Ch_2_1
Ch_tx2rx1
Ch_2_2
Ch_tx2rx2
Ch_2_3
Ch_tx2rx3
Ch_2_4
Ch_tx2rx4
Ch_3_1
Ch_tx3rx1
Ch_3_2
Ch_tx3rx2
Ch_3_3
Ch_tx3rx3
Ch_3_4
Ch_tx3rx4
wreal_1_4
wreal_2_4
Ch_tx4rx3
Ch_4_4
Ch_tx4rx4
wreal_4_1
Out_imag3
wreal_3_4
MIMO Decoder
6
SoftDecImag3
wimag_4_2
wreal_4_3
wimag_3_4
wimag_4_3
wreal_4_4
En
wimag_3_3
wimag_4_1
wreal_4_2
CFO_Est
wreal_3_3
wimag_3_4
ValidData
Addr
5
SoftDecReal3
wimag_3_2
wimag_3_3
wreal_3_4
Ch_tx4rx2
Out_real3
wreal_3_2
wimag_3_2
wreal_3_3
Ch_tx4rx1
wimag_3_1
wimag_3_1
wreal_3_2
Ch_4_3
wreal_3_1
wimag_2_4
wreal_3_1
Ch_4_2
4
SoftDecImag2
wimag_2_4
wimag_2_3
Rximag3
Ch_4_1
Out_imag2
wreal_2_4
wimag_2_2
wreal_2_3
Rximag4
wimag_2_3
wimag_2_1
wreal_2_2
Rxreal4
wreal_2_3
wimag_1_4
wreal_2_1
Rxreal3
3
SoftDecReal2
wimag_2_2
wimag_1_3
Rxreal2
Rximag2
Out_real2
wreal_2_2
wimag_1_2
wreal_1_3
Weight Matrix
Computation
wimag_2_1
wimag_1_1
wreal_1_2
Rximag1
wreal_4_1
wimag_4_4
wimag_4_1
ReadAddr
CFO_Est_Valid
Addr
Weight Matrix Computation
Out_real4
wreal_4_2
7
SoftDecReal4
Channel Estimation
wimag_4_2
wreal_4_3
wimag_4_3
Out_imag4
wreal_4_4
8
SoftDecImag4
wimag_4_4
BaudClk
MIMO Decoder
54
R. M. Rao, 2008
Channel Estimation
Channel Estimation Pilots
for Tx1
rst
UFix_6_0
sysgen
-2 UFix_6_0
out
sysgen
z
en
Enable
Pilots
double
Counter1
Channel Estimation Pilots
for Tx4
Enable
Enable
Delay2
Pilot_real
double
Enable
Pilot_real
Fix_2_0
Fix_2_0
Pilot_real
sel
rst
UFix_6_0
sysgen
out
-1
sysgen
Delay6en
z
Bool
Counter2
UFix_6_0
10
-2
sysgen
z
sysgen
d0
UFix_6_0
Reset
sel
Reset
Addr
d1
sysgen
d0
Mux1
Training Symbols
Tx1
UFix_6_0
Reset
Reset
double
Training Symbols
Tx3
Training Symbols
Tx2
d1
Training Symbols
Tx4
Mux
7
Addr
Delay3
5
addr
addr
Fix_16_10 -2 Fix_16_10
sysgen
1
-2
sysgen
z
Delay4
Fix_16_10
2
Fix_16_10
Fix_32_20
x sysgen
0.3535
imag_out
Real
Imag
Fix_16_10
CMult
Imag
WE
Delay5
Rximag1
real_out
Real
z
Rxreal1
real_out
Pilots1
imag_out
Fix_16_10
Real_in
double
addr
double
Imag
Imag_in
Fix_32_20
VDATA
ChEst Tx1-Rx1
CMult1
Real_in
double
1
Single Port RAM
Imag_in
addr
Real
Fix_32_20
Fix_32_20
Imag
imag_out
Real_in
Fix_16_10
Fix_16_10
Chreal3
6
addr
Chimag3
Real
Fix_32_20
VDATA
real_out
Pilots1
Imag
WE
3
Chreal2
4
ChEst Tx2-Rx1
real_out
Pilots1
double
WE
VDATA
EN
imag_out
Real
Fix_32_20
WE
Fix_32_20
x sysgen
0.3535
real_out
Pilots1
imag_out
Fix_16_10
Chreal4
8
Fix_16_10
Chimag4
Real_in
WE
Imag_in
Fix_32_20
VDATA
ChEst Tx3-Rx1
Imag_in
ChEst Tx4-Rx1
Chimag2
Chreal1
2
Chimag1
addr
addr
Fix_16_10 -2 Fix_16_10
sysgen
3
Rxreal2
Delay7
-2
sysgen
z
Fix_16_10
double
x sysgen
0.3535
imag_out
Real
Imag
Fix_16_10
CMult2
Imag
Fix_16_10
4
real_out
Pilots1
real_out
Real
z
WE
Delay8
imag_out
Rximag2
Fix_16_10
Real_in
double
double
double
double (8)
9
Chreal5
10
addr
Chimag5
Real
Imag
WE
Fix_32_20
VDATA
Imag_in
VDATA
ChEst Tx1-Rx2
CMult3
imag_out
Real_in
double
Imag_in
addr
Real
double
Fix_32_20
Imag
11
Chreal6
ChEst Tx2-Rx2
12
Single Port RAM1
real_out
Pilots1
double
WE
Fix_32_20
x sysgen
0.3535
EN
real_out
Pilots1
imag_out
Real_in
Chreal7
double
addr
real_out
Pilots1
Fix_16_10
14
Chimag7
WE
VDATA
simout11
13
Fix_16_10
Real
Imag
imag_out
Fix_16_10
15
To Workspace2
Chreal8
Fix_16_10
16
Real_in
Chimag8
WE
Imag_in
Fix_32_20
VDATA
ChEst Tx3-Rx2
Imag_in
ChEst Tx4-Rx2
Chimag6
19
17
addr
addr
5
Fix_16_10-2
sysgen
z
Fix_16_10
Rxreal3
Fix_16_10
Delay9
6
-2
sysgen
z
Rximag3
Real
Fix_16_10
Fix_32_20
x sysgen
0.3535
Real
Imag
Fix_16_10
CMult4
Imag
WE
Delay10
real_out
imag_out
Fix_16_10
Fix_16_10
Chreal9
18
Pilots1
real_out
imag_out
Real_in
Fix_16_10
x sysgen
0.3535
EN
VDATA
Imag_in
real_out
imag_out
Real_in
Fix_16_10
Chreal10
20
Fix_16_10
Chimag10
VDATA
addr
real_out
Pilots1
Real
Fix_32_20
Imag
WE
Fix_32_20
imag_out
Real_in
Fix_16_10
21
Chreal11
Fix_16_10
Fix_32_20
VDATA
addr
real_out
Pilots1
22
Fix_32_20 Chimag11
WE
Imag_in
ChEst Tx2-Rx3
ChEst Tx1-Rx3
CMult5
Real
Imag
Fix_32_20
WE
Fix_32_20
addr
Pilots1
Chimag9
Real
Imag
imag_out
Fix_16_10
Fix_16_10
23
Chreal12
24
Chimag12
Real_in
WE
Imag_in
Fix_32_20
VDATA
ChEst Tx3-Rx3
Imag_in
ChEst Tx4-Rx3
Single Port RAM2
25
27
Chreal13
addr
addr
7
Fix_16_10
-2
sysgen
z
Fix_16_10
Rxreal4
Delay11
8
-2
sysgen
z
Real
Fix_16_10
Fix_32_20
x sysgen
0.3535
Real
Imag
Fix_16_10
CMult6
Imag
Fix_16_10
WE
Delay12
imag_out
Rximag4
real_out
Pilots1
real_out
Fix_16_10
imag_out
Real_in
Fix_16_10
VDATA
addr
Fix_32_20
real_out
Pilots1
Real
WE
Fix_32_20
Chreal14
26
Chimag13
Fix_16_10
Imag
imag_out
Real_in
Fix_16_10
Fix_16_10
Fix_32_20
VDATA
28
Chimag14
Chreal15
addr
Fix_32_20
real_out
Pilots1
Real
WE
Imag_in
29
Imag
imag_out
Real_in
Fix_16_10
Fix_16_10
Fix_32_20
VDATA
addr
real_out
Pilots1
Real
Fix_32_20
WE
Imag_in
30
Chimag15
Imag
imag_out
Fix_16_10
31
Chreal16
Fix_16_10
32
Chimag16
Real_in
WE
Imag_in
Fix_32_20
VDATA
Imag_in
x sysgen
0.3535
EN
CMult7
ChEst Tx1-Rx4
ChEst Tx2-Rx4
ChEst Tx3-Rx4
ChEst Tx4-Rx4
Single Port RAM3
9
double
ValidData
11
double
and
sysgen
-2
z
Bool
Logical
ChEstPilots
-3
sysgen
z
ValidData
ChEstRst
En
Rst
ChEstPilots
En2
ChEstPilots1
Bool
Delay1
ChEstEn
double
double
double
double
double
ControlSignals
Input FIFO
12
4x4 Channel Estimation
Memory
double
ReadAddr
Control Signals
55
R. M. Rao, 2008
Packet Detection
Identical halves of 1 OFDM symbol
Schmidl and Cox algorithm for Packet
Detection and coarse carrier frequency
offset estimation.
T. M. Schmidl, D. C. Cox, “Low Overhead Low
Complexity Synchronization for OFDM”,
ICC 1996, Vol 3, pp 1301-1306.
r(n)
*
Z-D
*
C
c(n)
P
p(n)( )2

2
m(n)
2
RealIn1
double
1
RealIn
Fix_16_8
sysgen
force
sysgen
cast
Reinterpret1
Fix_8_0
sysgen
z-32
en
Convert
RealOut
ImagIn1
Fix_8_0
RealIn2
ImagOut
ImagIn
sysgen
force
Fix_16_8
Reinterpret
sysgen
cast
BaudClk
Convert1
Fix_8_0
sysgen
z-32
en
Delay1
3
Bool
Fix_16_0 RealIn1
ImagIn
Correlate
double
RealIn
ImagIn1
Fix_8_0
BaudClk
2
CorrMetric
RealOut
Complex
Multiply
ImagIn2
Delay
Fix_8_0
Fix_8_0
ImagOut
Complex
double
RealOut
Multiply
a
Fix_16_0
a>b
sysgen
z-1
En
BaudClk
1
PacketDetect
Complex Sliding
Window Averager
Power Calculator1
b
Relational
RealIn1
ImagIn1
double
Complex
Fix_8_0
PwrOut
Multiply
DecisionMetric = (CorrelationPeak >= 0.5(AveragePower))
In
In1
Out
Fix_16_0
Out1
BaudClk
BaudClk
Power Calculator
Fix_32_0 X >> 1
sysgen
z-0
double
Shift
En
3
Sliding Window
Averager
Squarer
AvePower
56
R. M. Rao, 2008
Pre-FFT Carrier Frequency Offset
Estimation
z-14
en
1
RealIn 1
ˆ
Delay5
2
ImagIn1
N 
2p  s 
 2 
RealIn 1
ImagIn 1
3
RealIn 2
CorrMetric _ real
In 1
Out 1
CorrMetric _ imag
In 2
Out 2
In 3
Out 3
x
ImagIn 2
BaudClk
5
Baud_clk
Rising edge
detector
z -17
RealIn 2
4
ImagIn2
mag
Out 1In 1
7
BBD
AvePwr
Rst
Packet Detection 3
z-24
en
y
atan
x 0 . 003906
-2
z
cast
Convert
CORDIC ATAN
Truncate
CMult8
d
rstz - 1 q
en
1
CFO_Est
Register1
Delay6
6
Rst
The angle of the correlation metric is
proportional to the Carrier frequency offset.
Right size the number of bits before the
CORDIC operation.
CORDIC ATAN from the Xilinx Math library
calculates the angle.
57
R. M. Rao, 2008
Carrier Frequency Offset
Correction
9
Direct digital synthesizer (DDS) from
the Xilinx DSP SysGen library.
Fix_16_12
Fix_16_16
x 0 . 01563
CFO_Est
freq_off
cos_out
11
CMult
Fix_16_15
Enable
CFO_Est_valid
12
double
sin_out
Reset
Fix_16_15 Fix_17_15
x(- 1 )
Reset
DDS
1
Fix_16_10
z
Fix_16_10
-1
RealIn 1
2
z
Fix_16_10
Negate 1
RealIn 1
Fix_16_10
-1
RealIn 2
Delay 1
ImagIn 1
RealOut
ImagIn 1
Delay
1
RealOut 1
ImagIn 2
BaudClk
Fix_16_10
Complex
Multiply
ImagOut
Fix_16_10
2
ImagOut 1
Complex Multiply
3
Fix_16_10
z
Fix_16_10
-1
RealIn 2
4
z
Fix_16_10
RealIn 1
Fix_16_10
-1
RealIn 2
Delay 3
ImagIn 2
RealOut
ImagIn 1
Delay 2
3
RealOut 2
ImagIn 2
BaudClk
Fix_16_10
Complex
Multiply
ImagOut
Fix_16_10
4
ImagOut 2
Complex Multiply 1
5
Fix_16_10
z
Fix_16_10
-1
RealIn 3
6
z
RealIn 1
Fix_16_10
-1
Fix_16_10
RealIn 2
Delay 5
ImagIn 3
RealOut
ImagIn 1
Delay 4
5
RealOut 3
ImagIn 2
BaudClk
Fix_16_10
Complex
Multiply
ImagOut
Fix_16_10
6
ImagOut 3
Complex Multiply 2
7
Fix_16_10
z
Fix_16_10
-1
RealIn 4
8
z
Fix_16_10
-1
RealIn 1
Fix_16_10
RealIn 2
ImagIn 4
RealOut
ImagIn 1
Delay 6
Delay 7
ImagIn 2
1
Constant 3
Bool
BaudClk
Fix_16_10
ImagOut
7
RealOut 4
Complex
Multiply
Fix_16_10
8
ImagOut 4
Complex Multiply 3
Bool
or
z -0
Logical 1
0
UFix_16_0
a a <b
Constant 1
bz
10
Bool
In1
Out1
Bool
rst
out
UFix_16_0
Bool
-0
Relational
and
-0
z
Bool
FFT_Start
Rising edge
detector
aa < = b
-0
bz
Counter
78
UFix_16_0
Bool
Logical
Relational 1
Constant 2
58
R. M. Rao, 2008
Design methodology issues
• FPGA tools
– Where to from here?
• C-to-gates
– Higher level design languages to gates
– Raising the level of abstraction
59
R. M. Rao, 2008
‘C’ or higher level language to
Gates
• There is interest in higher level design
methodologies, such as C-to-Gates from the
design community.
• ESL (Electronic system level) tools/design
methodologies are being explored.
• But, extracting all the concurrency from a
sequential description is not an easy problem.
60
R. M. Rao, 2008
C to Gates evaluation flow
Source: BDTI. For more info and results see www.BDTI.com.
61
R. M. Rao, 2008
C to Gates evaluation by BDTI
Source: BDTI. For more info and results see www.BDTI.com.
62
R. M. Rao, 2008
Conclusion
• FPGAs are finding wide use in infrastructure
communication systems and signal processing
systems.
• FPGA are an efficient choice for exploring VLSI
architectures.
• FPGA tools are raising the level of abstraction to
allow algorithm designers the ability to explore
h/w architectures without learning “h/w design
tools/languages”.
63
R. M. Rao, 2008
Questions?
64
R. M. Rao, 2008
Download