FPGAs for Signal Processing and Communication Systems Raghu Rao Wireless and Signal Processing Group, Xilinx Inc. 05/14/2010 Agenda • Overview of FPGAs – Building DSP sub-systems on FPGAs – Digital baseband • The Platform FPGA • Communication systems and DSP on FPGAs • Architectural tradeoffs for FPGAs – The Matrix inversion problem • FPGA tools and design methodology 2 R. M. Rao, 2008 What are FPGAs? • An array of configurable logic blocks with configurable interconnects between them. • Each logic block can implement any 6-input combinatorial function. • Logic blocks can be connected to generate larger circuits. • Additional DSP specific resources (multiply accumulate units). 3 R. M. Rao, 2008 Virtex-4/5 FPGA Arhitecture High-Level View • FPGA family with 3 members tailored for specific classes of processing – SX: DSP – LX: Logic centric – FX: Full featured • Embedded PowerPC hard IP • Giga-bit serial connectivity • DSP processing tiles “DSP48” 4 R. M. Rao, 2008 Virtex-5 FPGA Platform Can be configured as a shift register • 2 slices per CLB, 4 LUTs per CLB • Can be configured as a shift register • Can be configured as distributed memory Can be configured as RAM 5 R. M. Rao, 2008 Virtex-5 DSP48E Full Custom Design Enabling Efficient DSP Cascade routing enables scalable performance ACOUT Wider internal data-path and 96-accumulated output enable higher precision Pipeline registers enable 550Mhz performance BCOUT PCOUT ACIN New 25x18 input increases precision and efficiency = BCIN New second stage enables SIMD and bitwise logic operations Optional Pipeline Register/ Routing Logic Optional Register Routing Logic C (48-bit) Multiplier B (18-bit) A (25-bit) Optional Pipeline Register/ Routing Logic 48-bit P (48-bit) Optional P(96-bit) PCIN Pattern detect circuitry increases functionality 6 R. M. Rao, 2008 Dynamically Reconfigurable DSP OPMODEs OpMode Zero Hold P A:B Select Multiply C Select Feedback Add 36-Bit Adder P Cascade Select P Cascade Feedback Add P Cascade Add P Cascade Multiply Add P Cascade Add P Cascade Feedback Add Add P Cascade Add Add Hold P Double Feedback Add Feedback Add Multiply-Accumulate Feedback Add Double Feedback Add Feedback Add Add C Select Feedback Add 36-Bit Adder Multiply-Add 17-Bit Shift P Cascade Select 17-Bit Shift P Cascade Feedback Add 17-Bit Shift P Cascade Add 17-Bit Shift P Cascade Multiply Add 17-Bit Shift P Cascade Add 17-Bit Shift P Cascade Add Add 17-Bit Shift Feedback 17-Bit Shift Feedback Feedback Add 17-Bit Shift Feedback Add 17-Bit Shift Feedback Multiply Add 17-Bit Shift Feedback Add 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 Z 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 4 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 Y 3 2 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 X 1 0 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 Output +/- Cin +/- (P + Cin) +/- (A:B + Cin) +/- (A * B + Cin) +/- (C + Cin) +/- (C + P + Cin) +/- (A:B + C + Cin) PCIN +/- Cin PCIN +/- (P + Cin) PCIN +/- (A:B + Cin) PCIN +/- (A * B + Cin) PCIN +/- (C + Cin) PCIN +/- (C + P + Cin) PCIN +/- (A:B + C + Cin) P +/- Cin P +/- (P + Cin) P +/- (A:B + Cin) P +/- (A * B + Cin) P +/- (C + Cin) P +/- (C + P + Cin) P +/- (A:B + C + Cin) C +/- Cin C +/- (P + Cin) C +/- (A:B + Cin) C +/- (A * B + Cin) Shift(PCIN) +/- Cin Shift(PCIN) +/- (P + Cin) Shift(PCIN) +/- (A:B + Cin) Shift(PCIN) +/- (A * B + Cin) Shift(PCIN) +/- (C + Cin) Shift(PCIN) +/- (A:B + C + Cin) Shift(P) +/- Cin Shift(P) +/- (P + Cin) Shift(P) +/- (A:B + Cin) Shift(P) +/- (A * B + Cin) Shift(P) +/- (C + Cin) – Over 40 Different Modes - Each XtremeDSP Slice individually controllable - Change operation in a single clock cycle - Enables resource sharing for maximum utilization 7 R. M. Rao, 2008 Reconfigurability Waveform identification module Waveform 1 Waveform 2 Waveform 3 can be reconfigured into this region of the FPGA. Waveform 2 can be “reloaded” into its region when Waveform identification module detects waveform 2 being received. 8 R. M. Rao, 2008 Virtex-6 resources 9 R. M. Rao, 2008 GMACs Performance DSP48 slices 10 R. M. Rao, 2008 Processing capabilities of FPGAs BDTI Certified(tm) Results (c) 2008 BDTI. For more info and results see www.BDTI.com. 11 R. M. Rao, 2008 Processing capabilities of FPGAs BDTI Certified(tm) Results (c) 2008 BDTI. For more info and results see www.BDTI.com. 12 R. M. Rao, 2008 Pipelined Multiplier C To Adjacent DSP48 Tile 48 BCOUT 18 3 delay latency PCOUT 18 48 18 MS Word LS Word A B 18 X 36 72 36 Y 48 48 A B ZERO 48 18 z-3 18 48 48 36 18 18 48 CIN 48 SUB Z 48 P (PCOUT) 48 36b product sign extended to 48b Register 18 Wire Shift Right By 17b 48 BCIN PCIN 13 R. M. Rao, 2008 P Pipelined Complex 18x18 MPY Ai S4 18 Bi Ar 18 18 Pr 48 18 ‘0’ S2 18 Bi Ar 48 S3 Br Ar - 48 48 18 48 Pi S1 Bi sn = Slice n 18 18 Register 36 48 ‘0’ Sign Extension 14 R. M. Rao, 2008 Wide Filters At Full Speed Within the Virtex-4 DSP Slice Column • Systolic N-tap FIR – Scalable N-levels deep implementation – N-levels deep at 500MHz performance • Uses Integrated Pipeline Registers to Synchronize Filter Inputs • Utilizes Input and Output Cascade Routing Build Massively Parallel 512-TAP FIR Filter In a Single Device Achieving 256 GMACCs/s Performance Equivalent Implementation Would Consume 444 Embedded Multipliers and 77,008 LCs And Would Only Achieve ½ The Performance 15 R. M. Rao, 2008 Xilinx FFT IP (4) • FFT fully utilizes FPGA arithmetic hardware resources • FFT viewed as a recursion using a butterfly kernel a (a + b) CADD1 CADD2 b (a + b) e-j2pk/N CMPY Phase factors: e-j2pk/N • • CADD{1|2}: complex adder CMPY: complex multiplier 16 R. M. Rao, 2008 Virtex-4 DSP Slice • DSP slice key for implementing highperformance arithmetic • Embedded 18x18 MPY and 48b adder – Butterfly phase rotator – Cross-addition 17 R. M. Rao, 2008 Butterfly CMPLX MPY Pr + jPi = (Ar+jAi) x (Br + jBi) • Complex MPY used in FFT butterfly • Optimized to employ Virtex-4 DSP Slice Pr DSP Slice 2 DSP Slice 1 Ar – 4 and 3 MPY option • Br Complex MPY available as IP Bi module† DSP Slice 4 Ai DSP Slice 3 Pi † Available: 6.2i IP Update 2 18 R. M. Rao, 2008 Performance/Parallelism/Area • FPGA: highly parallel computing machine • Achieve performance using functional unit parallelism • Butterfly array to produce highperformance FFT processor • High computation rate using (possibly) hundreds of DSP slices – Allocate resources as appropriate to meet system requirements • Large memory bandwidth using multiport memory constructed from BRAMs Mem read BW: 320 x 36 x 500e6 = 5.76 Tera-bps • Area/throughput tradeoff delivered via Xilinx IP library 19 R. M. Rao, 2008 FFT Architecture • For small number of carriers and modest data rates single butterfly (I)FFT is probably suitable - Small FPGA footprint Input Data Data Ram 0 switch Iteration Engine switch Phase Factor ROM Data Ram 1 Output Data 20 R. M. Rao, 2008 Block boundary detection/Fine timing acquisition 1 OFDM block of repeated data Half an OFDM block SAMPLES Z-1 Z-1 Z-1 Z-1 Z-1 Z-1 Z-1 Z-1 ||2 KNOWN SEQUENCE ave Timing Est ()* Freq Est Z-1 Z-1 Z-1 Z-1 Z-1 Z-1 Z-1 Z-1 arg F. Tufvesson, O. Edfors, M. Faulkner, “Time and Frequency Synchronization for OFDM using PN-Sequence Preambles”, VTC-1999/Fall, vol 4, pp.2203-7, New Jersey, 1999. 21 R. M. Rao, 2008 Fine-timing acquisition using a clipped correlator 10 time multiplexed correlators in0 out0 Full precision correlators : 32 embedded multipliers 896 flipflops in1 Register1 2 xnz 1 1 d xn 2 addr sysgenq DAddr coef f yn R -1 addr sysgen z bsysgen ab a 2 a sysgen cast coeff bc2 1 3 sysgen cast yn ld bc3 en 3 a -1 sysgen z Delay2 Data Addr BaudClkCoef Addr BaudClk AddSub 1 yn en 1-bit correlator MAC ROM1 LD 2 -1 sysgen z q ld CAddr 4 d sub -1 sysgen z -1 sysgen z Delay2 Delay4 -2 sysgen z Delay load FSM 1 xn x DAddr DAddr DAddr DAddr DAddr DAddr DAddr CAddr CAddr CAddr CAddr CAddr CAddr CAddr xn xnz LD xnz yn xnz LD C4 C6 C7 Delay5 a -1 sysgen z AddSub1 -1 sysgen z -8 sysgen zen a+b b en a a+b b en a+b b en -1 sysgen zen AddSub12 a xnz LD C5 -8 sysgen zen Delay1 AddSub yn xnz LD C3 Each 1-bit correlator : 10 slices Total for clipped correlator : 589 slices xn yn xnz LD C2 xn yn LD C1 xn yn -1 sysgen z -1 sysgen z en Delay3 a AddSub13 a+b b en Delay7 a xnz LD xn yn a+b b en xn yn -1 AddSub2 sysgen z -1 sysgen z -7 sysgen zen a Bank of correlators AddSub4 a+b b en Delay6 -1 sysgen z 1 y 22 R. M. Rao, 2008 High Performance Processing Embedded Software Serial Gigabit OBSAI/CPRI Proprietary serial backplane Inter-chip connectivity The Platform DUC,DDC CFR,DPD RACH Searcher OFDM PHY TCC MIMO Connectivity MAC (Media Access) Decision oriented tasks CORBA RTOS NBAP SCA (JTRS radios) High MIPs tasks Radio PHY Supported by embedded DSP tiles, distributed memory, block memory and logic fabric DAC DAC ADC ADC EMIF SRIO Logic & IO OBSAI/CPRI SRIO AD/DA interface EMIF 23 R. M. Rao, 2008 Digital Receiver Architecture: Abstracted Architecture • Common model of abstraction for digital receiver is inner/outer receiver Receiver Abstraction Digital IF Processing Ø Ø Ø Ø Up-Conversion Down-Conversion Channelizer Fast AGC Inner Receiver Ø Ø Ø Ø Ø Ø Ø Ø Ø Outer Receiver Frequency Offset Estimation/Correction Sample Clock Offset Correction Channel Estimation/Equalization Frame detection AGC Successive Interference Cancellation Space-Time-Coding IFFT/FFT Per sub-carrier processing q Beamforming q QRD-RLS Ø Channel Coding q LDPC q TPC q CTC q Viterbi q (De-) Interleave Control, Protocol and Link Layer processing Ø Ø Ø Ø Ø Ø Medium Access Control (MAC) Link Layer Processing CPRI OBSAI Ø System Initialization, Control and Monitoring Application Ethernet Ø Ø PCI Express SRIO 24 R. M. Rao, 2008 Receiver Abstraction and Projection on to Platform FPGA Receiver Characteristics Function Digital IF Ø MAC Intensive Processing Inner Receiver Ø MAC intensive Ø Some functions LUT intensive CORDIC in QRD-RLS Ø FFT processing for OFDM Ø Correlation processing for timing Ø Per-carrier complexity processing (MIMO-OFDM) NTX N RX Num. Sub-carriers FPGA Platform SX SX/LX Outer Receiver Ø Symbol rate tasks Ø Channel coding LX Control/ Protocol Ø Ø Ø Ø FX Gigabit connectivity Linux OS “heavy” tasks TCP/IP Comments Ø DSP48 main requirement Ø DSP48 leveraged FFT Ø FPGA fabric for CORDIC FFT Ø ACS/ACSO dominated by low bit precision add/multiplexors Good match for fabric Lots of memory required Ø Embedded PPC used Ø Rocket IO for PCI Express SRIO Receiver Abstraction SX SX/LX LX FX FPGA product portfolio Tailored for various processing Tasks in communications receiver 25 R. M. Rao, 2008 Digital Frontend Digital upconversion (downconversion) Crest factor reduction Digital pre-distortion 26 R. M. Rao, 2008 Wired Communications • • • Flexible serial transceivers support multi-rate applications. GTX transceivers run at 150Mbps to 6.5Gbps with 25% lower power consumption. GTH transceivers support line rates beyond 11Gbps to enable 40G and 100G protocols and more. 27 R. M. Rao, 2008 Magnitude Orthogonal Frequency Division Multiplexing (OFDM) Frequency OFDM divides a frequency selective channel into a number of flat fading channels 28 R. M. Rao, 2008 OFDM Modulation • A QAM symbol is modulated onto each subcarrier • IFFT/FFT are used for efficient modulation and demodulation Frequency Domain QAM Mapping S/P Strip cyclic prefix Cyclic Prefix IFFT Time Domain RF and A/D Time Domain S/P P/S D/A and RF (a) Frequency Domain FFT FEQ P/S QAM decoding (b) 29 R. M. Rao, 2008 MIMO Systems • • • MIMO systems: • Multiple Antennas at the transmitter and receiver. 3 types of MIMO Systems: • STBC MIMO systems • Diversity gain. • Spatial Multiplexing MIMO systems • Capacity/throughput gain. • Feedback MIMO systems • Higher performance thru interference reduction. MISO (multiple input single output) Systems: • STBC can be used with just 1 receive antenna. • Provides diversity gain. • To achieve array gain, need knowledge of channel at the transmitter (feedback). Rx Antenna 1 Tx Antenna 1 Tx Antenna 2 Tx Antenna M H Rx Antenna 2 Rx Antenna N 30 R. M. Rao, 2008 Spatial Multiplexing • A spatial multiplexing MIMO system transmits different data symbols from each transmitter. • The signals from each transmitter combine over the air and are received by multiple receive antennas. • SM systems have a rate=M (num transmit antennas). The diversity order depends on the type of encoding and receiver (uncoded SM with ML decoding has diversity order=N (num receive antennas)). r1(t) = a11x(t)+a12y(t)+a13z(t) x(n) y(n) MODULATOR x(t) MODULATOR y(t) z(n) MODULATOR z(t) MIMO MIMO Receiver Receiver x(n) y(n) z(n) r3(t) = a31x(t)+a32y(t)+a33z(t) 31 R. M. Rao, 2008 MIMO and OFDM • MIMO – Multiple Input Multiple Output Communication System. Employs multiple antennas at both transmitter and receiver. • OFDM – Orthogonal Frequency Division Multiplexing. Breaks up a broadband channel into many parallel narrowband channels (subcarriers). • MIMO-OFDM – A Combination of MIMO and OFDM. Appears like many parallel MIMO systems on orthogonal subcarriers. 32 R. M. Rao, 2008 MIMO-OFDM System OFDM DEMODULATOR 1 MIMO DECODER OFDM TRANSMITTER N RICH SCATTERING ENVIRONMENT OFDM TRANSMITTER 1 OFDM DEMODULATOR N Each transmitter is an independent OFDM modulator. The source symbols could be space-time block coded or just QAM modulated for spatial multiplexing. Each receiver is an OFDM demodulator combined with a MIMO decoder to invert the channel on each subcarrier and extract the source symbols. 33 R. M. Rao, 2008 Spatial Multiplexing Receivers Zero Forcing receiver: h11 Rx Antenna 1 Tx Antenna 1 h12 h21 Tx Antenna 2 h22 y1 h11 x1 + h12 x2 + n1 y2 h21 x1 + h22 x2 + n2 y1 h11 y h 2 21 xˆ1 xˆ 2 W Rx Antenna 2 h12 x1 n1 + h22 x2 n2 y1 y 2 1 For ZF receivers W H 1 xˆ1 h11 h12 y1 xˆ h y h 2 21 22 2 Significant increase in noise when the channel is in a deep fade. 34 R. M. Rao, 2008 Spatial Multiplexing Receivers • MMSE MIMO Decoders: – Cancels interference and minimizes noise. – Minimizes the over all error (mean squared error). E[( xˆ x) ] 2 WMMSE M Es 1 M H H H H + I H M SNR 35 R. M. Rao, 2008 QRD • One of the popular methods of matrix inversion is based on QRD. H QR H 1 R 1Q H • Q is Unitary and R is upper triangular • A Unitary matrix has a trival inverse, Q Q • An upper triangular matrix can be inverted by back-substitution 1 H 36 R. M. Rao, 2008 Architectures for QRD • There are many architectures to get the QR decomposition of any matrix. – Givens Rotations and its variations – Householder transformations, etc. • A systolic structure makes implementation straightforward and scalable. • Givens rotations based QRD has a nice and easy systolic structure. 37 R. M. Rao, 2008 Givens Rotations • For a 2x1 vector of real numbers c s a a 2 + b2 s c b 0 c a a +b 2 2 ,s b a 2 + b2 • For a NxM matrix, repeat the process 2 cells at a time. a11 a12 a 21 a22 a31 a32 a13 a11 a12 a23 a21 a22 a33 0 a32 a13 a11 a12 a23 0 a22 a33 0 a32 a13 a11 a12 a23 0 a22 a33 0 0 a13 a23 a33 38 R. M. Rao, 2008 Systolic Arrays • Structured arrays with identical cells. Usually a “boundary” cell and an “internal” cell for the QRD process. Internal cell Boundary cell 1. The boundary cell generates the rotations. 2. Internal cell applies the rotations to all the cells in the row. 3. The systolic array in this figure can handle any matrix below 3x3. 39 R. M. Rao, 2008 Boundary and Internal Cell a/x x ) 2 Z 1 ( r Z 1 mode -ve in mode 0, +ve in mode 1 1 ve c/1 0 s/w z -ve This negative is needed since W12=-(W11a12)W22 1 c s/w Z 1 mode c/1 mode This register needs to be initialized to 1, since in the first cycle the output needs to be +1 40 R. M. Rao, 2008 Triangularization mode a31 a21 a11 a32 a22 a12 a33 a23 a13 The rotation factors for zeroing out cell A(2,1) are stored in cell A(1,2), etc. a11 a12 a 21 a22 a31 a32 a13 a11 a12 a23 0 a22 a33 a31 a32 a13 a11 a12 a23 0 a22 a33 0 a32 a13 a11 a12 a23 0 a22 a33 0 0 a13 a23 a33 • For QRD of upto a 3x3 matrix we need 3 boundary cells and 3 internal cells. • Boundary cells calculate rotation vectors and internal cells store them. • Data is fed column-wise into the systolic array. • This may have to be staggered depending on the pipelining delays thru the boundary cell and internal cell. 41 R. M. Rao, 2008 Back-substitution mode • 1 0 if (i j ) Wij 0 0 a a 11 12 0 a 22 0 0 elseif (i j ) a13 W W W 11 12 13 a23 0 W22 W23 0 W33 a33 0 Wij rjj1 elseif (i j ) j 1 Wij Wim rmj rjj1 m 1 end • ( ) (W a + W a )W W12 W11 a12 W22 W13 11 13 Computing R-1 with backsubstitution 12 23 The Wij rjj1 is already computed in the boundary cell and stored away. So just use it. 33 42 R. M. Rao, 2008 Q-matrix computation mode 0 0 1 0 1 0 1 0 0 s x * I .s + s * I .c z x * I .c* s * I .s* c c; QH A R QH I QH first column of Q matrix second column of Q matrix third column of Q matrix 0 1 0 c 32 0 s32 0 c21 s21 0 c31 0 s31 a11 a12 s32 s21 c21 0 0 1 0 a21 a22 c32 0 0 1 s31 0 c31 a31 a32 QH A a13 a11 a12 a23 0 a22 a33 0 0 a13 a23 a33 R 43 R. M. Rao, 2008 Scalability • 4x1 matrix • A 4x4 systolic array needs 4 boundary cells and 6 internal cells and can handle all matricies below 4x4. (i.e. 1x1, 1x2, .. 2x2, …, 3x4, 4x4) But if your design is restricted to only a 2x2, you need only a 2x2 systolic array. With this you can handle 1x1, 1x2 and 2x2. 4x4 matrix 2x2 matrix 3x3 matrix 44 R. M. Rao, 2008 FPGA Tools for DSP Systems Design • Higher level tools are raising the level of abstraction. • Allows non-hardware engineers (algorithm designers) to get a first look at hardware. • System Generator – Simulink to Hardware • C-to-Gates tools – C or “higher” level languages to gates 45 R. M. Rao, 2008 Xilinx DSP Tools and Flows Accelerate DSP Design C/C++ ESL Partners MATLAB Language Based Flow RTL RTL MATLAB / Simulink Simulink Mixed Flow Graphical Based Flow RTL RTL FPGA Implementation with ISE 46 R. M. Rao, 2008 System Generator System Level Modeling & Simulation Framework HDL C Work in the language of your problem 47 R. M. Rao, 2008 System Generator Flow DSP Development Flow 1. Develop Algorithm & System Model HDL Simulation Flow Simulink MDL 2. Automatic Code Generation RTL VHDL & Cores HDL Test Bench Test Vectors 3. Xilinx Implementation Flow Bitstream Download to FPGA FPGA 48 R. M. Rao, 2008 Hardware/Software Co-simulation •Encapsulates HDL semantics •Simulink as verification framework Hardware co-simulation HDL co-simulation 49 R. M. Rao, 2008 FlexOFDM • A Configurable MIMO-OFDM Technology Demonstrator. • Not specific to any standard, but can be configured (with some effort) to showcase technologies that are part of some of the Wireless standards. • Provides an architecture for the PHY and MAC layers, which can act as a starting point or spring board for product development. • Investigate communication algorithms and architectures as they efficiently map to Xilinx FPGAs. This is not a product/IP from Xilinx, but is available to partners, to speed up their MIMO-OFDM development efforts, on an AS IS basis. ADVANCED SYSTEMS TECHNOLOGY GROUP (ASTG) 50 R. M. Rao, 2008 Configurable MIMO-OFDM Transmitter Add Cyclic Extension/Block Shaping Time shared FFT across antennas Pilot insertion and data loading 1 RealOut1 double 1 RealIn DataIn DataIn SampleClk 3 RealOut double double DataDone Bool 2 sysgen not Inverter Bool RealOut double SampleClk and sysgen -0 z Bdata double Start rfd Logical2 Zeroblks double DataSubc ImagOut Preamble Bool double xk_im xn_im double SampleClk Bdata double Enable xk_index FFT double rfd double RealIn RealOut DataRequest Bool vout RealIn double RealOut1 ImagOut1 ImagIn UFix_6_0 double ImagOut Fix_16_10 RealOut2 ImagIn ImagOut2 Addr RealOut3 start Preamble Fix_16_10 double WriteFIFO ReadFIFO double WriteFIFO ImagOut3 BFrame DataEnable and sysgen -0 z double Preamble Bdata DataEnable ImagOut Zeroblks xk_re xn_re ImagIn Packetization and Encoding DataSubcarrier FFTbusy Pilot Insertion and Data loading double enable Busy FFT double RealOut4 Add Cyclic Extension BaudClk ImagOut4 double double double double 4 double ImagOut2 double double Clock Generator BaudClk double 6 ImagOut3 RealOut4 BFrame SampleClk 5 RealOut3 7 Logical double 3 RealOut2 double Spatial Demultiplexing double 2 ImagOut1 Packet Controller Packet Controller double 8 Packetization and configurable STBC encoding ImagOut4 Spatial Demultiplexing and Interpolation Clock Generator Resource sharing (folding factor) Ratio of System clock rate to symbol rate > 8 needed for a 4 transmit antenna system 51 R. M. Rao, 2008 MIMO Receiver Architecture Combine PD Block Boundary Detection Packet Detection Block Boundary Rx 1 Strip CP Input FIFO FFT Output FIFO Strip CP Input FIFO FFT Output FIFO Rx 2 Packet Detection Coarse CFO estimate Rx 3 Packet Detection CFO estimator CFO Compensator Packet Detection MIMO Decode Strip CP Input FIFO FFT Output FIFO Strip CP Input FIFO FFT Output FIFO Soft Decisions Coarse CFO estimate Rx 4 Pilot based CFO estimator Preamble Packet Controller Payload Samples processed at sample clock rate Channel Estimator Samples processed at system clock rate MIMO Decoder FIFO MIMO Decoder Matrix (MMSE, etc) 52 R. M. Rao, 2008 Fine-timing acquisition using a clipped correlator 10 time multiplexed correlators in0 out0 Full precision correlators : 32 embedded multipliers 896 flipflops in1 Register1 2 xnz 1 1 d xn 2 addr sysgenq DAddr coef f yn R -1 addr sysgen z bsysgen ab a 2 a sysgen cast coeff bc2 1 3 sysgen cast yn ld bc3 en 3 a -1 sysgen z Delay2 Data Addr BaudClkCoef Addr BaudClk AddSub 1 yn en 1-bit correlator MAC ROM1 LD 2 -1 sysgen z q ld CAddr 4 d sub -1 sysgen z -1 sysgen z Delay2 Delay4 -2 sysgen z Delay load FSM 1 xn x DAddr DAddr DAddr DAddr DAddr DAddr DAddr CAddr CAddr CAddr CAddr CAddr CAddr CAddr xn xnz LD xnz yn xnz LD C4 C6 C7 Delay5 a -1 sysgen z AddSub1 -1 sysgen z -8 sysgen zen a+b b en a a+b b en a+b b en -1 sysgen zen AddSub12 a xnz LD C5 -8 sysgen zen Delay1 AddSub yn xnz LD C3 Each 1-bit correlator : 10 slices Total for clipped correlator : 589 slices xn yn xnz LD C2 xn yn LD C1 xn yn -1 sysgen z -1 sysgen z en Delay3 a AddSub13 a+b b en Delay7 a xnz LD xn yn a+b b en xn yn -1 AddSub2 sysgen z -1 sysgen z -7 sysgen zen a Bank of correlators AddSub4 a+b b en Delay6 -1 sysgen z 1 y 53 R. M. Rao, 2008 MIMO-OFDM Receiver Packet Detection 1 -1 en RealIn1 z RealIn1 9 PacketDetect Delay -1 z 2 PacketDetect en ImagIn1 Delay1 -1 z 3 ImagIn1 en RealIn2 Delay2 -1 z en 4 ImagIn2 CFO_Est Delay4 -1 z en Carrier Frequency Offset Correction Display2 en 6 0 Fine Timing Acq RealIn2 Delay3 -1 z 5 RealIn3 ImagIn3 ImagIn2 Delay5 -1 z 7 en RealIn4 PktDetPulse Delay6 -1 z 8 Baud_clk en ImagIn4 Delay7 MIMO Packet Detect1 9 Reset a b z -1 Delay8 a -b Output FIFO AddSub RealIn Out2 Cyclic prefix removal 0 FFT Display1 ImagIn SampleClk Clock Generator BBDValid BaudClk BaudClk Clock Generator Fine T iming Acquisition BlkBounDetect Rxreal1 ReadEnable Out_real1 WriteFIFO RealIn1 RxReal1 RxOut1 RxStream1 ImagIn1 Rxreal1 Rximag1 RxStream1 Out_imag1 Rximag1 RxImag1 RealIn2 RxStream1 RxStream2 Rxreal2 Out_real2 Rxreal2 RxOut2 ImagIn2 RealIn3 RxStream2 RxReal2 Rximag2 RxStream2 ImagIn3 RxOut3 RxStream3 Rxreal3 Enable Rximag3 ReadFIFO RxStream4 Rxreal3 Rximag3 Out_imag3 RxImag3 RxOut4 Rxreal4 Rxreal4 PacketDetect Out_real4 Out_imag1 CFO_est RxStream4 BaudClk RxReal4 Rximag4 FIFO_status_f lag FFT_Start 1 SoftDecReal1 Rximag2 Out_real3 RxReal3 RxStream3 ImagIn4 Out_imag2 RxImag2 RxStream4 RealIn4 Out_real1 RxStream3 Rximag4 Out_imag4 ValidData ReadFIFO FIFO_status_f lag Cyclic Prefix Removal Input Buffer 2 SoftDecImag1 ReadFIFO RxImag4 CFO_Valid Enable Addr Valid out Addr Addr CFO_Valid AddrOut wreal_1_1 v alid_out Output FIFO wimag_1_1 10 ValidOut FFT_RFD Channel Estimation wreal_1_2 Reset FFT_Start wimag_1_2 FFT wreal_1_3 wimag_1_3 ReadWeightMatrix wreal_1_4 wimag_1_4 wreal_2_1 Rxreal1 Ch_1_1 Ch_tx1rx1 Ch_1_2 Ch_tx1rx2 wreal_1_1 Ch_1_3 Ch_tx1rx3 Ch_1_4 Ch_tx1rx4 Ch_2_1 Ch_tx2rx1 Ch_2_2 Ch_tx2rx2 Ch_2_3 Ch_tx2rx3 Ch_2_4 Ch_tx2rx4 Ch_3_1 Ch_tx3rx1 Ch_3_2 Ch_tx3rx2 Ch_3_3 Ch_tx3rx3 Ch_3_4 Ch_tx3rx4 wreal_1_4 wreal_2_4 Ch_tx4rx3 Ch_4_4 Ch_tx4rx4 wreal_4_1 Out_imag3 wreal_3_4 MIMO Decoder 6 SoftDecImag3 wimag_4_2 wreal_4_3 wimag_3_4 wimag_4_3 wreal_4_4 En wimag_3_3 wimag_4_1 wreal_4_2 CFO_Est wreal_3_3 wimag_3_4 ValidData Addr 5 SoftDecReal3 wimag_3_2 wimag_3_3 wreal_3_4 Ch_tx4rx2 Out_real3 wreal_3_2 wimag_3_2 wreal_3_3 Ch_tx4rx1 wimag_3_1 wimag_3_1 wreal_3_2 Ch_4_3 wreal_3_1 wimag_2_4 wreal_3_1 Ch_4_2 4 SoftDecImag2 wimag_2_4 wimag_2_3 Rximag3 Ch_4_1 Out_imag2 wreal_2_4 wimag_2_2 wreal_2_3 Rximag4 wimag_2_3 wimag_2_1 wreal_2_2 Rxreal4 wreal_2_3 wimag_1_4 wreal_2_1 Rxreal3 3 SoftDecReal2 wimag_2_2 wimag_1_3 Rxreal2 Rximag2 Out_real2 wreal_2_2 wimag_1_2 wreal_1_3 Weight Matrix Computation wimag_2_1 wimag_1_1 wreal_1_2 Rximag1 wreal_4_1 wimag_4_4 wimag_4_1 ReadAddr CFO_Est_Valid Addr Weight Matrix Computation Out_real4 wreal_4_2 7 SoftDecReal4 Channel Estimation wimag_4_2 wreal_4_3 wimag_4_3 Out_imag4 wreal_4_4 8 SoftDecImag4 wimag_4_4 BaudClk MIMO Decoder 54 R. M. Rao, 2008 Channel Estimation Channel Estimation Pilots for Tx1 rst UFix_6_0 sysgen -2 UFix_6_0 out sysgen z en Enable Pilots double Counter1 Channel Estimation Pilots for Tx4 Enable Enable Delay2 Pilot_real double Enable Pilot_real Fix_2_0 Fix_2_0 Pilot_real sel rst UFix_6_0 sysgen out -1 sysgen Delay6en z Bool Counter2 UFix_6_0 10 -2 sysgen z sysgen d0 UFix_6_0 Reset sel Reset Addr d1 sysgen d0 Mux1 Training Symbols Tx1 UFix_6_0 Reset Reset double Training Symbols Tx3 Training Symbols Tx2 d1 Training Symbols Tx4 Mux 7 Addr Delay3 5 addr addr Fix_16_10 -2 Fix_16_10 sysgen 1 -2 sysgen z Delay4 Fix_16_10 2 Fix_16_10 Fix_32_20 x sysgen 0.3535 imag_out Real Imag Fix_16_10 CMult Imag WE Delay5 Rximag1 real_out Real z Rxreal1 real_out Pilots1 imag_out Fix_16_10 Real_in double addr double Imag Imag_in Fix_32_20 VDATA ChEst Tx1-Rx1 CMult1 Real_in double 1 Single Port RAM Imag_in addr Real Fix_32_20 Fix_32_20 Imag imag_out Real_in Fix_16_10 Fix_16_10 Chreal3 6 addr Chimag3 Real Fix_32_20 VDATA real_out Pilots1 Imag WE 3 Chreal2 4 ChEst Tx2-Rx1 real_out Pilots1 double WE VDATA EN imag_out Real Fix_32_20 WE Fix_32_20 x sysgen 0.3535 real_out Pilots1 imag_out Fix_16_10 Chreal4 8 Fix_16_10 Chimag4 Real_in WE Imag_in Fix_32_20 VDATA ChEst Tx3-Rx1 Imag_in ChEst Tx4-Rx1 Chimag2 Chreal1 2 Chimag1 addr addr Fix_16_10 -2 Fix_16_10 sysgen 3 Rxreal2 Delay7 -2 sysgen z Fix_16_10 double x sysgen 0.3535 imag_out Real Imag Fix_16_10 CMult2 Imag Fix_16_10 4 real_out Pilots1 real_out Real z WE Delay8 imag_out Rximag2 Fix_16_10 Real_in double double double double (8) 9 Chreal5 10 addr Chimag5 Real Imag WE Fix_32_20 VDATA Imag_in VDATA ChEst Tx1-Rx2 CMult3 imag_out Real_in double Imag_in addr Real double Fix_32_20 Imag 11 Chreal6 ChEst Tx2-Rx2 12 Single Port RAM1 real_out Pilots1 double WE Fix_32_20 x sysgen 0.3535 EN real_out Pilots1 imag_out Real_in Chreal7 double addr real_out Pilots1 Fix_16_10 14 Chimag7 WE VDATA simout11 13 Fix_16_10 Real Imag imag_out Fix_16_10 15 To Workspace2 Chreal8 Fix_16_10 16 Real_in Chimag8 WE Imag_in Fix_32_20 VDATA ChEst Tx3-Rx2 Imag_in ChEst Tx4-Rx2 Chimag6 19 17 addr addr 5 Fix_16_10-2 sysgen z Fix_16_10 Rxreal3 Fix_16_10 Delay9 6 -2 sysgen z Rximag3 Real Fix_16_10 Fix_32_20 x sysgen 0.3535 Real Imag Fix_16_10 CMult4 Imag WE Delay10 real_out imag_out Fix_16_10 Fix_16_10 Chreal9 18 Pilots1 real_out imag_out Real_in Fix_16_10 x sysgen 0.3535 EN VDATA Imag_in real_out imag_out Real_in Fix_16_10 Chreal10 20 Fix_16_10 Chimag10 VDATA addr real_out Pilots1 Real Fix_32_20 Imag WE Fix_32_20 imag_out Real_in Fix_16_10 21 Chreal11 Fix_16_10 Fix_32_20 VDATA addr real_out Pilots1 22 Fix_32_20 Chimag11 WE Imag_in ChEst Tx2-Rx3 ChEst Tx1-Rx3 CMult5 Real Imag Fix_32_20 WE Fix_32_20 addr Pilots1 Chimag9 Real Imag imag_out Fix_16_10 Fix_16_10 23 Chreal12 24 Chimag12 Real_in WE Imag_in Fix_32_20 VDATA ChEst Tx3-Rx3 Imag_in ChEst Tx4-Rx3 Single Port RAM2 25 27 Chreal13 addr addr 7 Fix_16_10 -2 sysgen z Fix_16_10 Rxreal4 Delay11 8 -2 sysgen z Real Fix_16_10 Fix_32_20 x sysgen 0.3535 Real Imag Fix_16_10 CMult6 Imag Fix_16_10 WE Delay12 imag_out Rximag4 real_out Pilots1 real_out Fix_16_10 imag_out Real_in Fix_16_10 VDATA addr Fix_32_20 real_out Pilots1 Real WE Fix_32_20 Chreal14 26 Chimag13 Fix_16_10 Imag imag_out Real_in Fix_16_10 Fix_16_10 Fix_32_20 VDATA 28 Chimag14 Chreal15 addr Fix_32_20 real_out Pilots1 Real WE Imag_in 29 Imag imag_out Real_in Fix_16_10 Fix_16_10 Fix_32_20 VDATA addr real_out Pilots1 Real Fix_32_20 WE Imag_in 30 Chimag15 Imag imag_out Fix_16_10 31 Chreal16 Fix_16_10 32 Chimag16 Real_in WE Imag_in Fix_32_20 VDATA Imag_in x sysgen 0.3535 EN CMult7 ChEst Tx1-Rx4 ChEst Tx2-Rx4 ChEst Tx3-Rx4 ChEst Tx4-Rx4 Single Port RAM3 9 double ValidData 11 double and sysgen -2 z Bool Logical ChEstPilots -3 sysgen z ValidData ChEstRst En Rst ChEstPilots En2 ChEstPilots1 Bool Delay1 ChEstEn double double double double double ControlSignals Input FIFO 12 4x4 Channel Estimation Memory double ReadAddr Control Signals 55 R. M. Rao, 2008 Packet Detection Identical halves of 1 OFDM symbol Schmidl and Cox algorithm for Packet Detection and coarse carrier frequency offset estimation. T. M. Schmidl, D. C. Cox, “Low Overhead Low Complexity Synchronization for OFDM”, ICC 1996, Vol 3, pp 1301-1306. r(n) * Z-D * C c(n) P p(n)( )2 2 m(n) 2 RealIn1 double 1 RealIn Fix_16_8 sysgen force sysgen cast Reinterpret1 Fix_8_0 sysgen z-32 en Convert RealOut ImagIn1 Fix_8_0 RealIn2 ImagOut ImagIn sysgen force Fix_16_8 Reinterpret sysgen cast BaudClk Convert1 Fix_8_0 sysgen z-32 en Delay1 3 Bool Fix_16_0 RealIn1 ImagIn Correlate double RealIn ImagIn1 Fix_8_0 BaudClk 2 CorrMetric RealOut Complex Multiply ImagIn2 Delay Fix_8_0 Fix_8_0 ImagOut Complex double RealOut Multiply a Fix_16_0 a>b sysgen z-1 En BaudClk 1 PacketDetect Complex Sliding Window Averager Power Calculator1 b Relational RealIn1 ImagIn1 double Complex Fix_8_0 PwrOut Multiply DecisionMetric = (CorrelationPeak >= 0.5(AveragePower)) In In1 Out Fix_16_0 Out1 BaudClk BaudClk Power Calculator Fix_32_0 X >> 1 sysgen z-0 double Shift En 3 Sliding Window Averager Squarer AvePower 56 R. M. Rao, 2008 Pre-FFT Carrier Frequency Offset Estimation z-14 en 1 RealIn 1 ˆ Delay5 2 ImagIn1 N 2p s 2 RealIn 1 ImagIn 1 3 RealIn 2 CorrMetric _ real In 1 Out 1 CorrMetric _ imag In 2 Out 2 In 3 Out 3 x ImagIn 2 BaudClk 5 Baud_clk Rising edge detector z -17 RealIn 2 4 ImagIn2 mag Out 1In 1 7 BBD AvePwr Rst Packet Detection 3 z-24 en y atan x 0 . 003906 -2 z cast Convert CORDIC ATAN Truncate CMult8 d rstz - 1 q en 1 CFO_Est Register1 Delay6 6 Rst The angle of the correlation metric is proportional to the Carrier frequency offset. Right size the number of bits before the CORDIC operation. CORDIC ATAN from the Xilinx Math library calculates the angle. 57 R. M. Rao, 2008 Carrier Frequency Offset Correction 9 Direct digital synthesizer (DDS) from the Xilinx DSP SysGen library. Fix_16_12 Fix_16_16 x 0 . 01563 CFO_Est freq_off cos_out 11 CMult Fix_16_15 Enable CFO_Est_valid 12 double sin_out Reset Fix_16_15 Fix_17_15 x(- 1 ) Reset DDS 1 Fix_16_10 z Fix_16_10 -1 RealIn 1 2 z Fix_16_10 Negate 1 RealIn 1 Fix_16_10 -1 RealIn 2 Delay 1 ImagIn 1 RealOut ImagIn 1 Delay 1 RealOut 1 ImagIn 2 BaudClk Fix_16_10 Complex Multiply ImagOut Fix_16_10 2 ImagOut 1 Complex Multiply 3 Fix_16_10 z Fix_16_10 -1 RealIn 2 4 z Fix_16_10 RealIn 1 Fix_16_10 -1 RealIn 2 Delay 3 ImagIn 2 RealOut ImagIn 1 Delay 2 3 RealOut 2 ImagIn 2 BaudClk Fix_16_10 Complex Multiply ImagOut Fix_16_10 4 ImagOut 2 Complex Multiply 1 5 Fix_16_10 z Fix_16_10 -1 RealIn 3 6 z RealIn 1 Fix_16_10 -1 Fix_16_10 RealIn 2 Delay 5 ImagIn 3 RealOut ImagIn 1 Delay 4 5 RealOut 3 ImagIn 2 BaudClk Fix_16_10 Complex Multiply ImagOut Fix_16_10 6 ImagOut 3 Complex Multiply 2 7 Fix_16_10 z Fix_16_10 -1 RealIn 4 8 z Fix_16_10 -1 RealIn 1 Fix_16_10 RealIn 2 ImagIn 4 RealOut ImagIn 1 Delay 6 Delay 7 ImagIn 2 1 Constant 3 Bool BaudClk Fix_16_10 ImagOut 7 RealOut 4 Complex Multiply Fix_16_10 8 ImagOut 4 Complex Multiply 3 Bool or z -0 Logical 1 0 UFix_16_0 a a <b Constant 1 bz 10 Bool In1 Out1 Bool rst out UFix_16_0 Bool -0 Relational and -0 z Bool FFT_Start Rising edge detector aa < = b -0 bz Counter 78 UFix_16_0 Bool Logical Relational 1 Constant 2 58 R. M. Rao, 2008 Design methodology issues • FPGA tools – Where to from here? • C-to-gates – Higher level design languages to gates – Raising the level of abstraction 59 R. M. Rao, 2008 ‘C’ or higher level language to Gates • There is interest in higher level design methodologies, such as C-to-Gates from the design community. • ESL (Electronic system level) tools/design methodologies are being explored. • But, extracting all the concurrency from a sequential description is not an easy problem. 60 R. M. Rao, 2008 C to Gates evaluation flow Source: BDTI. For more info and results see www.BDTI.com. 61 R. M. Rao, 2008 C to Gates evaluation by BDTI Source: BDTI. For more info and results see www.BDTI.com. 62 R. M. Rao, 2008 Conclusion • FPGAs are finding wide use in infrastructure communication systems and signal processing systems. • FPGA are an efficient choice for exploring VLSI architectures. • FPGA tools are raising the level of abstraction to allow algorithm designers the ability to explore h/w architectures without learning “h/w design tools/languages”. 63 R. M. Rao, 2008 Questions? 64 R. M. Rao, 2008