DSP for FPGA SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic Objectives • • • • • Comparison between PDSP and FPGA Virtex II Pro Altera Stratix FPGA Stratix DSP Block and its configuration Altera design flow What Is an FPGA? • Field Programmable Gate Array • Device that Has a Regular Architecture (Set of Blocks) that Can Be Programmed for Various Functions • “Glue” Logic • Customizable Hardware Solution • Configurable Processors Why Use FPGAs in DSP Applications? • 10x More DSP Throughput Than DSP Processors – Parallel vs. Serial Architecture • Cost-Effective for Multi-Channel Applications • Flexible Hardware Implementation • Single-Chip Solution DSP System Software DSP FPGA – System (Hardware/Software) Integration Benefits Software Embedded Processor FPGA DSP Processors vs. FPGAs High Speed DSP Processor MAC MAC MAC 1-8 Multipliers MAC Needs looping for more than 8 multiplications Needs multiple clock cycles because of serial computation 200 Tap FIR Filter would need 25+ clock cycles per sample with an 8 MAC unit processor High Level of Parallel Processing in FPGA MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC Can implement hundreds of MAC functions in an FPGA Parallel implementation allows for faster throughput – 200 Tap FIR Filter would need 1 clock cycle per sample Extending Range of Altera Reconfigurable DSP Solutions Performance (MMACs/sec) New! 600 - 100 - Embedded Processors Embedded Processors Hardware Acceleration Complete Hardware Implementation Comparison of DSP Devices Data Programmable DSP Processors Reconfigurable DSP Benefits • Easy to Use • Programmed Via C-Code or Assembly • Fast Development Time • Easy to Use • Programmed via C-Code, Assembly, or HDL • Efficient for Recursive Algorithms Using DSP IP Cores • Higher Levels of Integration Weaknesses • Fixed Architecture • Inefficient for Highly Recursive Algorithms Unless Hardware Accelerated • Potential Bus Bottlenecks • Other Devices (FPGAs) Often Used on Board for Other Functions • Longer Development Time (But Getting Shorter!) Objectives • • • • • Comparison between PDSP and FPGA Virtex II Pro Altera Stratix FPGA Stratix DSP Block and its configuration Altera design flow Stratix EP1S10 [2] TriMatrix™ Memory [1] M512 Blocks Small FIFOs Shift Register Rake Receiver Correlator FIR Filter Delay Line M-RAM M4K Blocks Header / Cell Storage Channelized Functions ATM cell–packet processing Nios Program Memory Dedicated External Memory Interface Packet / Data Storage Nios Program Memory System Cache Video Frame Buffers Echo Canceller Data Storage Look-Up Schemes Packet & Cell Buffering Cache More Bits For Larger Memory Buffering 512 Kbits per block + parity 512 bits per block + parity 4 Kbits per block + parity More Data Ports for Greater Memory Bandwidth Memory Bandwidth Summary Stratix Device Family [1] Device Total RAM Bits EP1S10 920,448 EP1S20 M-RAM Blocks M4K Blocks M512 Blocks 1 60 94 1,245,024 1,669,248 2 82 194 2,096,928 EP1S25 1,944,576 2 138 224 2,894,400 EP1S30 3,317,184 4 171 295 3,750,192 EP1S40 3,423,744 4 183 384 4,384,800 EP1S60 5,215,104 6 292 574 6,762,528 EP1S80 7,427,520 9 364 767 8,784,720 Maximum Bandwidth (Mbps) Logic Element (LE) [2] LUT Chain Input Register Chain Input addnsub Register Control Signals cin data1 data2 data3 (2) Sync Load & Clear Logic 4-Input LUT D DATA data4 Row, Column & DirectLink Routing Local Routing Register Feedback LUT Chain Output Note: 1) 2) Register Chain Output Functional Diagram Only. Please See Datasheet for more Details. Addnsum & data1 connected via XOR logic Dynamic Arithmetic Mode LAB Carry-In Carry-In0 Carry-In1 Register Register Control Chain Input Signals Carry-In Logic addnsub data1 data2 data3 Sync Load & Clear Logic Sum Calculator D DATA Carry Calculator Carry-In0 Carry-In1 Carry-Out Logic Row, Column & DirectLink Routing Local Routing Register Chain Output Carry-Out1 Carry-Out0 Note: Functional Diagram Only. Please See Datasheet for more Details. Logic Array Blocks (LAB) [2] 30 LAB Input Lines 10 LE Feedback Lines 4 4 Local Interconnect • 10 LEs • Local Interconnect • LAB-Wide Control Signals Control Signals 4 4 4 4 4 4 4 4 LE1 LE2 LE3 LE4 LE5 LE6 LE7 LE8 LE9 LE10 Avalon Switch Fabric Contents • Avalon Switch Fabric provides the following to peripherals it connects – – – – – – – Data-Path Multiplexing Address Decoding Wait-State Generation Dynamic Bus Sizing Interrupt-Priority Assignment Latent Transfer Capabilities Streaming Read and Write Capabilities • Avalon Switch Fabric tailors transactions to the characteristic of peripherals that are attached SOPC Design Example CPU 32 Bit Inst Master DMA Controller With Streaming Data Master Control Port (Slave) Read Port (Master – Streaming) Write Port (Master – Streaming) Allows for Masters and Slaves to communicate without knowledge of each others interface details Instruction Memory 32bit Data path Data Memory 32-bit Data path UART Avalon Tri-State Bridge External FLASH 1 MB 16-bit Datapath External SRAM 256 KB 32-bit Datapath VGA Controller Avalon Switch Fabric Data Path Multiplexing & Slave Arbitration CPU 32 Bit Inst Master Data Master DMA Controller With Streaming Control Port (Slave) 1. Data-Path Multiplexing Read Port (Master – Streaming) Write Port (Master – Streaming) Avalon Switch Fabric MUX 2- Slave Arbitration Arbiter Instruction Memory 32bit Data path Data Memory 32-bit Data path 3- Address Decoding UART Avalon Tri-State Bridge External FLASH 1 MB 16-bit Datapath External SRAM 256 KB 32-bit Datapath VGA Controller Objectives • • • • • Comparison between PDSP and FPGA Virtex II Pro Altera Stratix FPGA Stratix DSP Block and its configuration Altera design flow DSP Blocks • • • Eight 9 × 9 bit multipliers Four 18 × 18 bit multipliers One 36 × 36 bit multiplier DSP Blocks (cont.) The DSP block consists of • A multiplier block • An adder/subtractor/accumulator block • A summation block • An output interface • Output registers • Routing and control signals Stratix DSP Blocks • High Performance Dedicated Multiplier Circuitry – 18x18 Functions at 280 MHz + +-S Output Register Unit – Signed & Unsigned Operations – Dynamically Change between Add & Subtract – Supports DSP Requirements Including Complex Numbers +-S Output Multiplexer • Add, Accumulate or Subtract Optional Pipelining – 9x9 (8 Max.) – 18x18 (4 Max.) – 36x36 (1 Max.) Input Register Unit • Variable Operand Widths with Full Precision Outputs DSP Block for 18 x 18-bit Mode Shift Register Chain Adder/Output Block Time-Domain Multiplexed FIR Filters Operation of TDM Filter Resource Savings with DSP Blocks • DSP Block – – – – Reduces LE Usage Reduces Routing Congestion Reduces Power Maintains Performance 18 18 90% of your problems are hidden under the surface! 18 18 X X 36 36 36 + 36 + + 38 SAVES 652 ROUTING NETS! Design Flow Design Flow Overview 1) 2) 3) 4) 5) 6) 7) 8) Create Design in Simulink Using Altera Libraries Simulate in Simulink Add SignalCompiler to Model Create HDL Code & Generate Testbench Perform RTL Simulation Synthesize HDL Code & Place & Route Program Device Signal Tap II Logic Analyzer Step 1- Create Design in Simulink Using Altera Libraries • Drag & Drop Library Blocks into Simulink Design & Parameterize Each Block Parameterization of IP Megacores Step 2 - Simulate in Simulink Step 3 - Add “Signal Compiler” to Model to Generate HDL code • • • • • • • APEX20K/E/C APEX II Stratix & Stratix GX Cyclone & ACEX 1K Mercury FLEX10K & FLEX 6000 DSP Boards • • Leonardo Spectrum Synplify • Quartus II Speed vs. Area Testbench Generation Message Window Step 4 - Create HDL Code & Generate Testbench AltrFir32.mdl Enable "Generate Stimuli for VHDL Testbench" Button AltrFir32.vhd HDL Code Generation DSP Builder Report File • Lists All Converted Blocks – Port Widths – Sampling Frequencies – Warnings & Messages Step 5 – Perform RTL Simulation ( ModelSim ) 1) Set working directory (File => Change Directory) 2) Run TCL file (Tools => Execute Macro) Perform Verification ModelSim vs Simulink Step 6 - Synthesize HDL & Place & Route • Leonardo Spectrum • Synplify • Quartus II – Synthesis – Quartus II Fitter Step 7 – Program Device Download Design to DSP Development Kits Stratix DSP Development Board Nios Expansion Prototype Connector MAX 7000 Device Prototyping Area D/A Converters Mictor-Type Connectors for HP Logic Analyzers A/D Converters Analog SMA Connectors Texas Instruments Connectors on Underside of Board 40-Pin Connectors for Analog Devices Stratix DSP Board – Key Features • Stratix EP1S25F780C5 Device (Starter Version) • Stratix EP1S80B956C7 Device (Professional Version) • Analog I/O – Two 12-bit, 125 MHz A/D Converters – Two 14-bit, 165 MHz D/A Converters • Digital I/O – Two 40-pin Connectors for Analog Devices A/D Converter Evaluation Boards – Connector for TI TMS320 Cross-Platform Daughter Card – 3.3V Expansion/Prototype Headers – RS-232 Serial Port • Memory – 2 Mbytes of 7.5-ns Synchronous SRAM – 32 Mbytes of FLASH Step 8 - SignalTap II Logic Analyzer • Embedded Logic Analyzer – Downloads into Device with Design – Captures State of Internal Nodes – Uses JTAG for Communication SignalTap II Logic Analyzer Analysis of Imported Data Imported Data Imported Plot