ENG6530 Reconfigurable Computing Systems Digital Signal Processing using FPGAs ENG6530 RCS 1 Topics Digital Signal Processing (DSP): Definition, Advantages and Disadvantages Applications, …. DSP vs. GPP vs. ASIC vs. FPGA Why use Reconfigurable Computing. Xilinx System Generator ENG6530 RCS 2 References “http://www.xilinx.com “Reconfigurable Computing for DSP: A Survey”, by R. Tessier and W. Burleson, 2001 “Optimization Techniques for Efficient Implementation of DSP in FPGAs”, by J. Wang “Reconfigurable Computing: The Theory and Practice of FPGA Based Computing. I. II. III. IV. Chapter 24: Distributed Arithmetic. ENG6530 RCS 3 Introduction The term Digital Signal Processing, or DSP, refers to the branch of electronics concerned with the representation and manipulation of signals in digital form. Such applications as i. Telecommunication (switches, …) ii. Medical (Images, equipment, ..) iii. Military (radar, missiles, ..) iv. Consumers (Cell Phones, TVs, ..) ENG6530 RCS 4 DSP Flow The data to be processed starts out as a signal in the real (analog) world. This analog signal is then sampled by means of an analog to digital converter. These samples are then processed in the digital domain. The digital samples are subsequently converted into an analog equivalent by means of a digital to analog converter. A/D Analog input signal Analog domain DSP Digital input samples D/A Modified output samples Digital domain ENG6530 RCS Analog output signal Analog domain 5 DSP Flow Digital System Signal Analysis System Analysis Filter Design DSP ADC 1010.. Sampling + Quantification 1001.. DAC Architecture Fix Point Arithmetic Architecture Types Selection Criteria ENG6530 RCS 6 Transition from Analog to Digital The transition from analog to more digital techniques has been driven by the many advantages of DSP: The main advantage of digital signals over analog signals is that the precise signal level of former is not vital (immune to imperfections) Digital signals can be saved in memory and then recalled. Digital signals can convey information with greater noise immunity. Digital signals can be processed by digital circuit components, which are cheap and easily produced. Digital can be encrypted so that only the intended receiver can decode. The flexibility in precision through changing word lengths and/or number representation (e.g., fixed point vs. floating point) The ability to use a single processing element to process multiple incoming signals through multiplexing. Enables transmission of signals over a long distance and higher rate. The ease with which digital approaches can adjust their processing parameters, such as with adaptive filtering. ENG6530 RCS 7 Transition from Analog to Digital The main disadvantage of DSP: i. ii. iii. Increased system complexity, DSP requires that signals be converted between analog and digital forms using a sample and hold circuit, analog-to-digital converters (ADCs), and digital-toanalog converters (DACs) and analog filtering. Power consumption, DSP tends to require more power since a dedicated processor is used. Frequency range limitation, analog hardware will naturally be able to work with higher frequency signals than is possible with DSP hardware due to the limitations of performing analog to digital conversion. For many applications, the advantages of DSP far outweigh these disadvantages. ENG6530 RCS 8 DSP: Common Operations Some of the most common operations performed on signals using digital or analog techniques include: Elementary time-domain operations: amplification, attenuation, integration, differentiation, addition of signals, multiplication of signals, etc., Filtering (FIR, IIR) Transforms (FFT, IFFT) Convolution (Integral of product of two functions) Error Correction (Transmission) Compression and decompression (Audio, Video) Modulation and demodulation (BPSK, QAM, FSK, ASK, …) Multiplexing and de-multiplexing Signal generation ENG6530 RCS 9 DSP Applications Audio Applications: WiFi WiMax Blue Tooth Switches Classifiers Hearing Aids Heart Pacers Cable modems Networking Medical Equipment: Digital cameras CAM Wireless Applications MPEG Audio Portable audio Photography: ADSL VDSL Cellular Phones Base Stations GSM LTE Military Applications: ENG6530 RCS Radar 10 Main DSP Operations DSP is the arithmetic processing of digital signals sampled at regular intervals DSP can be reduced to three trivial operations: Delay Add Multiply Accumulate = Add + Delay MAC = Multiply + Accumulate The MAC is the engine behind DSP More MACs = Higher Performance, Better Signal Quality MACs vs. MIPS, not always equal Filter 3 MACs 50* MACs 100 MACs Alternative DSP Implementations DSP tasks can be implemented in a number of different ways. i. ii. iii. iv. A general purpose processor (GPP): The processor can perform DSP by running an appropriate DSP algorithm. A digital signal processor (PDSP): This is a specialized form of microprocessor chip that has been designed to perform DSP tasks much faster and more efficiently than GPP. Dedicated ASIC hardware: Custom hardware implementation that executes the DSP task. Dedicated FPGA hardware: Similar to ASIC except that it offers: Flexibility in terms of reconfiguration. Embedded microprocessor cores on the FPGA. ENG6530 RCS 12 The Performance Gap Algorithmic complexity increases as application demands increase. In order to process these new algorithms, higher performance signal processing engines are required ENG6530 RCS 13 Traditional DSP Approaches Digital Signal Processor IC Software programmable, like a microprocessor Single MAC unit All processing done sequentially Fit the algorithm to the architecture ‘Traditional’ DSP Processor Analog input ADC MAC Memory Data Controller Analog output DAC Digital output ASIC (gate array) Fit the architecture to the algorithm Significantly higher performance than DSP processor High cost and high risk to develop Usually only for high-volume applications The Promise of Programmable Logic ASIC FPGA DSP Processor Best from both worlds plus: Pros Pros High performance Efficient IC architecture High flexibility High density System features Good adaptability One chip solution Short design cycle Low design risk Automatic migration to low cost HardWire Cons Cons High design risk Performance Long design cycle Hardware Complexity Why FPGAs? The most commonly used DSP functions are: FIR (Finite Impulse response) filters, IIR (Infinite Impulse response) filters, FFT (Fast Fourier Transform), DCT (Direct Cosine Transform), Encoder/Decoder and Error Correction/Detection functions. All of these blocks perform intensive arithmetic operations (data path intensive operations) such as: add, subtract, multiply, multiply-add or, multiply-accumulate. ENG6530 RCS 16 Why Use FPGAs in DSP Applications? 10x More DSP Throughput Than DSP Processors Parallel vs. Serial Architecture Cost-Effective for Multi-Channel Applications Flexible Hardware Implementation Single-Chip Solution System (Hardware/Software) Integration Benefits DSP System Software DSP FPGA Software Embedded Processor FPGA DSP-related embedded FPGA resources Many FPGAs incorporate dedicated multiplier blocks (Virtex-5/6/7). Similarly, some FPGAs offer dedicated adder blocks. One operation that is very common in DSP-type application is called the multiply-and-accumulate (MAC) unit. To make life easier for implementing DSP on FPGAs some provide an entire MAC as an embedded function (Virtex-4) Multiplier Adder Accumulator A[n:0] xx B[n:0] + + Y[(2n - 1):0] MAC ENG6530 RCS 18 DSP Functions are Parallel in Nature 8-Bit, 16-Tap Finite Impulse Response (FIR) Filter REG Data Input X[7:0] REG 0 Multiply by Filter C0 Co-Efficients REG 15 REG 1 C1 REG 14 REG 2 C2 REG 13 REG REG 3 12 C3 REG 4 C4 REG 11 REG REG 5 C5 10 Filter Taps REG 6 C6 REG 9 7 C7 Accumulate Values Equation: Data Output Y[9:0] n Yj ck xkj c0 x0 c1 x1 c2 x2 c3 x3 c3 x12 c2 x13 c1 x14 c0 x15 k 1 Symmetrical Coefficients 8 DSP and FPGA FPGAs Parallel Approach to DSP Enables Higher Computational Throughput Consider a 256-tap FIR filter: Conventional DSP Processor – Serial Implementation FPGA – Fully parallel implementation Multiply Accumulate Multiple Engines Parallel processing maximizes data throughput Support any level of parallelism Optimal performance/cost tradeoff 256 Tap FIR Filter 256 multiply and accumulate (MAC) operations per data sample One output every clock cycle Flexible architecture Distributed DSP resources (LUT, registers, multipliers, & memory) Data In C0 Reg1 Reg0 C1 All 256 MAC operations in 1 clock cycle ENG6530 RCS Reg2 C2 Reg255 .... C255 Data Out 21 FPGAs Outperform ‘Traditional’ DSP Processors Performance Relative to 50 MHz Fixed-Point DSP 25 8-Bit, 16-Tap FIR Filter Performance Comparisons 22.00 Parallel Distributed Arithmetic (PDA) (est.) (External Performance) 20 16.00 15 FPGA 10 Serial Distributed Arithmetic (SDA) FPGA 4.00 5 2.60 0.24 1.00 FPGA MCM 0 133 MHz Pentium™ Processor 750 KHz Single 50 MHz DSP 3 MHz XC4003E-3 FPGA (68% util.) 8 MHz Four 50 MHz DSPs 12 MHz XC4010E-3 FPGA (98% util.) 56 MHz XC4013E-2 FPGA (75% util.) 66 MHz Case Study: Viterbi Decoder Old_1 (FPGA-based DSP Co-Processor) + + R E G + - I/O Bus INC + M U X R E G R E G New_1 MSB R E G Diff_1 I/O Bus + - Old_2 + + Diff_2 R E G MSB + - M U X R E G R E G New_2 R E G Prestate Buffer Optional Pipelining Registers 24-bit 24-bit 1 0 Bit 24-bit Relative Performance 3 2 2.67 tim es better perform ance w ith FPGA-assisted DSP 135 ns 1 360 ns 0 Two 6 6 MHz DS P s S ix 15 ns RAMs 6 6 MHz DS P + FP G A Thre e 15 ns RAMs DSP-Only DSP + FPGA 8 DEVICES Two 66 MHz DSPs Six 15 ns SRAMs System logic 4 DEVICES One 66 MHz DSP XC4013E-3 FPGA (44%) Three 15 ns SRAMs What to Look for in Your DSP Application Identify Parallel Data Paths Find Operations that Require Multiple Clock Cycles Processor Bottlenecks Flexibility Parallel Data Paths Scaleable Bandwidth Design Modification Device Expansion = YES = NO When to Use FPGAs for DSP 50 High sample rates Data Rate (with 50 MHz system clock) 45 Number of DSPs 4 DSPs 3 DSPs 2 DSPs 1 DSP 40 35 Low sample rates 30 FPGA Region 20 10 5 DSP Region 0 1 4 8 12 16 20 24 28 32 36 40 44 48 Arithmetic Operations Per Sample DA algorithm gets faster with shorter word length Lots of filter taps 15 Integrate DSP + system logic in a low-cost DSP using serial sequential algorithm Short word lengths 25 Up to 500 MHz with Virtex 5/6/7 FPGA processes all taps in parallel, faster than DSP Fast correlators Single-chip solution required HardWire gate array migration path for high-volume designs Co-processing with a FPGA FPGA co-processors are an extremely cost-effective means of off-loading computationally intensive algorithms from a DSP processor. FPGA Coprocessor for WiMAX Baseband Processing FPGA Coprocessor for High-Definition H.264 Encoding Digital Filters Digital filters are one of the main elements of DSP and are performed using only a MAC operation. A digital filter performs a filtering function on data by attenuating or reducing bands of frequencies. Remove High Frequency Noise from Speech Signal Remove low Frequency Noise for some sensors Emphasize a particular Frequency in Music Signal Remove 50 HZ mains hums from ECG Signal ENG6530 RCS 27 Low Pass Digital Filter An example of the operation of a low pass filter is: The weights W0 to WN-1must be appropriately chosen ENG6530 RCS 28 Digital Filters: Types Finite Impulse Response (FIR): Infinite Impulse Response (IIR) Recursive linear filter (i.e. with feedback) Adaptive Digital Filter (ADF) Non-recursive linear filter (i.e. no feedback present). A self learning filter that adapts itself to a desired signal. Non-Linear Filters: A Filter that can perform non-linear operations e.g. median filter min/max filters ENG6530 RCS 29 FIR Filters A Finite Impulse Response (FIR) filter performs a weighted average (convolution) on a window of N data samples: ENG6530 RCS 30 FIR FILTERS Register FINITE-IMPULSE RESPONSE FILTER Z 1 C1 Z 1 Z 1 .... C N 1 C2 CN Multiplier Adder ENG6530 RCS 31 Frequency Response The frequency/phase response of a digital filter is found by taking the Discrete Fourier Transform (DFT) of the impulse ENG6530 RCS 32 FPGA Implementations 1. Hardware Description Language: 2. VHDL Verilog Electronic System Level Handel-C, Vivado HLS (Lab #7) Impulse-C 3. Core Generator (IP Selection) 4. System Generator (Lab #6) Matlab, Simulink, System Generator ENG6530 RCS 33 FIR FILTER: VHDL Implementation Simple VHDL design example of an 8-tap FIR filter. ENG6530 RCS 34 Hardware Descriptive Languages Full VHDL/Verilog (RTL code) Advantages: Portability and efficient implementation Complete control of the design implementation and tradeoffs Easier to debug and understand a code that you own Disadvantages: Can be time consuming Don’t always have control over the Synthesis tool Need to be familiar with algorithm and how to write it ENG6530 RCS 35 ENG6530 RCS 36 Abstraction: Advantages ENG6530 RCS 37 CORE Generator HDL COREGen Synthesis Behavioral Simulation Instantiate optimized IP within the HDL code Functional Simulation Implementation Timing Simulation Download In-Circuit Verification ENG6530 RCS 38 Xilinx CORE Generator List of available IP from or Fully Parameterizable ENG6530 RCS 39 Xilinx IP Solutions DSP Functions $P Reed Solomon $3GPP Turbo Code $P Viterbi Decoder $P Convolution Encoder $P Interleaver/De-interleaver P LFSR P 1D DCT P DA FIR P MAC P MAC-based FIR filter Fixed FFTs 16, 64, 256, 1024 points P FFT - 32 Point P Sine Cosine P Direct Digital Synthesizer P Cascaded Integrator Comb P Bit Correlator P Digital Down Converter IP CENTER http://www.xilinx.com/ipcenter Math Functions P Multiplier Generator - Parallel Multiplier - Dyn Constant Coefficient Mult - Serial Sequential Multiplier - Multiplier Enhancements P Divider P CORDIC Base Functions P Binary Decoder P Two's Complement P Shift Register RAM/FF P Gate modules P Multiplexer functions P Registers, FF & latch based P Adder/Subtractor P Accumulator P Comparator P Binary Counter $ - License Fee, P - Parameterized, S - Project License Available, BOLD – Available in the Xilinx Blockset for the System Generator for DSP Memory Functions P Asynchronous FIFO P Block Memory modules P Distributed Memory P Distributed Mem Enhance P Sync FIFO (SRL16) P Sync FIFO (Block RAM) P CAM (SRL16) PCI $P PCI 64/66 $PS PCI 32/33 $P PCI-X 64/66 Networking 8B/10B Encoder/Decoder $ POS-PHY L3 $ POS-PHY L4 $ Flexbus 4 $ RapidIO PHY Layer $S HDLC 1 and 32 channel $S G.711 PCM Cores $S ADPCM 32 & 64 channel Core Generator: Summary CORE Generator Advantages Can quickly access and generate existing functions No need to reinvent the wheel and re-design a block if it meets specifications IP is optimized for the specified architecture Disadvantages IP doesn’t always do exactly what you are looking for Need to understand signals and parameters and match them to your specification Dealing with black box and have little information on how the function is implemented ENG6530 RCS 41 Xilinx System Generator for DSP • • • • • Industry’s first tool system-level design environment (IDE) for FPGAs Simulink library of arithmetic, logic operators and DSP functions (Xilinx blockset) Arithmetic abstraction VHDL code generation for most Spartan based FPGAs and Virtex 4/5/6/7 FPGAs Enables Hardware in the Loop Co-simulation MATLAB • MATLAB™, the most popular system design tool, is a programming language, interpreter, and modeling environment – – – Extensive libraries for math functions, signal processing, DSP, communications, and much more Visualization: large array of functions to plot and visualize your data and system/design Open architecture: software model based on base system and domainspecific plug-ins System Level Evaluation Irrespective of the final implementation technology (GPP, DSP, ASIC, FPGA), if one is creating a product that is to be based on a new DSP algorithm, it is common practice to first perform system-level evaluation and algorithmic verification using an appropriate environment. The de facto industry standard for DSP algorithmic verification is MATLAB. Auto C/C++ Generation Original Concept Algorithmic Verification Handcrafted C/C++ Compile / Assemble Machine Code Handcrafted Assembly ENG6530 RCS 44 System/Algorithmic level to RTL Many DSP design teams commence by performing their system level evaluation and algorithmic validation in MATLAB using floating point representation. Alternatively, they may first transition the FP representation into their fixed-point counterparts at the system level. At this point, many design teams bounce directly into handcoding fixed-point RTL equivalents of the design in VHDL Original Concept System/Algorithmic Verification (Floating-point) (a) System/Algorithmic Verification (Fixed-point) (b) Handcraft Verilog/VHDL RTL (Fixed-point) To standard RTL-based simulation and synthesis ENG6530 RCS 45 Simulink • Simulink™ - Visual data flow environment for modeling and simulation of dynamical systems – – – – – Fully integrated with the MATLAB engine Graphical block editor Event-driven simulator Models parallelism Extensive library of parameterizable functions • • • Simulink Blockset - math, sinks, sources DSP Blockset - filters, transforms, etc. Communications Blockset - modulation, DPCM, etc. Traditional Simulink FPGA Flow System Verification System Architect GAP Simulink FPGA Designer HDL Synthesis Implementation Download Functional Simulation Timing Simulation In-Circuit Verification Verify Equivalence System Generator MATLAB/Simulink HDL System Generator •VHDL System Verification •IP •Testbench •Constraints File Synthesis Functional Simulation Implementation Timing Simulation Download In-Circuit Verification Creating a System Generator Design • Xilinx Block-set listed in Simulink Library Browser • Create Design by Dragging and Dropping components from the Xilinx Block-set onto your new sheet to create design Finding Blocks • • Use the Find feature to search ALL Simulink libraries Xilinx blockset has nine major sections – Basic elements • – Communication • – Multiply, accumulate, inverter Memory • – All Xilinx blocks – quick way to view all blocks Math • – FDATool, FFT, FIR Index • – Convert, Slice DSP • – MCode, Black Box Data Types • – Error correction blocks Control Logic • – Counters, delays Dual Port RAM, Single Port RAM Tools • ModelSim, Resource Estimator Configure Your Blocks • Double-click or go to Block Parameters to view a block’s configurable parameters – – – – – – – • Arithmetic Type: Unsigned or twos complement Implement with Xilinx Smart-IP Core (if possible)/ Generate Core Latency: Specify the delay through the block Overflow and Quantization: Users can saturate or wrap overflow. Truncate or Round Quantization Override with Doubles: Simulation only Precision: Full or the user can define the number of bits and where the decimal point is for the block Sample Period: Can be inherent with a “-1” or must be an integer value Note: While all parameters can be simulated, not all are realizable Values Can Be Equations • • • You can also enter equations in the block parameters, which can aid calculation and your own understanding of the model parameters The equations are calculated at the beginning of a simulation Useful MATLAB operators – – – – – – – + add - subtract * multiply / divide ^ power pi (3.1415926535897.…) exp(x) exponential (ex) Important Concept 1: The Numbers Game • Simulink uses a “double” to represent numbers in a simulation. A double is a “64-bit twos complement floating point number” – • Because the binary point can move, a double can represent any number between +/- 9.223 x 1018 with a resolution of 1.08 x 10-19 …a wide desirable range, but not efficient or realistic for FPGAs Xilinx Blockset uses n-bit fixed point number (twos complement optional) 2 -2 2 1 1 0 Integer 2 0 1 2 -1 1 2 -2 0 2 -3 1 2 -4 1 2 -5 1 2 -6 1 2 -7 0 2 -8 2 1 Fraction Format = Sign_Width_Decimal point from the LSB -9 0 2 -10 0 2 -11 1 2 -12 0 2 -13 1 Value = -2.261108… Format = Fix_16_13 (Sign: Fix = Signed Value UFix = Unsigned value) Design Hint: Always try to maximize the dynamic range of design by using only the required number of bits Thus, a conversion is required when communicating with Xilinx blocks with Simulink blocks (Xilinx blockset MATLAB I/O Gateway In/Out) What About All Those Other Bits? • The Gateway In and Out blocks support parameters to control the conversion from double precision to N - bit fixed point precision DOUBLE 6 -2 .... 1 4 5 2 2 1 1 2 1 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 OVERFLOW - Wrap - Saturate - Flag Error -10 -11 -12 -13 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 .... 1 0 1 1 0 1 1 1 1 0 1 0 0 1 0 1 QUANTIZATION 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -2 2 2 2 2 2 2 2 2 2 2 2 1 0 1 1 0 1 1 1 1 0 1 0 FIX_12_9 - Truncate - Round Creating a System Generator Design IO blocks used as interface between the Xilinx blockset and other Simulink blocks Simulink sources SysGen blocks realizable in Hardware Simulink sinks & library functions Using the Scope • Click properties to change the number of axis displayed and the time range value (Xaxis) • Use Data history to control how many values are stored and displayed on the scope • Click autoscale to quickly let the tools configure the display to the correct axis values • Right click on the Y-axis to set its value Design & Simulate in Simulink Simulate the design by pushing “play.” Go to “Simulation Parameters” under the “Simulation” menu to control the length of simulations Resource Estimator • • The block provides fast estimates of FPGA resources required to implement the subsystem Most of the blocks in the System Generator Blockset carries the resources information – – – – – – LUTs FFs BRAM Embedded multipliers 3-state buffers I/Os Resource Estimator • Three types of estimation – Estimate Area • – Quick Sum • – This option computes resources for the current level and all sub-levels Uses the resources stored in block directly and sum them up (no sub-levels functions are invoked) Post-Map Area • Opens up a file browser and let user select map report file. The design should have been generated and gone through synthesis, translate, and mapping phases. The Black Box Use the Black Box when: • You need a function that cannot be created with the Xilinx Blockset • You already have a piece of VHDL you wish to use for a section of the design Creates a place holder for the ‘Black box’ in generated VHDL Use Black Box parameters to control the VHDL placeholder’s features Generate the VHDL Code Once complete, double click the System Generator token Select the target device Select to generate the testbench Set the System clock period desired Generate the VHDL Hardware-in-the-Loop Reduces Design Time & Cost • Configure any development board for hardware-in-the-loop using JTAG header in < 20 minutes – – – – • Automatically create FPGA bit-stream from Simulink Transparent use of FPGA implementation tools Accelerate and verify the Simulink design using FPGA hardware Mirrors traditional DSP processor design flows Combine with black box to simulate HDL & EDIF Create Bit-stream Step 1 Select Target H/W Platform Step 2 Generate Bit-stream Co-Simulate in Hardware Step 3 contd. Post-generation script creates a new library containing a parameterized run-time cosimulation block. Step 5 Simulate for verification Step 4 Copy the a cosimulation runtime block into the original model. Hardware in the Loop Performance Results Single Step Clock Mode (bit and cycle accurate) Hardware Simulation Time (seconds) Speed-up 676 6 112X 1203 18 67X 5 x 5 Image Filter 170 4 43X Cordic Arc Tangent 187 27 Additive White Gaussian Noise Channel 600 80 7X 7.5X Application Image Filtering QAM Demodulator + Extension Software Simulation Time (seconds) Free Running Clock Mode A free running clock is provided to the design, thus the hardware is no longer running in lockstep with the software. The test is started, and after some time a 'done' flag is set to read the results from the FPGA and display them in Simulink. Using this hardware co-simulation method, designers can achieve up to 6 orders of magnitude performance enhancement over original software simulation. DSP System Generator: Summary • System Generator for DSP – Advantages • • • • • • – Ability to simulate the design at a system level High level of abstraction - Very attractive for FPGA novices Optimize Area, Speed, combination Estimate resources easily Hardware Co-Simulation (FPGA in the loop) Test-bench and golden data written automatically Disadvantages • • Cost of abstraction: doesn’t always give the best result from an area usage point Only as good as the IP support FPGAs versus DSP FPGAs can out perform DSP processors on certain DSP tasks; computation intensive, highly parallelizable tasks DSP processors have the advantage for development infrastructure, time-to-market, developer familiarity DSP processors are still easier to use Many engineers possess DSP processor development skills Ultimate speed is not always the first priority Combination of FPGA and DSP processor is an excellent solution if performance requirements cannot be met by the processor alone The “Best” architecture depends on the requirements of the applications Problem with this flow? There is a significant conceptual and representational divide between the system architects working at the system/algorithmic level and the hardware design engineers working with RTL representation in VHDL. Manual translation from one to another is time consuming and prone to error. Any changes made to the original specs during the course of the project will be a painful and time consuming process to translate again to RTL. Original Concept System/Algorithmic Verification (Floating-point) (a) System/Algorithmic Verification (Fixed-point) (b) Handcraft Verilog/VHDL RTL (Fixed-point) To standard RTL-based simulation and synthesis ENG6530 RCS 69 Direct RTL Generation Original Concept (a) Some system/algorithmic level design environments offer direct VHDL code generation. An example of this type of environment is offered by AccelChip Inc whose environment can accept floating-point MATLAB Mfiles, output their fixed point equivalent for verification and then use these new Mfiles to auto generate RTL. (b) System/Algorithmic Environment System/Algorithmic Environment System/Algorithmic Verification (Floating-point) System/Algorithmic Verification (Floating-point) Third-party Environment System/Algorithmic Verification (Fixed-point) Auto-interactive quantization (Fixed-point) Auto-generate Verilog/VHDL RTL (Fixed-point) Auto-generate Verilog/VHDL RTL (Fixed-point) ENG6530 RCS (a) (b) To standard RTL-based simulation and synthesis 70 Transposed FIR with Multiplier Block ENG6530 RCS 71 DSP Processors vs. FPGAs High Speed DSP Processor MAC MAC MAC 1-8 Multipliers MAC Needs looping for more than 8 multiplications Needs multiple clock cycles because of serial computation 200 Tap FIR Filter would need 25+ clock cycles per sample with an 8 MAC unit processor High Level of Parallel Processing in FPGA MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC Can implement hundreds of MAC functions in an FPGA Parallel implementation allows for faster throughput – 200 Tap FIR Filter would need 1 clock cycle per sample Multiply Accumulate Single Engine Sequential processing limits data throughput: Time-shared MAC unit Data width is fixed!! High clock frequency creates difficult system-challenge 256 Tap FIR Filter 256 multiply and accumulate (MAC) operations per data sample One output every 256 clock cycles ENG6530 RCS Data In Loop Algorithm 256 times Reg MAC unit Data Out 73 Filters: Applications ENG6530 RCS 74 Impulse Response The Impulse Response of an FIR filter is obtained from the output of a filter when a single unit impulse is input: ENG6530 RCS 75 Solution: Building a MAC with System Generator MAC using Sliced Based Multiplier Slice Count: 70 Slices MAC using Embedded Multiplier Slice Count: 22 Slices, 1 embedded multiplier Performance: ~130 Mhz (2v1000 -4) Performance: ~126 MHz (2v1000 -4) c aibi i a b + c FIR: Cont … VHDL Implementation For convenience the selected coefficients are powers of 2. To operate, the filter must have eight register stages, each of which is eight bits wide. At each clock cycle, each coefficient is multiplied by the eight-bit value in the appropriate register. Therefore, for the register or memory portion of the design, 64 flipflops are required. Due to the selection of ``powers of two” coefficients, multiplication is achieved by a simple shifting operation The coefficient values may be stored as constants. The coefficients used in the example are given below: a0 = 2-3, a1=2-2, a2=2-1,a3=1,a4=1,a5=2-1,a6=2-2,a7=2-3 ENG6530 RCS 77 VHDL Description of FIR Filter library ieee; use ieee.std_logic_1164.all; entity FIR1 is port (clk : in std_logic; x : in integer range 0 to 255; y : out integer range 0 to 511); end entity FIR1; ENG6530 RCS 78 VHDL Description of FIR Filter architecture arch1 of FIR1 is begin process (clk) type RegType is array (7 downto 0) of integer; variable Reg: RegType:= (others => 0); begin if (clk’event and clk=‘1’) then - - multiply/accumulate (MAC) operation y <= Reg(0)/8 + Reg(1)/4 + Reg(2)/2 + Reg(3) + Reg(4) + Reg(5)/2 + Reg(6)/4 + Reg(7)/8; - - update register values by shifting Reg(0) := Reg(1); Reg(1) := Reg(2); Reg(2) := Reg(3); Reg(3) := Reg(4); Reg(4) := Reg(5); Reg(5) := Reg(6); Reg(6) := Reg(7); Reg(7) := x; end if; end process; end architecture arch1; ENG6530 RCS 79