RCS for DSP .

advertisement
ENG6530
Reconfigurable
Computing Systems
Digital Signal Processing
using FPGAs
ENG6530 RCS
1
Topics

Digital Signal Processing (DSP):

Definition,
 Advantages and Disadvantages
 Applications, ….



DSP vs. GPP vs. ASIC vs. FPGA
Why use Reconfigurable Computing.
Xilinx System Generator
ENG6530 RCS
2
References
“http://www.xilinx.com
“Reconfigurable Computing for DSP: A Survey”,
by R. Tessier and W. Burleson, 2001
“Optimization Techniques for Efficient
Implementation of DSP in FPGAs”, by J. Wang
“Reconfigurable Computing: The Theory and
Practice of FPGA Based Computing.
I.
II.
III.
IV.

Chapter 24: Distributed Arithmetic.
ENG6530 RCS
3
Introduction

The term Digital Signal Processing, or DSP, refers to the
branch of electronics concerned with the representation
and manipulation of signals in digital form.

Such applications as
i.
Telecommunication (switches, …)
ii.
Medical (Images, equipment, ..)
iii. Military (radar, missiles, ..)
iv. Consumers (Cell Phones, TVs, ..)
ENG6530 RCS
4
DSP Flow




The data to be processed starts out as a signal in the real
(analog) world.
This analog signal is then sampled by means of an analog
to digital converter.
These samples are then processed in the digital domain.
The digital samples are subsequently converted into an
analog equivalent by means of a digital to analog converter.
A/D
Analog input
signal
Analog domain
DSP
Digital input
samples
D/A
Modified output
samples
Digital domain
ENG6530 RCS
Analog output
signal
Analog domain
5
DSP Flow
Digital System
Signal
Analysis
System
Analysis
Filter
Design
DSP
ADC
1010..
Sampling +
Quantification
1001..
DAC
Architecture
Fix Point Arithmetic
Architecture Types
Selection Criteria
ENG6530 RCS
6
Transition from Analog to Digital

The transition from analog to more digital techniques has been driven by
the many advantages of DSP:









The main advantage of digital signals over analog signals is that the
precise signal level of former is not vital (immune to imperfections)
Digital signals can be saved in memory and then recalled.
Digital signals can convey information with greater noise immunity.
Digital signals can be processed by digital circuit components, which
are cheap and easily produced.
Digital can be encrypted so that only the intended receiver can decode.
The flexibility in precision through changing word lengths and/or
number representation (e.g., fixed point vs. floating point)
The ability to use a single processing element to process multiple
incoming signals through multiplexing.
Enables transmission of signals over a long distance and higher rate.
The ease with which digital approaches can adjust their processing
parameters, such as with adaptive filtering.
ENG6530 RCS
7
Transition from Analog to Digital

The main disadvantage of DSP:
i.
ii.
iii.

Increased system complexity, DSP requires that signals be
converted between analog and digital forms using a sample and
hold circuit, analog-to-digital converters (ADCs), and digital-toanalog converters (DACs) and analog filtering.
Power consumption, DSP tends to require more power since a
dedicated processor is used.
Frequency range limitation, analog hardware will naturally be
able to work with higher frequency signals than is possible with
DSP hardware due to the limitations of performing analog to digital
conversion.
For many applications, the advantages of DSP far outweigh
these disadvantages.
ENG6530 RCS
8
DSP: Common Operations
Some of the most common operations performed on signals
using digital or analog techniques include:

Elementary time-domain operations:











amplification, attenuation,
integration, differentiation,
addition of signals, multiplication of signals, etc.,
Filtering (FIR, IIR)
Transforms (FFT, IFFT)
Convolution (Integral of product of two functions)
Error Correction (Transmission)
Compression and decompression (Audio, Video)
Modulation and demodulation (BPSK, QAM, FSK, ASK, …)
Multiplexing and de-multiplexing
Signal generation
ENG6530 RCS
9
DSP Applications

Audio Applications:











WiFi
WiMax
Blue Tooth

Switches
Classifiers

Hearing Aids
Heart Pacers
Cable modems

Networking

Medical Equipment:

Digital cameras
CAM
Wireless Applications


MPEG Audio
Portable audio
Photography:


ADSL
VDSL
Cellular Phones
 Base Stations
 GSM
 LTE
Military Applications:

ENG6530 RCS
Radar
10
Main DSP Operations





DSP is the arithmetic processing of
digital signals sampled at regular
intervals
DSP can be reduced to three trivial
operations:
 Delay
 Add
 Multiply
Accumulate = Add + Delay
MAC = Multiply + Accumulate
The MAC is the engine behind DSP
 More MACs = Higher Performance,
Better Signal Quality
 MACs vs. MIPS, not always equal
Filter
3 MACs
50* MACs
100 MACs
Alternative DSP Implementations

DSP tasks can be implemented in a number of different ways.
i.
ii.
iii.
iv.
A general purpose processor (GPP): The processor can
perform DSP by running an appropriate DSP algorithm.
A digital signal processor (PDSP): This is a specialized form of
microprocessor chip that has been designed to perform DSP
tasks much faster and more efficiently than GPP.
Dedicated ASIC hardware: Custom hardware implementation
that executes the DSP task.
Dedicated FPGA hardware: Similar to ASIC except that it
offers:


Flexibility in terms of reconfiguration.
Embedded microprocessor cores on the FPGA.
ENG6530 RCS
12
The Performance Gap


Algorithmic complexity increases as application demands increase.
In order to process these new algorithms, higher performance signal
processing engines are required
ENG6530 RCS
13
Traditional DSP Approaches

Digital Signal Processor IC
 Software programmable, like a microprocessor
 Single MAC unit
 All processing done sequentially
 Fit the algorithm to the architecture
‘Traditional’ DSP Processor
Analog input
ADC
MAC
Memory
Data Controller

Analog output
DAC
Digital output
ASIC (gate array)
 Fit the architecture to the algorithm
 Significantly higher performance than DSP processor
 High cost and high risk to develop
 Usually only for high-volume applications
The Promise of Programmable Logic
ASIC
FPGA
DSP Processor
Best from both worlds
plus:
Pros
Pros

High performance

Efficient IC architecture

High flexibility

High density

System features

Good adaptability

One chip solution

Short design cycle

Low design risk

Automatic migration to
low cost HardWire
Cons
Cons

High design risk

Performance

Long design cycle

Hardware
Complexity
Why FPGAs?


The most commonly used DSP functions are:
 FIR (Finite Impulse response) filters,
 IIR (Infinite Impulse response) filters,
 FFT (Fast Fourier Transform),
 DCT (Direct Cosine Transform),
 Encoder/Decoder and Error Correction/Detection functions.
All of these blocks perform intensive arithmetic operations
(data path intensive operations) such as:
 add, subtract,
 multiply, multiply-add or,
 multiply-accumulate.
ENG6530 RCS
16
Why Use FPGAs in DSP Applications?

10x More DSP Throughput Than
DSP Processors
 Parallel vs. Serial Architecture

Cost-Effective for Multi-Channel
Applications

Flexible Hardware Implementation

Single-Chip Solution

System (Hardware/Software)
Integration Benefits
DSP System
Software
DSP
FPGA
Software
Embedded
Processor
FPGA
DSP-related embedded FPGA resources




Many FPGAs incorporate dedicated multiplier blocks (Virtex-5/6/7).
Similarly, some FPGAs offer dedicated adder blocks.
One operation that is very common in DSP-type application is called the
multiply-and-accumulate (MAC) unit.
To make life easier for implementing DSP on FPGAs some provide an
entire MAC as an embedded function (Virtex-4)
Multiplier
Adder
Accumulator
A[n:0]
xx
B[n:0]
+
+
Y[(2n - 1):0]
MAC
ENG6530 RCS
18
DSP Functions are Parallel in Nature

8-Bit, 16-Tap Finite Impulse Response (FIR) Filter
REG
Data Input
X[7:0]
REG
0
Multiply by
Filter C0
Co-Efficients
REG
15
REG
1
C1
REG
14
REG
2
C2
REG
13
REG
REG
3
12
C3
REG
4
C4
REG
11
REG
REG
5
C5
10
Filter
Taps
REG
6
C6
REG
9
7
C7
Accumulate
Values

Equation:
Data Output
Y[9:0]
n
Yj   ck xkj  c0 x0  c1 x1  c2 x2  c3 x3 c3 x12  c2 x13  c1 x14  c0 x15
k 1
Symmetrical Coefficients
8
DSP and FPGA
FPGAs Parallel Approach to DSP Enables Higher Computational Throughput
Consider a 256-tap FIR filter:
Conventional DSP Processor – Serial
Implementation
FPGA – Fully parallel implementation
Multiply Accumulate Multiple Engines



Parallel processing maximizes data
throughput
 Support any level of parallelism
 Optimal performance/cost
tradeoff
256 Tap FIR Filter
 256 multiply and accumulate
(MAC) operations per data
sample
 One output every clock cycle
Flexible architecture
 Distributed DSP resources
(LUT, registers, multipliers, &
memory)
Data In
C0
Reg1
Reg0
C1
All 256 MAC operations
in 1 clock cycle
ENG6530 RCS
Reg2
C2
Reg255
....
C255
Data Out
21
FPGAs Outperform ‘Traditional’ DSP Processors
Performance Relative to 50 MHz Fixed-Point DSP
25
8-Bit, 16-Tap FIR Filter
Performance Comparisons
22.00
Parallel Distributed Arithmetic
(PDA)
(est.)
(External Performance)
20
16.00
15
FPGA
10
Serial Distributed Arithmetic
(SDA)
FPGA
4.00
5
2.60
0.24
1.00
FPGA
MCM
0
133 MHz
Pentium™
Processor
750 KHz
Single
50 MHz
DSP
3 MHz
XC4003E-3
FPGA
(68% util.)
8 MHz
Four
50 MHz
DSPs
12 MHz
XC4010E-3
FPGA
(98% util.)
56 MHz
XC4013E-2
FPGA
(75% util.)
66 MHz
Case Study: Viterbi Decoder
Old_1
(FPGA-based DSP Co-Processor)
+
+
R
E
G
+
-
I/O Bus
INC
+
M
U
X
R
E
G
R
E
G
New_1
MSB
R
E
G
Diff_1
I/O Bus
+
-
Old_2
+
+
Diff_2
R
E
G
MSB
+
-
M
U
X
R
E
G
R
E
G
New_2
R
E
G
Prestate Buffer
Optional
Pipelining
Registers
24-bit
24-bit
1 0 Bit
24-bit
Relative Performance
3
2
2.67 tim es better
perform ance w ith
FPGA-assisted DSP
135 ns
1
360 ns
0
Two 6 6 MHz DS P s
S ix 15 ns RAMs
6 6 MHz DS P + FP G A
Thre e 15 ns RAMs
DSP-Only
DSP + FPGA
8 DEVICES
Two 66 MHz DSPs
Six 15 ns SRAMs
System logic
4 DEVICES
One 66 MHz DSP
XC4013E-3 FPGA (44%)
Three 15 ns SRAMs
What to Look for in Your DSP Application



Identify Parallel Data Paths
Find Operations that Require Multiple Clock Cycles
Processor Bottlenecks
Flexibility
Parallel Data Paths
Scaleable Bandwidth
Design Modification
Device Expansion
= YES
= NO
When to Use FPGAs for DSP

50
High sample rates

Data Rate (with 50 MHz system clock)
45
Number of DSPs
4 DSPs
3 DSPs
2 DSPs
1 DSP
40
35

Low sample rates


30
FPGA
Region
20


10

5
DSP
Region
0
1 4 8 12 16 20 24 28 32 36 40 44 48
Arithmetic Operations Per Sample

DA algorithm gets faster with
shorter word length
Lots of filter taps

15
Integrate DSP + system logic in a
low-cost DSP using serial
sequential algorithm
Short word lengths

25
Up to 500 MHz with Virtex 5/6/7
FPGA processes all taps in
parallel, faster than DSP
Fast correlators
Single-chip solution required
HardWire gate array migration
path for high-volume designs
Co-processing with a FPGA
FPGA co-processors are an extremely cost-effective means of off-loading
computationally intensive algorithms from a DSP processor.
FPGA Coprocessor for WiMAX Baseband
Processing
FPGA Coprocessor for High-Definition
H.264 Encoding
Digital Filters


Digital filters are one of the main elements of DSP and are
performed using only a MAC operation.
A digital filter performs a filtering function on data by
attenuating or reducing bands of frequencies.
Remove High Frequency Noise
from Speech Signal
Remove low Frequency Noise
for some sensors
Emphasize a particular Frequency
in Music Signal
Remove 50 HZ mains hums
from ECG Signal
ENG6530 RCS
27
Low Pass Digital Filter


An example of the operation of a low pass filter is:
The weights W0 to WN-1must be appropriately chosen
ENG6530 RCS
28
Digital Filters: Types

Finite Impulse Response (FIR):


Infinite Impulse Response (IIR)


Recursive linear filter (i.e. with feedback)
Adaptive Digital Filter (ADF)


Non-recursive linear filter (i.e. no feedback present).
A self learning filter that adapts itself to a desired signal.
Non-Linear Filters:



A Filter that can perform non-linear operations
e.g. median filter
min/max filters
ENG6530 RCS
29
FIR Filters

A Finite Impulse Response (FIR) filter performs a weighted
average (convolution) on a window of N data samples:
ENG6530 RCS
30
FIR FILTERS
Register
FINITE-IMPULSE RESPONSE FILTER
Z 1
C1
Z 1
Z 1
....
C N 1
C2
CN
Multiplier
Adder
ENG6530 RCS
31
Frequency Response

The frequency/phase response of a digital filter is found by
taking the Discrete Fourier Transform (DFT) of the impulse
ENG6530 RCS
32
FPGA Implementations
1.
Hardware Description Language:


2.
VHDL
Verilog
Electronic System Level



Handel-C,
Vivado HLS (Lab #7)
Impulse-C
3.
Core Generator (IP Selection)
4.
System Generator (Lab #6)

Matlab, Simulink, System Generator
ENG6530 RCS
33
FIR FILTER: VHDL Implementation

Simple VHDL design example of an 8-tap FIR filter.
ENG6530 RCS
34
Hardware Descriptive Languages

Full VHDL/Verilog (RTL code)
 Advantages:
 Portability and efficient implementation
 Complete control of the design implementation and
tradeoffs
 Easier to debug and understand a code that you own
 Disadvantages:
 Can be time consuming
 Don’t always have control over the Synthesis tool
 Need to be familiar with algorithm and how to write it
ENG6530 RCS
35
ENG6530 RCS
36
Abstraction: Advantages
ENG6530 RCS
37
CORE Generator
HDL
COREGen
Synthesis
Behavioral
Simulation
Instantiate
optimized IP
within the HDL
code
Functional
Simulation
Implementation
Timing
Simulation
Download
In-Circuit
Verification
ENG6530 RCS
38
Xilinx CORE Generator
List of available IP from
or
Fully
Parameterizable
ENG6530 RCS
39
Xilinx IP Solutions
DSP Functions
$P Reed Solomon
$3GPP Turbo Code
$P Viterbi Decoder
$P Convolution Encoder
$P Interleaver/De-interleaver
P LFSR
P 1D DCT
P DA FIR
P MAC
P MAC-based FIR filter
Fixed FFTs 16, 64, 256, 1024 points
P FFT - 32 Point
P Sine Cosine
P Direct Digital Synthesizer
P Cascaded Integrator Comb
P Bit Correlator
P Digital Down Converter
IP CENTER
http://www.xilinx.com/ipcenter
Math Functions
P Multiplier Generator
- Parallel Multiplier
- Dyn Constant Coefficient Mult
- Serial Sequential Multiplier
- Multiplier Enhancements
P Divider
P CORDIC
Base Functions
P Binary Decoder
P Two's Complement
P Shift Register RAM/FF
P Gate modules
P Multiplexer functions
P Registers, FF & latch based
P Adder/Subtractor
P Accumulator
P Comparator
P Binary Counter
$ - License Fee, P - Parameterized, S - Project License Available,
BOLD – Available in the Xilinx Blockset for the System Generator for DSP
Memory Functions
P Asynchronous FIFO
P Block Memory modules
P Distributed Memory
P Distributed Mem Enhance
P Sync FIFO (SRL16)
P Sync FIFO (Block RAM)
P CAM (SRL16)
PCI
$P PCI 64/66
$PS PCI 32/33
$P PCI-X 64/66
Networking
8B/10B Encoder/Decoder
$ POS-PHY L3
$ POS-PHY L4
$ Flexbus 4
$ RapidIO PHY Layer
$S HDLC 1 and 32 channel
$S G.711 PCM Cores
$S ADPCM 32 & 64 channel
Core Generator: Summary

CORE Generator
 Advantages
 Can quickly access and generate existing functions
 No need to reinvent the wheel and re-design a block
if it meets specifications
 IP is optimized for the specified architecture
 Disadvantages
 IP doesn’t always do exactly what you are looking for
 Need to understand signals and parameters and
match them to your specification
 Dealing with black box and have little information on
how the function is implemented
ENG6530 RCS
41
Xilinx
System Generator for DSP
•
•
•
•
•
Industry’s first tool system-level design environment (IDE)
for FPGAs
Simulink library of arithmetic, logic operators and DSP
functions (Xilinx blockset)
Arithmetic abstraction
VHDL code generation for most Spartan based FPGAs and
Virtex 4/5/6/7 FPGAs
Enables Hardware in the Loop Co-simulation
MATLAB
•
MATLAB™, the most popular system design tool, is a programming
language, interpreter, and modeling environment
–
–
–
Extensive libraries for math functions, signal processing, DSP,
communications, and much more
Visualization: large array of functions to plot and visualize your data and
system/design
Open architecture: software model based on base system and domainspecific plug-ins
System Level Evaluation


Irrespective of the final implementation technology (GPP, DSP,
ASIC, FPGA), if one is creating a product that is to be based on
a new DSP algorithm, it is common practice to first perform
system-level evaluation and algorithmic verification using an
appropriate environment.
The de facto industry standard for DSP algorithmic verification
is MATLAB.
Auto C/C++
Generation
Original
Concept
Algorithmic
Verification
Handcrafted
C/C++
Compile /
Assemble
Machine
Code
Handcrafted
Assembly
ENG6530 RCS
44
System/Algorithmic level to RTL



Many DSP design teams commence by performing their
system level evaluation and algorithmic validation in MATLAB
using floating point representation.
Alternatively, they may first transition the FP representation
into their fixed-point counterparts at the system level.
At this point, many design teams bounce directly into handcoding fixed-point RTL equivalents of the design in VHDL
Original
Concept
System/Algorithmic Verification
(Floating-point)
(a)
System/Algorithmic Verification
(Fixed-point)
(b)
Handcraft Verilog/VHDL RTL
(Fixed-point)
To standard RTL-based
simulation and synthesis
ENG6530 RCS
45
Simulink
•
Simulink™ - Visual data flow environment for modeling and simulation of
dynamical systems
–
–
–
–
–
Fully integrated with the MATLAB engine
Graphical block editor
Event-driven simulator
Models parallelism
Extensive library of parameterizable functions
•
•
•
Simulink Blockset - math, sinks, sources
DSP Blockset - filters, transforms, etc.
Communications Blockset - modulation, DPCM, etc.
Traditional Simulink
FPGA Flow
System Verification
System Architect
GAP
Simulink
FPGA Designer
HDL
Synthesis
Implementation
Download
Functional Simulation
Timing Simulation
In-Circuit Verification
Verify Equivalence
System Generator
MATLAB/Simulink
HDL
System Generator
•VHDL
System Verification
•IP
•Testbench
•Constraints File
Synthesis
Functional Simulation
Implementation
Timing Simulation
Download
In-Circuit Verification
Creating a System
Generator Design
• Xilinx Block-set listed in Simulink Library Browser
• Create Design by Dragging and Dropping
components from the Xilinx Block-set onto your
new sheet to create design
Finding Blocks
•
•
Use the Find feature to search ALL
Simulink libraries
Xilinx blockset has nine major sections
–
Basic elements
•
–
Communication
•
–
Multiply, accumulate, inverter
Memory
•
–
All Xilinx blocks – quick way to view all blocks
Math
•
–
FDATool, FFT, FIR
Index
•
–
Convert, Slice
DSP
•
–
MCode, Black Box
Data Types
•
–
Error correction blocks
Control Logic
•
–
Counters, delays
Dual Port RAM, Single Port RAM
Tools
•
ModelSim, Resource Estimator
Configure Your Blocks
•
Double-click or go to Block Parameters
to view a block’s configurable parameters
–
–
–
–
–
–
–
•
Arithmetic Type: Unsigned or twos complement
Implement with Xilinx Smart-IP Core (if possible)/
Generate Core
Latency: Specify the delay through the block
Overflow and Quantization: Users can saturate or
wrap overflow. Truncate or Round Quantization
Override with Doubles: Simulation only
Precision: Full or the user can define the number
of bits and where the decimal point is for the block
Sample Period: Can be inherent with a “-1” or
must be an integer value
Note: While all parameters can be simulated,
not all are realizable
Values Can Be Equations
•
•
•
You can also enter equations in the
block parameters, which can aid
calculation and your own understanding
of the model parameters
The equations are calculated at the
beginning of a simulation
Useful MATLAB operators
–
–
–
–
–
–
–
+ add
- subtract
* multiply
/ divide
^ power
 pi (3.1415926535897.…)
exp(x) exponential (ex)
Important Concept 1:
The Numbers Game
•
Simulink uses a “double” to represent numbers in a simulation. A double is a “64-bit twos
complement floating point number”
–
•
Because the binary point can move, a double can represent any number between +/- 9.223 x 1018 with a
resolution of 1.08 x 10-19 …a wide desirable range, but not efficient or realistic for FPGAs
Xilinx Blockset uses n-bit fixed point number (twos complement optional)
2
-2 2
1
1
0
Integer
2
0
1
2
-1
1
2
-2
0
2
-3
1
2
-4
1
2
-5
1
2
-6
1
2
-7
0
2
-8
2
1
Fraction
Format = Sign_Width_Decimal point from the LSB
-9
0
2
-10
0
2
-11
1
2
-12
0
2
-13
1
Value = -2.261108…
Format = Fix_16_13
(Sign: Fix = Signed Value
UFix = Unsigned value)
Design Hint: Always try to maximize the dynamic range of design by using only the required number of bits
Thus, a conversion is required when communicating with Xilinx blocks with Simulink blocks
(Xilinx blockset  MATLAB I/O  Gateway In/Out)
What About All Those
Other Bits?
•
The Gateway In and Out blocks support parameters to control the
conversion from double precision to N - bit fixed point precision
DOUBLE
6
-2
.... 1
4
5
2 2
1 1
2
1
3
2
1
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
OVERFLOW
- Wrap
- Saturate
- Flag Error
-10 -11 -12 -13
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
....
1 0 1 1 0 1 1 1 1 0 1 0 0 1 0 1
QUANTIZATION
2
1
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
-2 2 2 2 2 2 2 2 2 2 2 2
1 0 1 1 0 1 1 1 1 0 1 0
FIX_12_9
- Truncate
- Round
Creating a System
Generator Design
IO blocks used as interface between the Xilinx
blockset and other Simulink blocks
Simulink sources
SysGen blocks
realizable in Hardware
Simulink sinks &
library functions
Using the Scope
• Click properties to change the number of
axis displayed and the time range value (Xaxis)
• Use Data history to control how many
values are stored and displayed on the
scope
• Click autoscale to quickly let the tools
configure the display to the correct axis
values
• Right click on the Y-axis to set its value
Design & Simulate in
Simulink
Simulate the design by pushing “play.” Go to
“Simulation Parameters” under the “Simulation”
menu to control the length of simulations
Resource Estimator
•
•
The block provides fast estimates
of FPGA resources required to
implement the subsystem
Most of the blocks in the System
Generator Blockset carries the
resources information
–
–
–
–
–
–
LUTs
FFs
BRAM
Embedded multipliers
3-state buffers
I/Os
Resource Estimator
•
Three types of estimation
–
Estimate Area
•
–
Quick Sum
•
–
This option computes
resources for the current
level and all sub-levels
Uses the resources stored
in block directly and sum
them up (no sub-levels
functions are invoked)
Post-Map Area
•
Opens up a file browser and
let user select map report
file. The design should
have been generated and
gone through synthesis,
translate, and mapping
phases.
The Black Box
Use the Black Box when:
• You need a function that cannot be created with the Xilinx Blockset
• You already have a piece of VHDL you wish to use for a section of
the design
Creates a place holder for the ‘Black
box’ in generated VHDL
Use Black Box parameters to control
the VHDL placeholder’s features
Generate the VHDL Code
Once complete, double click
the System Generator token
Select the target device
Select to generate the
testbench
Set the System clock
period desired
Generate the VHDL
Hardware-in-the-Loop Reduces
Design Time & Cost
•
Configure any development board for hardware-in-the-loop using
JTAG header in < 20 minutes
–
–
–
–
•
Automatically create FPGA bit-stream from Simulink
Transparent use of FPGA implementation tools
Accelerate and verify the Simulink design using
FPGA hardware
Mirrors traditional DSP processor design flows
Combine with black box to simulate HDL & EDIF
Create Bit-stream
Step 1
Select Target
H/W Platform
Step 2
Generate
Bit-stream
Co-Simulate in Hardware
Step 3 contd.
Post-generation
script creates a
new library
containing a
parameterized
run-time cosimulation block.
Step 5
Simulate for
verification
Step 4
Copy the a cosimulation runtime block into the
original model.
Hardware in the Loop
Performance Results
Single Step Clock Mode (bit and cycle accurate)
Hardware
Simulation
Time
(seconds)
Speed-up
676
6
112X
1203
18
67X
5 x 5 Image Filter
170
4
43X
Cordic Arc Tangent
187
27
Additive White Gaussian Noise Channel
600
80
7X
7.5X
Application
Image Filtering
QAM Demodulator + Extension
Software
Simulation
Time
(seconds)
Free Running Clock Mode
A free running clock is provided to the design, thus the hardware is no longer running in lockstep with
the software. The test is started, and after some time a 'done' flag is set to read the results from the
FPGA and display them in Simulink. Using this hardware co-simulation method, designers can achieve
up to 6 orders of magnitude performance enhancement over original software simulation.
DSP System Generator: Summary
•
System Generator for DSP
– Advantages
•
•
•
•
•
•
–
Ability to simulate the design at a system level
High level of abstraction - Very attractive for FPGA novices
Optimize Area, Speed, combination
Estimate resources easily
Hardware Co-Simulation (FPGA in the loop)
Test-bench and golden data written automatically
Disadvantages
•
•
Cost of abstraction: doesn’t always give the best result from an area
usage point
Only as good as the IP support
FPGAs versus DSP







FPGAs can out perform DSP processors on certain DSP tasks;
 computation intensive,
 highly parallelizable tasks
DSP processors have the advantage for
 development infrastructure,
 time-to-market,
 developer familiarity
DSP processors are still easier to use
Many engineers possess DSP processor development skills
Ultimate speed is not always the first priority
Combination of FPGA and DSP processor is an excellent solution if
performance requirements cannot be met by the processor alone
The “Best” architecture depends on the requirements of the applications
Problem with this flow?



There is a significant conceptual and representational divide between
the system architects working at the system/algorithmic level and the
hardware design engineers working with RTL representation in VHDL.
Manual translation from one to another is time consuming and prone to
error.
Any changes made to the original specs during the course of the project
will be a painful and time consuming process to translate again to RTL.
Original
Concept
System/Algorithmic Verification
(Floating-point)
(a)
System/Algorithmic Verification
(Fixed-point)
(b)
Handcraft Verilog/VHDL RTL
(Fixed-point)
To standard RTL-based
simulation and synthesis
ENG6530 RCS
69
Direct RTL Generation
Original
Concept
(a)


Some system/algorithmic
level design environments
offer direct VHDL code
generation.
An example of this type of
environment is offered by
AccelChip Inc whose
environment can accept
floating-point MATLAB Mfiles, output their fixed point
equivalent for verification
and then use these new Mfiles to auto generate RTL.
(b)
System/Algorithmic Environment
System/Algorithmic Environment
System/Algorithmic Verification
(Floating-point)
System/Algorithmic Verification
(Floating-point)
Third-party Environment
System/Algorithmic Verification
(Fixed-point)
Auto-interactive quantization
(Fixed-point)
Auto-generate Verilog/VHDL RTL
(Fixed-point)
Auto-generate Verilog/VHDL RTL
(Fixed-point)
ENG6530 RCS
(a)
(b)
To standard RTL-based
simulation and synthesis
70
Transposed FIR with Multiplier Block
ENG6530 RCS
71
DSP Processors vs. FPGAs
High Speed DSP
Processor
MAC MAC
MAC

1-8 Multipliers


MAC
Needs looping for more than 8
multiplications
Needs multiple clock cycles
because of serial computation

200 Tap FIR Filter would need
25+ clock cycles per sample
with an 8 MAC unit processor
High Level of Parallel
Processing in FPGA
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC
MAC

Can implement hundreds of MAC
functions in an FPGA
 Parallel implementation allows for
faster throughput
–
200 Tap FIR Filter would need 1
clock cycle per sample
Multiply Accumulate Single Engine


Sequential processing limits data
throughput:
 Time-shared MAC unit
 Data width is fixed!!
 High clock frequency creates difficult
system-challenge
256 Tap FIR Filter
 256 multiply and accumulate (MAC)
operations per data sample
 One output every 256 clock cycles
ENG6530 RCS
Data In
Loop
Algorithm
256 times
Reg
MAC unit
Data Out
73
Filters: Applications
ENG6530 RCS
74
Impulse Response

The Impulse Response of an FIR filter is obtained from the
output of a filter when a single unit impulse is input:
ENG6530 RCS
75
Solution: Building a MAC
with System Generator
MAC using Sliced Based Multiplier
Slice Count: 70 Slices
MAC using Embedded Multiplier
Slice Count: 22 Slices,
1 embedded multiplier
Performance: ~130 Mhz
(2v1000 -4)
Performance: ~126 MHz
(2v1000 -4)
c   aibi
i
a
b
+
c
FIR: Cont … VHDL Implementation


For convenience the selected coefficients are powers of 2.
To operate, the filter must have eight register stages, each
of which is eight bits wide.


At each clock cycle, each coefficient is multiplied by the
eight-bit value in the appropriate register.


Therefore, for the register or memory portion of the design, 64 flipflops are required.
Due to the selection of ``powers of two” coefficients, multiplication
is achieved by a simple shifting operation
The coefficient values may be stored as constants.


The coefficients used in the example are given below:
a0 = 2-3, a1=2-2, a2=2-1,a3=1,a4=1,a5=2-1,a6=2-2,a7=2-3
ENG6530 RCS
77
VHDL Description of FIR Filter
library ieee;
use ieee.std_logic_1164.all;
entity FIR1 is
port (clk : in std_logic;
x : in integer range 0 to 255;
y : out integer range 0 to 511);
end entity FIR1;
ENG6530 RCS
78
VHDL Description of FIR Filter
architecture arch1 of FIR1 is
begin
process (clk)
type RegType is array (7 downto 0) of integer;
variable Reg: RegType:= (others => 0);
begin
if (clk’event and clk=‘1’) then
- - multiply/accumulate (MAC) operation
y <= Reg(0)/8 + Reg(1)/4 + Reg(2)/2 + Reg(3)
+ Reg(4) + Reg(5)/2 + Reg(6)/4 + Reg(7)/8;
- - update register values by shifting
Reg(0) := Reg(1); Reg(1) := Reg(2); Reg(2) := Reg(3); Reg(3) := Reg(4);
Reg(4) := Reg(5); Reg(5) := Reg(6); Reg(6) := Reg(7); Reg(7) := x;
end if;
end process;
end architecture arch1;
ENG6530 RCS
79
Download