dsptutor

advertisement
Using Programmable Logic to
Accelerate DSP Functions
“A Tutorial“
Greg Goslin
Digital Signal Processing Applications Manager
Corporate Applications Group
15OCT95
Agenda

When to use FPGAs for DSP, an Overview
– What is Digital Signal Processing (DSP)?
– Where is DSP Used?
– Traditional DSP Approaches.

The Promise of Programmable Logic
– Case Study: Finite Impulse Response Filter.
– Case Study: Viterbi Decoder.

Design Methodologies for DSP in FPGAs
– Design Entry and Third Party Software Tools.

Building Fast Filters in FPGAs, a Tutorial
– Efficient Algorithms for FPGAs.
– Using Distributed Arithmetic for Filter Designs.
– How to use an FPGA to Building Filter Designs.
When to Use FPGAs for DSP
50

Data Rate (with 50 MHz system clock)
45
Number of DSPs
4 DSPs
3 DSPs
2 DSPs
1 DSP
40
35
– Up to 66 MHz (off-chip) with XC4000E-2


Short Word Lengths
– DA algorithm is faster with shorter word
length
25
FPGA
Re gion

15
Lots of Filter Taps with DA
– FPGA processes all taps in parallel,
faster than DSP
10
5
Low Sample Rates
– Integrate DSP + system logic in a lowcost DSP using serial sequential
Distributed Arithmetic algorithms
30
20
High Sample Rates
DSP
Region
0
1 4 8 12 16 20 24 28 32 36 40 44 48
Arithmetic Operations Per Sample
(MACs)

Fast Correlators

Single-Chip Solution Required

HardWire Gate Array Migration
path for high-volume designs
Constraint Driven Design
Methodology

Constraints
– System Requirements
– Hardware Limitations

Data Rate
– Inputs
– Outputs
– Multi-Channel I/O

Constarint Driven Design methodologies
Quality
– Number of Bits/Taps
– Number of Opperations
Data Rate
Quality
Processor Power
Clock Rate
Options
– Error Tolerance

Processor Power

Clock Rate
Performance
Efficiency
Constraints

Data Rate
– Functional Algorithms must opperate at system speed.
– Below System Frequency, the design has NO Value.
– Above System Frequency, the design has NO added Value.

Quality
– Data and Coefficient Bandwidth, m-Bits.
– Number of operations within Function, n-Taps.
– Error Tolerance, +/- LSB.
Design Implementation

Algorithm Evaluation:
– Data Flow Structure
– Parallel/Serial Operation
– Variable/Constant Operators
– Single/Multiple Data Path

Processor Power
– Maximum Processing Rate, Device Dependent
– Number of Clock Cycles to Perform Algorithm
– Bandwidth
– Data, Coefficients, Input/Output

Clock Rate
– Subdivision of Data Rate Clock
Case Study - Viterbi Decoder

Design Evaluation
Old_1
– Multi-Path Processes
– Repeated Independent
Functions
I/O Bus
INC
– Programmable DSP
– 24 clock cycles
– 360nsec @66MHz
New_1
MSB
+
Diff_1
+
Diff_2
I/O Bus
-
– While(), For() Loops
Performance
M
U
X
+
-
– Symmetrical Design

+
+
Old_2
+
+
MSB
M
U
X
+
-
Prestate Buffer
24-bit
24-bit
New_2
1 0 Bit
24-bit
DSP Design Implementation

Algorithm Evaluation:
– Data Flow Structure
– Parallel/Serial Operation
– Single/Multiple Data Path
Data_In
Begin
Fetch - A
Fetch - D
Fetch - C
Fetch - B
Fetch - E
g(C,F) = G
f(A,B) = C
f(D,E) = F
Send - G
– Variable/Constant Operators
– While() and For() Loops

Processor Power
– Maximum Processing Rate, Device
Dependent
– Number of Clock Cycles to
Perform Algorithm
– Bandwidth
– Data, Coefficients, Output

Clock Rate
– Subdivision of Data Rate Clock
Send - C
Adaptive
Changes
Data_Out
FPGA-Based DSP Coprocessor
Design Implementation

Performance
Old_1
+
+
– Programmable DSP
+
-
– 24 clock cycles
– 360nsec @66MHz
I/O Bus
+
INC
– 9 clock cycles
M
U
X
R
E
G
R
E
G
New_1
MSB
R
E
G
Diff_1
I/O Bus
+
-
– FPGA-Based Coprocessor
+
+
Old_2
Diff_2
R
E
G
MSB
+
-
M
U
X
R
E
G
R
E
G
New_2
R
E
G
Prestate Buffer
– 135nsec @66MHz
24-bit
24-bit
1 0 Bit
24-bit
Results:
– 37.5% of original processing time
– 2.67X Increase in throughput
– System Requirements:
– Before: 4-DSPs, 12-RAMs
– After: 2-DSPs, 6-RAMs, 1-XC4013E
3
Relative Performance

R
E
G
2
2.67 times better
performance with
FPGA-assisted DSP
135 ns
1
360 ns
0
Two 66 MHz DSPs
Six 15 ns RAMs
66 MHz DSP+FPGA
Three 15 ns RAMs
Building Fast and Efficient Filters in FPGAs

Efficient Filter Algorithms for FPGAs
– Distributed Arithmetic:
– Serial Sequential
– Serial
– Parallel

Using Distributed Arithmetic for Filter Designs
– Serial FIR Filter Example
– Two-Bit Parallel FIR Example
– Full Parallel FIR Example

How to use an FPGA to Building Filter Designs
– 8-Tap, 8-Bit FIR Filter SLICE
FIR FILTER EXAMPLE
N BITS WIDE
Sum of Products Equation
SAMPLE DATA
X0
•
X1
•
X2
PRODUCT
SUM
X
 K Multiplies
 K Sums
 CLOCK = Multiply Time
C0
K
X
 Sample Rate = Clock Rate
C1
OUTPUT DATA
•
•
•
•
X
0
C2
K COEFFICIENTS
K TAPS LONG
K SUMs
•
•
•
IMPLEMENTATION ???
FIR FILTER EXAMPLE
PROGRAMMABLE DSP CHIP IMPLEMENTATION
N BITS WIDE
SOFTWARE SOLUTION:
FIR FILTER
SAMPLE DATA
X0
SUM
X
•
• 1 Parallel Multiplier, Accumulator
C0
X1
•
X
C1
X2
OUTPUT DATA
X
•
C2
•
•
•
• Time Share through Microcoding
• Relatively Low Sample Rates
•
•
•
K SUMs
K TAPS LONG
• Multiple Chip Solution
• No Migration Path
FOR EACH SAMPLE DATA WORD
FOR EACH TAP
MULTIPLY C(i) TIMES X(i)
ADD RESULT TO ACCUMULATOR
• Complex Real Time Programming
Distributed Arithmetic
Made Easy
8-Bit X 8-Bit Signed Multiply
S
B7B6B5B4B3B2B1B0
SIGN EXTEND
X
A7A6A5A4A3A2A1A0
A0(B7B6B5B4B3B2B1B0)
A1(B7B6B5B4B3B2B1B0)
A2(B7B6B5B4B3B2B1B0)
A3(B7 B6B5B4B3B2B1B0)
A4(B7 B6 B5B4B3B2B1B0)
A5(B7 B6 B5 B4B3B2B1B0)
A6(B7 B6 B5 B4 B3B2B1B0)
+ A7(B7 B6 B5 B4 B3 B2B1B0)
S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0
D.A. ONE TAP FIR FILTER = D0 C0
REDUCES TO MULTIPLYING A VARIABLE TIMES A CONSTANT
A[0]
0
2 WORD X N BIT
LOOK UP TABLE
...000000
1
C0
N BITS WIDE
SAMPLE DATA
Xn
2 -1
DIN
N
LOOK
UP
TABLE
X
3
X
A
2
X
1
ADRS
X
A0
0
Scaling
Accum.
+
1
DATA
B
-
R
E
G
I
S
T
E
R
X0(B7B6B5B4B3B2B1B0)
+X1(B7B6B5B4B3B2B1B0)
S9S8S7S6S5S4S3S2S1S0
+X2(B7B6B5B4B3B2B1B0)
FILTERED
S10S9S8S7S6S5S4S3S2S1S0
DATA OUT
+X3(B7 B6B5B4B3B2B1B0)
S11S10S9S8S7S6S5S4S3S2S1S0
+X4(B7 B6 B5B4B3B2B1B0)
S12S11S10S9S8S7S6S5S4S3S2S1S0
+X5(B7 B6 B5 B4B3B2B1B0)
S13S12S11S10S9S8S7S6S5S4S3S2S1S0
+X6(B7 B6 B5 B4 B3B2B1B0)
S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0
+X7(B7 B6 B5 B4 B3 B2B1B0)
S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0
D.A. TWO TAP FIR FILTER = D0 C0 + D1 C1
N BITS WIDE
SAMPLE DATA
N
D0
XN
X2
X1
X0
X2
X1
X0
01
C0
10
c1
11
C0 + C1
A[10]
A0
2 -1
LOOK
UP
TABLE
XN
D1
00
4 WORD X N BIT
LOOK UP TABLE
...000000
A
ADRS
A1
Scaling
Accum.
+
DATA
B
-
R
E
G
I
S
T
E
R
(X0,0,X1,0)(B7B6B5B4B3B2B1B0)
+(X0,1,X1,1)(B7B6B5B4B3B2B1B0)
S9S8S7S6S5S4S3S2S1S0
+(X
0,2,X1,2)(B7B6B5B4B3B2B1B0)
FILTERED
S10S9S8S7S6S5S4S3S2S1S0
DATA OUT
+(X0,3,X1,3)(B7 B6B5B4B3B2B1B0)
S11S10S9S8S7S6S5S4S3S2S1S0
+(X0,4,X1,4)(B7 B6 B5B4B3B2B1B0)
S12S11S10S9S8S7S6S5S4S3S2S1S0
+(X0,5,X1,5)(B7 B6 B5 B4B3B2B1B0)
S13S12S11S10S9S8S7S6S5S4S3S2S1S0
+(X0,6,X1,6)(B7 B6 B5 B4 B3B2B1B0)
S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0
+(X0,7,X1,7)(B7 B6 B5 B4 B3 B2B1B0)
S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0
D.A. THREE TAP FIR FILTER
N BITS WIDE
SAMPLE DATA
N
D0
XN
X2
X1
X0
D1
A0
X2
X1
X0
2 -1
C0
010
100
C1
C1 + C0
C2
101
C2 + C0
110
C2 + C1
111
C2 + C1 + C0
011
LOOK
UP
TABLE
A
ADRS
A1
XN
D2
001
A[210]
XN
X2
X1
X0
000
8 WORD X N BIT
LOOK UP TABLE
...000000
+
DATA
A2
Scaling
Accum.
B
-
R
E
G
I
S
T
E
R
FILTERED
DATA OUT
(X0,0,X1,0,X2,0)(B7B6B5B4B3B2B1B0)
+(X0,1,X1,1,X2,1)(B7B6B5B4B3B2B1B0)
S9S8S7S6S5S4S3S2S1S0
+(X0,2,X1,2,X2,2)(B7B6B5B4B3B2B1B0)
S10S9S8S7S6S5S4S3S2S1S0
+(X0,N,X1,N,X2,N)(B7B6B5B4B3B2B1B0)
S(N+M) ... S13S12S11S10S9S8S7S6S5S4S3S2S1S0
The Development of a
Distributed Arithmetic FIR Filter
10 Bit 10 Tap - XC4000 Family Example
10 BIT 10 TAP SYMMETRICAL FIR FILTER
100 BIT
SHIFT
REGISTER
SAMPLE
DATA
10
PARALLEL IN
SERIAL OUT
Look Up Table is only 32 words by 10 bits
SUM(10,1)
10 BIT
SHIFT
REGISTER
A10
S10 A9
S9 A8
32 X 10 MEMORY
D0
D9
D1
D8
ADD
A0
LOOK UP
TABLE
ADD
A1
DATA
D1
SHIFT
D2
D7
D3
D6
D9
D4
D5
10
R
E
G
S1 A0
Scaling I
Accum. S
T
SIGN EXT
B10
E
B(9:0)
XOR
R
B
10
A
C_I LD
ADD
A2
COMPLEMENT ON
LAST BIT & ADD 1
Serial
Adders
DIN
A3
OPTIONAL
DOUBLE
PRECISION
ADD
A4
320 BITS
11
FILTERED
DATA OUT
Most
Significant
BYTE
SUM(0)
LOAD ON
FIRST BIT
ADD
10
Shift
Reg.
Least Significant
BYTE
10
N • K BIT
SHIFT
REGISTER
SAMPLE
DATA
N
PARALLEL IN
SERIAL OUT
SERIAL TIME SKEW
BUFFER
N BIT
SHIFT
REGISTER
N • K BIT
SHIFT
REGISTER
SAMPLE
DATA
N
N BIT
SHIFT
REGISTER
PARALLEL IN
D_0
D_0
SAMPLE DATA WORD SIZE = N BITS
NUMBER OF TAPS = K
RAM16X1R
DATA_I
A3
A2
A1
A0
• One N Bit Shift Register Per Tap
D_1
SHIFT
• Use 4000 RAM to build Shift Register
DATA_O
D_1
WR
CLK
• One 16 Bit Shift Register Per 1/2 CLB
RAM16X1R
SHIFT REGISTER
IMPLEMENTED IN RAM
D_k-1
# OUTPUTS = # TAPS
10 BIT 10 TAP = 50 CLBs
DATA_I
A3
A2
A1
A0
DATA_O
D_k-1
WR
CLK
10 BIT 10 TAP = 10 CLBs
Serial Adder
D0
D9
ADD
A+B
SUM
D
FF
D1
D8
D2
D7
Clk
ADD
A
B
ADD
Carry In
A+B+Carry
Carry
D
FF
D3
D6
ADD
D4
D5
ADD
Clk
CLR
CNT=10
Serial
Adders
1 CLB Per 2 Taps
DISTRIBUTED ARITHMETIC
LOOK-UP TABLE
32 X 10 MEMORY
A0
LOOK UP
TABLE
A1
• HOLDS ALL PARTIAL PRODUCTS
DATA
• LUT IS AS WIDE AS COEFF
A2
A3
A4
320 BITS
• CAN USE MEMGEN TO BUILD LUT
1’s COMPLEMENTER
INVERT
D0
D
Q
• INVERTS DATA ON LAST CYCLE
• 2 BITS PER CLB
D1
D
Q
SCALING ACCUMULATOR
A10
S10 A9
S9 A8
R
E
G
S1 A0
Scaling I
Accum. S
T
SIGN EXT
B10
DATA
E
B(9:0)
R
B
10
A
C_I LD
FORCE CARRY-IN ON
LAST BIT
10
• ADDS DATA TO (1/2) *(SUMOUT)
11
SUM OUT
Most
Significant
BYTE
• 2 BITS PER CLB
• NEED N+1 BITS
• DOUBLE PRECISION WITH SR
SUM(0)
LOAD ON
FIRST BIT
OPTIONAL
DOUBLE
PRECISION
• CAN USE XBLOX FOR RPM
DIN
Shift
Reg.
Least Significant
BYTE
10
10 BIT 10 TAP SYMMETRICAL FIR FILTER
100 BIT
SHIFT
REGISTER
SAMPLE
DATA
10
PARALLEL IN
SERIAL OUT
SUM(10,1)
10 BIT
SHIFT
REGISTER
A10
S10 A9
S9 A8
32 X 10 MEMORY
D0
D9
ADD
LOOK UP
TABLE
(RAM)
D1
D8
A0
ADD
A1
DATA
D1
SHIFT
D2
D7
D3
D6
D9
D4
D5
10
R
E
G
S1 A0
Scaling I
Accum. S
T
SIGN EXT
B10
E
B(9:0)
XOR
R
B
10
A
C_I LD
ADD
A2
COMPLEMENT ON
LAST BIT & ADD 1
Serial
Adders
DIN
A3
OPTIONAL
DOUBLE
PRECISION
ADD
A4
320 BITS
11
FILTERED
DATA OUT
Most
Significant
BYTE
SUM(0)
LOAD ON
FIRST BIT
ADD
10
Shift
Reg.
Least Significant
BYTE
10
10
RAM BASED
SHIFT REGISTER
SAMPLE DATA
TIMING AND
CONTROL
CNTEQ10
CNTEQ9
50 MHz CLK
5
10 CLBs
5 CLBs
SERIAL TIME SKEW
BUFFER
2 TO 1 REDUCTION
DUE TO SYMMETRY
A3
A2
A1
A0
7 CLBs
10
RAM OR ROM
LOOK UP TABLE
COMPLEMENT
ON LAST
CYCLE
10
FIR FILTER COEFFICIENTS
AND MULTIPLY LOOK UP
A
ADDER
9
B
5 CLBs
DATA
10 CLBs
10
XOR
10
32 X 10
ADRS
A3
A2
A1
A0
CLK
FIVE 2 BIT
ADDERS
7 CLBs
R
E
G
I
S
T
E
R
10
FILTER OUT
9 Most Significant Bits
1’S COMPLEMENT
SCALING ACCUMULATOR
• TOTAL OF 44 CLBS: FITS IN A 4002A (WITH 20 CLBS EXTRA FOR SYSTEM DESIGN)
• ABOUT 1300 EQUIVALENT GATES - LITTLE INTERCONNECT BETWEEN BLOCKS
NUMBER OF 10 BIT 10 TAP SYMMETRICAL FIR FILTERS PER XC4000 DEVICE
XC4000
PART
NUMBER OF
INSTANCES
4002A 4003A 4004A 4005A 4006
1
2
3
5
6
4008
4010
4013
8
10
15
4025
23
10 BIT 10 TAP
FIR FILTER
PERFORMANCE
• FIR10B10T MACRO CAN BE CLOCKED AT 50 MHZ
• 10 BIT WORD REQUIRES 11 CLOCKS
• 8 BIT WORD REQUIRES 9 CLOCKS, ETC
FIR Filter Macro
Relatively Placed Macro
• 10 BIT SAMPLE WORD RATE IS 4.5 MHZ
DATA IN
BIT_CLK
DIN_
DOUT_
10X_CLK
CLK_OUT
FIR10B10T
WORD SIZE
SAMPLE RATE
6
8
10
12
14
7.1
5.5
4.5
3.8
3.3
16
2.9
BITS
MHZ
DATA OUT
WORD_CLK
Double-Rate DA FIR Filters
Two Bit Parallel Distributed Arithmetic
WORD X N BIT
FIR Filter A[3210] 16
LOOK UP TABLE
SAMPLE DATA
N
D0
N BITS WIDE
0000
XN
X2
X1
X0
A1
A0
2 -1
LOOK
UP
TABLE
XN
ADRS
D1
X2
X1
X0
A3
A2
A
Scaling
Accum.
+
DATA
B
-
R
E
G
I
S
T
E
R
FILTERED
DATA OUT

Process 2 Bits per Clock

# of Clocks = (N/2) + 1

Twice as fast
...000000
0001
C0
0010
0011
2C0
3C0
0100
C1
0101
C2 + C1
0110
C2 + 2C1
0111
C1 + 3C0
1000
2C1
1001
2C1 + C0
1010
2C1 + 2C0
1011
2C1 + 3C0
Double Sample Rate D.A. FIR Filters

Two Taps Requires 4 Input LUT without Symmetry

Four Taps Requires 4 Level LUT with Symmetrical FIR

Time Skew Buffer uses Twice as many CLBs

Twice the I/O Data Sample Rate

Both LUTs are the same
Full Parallel DA FIR Filters
Full Parallel Distributed Arithmetic
WORD X N BIT
FIR Filter A[3210] 16
LOOK UP TABLE
SAMPLE DATA
N
D0
N BITS WIDE
X7
X6
X5
X4
X3
X2
X1
X0
A3
A2
A1
A0
A3
A2
A1
A0
LUT-A
ADRS
DATA
A
R
E
G
LUT-A
ADRS
DATA
B
A
D1
X7
X6
X5
X4
X3
X2
X1
X0
A3
A2
A1
A0
A3
A2
A1
A0
R
E
G
LUT-A
ADRS
DATA
B
A
R
E
G
LUT-A
ADRS
DATA
B
0000
...000000
0001
C0
0010
0011
2C0
3C0
0100
4C0
0101
5C0
0110
6C0
0111
7C0
1000
8C0
1001
9C0
1010
10C0
1011
11C0
Full Parallel D.A. FIR Filters

One Taps Requires two 4 Input LUTs and an ADDER

Time Skew Buffer must use REGs

Maximum I/O Data Sample Rate
Large Number of TAPs:
8X - TAP FIR using an
8 - TAP SLICE
IN
TSB
ADD
LUT
OUT
R
E
G
I
S
T
E
R
N
1’s
COM
R
E
G
I
S
T
E
R
N
ADD
N
IN
TSB
OUT
ADD
LUT
R
E
G
I
S
T
E
R
N
1’s
COM
R
E
G
I
S
T
E
R
R
E
G
I
S
T
E
R
N+1
SCAL
ACC
R
R
E
E
G
G
II
S
S
T
T
E
E
R
R
N+2
8 Tap FIR Filter SLICE
Number of CLBs per Slice (up to 16 Bit Word)
IN
TSB
OUT
ADD
LUT
R
E
G
I
S
T
E
R
N
1’s
COM
R
E
G
I
S
T
E
R
N
ADD
R
E
G
I
S
T
E
R
N+1
SCAL
ACC
R
R
E
E
G
G
II
S
S
T
T
E
E
R
R
N+2
N
4 +
4 + 1/2N + 1/2N + ((N+1)/2+1) + ((N+2)/2+1)
New_word
Sample
Data N
32 Tap Filter Using Four
8 Tap FIR Filter SLICE
Load
PSC
Bit_Clk
IN
TSB
SER
ADD
LUT
TSB
IN
R
E
G
I
S
T
E
R
8
1’s
COM
R
E
G
I
S
T
E
R
8
ADD
R
E
G
I
S
T
E
R
9
8
IN
TSB
SER
ADD
LUT
TSB
R
E
G
I
S
T
E
R
8
1’s
COM
R
E
G
I
S
T
E
R
1’s
COM
R
E
G
I
S
T
E
R
ADD
IN
IN
TSB
SER
ADD
LUT
TSB
R
E
G
I
S
T
E
R
8
8
ADD
IN
8
IN
TSB
SER
ADD
TSB
IN
LUT
R
E
G
I
S
T
E
R
8
1’s
COM
R
E
G
I
S
T
E
R
R
E
G
I
S
T
E
R
9
R
E
G
I
S
T
E
R
SCAL
ACC
R
E
G
I
S
T
E
R
Data Out
8 Tap FIR Filter SLICE Building Blocks
Byte_Clk
Parallel to Serial
Converter
N
N/2 CLBs
Bit_Clk
Time Skew Buffer
(Quad)
Bit3
Bit2
Bit1
Bit0
IN
TSB
Look Up Table
N Bit ADDer
LUT
R
E
G
I
S
T
E
R
ADD
R
E
G
I
S
T
E
R
N
N
N Bit SCAL
ACCUM
PSC
N
R
R
E
E
SCAL GGII
ACC SSTT
2 CLBs
(Up to 16
bit word)
N CLBs
N+1
N+1
(N/2)+1 CLBs
(N/2)+1 CLBs
E
E
R
R
Serial Adder
1’s Complementer
ADD
1 CLB
1’s
COM
1/2 CLB
8 Tap FIR Filter SLICE
APPROXIMATE NUMBER OF XC4000 CLBs
8 TAPS
40
44
48
52
56
60
16 TAPS
64
70
78
84
92
100 118 126 132 140
24 TAPS
90 100
112 122 134 144 200 214 228 242
32 TAPS
108 122
142 148 160 174 238 256 272 290
40 TAPS
138 156
174 192 210 228 316 340 362 386
48 TAPS
158 178
198 218 238 258 364 388 414 440
56 TAPS
180 204
226 250 272 296 414 444 474 504
6
8
10
12
14
16
84
18
88
20
SAMPLE DATA WORD SIZE (N)
92
22
96
24
8 Tap FIR Filter SLICE
PERFORMANCE with XC4000-4
SAMPLE DATA
WORD SIZE
6
8
10 12 14 16 18 20 22 24
MEGA SAMPLES
PER SECOND
7.1 5.5 4.5 3.8
3.3 2.9
2.6 2.3 2.1 2.0
DOUBLE RATE
PERFORMANCE
12.5 10 8.3
6.2 5.5
5.0
6.9
4.5 4.1 3.8
Sample Rate is Independent of the Number of Taps
8 Bit Word FIR Filter Sample Rates
Word
Sample
Rate
Distributed
Arithmetic
5 Mhz
4 Mhz
3 Mhz
2 Mhz
1 Mhz
16
32
48
Number of TAPS
64
80
8 Bit Word FIR Filter Structures
Two-Bit Parallel
Distributed
Arithmetic
Parallel
Distributed
Arithmetic
# CLBs
300
16
Mhz
55
Mhz
8
Mhz
200
•
•
Serial
Distributed
Arithmetic
•
•
•
•
•
16
32
48
100
•
1000 to 50 Khz
•
•
64
80
Number of TAPS
Serial Sequential
Distributed
Arithmetic
FIR Filter Implementation Options
8 Bit Word Example
Serial
Sequential
Distributed
Arithmetic
Parallel
Parallel
8 Taps
36 CLBs
1080 Khz
44 CLBs
8.1 Mhz
250 CLBs
60 Mhz
16 Taps
36 CLBs
462 Khz
70 CLBs
8.1 Mhz
400 CLBs
55 Mhz
32 Taps
44 CLBs
231 Khz
122 CLBs
8.1 Mhz
48 Taps
62 CLBs
154 Khz
178 CLBs
8.1 Mhz
64 Taps
70 CLBs
115 Khz
228 CLBs
8.1 Mhz
Lower Sample Rate Applications:
Efficient CLB Counts
Large Number of TAPs
Moderate Sample Rates
Non Symmetrical FIR OK
Serial Sequential Architecture
Serial Sequential - FIR Filter
Sample
Data
32 Tap 8 Bit Example
SAMPLE
DATA
BUFFER
Coefficient
Select
3 CLBs 5-BIT
CNTR
5
Coefficient
Table
SDB Out
SERIAL
MULTIPLY
REG
R
E
G
Filtered
Data Out
PSR
Parallel to Serial
Converter
4 CLBs
Serial Multiplier
24 CLBs Total
0
8
Clk
50 Mhz
32 - 8 Bit Coefficients
8 CLBs
8
ACC
32 x 8 LUT
8
Select
2-1 Scale
ADD
REGISTER
9
5 CLBs
Sample
Data
Coefficient
Select
SAMPLE
DATA
BUFFER
SAMPLE
DATA
BUFFER
SERIAL
MULTIPLY
SERIAL
MULTIPLY
Coefficient
Select
ACC
ACC
REG
REG
64-TAP Serial
Sequential FIR
Filter
ADD
R
E
G
I
S
T
E
R
Sample
Data
Serial Sequential - FIR Filter
SAMPLE
DATA
BUFFER
Coefficient
Select
Number CLBs vs. Taps / Word Size
8 Bit 10 Bit 12 Bit 14 Bit
SERIAL
MULTIPLY
16 Bit
8 Tap
36
43
50
57
64
16 Tap
36
43
50
57
64
32 Tap
44
53
62
71
80
48 Tap
62
77
92
107
122
64 Tap
70
85
100
115
130
• 4005 = 196 CLBs
80 Tap
97
115
133
151
169
• 4013 = 576 CLBs
96 Tap
97
115
133
151
169
• 4025 = 1024 CLBs
128 Tap
112
137
162
187
212
ACC
REG
R
E
G
Filtered
Data Out
• 4002 = 64 CLBs
Sample
Data
Serial Sequential - FIR Filter
SAMPLE
DATA
BUFFER
Maximum Sample Rate / Word Size
TAPS
8 Bit
10 Bit
8 Tap
781Khz
625Khz
390Khz
16 Tap
390Khz
312Khz
195Khz
32 Tap
195Khz
156Khz
97Khz
48 Tap
130Khz
104Khz
65Khz
64 Tap
97Khz
78Khz
48Khz
• Serial Mult. Limitations
80 Tap
78Khz
62Khz
39Khz
• Can Use Multiple 16 Tap
Building Blocks
96 Tap
65Khz
52Khz
32Khz
128 Tap
48Khz
39Khz
24Khz
Coefficient
Select
16 Bit
SERIAL
MULTIPLY
ACC
REG
R
E
G
Filtered
Data Out
• 8X Faster at 128 Taps
Sample
Data
SAMPLE
DATA
BUFFER
SAMPLE
DATA
BUFFER
Coefficient
Select
SERIAL
MULTIPLY
Serial Sequential
16 Tap Slice
FIR Filter
Coefficient
Select
SERIAL
MULTIPLY
ACC
ACC
REG
REG
Maximum Sample Rate / Word Size
ADD
R
E
G
I
S
T
E
R
TAPS
8 Bit
10 Bit
16 Tap
390Khz
312Khz
195Khz
32 Tap
390Khz
312Khz
195Khz
48 Tap
64 Tap
• 16-Tap Slice Used
• 32-Tap Slice Uses Less CLBs
80 Tap
96 Tap
128 Tap
16 Bit
SCHEMATIC
CAPTURE
THIRD-PARTY
FILTER DESIGN
SOFTWARE
CONVERT
TO XNF
CONVERT
COEFFICIENTS
LOOK UP TABLE
XBLOX
PROCESSOR
MEMGEN
XNF
DESIGN
METHODOLOGY
FORMAT COEFFICIENTS
INTO LOOK UP TABLE
GENERATE
ROM
XNF
PARTITION PLACE
AND ROUTE
POST ROUTE
SIMULATION
BIT STREAM FOR DOWN LOAD CABLE, OR EPROM
DESIGN METHODOLOGY
SCHEMATIC CAPTURE
• Filter Blocks can be Embedded in Complete design
• XBLOX Can Synthesize the Data Path Logic
• Filter Design Software used to design filter Coefficients
• Complete System Level Design in a Single Chip
• Incremental Filter Design Using XACT 5.0
FPGA
The Right Solution for Most Applications
Audio Sample Rates:
Don’t need Special DSP Chip
Serial Sequential Architecture is efficient
RF Sample Rates:
Programmable DSP Chip is too slow
FPGA is a single chip configurable solution
XILINX VS. D.S.P. CHIP COMPARISON
When Does It Make Sense To Use FPGAs?
• High Sample Rate Systems
• Low Sample Rates
• Small Word Length
• Lots of Taps
• Single Chip Solution Required
• Low Cost Migration Path (HardWire)
• Incremental Cost of DSP Chip
“Design Once”
DISTRIBUTED ARITHMETIC
FPGA Applications, Coming Attractions:
• Signal Synthesis
• Modulation, De-modulation
• FFTs
• Neural Networks
• Half Band FIR Filters
• Video Signal Processing
POSSIBILITIES
X.D.S.P.
XILINX Hardware Digital Signal Processing
• There is an Alternative to Software DSP Chip Solutions Today
• Existing Xilinx 3100, 4000, 4000A,E, & H can Efficiently do Signal Processing
• System Level Application Specific Solution on a Single Chip
• Standard Product Configurable Solution
•
Automatic Migration Path to a Lower Cost/High Volume Solution
Download