Using Programmable Logic to Accelerate DSP Functions “A Tutorial“ Greg Goslin Digital Signal Processing Applications Manager Corporate Applications Group 15OCT95 Agenda When to use FPGAs for DSP, an Overview – What is Digital Signal Processing (DSP)? – Where is DSP Used? – Traditional DSP Approaches. The Promise of Programmable Logic – Case Study: Finite Impulse Response Filter. – Case Study: Viterbi Decoder. Design Methodologies for DSP in FPGAs – Design Entry and Third Party Software Tools. Building Fast Filters in FPGAs, a Tutorial – Efficient Algorithms for FPGAs. – Using Distributed Arithmetic for Filter Designs. – How to use an FPGA to Building Filter Designs. When to Use FPGAs for DSP 50 Data Rate (with 50 MHz system clock) 45 Number of DSPs 4 DSPs 3 DSPs 2 DSPs 1 DSP 40 35 – Up to 66 MHz (off-chip) with XC4000E-2 Short Word Lengths – DA algorithm is faster with shorter word length 25 FPGA Re gion 15 Lots of Filter Taps with DA – FPGA processes all taps in parallel, faster than DSP 10 5 Low Sample Rates – Integrate DSP + system logic in a lowcost DSP using serial sequential Distributed Arithmetic algorithms 30 20 High Sample Rates DSP Region 0 1 4 8 12 16 20 24 28 32 36 40 44 48 Arithmetic Operations Per Sample (MACs) Fast Correlators Single-Chip Solution Required HardWire Gate Array Migration path for high-volume designs Constraint Driven Design Methodology Constraints – System Requirements – Hardware Limitations Data Rate – Inputs – Outputs – Multi-Channel I/O Constarint Driven Design methodologies Quality – Number of Bits/Taps – Number of Opperations Data Rate Quality Processor Power Clock Rate Options – Error Tolerance Processor Power Clock Rate Performance Efficiency Constraints Data Rate – Functional Algorithms must opperate at system speed. – Below System Frequency, the design has NO Value. – Above System Frequency, the design has NO added Value. Quality – Data and Coefficient Bandwidth, m-Bits. – Number of operations within Function, n-Taps. – Error Tolerance, +/- LSB. Design Implementation Algorithm Evaluation: – Data Flow Structure – Parallel/Serial Operation – Variable/Constant Operators – Single/Multiple Data Path Processor Power – Maximum Processing Rate, Device Dependent – Number of Clock Cycles to Perform Algorithm – Bandwidth – Data, Coefficients, Input/Output Clock Rate – Subdivision of Data Rate Clock Case Study - Viterbi Decoder Design Evaluation Old_1 – Multi-Path Processes – Repeated Independent Functions I/O Bus INC – Programmable DSP – 24 clock cycles – 360nsec @66MHz New_1 MSB + Diff_1 + Diff_2 I/O Bus - – While(), For() Loops Performance M U X + - – Symmetrical Design + + Old_2 + + MSB M U X + - Prestate Buffer 24-bit 24-bit New_2 1 0 Bit 24-bit DSP Design Implementation Algorithm Evaluation: – Data Flow Structure – Parallel/Serial Operation – Single/Multiple Data Path Data_In Begin Fetch - A Fetch - D Fetch - C Fetch - B Fetch - E g(C,F) = G f(A,B) = C f(D,E) = F Send - G – Variable/Constant Operators – While() and For() Loops Processor Power – Maximum Processing Rate, Device Dependent – Number of Clock Cycles to Perform Algorithm – Bandwidth – Data, Coefficients, Output Clock Rate – Subdivision of Data Rate Clock Send - C Adaptive Changes Data_Out FPGA-Based DSP Coprocessor Design Implementation Performance Old_1 + + – Programmable DSP + - – 24 clock cycles – 360nsec @66MHz I/O Bus + INC – 9 clock cycles M U X R E G R E G New_1 MSB R E G Diff_1 I/O Bus + - – FPGA-Based Coprocessor + + Old_2 Diff_2 R E G MSB + - M U X R E G R E G New_2 R E G Prestate Buffer – 135nsec @66MHz 24-bit 24-bit 1 0 Bit 24-bit Results: – 37.5% of original processing time – 2.67X Increase in throughput – System Requirements: – Before: 4-DSPs, 12-RAMs – After: 2-DSPs, 6-RAMs, 1-XC4013E 3 Relative Performance R E G 2 2.67 times better performance with FPGA-assisted DSP 135 ns 1 360 ns 0 Two 66 MHz DSPs Six 15 ns RAMs 66 MHz DSP+FPGA Three 15 ns RAMs Building Fast and Efficient Filters in FPGAs Efficient Filter Algorithms for FPGAs – Distributed Arithmetic: – Serial Sequential – Serial – Parallel Using Distributed Arithmetic for Filter Designs – Serial FIR Filter Example – Two-Bit Parallel FIR Example – Full Parallel FIR Example How to use an FPGA to Building Filter Designs – 8-Tap, 8-Bit FIR Filter SLICE FIR FILTER EXAMPLE N BITS WIDE Sum of Products Equation SAMPLE DATA X0 • X1 • X2 PRODUCT SUM X K Multiplies K Sums CLOCK = Multiply Time C0 K X Sample Rate = Clock Rate C1 OUTPUT DATA • • • • X 0 C2 K COEFFICIENTS K TAPS LONG K SUMs • • • IMPLEMENTATION ??? FIR FILTER EXAMPLE PROGRAMMABLE DSP CHIP IMPLEMENTATION N BITS WIDE SOFTWARE SOLUTION: FIR FILTER SAMPLE DATA X0 SUM X • • 1 Parallel Multiplier, Accumulator C0 X1 • X C1 X2 OUTPUT DATA X • C2 • • • • Time Share through Microcoding • Relatively Low Sample Rates • • • K SUMs K TAPS LONG • Multiple Chip Solution • No Migration Path FOR EACH SAMPLE DATA WORD FOR EACH TAP MULTIPLY C(i) TIMES X(i) ADD RESULT TO ACCUMULATOR • Complex Real Time Programming Distributed Arithmetic Made Easy 8-Bit X 8-Bit Signed Multiply S B7B6B5B4B3B2B1B0 SIGN EXTEND X A7A6A5A4A3A2A1A0 A0(B7B6B5B4B3B2B1B0) A1(B7B6B5B4B3B2B1B0) A2(B7B6B5B4B3B2B1B0) A3(B7 B6B5B4B3B2B1B0) A4(B7 B6 B5B4B3B2B1B0) A5(B7 B6 B5 B4B3B2B1B0) A6(B7 B6 B5 B4 B3B2B1B0) + A7(B7 B6 B5 B4 B3 B2B1B0) S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0 D.A. ONE TAP FIR FILTER = D0 C0 REDUCES TO MULTIPLYING A VARIABLE TIMES A CONSTANT A[0] 0 2 WORD X N BIT LOOK UP TABLE ...000000 1 C0 N BITS WIDE SAMPLE DATA Xn 2 -1 DIN N LOOK UP TABLE X 3 X A 2 X 1 ADRS X A0 0 Scaling Accum. + 1 DATA B - R E G I S T E R X0(B7B6B5B4B3B2B1B0) +X1(B7B6B5B4B3B2B1B0) S9S8S7S6S5S4S3S2S1S0 +X2(B7B6B5B4B3B2B1B0) FILTERED S10S9S8S7S6S5S4S3S2S1S0 DATA OUT +X3(B7 B6B5B4B3B2B1B0) S11S10S9S8S7S6S5S4S3S2S1S0 +X4(B7 B6 B5B4B3B2B1B0) S12S11S10S9S8S7S6S5S4S3S2S1S0 +X5(B7 B6 B5 B4B3B2B1B0) S13S12S11S10S9S8S7S6S5S4S3S2S1S0 +X6(B7 B6 B5 B4 B3B2B1B0) S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0 +X7(B7 B6 B5 B4 B3 B2B1B0) S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0 D.A. TWO TAP FIR FILTER = D0 C0 + D1 C1 N BITS WIDE SAMPLE DATA N D0 XN X2 X1 X0 X2 X1 X0 01 C0 10 c1 11 C0 + C1 A[10] A0 2 -1 LOOK UP TABLE XN D1 00 4 WORD X N BIT LOOK UP TABLE ...000000 A ADRS A1 Scaling Accum. + DATA B - R E G I S T E R (X0,0,X1,0)(B7B6B5B4B3B2B1B0) +(X0,1,X1,1)(B7B6B5B4B3B2B1B0) S9S8S7S6S5S4S3S2S1S0 +(X 0,2,X1,2)(B7B6B5B4B3B2B1B0) FILTERED S10S9S8S7S6S5S4S3S2S1S0 DATA OUT +(X0,3,X1,3)(B7 B6B5B4B3B2B1B0) S11S10S9S8S7S6S5S4S3S2S1S0 +(X0,4,X1,4)(B7 B6 B5B4B3B2B1B0) S12S11S10S9S8S7S6S5S4S3S2S1S0 +(X0,5,X1,5)(B7 B6 B5 B4B3B2B1B0) S13S12S11S10S9S8S7S6S5S4S3S2S1S0 +(X0,6,X1,6)(B7 B6 B5 B4 B3B2B1B0) S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0 +(X0,7,X1,7)(B7 B6 B5 B4 B3 B2B1B0) S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0 D.A. THREE TAP FIR FILTER N BITS WIDE SAMPLE DATA N D0 XN X2 X1 X0 D1 A0 X2 X1 X0 2 -1 C0 010 100 C1 C1 + C0 C2 101 C2 + C0 110 C2 + C1 111 C2 + C1 + C0 011 LOOK UP TABLE A ADRS A1 XN D2 001 A[210] XN X2 X1 X0 000 8 WORD X N BIT LOOK UP TABLE ...000000 + DATA A2 Scaling Accum. B - R E G I S T E R FILTERED DATA OUT (X0,0,X1,0,X2,0)(B7B6B5B4B3B2B1B0) +(X0,1,X1,1,X2,1)(B7B6B5B4B3B2B1B0) S9S8S7S6S5S4S3S2S1S0 +(X0,2,X1,2,X2,2)(B7B6B5B4B3B2B1B0) S10S9S8S7S6S5S4S3S2S1S0 +(X0,N,X1,N,X2,N)(B7B6B5B4B3B2B1B0) S(N+M) ... S13S12S11S10S9S8S7S6S5S4S3S2S1S0 The Development of a Distributed Arithmetic FIR Filter 10 Bit 10 Tap - XC4000 Family Example 10 BIT 10 TAP SYMMETRICAL FIR FILTER 100 BIT SHIFT REGISTER SAMPLE DATA 10 PARALLEL IN SERIAL OUT Look Up Table is only 32 words by 10 bits SUM(10,1) 10 BIT SHIFT REGISTER A10 S10 A9 S9 A8 32 X 10 MEMORY D0 D9 D1 D8 ADD A0 LOOK UP TABLE ADD A1 DATA D1 SHIFT D2 D7 D3 D6 D9 D4 D5 10 R E G S1 A0 Scaling I Accum. S T SIGN EXT B10 E B(9:0) XOR R B 10 A C_I LD ADD A2 COMPLEMENT ON LAST BIT & ADD 1 Serial Adders DIN A3 OPTIONAL DOUBLE PRECISION ADD A4 320 BITS 11 FILTERED DATA OUT Most Significant BYTE SUM(0) LOAD ON FIRST BIT ADD 10 Shift Reg. Least Significant BYTE 10 N • K BIT SHIFT REGISTER SAMPLE DATA N PARALLEL IN SERIAL OUT SERIAL TIME SKEW BUFFER N BIT SHIFT REGISTER N • K BIT SHIFT REGISTER SAMPLE DATA N N BIT SHIFT REGISTER PARALLEL IN D_0 D_0 SAMPLE DATA WORD SIZE = N BITS NUMBER OF TAPS = K RAM16X1R DATA_I A3 A2 A1 A0 • One N Bit Shift Register Per Tap D_1 SHIFT • Use 4000 RAM to build Shift Register DATA_O D_1 WR CLK • One 16 Bit Shift Register Per 1/2 CLB RAM16X1R SHIFT REGISTER IMPLEMENTED IN RAM D_k-1 # OUTPUTS = # TAPS 10 BIT 10 TAP = 50 CLBs DATA_I A3 A2 A1 A0 DATA_O D_k-1 WR CLK 10 BIT 10 TAP = 10 CLBs Serial Adder D0 D9 ADD A+B SUM D FF D1 D8 D2 D7 Clk ADD A B ADD Carry In A+B+Carry Carry D FF D3 D6 ADD D4 D5 ADD Clk CLR CNT=10 Serial Adders 1 CLB Per 2 Taps DISTRIBUTED ARITHMETIC LOOK-UP TABLE 32 X 10 MEMORY A0 LOOK UP TABLE A1 • HOLDS ALL PARTIAL PRODUCTS DATA • LUT IS AS WIDE AS COEFF A2 A3 A4 320 BITS • CAN USE MEMGEN TO BUILD LUT 1’s COMPLEMENTER INVERT D0 D Q • INVERTS DATA ON LAST CYCLE • 2 BITS PER CLB D1 D Q SCALING ACCUMULATOR A10 S10 A9 S9 A8 R E G S1 A0 Scaling I Accum. S T SIGN EXT B10 DATA E B(9:0) R B 10 A C_I LD FORCE CARRY-IN ON LAST BIT 10 • ADDS DATA TO (1/2) *(SUMOUT) 11 SUM OUT Most Significant BYTE • 2 BITS PER CLB • NEED N+1 BITS • DOUBLE PRECISION WITH SR SUM(0) LOAD ON FIRST BIT OPTIONAL DOUBLE PRECISION • CAN USE XBLOX FOR RPM DIN Shift Reg. Least Significant BYTE 10 10 BIT 10 TAP SYMMETRICAL FIR FILTER 100 BIT SHIFT REGISTER SAMPLE DATA 10 PARALLEL IN SERIAL OUT SUM(10,1) 10 BIT SHIFT REGISTER A10 S10 A9 S9 A8 32 X 10 MEMORY D0 D9 ADD LOOK UP TABLE (RAM) D1 D8 A0 ADD A1 DATA D1 SHIFT D2 D7 D3 D6 D9 D4 D5 10 R E G S1 A0 Scaling I Accum. S T SIGN EXT B10 E B(9:0) XOR R B 10 A C_I LD ADD A2 COMPLEMENT ON LAST BIT & ADD 1 Serial Adders DIN A3 OPTIONAL DOUBLE PRECISION ADD A4 320 BITS 11 FILTERED DATA OUT Most Significant BYTE SUM(0) LOAD ON FIRST BIT ADD 10 Shift Reg. Least Significant BYTE 10 10 RAM BASED SHIFT REGISTER SAMPLE DATA TIMING AND CONTROL CNTEQ10 CNTEQ9 50 MHz CLK 5 10 CLBs 5 CLBs SERIAL TIME SKEW BUFFER 2 TO 1 REDUCTION DUE TO SYMMETRY A3 A2 A1 A0 7 CLBs 10 RAM OR ROM LOOK UP TABLE COMPLEMENT ON LAST CYCLE 10 FIR FILTER COEFFICIENTS AND MULTIPLY LOOK UP A ADDER 9 B 5 CLBs DATA 10 CLBs 10 XOR 10 32 X 10 ADRS A3 A2 A1 A0 CLK FIVE 2 BIT ADDERS 7 CLBs R E G I S T E R 10 FILTER OUT 9 Most Significant Bits 1’S COMPLEMENT SCALING ACCUMULATOR • TOTAL OF 44 CLBS: FITS IN A 4002A (WITH 20 CLBS EXTRA FOR SYSTEM DESIGN) • ABOUT 1300 EQUIVALENT GATES - LITTLE INTERCONNECT BETWEEN BLOCKS NUMBER OF 10 BIT 10 TAP SYMMETRICAL FIR FILTERS PER XC4000 DEVICE XC4000 PART NUMBER OF INSTANCES 4002A 4003A 4004A 4005A 4006 1 2 3 5 6 4008 4010 4013 8 10 15 4025 23 10 BIT 10 TAP FIR FILTER PERFORMANCE • FIR10B10T MACRO CAN BE CLOCKED AT 50 MHZ • 10 BIT WORD REQUIRES 11 CLOCKS • 8 BIT WORD REQUIRES 9 CLOCKS, ETC FIR Filter Macro Relatively Placed Macro • 10 BIT SAMPLE WORD RATE IS 4.5 MHZ DATA IN BIT_CLK DIN_ DOUT_ 10X_CLK CLK_OUT FIR10B10T WORD SIZE SAMPLE RATE 6 8 10 12 14 7.1 5.5 4.5 3.8 3.3 16 2.9 BITS MHZ DATA OUT WORD_CLK Double-Rate DA FIR Filters Two Bit Parallel Distributed Arithmetic WORD X N BIT FIR Filter A[3210] 16 LOOK UP TABLE SAMPLE DATA N D0 N BITS WIDE 0000 XN X2 X1 X0 A1 A0 2 -1 LOOK UP TABLE XN ADRS D1 X2 X1 X0 A3 A2 A Scaling Accum. + DATA B - R E G I S T E R FILTERED DATA OUT Process 2 Bits per Clock # of Clocks = (N/2) + 1 Twice as fast ...000000 0001 C0 0010 0011 2C0 3C0 0100 C1 0101 C2 + C1 0110 C2 + 2C1 0111 C1 + 3C0 1000 2C1 1001 2C1 + C0 1010 2C1 + 2C0 1011 2C1 + 3C0 Double Sample Rate D.A. FIR Filters Two Taps Requires 4 Input LUT without Symmetry Four Taps Requires 4 Level LUT with Symmetrical FIR Time Skew Buffer uses Twice as many CLBs Twice the I/O Data Sample Rate Both LUTs are the same Full Parallel DA FIR Filters Full Parallel Distributed Arithmetic WORD X N BIT FIR Filter A[3210] 16 LOOK UP TABLE SAMPLE DATA N D0 N BITS WIDE X7 X6 X5 X4 X3 X2 X1 X0 A3 A2 A1 A0 A3 A2 A1 A0 LUT-A ADRS DATA A R E G LUT-A ADRS DATA B A D1 X7 X6 X5 X4 X3 X2 X1 X0 A3 A2 A1 A0 A3 A2 A1 A0 R E G LUT-A ADRS DATA B A R E G LUT-A ADRS DATA B 0000 ...000000 0001 C0 0010 0011 2C0 3C0 0100 4C0 0101 5C0 0110 6C0 0111 7C0 1000 8C0 1001 9C0 1010 10C0 1011 11C0 Full Parallel D.A. FIR Filters One Taps Requires two 4 Input LUTs and an ADDER Time Skew Buffer must use REGs Maximum I/O Data Sample Rate Large Number of TAPs: 8X - TAP FIR using an 8 - TAP SLICE IN TSB ADD LUT OUT R E G I S T E R N 1’s COM R E G I S T E R N ADD N IN TSB OUT ADD LUT R E G I S T E R N 1’s COM R E G I S T E R R E G I S T E R N+1 SCAL ACC R R E E G G II S S T T E E R R N+2 8 Tap FIR Filter SLICE Number of CLBs per Slice (up to 16 Bit Word) IN TSB OUT ADD LUT R E G I S T E R N 1’s COM R E G I S T E R N ADD R E G I S T E R N+1 SCAL ACC R R E E G G II S S T T E E R R N+2 N 4 + 4 + 1/2N + 1/2N + ((N+1)/2+1) + ((N+2)/2+1) New_word Sample Data N 32 Tap Filter Using Four 8 Tap FIR Filter SLICE Load PSC Bit_Clk IN TSB SER ADD LUT TSB IN R E G I S T E R 8 1’s COM R E G I S T E R 8 ADD R E G I S T E R 9 8 IN TSB SER ADD LUT TSB R E G I S T E R 8 1’s COM R E G I S T E R 1’s COM R E G I S T E R ADD IN IN TSB SER ADD LUT TSB R E G I S T E R 8 8 ADD IN 8 IN TSB SER ADD TSB IN LUT R E G I S T E R 8 1’s COM R E G I S T E R R E G I S T E R 9 R E G I S T E R SCAL ACC R E G I S T E R Data Out 8 Tap FIR Filter SLICE Building Blocks Byte_Clk Parallel to Serial Converter N N/2 CLBs Bit_Clk Time Skew Buffer (Quad) Bit3 Bit2 Bit1 Bit0 IN TSB Look Up Table N Bit ADDer LUT R E G I S T E R ADD R E G I S T E R N N N Bit SCAL ACCUM PSC N R R E E SCAL GGII ACC SSTT 2 CLBs (Up to 16 bit word) N CLBs N+1 N+1 (N/2)+1 CLBs (N/2)+1 CLBs E E R R Serial Adder 1’s Complementer ADD 1 CLB 1’s COM 1/2 CLB 8 Tap FIR Filter SLICE APPROXIMATE NUMBER OF XC4000 CLBs 8 TAPS 40 44 48 52 56 60 16 TAPS 64 70 78 84 92 100 118 126 132 140 24 TAPS 90 100 112 122 134 144 200 214 228 242 32 TAPS 108 122 142 148 160 174 238 256 272 290 40 TAPS 138 156 174 192 210 228 316 340 362 386 48 TAPS 158 178 198 218 238 258 364 388 414 440 56 TAPS 180 204 226 250 272 296 414 444 474 504 6 8 10 12 14 16 84 18 88 20 SAMPLE DATA WORD SIZE (N) 92 22 96 24 8 Tap FIR Filter SLICE PERFORMANCE with XC4000-4 SAMPLE DATA WORD SIZE 6 8 10 12 14 16 18 20 22 24 MEGA SAMPLES PER SECOND 7.1 5.5 4.5 3.8 3.3 2.9 2.6 2.3 2.1 2.0 DOUBLE RATE PERFORMANCE 12.5 10 8.3 6.2 5.5 5.0 6.9 4.5 4.1 3.8 Sample Rate is Independent of the Number of Taps 8 Bit Word FIR Filter Sample Rates Word Sample Rate Distributed Arithmetic 5 Mhz 4 Mhz 3 Mhz 2 Mhz 1 Mhz 16 32 48 Number of TAPS 64 80 8 Bit Word FIR Filter Structures Two-Bit Parallel Distributed Arithmetic Parallel Distributed Arithmetic # CLBs 300 16 Mhz 55 Mhz 8 Mhz 200 • • Serial Distributed Arithmetic • • • • • 16 32 48 100 • 1000 to 50 Khz • • 64 80 Number of TAPS Serial Sequential Distributed Arithmetic FIR Filter Implementation Options 8 Bit Word Example Serial Sequential Distributed Arithmetic Parallel Parallel 8 Taps 36 CLBs 1080 Khz 44 CLBs 8.1 Mhz 250 CLBs 60 Mhz 16 Taps 36 CLBs 462 Khz 70 CLBs 8.1 Mhz 400 CLBs 55 Mhz 32 Taps 44 CLBs 231 Khz 122 CLBs 8.1 Mhz 48 Taps 62 CLBs 154 Khz 178 CLBs 8.1 Mhz 64 Taps 70 CLBs 115 Khz 228 CLBs 8.1 Mhz Lower Sample Rate Applications: Efficient CLB Counts Large Number of TAPs Moderate Sample Rates Non Symmetrical FIR OK Serial Sequential Architecture Serial Sequential - FIR Filter Sample Data 32 Tap 8 Bit Example SAMPLE DATA BUFFER Coefficient Select 3 CLBs 5-BIT CNTR 5 Coefficient Table SDB Out SERIAL MULTIPLY REG R E G Filtered Data Out PSR Parallel to Serial Converter 4 CLBs Serial Multiplier 24 CLBs Total 0 8 Clk 50 Mhz 32 - 8 Bit Coefficients 8 CLBs 8 ACC 32 x 8 LUT 8 Select 2-1 Scale ADD REGISTER 9 5 CLBs Sample Data Coefficient Select SAMPLE DATA BUFFER SAMPLE DATA BUFFER SERIAL MULTIPLY SERIAL MULTIPLY Coefficient Select ACC ACC REG REG 64-TAP Serial Sequential FIR Filter ADD R E G I S T E R Sample Data Serial Sequential - FIR Filter SAMPLE DATA BUFFER Coefficient Select Number CLBs vs. Taps / Word Size 8 Bit 10 Bit 12 Bit 14 Bit SERIAL MULTIPLY 16 Bit 8 Tap 36 43 50 57 64 16 Tap 36 43 50 57 64 32 Tap 44 53 62 71 80 48 Tap 62 77 92 107 122 64 Tap 70 85 100 115 130 • 4005 = 196 CLBs 80 Tap 97 115 133 151 169 • 4013 = 576 CLBs 96 Tap 97 115 133 151 169 • 4025 = 1024 CLBs 128 Tap 112 137 162 187 212 ACC REG R E G Filtered Data Out • 4002 = 64 CLBs Sample Data Serial Sequential - FIR Filter SAMPLE DATA BUFFER Maximum Sample Rate / Word Size TAPS 8 Bit 10 Bit 8 Tap 781Khz 625Khz 390Khz 16 Tap 390Khz 312Khz 195Khz 32 Tap 195Khz 156Khz 97Khz 48 Tap 130Khz 104Khz 65Khz 64 Tap 97Khz 78Khz 48Khz • Serial Mult. Limitations 80 Tap 78Khz 62Khz 39Khz • Can Use Multiple 16 Tap Building Blocks 96 Tap 65Khz 52Khz 32Khz 128 Tap 48Khz 39Khz 24Khz Coefficient Select 16 Bit SERIAL MULTIPLY ACC REG R E G Filtered Data Out • 8X Faster at 128 Taps Sample Data SAMPLE DATA BUFFER SAMPLE DATA BUFFER Coefficient Select SERIAL MULTIPLY Serial Sequential 16 Tap Slice FIR Filter Coefficient Select SERIAL MULTIPLY ACC ACC REG REG Maximum Sample Rate / Word Size ADD R E G I S T E R TAPS 8 Bit 10 Bit 16 Tap 390Khz 312Khz 195Khz 32 Tap 390Khz 312Khz 195Khz 48 Tap 64 Tap • 16-Tap Slice Used • 32-Tap Slice Uses Less CLBs 80 Tap 96 Tap 128 Tap 16 Bit SCHEMATIC CAPTURE THIRD-PARTY FILTER DESIGN SOFTWARE CONVERT TO XNF CONVERT COEFFICIENTS LOOK UP TABLE XBLOX PROCESSOR MEMGEN XNF DESIGN METHODOLOGY FORMAT COEFFICIENTS INTO LOOK UP TABLE GENERATE ROM XNF PARTITION PLACE AND ROUTE POST ROUTE SIMULATION BIT STREAM FOR DOWN LOAD CABLE, OR EPROM DESIGN METHODOLOGY SCHEMATIC CAPTURE • Filter Blocks can be Embedded in Complete design • XBLOX Can Synthesize the Data Path Logic • Filter Design Software used to design filter Coefficients • Complete System Level Design in a Single Chip • Incremental Filter Design Using XACT 5.0 FPGA The Right Solution for Most Applications Audio Sample Rates: Don’t need Special DSP Chip Serial Sequential Architecture is efficient RF Sample Rates: Programmable DSP Chip is too slow FPGA is a single chip configurable solution XILINX VS. D.S.P. CHIP COMPARISON When Does It Make Sense To Use FPGAs? • High Sample Rate Systems • Low Sample Rates • Small Word Length • Lots of Taps • Single Chip Solution Required • Low Cost Migration Path (HardWire) • Incremental Cost of DSP Chip “Design Once” DISTRIBUTED ARITHMETIC FPGA Applications, Coming Attractions: • Signal Synthesis • Modulation, De-modulation • FFTs • Neural Networks • Half Band FIR Filters • Video Signal Processing POSSIBILITIES X.D.S.P. XILINX Hardware Digital Signal Processing • There is an Alternative to Software DSP Chip Solutions Today • Existing Xilinx 3100, 4000, 4000A,E, & H can Efficiently do Signal Processing • System Level Application Specific Solution on a Single Chip • Standard Product Configurable Solution • Automatic Migration Path to a Lower Cost/High Volume Solution