Engineer-to-Engineer Note a EE-215 Technical notes on using Analog Devices DSPs, processors and development tools Contact our technical support at dsp.support@analog.com and at dsptools.support@analog.com Or visit our on-line resources http://www.analog.com/ee-notes and http://www.analog.com/processors A 16-bit IIR Filter on the ADSP-TS20x TigerSHARC® Processor Contributed by Rickard Fahlqvist November 6, 2003 Introduction This document describes an implementation of a 2nd order 16-bit IIR filter on the ADSP-TS20x TigerSHARC® processor. The implementation uses the direct form II. The structure of the 2nd order filter is shown in Figure 1. To achieve filters of orders higher than two, several 2nd order filters can be cascaded or connected in a parallel fashion. Note that the resulting filter will have an order that is a multiple of 2. For an odd number of ordered filters, a 1st order filter is needed in addition to one or several 2nd order filters. This is not implemented in the example code presented in Listing 3. w[i] + y[i] b0 + b1 + z-1 + a1 z-1 a2 2 w[i ] = x[i ] + ∑ ak w[i − k ] k =1 2 y[i ] = ∑ bk w[i − k ] k =0 Filter Structure x[i] The equations that characterizes this type of filter structure are described in Equation 1. b2 Figure 1. 2nd order direct form II Equation 1. Characterizing 2nd order direct form II IIR The first numerator coefficient (b0) can always be set to unity by scaling the input, x[i], with b0 and dividing b1 and b2 by b0. This has been used in the example implementation in listing 1 below. Implementation Implementing small recursive algorithms on pipelines presents a difficult problem. In this filter, the value of w at iteration i depends on the value of w at iteration i-1. This loop-carried dependence limits the degree to which the operations may be pipelined. In this program, all multiplications required to produce w[i] and y[i] are computed in a single cycle. However, to add the input with the partial results in order to get the final output takes 5 cycles. During this time, there is no available parallelism to keep the multiplier busy. As a result, this program can only use one compute unit. The operations used to administer the delay line (as well as the memory fetches) are associated with the ALU, so filling multiple instruction slots per instruction line becomes an issue. As mentioned above, if a Copyright 2003, Analog Devices, Inc. All rights reserved. Analog Devices assumes no responsibility for customer product design or the use or application of customers’ products or for any infringements of patents or rights of others which may result from Analog Devices assistance. All trademarks and logos are property of their respective holders. Information furnished by Analog Devices Applications and Development Tools Engineers is believed to be accurate and reliable, however no responsibility is assumed by Analog Devices regarding technical accuracy and topicality of the content provided in Analog Devices’ Engineer-to-Engineer Notes. a higher order filter is required, you can implement it as parallel 2nd order filters. In this case, two filtered outputs are calculated simultaneously, one in each compute block, with a summation of the results at the end. The 2-element state is stored in register yR5, and during every iteration each newly computed w[i] is inserted into the state register, which is then shifted left. The delay line is duplicated so that the computation of the 4 multiplications on the feedback and feedforward paths can be performed with one 4-way 16x16-bit multiplication instruction as shown in Figure 2. 47 63 31 15 D1 D2 D1 * * * * b1 a2 Interface The C style prototype for the filter in the example implementation is noted in Listing 1. void iir_16(int2x16 input[], int output[], int input_len, iir_state_t *f_state) 0 D2 b2 Because the SDAB (Short Word Data Alignment Buffer) is used, eight 16-bit input words are loaded per iteration where only the first one is used. Memory bandwidth is not the bottleneck, so this seemingly wasteful treatment of the fetched words is appropriate. a1 Listing 1. IIR function prototype The typedef “iir_state_t” is a structure that holds coefficients, delay line and the input scaling factor. The structure definition is listed in Listing 2. Dx denotes Delay line element X. Figure 2. Multiplication of delay line and coefficients The partial results are collected with the sideways summation instructions. If the coefficients are mixed fraction and integers, the 4-way multiplication must be divided into two separate multiplications, which will impair efficiency. typedef struct { const int2x16 c; /* coefficients */ int2x16 d; /* start of delay line */ int2x16 s; /* Input value scaling */ } iir_state_t; Listing 2. Filter state structure A 16-bit IIR Filter on the ADSP-TS20x TigerSHARC® Processor (EE-215) Page 2 of 4 a Appendix In this appendix the assembly source code for the example file is presented. IIR_16.asm /* ******************************************************************************** * • Copyright © 2003 Analog Devices Inc. All rights reserved. * * *******************************************************************************/ .section program; .global _iir_16; // Local defines #define Arg0 j4 #define Yout j5 #define Arg2 j6 #define s_p j7 #define Xin j0 #define Xin_tmp j1 #define s_d j2 #define #define #define #define #define c_offs 0 d_offs 1 k_offs 2 s_offs 3 coeff_p j10 _iir_16: //PROLOGUE J26 = J27 - 64; K26 = K27 - 64;; [J27 += -28] = CJMP; K27 = K27 - 20;; Q[J27 + 24] = XR27:24; Q[K27 + 16] = YR27:24;; Q[J27 + 20] = XR31:28; Q[K27 + 12] = YR31:28;; Q[J27 + 16] = J19:16; Q[K27 + 8 ] = K19:16;; Q[J27 + 12] = J23:20; Q[K27 + 4 ] = K23:20;; //PROLOGUE ENDS coeff_p = [s_p + c_offs];; yR24 = [s_p + s_offs]; yR25 = R25 XOR R25;; // Get scale factor yR21:20 = l[coeff_p += 2];; // Load ai’s and bi’s Xin = Arg0 + j31;; Xin = Xin + Xin; LC0 = Arg2;; // Double it to access shorts; Get # of input samples yR31 = 0x0010;; // used for FDEP with length=16, position=0 s_d = [s_p + d_offs]; yR4 = R4 XOR R4;; yR5 = [s_d += j31];; // Load delay line jl0 = jl0 – jl0;; // Avoid unintended circular buffer Xin_tmp = Xin;; yR3:0 = sDAB q[Xin += 8];; yR4 = R4 OR R5;; // Replicate the two 16-bit states yR3:0 = sDAB q[Xin += 8];; Xin_tmp = Xin_tmp + 1; yR11:10 = R5:4 * R21:20;;// Do a1*D1, a2*D2, b1*D1, b2*D2 // and store in R10:11 Xin = Xin_tmp + j31; yR1:0 = R1:0 * R25:24;; // scale A 16-bit IIR Filter on the ADSP-TS20x TigerSHARC® Processor (EE-215) Page 3 of 4 a yR15 = SUM sR10;; yR29:28 = EXPAND sR0(I);; yR16 = SUM sR11;; yR15 = R28 + R15;; // // // // get a1*D1 + a2*D2 LSS(R0) -> R28 b1*D1 + b2*D2 w[i] = x[i] + a1*D1 + a2*D2 .align_code 4; loop_: yR5 = LSHIFT R5 by 16; yR4 = R4 XOR R4;; // Shift out oldest element in delay // line yR3:0 = sDAB q[Xin += 8]; yR5 += FDEP R15 by R31;; // Get x[i+1] and store the // new w[i] in delay line yR3:0 = sDAB q[Xin += 8]; yR17 = R16 + R15;; // y[i] = w[i] + b1*D1 + b2*D2 Xin_tmp = Xin_tmp + 1; yR4 = R4 OR R5;; yR1:0 = R1:0 * R25:24;; // Scale i/p Xin = Xin_tmp + j31; yR11:10 = R5:4 * R21:20;; // Do a1*D1, a2*D2, b1*D1, b2*D2 and // store in R10:11 yR29:28 = EXPAND sR0(I);; // LSS(R0) -> R28 yR15 = SUM sR10;; // get a1*D1 + a2*D2 yR16 = SUM sR11;; // b1*D1 + b2*D2 if NLC0E, jump loop_; yR15 = R28 + R15; [Yout += 1] = yR17;; // w[i+1] = x[i+1] + // a1*D1 + a2*D2 // EPILOGUE STARTS CJMP = [J26 + 64];; YR27:24 = q[K27 + 16]; XR27:24 = q[J27 + 24];; YR31:28 = q[K27 + 12]; XR31:28 = q[J27 + 20];; K19:16 = q[K27 + 8 ]; J19:16 = q[J27 + 16];; K23:20 = q[K27 + 4 ]; J23:20 = q[J27 + 12];; CJMP (ABS); J27:24=q[J26+68]; K27:24=q[K26+68]; nop;; // EPILOGUE ENDS _iir_16.end: Listing 3. Source file iir_16.asm Document History Version Description November 06, 2003 by R. Fahlqvist First version A 16-bit IIR Filter on the ADSP-TS20x TigerSHARC® Processor (EE-215) Page 4 of 4