Tracking Radar Digital Matched-Filter ASIC Design Zhenyu Liu Zhimei Zhou Electronic Engineering Department Beijing Institute of Technology Beijing 100081 People’s Republic of China Abstract : Matched-filter is widely used in real time signal processing, especially in Radar Signal Processing. This paper provides a novel structure of digital matched-filter used in tracking radar system. This design applies block-floating-point arithmetic to improve the precision. The whole digital matched-filter is implemented in only one chip of FPGA. This ASIC has two work modes: 512 points pulse compression and 256 points pulse compression. It complements three channels of 512 points complex signal pulse compression in 102us. Key-Words: matched-filter, parallel processing, low power dissipation, FFT, FPGA 1 Introduction Modern radar systems generally apply FM signals to get large time-bandwidth product signal, so matched-filter must be used in the system to generate pulse compression[1]. Compared with the analog matched-filter, the digital is programmable, more accurate and compact. The digital matched-filter, which is the crucial component in modern radar system, is getting more and more attention. In tracking radar system, three parameters must be estimated to track a target. These parameters are the relative target position, the elevation angle and the azimuth angle of the target. In order to realize the pulse compression of these signals, a bank of matched-filters are needed. In this paper, a novel digital matched-filter that can simultaneously process these three signals is designed. This chip has two work modes: the first is 512-points matched-filter and the second is 256-points matched-filter. This design is implemented with VirtexII xc2v500. 2 Matched-filter basic concept The basic concept of matched-filter evolved from the effort to obtain a better theoretical understanding of the factors leading to optimum performing of radar system[1]. The technique of match filtering constitutes the optimum linear processing of radar signals. This form of signal processing transforms the raw radar data, available at the receive input and assumed to be corrupted by white Gaussian noise, into a form that is suitable for performing optimum detection decisions (i.e., target or no target) or for estimating target parameters (i.e., range, velocity, etc) with minimum rms errors, or for obtaining maximum resolution among a group of targets. In tracking radar system, a singe matched-filter is no longer sufficient. In order to track a target, the relative range, the elevation and the azimuth of the target are needed. Three channels of complex signal are fed into the matched-filters and they will be processed simultaneously in one PRT interval. The system block diagram is shown in figure 1. The required parameters will be derived from y1 (n) , y 2 (n) and y 3 (n) . x1(n) IFFT(FFT(x1(n))·S*(w)) x2(n) IFFT(FFT(x2(n))·S*(w)) x3(n) IFFT(FFT(x3(n))·S*(w)) y1(n) y2(n) y3(n) Fig.1 tracking radar matched-filter bank We should notice two characters of this procedure: First, the three matched-filters have the same frequency coefficients, which is denoted S ∗ (ω ) ; Second, the complex signals, x1 (n) , x 2 (n) and x3 (n) , are sampled and processed simultaneously, in addition, they have the identical processing algorithm. These characters will be utilized in the design to reduce hardware complexity. 3 Matched-filter ASIC Design The original design [2][4] consists of two identical processors for direct and inverse fast Fourier Transform, and a complex multiplier for fast convolution in frequency domain. In this design, through novel design, just a dual-butterfly unit completes these operations. The hardware is reduced to one third but the throughout is doubled. multiplication and IFFT. This reduces the hardware overhead to one third of the original one. W Nk I W Nk R _ Real 3.1 Improved Radix-2 Butterfly Unit In this design, we will use “butterfly operation processor” to perform FFT transform, complex multiplication and IFFT transform. Obviously the butterfly unit is the soul of the whole design, and we will first introduce the architecture of the improved radix-2 butterfly unit. Radix-2 butterfly algorithm in Decimation In Time (DIT) form of FFT is expressed as follows: Am (i ) = Am −1 (i ) + Am −1 ( j )W Nk Am ( j ) = Am −1 (i ) − Am −1 ( j )W Phi1 + Phi3 delay Am−1(i) +/_ delay ADD/SUB Phi2 control Fig2. Block Diagram of Radix-2 Butterfly Am−1 (i ) clk (1) Phi1 Am −1 (i ), Am −1 ( j ) : (m-1) stage data Phi2 Am (i ), Am ( j ) : (m) stage data Phi3 W Nk : Rotation coefficient In every butterfly operation there is only one complex multiplication. We can use this character to reduce the number of multipliers. The block diagram of the improved butterfly unit is illustrated in figure 2 and the related timing diagram of the control signals is shown in figure 3. It can be seen from this architecture that Am−1 ( j ) is multiplied with W Nk , but Am−1 (i) just passes by. This is realized through controlling the phase characters of “Phi1”, “Phi2” and “Phi3”, which is illustrated in figure 3. In this way, one complex multiplication is completed in two clock periods through just two multipliers: In the first clock period, Am−1 ( j ) is multiplied with W Nk R . In the second clock period, Am−1 ( j ) W Img Am−1(j) Am−1 ( j ) k N where: k NR +/_ Am−1(i) Am−1(j) is multiplied with WNk I . represents the real part of W k N (Note: and WNk I represents the imagine part of W Nk .) After W Nk is replaced with W N− k , the butterfly unit will perform inverse fast Fourier Transform. Perhapse, It can be noticed that the butterfly unit can also perform complex multiplication. If Am−1 (i ) is set zero and “ADD/SUB” unit is controlled to perform only addition operation, the butterfly unit can be used to perform complex multiplication. One complex multiplication is completed in two clock periods. This idea is applied in this design. That means butterfly units are used to perform FFT, complex Fig3. Timing Diagram of Control Signal 3.2 Dual-Butterfly Operation Unit The bottleneck of the whole system is the butterfly, because this part would implement most algorithm operations. We can solve the bottleneck problem by using two improved butterfly units to construct a dual butterfly operation unit. Its block diagram is shown in figure 4. There is 180-degree phase difference between CLK0 and CLK1. Only in this way, the raw data can be dispatched to the correct butterfly processor. The frequencies of CLK0 and CLK1 are the same and they are only the half of the address unit clock. This is the advantage of this kind architecture. The main reason can be expressed briefly as follows: All of the math components in the system, such as adders, subtracters and multipliers, are all in butterfly units. Applying this architecture, we reduce the working clock frequency of these components. In addition, reducing the clock frequency helps to reduce power dissipation [7]. Therefore, we could reduce the power supply voltage to the dual-butterfly unit when we fabricate ASIC. From reference [7], we know that this will greatly contribute to the system power consumption reduction. Butterfly_0 Am −1 R Am −1 I MUX In_Real Out_Real In_Img Out_Img Am R CLK Clk0 Clk1 Butterfly_1 MUX In_Real Out_Real In_Img Out_Img CLK Clk1 Am I Clk1 Fig 4. Block Diagram of Dual Butterfly Unit From the timing diagram, we can understand how the source data flow into these two butterfly units. ”CLK_SYS” is the system clock, “address generator” and dual-port SRAM blocks all work under this clock. The subscripts, ‘R’ and ‘I’, denote real or imagine part of the data respectively. The subscripts, ‘0’ and ‘1’, are used to denote which butterfly unit this data should be sent to, “butterfly_0” or “butterfly_1”. Am−1 ( j) R0 Am −1 ( j ) R1 Am−1 ( j ) I 0 Am−1 ( j ) I 1 Then the signal spectrum data will be multiplied with S * (ω ) and the data is still stored in reverse sequence. After IFFT transform, the data are stored in dual-port RAM with normal sequence again. This can provide more convenience for the post matched-filter processing. Input and output data are both in normal sequence. This is another merit of this design. The address unit structures are depicted in figure 7-10. The methods to derive the correct data and rotation coefficient addresses have been discussed in details in reference [5][6]. Compared with the original design, the address unit fetches data and coefficients for two butterfly units, so some reforms have been made to let the dual-butterfly unit work simultaneously. From figure 2, it is noticed that the real part and imaginary part of the coefficients are fed into the processor serially. So all complex coefficients are stored serially in one 4k × 12bits ROM. Figure depicts the map of the coefficient ROM. 000H CLK_SYS Am−1( j)R0 to Butterfly_0 Am −1 ( j ) I 0 to Butterfly_0 CLK0 Am−1 ( j) R1 to Butterfly_1 A m −1 ( j) I1 to Butterfly_1 CLK1 Fig 5. Timing Diagram of the input data Though radix-4 butterfly[5][6] performs less multiply operation, in this applicaiton, radix-2 dual butterfly unit has such advantages: First, it can process any 2 n points data, while radix-4 can only process 4 n points data. In this application, we need 512 points pulse compression. Second, as pointed out previously, dual radix-2 butterfly unit could be deviced to perform not only FFT and IFFT transform, but also complex multiplication. Radix-4 butterfly can not realize this hardware sharing. Third, compared with the radix-4 butterfly block diagram, radix-2 has more compact structure and its interconnections between blocks are fewer and shorter. In deep submirometer technique, reducing interconnection delay is crucial to enhance the performance. So applying dual-butterfly structure is a great advantage for this application. 3.3 FFT_IFFT Address Unit In order to let this architecture work properly, the data address and the rotation coefficient address must be generated correctly. It is assumed that the input raw data are store in the dual port RAM in normal sequence. First these data will be processed fast fourier transform. After FFT, the spectrum data are stored in the dual port RAM in reverse sequence. 3FFH 400H 7FFH 800H BFFH C00H IFFT coefficient …… Wk512R FFT coefficient Wk512I Wk+1512R 512 points S*(w) Wk+1512I …… 256 points S*(w) FFFH Fig 6. The map of the coefficient ROM The 256 points FFT and IFFT coefficients are ±k . This could be expressed in expressed as W256 ±k ±2 k = W512 . That means the 256 another form: W256 FFT and IFFT use the even coefficients of the 512 FFT and IFFT, so they can share the same rotation coefficients. In figure 8 and figure 10, only the least significant bits of the rotation coefficient address are depicted. The current operation and work mode decide the segment address of the coefficients. In our design, the ROM, which contains the coefficients, is implemented with the build-in dual-port block RAM in Virtex2 FPGA. At the startup stage, the content of the ROM is initialized through a chip of EPROM outside the FPGA. For each chosen FM radar pulse, the related compressor spectrum is stored in the EPROM. So this system is fully programmable. FFT FFT IFFT count IFFT count Stage=0 Stage=0 0 0 coef address Data address Stage=1 0 0 Stage=2 0 0 Stage=3 0 0 Stage=3 Stage=4 0 0 Stage=4 Stage=5 0 0 0 0 Stage=1 Stage=2 Stage=6 Stage=5 0 0 Stage=7 Stage=6 Fig 10 256 coefficients address generator Stage=7 Stage=8 Fig 7 512 data address generator FFT IFFT count Stage=0 0 0 coef address As pointed out before, because the three channels of signal are processed simultaneously and their coefficients are identical, they can share the unique address generator and the coefficient ROM. It would dramatically reduce the hardware overhead. Stage=1 0 0 3.4 Block-Floating-Point Arithmetic Stage=2 0 0 0 0 Stage=4 0 0 Stage=5 0 0 0 0 In order to trade off between precision and performance, block-floating-point arithmetic [5][8] is applied in this design. From reference[9], we can get the conclusion that block-floating-point has much more higher SNR than definite-point and its implementation is very simple. Block-floating-point comprises two parts, “overflow-detector” and “scale-counter”. “overflow-detector” is a finite-state-machine. If it is assumed that the word length of the data fed into “dual-butterfly-unit” is N bits, the word length of the output is N+2 bits. The state of “overflow-detector” is decided by the most significant 3 bits of the output, N + 2 ~ N bits. They have the fllowing possibility: 000 (or 111) [ no overflow, output of “overflow-detector”=0 ] 001 (or 110) [1-bit overflow, output of “overflow-detector”=1 ] 01x (or 10x) [2-bit overflow, output of “overflow-detector”=2 ] The state diagram of “overflow-detector” is shown in figure 11. Stage=3 Stage=6 0 0 Stage=7 Stage=8 Fig 8 512 coefficients address generator FFT count ‘0’ ‘0’ IFFT Stage=0 Data address ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ Stage=1 Stage=2 Stage=3 Stage=4 ‘0’ ‘0’ Stage=5 ‘0’ ‘0’ ① Stage=6 S1 ‘0’ ‘0’ Stage=7 ①② Fig 9 256 data address generator ③ ② S2 ①②③ ③ S3 Fig.11 State Diagram of “Overflow-Detector At the beginning of every stage, the “overflow-detector” is reset to S1. At the end of the stage, the “scale-counter” is active, so the result of “overflow-detector” is accumulated, at the same time, the “overflow-detector” output is registed to control the next stage input data shifter, indicating how many bits should be shift right. 4 Design Verification This design is implemented with Virtex2 FPGA (XC2V500). The work clock frequence is 100MHz and 512 points pulse compression is completed within 102us. The input and output signal are both 12bits 2’s complement representation. In addition, the result has 5 bits group exponent. In order to verify the correctness of this design, we choose two typical groups of data as the test vectors, and then compare the process results with the “Matlab” calculation results. The first group of test vectors are used to test the 512-points work mode of the matched-filter. The signal is linear FM. 3 1 2 n / 2) 0 ≤ n ≤ 383 exp( j ⋅ 2π (− n + x ( n) = 8 512 0 384 ≤ n ≤ 511 S ∗ (ω ) is set as FFT * [ x(n)] . The processed result through matched-filter is shown in figure 12. Because 512 < 384 × 2 , overlap occures during convolution. But because x(n) is a chirp signal, the correlation between the begin and the end of the signal is weak and this overlap will not affect the pulse compression. X(n) X(-6) X(-5) X(-4) X(-3) process(dB) -23.2767 -23.1839 -40.5492 -20.5537 calculate(dB) -23.2189 -23.2864 -39.6585 -20.5036 X(n) X(-2) X(-1) X(0) X(1) process(dB) -13.4264 -10.4001 0 -10.3836 calculate(dB) -13.4672 -10.4015 0 -10.4015 X(n) X(2) X(3) X(4) X(5) process (dB) -13.4377 -20.4741 -39.9253 -23.2885 calculate(dB) -13.4672 -20.5036 -39.6585 -23.2864 X(n) X(6) X(7) X(8) X(9) process(dB) -23.1968 -31.1193 -33.8152 -26.8170 calculate(dB) -23.2189 -30.9762 -33.8457 -26.8910 X(n) X(10) process(dB) -29.1364 calculate(dB) -29.1881 The second group of test vectors are applied to test the 256-points work mode. The input signal is: 3 3 2 n / 2) 0 ≤ n ≤ 63 exp( j ⋅ 2π (− n + x ( n) = 8 256 0 64 ≤ n ≤ 255 The processed result is shown in figure 13 and the relative comparison is listed in table 2. In this test, there is no overlap in convolution. Fig.12 512-points pulse compression result To further verify the result, the most important twenty one points processed results are listed in table1. Through comparing these results with the “Matlab” calculation results, it could been seen that the process precision of this design is very high. Table 1 X(n) X(-10) X(-9) X(-8) X(-7) process(dB) -28.9965 -27.0478 -33.7113 -30.8718 calculate(dB) -29.1881 -26.8910 -33.8457 -30.9762 Fig.13 256-points pulse compression result Table 2 X(n) X(-10) X(-9) X(-8) X(-7) process(dB) -28.6051 -30.8746 -28.3943 -25.6623 calculate (dB) -28.5801 -31.0146 -28.3904 -25.6476 X(n) X(-6) X(-5) X(-4) X(-3) process(dB) -35.3313 -21.3956 -24.6855 -24.1825 calculate(dB) -35.2251 -21.4539 -24.5592 -24.1273 X(n) X(-2) X(-1) X(0) X(1) process(dB) -13.4969 -10.1513 0 -10.1301 calculate(dB) -13.5515 -10.1443 0 -10.1443 X(n) X(2) X(3) X(4) X(5) process(dB) -13.5157 -24.0544 -24.5103 -21.4588 calculate(dB) -13.5515 -24.1273 -24.5592 -21.4539 X(n) X(6) X(7) X(8) X(9) process(dB) -34.8020 -25.5586 -28.5885 -31.1649 calculate(dB) -35.2251 -25.6476 -28.3904 -31.0146 X(n) X(10) process(dB) -28.3975 calculate(dB) -28.5801 5 Conclusion The matched-filter provided in this paper is specially designed for the tracking radar system. It can process three channels of radar backscattering signal simultaneously. This design has such characteristics: 1) It applies block-floating-point arithmetic and achieves very high precision. 2) Through parallel processing, its performance is improved significantly. 512 points pulse compression operation consumes only 102us. 3) Its peripheral circuitry is very simple. Except an EPROM for the coefficient storage, no other auxiliary chips are needed. So it is very suitable for the airborne environment. 4) This chip is low power dissipation. The leak power consumption of the chip, estimated with XPOWER tool provided by Xilinx, is 405mw. After processing a group of data, this chip turns to idle status automatically. References: [1] Charles E. Cook Marvin Bernfeld, Radar Signals an Introduction to Theory and Application, Artech House, INC, 1993 [2] Tortoli, P. Guidi, F. Atzeni, C., Digital VS SAW matched filter implementation for radar pulse compression, Proceedings of the IEEE Ultrasonics Symposium 1 Nov 1-4 1994 1994 Sponsored by: Ultrasonics, Ferroelectrics, and Frequency Control Society IEEE pp. 199-202 1051-0117 [3] Tapan K. Brown, Russell D., Ultra-low sidelobe pulse compression technique for high performance radar systems Sarkar, IEEE National Radar Conference - Proceedings May 13-15 1997 1997 Sponsored by: IEEE, pp 111-114 [4] Huang Ruojian, The Design of Match Filter in Pulse Compression, 1997 (Master dessertation of Beijing Institute of Technology) [5] Liu Zhaohui, The Design of Application Specific FFT Processors and The Study of CFAR Detectors Based on Systolic Array, 1999 (Ph.D dessertation of Beijing Institute of Technology) [6] Brigham, E. Oran, The fast Fourier transform and its application, Englewood Cliffs, N.J., Prentice Hall, 1988. [7]Rabaey Jan M., Digital Integrated Circuit a Design Perspective [M], Prentice Hall Inc., Simon & Schuster / A Viacom Company, 1996: 522-533. [8] Hu Guorong and Lee Tak Kwan., Asynchronous FFT-ASIC Architecture, Chinese Journal of Electronics, Oct 1998, Vol. 7 No. 4 pp. 333-337. [9] A. V. Oppenheim and C. J. Weinstein. Effects of finite register length in digital filtering and the fast Fourier transform. Proc. of IEEE. 1972, pp. 957~976