CFAR performance for coherent sound source detection with distributed microphones Kevin D. Donohue, Sayed M. SaghaianNejadEsfahani, and Jingjing Yu Visualization Center and Department of Electrical and Computer Engineering, University of Kentucky, Lexington, KY 40506 Abstract Auditory scene analyses with distributed microphone systems are often initiated with the detection and location of sound sources. This paper introduces a novel method for the automatic detection of sound sources using a steered response power (SRP) algorithm. The method exploits the near-symmetric noise distribution of the coherent power in the SRP image to adaptively estimate thresholds that achieve a constant false alarm rate (CFAR). Statistics based on the microphone geometry and the field of view (FOV) are derived for determining the frequency range that results in near-symmetric distributions and accurate CFAR threshold estimations. Analyses and simulation show that low frequency components (relative to the microphone distribution geometry) are primarily responsible for degrading CFAR threshold performance, but can be offset by partial whitening or increasing the inter-path distances of the microphone to the sources in the FOV. Experimental noise-only recordings are used to assess CFAR performance for several microphone geometries in conjunction with variations in the source frequency and degree of partial whitening. Results for linear, perimeter, and planar microphone geometries demonstrate that a Weibull distribution is a reasonably accurate model for the negative and positive coherent power values. Experimental false-alarm probabilities corresponding to CFAR thresholds ranging from 10-1 and 10-6 were estimated using over 9.2 million pixels and showed that the deviations from the desired false-alarm probability were limited to within 1 order of magnitude. Introduction: Automatic sound source detection and location with distributed microphone systems is relevant for enhancing applications such as teleconferencing [1-5], speech recognition [6-12], talker tracking [13], and beamforming for SNR enhancement [14]. Many of these applications involve the detection and location of a sound source. For example, a minute-taking application for meetings requires detecting and locating all voices before beamforming on each voice to effectively create independent channels for each speaker. The failures to detect an active sound source or a false detection degrade performance. This paper introduces a method for automatically detecting active sound sources using a variant of the steered response power (SRP) algorithm and applying a constant false-alarm rate (CFAR) threshold algorithm. Recent work on sound source location algorithms in close (immersive) spaces has focused on enhancements for detecting and locating targets. One of the most robust algorithms for detecting multiple speakers is the SRP with a Phase Transform (PHAT) [15,16]. This technique has shown promise based on images created from the SRP likelihood function. A detailed analysis based on detection performance, showed that a variant of the PHAT, referred to as partial whitening [17, 18], outperforms the PHAT for a variety of signal source types. These works analyzed overall detection performance with receiver operating characteristics (ROC) which considers the overall detection and false-alarm probabilities without regard to a threshold. Threshold design based on a false-alarm probability was considered in [19]. The SRP images for the work in [19] were created only through the crosscorrelations of different microphone signals (contribution autocorrelation on each was eliminated) to result both positive and negative values, which were referred to as coherent power. It was observed for noise-only regions the coherent power tended to be symmetric, while for target regions the distributions were skewed in the positive direction. These properties were used to develop an adaptive thresholding scheme, where the negative values in a local neighborhood were used to characterize the noise and estimate a constant false-alarm rate (CFAR) threshold for the center pixel based on a specified false-alarm rate. The results in [19] showed the performance of CFAR threshold work well for some microphone distributions and poorly for others. In addition, a limited model for the noise distribution was used. The work in this paper further develops the work introduced in [19] with a more detailed analyses of the coherent power points and a broader examination of distributions to fit these points. In particular the Webull distribution with various shape parameters is used to fit the distribution to the positive and negative coherent values to achieve consistent CFAR performance over a wide range of false-alarm probabilities. The primary source of performance degradation is the inability of a given microphone distribution to effectively decorrelate the low frequency components of noise sources. This paper derives a statistic that rates the ability of an array to effectively decorrelate the low frequencies such that they do not degrade the CFAR performance. The next section presents equations for creating an acoustic image based on the steered-response coherent power (SRCP) algorithm and derives a statistic based based on the inter-path distances between microphone pairs and the FOV. Section 3 presents a statistical analysis of the positive and negative noise distribution for different arrays and source frequency ranges. Simulation and experimental results demonstrate the sources of non-symmetry between the positive and negative distributions and suggest methods for addressing this. Section 4 presents the CFAR algorithm with a performance analysis using data recorded from 3 different arrays. Finally Section 5 summarizes the results and presents conclusions. Statistical Signal Models This section formulates the SRP image process using coherent power and derives a statistic related to the microphone distribution’s ability to decorrelate noise sources. Consider microphones and sound sources distributed in a 3-D space. Let ui(t) be the pressure wave denoting the ith source of interest located at position ri, where ri is a vector denoting the x, y, and z axis coordinates. The waveform received by the pth microphone is given by: v p (t ; r p , ri ) K hip ( )ui (t )d hkp ( )nk (t )d , (1) k 1 where hip() represents the microphone and propagation path impulse response (including multi-path) from ri to rp, and nk(t) represents noise sources located at rk. The noise sources nk(t) arise from ambient room noises and sources not at the position of interest. For reverberant rooms the impulse response can be separated into a signal (direct path) and noise component (reflected path) to result in: hip (t ) aip 0 (t ip 0 ) aipn (t ipn ) , (2) n 1 where aipn(t) denotes the nth path of the effective impulse response associated with source at ri, and microphone at rp, and τipn is the corresponding delay. The component corresponding to n = 0 is the direct path between the source and microphone. An SRP estimate is based on sound events limited to those received over a finite time frame denote by Δl. Therefore, a single SRP frame Eq. (1) can be expressed in frequency domain: jip0 Vˆp (, l ) Uˆ i ( ) Aˆ ip0 ()e Uˆ i ( ) R K m kpml &m0 k 1 j Aˆ ipm ()e ipm Nˆ k () M Aˆ kpm ()e m kpml j kpm , (3) where the summations index denotes summing only those scatterer delays within the interval Δ l. The SRP image is computed from the power in the filtered sum of microphone signals received over all positions: P G(ri ) Bˆ ipVˆ p ( , l ) , (4) p 1 where B̂ip is a complex filter coefficient for the microphone at rp and source at ri. The phase of B̂ip is selected to undo the shift introduced by the propagation from the point of interest and the magnitude is typically selected to emphasize changes closer to the point of interest. The power in the filtered sum is computed by multiplying G in Eq. (4) by its conjugate and integrating over ω to obtain the SRP image. The multiplication of all terms in Eq. (4) results in the sum of products for all possible paired-terms. Since the product-pairs consisting of the same channel (autocorrelation terms) do not vary with spatial position ri, it acts only as a bias to keep the power computations positive. Therefore, it can be subtracted out to result in the coherent power given by: S C (ri ) P P Bˆ ip Bˆ iqVˆp (, l )Vˆq (, l )d . p 1 q p * * (5) Imaging algorithms based on this quantity are referred to as steered-response coherent power (SRCP). If the expected value over all microphone pairs is taken in the integrand of Eq. (5), with the assumptions that distinct sources are uncorrelated and letting the filter delay parameter correspond to the direct path to the actual target (i.e. Bˆ B exp( j ) with ), the result is: ip ip ip 0 ip 0 ip E Bˆ ip Bˆ iq* Vˆp (, l )Vˆq* (, l ) Uˆ i ( ) 2 M B B A A* * Bip Biq Aipm Aiqm E exp j ( ip 0 ipm ) ( iq 0 iqm ) ip iq ip 0 iq 0 m ipm , iqml K Nˆ k ( ) k 1 2 M * Bip Biq Akpm Akqm E exp j ( ip 0 kpm ) ( iq 0 kqm ) m kpm , kqml (6) where the angular brackets denote the average value over all microphone pairs. Note the complex exponential terms for the target were in phase with the filter coefficients so the direct path target term results in a coherent addition. The other terms relate to the noise are expected to diminish due to the decorrelation from the (ideally) incoherent phases of the complex exponentials over all microphones. To investigate the statistics of the SRCP image values for noise only, the sound source at ri is set to 0, time delays are converted to spatial distances d, and frequencies to wavelengths (λ) to obtain: d ip 0 d iq 0 E Bip BiqVˆp ( , l )Vˆq* ( , l ) exp( j ( ip 0 iq 0 ) E exp j 2 K k 1 Nˆ k ( ) 2 N M n kqml m kpml d kqn d kpm * Bip Biq Akpm Akqn E exp j 2 (7) Equation (7) shows that the level of incoherence, or decorrelation results from the 2 complex exponential arguments (one inside the summation due to the multi-path of the reverberations and the other factored out of the noise source summation). If the microphone distribution is such that the interpath distances in the first exponential term are on average much smaller than the wavelengths of the source, the phases of the complex exponential arguments are limited to a small range about 0, resulting in coherent sums independent of the source location. Ideally, if the complex exponential arguments uniformly span from – to over all microphone pairs for non-source locations, the expected becomes zero. Note that the 2 exponential terms of Eq. (7), responsible for decorrelating the noise sources are related to the FOV, microphone positions, and the frequency content of the sources. Since the first exponential depends only on the microphone geometry, and it scales all the noise components in the summation, this factor will be analyzed to determine the degree to which the microphone geometries generate a zero-mean distribution for off-target/noise sources. The differential path length distributions can be examined with histograms and characterized with statistics. Once the distributions are known, the expected value of Eq. (7) can be obtained. The closedform expressions can be derived for cases using normal and uniform distributions for the inter-path distances. Let Δipq be the random variable of the inter-path distance over all microphones. In the case of a normal distribution with standard deviation the expected value becomes: 2 ipq E exp j 2 exp 2 . (8) In the case of a zero mean uniform distribution with standard deviation the expected value becomes: ipq 12 . sinc E exp j 2 (9) The relationships in Eqs. (8) and (9) indicate that the distribution mean can never be 0 over a range of frequencies. However, it can be driven to a sufficiently small value by increasing the inter-path standard deviation relative to the source wavelength. A zero-mean condition is necessary for symmetry, but not sufficient. The distribution can indeed be skewed as well In order to determine the frequency over which an effective zero-mean symmetric noise distribution is likely, σΔ must be evaluated for a given microphone geometry and FOV. It must also be determined which distribution (uniform or Gaussian) most accurately describes the distribution of inter-path distances. Experimental Systems In order for good performance of the thresholding procedure, the symmetry of the noise distribution must be established. Equations (8) and (9) show that a mean offset can exist. In addition a skewness can also contribute to the non-symmetric nature of the distribution. In order to more fully explore the deviations from symmetry for noise only data and the factors that influence it, simulation and experimental data are generated to examine the resulting distributions and test the CFAR performance of thresholds designed based on this procedure. Figure 1 shows the 3 microphone distributions used in this study. This included 16 microphones that equally spaced over a particular geometry with a FOV being a plane region 1.57 meters above the floor with a square dimension of 3 meters. This plane was sampled for creating an SRP image at 4 centimeters in the X and Y directions. Figure 1a shows a linear array placed 1.52 meters above the floor, 0.5 meters away from the FOV edge, and a spacing of .23 meters between microphones. The array was symmetrically placed along the y-axis relative to the FOV. Figure 1b shows a perimeter array with microphones place 1.52 meters above the floor, 0.5 m away from the FOV, and a microphone spacing of 0.848 m along the perimeter. Figure 1c shows the planar array with microphone placed in a plane 1.985 m above the ground and placed in a rectangular grid starting in a corner directly above the FOV with a 1 m spacing in the X and Y directions. For the experimental measurements a cage of aluminum struts around the FOV held the microphones in place positions were measured with a laser meter and tape measure. Based on the difficulty in measuring the tip to tip distance of the microphones It was estimated that a level of precision was on the order of 1 cm. The speed of sound was measured on the day of each recording and was 347 m/s for the linear and 346 for the perimeter and planar. 2.5 2 1.5 Z 2 1 0 1 Z 0 1 0 Y -1 Z 2 1 1 0.5 1 0 -1 X 1 0 0 Y -1 -1 X 0 10 -1 Y -1 X 0 1 Figure 1. Microphone distributions and FOV (shaded plane) for simulation and experimental recordings with axes in meters. Square and star markers denote the smallest and largest (respectively) microphone inter-distance standard deviation overall pairs (a) linear (b) perimeter and (c) planar. In order to determine the nature of the inter-distances for each configuration, the histogram of all microphone pairs (240 for each point) was plotted for the FOV positions that corresponded to the maximum and minimum inter-distance standard deviations. These positions are indicated with the square (minimum) and star (maximum) markers on the FOVs in Fig. 1. Note that the minimum variance corresponds to the center of the arrays, where the array gain is typically the highest. Equations 8 and 9 predict that the symmetry (especially for lower frequencies) will be at a minimum and degrade the CFAR performance. It also implies significant interference from low frequency noise sources located away from the point of interest. Figure 2 shows the histogram of the microphone spacings for the minimum and maximum variance points. According to 0.8 0.6 0.8 = 0.21 0.6 0.4 0.4 0.2 0.2 0 -5 0 Meters 5 0 -5 = 1.42 0 Meters 5 (a) (b) 0.8 0.8 = 0.38 0.6 0.4 0.4 0.2 0.2 0 -5 0 Meters 5 0 -5 (c) 0 Meters 5 (d) 0.8 0.8 = 0.67 0.6 0.4 0.2 0.2 0 Meters (e) = 1.48 0.6 0.4 0 -5 = 1.88 0.6 5 0 -5 0 Meters 5 (f) Figure 2. Histograms and standard deviation of inter-distance microphone pairs for a point in the FOV for each distribution. (a) linear minimum variance (b) linear maximum variance, (c) perimeter minimum variance, (d) perimeter maximum variance, (e) planar minimum variance, (f) planar maximum variance. The experiment in this paper considers 3 microphone spatial distributions: linear, perimeter and planar. The FOV for all cases was a 3m by 3m plane located 1.57m above the floor. The linear array consisted of 16 microphones in the FOV plane, along a line parallel to the edge of the FOV and 5 cm outside the edge. Microphones were equally spaced at 0.5m. The perimeter array consisted of 16 microphones within the FOV plane and symmetrically distributed outside the FOV forming the vertices of an equilateral octagon with 1.27m sides. The planar array consisted of 16 microphones located in a plane parallel to and 0.58m above the FOV. The microphones were placed on the vertices of 2 concentric rectangles with sides parallel to those of the FOV. The inner rectangle was 1.81m by 1.63m and the outer rectangle was 3.4m by 3.54m. The histogram of inter-path distances for all microphone pairs to all points in the FOV plane (taken at intervals of 4 cm in the x and y directions) is plotted in Fig. 1. The histograms suggest the distributions are closer to uniform than Gaussian; therefore, Eq. (9) is used as the expected value. For the linear, perimeter, and planar geometries the standard deviations for the inter-path distances were 0.69, 1.24, and 1.08 m, respectively. This suggests the superiority of the planar array in limiting partial coherences at lower frequencies and resulting in a more symmetric noise distribution for coherent power. 0.2 perimeter linear planar 0.15 0.1 0.05 0 -4 -3 -2 -1 0 Meters 1 2 3 4 Figure 1. Histograms of inter-path distances for 3 microphone and FOV geometries. For the experiments presented in this paper, a high-pass filter of 300 Hz is applied to remove low frequency noises (room modes) and limited the partial coherences predicted by the Eq. (7). For frequencies greater than 300 Hz, Eq. (9) indicates the expected deviation for the distribution mean from zero is .06 for the linear, .05 for the perimeter, and .02 for the planar. While these values appear to be small, they may significantly impact the FA threshold depending on how small of an FA probability is required. A higher cutoff frequency on the sources can drive the mean smaller at the expense of losing significant portions of the target signal. For the sake of illustrating the relative performance differences between the geometries, a cutoff at 300 Hz will be used in all cases. Refrences: [1] P. L. Chu. Desktop mic array for teleconferencing. In Proceedings of ICASSP95, IEEE, 1995. [2] J. L. Flanagan, D. A. Berkley, G.W. Elko, J. E. West, and M. M. Shondhi. Autodirective microphone systems. Acoustica, vol. 73, pp. 58-71, 1991. [3] W. Kellerman. A self-steering digital microphone array. In Proceedings of ICASSP91, pages 35813584, IEEE, May 1991. [4] F. Khali, J. P. Jullien and A. Gilloire. Microphone array for sound pickup in teleconferencing systems. J. Audio Eng. Soc., vol. 42, no. 9, September 1994. [5]H. Wang and P. Chu. Voice source localization for automatic camera pointing system in videoconferencing. In Proceedings of ICASSP, volume 1, pages 187-190. IEEE, 1997. [6] J. E. Adcock, Y. Gotoh, D. J. Mashao and H. F. Silverman. Microphone-array speech recognition via incremental MAP training. In Proceeding of ICASSP96, Atlanta, GA, May 1996. [7] C. Che, Q. Lin, J. Pearson, B. deVries and J. Flanagan. Microphone arrays and neural networks for robust speech recognition. In Proceedings of the Human Language Technology Workshop, Pages 342347, Plainsboro, NJ, March 8-11, 1994. [8]C. Che, M. Rahim and J. Flanagan. Robust speech recognition in a multimedia teleconferencing environment. J. Acoust. Soc. Am., 92(4):2476, 1992. [9]D. Giuliani, M. Omologo and P. Svaizer. Talker localization and speech recognition using a microphone array and a cross-power spectrum phase analysis. In Proceedings of ICSLP, volume 3, pages 1243-1246, September 1994. [10]T. B. Hughes, H. Kim, J. H. DiBiase, and H. F. Silverman. Performance of an HMM speech recognizer using a real-time tracking microphone array as input. IEEE Trans. Speech Audio Proc., 7(3):346-349, May 1999. [11]T. B. Hughes, H. Kim, J. H. DiBiase, and H. F. Silverman. Using a real-time, tracking microphone array as input to an HMM speech recognizer. In Proceedings of ICASSP98, IEEE, 1998. [12]H. F. Silverman. Some analysis of microphone arrays for speech data acquisition. Trans. on Acoustics, Speech, and Signal Processing, 35(12):1699-1711, IEEE, December 1987. [13] S. M. Yoon, S. C. Kee, Speaker detection and tracking at mobile robot platform, Proc. 2004 Intl. Symp. on Intelligent Signal Processing and Communication Systems, 2004, pp. 596–600. [14]Huang, T.S., “Multimedia/multimodal signal processing, analysis, and understanding,” First International Symposium on Control, Communications and Signal Processing. 2004, p. 1. [15] J. H. DiBiase, H. F. Silverman, M. S. Brandstein, “Robust Localization in Reverberant Rooms, ” Microphone Arrays, Signal Processing Techniques and Applications, Springer Verlag, Berlin, 2001, pp. 157–180. [16] T. Gustafsson, B. D. Rao, M. Triverdi, “Source localization in reverberant environments: modeling and statistical analysis,” IEEE Trans. on Speech and Audio Proc., vol. 11, no. 6, Jun. 2003, pp. 791–803. [17] K.D. Donohue, J. Hannemann, and H.G. Dietz, “Performance for Phase Transform for Detecting Sound Sources in Reverberant and Noisy Environments,” Signal Processing, Vol. 87, no. 7, pp. 16771691, July 2007. [18] A. Ramamurthy, H. Unnikrishnan, K.D. Donohue, “Experimental Performance Analysis of Sound Source Detection with SRP PHAT-𝛽, Proceedings of the IEEE, Southeastcon, March 2009 [19] K.D. Donohue, K.S. McReynolds, A. Ramamurthy, “Sound Source Detection Threshold Estimation using Negative Coherent Power,” Proceeding of the IEEE, Southeastcon 2008, pp. 575-580, April 2008. [16]C. H. Knapp and G.C. Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech and Signal Processing, 24(4):320-327, August 1976. [17]P. Svaizer, M. Mattassoni and M. Omologo. Acoustic source localization in a three-dimensional space using crosspower spectrum phase. In Proceedings of ICASSP97, IEEE, 1997. [18]M. S. Brandstein, J. E. Adcock and H. F. Silverman. A closed-form method for finding source locations from microphone-array time-delay estimates. In Proceedings of ICASSP95, pages 3019-3022, IEEE, May 1995. “Performance for Phase Transform for Detecting Sound Sources in Reverberant and Noisy Environments,” K.D. Donohue, J. Hannemann, and H.G. Dietz, Signal Processing, Vol. 87, no. 7, pp. 16771691, July 2007. These assessed detection performance using an ROC analysis, however they did not discuss threshold design for automatic detection. “Sound Source Detection Threshold Estimation using Negative Coherent Power,” K.D. Donohue, K.S. McReynolds, A. Ramamurthy, Proceeding of the IEEE, Southeastcon 2008, pp. 575-580, April 2008.