Introduction - University of Kentucky

advertisement
CFAR performance for coherent sound source detection with distributed microphones
Kevin D. Donohue, Sayed M. SaghaianNejadEsfahani, and Jingjing Yu
Visualization Center and Department of Electrical and Computer Engineering, University of Kentucky,
Lexington, KY 40506
Abstract
Auditory scene analyses with distributed microphone systems are often initiated with the detection and
location of sound sources. This paper introduces a novel method for the automatic detection of sound
sources using a steered response power (SRP) algorithm. The method exploits the near-symmetric noise
distribution of the coherent power in the SRP image to adaptively estimate thresholds that achieve a
constant false alarm rate (CFAR). Statistics based on the microphone geometry and the field of view
(FOV) are derived for determining the frequency range that results in near-symmetric distributions and
accurate CFAR threshold estimations. Analyses and simulation show that low frequency components
(relative to the microphone distribution geometry) are primarily responsible for degrading CFAR
threshold performance, but can be offset by partial whitening or increasing the inter-path distances of
the microphone to the sources in the FOV. Experimental noise-only recordings are used to assess CFAR
performance for several microphone geometries in conjunction with variations in the source frequency
and degree of partial whitening. Results for linear, perimeter, and planar microphone geometries
demonstrate that a Weibull distribution is a reasonably accurate model for the negative and positive
coherent power values. Experimental false-alarm probabilities corresponding to CFAR thresholds
ranging from 10-1 and 10-6 were estimated using over 9.2 million pixels and showed that the deviations
from the desired false-alarm probability were limited to within 1 order of magnitude.
Introduction:
Automatic sound source detection and location with distributed microphone systems is relevant for
enhancing applications such as teleconferencing [1-5], speech recognition [6-12], talker tracking [13],
and beamforming for SNR enhancement [14]. Many of these applications involve the detection and
location of a sound source. For example, a minute-taking application for meetings requires detecting
and locating all voices before beamforming on each voice to effectively create independent channels for
each speaker. The failures to detect an active sound source or a false detection degrade performance.
This paper introduces a method for automatically detecting active sound sources using a variant of the
steered response power (SRP) algorithm and applying a constant false-alarm rate (CFAR) threshold
algorithm.
Recent work on sound source location algorithms in close (immersive) spaces has focused on
enhancements for detecting and locating targets. One of the most robust algorithms for detecting
multiple speakers is the SRP with a Phase Transform (PHAT) [15,16]. This technique has shown promise
based on images created from the SRP likelihood function. A detailed analysis based on detection
performance, showed that a variant of the PHAT, referred to as partial whitening [17, 18], outperforms
the PHAT for a variety of signal source types. These works analyzed overall detection performance with
receiver operating characteristics (ROC) which considers the overall detection and false-alarm
probabilities without regard to a threshold. Threshold design based on a false-alarm probability was
considered in [19]. The SRP images for the work in [19] were created only through the crosscorrelations of different microphone signals (contribution autocorrelation on each was eliminated) to
result both positive and negative values, which were referred to as coherent power. It was observed
for noise-only regions the coherent power tended to be symmetric, while for target regions the
distributions were skewed in the positive direction. These properties were used to develop an adaptive
thresholding scheme, where the negative values in a local neighborhood were used to characterize the
noise and estimate a constant false-alarm rate (CFAR) threshold for the center pixel based on a specified
false-alarm rate. The results in [19] showed the performance of CFAR threshold work well for some
microphone distributions and poorly for others. In addition, a limited model for the noise distribution
was used.
The work in this paper further develops the work introduced in [19] with a more detailed analyses of
the coherent power points and a broader examination of distributions to fit these points. In particular
the Webull distribution with various shape parameters is used to fit the distribution to the positive and
negative coherent values to achieve consistent CFAR performance over a wide range of false-alarm
probabilities. The primary source of performance degradation is the inability of a given microphone
distribution to effectively decorrelate the low frequency components of noise sources. This paper
derives a statistic that rates the ability of an array to effectively decorrelate the low frequencies such
that they do not degrade the CFAR performance.
The next section presents equations for creating an acoustic image based on the steered-response
coherent power (SRCP) algorithm and derives a statistic based based on the inter-path distances
between microphone pairs and the FOV. Section 3 presents a statistical analysis of the positive and
negative noise distribution for different arrays and source frequency ranges. Simulation and
experimental results demonstrate the sources of non-symmetry between the positive and negative
distributions and suggest methods for addressing this. Section 4 presents the CFAR algorithm with a
performance analysis using data recorded from 3 different arrays. Finally Section 5 summarizes the
results and presents conclusions.
Statistical Signal Models
This section formulates the SRP image process using coherent power and derives a statistic related to
the microphone distribution’s ability to decorrelate noise sources. Consider microphones and sound
sources distributed in a 3-D space. Let ui(t) be the pressure wave denoting the ith source of interest
located at position ri, where ri is a vector denoting the x, y, and z axis coordinates. The waveform
received by the pth microphone is given by:

v p (t ; r p , ri ) 
K 
 hip ( )ui (t   )d    hkp ( )nk (t   )d ,
(1)
k 1 

where hip() represents the microphone and propagation path impulse response (including multi-path)
from ri to rp, and nk(t) represents noise sources located at rk. The noise sources nk(t) arise from ambient
room noises and sources not at the position of interest.
For reverberant rooms the impulse response can be separated into a signal (direct path) and noise
component (reflected path) to result in:

hip (t )  aip 0 (t   ip 0 )   aipn (t   ipn ) ,
(2)
n 1
where aipn(t) denotes the nth path of the effective impulse response associated with source at ri, and
microphone at rp, and τipn is the corresponding delay. The component corresponding to n = 0 is the direct
path between the source and microphone.
An SRP estimate is based on sound events limited to those received over a finite time frame denote
by Δl. Therefore, a single SRP frame Eq. (1) can be expressed in frequency domain:
 jip0
Vˆp (,  l )  Uˆ i ( ) Aˆ ip0 ()e

Uˆ i ( )
R
K
m  kpml &m0
k 1
 j
 Aˆ ipm ()e ipm   Nˆ k ()
M
 Aˆ kpm ()e
m  kpml 
 j kpm
,
(3)
where the summations index denotes summing only those scatterer delays within the interval Δ l. The
SRP image is computed from the power in the filtered sum of microphone signals received over all
positions:
P
G(ri )   Bˆ ipVˆ p ( ,  l ) ,
(4)
p 1
where B̂ip is a complex filter coefficient for the microphone at rp and source at ri. The phase of B̂ip is
selected to undo the shift introduced by the propagation from the point of interest and the magnitude is
typically selected to emphasize changes closer to the point of interest. The power in the filtered sum is
computed by multiplying G in Eq. (4) by its conjugate and integrating over ω to obtain the SRP image.
The multiplication of all terms in Eq. (4) results in the sum of products for all possible paired-terms.
Since the product-pairs consisting of the same channel (autocorrelation terms) do not vary with spatial
position ri, it acts only as a bias to keep the power computations positive. Therefore, it can be
subtracted out to result in the coherent power given by:
S C (ri ) 
P
P
 Bˆ ip Bˆ iqVˆp (,  l )Vˆq (,  l )d .

p 1 q  p
*
*
(5)
Imaging algorithms based on this quantity are referred to as steered-response coherent power (SRCP). If
the expected value over all microphone pairs is taken in the integrand of Eq. (5), with the assumptions
that distinct sources are uncorrelated and letting the filter delay parameter correspond to the direct
path to the actual target (i.e. Bˆ  B exp( j ) with    ), the result is:
ip

ip
ip 0
ip 0
ip

E Bˆ ip Bˆ iq* Vˆp (,  l )Vˆq* (,  l ) 
Uˆ i ( )
2

M
 B B A A* 
*
Bip Biq Aipm Aiqm
E exp j ( ip 0   ipm )  ( iq 0   iqm )

ip
iq
ip
0
iq
0


m  ipm , iqml 

  
K
  Nˆ k ( )
k 1
2

M

*
Bip Biq Akpm Akqm
E exp j ( ip 0   kpm )  ( iq 0   kqm )



m  kpm , kqml 

  


 (6)


where the angular brackets denote the average value over all microphone pairs. Note the complex
exponential terms for the target were in phase with the filter coefficients so the direct path target term
results in a coherent addition. The other terms relate to the noise are expected to diminish due to the
decorrelation from the (ideally) incoherent phases of the complex exponentials over all microphones.
To investigate the statistics of the SRCP image values for noise only, the sound source at ri is set to 0,
time delays are converted to spatial distances d, and frequencies to wavelengths (λ) to obtain:
 
 d ip 0  d iq 0  
  
E Bip BiqVˆp ( ,  l )Vˆq* ( ,  l ) exp( j ( ip 0   iq 0 )  E exp  j 2 




 
 

K

k 1

Nˆ k ( )
2
N

M

n  kqml  m  kpml 
 
 d kqn  d kpm  
*
 
Bip Biq Akpm Akqn
E exp  j 2 



 

 




(7)
Equation (7) shows that the level of incoherence, or decorrelation results from the 2 complex
exponential arguments (one inside the summation due to the multi-path of the reverberations and the
other factored out of the noise source summation). If the microphone distribution is such that the interpath distances in the first exponential term are on average much smaller than the wavelengths of the
source, the phases of the complex exponential arguments are limited to a small range about 0, resulting
in coherent sums independent of the source location. Ideally, if the complex exponential arguments
uniformly span from – to  over all microphone pairs for non-source locations, the expected becomes
zero. Note that the 2 exponential terms of Eq. (7), responsible for decorrelating the noise sources are
related to the FOV, microphone positions, and the frequency content of the sources. Since the first
exponential depends only on the microphone geometry, and it scales all the noise components in the
summation, this factor will be analyzed to determine the degree to which the microphone geometries
generate a zero-mean distribution for off-target/noise sources.
The differential path length distributions can be examined with histograms and characterized with
statistics. Once the distributions are known, the expected value of Eq. (7) can be obtained. The closedform expressions can be derived for cases using normal and uniform distributions for the inter-path
distances. Let Δipq be the random variable of the inter-path distance over all microphones. In the case
of a normal distribution with standard deviation  the expected value becomes:
   2 
 
 ipq  




E exp   j 2 
  exp   2     .


    
   
 


(8)
In the case of a zero mean uniform distribution with standard deviation  the expected value becomes:
 

 ipq  
12  
.
   sinc  
E exp   j 2 





 

 


(9)
The relationships in Eqs. (8) and (9) indicate that the distribution mean can never be 0 over a range of
frequencies. However, it can be driven to a sufficiently small value by increasing the inter-path standard
deviation relative to the source wavelength. A zero-mean condition is necessary for symmetry, but not
sufficient. The distribution can indeed be skewed as well In order to determine the frequency over
which an effective zero-mean symmetric noise distribution is likely, σΔ must be evaluated for a given
microphone geometry and FOV. It must also be determined which distribution (uniform or Gaussian)
most accurately describes the distribution of inter-path distances.
Experimental Systems
In order for good performance of the thresholding procedure, the symmetry of the noise distribution
must be established. Equations (8) and (9) show that a mean offset can exist. In addition a skewness
can also contribute to the non-symmetric nature of the distribution. In order to more fully explore the
deviations from symmetry for noise only data and the factors that influence it, simulation and
experimental data are generated to examine the resulting distributions and test the CFAR performance
of thresholds designed based on this procedure. Figure 1 shows the 3 microphone distributions used in
this study. This included 16 microphones that equally spaced over a particular geometry with a FOV
being a plane region 1.57 meters above the floor with a square dimension of 3 meters. This plane was
sampled for creating an SRP image at 4 centimeters in the X and Y directions. Figure 1a shows a linear
array placed 1.52 meters above the floor, 0.5 meters away from the FOV edge, and a spacing of .23
meters between microphones. The array was symmetrically placed along the y-axis relative to the FOV.
Figure 1b shows a perimeter array with microphones place 1.52 meters above the floor, 0.5 m away
from the FOV, and a microphone spacing of 0.848 m along the perimeter. Figure 1c shows the planar
array with microphone placed in a plane 1.985 m above the ground and placed in a rectangular grid
starting in a corner directly above the FOV with a 1 m spacing in the X and Y directions.
For the experimental measurements a cage of aluminum struts around the FOV held the
microphones in place positions were measured with a laser meter and tape measure. Based on the
difficulty in measuring the tip to tip distance of the microphones It was estimated that a level of
precision was on the order of 1 cm. The speed of sound was measured on the day of each recording
and was 347 m/s for the linear and 346 for the perimeter and planar.
2.5
2
1.5
Z
2
1
0
1
Z
0
1
0
Y -1
Z
2
1
1
0.5
1
0
-1 X
1
0
0
Y -1
-1
X
0
10
-1
Y
-1
X
0
1
Figure 1. Microphone distributions and FOV (shaded plane) for simulation and experimental recordings
with axes in meters. Square and star markers denote the smallest and largest (respectively) microphone
inter-distance standard deviation overall pairs (a) linear (b) perimeter and (c) planar.
In order to determine the nature of the inter-distances for each configuration, the histogram of all
microphone pairs (240 for each point) was plotted for the FOV positions that corresponded to the
maximum and minimum inter-distance standard deviations. These positions are indicated with the
square (minimum) and star (maximum) markers on the FOVs in Fig. 1. Note that the minimum variance
corresponds to the center of the arrays, where the array gain is typically the highest. Equations 8 and 9
predict that the symmetry (especially for lower frequencies) will be at a minimum and degrade the CFAR
performance. It also implies significant interference from low frequency noise sources located away
from the point of interest.
Figure 2 shows the histogram of the microphone spacings for the minimum and maximum variance
points. According to
0.8
0.6
0.8
 = 0.21
0.6
0.4
0.4
0.2
0.2
0
-5
0
Meters
5
0
-5
 = 1.42
0
Meters
5
(a)
(b)
0.8
0.8
 = 0.38
0.6
0.4
0.4
0.2
0.2
0
-5
0
Meters
5
0
-5
(c)
0
Meters
5
(d)
0.8
0.8
 = 0.67
0.6
0.4
0.2
0.2
0
Meters
(e)
 = 1.48
0.6
0.4
0
-5
 = 1.88
0.6
5
0
-5
0
Meters
5
(f)
Figure 2. Histograms and standard deviation of inter-distance microphone pairs for a point in the FOV
for each distribution. (a) linear minimum variance (b) linear maximum variance, (c) perimeter minimum
variance, (d) perimeter maximum variance, (e) planar minimum variance, (f) planar maximum variance.
The experiment in this paper considers 3 microphone spatial distributions: linear, perimeter and
planar. The FOV for all cases was a 3m by 3m plane located 1.57m above the floor. The linear array
consisted of 16 microphones in the FOV plane, along a line parallel to the edge of the FOV and 5 cm
outside the edge. Microphones were equally spaced at 0.5m. The perimeter array consisted of 16
microphones within the FOV plane and symmetrically distributed outside the FOV forming the vertices
of an equilateral octagon with 1.27m sides. The planar array consisted of 16 microphones located in a
plane parallel to and 0.58m above the FOV. The microphones were placed on the vertices of 2
concentric rectangles with sides parallel to those of the FOV. The inner rectangle was 1.81m by 1.63m
and the outer rectangle was 3.4m by 3.54m.
The histogram of inter-path distances for all microphone pairs to all points in the FOV plane (taken at
intervals of 4 cm in the x and y directions) is plotted in Fig. 1. The histograms suggest the distributions
are closer to uniform than Gaussian; therefore, Eq. (9) is used as the expected value. For the linear,
perimeter, and planar geometries the standard deviations for the inter-path distances were 0.69, 1.24,
and 1.08 m, respectively. This suggests the superiority of the planar array in limiting partial coherences
at lower frequencies and resulting in a more symmetric noise distribution for coherent power.
0.2
perimeter
linear
planar
0.15
0.1
0.05
0
-4
-3
-2
-1
0
Meters
1
2
3
4
Figure 1. Histograms of inter-path distances for 3 microphone and FOV geometries.
For the experiments presented in this paper, a high-pass filter of 300 Hz is applied to remove low
frequency noises (room modes) and limited the partial coherences predicted by the Eq. (7). For
frequencies greater than 300 Hz, Eq. (9) indicates the expected deviation for the distribution mean from
zero is .06 for the linear, .05 for the perimeter, and .02 for the planar. While these values appear to be
small, they may significantly impact the FA threshold depending on how small of an FA probability is
required. A higher cutoff frequency on the sources can drive the mean smaller at the expense of losing
significant portions of the target signal. For the sake of illustrating the relative performance differences
between the geometries, a cutoff at 300 Hz will be used in all cases.
Refrences:
[1] P. L. Chu. Desktop mic array for teleconferencing. In Proceedings of ICASSP95, IEEE, 1995.
[2] J. L. Flanagan, D. A. Berkley, G.W. Elko, J. E. West, and M. M. Shondhi. Autodirective microphone
systems. Acoustica, vol. 73, pp. 58-71, 1991.
[3] W. Kellerman. A self-steering digital microphone array. In Proceedings of ICASSP91, pages 35813584, IEEE, May 1991.
[4] F. Khali, J. P. Jullien and A. Gilloire. Microphone array for sound pickup in teleconferencing systems. J.
Audio Eng. Soc., vol. 42, no. 9, September 1994.
[5]H. Wang and P. Chu. Voice source localization for automatic camera pointing system in
videoconferencing. In Proceedings of ICASSP, volume 1, pages 187-190. IEEE, 1997.
[6] J. E. Adcock, Y. Gotoh, D. J. Mashao and H. F. Silverman. Microphone-array speech recognition via
incremental MAP training. In Proceeding of ICASSP96, Atlanta, GA, May 1996.
[7] C. Che, Q. Lin, J. Pearson, B. deVries and J. Flanagan. Microphone arrays and neural networks for
robust speech recognition. In Proceedings of the Human Language Technology Workshop, Pages 342347, Plainsboro, NJ, March 8-11, 1994.
[8]C. Che, M. Rahim and J. Flanagan. Robust speech recognition in a multimedia teleconferencing
environment. J. Acoust. Soc. Am., 92(4):2476, 1992.
[9]D. Giuliani, M. Omologo and P. Svaizer. Talker localization and speech recognition using a microphone
array and a cross-power spectrum phase analysis. In Proceedings of ICSLP, volume 3, pages 1243-1246,
September 1994.
[10]T. B. Hughes, H. Kim, J. H. DiBiase, and H. F. Silverman. Performance of an HMM speech recognizer
using a real-time tracking microphone array as input. IEEE Trans. Speech Audio Proc., 7(3):346-349, May
1999.
[11]T. B. Hughes, H. Kim, J. H. DiBiase, and H. F. Silverman. Using a real-time, tracking microphone array
as input to an HMM speech recognizer. In Proceedings of ICASSP98, IEEE, 1998.
[12]H. F. Silverman. Some analysis of microphone arrays for speech data acquisition. Trans. on Acoustics,
Speech, and Signal Processing, 35(12):1699-1711, IEEE, December 1987.
[13] S. M. Yoon, S. C. Kee, Speaker detection and tracking at mobile robot platform, Proc. 2004 Intl.
Symp. on Intelligent Signal Processing and Communication Systems, 2004, pp. 596–600.
[14]Huang, T.S., “Multimedia/multimodal signal processing, analysis, and understanding,” First
International Symposium on Control, Communications and Signal Processing. 2004, p. 1.
[15] J. H. DiBiase, H. F. Silverman, M. S. Brandstein, “Robust Localization in Reverberant Rooms, ”
Microphone Arrays, Signal Processing Techniques and Applications, Springer Verlag, Berlin, 2001, pp.
157–180.
[16] T. Gustafsson, B. D. Rao, M. Triverdi, “Source localization in reverberant environments: modeling
and statistical analysis,” IEEE Trans. on Speech and Audio Proc., vol. 11, no. 6, Jun. 2003, pp. 791–803.
[17] K.D. Donohue, J. Hannemann, and H.G. Dietz, “Performance for Phase Transform for Detecting
Sound Sources in Reverberant and Noisy Environments,” Signal Processing, Vol. 87, no. 7, pp. 16771691, July 2007.
[18] A. Ramamurthy, H. Unnikrishnan, K.D. Donohue, “Experimental Performance Analysis of Sound
Source Detection with SRP PHAT-𝛽, Proceedings of the IEEE, Southeastcon, March 2009
[19] K.D. Donohue, K.S. McReynolds, A. Ramamurthy, “Sound Source Detection Threshold Estimation
using Negative Coherent Power,” Proceeding of the IEEE, Southeastcon 2008, pp. 575-580, April 2008.
[16]C. H. Knapp and G.C. Carter. The generalized correlation method for estimation of time delay. IEEE
Transactions on Acoustics, Speech and Signal Processing, 24(4):320-327, August 1976.
[17]P. Svaizer, M. Mattassoni and M. Omologo. Acoustic source localization in a three-dimensional space
using crosspower spectrum phase. In Proceedings of ICASSP97, IEEE, 1997.
[18]M. S. Brandstein, J. E. Adcock and H. F. Silverman. A closed-form method for finding source locations
from microphone-array time-delay estimates. In Proceedings of ICASSP95, pages 3019-3022, IEEE, May
1995.
“Performance for Phase Transform for Detecting Sound Sources in Reverberant and Noisy
Environments,” K.D. Donohue, J. Hannemann, and H.G. Dietz, Signal Processing, Vol. 87, no. 7, pp. 16771691, July 2007.
These assessed detection performance using an ROC analysis, however they did not discuss threshold
design for automatic detection.
“Sound Source Detection Threshold Estimation using Negative Coherent Power,” K.D. Donohue, K.S.
McReynolds, A. Ramamurthy, Proceeding of the IEEE, Southeastcon 2008, pp. 575-580, April 2008.
Download