Automatic Speech Processing Methods For Bioacoustic Signal Analysis: A Case Study Of Cross-Disciplinary Acoustic Research Mark D. Skowronski and John G. Harris Computational Neuro-Engineering Lab, University of Florida, Gainesville, FL ABSTRACT Automatic speech processing research has produced many advances in the analysis of time series. Knowledge of the production and perception of speech has guided the design of many useful algorithms, and automatic speech recognition has been at the forefront of the machine learning paradigm. In contrast to the advances made in automatic speech processing, analysis of other bioacoustic signals, such as those from dolphins and bats, has lagged behind. In this paper, we demonstrate how techniques from automatic speech processing can significantly impact bioacoustic analysis, using echolocating bats as our model animal. Compared to conventional techniques, machine learning methods reduced detection and species classification error rates by an order of magnitude. Furthermore, the signal-to-noise ratio of an audible monitoring signal was improved by 12 dB using techniques from noise-robust feature extraction and speech synthesis. The work demonstrates the impact that speech research can have across disciplines. Conventional method: Features [2,4-8]: min frequency, max frequency, frequency at peak amplitude, and duration, extracted from hand-labeled calls using noise-robust methods [3]. 1 L 2 E ( k ) xk ( n ) L n 1 1, E (k ) d(k ) 0, E (k ) xk(n) - frame k of raw signal x(n) E(k) - energy in frame k L - frame length (~1ms) M p ( xk | i ) wi ,mG ( xk , i ,m , i ,m ) m 1 1, log ( p( xk | 2 )) log( p( xk | 1 )) d(k ) 0, log ( p( xk | 2 )) log( p( xk | 1 )) xk - input features for frame k: spectral peak amplitude, frequency at peak amplitude, first- and second-order temporal derivatives ωi - class of signal: i = 1 for background frames, i = 2 for call frames p (xk|ωi) - class-conditional probability density for frame k of input feature vector x given class ωi G - Gaussian kernel with mean vector μ and covariance matrix Σ, estimated from hand-labeled data wi,m, μi,m, Σi,m - mixture weight, mean, and covariance of mth kernel for class ωi d(k) - detection decision for frame k θ - likelihood threshold Pipistrellus bodenheimeri: Molossus molossus: Mm Lb Lc Tb Total Pb 99.8 ± 0.3 0±0 0±0 0±0 0.2 ± 0.3 99.8 ± 0.3 Mm 0.03 ± 0.2 95.6 ± 1.6 0±0 0±0 4.3 ± 1.6 95.6 ± 1.6 Lb 0±0 0±0 99.8 ± 0.1 0.2 ± 0.1 0±0 99.8 ± 0.1 Lc 0±0 0±0 0.2 ± 0.2 99.8 ± 0.2 0±0 99.8 ± 0.2 Tb 0±0 0.2 ± 0.2 0±0 0±0 99.8 ± 0.2 99.8 ± 0.2 99.4 ± 0.2 Total DFA Pb Mm Lb Lc Tb Total Pb 97.1 ± 0.8 0.2 ± 0.2 2.7 ± 0.8 0±0 0±0 97.1 ± 0.8 Mm 0.6 ± 0.5 76.7 ± 3 4.1 ± 2 17.3 ± 3 1.3 ± 0.6 76.7 ± 3 Lb 1.2 ± 0.4 16.9 ± 1.5 79.6 ± 1.3 0.3 ± 0.3 2.1 ± 0.5 79.6 ± 1.3 Lc 0±0 1.1 ± 0.9 0.3 ± 0.5 89.7 ± 1.4 8.8 ± 0.9 89.7 ± 1.4 Tb 0±0 6.6 ± 1.5 5.4 ± 1.4 16.5 ± 3 71.4 ± 3 71.4 ± 3 83.1 ± 1.1 T F Sensitivity GMM 1 4314 0 3985 176 0.96 96615 Peak energy 1 4132 0 7819 358 0.92 92781 Broadband energy 1 3047 0 31891 1443 0.68 68709 Conventional methods: Frequency division, time expansion, zero crossings, heterodyne [11]. Synthetic method: For each frame i of features [amplitude(i), frequency(i)]: Ai (t ) amplitude (i ) (amplitude (i 1) amplitude (i )) t f i frequency(i ) K frequency division factor si (t ) Ai (t ) sin( 2f i t / K i ) i 1 i 2f iT / K State model of nonstationary signal, each state represents pseudostationary probability density function with a GMM. One model for each species was trained using the Baum-Welch algorithm on handlabeled calls. Testing was performed using the Viterbi dynamic programming algorithm, which determines the log likelihood of the single most likely state sequence through a model. Pb DETECTOR SYNTHESIS Hidden Markov model (HMM) classifier [10]: HMM Confusion matrices at equal sensitivity and specificity:105,090 detection blocks (20 ms) CLASSIFICATION Same as GMM detector, except ωi represent each species. Averaged log likelihood over all K frames of a call was calculated for each class, and the classifier output was the label of the class with the maximum averaged log likelihood. Average and st. dev. over 20 trials of randomly selected test and train calls, 50% test, 50% train. The GMM and HMM results were statistically indistinguishable (t-test, p>0.9). Each gray column is a hand-labeled call from a pass of 25 calls from L. borealis. The black horizontal line represents θ for equal sensitivity and specificity. Gaussian mixture model (GMM) [3]: Gaussian mixture model (GMM) classifer: Classification confusion matrices: Detector output examples: d(k) - detection decision θ - energy threshold Classifier [2,7-9]: discriminant function analysis (DFA) with stratified covariance matrices (quadratic) Total DETECTION Conventional method [1,2]: Spectral mean subtraction: Tadarida brasiliensis: Lasiurus borealis: BIBLIOGRAPHY [1] M. K. Obrist, “Flexible bat echolocation: the influence of individual, habitat and conspecifics on sonar signal design,” Behav. Ecol. Sociobiol., vol. 36, pp. 207-219, 1995 [2] S. Parsons and G. Jones, “Acoustic identification of twelve species of echolocating bat by discriminant function analysis and artificial neural networks,” J. Exp. Biol., vol. 203, pp. 2641-2656, 2000 [3] M. D. Skowronski and J. G. Harris, “Acoustic detection and classification of microchiroptera using machine learning: lessons learned from automatic speech recognition,” J. Acoust. Soc. Am., 2005, submitted [4] M. B. Fenton and G. P. Bell, “Recognition of species of insectivorous bats by their echolocation calls,” J. Mammal., vol. 62, no. 2, pp. 233-243, May 1981 [5] M. J. O'Farrell, B. W. Miller, and W. L. Gannon, “Qualitative identification of free-flying bats using the Anabat detector,” J. Mammal., vol. 80, no. 1, pp. 11-23, Jan. 1999 [6] M. K. Obrist, “Flexible bat echolocation: the influence of individual, habitat and conspecifics on sonar signal design,” Behav. Ecol. Sociobiol., vol. 36, pp. 207-219, 1995 [7] M. K. Obrist, R. Boesch, and P. F. Fluckiger, “Variability in echolocation call design of 26 Swiss bat species: consequences, limits and options for automated field identification with a synergetic pattern recognition approach,” Mammalia, vol. 68, no. 4, pp. 307322, Dec. 2004 [8] R. F. Lance, B. Bollich, C. L. Callahan, and P. L. Leberg, “Surveying forest-bat communities with Anabat detectors,” in Bats and Forests Symposium, R. M. R. Barclay and R. M. Brigham, eds., Res. Br., B.C. Min. For., Victoria, B.C., CA, pp. 175-184, 1996 [9] D. Russo and G. Jones, “Identification of twenty-two bat species (Mammalia: Chiroptera) from Italy by analysis of time-expanded recordings of echolocation calls,” J. Zool., Lond., vol. 258, no. 1, pp. 91-103, Sept. 2002 [10] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in Readings in Speech Recognition, A. Waibel and K.-F. Lee, eds., Kaufmann, San Mateo, CA, pp. 267-296, 1990 [11] S. Parsons, A. M. Boonman, and M. K. Obrist, “Advantages and disadvantages of techniques for transforming and analyzing chiropteran echolocation calls,” J. Mammal., vol. 81, no. 4, pp. 927-938, Nov. 2000