Automatic Speech Processing Methods For Bioacoustic Signal Analysis:

advertisement
Automatic Speech Processing Methods For Bioacoustic Signal Analysis:
A Case Study Of Cross-Disciplinary Acoustic Research
Mark D. Skowronski and John G. Harris
Computational Neuro-Engineering Lab, University of Florida, Gainesville, FL
ABSTRACT
Automatic speech processing research has produced many advances
in the analysis of time series. Knowledge of the production and
perception of speech has guided the design of many useful
algorithms, and automatic speech recognition has been at the
forefront of the machine learning paradigm. In contrast to the
advances made in automatic speech processing, analysis of other
bioacoustic signals, such as those from dolphins and bats, has lagged
behind. In this paper, we demonstrate how techniques from
automatic speech processing can significantly impact bioacoustic
analysis, using echolocating bats as our model animal. Compared to
conventional techniques, machine learning methods reduced
detection and species classification error rates by an order of
magnitude. Furthermore, the signal-to-noise ratio of an audible
monitoring signal was improved by 12 dB using techniques from
noise-robust feature extraction and speech synthesis. The work
demonstrates the impact that speech research can have across
disciplines.
Conventional method:
Features [2,4-8]: min frequency, max frequency, frequency at peak
amplitude, and duration, extracted from hand-labeled calls using
noise-robust methods [3].
1 L 2
E ( k )   xk ( n )
L n 1
1, E (k )  
d(k ) 
0, E (k )  
xk(n) - frame k of raw signal x(n)
E(k) - energy in frame k
L - frame length (~1ms)
M
p ( xk | i )   wi ,mG ( xk , i ,m ,  i ,m )
m 1
1, log ( p( xk | 2 ))  log( p( xk | 1 ))  
d(k ) 
0, log ( p( xk | 2 ))  log( p( xk | 1 ))  
xk - input features for frame k: spectral peak amplitude, frequency at
peak amplitude, first- and second-order temporal derivatives
ωi - class of signal: i = 1 for background frames, i = 2 for call frames
p (xk|ωi) - class-conditional probability density for frame k of input
feature vector x given class ωi
G - Gaussian kernel with mean vector μ and covariance matrix Σ,
estimated from hand-labeled data
wi,m, μi,m, Σi,m - mixture weight, mean, and covariance of mth kernel for
class ωi
d(k) - detection decision for frame k
θ - likelihood threshold
Pipistrellus bodenheimeri:
Molossus molossus:
Mm
Lb
Lc
Tb
Total
Pb
99.8 ± 0.3
0±0
0±0
0±0
0.2 ± 0.3
99.8 ± 0.3
Mm
0.03 ± 0.2
95.6 ± 1.6
0±0
0±0
4.3 ± 1.6
95.6 ± 1.6
Lb
0±0
0±0
99.8 ± 0.1
0.2 ± 0.1
0±0
99.8 ± 0.1
Lc
0±0
0±0
0.2 ± 0.2
99.8 ± 0.2
0±0
99.8 ± 0.2
Tb
0±0
0.2 ± 0.2
0±0
0±0
99.8 ± 0.2
99.8 ± 0.2
99.4 ± 0.2
Total
DFA
Pb
Mm
Lb
Lc
Tb
Total
Pb
97.1 ± 0.8
0.2 ± 0.2
2.7 ± 0.8
0±0
0±0
97.1 ± 0.8
Mm
0.6 ± 0.5
76.7 ± 3
4.1 ± 2
17.3 ± 3
1.3 ± 0.6
76.7 ± 3
Lb
1.2 ± 0.4
16.9 ± 1.5
79.6 ± 1.3
0.3 ± 0.3
2.1 ± 0.5
79.6 ± 1.3
Lc
0±0
1.1 ± 0.9
0.3 ± 0.5
89.7 ± 1.4
8.8 ± 0.9
89.7 ± 1.4
Tb
0±0
6.6 ± 1.5
5.4 ± 1.4
16.5 ± 3
71.4 ± 3
71.4 ± 3
83.1 ± 1.1
T
F
Sensitivity
GMM
1 4314
0 3985
176 0.96
96615
Peak energy
1 4132
0 7819
358 0.92
92781
Broadband
energy
1 3047
0 31891
1443 0.68
68709
Conventional methods:
Frequency division, time expansion, zero crossings, heterodyne [11].
Synthetic method:
For each frame i of features [amplitude(i), frequency(i)]:
Ai (t )  amplitude (i )  (amplitude (i  1)  amplitude (i ))  t
f i  frequency(i )
K  frequency division factor
si (t )  Ai (t )  sin( 2f i t / K  i )
i 1  i  2f iT / K
State model of nonstationary signal, each state represents pseudostationary probability density function with a GMM. One model for
each species was trained using the Baum-Welch algorithm on handlabeled calls. Testing was performed using the Viterbi dynamic
programming algorithm, which determines the log likelihood of the
single most likely state sequence through a model.
Pb
DETECTOR
SYNTHESIS
Hidden Markov model (HMM) classifier [10]:
HMM
Confusion matrices at equal sensitivity and
specificity:105,090 detection blocks (20 ms)
CLASSIFICATION
Same as GMM detector, except ωi represent each species.
Averaged log likelihood over all K frames of a call was calculated for
each class, and the classifier output was the label of the class with
the maximum averaged log likelihood.
Average and st. dev. over 20 trials of randomly selected test and train calls,
50% test, 50% train. The GMM and HMM results were statistically
indistinguishable (t-test, p>0.9).
Each gray column is a hand-labeled call from a pass of 25 calls from L. borealis.
The black horizontal line represents θ for equal sensitivity and specificity.
Gaussian mixture model (GMM) [3]:
Gaussian mixture model (GMM) classifer:
Classification confusion matrices:
Detector output examples:
d(k) - detection decision
θ - energy threshold
Classifier [2,7-9]: discriminant function analysis (DFA) with stratified
covariance matrices (quadratic)
Total
DETECTION
Conventional method [1,2]:
Spectral mean
subtraction:
Tadarida brasiliensis:
Lasiurus borealis:
BIBLIOGRAPHY
[1] M. K. Obrist, “Flexible bat echolocation: the influence of individual, habitat and conspecifics on sonar signal design,” Behav. Ecol. Sociobiol., vol. 36, pp. 207-219, 1995
[2] S. Parsons and G. Jones, “Acoustic identification of twelve species of echolocating bat by discriminant function analysis and artificial neural networks,” J. Exp. Biol., vol. 203, pp. 2641-2656, 2000
[3] M. D. Skowronski and J. G. Harris, “Acoustic detection and classification of microchiroptera using machine learning: lessons learned from automatic speech recognition,” J. Acoust. Soc. Am., 2005, submitted
[4] M. B. Fenton and G. P. Bell, “Recognition of species of insectivorous bats by their echolocation calls,” J. Mammal., vol. 62, no. 2, pp. 233-243, May 1981
[5] M. J. O'Farrell, B. W. Miller, and W. L. Gannon, “Qualitative identification of free-flying bats using the Anabat detector,” J. Mammal., vol. 80, no. 1, pp. 11-23, Jan. 1999
[6] M. K. Obrist, “Flexible bat echolocation: the influence of individual, habitat and conspecifics on sonar signal design,” Behav. Ecol. Sociobiol., vol. 36, pp. 207-219, 1995
[7] M. K. Obrist, R. Boesch, and P. F. Fluckiger, “Variability in echolocation call design of 26 Swiss bat species: consequences, limits and options for automated field identification with a synergetic pattern recognition approach,” Mammalia, vol. 68, no. 4, pp. 307322, Dec. 2004
[8] R. F. Lance, B. Bollich, C. L. Callahan, and P. L. Leberg, “Surveying forest-bat communities with Anabat detectors,” in Bats and Forests Symposium, R. M. R. Barclay and R. M. Brigham, eds., Res. Br., B.C. Min. For., Victoria, B.C., CA, pp. 175-184, 1996
[9] D. Russo and G. Jones, “Identification of twenty-two bat species (Mammalia: Chiroptera) from Italy by analysis of time-expanded recordings of echolocation calls,” J. Zool., Lond., vol. 258, no. 1, pp. 91-103, Sept. 2002
[10] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in Readings in Speech Recognition, A. Waibel and K.-F. Lee, eds., Kaufmann, San Mateo, CA, pp. 267-296, 1990
[11] S. Parsons, A. M. Boonman, and M. K. Obrist, “Advantages and disadvantages of techniques for transforming and analyzing chiropteran echolocation calls,” J. Mammal., vol. 81, no. 4, pp. 927-938, Nov. 2000
Download