Statistical automatic identification of microchiroptera from echolocation calls

advertisement
Statistical automatic identification of
microchiroptera from echolocation calls
Lessons learned from human automatic speech recognition
Mark D. Skowronski and John G. Harris
Computational Neuro-Engineering Lab
Electrical and Computer Engineering
University of Florida
Gainesville, FL, USA
November 19, 2004
Overview
•
•
•
•
•
Motivations for bat acoustic research
Review bat call classification methods
Contrast with 1970s human ASR
Experiments
Conclusions
Bat research motivations
• Bats are among:
– the most diverse,
– the most endangered,
– and the least studied mammals.
• 1000 species, ~25% of all mammal species
• Close relationship with insects, agricultural
impact, disease vectors
• Acoustical research non-invasive, significant
domain (echolocation)
• Simplified biological acoustic communication
system (compared to human speech)
Bat echolocation
• Ultrasonic, brief chirps
• Determine range, velocity of nearby objects
(clutter, prey, conspecifics)
• Tailored for task, environment
Tadarida brasiliensis (Mexican free-tailed bat)
Listen to 10x time-expanded search calls:
Echolocation calls
• Two characteristics
– Frequency modulated -- range
– Constant frequency -- velocity
• Features (holistic)
– Freq. extrema
– Duration
– Shape
– # harmonics
– Call interval
Mexican free-tailed calls, concatenated
Current classification methods
• Expert sonogram readers
– Manual or automatic feature extraction
– Comparison with exemplar sonograms
• Automatic classification
– Decision trees
– Discriminant function analysis
– Artificial neural networks
– Spectrogram correlation
Parallels the knowledge-based approach to human ASR from the
1970s (acoustic phonetics, expert systems, cognitive approach).
Acoustic phonetics
DH AH F
UH T
B
AO
L
G
EY
EM
IH
Z
OW
V
ER
• Bottom up paradigm
– Frames, boundaries, groups, phonemes, words
• Manual or automatic feature extraction
– Formants, voicing, duration, intensity, transitions
• Classification
– Decision tree, discriminant functions, neural
network, Gaussian mixture model, Viterbi path
Acoustic phonetics limitations
• Variability of conversational speech
– Complex rules, difficult to train
• Boundaries difficult to define
– Coarticulation
• Feature estimates brittle
– Variable noise robustness
• Hard decisions, errors accumulate
Shifted to information theoretic paradigm of human ASR,
better able to account for variability of speech, noise.
Information theoretic ASR
• Data-driven models from computer
science
– Non-parametric: dynamic time warp (DTW)
– Parametric: hidden Markov model (HMM)
• Frame-based
– Expert information in feature extraction
– Models account for feature, temporal
variability
Information theoretic ASR dominates state-of-the-art speech
understanding systems.
Data collection
• UF Bat House, home to 60,000 bats
– Mexican free-tailed bat (vast majority)
– Evening bat
– Southeastern myotis
• Continuous recording
– 90 minutes around sunset
– ~20,000 calls
• Equipment:
–
–
–
–
–
B&K mic (4939), 100 kHz
B&K preamp (2670)
Custom amp/AA filter
NI 6036E 200kS/s A/D card
Laptop, Matlab
Experiment design
• Designs and assumptions
–
–
–
–
All recorded bats are Mexican free-tailed
Calls divided into different intraspecies calls
All calls are search phase
Hand-labeled call detection is complete (no
discarded calls)
• Hand labels
–
–
–
–
–
Narrowband spectrogram
Endpoints, class label
436 calls in 261 0.5-sec sequences (2% of data)
Four classes, a priori: 34, 40, 20, 6%
All experiments on hand-labeled data only
• Baseline
Experiments
– Features: Fmin, Fmax, Fmax_energy, and duration, from
zero crossings and MUSIC
– Classifier: Discriminant function analysis, quadratic
boundaries
• DTW and HMM
– Frame-based features: fundamental frequency (MUSIC
super-resolution estimate), log energy, temporal derivatives
(HMM only)
– DTW: MUSIC frequencies, 10% endpoint range
– HMM: 5 states/model, 4 Gaussian mixtures/state, diagonal
covariances
• Tests
– Leave one out
– 75% train, 25% test, 1000 trials
– Test on train (HMM only)
Results
• Baseline, zero crossing
– Leave one out: 72.5% correct
– Repeated trials: 72.5 ± 4% (mean ± std)
• Baseline, MUSIC
– Leave one out: 79.1%
– Repeated trials: 77.5 ± 4%
• DTW, MUSIC
– Leave one out: 74.5 %
– Repeated trials: 74.1 ± 4%
• HMM, MUSIC
– Test on train:
85.3 %
Confusion matrices
Baseline, zero crossing
1
2
3
4
Baseline, MUSIC
1
2
3
4
1
107 38
1
2
72.3%
1
110
36
1
1
74.3%
2
21
134
16
4
76.6%
2
12
149
12
2
85.1%
3
2
29
57
0
64.8%
3
4
18
66
0
75.0%
4
4
3
0
18
72.0%
4
3
2
0
20
80.0%
72.5%
79.1%
DTW, MUSIC
1
2
3
4
1
115
29
0
4
2
32
131
11
3
5
20
4
5
4
HMM, MUSIC
1
2
3
4
77.7%
1 118
25
0
5
79.7%
1
74.9%
2 10
154
5
6
88.0%
63
0
71.6%
3 1
12
75
0
85.2%
0
16
64.0%
4 0
0
0
25
100%
74.5%
85.3%
Conclusions
• Human ASR algorithms applicable to bat
echolocation calls
• Experiments
–
–
–
–
Weakness: accuracy of class labels
No labeled calls excluded
HMM most accurate, undertrained
MUSIC frequency estimate robust, slow
• Machine learning
– DTW: fast training, slow classification
– HMM: slow training, fast classification
Future work
• Find robust features of bat echolocation calls
that match assumptions of machine learning
algorithms
– Noise robust
– Distribution modeled by Gaussian mixtures
• Use hand-labeled subset of data to create call
detection algorithm
• Explore unsupervised learning
– Self-organized maps
– Clustering
• Real-time portable detection/classification
system on laptop PC
Further information
• http://www.cnel.ufl.edu/~markskow
• markskow@cnel.ufl.edu
• DTW reference:
– L. Rabiner and B. Juang, Fundamentals of Speech
Recognition, Prentice Hall, Englewood Cliffs, NJ,
1993
• HMM reference:
– L. Rabiner, “A tutorial on hidden Markov models
and selected applications in speech recognition,” in
Readings in Speech Recognition, A. Waibel and K.F. Lee, Eds., pp. 267–296. Kaufmann, San Mateo,
CA, 1990.
Download