Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski and John G. Harris Computational Neuro-Engineering Lab Electrical and Computer Engineering University of Florida Gainesville, FL, USA November 19, 2004 Overview • • • • • Motivations for bat acoustic research Review bat call classification methods Contrast with 1970s human ASR Experiments Conclusions Bat research motivations • Bats are among: – the most diverse, – the most endangered, – and the least studied mammals. • 1000 species, ~25% of all mammal species • Close relationship with insects, agricultural impact, disease vectors • Acoustical research non-invasive, significant domain (echolocation) • Simplified biological acoustic communication system (compared to human speech) Bat echolocation • Ultrasonic, brief chirps • Determine range, velocity of nearby objects (clutter, prey, conspecifics) • Tailored for task, environment Tadarida brasiliensis (Mexican free-tailed bat) Listen to 10x time-expanded search calls: Echolocation calls • Two characteristics – Frequency modulated -- range – Constant frequency -- velocity • Features (holistic) – Freq. extrema – Duration – Shape – # harmonics – Call interval Mexican free-tailed calls, concatenated Current classification methods • Expert sonogram readers – Manual or automatic feature extraction – Comparison with exemplar sonograms • Automatic classification – Decision trees – Discriminant function analysis – Artificial neural networks – Spectrogram correlation Parallels the knowledge-based approach to human ASR from the 1970s (acoustic phonetics, expert systems, cognitive approach). Acoustic phonetics DH AH F UH T B AO L G EY EM IH Z OW V ER • Bottom up paradigm – Frames, boundaries, groups, phonemes, words • Manual or automatic feature extraction – Formants, voicing, duration, intensity, transitions • Classification – Decision tree, discriminant functions, neural network, Gaussian mixture model, Viterbi path Acoustic phonetics limitations • Variability of conversational speech – Complex rules, difficult to train • Boundaries difficult to define – Coarticulation • Feature estimates brittle – Variable noise robustness • Hard decisions, errors accumulate Shifted to information theoretic paradigm of human ASR, better able to account for variability of speech, noise. Information theoretic ASR • Data-driven models from computer science – Non-parametric: dynamic time warp (DTW) – Parametric: hidden Markov model (HMM) • Frame-based – Expert information in feature extraction – Models account for feature, temporal variability Information theoretic ASR dominates state-of-the-art speech understanding systems. Data collection • UF Bat House, home to 60,000 bats – Mexican free-tailed bat (vast majority) – Evening bat – Southeastern myotis • Continuous recording – 90 minutes around sunset – ~20,000 calls • Equipment: – – – – – B&K mic (4939), 100 kHz B&K preamp (2670) Custom amp/AA filter NI 6036E 200kS/s A/D card Laptop, Matlab Experiment design • Designs and assumptions – – – – All recorded bats are Mexican free-tailed Calls divided into different intraspecies calls All calls are search phase Hand-labeled call detection is complete (no discarded calls) • Hand labels – – – – – Narrowband spectrogram Endpoints, class label 436 calls in 261 0.5-sec sequences (2% of data) Four classes, a priori: 34, 40, 20, 6% All experiments on hand-labeled data only • Baseline Experiments – Features: Fmin, Fmax, Fmax_energy, and duration, from zero crossings and MUSIC – Classifier: Discriminant function analysis, quadratic boundaries • DTW and HMM – Frame-based features: fundamental frequency (MUSIC super-resolution estimate), log energy, temporal derivatives (HMM only) – DTW: MUSIC frequencies, 10% endpoint range – HMM: 5 states/model, 4 Gaussian mixtures/state, diagonal covariances • Tests – Leave one out – 75% train, 25% test, 1000 trials – Test on train (HMM only) Results • Baseline, zero crossing – Leave one out: 72.5% correct – Repeated trials: 72.5 ± 4% (mean ± std) • Baseline, MUSIC – Leave one out: 79.1% – Repeated trials: 77.5 ± 4% • DTW, MUSIC – Leave one out: 74.5 % – Repeated trials: 74.1 ± 4% • HMM, MUSIC – Test on train: 85.3 % Confusion matrices Baseline, zero crossing 1 2 3 4 Baseline, MUSIC 1 2 3 4 1 107 38 1 2 72.3% 1 110 36 1 1 74.3% 2 21 134 16 4 76.6% 2 12 149 12 2 85.1% 3 2 29 57 0 64.8% 3 4 18 66 0 75.0% 4 4 3 0 18 72.0% 4 3 2 0 20 80.0% 72.5% 79.1% DTW, MUSIC 1 2 3 4 1 115 29 0 4 2 32 131 11 3 5 20 4 5 4 HMM, MUSIC 1 2 3 4 77.7% 1 118 25 0 5 79.7% 1 74.9% 2 10 154 5 6 88.0% 63 0 71.6% 3 1 12 75 0 85.2% 0 16 64.0% 4 0 0 0 25 100% 74.5% 85.3% Conclusions • Human ASR algorithms applicable to bat echolocation calls • Experiments – – – – Weakness: accuracy of class labels No labeled calls excluded HMM most accurate, undertrained MUSIC frequency estimate robust, slow • Machine learning – DTW: fast training, slow classification – HMM: slow training, fast classification Future work • Find robust features of bat echolocation calls that match assumptions of machine learning algorithms – Noise robust – Distribution modeled by Gaussian mixtures • Use hand-labeled subset of data to create call detection algorithm • Explore unsupervised learning – Self-organized maps – Clustering • Real-time portable detection/classification system on laptop PC Further information • http://www.cnel.ufl.edu/~markskow • markskow@cnel.ufl.edu • DTW reference: – L. Rabiner and B. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, NJ, 1993 • HMM reference: – L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in Readings in Speech Recognition, A. Waibel and K.F. Lee, Eds., pp. 267–296. Kaufmann, San Mateo, CA, 1990.