Current work at ICSI Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1. Broadcast News MLP recognizer 2. Topic modeling 3. Acoustic segment classification 4. Thisl demonstrator front-end Thisl ICSI Status - Dan Ellis 1999feb03 - 1 The modulation-filtered spectrogram (Brian Kingsbury) • Goal: invariance to variable acoustics speech - filter out irrelevant modulations - channel adaptation (on-line auto. gain control) - multiple representations Bark-scale power-spectral filterbank x x lowpass 0-16 Hz envelope filtering τ = 160 ms AGC AGC τ = 320 ms AGC AGC automatic gain control lowpass features • bandpass 2-16 Hz AGC AGC τ =160 ms AGC AGC τ =640 ms bandpass features Results (small vocabulary): Feature Clean test WER Reverb test WER plp 5.9% 22.2% msg 6.1% 13.8% Thisl ICSI Status - Dan Ellis 1999feb03 - 2 Broadcast News recognizer • 1998 evaluation - RNN + MLP • 8000 HU nets trained for MLP-only system: combo WER% RNN98 MSG-8kHz PLP-16kHz RNN98 27.2 24.9 24.5 29.7 24.4 MSG-8kHz PLP-16Khz 25.5 - RNN+MSG+PLP: 23.7% - plp 8000HU forward-pass ~0.7x real time (spert) • Gender-dependent versions: net set WERF% WERM% WER% plp-GD 20.3 27.2 24.6 msg-GD plp+msg-GD Thisl ICSI Status - Dan Ellis 1999feb03 - 3 Broadcast News: ongoing • Dynamic pronunciations (Eric Fosler) - data-derived rules for context-dependent pronunciations: phones, syllables, words, rate ... - rescored N-best output from 1st pass - ~ 3% RER improvement • Multiband (Adam Janin / Nikki Mirghafori) - 20% RER for small-vocabulary (Numbers) - no significant improvement yet for BN - features: MSG, cepstra, KLT, plp - all-way possible combinations & weights Thisl ICSI Status - Dan Ellis 1999feb03 - 4 Multiband for Broadcast News • (Adam Janin / Nikki Mirghafori) Scheme that worked best for small vocab: - 4-way frequency split - plp cepstra+deltas within each band - MLP classifier for each band + MLP combiner Prob. estimator .. Power Power Front end .. Power .. • .. .. .. .. MLP Merger .. .. Power Speech Signal .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ViterbiDecoder Recognized Words .. Weighted average of all possible combos - p(q | a,b,c,d) = ∑S p(q | S,a,b,c,d) . p(S) S ranges over 16 possible combinations - p(S) from? constant, local feature (entropy) - oracle best p(S) → WER=19% (25%RER) Thisl ICSI Status - Dan Ellis 1999feb03 - 5 Topic modeling (Dan Gildea & Thomas Hofmann) • Bayesian model: - p(word | doc) = ∑t p(word | topic) p(topic | doc) - EM modeling of p(word | topic) & p(topic | doc) over training set - p(topic | doc) estimated from context in recognition • Use to modify language model weights - p(word) ∝ ptri(word) ptop(word) / puni(word) - WSJ: trigram perplexity of 109 reduced 17% - use for BN recognition? • Use for topic segmentation? Thisl ICSI Status - Dan Ellis 1999feb03 - 6 Acoustic Segment Classification (Gethin Williams (SU) & Dan Ellis) • Features from posteriors show utterance type: - average per-frame entropy - ‘dynamism’ - mean squared 1st-order difference - average energy of ‘silence’ label - covariance matrix distance to clean speech Speech Segment feature scatter 3.5 50 40 30 3 20 10 100 200 300 400 500 Speech+Music 600 700 800 900 2.5 40 entropy phone index 50 30 20 2 10 100 200 300 400 500 Music 600 700 800 900 1.5 50 40 1 30 20 10 100 200 300 400 500 16ms frames 600 700 800 900 0.5 0 0.05 0.1 0.15 0.2 0.25 dynamism • 100% on Scheirer/Slaney speech-music testset • Use for acoustic segmentation? Thisl ICSI Status - Dan Ellis 1999feb03 - 7 Thisl demo development - Stand-alone Tcl/Tk implementation - doesn’t require httpd - speech-input ready Thisl ICSI Status - Dan Ellis 1999feb03 - 8