ICSI /ThisL status report December 1997 Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> 1 Software tools & packages 2 Speech/nonspeech separation 3 Speech in reverberation ThisL; ICSI status - Dan Ellis 1997dec10 - 1 1 Software tools & packages: ICSI Speech Recognition system • New components: - feacalc: enhanced RASTA (I/O formats, options) - pfile_utils: comprehensive manipulations (editing, stats, etc.) • Portable package: - first test: bring up recognizer at IDIAP • Visualization - [incr Tcl] classes for display - recognizer visualization... ThisL; ICSI status - Dan Ellis 1997dec10 - 2 Software tools & packages: recogui visualization - modular objects for reuse - simple configuration files - broad use within ICSI ThisL; ICSI status - Dan Ellis 1997dec10 - 3 2 Speech vs. nonspeech: Comp. Aud. Scene Analysis for ASR frq/Hz • For handling sound mixtures, attempt to estimate individual sound sources - listeners do this transparently • Previous approach (Weintraub...) - ‘enhance-then-recognize’: extract by periodicity, resynthesize, recognize • But... - problems with ‘holes’ - which cues to separate speech? ...doesn’t exploit knowledge of speech structure brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 ThisL; ICSI status - Dan Ellis 0.6 0.8 1.0 time/s 0.2 0.4 0.6 0.8 1997dec10 - 4 1.0 time/s Prediction-driven CASA and ASR • Prediction-driven CASA: - don’t derive, but construct an explanation consistent with observations hypotheses Periodic elements prediction errors input mixture Front end signal features Compare & reconcile Noise elements Predict & combine Speech elements dd-vs-pd-1997oct Hypothesis management predicted features - need to express ‘predictions’ in signal domain - iterate over each component • Components make projections - into e.g. ‘the space of all speech sounds’ ThisL; ICSI status - Dan Ellis 1997dec10 - 5 A Speech Hypothesis module HMM decoder phonetic label probabilities phonetic label sequence Ideal reconstitution Neural net classifier cepstral feature vectors cepstral feature vectors Inverse cepstrum PLP cepstrum normalized spectral features partial residual spectrum RASTA normalization words modif-ASR dpwe 1997apr09 conventional speech recognizer spectral feature vectors slowlyvarying components Denormalize estimated spectral envelope • Want to exploit constraints of decoder • Invert each stage of speech recognizer - classifier by? trained estimator - normalization by: recovering from input ThisL; ICSI status - Dan Ellis 1997dec10 - 6 Preliminary results f/Hz 223cl 4000 2000 1000 400 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 1.4 Speech1 Click5 20 Clicks1−7 10 0 −10 dB 0.0 0.2 time/s • Prediction shortfall dominates result - improve inverse classification - more normalization • To complete iteration: - need p(q|X,M) • Initial separation by f0? ThisL; ICSI status - Dan Ellis 1997dec10 - 7 Speech-in-reverberation 3 • Modest reverb has severe impact (RT = 0.5s, D/R ≈ 0 dB) • Information/combination at various timescales - modulation spectral features, syllable units - combine results at utterance level (Nbests) - combine results at syllable level (HMM decomposition, [Dupont & Bourlard ’97] WER% Clean speech Reverb (6 dB SNR) Baseline (Rasta-PLP8) 6.8 27.8 ModSpec Syllable base 9.8 30.9 Utterance-level combin’n 5.5 19.6 Syllable-level combin’n 5.4 18.6 • 2 pass decoder to avoid state explosion - lattice output for compatibility ThisL; ICSI status - Dan Ellis 1997dec10 - 8