Biologically Inspired NoiseRobust Speech Recognition for Both Man and Machine Mark D. Skowronski Computational NeuroEngineering Lab University of Florida March 26, 2004 Speech Recognition Motivation Speech #1 real-time communication medium among humans. Advantages of voice interface to machines: • Hands-free operation • Speed • Ease of use 2 Man vs. Machine Man is a high-performance existence proof for speech processing in noisy environments. Can we emulate man’s performance by leveraging expert information into our systems? Wall Street Journal/Broadcast news readings, 5000 words Untrained human listeners vs. Cambridge HTK LVCSR system 3 Biologically Inspired Algorithms Expert Information is added in three applications: Speech enhancement for human listeners Feature extraction for automatic speech recognition Classification for automatic speech recognition 4 Speech Enhancement Motivations: • Noisy cell phone conversations • Public address systems • Aircraft cockpit What can we do to increase intelligibility when turning up the volume is not an option? Biology: Lombard effect 5 This work funded by the iDEN Technology Group of Motorola The Lombard Effect Psychophysical changes in vocal characteristics, produced by a speaker in the presence of background acoustic noise: • • • • • • Vocal effort (amplitude) increases Duration increases Pitch increases Formant frequencies increase Energy center of gravity increases Consonant-to-Noise ratio increases Result: Intelligibility increases 6 Psychoacoustic Experiments Miller and Nicely (1955): AWGN to speech affects place of articulation and frication most, less so for voicing and nasality. Furui (1986): Truncated vowels in consonant-vowel pairs dramatically decreased in intelligibility beyond a certain point of truncation. These points correspond to spectrally dynamic regions. Bottom Line: Speech contains regions of relatively high phonetic information, and emphasis of these regions increases intelligibility. 7 Solution: Energy Redistribution We redistribute energy from regions of low information content to regions of high information content while conserving overall energy. From Miller and Nicely: ER for Voiced/Unvoiced (ERVU) regions. SFM of “clarification” Voicing determined by the Spectral Flatness Measure (SFM): 1 N X j (k) SFM j k 1 N 1 X j (k) N k 1 N Xj(k) is the magnitude of the short-term Fourier transform of the jth speech window of length N. 8 M. D. Skowronski, J. G. Harris, and T. Reinke, J. Acoust. Soc. Am., 2002 Listening Tests Confusable set test, from Junqua* I f, s, x, yes II a, h, k, 8 III b, c, d, e, g, p, t, v, z, 3 IV m, n • 500 trials forced decision • 3 algorithms (control, ERVU, HPF) • 0 dB and -10 dB SNR, AWGN • unlimited playback over headphones • 25 participants, 30-45 minutes 9 J. C. Junqua, J. Acoust. Soc. Am., 1993* Listening Tests Results -10 dB SNR, white noise Errors decreased 20% compared to control. “S” 10 “A” “E” “M” Energy Redistribution Summary • Developed a real-time algorithm for cell phone applications using biological inspiration, • Increased intelligibility while maintaining naturalness and conserving energy, • Effective because everyday speech is not clearly enunciated, • ERVU is a novel approach to speech enhancement that works on either clean speech or noise-reduced speech. 11 M. D. Skowronski and J. G. Harris, J. Acoust. Soc. Am., 2004b (in preparation) Feature Extraction ASR: Input Feature Extraction Information: phonetic, gender, age, emotion, pitch, accent, physical state, additive/channel noise. HFCC filter bank 12 Classification Existing Algorithms Goal: emphasize phonetic information over other info streams. Feature algorithms: • Acoustic: formant frequencies, bandwidths • Model based: linear prediction • Filter bank based: mel freq cepstral coeff (MFCC) Provides dimensionality reduction on quasi-stationary windows. Features “seven” Time 13 MFCC Filter Bank • Design parameters: FB freq range, number of filters • Center freqs equally-spaced in mel frequency • Triangle endpoints set by center freqs of adjacent filters Although filter spacing is determined by perceptual mel frequency scale, bandwidth is set more for convenience than by biological arguments. 14 HFCC Filter Bank HFCC: human factor cepstral coefficients • Decouples filter bandwidth from filter spacing, • Sets filter width according to the critical bandwidth of the human auditory system, • Uses Moore and Glasberg approximation of critical bandwidth, defined in Equivalent Rectangular Bandwidth (ERB). ERB 6.23fc2 93.39fc 28.52 (Hz) fc is critical band center frequency (KHz). 15 M. D. Skowronski and J. G. Harris, ICASSP, 2002 HFCC with E-factor Linear ERB scale factor (E-factor) controls filter bandwidth E-factor = 1 E-factor = 3 • Controls tradeoff between local SNR and spectral resolution, • Exemplifies the benefits of decoupling filter bandwidth from filter spacing. 16 M. D. Skowronski and J. G. Harris, J. Acoust. Soc. Am., 2004a (submitted) ASR Experiments • Isolated English digits “zero” through “nine” from TI-46 corpus, 8 male speakers, • HMM word models, 8 states per model, diagonal covariance matrix, • Three MFCC versions (different filter banks), • Linear ERB scale factor (E-factor), • HFCC with E-factor (HFCC-E). Total: 37.9 million frames of speech, (>100 hours) 17 ASR Results White noise (global SNR), HFCC-E vs. D&M, Linear ERB scale factor (E-factor). 18 M. D. Skowronski and J. G. Harris, ISCAS, 2003 HFCC Summary • Adds biologically inspired bandwidth to filter bank of popular speech feature extractor, • Provides superior noise-robust performance over MFCC and variants, • Allows for further filter bank design modifications, demonstrated by HFCC with E-factor, • HFCC has the same computational cost as MFCC, only the filter bank coefficients are adjusted: easy to implement. 19 Classification • • • • • HMM Limitations & Variations Freeman Model Introduction Model Hierarchy Associative Memory ASR Experiments Freeman’s Reduced KII Network 20 This work funded by the Office of Naval Research grant N00014-1-1-0405 HMM Limitations & Variations Limitations: • HMM is piece-wise stationary; speech is nonstationary, • Assumes frames are i.i.d.; speech is coarticulated, • State PDFs are data-driven; curse of dimensionality. Variations: Deng (1992): trended HMM Rabiner (1986): autoregressive HMM Morgan & Bourlard (1995): HMM/MLP hybrid Robinson (1994): context-dependent RNN Herrmann (1993): transient attractor network Liaw & Berger (1996): dynamic synapse RNN ~ HMM Nonlinear Dynamic Freeman (1997): non-convergent dynamic biological model 21 Freeman Model Hierarchical nonlinear dynamic model of cortical signal processing from rabbit olfactory neo-cortex. K0 cell, H(s) 2nd order low pass filter Reduced KII (RKII) cell (stable oscillator) 1 m (a b)m abm K gmQ(g) I ab 1 g (a b)g abg K mg Q(m) ab 22 RKII Network High-dimensional, scalable network of stable oscillators. Fully connected M-cell and G-cell weight matrices (zero diagonal). Capable of several dynamic behaviors: • Stable attractors (limit cycle, fixed point) • Chaos • Spatio-temporal patterns • Synchronization Generalization 23 Associative Memory Oscillator Network Two regimes of operation as an associative memory of binary patterns: Energy Synchronization Through Stimulation (STS) Network weights for each regime set by outer product rule variation and by hand. 24 M. D. Skowronski and J. G. Harris, Phys. Rev. E, 2004 (in preparation) Associative Memory Input Full: Partial: Noisy: Spurious: 25 Output Input Output ASR with RKII Network Two-Class Case • \IY\ from “she” • \AA\ from “dark” • 10 HFCC-E coeffs. converted to binary • Energy-based RKII associative memory • No overlap between learned centroids 26 Classifier Bayes, \IY\ continuous \AA\ \IY\ \AA\ % Correct 2705 0 99.9 8 4340 Bayes, \IY\ 2701 4 binary \AA\ 110 4238 Hamming \IY\ 2658 47 distance \AA\ 394 3954 RKII, \IY\ 2593 6 exact \AA\ 202 3564 RKII, \IY\ 2666 39 Hamming \AA\ 479 3869 98.4 93.7 87.3 92.7 ASR with RKII Network Three-Class Case • • • • \IY\ from “she” \AA\ from “dark” \AE\ from “ask” 18 HFCC-E coeffs. converted to binary • Energy-based RKII associative memory • Variable overlap between learned centroids Overlap controlled by binary feature conversion More overlap more spurious outputs 27 Freeman Model Summary Contributions: • Documented impulse invariance discretization, • Developed software tools, enabling large-scale experiments, • Demonstrated stable attractors in Freeman model, • Explained attractor instability by transient chaos, • Proposed two regimes of associative memory, • Invented novel synchronization mechanism (STS), • Devised variation of outer product rule for oscillator network learning rule, • Proved practical probabilities concerning overlap in three-class case, • Applied novel static pattern classifier to ASR. 28 Conclusions Developed novel speech enhancement algorithm, - Lombard effect indicates what to modify, - Psychoacoustic experiments indicate where to modify, - ERVU reduces human recognition error 20-40% in noisy environments. Extended existing speech feature extraction algorithm, - Critical bandwidth used to decouple filter bandwidth and spacing, - HFCC-E demonstrates research tangent for ind. filter bandwidth, - HFCC-E improves ASR by 7 dB SNR. Advanced knowledge of NLD models for info processing. - Applied model to ASR of static speech features, - Near-optimum performance of RKII network associative memory using first-order statistics. 29