Noise robustness in Automatic Speech Recognition Michael I Mandel mandelm@cse.ohio-state.edu Ohio State University Computer Science and Engineering Jelenik Speech and Language Technologies Tutorial June 29, 2015 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 1 / 74 Outline 1 Overview 2 Types of noise and their effects 3 Robust ASR basics 4 Approaches to noise robust ASR 5 Results in recently organized challenges 6 Summary Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 2 / 74 Overview Outline 1 Overview 2 Types of noise and their effects 3 Robust ASR basics 4 Approaches to noise robust ASR 5 Results in recently organized challenges 6 Summary Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 3 / 74 Overview Noisy speech Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 4 / 74 Overview Noisy speech contains lots of (extraneous) information Information in the target speech Lexical information: what words were said Prosodic information: how they were said, emphasis Speaker information: age, gender, mood Channel information Location: angular position and distance Reverberation: room size, materials Noise information Stationary: HVAC, machinery, 60 Hz hum, road noise Non-stationary: Other talkers, music, environment Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 5 / 74 Overview Noisy speech contains lots of (extraneous) information Information in the target speech Lexical information: what words were said Prosodic information: how they were said, emphasis Speaker information: age, gender, mood Channel information Location: angular position and distance Reverberation: room size, materials Noise information Stationary: HVAC, machinery, 60 Hz hum, road noise Non-stationary: Other talkers, music, environment Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 5 / 74 Overview ASR works pretty well in matched conditions When train and test match in all respects, error rates are low match in talker, vocabulary, style, mic, reverb, noise type But can’t see all possibilities (esp. all combinations) in training So how to generalize best from training to test? Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 6 / 74 Overview How to generalize best from training to test? Make system components invariant to irrelevant differences Adapt training data/models to be more like test observations Adapt test observations to be more like training data/models Combine compact, better-trained models of separate “components” e.g., speech, noise, channel All of the above. . . Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 7 / 74 Overview A lot of progress in reducing word error rates Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 8 / 74 Overview But errors are still 30-40% in real noisy situations Meeting recordings1 : 40.0% with 1 mic, 34.7% with 8 mics YouTube videos2 : 40.9% trained on 2 years of audio 1 Takuya Yoshioka, Xie Chen, and Mark J. F. Gales. Impact of single-microphone dereverberation on DNN-based meeting transcription systems. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5527–5531. IEEE, May 2014 2 Hank Liao, Erik McDermott, and Andrew Senior. Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 368–373. IEEE, December 2013 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 9 / 74 Overview Recent surveys on noise-robust ASR Books Tuomas Virtanen, Bhiksha Raj, and Rita Singh, editors. Techniques for Noise Robustness in Automatic Speech Recognition. John Wiley & Sons, Ltd, 2012 D. Kolossa and R. Haeb-Umbach. Robust Speech Recognition of Uncertain or Missing Data: Theory and Applications. Springer Berlin Heidelberg, 2011 Review articles J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach. An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):745–777, April 2014 K. Kumatani, J. McDonough, and B. Raj. Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors. IEEE Signal Processing Magazine, 29(6):127–140, November 2012 T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, and W. Kellermann. Making machines understand us in reverberant rooms: Robustness against reverberation for automatic speech recognition. IEEE Signal Processing Magazine, 29(6):114–126, November 2012 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 10 / 74 Types of noise and their effects Outline 1 Overview 2 Types of noise and their effects Types of noise Effects on spectrograms 3 Robust ASR basics 4 Approaches to noise robust ASR 5 Results in recently organized challenges 6 Summary Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 11 / 74 Types of noise and their effects Types of noise Outline 1 Overview 2 Types of noise and their effects Types of noise Effects on spectrograms 3 Robust ASR basics 4 Approaches to noise robust ASR 5 Results in recently organized challenges 6 Summary Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 12 / 74 Types of noise and their effects Types of noise Noise types and effects summary Easy Stationary noise: “constant” background noise Medium Reverberation: correlates observations across time Hard Non-stationary noise: most common Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 13 / 74 Types of noise and their effects Types of noise Reverberation & non-stationary noise Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 14 / 74 Types of noise and their effects Types of noise Reverberation & non-stationary noise Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 14 / 74 Types of noise and their effects Types of noise Reverberation & non-stationary noise Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 14 / 74 Types of noise and their effects Types of noise Reverberation Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 14 / 74 Types of noise and their effects Types of noise Clean Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 14 / 74 Types of noise and their effects Types of noise Stationary noise: fighter jet cockpit noise Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 15 / 74 Types of noise and their effects Types of noise Stationary noise: car noise Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 15 / 74 Types of noise and their effects Types of noise A noisy observation can be modeled as Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 16 / 74 Types of noise and their effects Types of noise A noisy observation can be modeled as x[t] = s[t] ∗ h[t] + n[t] Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 17 / 74 Types of noise and their effects Types of noise A noisy observation can be modeled as x[t] = s[t] ∗ h[t] + n[t] Noisy signal Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 17 / 74 Types of noise and their effects Types of noise A noisy observation can be modeled as x[t] = s[t] ∗ h[t] + n[t] Noisy signal Clean target signal Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 17 / 74 Types of noise and their effects Types of noise A noisy observation can be modeled as x[t] = s[t] ∗ h[t] + n[t] Noisy signal Clean target signal Channel impulse response Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 17 / 74 Types of noise and their effects Types of noise A noisy observation can be modeled as x[t] = s[t] ∗ h[t] + n[t] Noisy signal Clean target signal Channel impulse response Additive noise Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 17 / 74 Types of noise and their effects Types of noise A noisy observation can be modeled as x[t] = s[t] ∗ h[t] + n[t] Noisy signal Clean target signal Channel impulse response Additive noise More on this later. . . Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 17 / 74 Types of noise and their effects Types of noise Reverberation: the channel Reflection causes additional wavefronts + scattering, absorption many paths → many echoes reflection Reverberant effect diffraction & shadowing causal ‘smearing’ of signal energy + reverb from hlwy16 freq / Hz freq / Hz Dry speech 'airvib16' 8000 6000 8000 6000 4000 4000 2000 2000 0 0 0.5 Michael I Mandel (OSU CSE) 1 1.5 0 time / sec 0 Robust ASR 0.5 1 1.5 time / sec June 29, 2015 18 / 74 Types of noise and their effects Types of noise Reverberation impulse response Exponential decay of reflections: hlwy16 - 128pt window ~e-t/T freq / Hz hroom(t) 8000 -10 -20 6000 -30 -40 4000 -50 2000 -60 t 0 -70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 time / s Frequency-dependent greater absorption at high frequencies → faster decay Size-dependent larger rooms → longer delays → slower decay Important parameters for ASR RT60: Time for reverb to decay 60 dB (lower = easier) DRR: Direct-to-reverberant energy ratio (higher = easier) Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 19 / 74 Types of noise and their effects Effects on spectrograms Outline 1 Overview 2 Types of noise and their effects Types of noise Effects on spectrograms 3 Robust ASR basics 4 Approaches to noise robust ASR 5 Results in recently organized challenges 6 Summary Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 20 / 74 Types of noise and their effects Effects on spectrograms Reverberation and non-stationary noise at SNR +∞ dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 21 / 74 Types of noise and their effects Effects on spectrograms Reverberation and non-stationary noise at SNR +20 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 21 / 74 Types of noise and their effects Effects on spectrograms Reverberation and non-stationary noise at SNR +15 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 21 / 74 Types of noise and their effects Effects on spectrograms Reverberation and non-stationary noise at SNR +10 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 21 / 74 Types of noise and their effects Effects on spectrograms Reverberation and non-stationary noise at SNR +5 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 21 / 74 Types of noise and their effects Effects on spectrograms Reverberation and non-stationary noise at SNR 0 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 21 / 74 Types of noise and their effects Effects on spectrograms Reverberation and non-stationary noise at SNR −5 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 21 / 74 Types of noise and their effects Effects on spectrograms Reverberation and non-stationary noise at SNR −10 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 21 / 74 Types of noise and their effects Effects on spectrograms Reverberation and non-stationary noise at SNR −∞ dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 21 / 74 Types of noise and their effects Effects on spectrograms Noise types and effects summary Easy Stationary noise: “constant” background noise Medium Reverberation: correlates observations across time Hard Non-stationary noise: most common Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 22 / 74 Robust ASR basics Outline 1 Overview 2 Types of noise and their effects 3 Robust ASR basics 4 Approaches to noise robust ASR 5 Results in recently organized challenges 6 Summary Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 23 / 74 Robust ASR basics Standard ASR pipeline x audio Feature extraction f features Acoustic model p(k | x) HMM state Decoder Michael I Mandel (OSU CSE) Robust ASR p(w | x) words June 29, 2015 24 / 74 Robust ASR basics Fundamental equation of ASR ŵ = argmax p(w | f) best word sequence w Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 25 / 74 Robust ASR basics Fundamental equation of ASR ŵ = argmax p(w | f) best word sequence w = argmax p(f | w)p(w) by Bayes’ rule w Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 25 / 74 Robust ASR basics Fundamental equation of ASR ŵ = argmax p(w | f) best word sequence w = argmax p(f | w)p(w) w X = argmax p(f | k) p(k | w)p(w) w by Bayes’ rule Break words into states k Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 25 / 74 Robust ASR basics Fundamental equation of ASR ŵ = argmax p(w | f) best word sequence w = argmax p(f | w)p(w) w X = argmax p(f | k) p(k | w)p(w) w by Bayes’ rule Break words into states k Acoustic model Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 25 / 74 Robust ASR basics Fundamental equation of ASR ŵ = argmax p(w | f) best word sequence w = argmax p(f | w)p(w) w X = argmax p(f | k) p(k | w)p(w) w by Bayes’ rule Break words into states k Acoustic model Decoder (HMM) Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 25 / 74 Robust ASR basics Standard features: Mel frequency cepstral coefficients (MFCCs) Waveform Spectrogram DCT l MFCC | · |2 ∆, ∆∆ in time Mel binning Mean var norm log(·) To acoustic model Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 26 / 74 Robust ASR basics Computation of ASR features (MFCCs) power spectrum Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 27 / 74 Robust ASR basics Computation of ASR features (MFCCs) mel spectrum Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 27 / 74 Robust ASR basics Computation of ASR features (MFCCs) log mel spectrum Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 27 / 74 Robust ASR basics Computation of ASR features (MFCCs) static MFCCs Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 27 / 74 Robust ASR basics Computation of ASR features (MFCCs) MFCCs, deltas Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 27 / 74 Robust ASR basics Effect of noise on MFCCs at SNR +∞ dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 28 / 74 Robust ASR basics Effect of noise on MFCCs at SNR +20 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 28 / 74 Robust ASR basics Effect of noise on MFCCs at SNR +15 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 28 / 74 Robust ASR basics Effect of noise on MFCCs at SNR +10 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 28 / 74 Robust ASR basics Effect of noise on MFCCs at SNR +5 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 28 / 74 Robust ASR basics Effect of noise on MFCCs at SNR 0 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 28 / 74 Robust ASR basics Effect of noise on MFCCs at SNR −5 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 28 / 74 Robust ASR basics Effect of noise on MFCCs at SNR −10 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 28 / 74 Robust ASR basics Effect of noise on MFCCs at SNR −∞ dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 28 / 74 Approaches to noise robust ASR Outline 1 Overview 2 Types of noise and their effects 3 Robust ASR basics 4 Approaches to noise robust ASR Make system components invariant to irrelevant differences Adapt training data/models to be more like test observations Adapt test observations to be more like training data/models Combine compact models of separate components Summary 5 Results in recently organized challenges Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 29 / 74 Approaches to noise robust ASR Standard ASR pipeline x audio Feature extraction f features Acoustic model p(k | x) HMM state Decoder Michael I Mandel (OSU CSE) Robust ASR p(w | x) words June 29, 2015 30 / 74 Approaches to noise robust ASR Robust ASR pipeline x noisy audio Enhancement p(s | x) clean speech Feature extraction p(f | x) clean features Acoustic model p(k | x) HMM state Decoder Michael I Mandel (OSU CSE) Robust ASR p(w | x) words June 29, 2015 31 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences Outline 1 Overview 2 Types of noise and their effects 3 Robust ASR basics 4 Approaches to noise robust ASR Make system components invariant to irrelevant differences Adapt training data/models to be more like test observations Adapt test observations to be more like training data/models Combine compact models of separate components Summary 5 Results in recently organized challenges Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 32 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences Make features invariant to irrelevant differences x noisy audio Enhancement p(s | x) clean speech Feature extraction p(f | x) clean features Acoustic model p(k | x) HMM state Decoder Michael I Mandel (OSU CSE) Robust ASR p(w | x) words June 29, 2015 33 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences Cepstral mean normalization removes time-invariant filters Convolution in time domain is addition in cepstral domain subtracting cepstral mean removes time-invariant filters x[n] = s[n] ∗ h[n] Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 34 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences Cepstral mean normalization removes time-invariant filters Convolution in time domain is addition in cepstral domain subtracting cepstral mean removes time-invariant filters x[n] = s[n] ∗ h[n] |Xωt |2 = |Sωt |2 |Hω |2 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 34 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences Cepstral mean normalization removes time-invariant filters Convolution in time domain is addition in cepstral domain subtracting cepstral mean removes time-invariant filters x[n] = s[n] ∗ h[n] |Xωt |2 = |Sωt |2 |Hω |2 log |Xωt |2 = log |Sωt |2 + log |Hω |2 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 34 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences Cepstral mean normalization removes time-invariant filters Convolution in time domain is addition in cepstral domain subtracting cepstral mean removes time-invariant filters x[n] = s[n] ∗ h[n] |Xωt |2 = |Sωt |2 |Hω |2 log |Xωt |2 = log |Sωt |2 + log |Hω |2 X X log |Xωt |2 − log |Xωt |2 = log |Sωt |2 − log |Sωt |2 t∈T Michael I Mandel (OSU CSE) t∈T Robust ASR June 29, 2015 34 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences Cepstral mean normalization removes time-invariant filters Convolution in time domain is addition in cepstral domain subtracting cepstral mean removes time-invariant filters x[n] = s[n] ∗ h[n] |Xωt |2 = |Sωt |2 |Hω |2 log |Xωt |2 = log |Sωt |2 + log |Hω |2 X X log |Xωt |2 − log |Xωt |2 = log |Sωt |2 − log |Sωt |2 t∈T t∈T Can remove different filters for different T s channel, talker, gender, . . . Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 34 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences RASTA-PLP and PNCC are robust to certain distortions3 3 Richard M. Stern and Nelson Morgan. Features based on auditory physiology and perception. In Virtanen et al. (2012), chapter 8, pages 193–227 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 35 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra without RASTA at SNR +∞ dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 36 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra without RASTA at SNR +20 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 36 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra without RASTA at SNR +15 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 36 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra without RASTA at SNR +10 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 36 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra without RASTA at SNR +5 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 36 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra without RASTA at SNR 0 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 36 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra without RASTA at SNR −5 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 36 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra without RASTA at SNR −10 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 36 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra without RASTA at SNR −∞ dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 36 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra with RASTA at SNR +∞ dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 37 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra with RASTA at SNR +20 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 37 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra with RASTA at SNR +15 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 37 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra with RASTA at SNR +10 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 37 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra with RASTA at SNR +5 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 37 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra with RASTA at SNR 0 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 37 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra with RASTA at SNR −5 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 37 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra with RASTA at SNR −10 dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 37 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences PLP-cepstra with RASTA at SNR −∞ dB Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 37 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences Deep neural network acoustic models4,5 4 Michael L Seltzer, Dong Yu, and Yongqiang Wang. An investigation of deep neural networks for noise robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 7398–7402. IEEE, 2013 5 Arun Narayanan and DeLiang Wang. Joint noise adaptive training for robust automatic speech recognition. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2504–2508. IEEE, May 2014 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 38 / 74 Approaches to noise robust ASR Make system components invariant to irrelevant differences Deep neural network acoustic models4,5 4 Michael L Seltzer, Dong Yu, and Yongqiang Wang. An investigation of deep neural networks for noise robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 7398–7402. IEEE, 2013 5 Arun Narayanan and DeLiang Wang. Joint noise adaptive training for robust automatic speech recognition. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2504–2508. IEEE, May 2014 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 38 / 74 Approaches to noise robust ASR Adapt training data/models to be more like test observations Outline 1 Overview 2 Types of noise and their effects 3 Robust ASR basics 4 Approaches to noise robust ASR Make system components invariant to irrelevant differences Adapt training data/models to be more like test observations Adapt test observations to be more like training data/models Combine compact models of separate components Summary 5 Results in recently organized challenges Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 39 / 74 Approaches to noise robust ASR Adapt training data/models to be more like test observations Robust acoustic modeling x noisy audio Enhancement p(s | x) clean speech Feature extraction p(f | x) clean features Acoustic model p(k | x) HMM state Decoder Michael I Mandel (OSU CSE) Robust ASR p(w | x) words June 29, 2015 40 / 74 Approaches to noise robust ASR Adapt training data/models to be more like test observations Multi-condition and noise adaptive training6 To match train and test sets, train on noisy speech Perform same processing of training data as will be performed on test Modeling “canonical” speech, noisy speech compensated for irrelevant variations 6 Michael L. Seltzer. Acoustic model training for robust speech recognition. In Virtanen et al. (2012), chapter 13, pages 347–368 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 41 / 74 Approaches to noise robust ASR Adapt training data/models to be more like test observations Maximum likelihood linear regression Find linear transformation of model parameters that maximizes likelihood Can be supervised or unsupervised Complexity can be scaled to fit amount of adaptation data Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 42 / 74 Approaches to noise robust ASR Adapt training data/models to be more like test observations Constrained maximum likelihood linear regression Normal GMM training finds parameters Θ = {πk , µk , Σk } using EM X Θ̂ = argmax L(Θ; xi ) Θ = argmax Θ i∈Itr X log X i∈Itr πk N (xi ; µk , Σk ) k Imagine that we want to adapt a trained model to a new observation Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 43 / 74 Approaches to noise robust ASR Adapt training data/models to be more like test observations Constrained maximum likelihood linear regression Why not learn mappings of the Gaussians’ parameters µ̃k = Ar µk + br Σ̃k = Ar Σk AT r Find {Âr , b̂r } that maximize the likelihood on the adaptation set X {Âr , bˆr } = argmax L(Θ̃; xi ) Ar ,br = argmax Ar ,br Michael I Mandel (OSU CSE) i∈Ite X i∈Ite log X πk N (xi ; Ar (k) µk + br (k) , Ar (k) Σk AT r (k) ) k Robust ASR June 29, 2015 44 / 74 Approaches to noise robust ASR Adapt training data/models to be more like test observations Model complexity can be scaled to amount of training data Need enough data to accurately estimate {Ar , br } Ar can be a full matrix, diagonal, or the identity Gaussians that are “similar” can share As and bs use more groups as adaptation data grows Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 45 / 74 Approaches to noise robust ASR Adapt test observations to be more like training data/models Outline 1 Overview 2 Types of noise and their effects 3 Robust ASR basics 4 Approaches to noise robust ASR Make system components invariant to irrelevant differences Adapt training data/models to be more like test observations Adapt test observations to be more like training data/models Combine compact models of separate components Summary 5 Results in recently organized challenges Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 46 / 74 Approaches to noise robust ASR Adapt test observations to be more like training data/models Remove noise from audio, recognize x noisy audio Enhancement p(s | x) clean speech Feature extraction p(f | x) clean features Acoustic model p(k | x) HMM state Decoder Michael I Mandel (OSU CSE) Robust ASR p(w | x) words June 29, 2015 47 / 74 Approaches to noise robust ASR Adapt test observations to be more like training data/models Voice activity detection, end-pointing7 Only recognize sequences of frames that contain speech 7 Rainer Martin and Dorothea Kolossa. Voice activity detection, noise estimation, and adaptive filters for acoustic signal enhancement. In Virtanen et al. (2012), chapter 4, pages 51–85 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 48 / 74 Approaches to noise robust ASR Adapt test observations to be more like training data/models Spectral subtraction of stationary noise8 Estimate noise distribution when speech in inactive Only pass mixture signal that is louder than the noise estimate 8 Rainer Martin and Dorothea Kolossa. Voice activity detection, noise estimation, and adaptive filters for acoustic signal enhancement. In Virtanen et al. (2012), chapter 4, pages 51–85 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 49 / 74 Approaches to noise robust ASR Adapt test observations to be more like training data/models Spectral subtraction of stationary noise8 Estimate noise distribution when speech in inactive Only pass mixture signal that is louder than the noise estimate 8 Rainer Martin and Dorothea Kolossa. Voice activity detection, noise estimation, and adaptive filters for acoustic signal enhancement. In Virtanen et al. (2012), chapter 4, pages 51–85 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 49 / 74 Approaches to noise robust ASR Adapt test observations to be more like training data/models Time-frequency masking9 Only recognize time-frequency points that contain speech 9 Arun Narayanan and Deliang Wang. Computational auditory scene analysis and automatic speech recognition. In Virtanen et al. (2012), chapter 16, pages 433–462 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 50 / 74 Approaches to noise robust ASR Adapt test observations to be more like training data/models Microphone array processing10 Only recognize spatial directions that contain speech 10 Andrew Greensted. The lab book pages: delay sum beamforming, 2012. Online: http://www.labbookpages.co.uk/audio/beamforming/delaySum.html Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 51 / 74 Approaches to noise robust ASR Adapt test observations to be more like training data/models Microphone array processing10 Only recognize spatial directions that contain speech 10 Andrew Greensted. The lab book pages: delay sum beamforming, 2012. Online: http://www.labbookpages.co.uk/audio/beamforming/delaySum.html Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 51 / 74 Approaches to noise robust ASR Adapt test observations to be more like training data/models Adapt test features to fit the trained model x noisy audio Enhancement p(s | x) clean speech Feature extraction p(f | x) clean features Acoustic model p(k | x) HMM state Decoder Michael I Mandel (OSU CSE) Robust ASR p(w | x) words June 29, 2015 52 / 74 Approaches to noise robust ASR Adapt test observations to be more like training data/models Feature-space maximum likelihood linear regression (FMLLR) Consider CMLLR with a single A and b X X Â, b̂ = argmax log πk N (xi ; Aµk + b, AΣk AT ) A,b i∈Ite k This is equivalent to the likelihood of transformed features with fixed parameters si = Axi + b Â, b̂ = argmax A,b X L(Θ; si ) i∈Ite ! = argmax A,b Michael I Mandel (OSU CSE) X log |A| + log i∈Ite X πk N (Axi + b; µk , Σk ) k Robust ASR June 29, 2015 53 / 74 Approaches to noise robust ASR Combine compact models of separate components Outline 1 Overview 2 Types of noise and their effects 3 Robust ASR basics 4 Approaches to noise robust ASR Make system components invariant to irrelevant differences Adapt training data/models to be more like test observations Adapt test observations to be more like training data/models Combine compact models of separate components Summary 5 Results in recently organized challenges Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 54 / 74 Approaches to noise robust ASR Combine compact models of separate components Explicitly modeling mixing11 Allows the use of more compact, separate speech and noise models require less data to train But must be derived for each type of feature 11 Jasha Droppo. Feature compensation. In Virtanen et al. (2012), chapter 9, pages 229–250 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 55 / 74 Approaches to noise robust ASR Combine compact models of separate components Factorial HMMs12,13 Model two sources individually with independent HMMs Combine states into joint states, k 0 = {k1 , k2 } Then model p(f | k1 , k2 ) But leads to K 2 joint states (exponential in number of chains) 12 AP Varga and Roger K Moore. Simultaneous recognition of concurrent speech signals using hidden markov model decomposition. In Second European Conference on Speech Communication and Technology, 1991 13 Zoubin Ghahramani and Michael I Jordan. Factorial hidden markov models. Machine learning, 29(2-3):245–273, 1997 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 56 / 74 Approaches to noise robust ASR Combine compact models of separate components No one feature is good for both separation and recognition Separation is easy in the time domain and complex spectrum simultaneous sources mix linearly but retains too many irrelevant variations for robust ASR Recognition is easy in the cepstral domain removes many sources of variation irrelevant to lexical content simultaneous sources mix very non-linearly Need some way to link the two domains Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 57 / 74 Approaches to noise robust ASR Combine compact models of separate components Humans perceive loudness logarithmically Let s be one complex TF unit of a speech signal, n noise, x mixture s by itself is perceived as s̃ = log |s|2 , so |s|2 = e s̃ , then x =s +n |x|2 = |s|2 + |n|2 + 2α|s||n| e x̃ ≈ e s̃ + e ñ x̃ = log e s̃ + e ñ If we model s̃ and ñ as Gaussians, what is x̃? Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 58 / 74 Approaches to noise robust ASR Combine compact models of separate components Gaussians mixed in log spectrum become non-gaussian14 x̃ = log e s̃ + e ñ where s̃ ∼ N (25, 102 ), ñ ∼ N (µn , 22 ) 14 J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach. An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):745–777, April 2014 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 59 / 74 Approaches to noise robust ASR Combine compact models of separate components p(|x|, |s|) is also non-Gaussian Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 60 / 74 Approaches to noise robust ASR Combine compact models of separate components Modeling phase interaction makes it more complicated15 Mix simulated Gaussian speech and noise, compute p(|s|, |n| | |x|) Single TF unit (log magnitude) Avg of 5 TF units (log mel magnitude) 15 John R. Hershey, Steven J. Rennie, and Jonathan Le Roux. Factorial models for noise robust speech recognition. In Virtanen et al. (2012), chapter 12, pages 311–345 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 61 / 74 Approaches to noise robust ASR Combine compact models of separate components Vector Taylor series approximates combination16 Model speech and noise as separate mixtures of Gaussians Use a Taylor series approximation to x̃ around a particular SNR Compute an updated SNR estimate from the approximation Repeat until convergence 16 Hank Liao. Uncertainty decoding. In Virtanen et al. (2012), chapter 17, pages 463–486 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 62 / 74 Approaches to noise robust ASR Summary Outline 1 Overview 2 Types of noise and their effects 3 Robust ASR basics 4 Approaches to noise robust ASR Make system components invariant to irrelevant differences Adapt training data/models to be more like test observations Adapt test observations to be more like training data/models Combine compact models of separate components Summary 5 Results in recently organized challenges Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 63 / 74 Approaches to noise robust ASR Summary Approaches to noise robust ASR summary Mis-match between training and test data degrades accuracy but is inevitable Noise-robust ASR attempts to reduce this mismatch by making system components invariant to irrelevant differences adapting training data/models to be more like test observations adapting test observations to be more like training data/models combining small, better-trained models of separate “components” Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 64 / 74 Results in recently organized challenges Outline 1 Overview 2 Types of noise and their effects 3 Robust ASR basics 4 Approaches to noise robust ASR 5 Results in recently organized challenges 6 Summary Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 65 / 74 Results in recently organized challenges There are lots of noisy speech datasets out there17 17 https://wiki.inria.fr/rosp/Datasets Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 66 / 74 Results in recently organized challenges Noisy ASR challenges 2002 AURORA-218 2006 Monaural speech separation & recognition challenge19 2011 (1st) PASCAL CHiME Speech Separation & Recognition Challenge20 2013 2nd PASCAL CHiME Speech Separation & Recognition Challenge21 2014 REVERB challenge22 2015 3rd PASCAL CHiME Speech Separation & Recognition Challenge 18 Hans-Gunter Hirsch and David Pearce. The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ASR-2000, pages 181–188, 2000 19 Martin Cooke, John R. Hershey, and Steven J. Rennie. Monaural speech separation and recognition challenge. Computer Speech & Language, 24(1):1–15, January 2010 20 Jon Barker, Emmanuel Vincent, Ning Ma, Heidi Christensen, and Phil Green. The PASCAL CHiME speech separation and recognition challenge. Computer Speech & Language, 27(3):621–633, May 2013 21 Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. The second ’CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, pages 162–167. IEEE, December 2013 22 Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Armin Sehr, Walter Kellermann, and Roland Maas. The reverb challenge: Acommon evaluation framework for dereverberation and recognition of reverberant speech. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 1–4. IEEE, October 2013 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 67 / 74 Results in recently organized challenges Noisy ASR challenges SSC CHiME CHiME2-trk1 CHiME2-trk2 CHiME3 Reverb-sim Reverb-real Michael I Mandel (OSU CSE) Vocab Ch Noise Reverb Motion 50 50 50 5k 5k 5k 5k 1 2 2 2 6 8 8 Talker Household Household Household Various Stationary Ambient No RIRs RIRs RIRs Live RIRs Live No No Sim No Yes No No Robust ASR June 29, 2015 68 / 74 Results in recently organized challenges 2nd CHiME Challenge23 small vocab baseline accuracies 23 Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. The second ’CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, pages 162–167. IEEE, December 2013 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 69 / 74 Results in recently organized challenges 2nd CHiME Challenge23 small vocab all accuracies 23 Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. The second ’CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, pages 162–167. IEEE, December 2013 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 69 / 74 Results in recently organized challenges 2nd CHiME Challenge23 medium vocab baseline WERs 23 Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. The second ’CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, pages 162–167. IEEE, December 2013 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 69 / 74 Results in recently organized challenges 2nd CHiME Challenge23 medium vocab all WERs 23 Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. The second ’CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, pages 162–167. IEEE, December 2013 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 69 / 74 Results in recently organized challenges 2nd CHiME Challenge successful techniques Both tracks spatial diversity-based enhancement noise adaptive training combinations of many systems / strategies Small vocab only spectral diversity based enhancement Medium vocab only careful design of ASR back-end Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 70 / 74 Results in recently organized challenges REVERB Challenge24 : baseline systems Source: Keisuke Kinoshita “Challenge summary” 24 http://reverb2014.dereverberation.com/result_asr.html Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 71 / 74 Results in recently organized challenges REVERB Challenge24 : all systems Source: Keisuke Kinoshita “Challenge summary” 24 http://reverb2014.dereverberation.com/result_asr.html Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 71 / 74 Results in recently organized challenges REVERB Challenge24 : by num mics, training data, ASR Source: Keisuke Kinoshita “Challenge summary” 24 http://reverb2014.dereverberation.com/result_asr.html Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 71 / 74 Results in recently organized challenges Recent challenge summary Challenges are getting more realistic but erring on the side of solvability The best systems are complex combinations of diverse approaches ASR is still far from human-level noise robustness in general Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 72 / 74 Summary Outline 1 Overview 2 Types of noise and their effects 3 Robust ASR basics 4 Approaches to noise robust ASR 5 Results in recently organized challenges 6 Summary Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 73 / 74 Summary Summary Noisy speech contains lots of (extraneous) information ASR works pretty well in matched conditions Models need help generalizing to new conditions make system components invariant to irrelevant differences adapt training data/models to be more like test observations adapt test observations to be more like training data/models combine compact, better-trained models of separate “components” No one representation is good for both separation and recognition Still work to be done to achieve human-level noise robustness Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 74 / 74 Summary Summary Noisy speech contains lots of (extraneous) information ASR works pretty well in matched conditions Models need help generalizing to new conditions make system components invariant to irrelevant differences adapt training data/models to be more like test observations adapt test observations to be more like training data/models combine compact, better-trained models of separate “components” No one representation is good for both separation and recognition Still work to be done to achieve human-level noise robustness Thanks! Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 74 / 74 Summary Summary Noisy speech contains lots of (extraneous) information ASR works pretty well in matched conditions Models need help generalizing to new conditions make system components invariant to irrelevant differences adapt training data/models to be more like test observations adapt test observations to be more like training data/models combine compact, better-trained models of separate “components” No one representation is good for both separation and recognition Still work to be done to achieve human-level noise robustness Thanks! Any questions? Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 74 / 74 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 1 / 13 More challenge results Outline 7 More challenge results 8 Full(er) MFCC mixing derivation Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 2 / 13 More challenge results Speech separation challenge25 : baseline accuracy 25 Martin Cooke, John R. Hershey, and Steven J. Rennie. Monaural speech separation and recognition challenge. Computer Speech & Language, 24(1):1–15, January 2010 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 3 / 13 More challenge results Speech separation challenge25 : all accuracies 25 Martin Cooke, John R. Hershey, and Steven J. Rennie. Monaural speech separation and recognition challenge. Computer Speech & Language, 24(1):1–15, January 2010 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 3 / 13 More challenge results 1st CHiME Challenge26 : baseline accuracy 26 http://spandh.dcs.shef.ac.uk/projects/chime/PCC/results.html Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 4 / 13 More challenge results 1st CHiME Challenge26 : all accuracies 26 http://spandh.dcs.shef.ac.uk/projects/chime/PCC/results.html Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 4 / 13 More challenge results 2nd CHiME Challenge27 : small vocab baseline accuracy 27 http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/track1_results.html Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 5 / 13 More challenge results 2nd CHiME Challenge27 : small vocab all accuracies 27 http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/track1_results.html Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 5 / 13 More challenge results 2nd CHiME Challenge28 : medium vocab baseline WER 28 http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/track2_results.html Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 6 / 13 More challenge results 2nd CHiME Challenge28 : medium vocab all WERs 28 http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/track2_results.html Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 6 / 13 Full(er) MFCC mixing derivation Outline 7 More challenge results 8 Full(er) MFCC mixing derivation Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 7 / 13 Full(er) MFCC mixing derivation Standard features: Mel frequency cepstral coefficients (MFCCs) Waveform Pre-emphasis | · |2 Framing, windowing Mel binning log(·) DFT ↔ DCT l Michael I Mandel (OSU CSE) Robust ASR MFCCs June 29, 2015 8 / 13 Full(er) MFCC mixing derivation Mixing of energy spectra In the time domain, a source is reverberated and mixed with noise x[n] = s[n] ∗ h[n] + n[n] In the short-time Fourier transform domain, this becomes Xωt = Sωt Hω + Nωt The energy of which is |Xωt |2 =|Sωt |2 |Hω |2 + |Nωt |2 + 2αωt |Sωt ||Hω ||Nωt | where αωt = cos (∠Sωt Hω − ∠Nωt ) If S and N are uncorrelated, then E [αωt ] = 0 Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 9 / 13 Full(er) MFCC mixing derivation Mixing of mel-frequency energy spectra Mixing of linear-frequency spectra |Xωt |2 =|Sωt |2 |Hω |2 + |Nωt |2 + 2αωt |Sωt ||Hω ||Nωt | This relation approximately holds for mel-frequency spectra as well (m) (m) (m) 2 |X`t |2 ≈ |S`t |2 |H` Michael I Mandel (OSU CSE) (m) (m) (m) (m) | + |N`t |2 + 2α`t |S`t ||H` Robust ASR (m) ||N`t | June 29, 2015 10 / 13 Full(er) MFCC mixing derivation Mixing of MFCCs To simplify notation, assume a single time t, and define energy vectors (m) x = |X`t |2 (m) (m) 2 s = |S`t |2 |H` | (m) n = |N`t |2 ∀ω ∈ Ω and define MFCC vectors fx , fs , fn so that, e.g., fx = C log x x = eC †f x where C is the DCT matrix, CC † = 1, log & exp operate per-element Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 11 / 13 Full(er) MFCC mixing derivation Mixing of MFCCs Then the mel-frequency mixing equation (m) (m) (m) 2 |X`t |2 ≈ |S`t |2 |H` (m) (m) (m) (m) | + |N`t |2 + 2α`t |S`t ||H` (m) ||N`t | becomes (assuming αωt ≈ 0) x≈s+n C † fx †f † + e C fn † † † e C fx = e C fs 1 + e C (fn −fs ) † fx = fs + C log 1 + e C (fn −fs ) e Michael I Mandel (OSU CSE) = eC s Robust ASR June 29, 2015 12 / 13 Full(er) MFCC mixing derivation Speech-noise interaction function behaves as expected When s n, C † (fn − fs ) is very negative, so † fx = fs + C log 1 + e C (fn −fs ) ≈ fs + C log 1 + e −∞ = fs + C log (1) = fs When s n, C † (fn − fs ) is very positive, so † fx = fx + C log 1 + e C (fn −fs ) † ≈ fx + C log e C (fn −fs ) = fs + CC † (fn − fs ) = fs + fn − fs = fn Michael I Mandel (OSU CSE) Robust ASR June 29, 2015 13 / 13