Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06 Outline Mel Frequency Cepstral Coefficient(MFCC) Mismatch in speech recognition Feature based-CMS、CMVN、HEQ Feature based-RASTA、data-driven Speech enhancement-Spectral substraction、wiener filtering Conclusions and applications Outline Mel Frequency Cepstral Coefficient(MFCC) Mismatch in speech recognition Feature based-CMS、CMVN、HEQ Feature based-RASTA、data-driven Speech enhancement-Spectral substraction、wiener filtering Conclusions and applications Mel Frequency Cepstral Coefficients(MFCC) 39 dimension The most common used feature in speech recognition Advantages: High accuracy and low complexity Mel Frequency Cepstral Coefficients(MFCC) The framework of feature extraction: xt(n) Speech signal x(n) Pre-emphasis x’(n) DFT At(k) Mel filter-bank Yt(m) Window energy yt j , et y t yt j , et 2 y j , 2 e t t Log(| |2) et derivatives yt (j) MFCC IDFT Yt’(m) Pre-emohasis Pre-emphasis of spectrum at higher frequencies x[n] Pre-emphasis x’[n] End-point Detection(Voice activity detection) Noise(silence) Speech Windowing Rectangle window Hamming window Mel-filter bank After DFT we get spectrum amplitude frequency Mel-filter bank amplitude frequency Triangular shape in frequency(overlaped) Uniformly spaced below 1kHz Logarithmic scale above 1kHz Delta Coefficients 1 st/2 nd order differences 13 dimension 39 dimension 1 st order 2 nd order Outline Mel Frequency Cepstral Coefficient(MFCC) Mismatch in speech recognition Feature based-CMS、CMVN、HEQ Feature based-RASTA、data-driven Speech enhancement-Spectral substraction、wiener filtering Conclusions and applications Mismatch in Statistical Speech Recognition y[n] x[n] original speech n1(t) h[n] additive convolutional noise noise additive noise O =o1o2…oT feature vectors Speech Corpus W=w1w2...wR Search Acoustic Lexicon Models output sentences Language Model Possible Approaches for Acoustic Environment Mismatch x[n] Feature Extraction Model Training Acoustic Models y[n] Feature Extraction Search and Recognition Acoustic Models (training) (recognition) input signal n2(t) acoustic reception microphone distortion phone/wireless channel Feature Extraction Speech Enhancement Feature-based Approaches Model-based Approaches Text Corpus Outline Mel Frequency Cepstral Coefficient(MFCC) Mismatch in speech recognition Feature based-CMS、CMVN、HEQ Feature based-RASTA、data-driven Speech enhancement-Spectral substraction、wiener filtering Conclusions and applications Feature-based Approach- Cepstral Moment Normalization (CMS, CMVN) P P Cepstral Mean Substraction(CMS)—Convolutional Noise P(y) P(y) P(x) P(x) becomes additive in Convolutional noise in time domain CMS cepstral domain y[n] = x[n]h[n] y = x+h ,x, y, h in cepstral domain most convolutional noise changes only very slightly for some reasonable time interval x = yh Cepstral Mean Substraction(CMS) assuming E[x ] = 0 , xCMS = yE[y] then E[y ] = h Feature-based Approach- Cepstral Moment Normalization (CMS, CMVN) CMVN: variance normalized as well P(x) xCMVN= xCMS/[Var(xCMS)]1/2 P(y) P(x) CMS P(y) P(x) CMVN P(y) Feature-based Approach-HEQ(Histogram Equalization) The whole distribution equalized y=CDFy-1[CDFx(x)] P P CDFx CDFy P=0.2 P=0.2 x 3 y 3.5 Outline Mel Frequency Cepstral Coefficient(MFCC) Mismatch in speech recognition Feature based-CMS、CMVN、HEQ Feature based-RASTA、data-driven Speech enhancement-Spectral substraction、wiener filtering Conclusions and applications Feature-based Approach-RASTA amplitude f amplitude f Perform filtering on these signals(temporal filtering) modulation frequency Feature-based Approach-RASTA(Relative Spectral Temporal filtering) Assume the rate of change of noise often lies outside the typical rate of vocal tract shape A specially designed temporal filter Bz 1 3 a0 a1z a3 z a4 z 1 b1z 1 z 4 4 Emphasize speech Modulation Frequency (Hz ) Data-driven Temporal filtering PCA(Principal Component Analysis) y x e Data-driven Temporal filtering We should not guess our filter, but get it from data filter convolution B1(z) B2(z) Original feature stream yt Bn(z) Frame index L zk(1) zk(2) zk(3) Outline Mel Frequency Cepstral Coefficient(MFCC) Mismatch in speech recognition Feature based-CMS、CMVN、HEQ Feature based-RASTA、data-driven Speech enhancement-Spectral substraction、 wiener filtering Conclusions and applications Speech Enhancement- Spectral Subtraction(SS) producing a better signal by trying to remove the noise for listening purposes or recognition purposes Noise n[n] changes fast and unpredictably in time domain, but relatively slowly in frequency domain, N(w) amplitude amplitude speech speech noise noise t f Outline Mel Frequency Cepstral Coefficient(MFCC) Mismatch in speech recognition Feature based-CMS、CMVN、HEQ Feature based-RASTA、data-driven Speech enhancement-Spectral substraction、wiener filtering Conclusions and applications Conclusions We give a general framework of how to extract speech feature We introduce the mainstream robustness There are still numerous noise reduction methods(leave in the reference) References Q&A