Progress Report

advertisement
Noise Reduction in Speech Recognition
Professor:Jian-Jiun Ding
Student: Yung Chang
2011/05/06
Outline



Mel Frequency Cepstral Coefficient(MFCC)
Mismatch in speech recognition
 Feature based-CMS、CMVN、HEQ
 Feature based-RASTA、data-driven
 Speech enhancement-Spectral substraction、wiener
filtering
Conclusions and applications
Outline



Mel Frequency Cepstral Coefficient(MFCC)
Mismatch in speech recognition
 Feature based-CMS、CMVN、HEQ
 Feature based-RASTA、data-driven
 Speech enhancement-Spectral substraction、wiener
filtering
Conclusions and applications
Mel Frequency Cepstral Coefficients(MFCC)


39 dimension
The most common used feature in speech recognition
Advantages: High accuracy and low complexity
Mel Frequency Cepstral Coefficients(MFCC)

The framework of feature extraction:
xt(n)
Speech signal
x(n)
Pre-emphasis
x’(n)
DFT
At(k)
Mel
filter-bank
Yt(m)
Window
energy


yt  j , et


y t    yt  j  , et  
2 y  j  , 2 e 
t
t 





Log(| |2)
et
derivatives
yt (j)
MFCC
IDFT
Yt’(m)
Pre-emohasis

Pre-emphasis of spectrum at higher frequencies
x[n]
Pre-emphasis
x’[n]
End-point Detection(Voice activity detection)
Noise(silence)
Speech
Windowing
Rectangle window
Hamming window
Mel-filter bank

After DFT we get spectrum
amplitude
frequency
Mel-filter bank
amplitude
frequency
Triangular shape in frequency(overlaped)
Uniformly spaced below 1kHz
Logarithmic scale above 1kHz
Delta Coefficients

1 st/2 nd order differences
13 dimension
39 dimension
1 st order
2 nd order
Outline



Mel Frequency Cepstral Coefficient(MFCC)
Mismatch in speech recognition
 Feature based-CMS、CMVN、HEQ
 Feature based-RASTA、data-driven
 Speech enhancement-Spectral substraction、wiener
filtering
Conclusions and applications
Mismatch in Statistical Speech Recognition
y[n]
x[n]
original
speech
n1(t)
h[n]
additive convolutional noise
noise

additive
noise
O =o1o2…oT
feature
vectors
Speech
Corpus
W=w1w2...wR
Search
Acoustic Lexicon
Models
output
sentences
Language
Model
Possible Approaches for Acoustic Environment
Mismatch
x[n]
Feature
Extraction
Model
Training
Acoustic
Models
y[n]
Feature
Extraction
Search and
Recognition
Acoustic
Models
(training)
(recognition)
input
signal
n2(t)
acoustic reception
microphone distortion
phone/wireless channel
Feature
Extraction
Speech Enhancement
Feature-based Approaches
Model-based Approaches
Text
Corpus
Outline



Mel Frequency Cepstral Coefficient(MFCC)
Mismatch in speech recognition
 Feature based-CMS、CMVN、HEQ
 Feature based-RASTA、data-driven
 Speech enhancement-Spectral substraction、wiener
filtering
Conclusions and applications
Feature-based Approach- Cepstral Moment
Normalization (CMS, CMVN)

P
P
Cepstral
Mean Substraction(CMS)—Convolutional
Noise
P(y)
P(y)

P(x)



P(x) becomes additive in
Convolutional noise in time domain
CMS
cepstral domain
y[n] = x[n]h[n]  y = x+h ,x, y, h in cepstral domain
most convolutional noise changes only very slightly for some
reasonable time interval
x = yh
Cepstral Mean Substraction(CMS)


assuming E[x ] = 0 ,
xCMS = yE[y]
then E[y ] = h
Feature-based Approach- Cepstral Moment
Normalization (CMS, CMVN)

CMVN: variance normalized as well

P(x)
xCMVN= xCMS/[Var(xCMS)]1/2
P(y)
P(x)
CMS
P(y)
P(x)
CMVN
P(y)
Feature-based Approach-HEQ(Histogram
Equalization)

The whole distribution equalized

y=CDFy-1[CDFx(x)]
P
P
CDFx
CDFy
P=0.2
P=0.2
x
3
y
3.5
Outline



Mel Frequency Cepstral Coefficient(MFCC)
Mismatch in speech recognition
 Feature based-CMS、CMVN、HEQ
 Feature based-RASTA、data-driven
 Speech enhancement-Spectral substraction、wiener
filtering
Conclusions and applications
Feature-based Approach-RASTA
amplitude
f
amplitude
f
Perform filtering on these signals(temporal filtering)
modulation frequency
Feature-based Approach-RASTA(Relative
Spectral Temporal filtering)


Assume the rate of change of noise often lies outside
the typical rate of vocal tract shape
A specially designed temporal filter
Bz  
1
3
a0  a1z  a3 z  a4 z
1  b1z 1  z 4
4
Emphasize speech
Modulation Frequency (Hz )
Data-driven Temporal filtering

PCA(Principal Component Analysis)
y
x
e
Data-driven Temporal filtering

We should not guess our filter, but get it from data
filter
convolution
B1(z)
B2(z)
Original feature
stream yt
Bn(z)
Frame index
L
zk(1)
zk(2)
zk(3)
Outline



Mel Frequency Cepstral Coefficient(MFCC)
Mismatch in speech recognition
 Feature based-CMS、CMVN、HEQ
 Feature based-RASTA、data-driven
 Speech enhancement-Spectral substraction、
wiener filtering
Conclusions and applications
Speech Enhancement- Spectral
Subtraction(SS)



producing a better signal by trying to remove the noise
for listening purposes or recognition purposes
Noise n[n] changes fast and unpredictably in time
domain, but relatively slowly in frequency domain,
N(w)
amplitude
amplitude
speech
speech
noise
noise
t
f
Outline



Mel Frequency Cepstral Coefficient(MFCC)
Mismatch in speech recognition
 Feature based-CMS、CMVN、HEQ
 Feature based-RASTA、data-driven
 Speech enhancement-Spectral substraction、wiener
filtering
Conclusions and applications
Conclusions



We give a general framework of how to extract speech
feature
We introduce the mainstream robustness
There are still numerous noise reduction methods(leave
in the reference)
References
Q&A
Download