Noise robustness in Automatic Speech Recognition

advertisement
Noise robustness in Automatic Speech Recognition
Michael I Mandel
mandelm@cse.ohio-state.edu
Ohio State University
Computer Science and Engineering
Jelenik Speech and Language Technologies Tutorial
June 29, 2015
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
1 / 74
Outline
1
Overview
2
Types of noise and their effects
3
Robust ASR basics
4
Approaches to noise robust ASR
5
Results in recently organized challenges
6
Summary
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
2 / 74
Overview
Outline
1
Overview
2
Types of noise and their effects
3
Robust ASR basics
4
Approaches to noise robust ASR
5
Results in recently organized challenges
6
Summary
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
3 / 74
Overview
Noisy speech
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
4 / 74
Overview
Noisy speech contains lots of (extraneous) information
Information in the target speech
Lexical information: what words were said
Prosodic information: how they were said, emphasis
Speaker information: age, gender, mood
Channel information
Location: angular position and distance
Reverberation: room size, materials
Noise information
Stationary: HVAC, machinery, 60 Hz hum, road noise
Non-stationary: Other talkers, music, environment
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
5 / 74
Overview
Noisy speech contains lots of (extraneous) information
Information in the target speech
Lexical information: what words were said
Prosodic information: how they were said, emphasis
Speaker information: age, gender, mood
Channel information
Location: angular position and distance
Reverberation: room size, materials
Noise information
Stationary: HVAC, machinery, 60 Hz hum, road noise
Non-stationary: Other talkers, music, environment
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
5 / 74
Overview
ASR works pretty well in matched conditions
When train and test match in all respects, error rates are low
match in talker, vocabulary, style, mic, reverb, noise type
But can’t see all possibilities (esp. all combinations) in training
So how to generalize best from training to test?
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
6 / 74
Overview
How to generalize best from training to test?
Make system components invariant to irrelevant differences
Adapt training data/models to be more like test observations
Adapt test observations to be more like training data/models
Combine compact, better-trained models of separate “components”
e.g., speech, noise, channel
All of the above. . .
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
7 / 74
Overview
A lot of progress in reducing word error rates
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
8 / 74
Overview
But errors are still 30-40% in real noisy situations
Meeting recordings1 : 40.0% with 1 mic, 34.7% with 8 mics
YouTube videos2 : 40.9% trained on 2 years of audio
1
Takuya Yoshioka, Xie Chen, and Mark J. F. Gales. Impact of single-microphone dereverberation on DNN-based meeting
transcription systems. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
5527–5531. IEEE, May 2014
2
Hank Liao, Erik McDermott, and Andrew Senior. Large scale deep neural network acoustic modeling with semi-supervised
training data for YouTube video transcription. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE
Workshop on, pages 368–373. IEEE, December 2013
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
9 / 74
Overview
Recent surveys on noise-robust ASR
Books
Tuomas Virtanen, Bhiksha Raj, and Rita Singh, editors. Techniques for Noise
Robustness in Automatic Speech Recognition. John Wiley & Sons, Ltd, 2012
D. Kolossa and R. Haeb-Umbach. Robust Speech Recognition of Uncertain or
Missing Data: Theory and Applications. Springer Berlin Heidelberg, 2011
Review articles
J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach. An overview of noise-robust
automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 22(4):745–777, April 2014
K. Kumatani, J. McDonough, and B. Raj. Microphone Array Processing for Distant
Speech Recognition: From Close-Talking Microphones to Far-Field Sensors. IEEE
Signal Processing Magazine, 29(6):127–140, November 2012
T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, and
W. Kellermann. Making machines understand us in reverberant rooms: Robustness
against reverberation for automatic speech recognition. IEEE Signal Processing
Magazine, 29(6):114–126, November 2012
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
10 / 74
Types of noise and their effects
Outline
1
Overview
2
Types of noise and their effects
Types of noise
Effects on spectrograms
3
Robust ASR basics
4
Approaches to noise robust ASR
5
Results in recently organized challenges
6
Summary
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
11 / 74
Types of noise and their effects
Types of noise
Outline
1
Overview
2
Types of noise and their effects
Types of noise
Effects on spectrograms
3
Robust ASR basics
4
Approaches to noise robust ASR
5
Results in recently organized challenges
6
Summary
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
12 / 74
Types of noise and their effects
Types of noise
Noise types and effects summary
Easy Stationary noise: “constant” background noise
Medium Reverberation: correlates observations across time
Hard Non-stationary noise: most common
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
13 / 74
Types of noise and their effects
Types of noise
Reverberation & non-stationary noise
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
14 / 74
Types of noise and their effects
Types of noise
Reverberation & non-stationary noise
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
14 / 74
Types of noise and their effects
Types of noise
Reverberation & non-stationary noise
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
14 / 74
Types of noise and their effects
Types of noise
Reverberation
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
14 / 74
Types of noise and their effects
Types of noise
Clean
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
14 / 74
Types of noise and their effects
Types of noise
Stationary noise: fighter jet cockpit noise
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
15 / 74
Types of noise and their effects
Types of noise
Stationary noise: car noise
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
15 / 74
Types of noise and their effects
Types of noise
A noisy observation can be modeled as
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
16 / 74
Types of noise and their effects
Types of noise
A noisy observation can be modeled as
x[t] = s[t] ∗ h[t] + n[t]
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
17 / 74
Types of noise and their effects
Types of noise
A noisy observation can be modeled as
x[t] = s[t] ∗ h[t] + n[t]
Noisy signal
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
17 / 74
Types of noise and their effects
Types of noise
A noisy observation can be modeled as
x[t] = s[t] ∗ h[t] + n[t]
Noisy signal
Clean target signal
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
17 / 74
Types of noise and their effects
Types of noise
A noisy observation can be modeled as
x[t] = s[t] ∗ h[t] + n[t]
Noisy signal
Clean target signal
Channel impulse response
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
17 / 74
Types of noise and their effects
Types of noise
A noisy observation can be modeled as
x[t] = s[t] ∗ h[t] + n[t]
Noisy signal
Clean target signal
Channel impulse response
Additive noise
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
17 / 74
Types of noise and their effects
Types of noise
A noisy observation can be modeled as
x[t] = s[t] ∗ h[t] + n[t]
Noisy signal
Clean target signal
Channel impulse response
Additive noise
More on this later. . .
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
17 / 74
Types of noise and their effects
Types of noise
Reverberation: the channel
Reflection causes additional wavefronts
+ scattering, absorption
many paths → many echoes
reflection
Reverberant effect
diffraction
& shadowing
causal ‘smearing’ of signal energy
+ reverb from hlwy16
freq / Hz
freq / Hz
Dry speech 'airvib16'
8000
6000
8000
6000
4000
4000
2000
2000
0
0
0.5
Michael I Mandel (OSU CSE)
1
1.5
0
time / sec
0
Robust ASR
0.5
1
1.5
time / sec
June 29, 2015
18 / 74
Types of noise and their effects
Types of noise
Reverberation impulse response
Exponential decay of reflections:
hlwy16 - 128pt window
~e-t/T
freq / Hz
hroom(t)
8000
-10
-20
6000
-30
-40
4000
-50
2000
-60
t
0
-70
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7 time / s
Frequency-dependent
greater absorption at high frequencies → faster decay
Size-dependent
larger rooms → longer delays → slower decay
Important parameters for ASR
RT60: Time for reverb to decay 60 dB (lower = easier)
DRR: Direct-to-reverberant energy ratio (higher = easier)
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
19 / 74
Types of noise and their effects
Effects on spectrograms
Outline
1
Overview
2
Types of noise and their effects
Types of noise
Effects on spectrograms
3
Robust ASR basics
4
Approaches to noise robust ASR
5
Results in recently organized challenges
6
Summary
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
20 / 74
Types of noise and their effects
Effects on spectrograms
Reverberation and non-stationary noise at SNR +∞ dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
21 / 74
Types of noise and their effects
Effects on spectrograms
Reverberation and non-stationary noise at SNR +20 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
21 / 74
Types of noise and their effects
Effects on spectrograms
Reverberation and non-stationary noise at SNR +15 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
21 / 74
Types of noise and their effects
Effects on spectrograms
Reverberation and non-stationary noise at SNR +10 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
21 / 74
Types of noise and their effects
Effects on spectrograms
Reverberation and non-stationary noise at SNR +5 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
21 / 74
Types of noise and their effects
Effects on spectrograms
Reverberation and non-stationary noise at SNR 0 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
21 / 74
Types of noise and their effects
Effects on spectrograms
Reverberation and non-stationary noise at SNR −5 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
21 / 74
Types of noise and their effects
Effects on spectrograms
Reverberation and non-stationary noise at SNR −10 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
21 / 74
Types of noise and their effects
Effects on spectrograms
Reverberation and non-stationary noise at SNR −∞ dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
21 / 74
Types of noise and their effects
Effects on spectrograms
Noise types and effects summary
Easy Stationary noise: “constant” background noise
Medium Reverberation: correlates observations across time
Hard Non-stationary noise: most common
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
22 / 74
Robust ASR basics
Outline
1
Overview
2
Types of noise and their effects
3
Robust ASR basics
4
Approaches to noise robust ASR
5
Results in recently organized challenges
6
Summary
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
23 / 74
Robust ASR basics
Standard ASR pipeline
x audio
Feature
extraction
f features
Acoustic
model
p(k | x) HMM state
Decoder
Michael I Mandel (OSU CSE)
Robust ASR
p(w | x) words
June 29, 2015
24 / 74
Robust ASR basics
Fundamental equation of ASR
ŵ = argmax p(w | f)
best word sequence
w
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
25 / 74
Robust ASR basics
Fundamental equation of ASR
ŵ = argmax p(w | f)
best word sequence
w
= argmax p(f | w)p(w)
by Bayes’ rule
w
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
25 / 74
Robust ASR basics
Fundamental equation of ASR
ŵ = argmax p(w | f)
best word sequence
w
= argmax p(f | w)p(w)
w
X
= argmax
p(f | k) p(k | w)p(w)
w
by Bayes’ rule
Break words into states
k
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
25 / 74
Robust ASR basics
Fundamental equation of ASR
ŵ = argmax p(w | f)
best word sequence
w
= argmax p(f | w)p(w)
w
X
= argmax
p(f | k) p(k | w)p(w)
w
by Bayes’ rule
Break words into states
k
Acoustic model
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
25 / 74
Robust ASR basics
Fundamental equation of ASR
ŵ = argmax p(w | f)
best word sequence
w
= argmax p(f | w)p(w)
w
X
= argmax
p(f | k) p(k | w)p(w)
w
by Bayes’ rule
Break words into states
k
Acoustic model
Decoder (HMM)
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
25 / 74
Robust ASR basics
Standard features: Mel frequency cepstral coefficients
(MFCCs)
Waveform
Spectrogram
DCT l
MFCC
| · |2
∆, ∆∆
in time
Mel binning
Mean
var norm
log(·)
To acoustic model
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
26 / 74
Robust ASR basics
Computation of ASR features (MFCCs) power spectrum
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
27 / 74
Robust ASR basics
Computation of ASR features (MFCCs) mel spectrum
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
27 / 74
Robust ASR basics
Computation of ASR features (MFCCs) log mel spectrum
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
27 / 74
Robust ASR basics
Computation of ASR features (MFCCs) static MFCCs
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
27 / 74
Robust ASR basics
Computation of ASR features (MFCCs) MFCCs, deltas
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
27 / 74
Robust ASR basics
Effect of noise on MFCCs at SNR +∞ dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
28 / 74
Robust ASR basics
Effect of noise on MFCCs at SNR +20 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
28 / 74
Robust ASR basics
Effect of noise on MFCCs at SNR +15 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
28 / 74
Robust ASR basics
Effect of noise on MFCCs at SNR +10 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
28 / 74
Robust ASR basics
Effect of noise on MFCCs at SNR +5 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
28 / 74
Robust ASR basics
Effect of noise on MFCCs at SNR 0 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
28 / 74
Robust ASR basics
Effect of noise on MFCCs at SNR −5 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
28 / 74
Robust ASR basics
Effect of noise on MFCCs at SNR −10 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
28 / 74
Robust ASR basics
Effect of noise on MFCCs at SNR −∞ dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
28 / 74
Approaches to noise robust ASR
Outline
1
Overview
2
Types of noise and their effects
3
Robust ASR basics
4
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Adapt training data/models to be more like test observations
Adapt test observations to be more like training data/models
Combine compact models of separate components
Summary
5
Results in recently organized challenges
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
29 / 74
Approaches to noise robust ASR
Standard ASR pipeline
x audio
Feature
extraction
f features
Acoustic
model
p(k | x) HMM state
Decoder
Michael I Mandel (OSU CSE)
Robust ASR
p(w | x) words
June 29, 2015
30 / 74
Approaches to noise robust ASR
Robust ASR pipeline
x noisy audio
Enhancement
p(s | x) clean speech
Feature
extraction
p(f | x) clean features
Acoustic
model
p(k | x) HMM state
Decoder
Michael I Mandel (OSU CSE)
Robust ASR
p(w | x) words
June 29, 2015
31 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Outline
1
Overview
2
Types of noise and their effects
3
Robust ASR basics
4
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Adapt training data/models to be more like test observations
Adapt test observations to be more like training data/models
Combine compact models of separate components
Summary
5
Results in recently organized challenges
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
32 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Make features invariant to irrelevant differences
x noisy audio
Enhancement
p(s | x) clean speech
Feature
extraction
p(f | x) clean features
Acoustic
model
p(k | x) HMM state
Decoder
Michael I Mandel (OSU CSE)
Robust ASR
p(w | x) words
June 29, 2015
33 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Cepstral mean normalization removes time-invariant filters
Convolution in time domain is addition in cepstral domain
subtracting cepstral mean removes time-invariant filters
x[n] = s[n] ∗ h[n]
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
34 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Cepstral mean normalization removes time-invariant filters
Convolution in time domain is addition in cepstral domain
subtracting cepstral mean removes time-invariant filters
x[n] = s[n] ∗ h[n]
|Xωt |2 = |Sωt |2 |Hω |2
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
34 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Cepstral mean normalization removes time-invariant filters
Convolution in time domain is addition in cepstral domain
subtracting cepstral mean removes time-invariant filters
x[n] = s[n] ∗ h[n]
|Xωt |2 = |Sωt |2 |Hω |2
log |Xωt |2 = log |Sωt |2 + log |Hω |2
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
34 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Cepstral mean normalization removes time-invariant filters
Convolution in time domain is addition in cepstral domain
subtracting cepstral mean removes time-invariant filters
x[n] = s[n] ∗ h[n]
|Xωt |2 = |Sωt |2 |Hω |2
log |Xωt |2 = log |Sωt |2 + log |Hω |2
X
X
log |Xωt |2 −
log |Xωt |2 = log |Sωt |2 −
log |Sωt |2
t∈T
Michael I Mandel (OSU CSE)
t∈T
Robust ASR
June 29, 2015
34 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Cepstral mean normalization removes time-invariant filters
Convolution in time domain is addition in cepstral domain
subtracting cepstral mean removes time-invariant filters
x[n] = s[n] ∗ h[n]
|Xωt |2 = |Sωt |2 |Hω |2
log |Xωt |2 = log |Sωt |2 + log |Hω |2
X
X
log |Xωt |2 −
log |Xωt |2 = log |Sωt |2 −
log |Sωt |2
t∈T
t∈T
Can remove different filters for different T s
channel, talker, gender, . . .
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
34 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
RASTA-PLP and PNCC are robust to certain distortions3
3
Richard M. Stern and Nelson Morgan. Features based on auditory physiology and perception. In Virtanen et al. (2012),
chapter 8, pages 193–227
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
35 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra without RASTA at SNR +∞ dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
36 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra without RASTA at SNR +20 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
36 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra without RASTA at SNR +15 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
36 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra without RASTA at SNR +10 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
36 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra without RASTA at SNR +5 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
36 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra without RASTA at SNR 0 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
36 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra without RASTA at SNR −5 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
36 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra without RASTA at SNR −10 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
36 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra without RASTA at SNR −∞ dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
36 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra with RASTA at SNR +∞ dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
37 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra with RASTA at SNR +20 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
37 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra with RASTA at SNR +15 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
37 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra with RASTA at SNR +10 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
37 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra with RASTA at SNR +5 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
37 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra with RASTA at SNR 0 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
37 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra with RASTA at SNR −5 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
37 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra with RASTA at SNR −10 dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
37 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
PLP-cepstra with RASTA at SNR −∞ dB
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
37 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Deep neural network acoustic models4,5
4
Michael L Seltzer, Dong Yu, and Yongqiang Wang. An investigation of deep neural networks for noise robust speech
recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 7398–7402.
IEEE, 2013
5
Arun Narayanan and DeLiang Wang. Joint noise adaptive training for robust automatic speech recognition. In 2014 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2504–2508. IEEE, May 2014
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
38 / 74
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Deep neural network acoustic models4,5
4
Michael L Seltzer, Dong Yu, and Yongqiang Wang. An investigation of deep neural networks for noise robust speech
recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 7398–7402.
IEEE, 2013
5
Arun Narayanan and DeLiang Wang. Joint noise adaptive training for robust automatic speech recognition. In 2014 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2504–2508. IEEE, May 2014
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
38 / 74
Approaches to noise robust ASR
Adapt training data/models to be more like test observations
Outline
1
Overview
2
Types of noise and their effects
3
Robust ASR basics
4
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Adapt training data/models to be more like test observations
Adapt test observations to be more like training data/models
Combine compact models of separate components
Summary
5
Results in recently organized challenges
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
39 / 74
Approaches to noise robust ASR
Adapt training data/models to be more like test observations
Robust acoustic modeling
x noisy audio
Enhancement
p(s | x) clean speech
Feature
extraction
p(f | x) clean features
Acoustic
model
p(k | x) HMM state
Decoder
Michael I Mandel (OSU CSE)
Robust ASR
p(w | x) words
June 29, 2015
40 / 74
Approaches to noise robust ASR
Adapt training data/models to be more like test observations
Multi-condition and noise adaptive training6
To match train and test sets, train on noisy speech
Perform same processing of training data as will be performed on test
Modeling “canonical” speech, noisy speech compensated for
irrelevant variations
6
Michael L. Seltzer. Acoustic model training for robust speech recognition. In Virtanen et al. (2012), chapter 13, pages
347–368
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
41 / 74
Approaches to noise robust ASR
Adapt training data/models to be more like test observations
Maximum likelihood linear regression
Find linear transformation of model parameters that maximizes
likelihood
Can be supervised or unsupervised
Complexity can be scaled to fit amount of adaptation data
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
42 / 74
Approaches to noise robust ASR
Adapt training data/models to be more like test observations
Constrained maximum likelihood linear regression
Normal GMM training finds parameters Θ = {πk , µk , Σk } using EM
X
Θ̂ = argmax
L(Θ; xi )
Θ
= argmax
Θ
i∈Itr
X
log
X
i∈Itr
πk N (xi ; µk , Σk )
k
Imagine that we want to adapt a trained model to a new observation
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
43 / 74
Approaches to noise robust ASR
Adapt training data/models to be more like test observations
Constrained maximum likelihood linear regression
Why not learn mappings of the Gaussians’ parameters
µ̃k = Ar µk + br
Σ̃k = Ar Σk AT
r
Find {Âr , b̂r } that maximize the likelihood on the adaptation set
X
{Âr , bˆr } = argmax
L(Θ̃; xi )
Ar ,br
= argmax
Ar ,br
Michael I Mandel (OSU CSE)
i∈Ite
X
i∈Ite
log
X
πk N (xi ; Ar (k) µk + br (k) , Ar (k) Σk AT
r (k) )
k
Robust ASR
June 29, 2015
44 / 74
Approaches to noise robust ASR
Adapt training data/models to be more like test observations
Model complexity can be scaled to amount of training data
Need enough data to accurately estimate {Ar , br }
Ar can be a full matrix, diagonal, or the identity
Gaussians that are “similar” can share As and bs
use more groups as adaptation data grows
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
45 / 74
Approaches to noise robust ASR
Adapt test observations to be more like training data/models
Outline
1
Overview
2
Types of noise and their effects
3
Robust ASR basics
4
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Adapt training data/models to be more like test observations
Adapt test observations to be more like training data/models
Combine compact models of separate components
Summary
5
Results in recently organized challenges
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
46 / 74
Approaches to noise robust ASR
Adapt test observations to be more like training data/models
Remove noise from audio, recognize
x noisy audio
Enhancement
p(s | x) clean speech
Feature
extraction
p(f | x) clean features
Acoustic
model
p(k | x) HMM state
Decoder
Michael I Mandel (OSU CSE)
Robust ASR
p(w | x) words
June 29, 2015
47 / 74
Approaches to noise robust ASR
Adapt test observations to be more like training data/models
Voice activity detection, end-pointing7
Only recognize sequences of frames that contain speech
7
Rainer Martin and Dorothea Kolossa. Voice activity detection, noise estimation, and adaptive filters for acoustic signal
enhancement. In Virtanen et al. (2012), chapter 4, pages 51–85
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
48 / 74
Approaches to noise robust ASR
Adapt test observations to be more like training data/models
Spectral subtraction of stationary noise8
Estimate noise distribution when speech in inactive
Only pass mixture signal that is louder than the noise estimate
8
Rainer Martin and Dorothea Kolossa. Voice activity detection, noise estimation, and adaptive filters for acoustic signal
enhancement. In Virtanen et al. (2012), chapter 4, pages 51–85
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
49 / 74
Approaches to noise robust ASR
Adapt test observations to be more like training data/models
Spectral subtraction of stationary noise8
Estimate noise distribution when speech in inactive
Only pass mixture signal that is louder than the noise estimate
8
Rainer Martin and Dorothea Kolossa. Voice activity detection, noise estimation, and adaptive filters for acoustic signal
enhancement. In Virtanen et al. (2012), chapter 4, pages 51–85
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
49 / 74
Approaches to noise robust ASR
Adapt test observations to be more like training data/models
Time-frequency masking9
Only recognize time-frequency points that contain speech
9
Arun Narayanan and Deliang Wang. Computational auditory scene analysis and automatic speech recognition. In Virtanen
et al. (2012), chapter 16, pages 433–462
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
50 / 74
Approaches to noise robust ASR
Adapt test observations to be more like training data/models
Microphone array processing10
Only recognize spatial directions that contain speech
10
Andrew Greensted. The lab book pages: delay sum beamforming, 2012. Online:
http://www.labbookpages.co.uk/audio/beamforming/delaySum.html
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
51 / 74
Approaches to noise robust ASR
Adapt test observations to be more like training data/models
Microphone array processing10
Only recognize spatial directions that contain speech
10
Andrew Greensted. The lab book pages: delay sum beamforming, 2012. Online:
http://www.labbookpages.co.uk/audio/beamforming/delaySum.html
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
51 / 74
Approaches to noise robust ASR
Adapt test observations to be more like training data/models
Adapt test features to fit the trained model
x noisy audio
Enhancement
p(s | x) clean speech
Feature
extraction
p(f | x) clean features
Acoustic
model
p(k | x) HMM state
Decoder
Michael I Mandel (OSU CSE)
Robust ASR
p(w | x) words
June 29, 2015
52 / 74
Approaches to noise robust ASR
Adapt test observations to be more like training data/models
Feature-space maximum likelihood linear regression
(FMLLR)
Consider CMLLR with a single A and b
X
X
Â, b̂ = argmax
log
πk N (xi ; Aµk + b, AΣk AT )
A,b
i∈Ite
k
This is equivalent to the likelihood of transformed features with fixed
parameters
si = Axi + b
Â, b̂ = argmax
A,b
X
L(Θ; si )
i∈Ite
!
= argmax
A,b
Michael I Mandel (OSU CSE)
X
log |A| + log
i∈Ite
X
πk N (Axi + b; µk , Σk )
k
Robust ASR
June 29, 2015
53 / 74
Approaches to noise robust ASR
Combine compact models of separate components
Outline
1
Overview
2
Types of noise and their effects
3
Robust ASR basics
4
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Adapt training data/models to be more like test observations
Adapt test observations to be more like training data/models
Combine compact models of separate components
Summary
5
Results in recently organized challenges
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
54 / 74
Approaches to noise robust ASR
Combine compact models of separate components
Explicitly modeling mixing11
Allows the use of more compact, separate speech and noise models
require less data to train
But must be derived for each type of feature
11
Jasha Droppo. Feature compensation. In Virtanen et al. (2012), chapter 9, pages 229–250
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
55 / 74
Approaches to noise robust ASR
Combine compact models of separate components
Factorial HMMs12,13
Model two sources individually with independent HMMs
Combine states into joint states, k 0 = {k1 , k2 }
Then model p(f | k1 , k2 )
But leads to K 2 joint states (exponential in number of chains)
12
AP Varga and Roger K Moore. Simultaneous recognition of concurrent speech signals using hidden markov model
decomposition. In Second European Conference on Speech Communication and Technology, 1991
13
Zoubin Ghahramani and Michael I Jordan. Factorial hidden markov models. Machine learning, 29(2-3):245–273, 1997
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
56 / 74
Approaches to noise robust ASR
Combine compact models of separate components
No one feature is good for both separation and recognition
Separation is easy in the time domain and complex spectrum
simultaneous sources mix linearly
but retains too many irrelevant variations for robust ASR
Recognition is easy in the cepstral domain
removes many sources of variation irrelevant to lexical content
simultaneous sources mix very non-linearly
Need some way to link the two domains
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
57 / 74
Approaches to noise robust ASR
Combine compact models of separate components
Humans perceive loudness logarithmically
Let s be one complex TF unit of a speech signal, n noise, x mixture
s by itself is perceived as s̃ = log |s|2 , so |s|2 = e s̃ , then
x =s +n
|x|2 = |s|2 + |n|2 + 2α|s||n|
e x̃ ≈ e s̃ + e ñ
x̃ = log e s̃ + e ñ
If we model s̃ and ñ as Gaussians, what is x̃?
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
58 / 74
Approaches to noise robust ASR
Combine compact models of separate components
Gaussians mixed in log spectrum become non-gaussian14
x̃ = log e s̃ + e ñ
where s̃ ∼ N (25, 102 ), ñ ∼ N (µn , 22 )
14
J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach. An overview of noise-robust automatic speech recognition. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 22(4):745–777, April 2014
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
59 / 74
Approaches to noise robust ASR
Combine compact models of separate components
p(|x|, |s|) is also non-Gaussian
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
60 / 74
Approaches to noise robust ASR
Combine compact models of separate components
Modeling phase interaction makes it more complicated15
Mix simulated Gaussian speech and noise, compute p(|s|, |n| | |x|)
Single TF unit
(log magnitude)
Avg of 5 TF units
(log mel magnitude)
15
John R. Hershey, Steven J. Rennie, and Jonathan Le Roux. Factorial models for noise robust speech recognition. In
Virtanen et al. (2012), chapter 12, pages 311–345
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
61 / 74
Approaches to noise robust ASR
Combine compact models of separate components
Vector Taylor series approximates combination16
Model speech and noise as separate mixtures of Gaussians
Use a Taylor series approximation to x̃ around a particular SNR
Compute an updated SNR estimate from the approximation
Repeat until convergence
16
Hank Liao. Uncertainty decoding. In Virtanen et al. (2012), chapter 17, pages 463–486
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
62 / 74
Approaches to noise robust ASR
Summary
Outline
1
Overview
2
Types of noise and their effects
3
Robust ASR basics
4
Approaches to noise robust ASR
Make system components invariant to irrelevant differences
Adapt training data/models to be more like test observations
Adapt test observations to be more like training data/models
Combine compact models of separate components
Summary
5
Results in recently organized challenges
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
63 / 74
Approaches to noise robust ASR
Summary
Approaches to noise robust ASR summary
Mis-match between training and test data degrades accuracy
but is inevitable
Noise-robust ASR attempts to reduce this mismatch by
making system components invariant to irrelevant differences
adapting training data/models to be more like test observations
adapting test observations to be more like training data/models
combining small, better-trained models of separate “components”
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
64 / 74
Results in recently organized challenges
Outline
1
Overview
2
Types of noise and their effects
3
Robust ASR basics
4
Approaches to noise robust ASR
5
Results in recently organized challenges
6
Summary
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
65 / 74
Results in recently organized challenges
There are lots of noisy speech datasets out there17
17
https://wiki.inria.fr/rosp/Datasets
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
66 / 74
Results in recently organized challenges
Noisy ASR challenges
2002 AURORA-218
2006 Monaural speech separation & recognition challenge19
2011 (1st) PASCAL CHiME Speech Separation & Recognition Challenge20
2013 2nd PASCAL CHiME Speech Separation & Recognition Challenge21
2014 REVERB challenge22
2015 3rd PASCAL CHiME Speech Separation & Recognition Challenge
18
Hans-Gunter Hirsch and David Pearce. The AURORA experimental framework for the performance evaluation of speech
recognition systems under noisy conditions. In ASR-2000, pages 181–188, 2000
19
Martin Cooke, John R. Hershey, and Steven J. Rennie. Monaural speech separation and recognition challenge. Computer
Speech & Language, 24(1):1–15, January 2010
20
Jon Barker, Emmanuel Vincent, Ning Ma, Heidi Christensen, and Phil Green. The PASCAL CHiME speech separation and
recognition challenge. Computer Speech & Language, 27(3):621–633, May 2013
21
Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. The second
’CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes. In 2013 IEEE Workshop
on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, pages 162–167. IEEE, December 2013
22
Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Armin Sehr, Walter Kellermann, and Roland
Maas. The reverb challenge: Acommon evaluation framework for dereverberation and recognition of reverberant speech. In 2013
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 1–4. IEEE, October 2013
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
67 / 74
Results in recently organized challenges
Noisy ASR challenges
SSC
CHiME
CHiME2-trk1
CHiME2-trk2
CHiME3
Reverb-sim
Reverb-real
Michael I Mandel (OSU CSE)
Vocab
Ch
Noise
Reverb
Motion
50
50
50
5k
5k
5k
5k
1
2
2
2
6
8
8
Talker
Household
Household
Household
Various
Stationary
Ambient
No
RIRs
RIRs
RIRs
Live
RIRs
Live
No
No
Sim
No
Yes
No
No
Robust ASR
June 29, 2015
68 / 74
Results in recently organized challenges
2nd CHiME Challenge23 small vocab baseline accuracies
23
Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. The second
’CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes. In 2013 IEEE Workshop
on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, pages 162–167. IEEE, December 2013
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
69 / 74
Results in recently organized challenges
2nd CHiME Challenge23 small vocab all accuracies
23
Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. The second
’CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes. In 2013 IEEE Workshop
on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, pages 162–167. IEEE, December 2013
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
69 / 74
Results in recently organized challenges
2nd CHiME Challenge23 medium vocab baseline WERs
23
Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. The second
’CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes. In 2013 IEEE Workshop
on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, pages 162–167. IEEE, December 2013
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
69 / 74
Results in recently organized challenges
2nd CHiME Challenge23 medium vocab all WERs
23
Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. The second
’CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes. In 2013 IEEE Workshop
on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, pages 162–167. IEEE, December 2013
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
69 / 74
Results in recently organized challenges
2nd CHiME Challenge successful techniques
Both tracks
spatial diversity-based enhancement
noise adaptive training
combinations of many systems / strategies
Small vocab only
spectral diversity based enhancement
Medium vocab only
careful design of ASR back-end
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
70 / 74
Results in recently organized challenges
REVERB Challenge24 : baseline systems
Source: Keisuke Kinoshita “Challenge
summary”
24
http://reverb2014.dereverberation.com/result_asr.html
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
71 / 74
Results in recently organized challenges
REVERB Challenge24 : all systems
Source: Keisuke Kinoshita “Challenge
summary”
24
http://reverb2014.dereverberation.com/result_asr.html
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
71 / 74
Results in recently organized challenges
REVERB Challenge24 : by num mics, training data, ASR
Source: Keisuke Kinoshita “Challenge
summary”
24
http://reverb2014.dereverberation.com/result_asr.html
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
71 / 74
Results in recently organized challenges
Recent challenge summary
Challenges are getting more realistic
but erring on the side of solvability
The best systems are complex combinations of diverse approaches
ASR is still far from human-level noise robustness in general
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
72 / 74
Summary
Outline
1
Overview
2
Types of noise and their effects
3
Robust ASR basics
4
Approaches to noise robust ASR
5
Results in recently organized challenges
6
Summary
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
73 / 74
Summary
Summary
Noisy speech contains lots of (extraneous) information
ASR works pretty well in matched conditions
Models need help generalizing to new conditions
make system components invariant to irrelevant differences
adapt training data/models to be more like test observations
adapt test observations to be more like training data/models
combine compact, better-trained models of separate “components”
No one representation is good for both separation and recognition
Still work to be done to achieve human-level noise robustness
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
74 / 74
Summary
Summary
Noisy speech contains lots of (extraneous) information
ASR works pretty well in matched conditions
Models need help generalizing to new conditions
make system components invariant to irrelevant differences
adapt training data/models to be more like test observations
adapt test observations to be more like training data/models
combine compact, better-trained models of separate “components”
No one representation is good for both separation and recognition
Still work to be done to achieve human-level noise robustness
Thanks!
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
74 / 74
Summary
Summary
Noisy speech contains lots of (extraneous) information
ASR works pretty well in matched conditions
Models need help generalizing to new conditions
make system components invariant to irrelevant differences
adapt training data/models to be more like test observations
adapt test observations to be more like training data/models
combine compact, better-trained models of separate “components”
No one representation is good for both separation and recognition
Still work to be done to achieve human-level noise robustness
Thanks!
Any questions?
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
74 / 74
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
1 / 13
More challenge results
Outline
7
More challenge results
8
Full(er) MFCC mixing derivation
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
2 / 13
More challenge results
Speech separation challenge25 : baseline accuracy
25
Martin Cooke, John R. Hershey, and Steven J. Rennie. Monaural speech separation and recognition challenge. Computer
Speech & Language, 24(1):1–15, January 2010
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
3 / 13
More challenge results
Speech separation challenge25 : all accuracies
25
Martin Cooke, John R. Hershey, and Steven J. Rennie. Monaural speech separation and recognition challenge. Computer
Speech & Language, 24(1):1–15, January 2010
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
3 / 13
More challenge results
1st CHiME Challenge26 : baseline accuracy
26
http://spandh.dcs.shef.ac.uk/projects/chime/PCC/results.html
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
4 / 13
More challenge results
1st CHiME Challenge26 : all accuracies
26
http://spandh.dcs.shef.ac.uk/projects/chime/PCC/results.html
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
4 / 13
More challenge results
2nd CHiME Challenge27 : small vocab baseline accuracy
27
http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/track1_results.html
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
5 / 13
More challenge results
2nd CHiME Challenge27 : small vocab all accuracies
27
http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/track1_results.html
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
5 / 13
More challenge results
2nd CHiME Challenge28 : medium vocab baseline WER
28
http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/track2_results.html
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
6 / 13
More challenge results
2nd CHiME Challenge28 : medium vocab all WERs
28
http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/track2_results.html
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
6 / 13
Full(er) MFCC mixing derivation
Outline
7
More challenge results
8
Full(er) MFCC mixing derivation
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
7 / 13
Full(er) MFCC mixing derivation
Standard features: Mel frequency cepstral coefficients
(MFCCs)
Waveform
Pre-emphasis
| · |2
Framing,
windowing
Mel binning
log(·)
DFT ↔
DCT l
Michael I Mandel (OSU CSE)
Robust ASR
MFCCs
June 29, 2015
8 / 13
Full(er) MFCC mixing derivation
Mixing of energy spectra
In the time domain, a source is reverberated and mixed with noise
x[n] = s[n] ∗ h[n] + n[n]
In the short-time Fourier transform domain, this becomes
Xωt = Sωt Hω + Nωt
The energy of which is
|Xωt |2 =|Sωt |2 |Hω |2 + |Nωt |2 + 2αωt |Sωt ||Hω ||Nωt |
where αωt = cos (∠Sωt Hω − ∠Nωt )
If S and N are uncorrelated, then E [αωt ] = 0
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
9 / 13
Full(er) MFCC mixing derivation
Mixing of mel-frequency energy spectra
Mixing of linear-frequency spectra
|Xωt |2 =|Sωt |2 |Hω |2 + |Nωt |2 + 2αωt |Sωt ||Hω ||Nωt |
This relation approximately holds for mel-frequency spectra as well
(m)
(m)
(m) 2
|X`t |2 ≈ |S`t |2 |H`
Michael I Mandel (OSU CSE)
(m)
(m)
(m)
(m)
| + |N`t |2 + 2α`t |S`t ||H`
Robust ASR
(m)
||N`t |
June 29, 2015
10 / 13
Full(er) MFCC mixing derivation
Mixing of MFCCs
To simplify notation, assume a single time t, and define energy vectors
(m)
x = |X`t |2
(m)
(m) 2
s = |S`t |2 |H`
|
(m)
n = |N`t |2
∀ω ∈ Ω
and define MFCC vectors fx , fs , fn so that, e.g.,
fx = C log x
x = eC
†f
x
where C is the DCT matrix, CC † = 1, log & exp operate per-element
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
11 / 13
Full(er) MFCC mixing derivation
Mixing of MFCCs
Then the mel-frequency mixing equation
(m)
(m)
(m) 2
|X`t |2 ≈ |S`t |2 |H`
(m)
(m)
(m)
(m)
| + |N`t |2 + 2α`t |S`t ||H`
(m)
||N`t |
becomes (assuming αωt ≈ 0)
x≈s+n
C † fx
†f
†
+ e C fn
†
†
†
e C fx = e C fs 1 + e C (fn −fs )
†
fx = fs + C log 1 + e C (fn −fs )
e
Michael I Mandel (OSU CSE)
= eC
s
Robust ASR
June 29, 2015
12 / 13
Full(er) MFCC mixing derivation
Speech-noise interaction function behaves as expected
When s n, C † (fn − fs ) is very negative, so
†
fx = fs + C log 1 + e C (fn −fs )
≈ fs + C log 1 + e −∞
= fs + C log (1)
= fs
When s n,
C † (fn
− fs ) is very positive, so
†
fx = fx + C log 1 + e C (fn −fs )
†
≈ fx + C log e C (fn −fs )
= fs + CC † (fn − fs )
= fs + fn − fs
= fn
Michael I Mandel (OSU CSE)
Robust ASR
June 29, 2015
13 / 13
Download