Tandem acoustic modeling: Neural nets for mainstream ASR? Outline

advertisement
Tandem acoustic modeling:
Neural nets for mainstream ASR?
Dan Ellis
International Computer Science Institute
Berkeley CA
dpwe@icsi.berkeley.edu
Outline
1
Tandem acoustic modeling
2
Inside Tandem systems:
What’s going on?
3
Future directions
Tandem modeling - Dan Ellis
2000-06-20 - 1
Tandem acoustic modeling
1
•
ETSI Aurora ‘noisy digits’ evaluation
- new features (for distributed speech recognition)
- Gaussian mixture HTK back-end provided
•
How to use hybrid-connectionist tricks?
(multistream posterior combination etc.)
→ Use posterior outputs
as features for HTK...
= Tandem connection of two large statistical
models:
Neural Net (NN) and Gaussian Mixture (GMM)
Tandem modeling - Dan Ellis
2000-06-20 - 2
The Tandem structure
Hybrid Connectionist-HMM ASR
Feature
calculation
Conventional ASR (HTK)
Noway
decoder
Neural net
classifier
Feature
calculation
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
s
ah
t
tn+w
Input
sound
Words
Phone
probabilities
Speech
features
Input
sound
Subword
likelihoods
Speech
features
Words
Tandem modeling
Feature
calculation
Neural net
classifier
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
tn+w
Phone
probabilities
Input
sound
Speech
features
Feature
calculation
Neural net
classifier
Subword
likelihoods
Words
Gauss mix
models
HTK
decoder
Refined system
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
tn
tn+w
+
Pre-nonlinearity
outputs
Input
sound
PCA
orthog'n
s
Alternate
features
Neural net
classifier
Subword
likelihoods
ah
t
Words
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
tn
tn+w
•
Better results when posteriors are made more
‘Gaussian’
•
Tandem allows posterior combination for HTK
Tandem modeling - Dan Ellis
2000-06-20 - 3
Training a tandem model
•
Tandem modeling uses two feature-spaces
- NN estimates phone posteriors (discriminant)
- GMM models subword likelihoods (distributions)
•
Training procedure
- NN trained (backprop) on base features to
forced-alignment phone targets
- GMM trained on modified NN outputs via EM to
maximise subword model likelihoods
- HTK backend knows nothing of phone models
•
Decoupled (good) but sequential
•
Training sets?
- can use same for both - learning different info
- could use different - for cross-task robustness
Tandem modeling - Dan Ellis
2000-06-20 - 4
Tandem system results
•
It works very well:
WER as a function of SNR for various Aurora99 systems
100
WER / % (log scale)
50
20
Average WER ratio to baseline:
HTK GMM:
100%
Hybrid:
84.6%
Tandem:
64.5%
Tandem + PC: 47.2%
10
5
HTK GMM baseline
Hybrid connectionist
Tandem
Tandem + PC
2
1
clean
-20
System-features
-15
-10
-5
SNR / dB (averaged over 4 noises)
0
5
Avg. WER 20-0 dB
Baseline WER ratio
HTK-mfcc
13.7%
100%
Neural net-mfcc
9.3%
84.5%
Tandem-mfcc
7.4%
64.5%
Tandem-msg+plp
6.4%
47.2%
Tandem modeling - Dan Ellis
2000-06-20 - 5
Inside Tandem systems:
What’s going on?
2
•
Visualizations of the net outputs
Clean
3
3
2
2
1
1
0
0
10
10
5
5
0
0
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
freq / kHz
4
phone
Spectrogram
5dB SNR to ‘Hall’ noise
4
phone
“one eight three”
(MFP_183A)
10
0
-10
-20
-30
-40 dB
Cepstral-smoothed
mel spectrum
Phone posterior
estimates
Hidden layer
linear outputs
freq / mel chan
7
5
4
0
•
6
0.2
0.4
0.6
time / s
0.8
1
3
1
0.5
0
20
10
0
-10
0
0.2
0.4
0.6
time / s
0.8
1
Neural net normalizes away noise
Tandem modeling - Dan Ellis
2000-06-20 - 6
-20
Feature space ‘magnification’
•
Neural net performs a nonlinear remapping of
the feature space
Smoothed spectrum
12
10
8
6
4
2
20
Linear outputs for /r/, /iy/, /w/
10
0
Posteriors for /r/, /iy/, /w/
1
0
0.7
0.75
0.8
time / s
- small changes across critical boundaries
result in large output changes
Tandem modeling - Dan Ellis
2000-06-20 - 7
Relative contributions
•
Approx relative impact on baseline WER ratio
for different component:
Combo over msg:
+20%
plp
Neural net
classifier
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
tn
tn+w
Input
sound
Pre-nonlinearity over +
posteriors: +12%
PCA
orthog'n
HTK
decoder
Gauss mix
models
s
msg
Neural net
classifier
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
KLT over
direct:
+8%
ah
Combo-into-HTK over
Combo-into-noway:
+15%
t
Words
tn
Combo over plp:
+20%
tn+w
Combo over mfcc:
+25%
NN over HTK:
+15%
Tandem over HTK:
+35%
Tandem over hybrid:
+25%
Tandem combo over HTK mfcc baseline: +53%
Tandem modeling - Dan Ellis
2000-06-20 - 8
Omitting the GMMs
•
“Tied posteriors” (Rottland & Rigoll, ICASSP):
Conventional ASR (HTK)
HMM decoder
s
Feature
calculation
ah
Gauss mix
models
t
sub-phone
states
mixture weights
Input
sound
Per-mixture
likelihoods
Speech
features
Words
- EM training of GMM and HMM mixture weights
Rottland & Rigoll (2000)
HMM decoder
s
Feature
calculation
ah
Neural net
classifier
sub-phone
states
h#
pcl
bcl
tcl
dcl
C0
C1
t
C2
Ck
tn
tn+w
Input
sound
Speech
features
mixture weights
Phone
probabilities
Words
- only mixture weights trained by EM
System
WSJ0 WER
Hybrid baseline
15.8%
“Tied posteriors”
9.4%
Tandem modeling - Dan Ellis
2000-06-20 - 9
Discussion
3
•
Key limitation: task-specific
- NN is not like features
(it’s part of the trained system)
•
Aurora1999 was a ‘matched condition’ task
- same noises added in training and test
- Aurora2000 has mismatched conditions
- Tandem modeling works just as well
•
How to relax specificity?
- train on alternative task?
- use articulatory targets
Tandem modeling - Dan Ellis
2000-06-20 - 10
Future developments
•
How to optimize NN for this structure?
- integrated training...previous work
- HMM states as targets?
•
Understanding the gains
- better analysis of each piece’s contribution
- strengths of different modeling approaches
- effects of model/training set size variation
- “tied posteriors”?
•
Other speech corpora
- need both NN and GMM systems...
- Switchboard is next goal
Tandem modeling - Dan Ellis
2000-06-20 - 11
Download