Tandem connectionist feature extraction for conventional HMM systems

advertisement
Tandem connectionist feature extraction
for conventional HMM systems
Dan Ellis • International Computer Science Institute, Berkeley • dpwe@icsi.berkeley.edu
Hynek Hermansky & Sangita Sharma* • Oregon Graduate Institute • {hynek,sangita}@ece.ogi.edu
Summary: In hybrid connectionist-HMM speech recognition, the acoustic classifier is a neural network, rather than Gaussian mixture models (GMMs).
Here, we use the outputs of such a network as feature inputs to a conventional GMM-based recognizer, obtaining >30% relative error rate reduction.
The Tandem architecture
• In a hybrid connectionist-HMM speech recognizer, a neural net
estimates the posterior probability of a context-independent phone label
p(qk|X), given a window of acoustic features, X:
Feature
calculation
Input
sound
Neural network acoustic classifier
tn
tn+w
• Neural network classifiers have several attractions. The nets are:
discriminant - the output is trained to choose between phones
able to learn correlated features and strange distributions
- since no distribution assumptions are made
However, because of the opacity of the model represented in the
weights, certain operations (such as adaptation) can be harder than
with the more common Gaussian mixture models (GMMs).
• For the ETSI DSR “Aurora” evaluation (TIDIGITS with different noises
added at various SNRs), we were researching both hybrid recognizers
(using our in-house systems) and conventional GMM-based
recognizers (using the standard HTK toolkit).
• We wanted to transfer some of the strengths of the connectionist
system to the HTK setup, so we tried using the phone posterior
probability estimates as input features for our HTK training.
• We start with our normal hybrid
connectionist-HMM system, trained to
estimate posterior probabilities for each
of the 24 phones present in TIDIGITS.
• The tandem architecture is so named because it uses two statistical
models - neural network and Gaussian mixture - in series.
• First, the neural network is trained by back-propagation, using
maximum cross-entropy, against forced-aligned phone targets.
• Next, the Gaussian mixtures are trained by EM to relearn the
relationship between the phone estimates and the utterances.
• The HTK GMM system is not informed about the phones used in the
neural net training; it independently learns the appropriate patterns.
for ICASSP-2000 Istanbul • 2000-05-30 dpwe@icsi.berkeley.edu
• Secondly, we apply Principal Component Analysis
(full-rank) to orthogonalize the feature dimensions.
• The features are passed to the
standard HTK recognizer
defined for this task by ETSI.
• The features are modeled by
Gaussian mixtures for each of
16 states in 11 whole-word
models (1-9, “zero” & “oh”).
Clean
4
3
3
2
2
• The ETSI Aurora task defines 4 noise types and 7 SNR levels
for a 28 condition test set. (Training has 5 SNR levels.)
• Average word-error-rates (WERs) by SNR show a consistent
30-40% relative WER advantage of using tandem modeling.
10
WER as a function of SNR for various MFCC-based Aurora99 systems
0
-10
1
1
0
10
10
5
5
-30
Average WER ratio to baseline:
HTK GMM:
100%
Hybrid:
84.6%
Tandem:
64.5%
50
-40 dB
7
Cepstral-smoothed
mel spectrum
Hidden layer
linear outputs
Phone posterior
estimates
6
5
4
0
0
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
0
0.2
0.4
0.6
time / s
0.8
1
3
20
10
0
-10
-20
1
20
10
5
2
HTK GMM baseline
Hybrid connectionist
Tandem
0.5
-5
0.2
0.4
0.6
time / s
0.8
1
0
0
Posteriors for /r/, /iy/, /w/
1
0
0.7
0.75
0.8
time / s
Discussion & Future work
• Feeding posteriors into HTK also allowed us to use posterior
combination of four feature streams to achieve under 40% of baseline
WER (see Sharma et al. “Feature extraction .. Aurora database”).
0
5
• Phone targets were a somewhat arbitrary choice for network training.
Targets related to the HMM states used in the HTK model might
improve results still further.
Related work
Acknowledgments
1
0
Linear outputs for /r/, /iy/, /w/
10
• Several researchers have previously investigated neural nets as feature preprocessors
in speech recognition. See for example:
Y.R. Bengio, R. De Mori, G. Flammia and R. Kompe, “Global optimization of a neuralhidden markov model hybrid,” IEEE Trans. on Neural Networks, 3:252-258, 1992.
V. Fontaine, C. Ris and J.M. Boite, “Nonlinear Discriminant Analysis for improved speech
recognition”, Proc. Eurospeech-97, Rhodes, 4:2071-2074, 1997.
G. Rigoll and D. Willett, “A NN/HMM hybrid for continuous speech recognition with a
discriminant nonlinear feature extraction,” Proc. ICASSP-98, Seattle, April 1998.
100
-20
0
20
• We need to investigate how well GMM techniques such as MLLR will
work with nonlinearly-transformed features.
5dB SNR to ‘Hall’ noise
4
• Nets model discriminant posteriors via
nonlinear weighted sums; GMMs model
distributions with parameterized kernels.
These very different approaches can extract
complementary information even from
limited training data.
Smoothed spectrum (clean1/MFP_183A)
12
10
8
6
4
2
• Unlike conventional features, the net is highly task and language
specific. Training to articulatory targets on a large corpus might help.
Results
• Comparing the spectrum, MFCC feature basis, network
linear outputs and posterior estimates for clean and noisy
examples illustrates the relative robustness of the
network outputs to high noise levels.
Training procedure
HTK
decoder
• Firstly, we take the network's linear hidden-layer
outputs before the final ‘softmax’ nonlinearity. These
have a more Gaussian distribution than the very
skewed posterior probabilities.
Visualization
Spectrogram
HTK
GM model
Tandem system
output
• Preprocessing the posteriors before passing them to
HTK greatly improved the results:
• Rather than using this probability stream
as input to a posterior-based HMM
decoder, we treat the outputs as
features for HTK.
“one eight three”
(MFP_183A)
Subword
likelihoods
Othogonal
features
WER / % (log scale)
(averaged over 4 noises)
Ck
freq / kHz
Speech
features
PCA
orthogn'n
• As shown to the right, the neural net does a
good job of magnifying smooth changes as
features cross critical class boundaries.
Hybrid system
output
Posterior
decoder
Phone
probabilities
Pre-nonlinearity
outputs
Posterior probabilities
for context-independent
phone classes
freq / mel chan
Input
sound
C2
phone
C1
phone
h#
pcl
bcl
tcl
dcl
C0
Feature
calculation
Neural net
classifier
Speech
features
Why should it work?
Mel freq. chans
Introduction
10
SNR / dB
15
20
clean
Work at ICSI was supported by the European Union under ESPRIT project RESPITE
(28149). Work at OGI was supported by DoD under MDA904-98-1-0521, NSF under IRI9712579, and by an industrial grant from Intel Corporation.
* Sangita Sharma is now with Intel Corporation
Download