Tandem connectionist feature extraction for conventional HMM systems Dan Ellis • International Computer Science Institute, Berkeley • dpwe@icsi.berkeley.edu Hynek Hermansky & Sangita Sharma* • Oregon Graduate Institute • {hynek,sangita}@ece.ogi.edu Summary: In hybrid connectionist-HMM speech recognition, the acoustic classifier is a neural network, rather than Gaussian mixture models (GMMs). Here, we use the outputs of such a network as feature inputs to a conventional GMM-based recognizer, obtaining >30% relative error rate reduction. The Tandem architecture • In a hybrid connectionist-HMM speech recognizer, a neural net estimates the posterior probability of a context-independent phone label p(qk|X), given a window of acoustic features, X: Feature calculation Input sound Neural network acoustic classifier tn tn+w • Neural network classifiers have several attractions. The nets are: discriminant - the output is trained to choose between phones able to learn correlated features and strange distributions - since no distribution assumptions are made However, because of the opacity of the model represented in the weights, certain operations (such as adaptation) can be harder than with the more common Gaussian mixture models (GMMs). • For the ETSI DSR “Aurora” evaluation (TIDIGITS with different noises added at various SNRs), we were researching both hybrid recognizers (using our in-house systems) and conventional GMM-based recognizers (using the standard HTK toolkit). • We wanted to transfer some of the strengths of the connectionist system to the HTK setup, so we tried using the phone posterior probability estimates as input features for our HTK training. • We start with our normal hybrid connectionist-HMM system, trained to estimate posterior probabilities for each of the 24 phones present in TIDIGITS. • The tandem architecture is so named because it uses two statistical models - neural network and Gaussian mixture - in series. • First, the neural network is trained by back-propagation, using maximum cross-entropy, against forced-aligned phone targets. • Next, the Gaussian mixtures are trained by EM to relearn the relationship between the phone estimates and the utterances. • The HTK GMM system is not informed about the phones used in the neural net training; it independently learns the appropriate patterns. for ICASSP-2000 Istanbul • 2000-05-30 dpwe@icsi.berkeley.edu • Secondly, we apply Principal Component Analysis (full-rank) to orthogonalize the feature dimensions. • The features are passed to the standard HTK recognizer defined for this task by ETSI. • The features are modeled by Gaussian mixtures for each of 16 states in 11 whole-word models (1-9, “zero” & “oh”). Clean 4 3 3 2 2 • The ETSI Aurora task defines 4 noise types and 7 SNR levels for a 28 condition test set. (Training has 5 SNR levels.) • Average word-error-rates (WERs) by SNR show a consistent 30-40% relative WER advantage of using tandem modeling. 10 WER as a function of SNR for various MFCC-based Aurora99 systems 0 -10 1 1 0 10 10 5 5 -30 Average WER ratio to baseline: HTK GMM: 100% Hybrid: 84.6% Tandem: 64.5% 50 -40 dB 7 Cepstral-smoothed mel spectrum Hidden layer linear outputs Phone posterior estimates 6 5 4 0 0 q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t 0 0.2 0.4 0.6 time / s 0.8 1 3 20 10 0 -10 -20 1 20 10 5 2 HTK GMM baseline Hybrid connectionist Tandem 0.5 -5 0.2 0.4 0.6 time / s 0.8 1 0 0 Posteriors for /r/, /iy/, /w/ 1 0 0.7 0.75 0.8 time / s Discussion & Future work • Feeding posteriors into HTK also allowed us to use posterior combination of four feature streams to achieve under 40% of baseline WER (see Sharma et al. “Feature extraction .. Aurora database”). 0 5 • Phone targets were a somewhat arbitrary choice for network training. Targets related to the HMM states used in the HTK model might improve results still further. Related work Acknowledgments 1 0 Linear outputs for /r/, /iy/, /w/ 10 • Several researchers have previously investigated neural nets as feature preprocessors in speech recognition. See for example: Y.R. Bengio, R. De Mori, G. Flammia and R. Kompe, “Global optimization of a neuralhidden markov model hybrid,” IEEE Trans. on Neural Networks, 3:252-258, 1992. V. Fontaine, C. Ris and J.M. Boite, “Nonlinear Discriminant Analysis for improved speech recognition”, Proc. Eurospeech-97, Rhodes, 4:2071-2074, 1997. G. Rigoll and D. Willett, “A NN/HMM hybrid for continuous speech recognition with a discriminant nonlinear feature extraction,” Proc. ICASSP-98, Seattle, April 1998. 100 -20 0 20 • We need to investigate how well GMM techniques such as MLLR will work with nonlinearly-transformed features. 5dB SNR to ‘Hall’ noise 4 • Nets model discriminant posteriors via nonlinear weighted sums; GMMs model distributions with parameterized kernels. These very different approaches can extract complementary information even from limited training data. Smoothed spectrum (clean1/MFP_183A) 12 10 8 6 4 2 • Unlike conventional features, the net is highly task and language specific. Training to articulatory targets on a large corpus might help. Results • Comparing the spectrum, MFCC feature basis, network linear outputs and posterior estimates for clean and noisy examples illustrates the relative robustness of the network outputs to high noise levels. Training procedure HTK decoder • Firstly, we take the network's linear hidden-layer outputs before the final ‘softmax’ nonlinearity. These have a more Gaussian distribution than the very skewed posterior probabilities. Visualization Spectrogram HTK GM model Tandem system output • Preprocessing the posteriors before passing them to HTK greatly improved the results: • Rather than using this probability stream as input to a posterior-based HMM decoder, we treat the outputs as features for HTK. “one eight three” (MFP_183A) Subword likelihoods Othogonal features WER / % (log scale) (averaged over 4 noises) Ck freq / kHz Speech features PCA orthogn'n • As shown to the right, the neural net does a good job of magnifying smooth changes as features cross critical class boundaries. Hybrid system output Posterior decoder Phone probabilities Pre-nonlinearity outputs Posterior probabilities for context-independent phone classes freq / mel chan Input sound C2 phone C1 phone h# pcl bcl tcl dcl C0 Feature calculation Neural net classifier Speech features Why should it work? Mel freq. chans Introduction 10 SNR / dB 15 20 clean Work at ICSI was supported by the European Union under ESPRIT project RESPITE (28149). Work at OGI was supported by DoD under MDA904-98-1-0521, NSF under IRI9712579, and by an industrial grant from Intel Corporation. * Sangita Sharma is now with Intel Corporation