Improved recognition by combining different features and different systems Dan Ellis International Computer Science Institute Berkeley CA dpwe@icsi.berkeley.edu Outline 1 The power of combination 2 Different ways to combine 3 Examples & results 4 Conclusions AVIOS - Combinations - Dan Ellis 2000-05-24 - 1 The power of combination 1 • Combination is a general approach in statistics - several models → several estimates - if ‘correct’ parts more consistent than ‘wrong’ parts... →averaging reduces error • Intuition: choose ‘right’ model for each case - .. need to know when each is ‘right’ - .. need models that are ‘right’ at different times • Continuum of combination schemes - conservative, ignorant of models - domain-specific, trained on models AVIOS - Combinations - Dan Ellis 2000-05-24 - 2 Relevant aspects of the speech signal f/Hz • Speech is highly redundant - intelligible despite large distortions - multiple cues for each phoneme • Speech is very variable - redundancy leaves room for variability - speakers can use different subsets of cues 4000 FAG_54A.08 3000 2000 1000 0 0.0 0.1 0.2 0.3 h# f/Hz 4000 0.4 f 0.5 ay 0.6 v 0.7 0.8 0.9 ao f 1.0 1.1 1.2 r FEE_54A.08 3000 2000 1000 0 0.0 f/Hz 4000 20 10 0.1 h# 0.2 0.3 f 0.4 ay 0.5 v 0.6 f 0.7 0.8 0.9 ao 1.0 r 1.1 h# 0 dB FEI_54A.08 3000 2000 1000 0 0.0 0.1 0.2 h# AVIOS - Combinations - Dan Ellis 0.3 f 0.4 0.5 ay 0.6 v 0.7 f 0.8 0.9 1.0 ao 1.1 r 1.2 2000-05-24 - 3 Combinations for speech recognition • Speech recognition abounds with different models - different feature processing, statistical models, search techniques ... • Redundancy in signal → many different ways to estimate, making different kinds of errors • General combination is easier than figuring out what is good in each estimator - training data & training process usually the limiting factor AVIOS - Combinations - Dan Ellis 2000-05-24 - 4 Outline 1 The power of combination 2 Different ways to combine - Feature combination - Posterior combination - Hypothesis combination - System hybrids 3 Examples & results 4 Conclusions AVIOS - Combinations - Dan Ellis 2000-05-24 - 5 2 Speech recognizer components sound Feature calculation feature vectors Network weights Acoustic classifier phone probabilities Word models Language model HMM decoder phone & word labeling AVIOS - Combinations - Dan Ellis 2000-05-24 - 6 Different ways to combine • After each stage of the recognizer Feature 1 calculation Feature combination Feature 2 calculation HMM decoder Acoustic classifier Speech features Word hypotheses Phone probabilities Input sound Feature 1 calculation Acoustic classifier HMM decoder Posterior combination Feature 2 calculation Input sound Acoustic classifier Word hypotheses Phone probabilities Speech features Feature 1 calculation Acoustic classifier HMM decoder Hypothesis combination Feature 2 calculation Input sound • Acoustic classifier Speech features Final hypotheses HMM decoder Phone probabilities Word hypotheses Lots of other ways... AVIOS - Combinations - Dan Ellis 2000-05-24 - 7 Feature combination (FC) • Concatenate different feature vectors • Train a single acoustic model freq / Bark plp log-spect features 15 10 5 freq / Bark msg1a features 10 5 0.5 • 1 1.5 2 time / s It helps: Features Avg. WER plp 8.9% msg 9.5% FC plp + msg 8.1% AVIOS - Combinations - Dan Ellis 2000-05-24 - 8 Posterior combination (PC) N3_SNR5/FIL_3O37A 0.0 0.1 0.2 0.3 plp12ddN cepstra h# s th t msg3N h# r tth f n uw r th 0.5 iy ruw ey iy Posterior Combo h# s tth h# 0.4 0.6 q iy r M • 0.9 n ow f q iy 0.8 ow f qf three i it l @ T 0.7 th th ow q 1.0 ow th th 1.1 r iy r iy r oh 1.3 n iyow f s f eh f s iy r 1.2 eh n iy 1.4 1.5 v s eh v s eh v three n n ax n ax sev Sometimes better than FC: Avg. WER plp 8.9% msg 9.5% PC plp + msg 7.1% AVIOS - Combinations - Dan Ellis ax v ax 23 18 44 01 PDT 2000 Features 1.6 2000-05-24 - 9 Hypothesis combination (HC) (J. Fiscus, NIST) • ROVER: Recognizer Output Voting Error Reduction • Final outputs from several recognizers, each word tagged with confidence • Align & vote for resulting word sequence • 25% relative improvement over best single Broadcast News system (14.1%→10.6%) AVIOS - Combinations - Dan Ellis 2000-05-24 - 10 Other combinations: ‘Tandem’ acoustic modeling (with Hermansky et al., OGI) • Input sound Can we combine with conventional models? Speech features Feature calculation Phone probabilities Neural net model Words Posterior decoder Subword likelihoods Othogonal features Pre-nonlinearity outputs • PCA orthogn’n HTK GM model Words HTK decoder Result: better performance than either alone! - neural net & Gaussian mixture models extract different information from training data System-features Avg. WER HTK-mfcc 13.7% Neural net-mfcc 9.3% Tandem-mfcc 7.4% AVIOS - Combinations - Dan Ellis 2000-05-24 - 11 How & what to combine (with Jeff Bilmes, U. Washington) • Combination is good, but... - which streams should we combine? - which combination methods to use? • Best streams to combine have complementary information - can measure as Conditional Mutual Information, I(X ; Y | C) - low CMI suggests combination potential • Choice of combination method depends: - FC for streams with complex interdependence - PC makes best use of limited training data - HC allows different subword units AVIOS - Combinations - Dan Ellis 2000-05-24 - 12 Outline 1 The power of combination 2 Different ways to combine 3 Examples & results - The SPRACH system for Broadcast News - OGI-ICSI-Qualcomm Aurora recognizer 4 Conclusions AVIOS - Combinations - Dan Ellis 2000-05-24 - 13 The SPRACH Broadcast News recognizer (with Cambridge & Sheffield) • Multiple feature streams • MLP and RNN models, including PC • Rover for hypothesis combination Perceptual Linear Prediction Chronos Decoder CI RNN Speech Modulation Spectrogram Chronos Decoder ROVER CI MLP Utterance Hypothesis Perceptual Linear Prediction Chronos Decoder CD RNN • 20% WER overall (c/w ~27% per stream) AVIOS - Combinations - Dan Ellis 2000-05-24 - 14 The OGI-ICSI-Qualcomm system for Aurora noisy digits • PC of feature streams + Tandem combination of neural nets and Gaussian mixture models: plp calculation Neural net model msg calculation Neural net model traps calculation Neural net model Input sound Speech features • x HTK GM model HTK decoder Subword likelihoods Phone probabilities 60% reduction in word errors compared to MFCC-HTK baseline AVIOS - Combinations - Dan Ellis 2000-05-24 - 15 Word hypotheses Aurora evaluation results ETSI Aurora 1999 Evaluation Avg WER -20-0 Baseline reduction 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% AVIOS - Combinations - Dan Ellis 2 Ta nd em S6 1 Ta nd em S5 S4 S3 S2 -10.00% S1 Ba se lin e 0.00% 2000-05-24 - 16 Conclusions 4 • Combination is a simple way to leverage multiple models - speech recognition has lots of models - redundancy in the speech signals allows each model to find different ‘information’ • Lots of ways to combine - after feature extraction, classification, decoding - ‘tandem’ hybrids etc. • Significant gains from simple approaches - e.g. 25% relative improvement from posterior averaging • Better insight into sources of benefits will lead to even greater gains AVIOS - Combinations - Dan Ellis 2000-05-24 - 17