Improved recognition by combining different features and different systems Outline

Improved recognition by combining different features and different systems Dan Ellis International Computer Science Institute Berkeley CA dpwe@icsi.berkeley.edu Outline 1 The power of combination 2 Different ways to combine 3 Examples & results 4 Conclusions AVIOS - Combinations - Dan Ellis 2000-05-24 - 1 The power of combination 1 • Combination is a general approach in statistics - several models → several estimates - if ‘correct’ parts more consistent than ‘wrong’ parts... →averaging reduces error • Intuition: choose ‘right’ model for each case - .. need to know when each is ‘right’ - .. need models that are ‘right’ at different times • Continuum of combination schemes - conservative, ignorant of models - domain-specific, trained on models AVIOS - Combinations - Dan Ellis 2000-05-24 - 2 Relevant aspects of the speech signal f/Hz • Speech is highly redundant - intelligible despite large distortions - multiple cues for each phoneme • Speech is very variable - redundancy leaves room for variability - speakers can use different subsets of cues 4000 FAG_54A.08 3000 2000 1000 0 0.0 0.1 0.2 0.3 h# f/Hz 4000 0.4 f 0.5 ay 0.6 v 0.7 0.8 0.9 ao f 1.0 1.1 1.2 r FEE_54A.08 3000 2000 1000 0 0.0 f/Hz 4000 20 10 0.1 h# 0.2 0.3 f 0.4 ay 0.5 v 0.6 f 0.7 0.8 0.9 ao 1.0 r 1.1 h# 0 dB FEI_54A.08 3000 2000 1000 0 0.0 0.1 0.2 h# AVIOS - Combinations - Dan Ellis 0.3 f 0.4 0.5 ay 0.6 v 0.7 f 0.8 0.9 1.0 ao 1.1 r 1.2 2000-05-24 - 3 Combinations for speech recognition • Speech recognition abounds with different models - different feature processing, statistical models, search techniques ... • Redundancy in signal → many different ways to estimate, making different kinds of errors • General combination is easier than figuring out what is good in each estimator - training data & training process usually the limiting factor AVIOS - Combinations - Dan Ellis 2000-05-24 - 4 Outline 1 The power of combination 2 Different ways to combine - Feature combination - Posterior combination - Hypothesis combination - System hybrids 3 Examples & results 4 Conclusions AVIOS - Combinations - Dan Ellis 2000-05-24 - 5 2 Speech recognizer components sound Feature calculation feature vectors Network weights Acoustic classifier phone probabilities Word models Language model HMM decoder phone & word labeling AVIOS - Combinations - Dan Ellis 2000-05-24 - 6 Different ways to combine • After each stage of the recognizer Feature 1 calculation Feature combination Feature 2 calculation HMM decoder Acoustic classifier Speech features Word hypotheses Phone probabilities Input sound Feature 1 calculation Acoustic classifier HMM decoder Posterior combination Feature 2 calculation Input sound Acoustic classifier Word hypotheses Phone probabilities Speech features Feature 1 calculation Acoustic classifier HMM decoder Hypothesis combination Feature 2 calculation Input sound • Acoustic classifier Speech features Final hypotheses HMM decoder Phone probabilities Word hypotheses Lots of other ways... AVIOS - Combinations - Dan Ellis 2000-05-24 - 7 Feature combination (FC) • Concatenate different feature vectors • Train a single acoustic model freq / Bark plp log-spect features 15 10 5 freq / Bark msg1a features 10 5 0.5 • 1 1.5 2 time / s It helps: Features Avg. WER plp 8.9% msg 9.5% FC plp + msg 8.1% AVIOS - Combinations - Dan Ellis 2000-05-24 - 8 Posterior combination (PC) N3_SNR5/FIL_3O37A 0.0 0.1 0.2 0.3 plp12ddN cepstra h# s th t msg3N h# r tth f n uw r th 0.5 iy ruw ey iy Posterior Combo h# s tth h# 0.4 0.6 q iy r M • 0.9 n ow f q iy 0.8 ow f qf three i it l @ T 0.7 th th ow q 1.0 ow th th 1.1 r iy r iy r oh 1.3 n iyow f s f eh f s iy r 1.2 eh n iy 1.4 1.5 v s eh v s eh v three n n ax n ax sev Sometimes better than FC: Avg. WER plp 8.9% msg 9.5% PC plp + msg 7.1% AVIOS - Combinations - Dan Ellis ax v ax 23 18 44 01 PDT 2000 Features 1.6 2000-05-24 - 9 Hypothesis combination (HC) (J. Fiscus, NIST) • ROVER: Recognizer Output Voting Error Reduction • Final outputs from several recognizers, each word tagged with confidence • Align & vote for resulting word sequence • 25% relative improvement over best single Broadcast News system (14.1%→10.6%) AVIOS - Combinations - Dan Ellis 2000-05-24 - 10 Other combinations: ‘Tandem’ acoustic modeling (with Hermansky et al., OGI) • Input sound Can we combine with conventional models? Speech features Feature calculation Phone probabilities Neural net model Words Posterior decoder Subword likelihoods Othogonal features Pre-nonlinearity outputs • PCA orthogn’n HTK GM model Words HTK decoder Result: better performance than either alone! - neural net & Gaussian mixture models extract different information from training data System-features Avg. WER HTK-mfcc 13.7% Neural net-mfcc 9.3% Tandem-mfcc 7.4% AVIOS - Combinations - Dan Ellis 2000-05-24 - 11 How & what to combine (with Jeff Bilmes, U. Washington) • Combination is good, but... - which streams should we combine? - which combination methods to use? • Best streams to combine have complementary information - can measure as Conditional Mutual Information, I(X ; Y | C) - low CMI suggests combination potential • Choice of combination method depends: - FC for streams with complex interdependence - PC makes best use of limited training data - HC allows different subword units AVIOS - Combinations - Dan Ellis 2000-05-24 - 12 Outline 1 The power of combination 2 Different ways to combine 3 Examples & results - The SPRACH system for Broadcast News - OGI-ICSI-Qualcomm Aurora recognizer 4 Conclusions AVIOS - Combinations - Dan Ellis 2000-05-24 - 13 The SPRACH Broadcast News recognizer (with Cambridge & Sheffield) • Multiple feature streams • MLP and RNN models, including PC • Rover for hypothesis combination Perceptual Linear Prediction Chronos Decoder CI RNN Speech Modulation Spectrogram Chronos Decoder ROVER CI MLP Utterance Hypothesis Perceptual Linear Prediction Chronos Decoder CD RNN • 20% WER overall (c/w ~27% per stream) AVIOS - Combinations - Dan Ellis 2000-05-24 - 14 The OGI-ICSI-Qualcomm system for Aurora noisy digits • PC of feature streams + Tandem combination of neural nets and Gaussian mixture models: plp calculation Neural net model msg calculation Neural net model traps calculation Neural net model Input sound Speech features • x HTK GM model HTK decoder Subword likelihoods Phone probabilities 60% reduction in word errors compared to MFCC-HTK baseline AVIOS - Combinations - Dan Ellis 2000-05-24 - 15 Word hypotheses Aurora evaluation results ETSI Aurora 1999 Evaluation Avg WER -20-0 Baseline reduction 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% AVIOS - Combinations - Dan Ellis 2 Ta nd em S6 1 Ta nd em S5 S4 S3 S2 -10.00% S1 Ba se lin e 0.00% 2000-05-24 - 16 Conclusions 4 • Combination is a simple way to leverage multiple models - speech recognition has lots of models - redundancy in the speech signals allows each model to find different ‘information’ • Lots of ways to combine - after feature extraction, classification, decoding - ‘tandem’ hybrids etc. • Significant gains from simple approaches - e.g. 25% relative improvement from posterior averaging • Better insight into sources of benefits will lead to even greater gains AVIOS - Combinations - Dan Ellis 2000-05-24 - 17

Improved recognition by combining different features and different systems Outline

Related documents

Products

Support

Improved recognition by combining different features and different systems Outline

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib