Tandem acoustic modeling: Neural nets for mainstream ASR? Dan Ellis International Computer Science Institute Berkeley CA dpwe@icsi.berkeley.edu Outline 1 Tandem acoustic modeling 2 Inside Tandem systems: What’s going on? 3 Future directions Tandem modeling - Dan Ellis 2000-06-20 - 1 Tandem acoustic modeling 1 • ETSI Aurora ‘noisy digits’ evaluation - new features (for distributed speech recognition) - Gaussian mixture HTK back-end provided • How to use hybrid-connectionist tricks? (multistream posterior combination etc.) → Use posterior outputs as features for HTK... = Tandem connection of two large statistical models: Neural Net (NN) and Gaussian Mixture (GMM) Tandem modeling - Dan Ellis 2000-06-20 - 2 The Tandem structure Hybrid Connectionist-HMM ASR Feature calculation Conventional ASR (HTK) Noway decoder Neural net classifier Feature calculation HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t s ah t tn+w Input sound Words Phone probabilities Speech features Input sound Subword likelihoods Speech features Words Tandem modeling Feature calculation Neural net classifier HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t tn+w Phone probabilities Input sound Speech features Feature calculation Neural net classifier Subword likelihoods Words Gauss mix models HTK decoder Refined system h# pcl bcl tcl dcl C0 C1 C2 Ck tn tn+w + Pre-nonlinearity outputs Input sound PCA orthog'n s Alternate features Neural net classifier Subword likelihoods ah t Words h# pcl bcl tcl dcl C0 C1 C2 Ck tn tn+w • Better results when posteriors are made more ‘Gaussian’ • Tandem allows posterior combination for HTK Tandem modeling - Dan Ellis 2000-06-20 - 3 Training a tandem model • Tandem modeling uses two feature-spaces - NN estimates phone posteriors (discriminant) - GMM models subword likelihoods (distributions) • Training procedure - NN trained (backprop) on base features to forced-alignment phone targets - GMM trained on modified NN outputs via EM to maximise subword model likelihoods - HTK backend knows nothing of phone models • Decoupled (good) but sequential • Training sets? - can use same for both - learning different info - could use different - for cross-task robustness Tandem modeling - Dan Ellis 2000-06-20 - 4 Tandem system results • It works very well: WER as a function of SNR for various Aurora99 systems 100 WER / % (log scale) 50 20 Average WER ratio to baseline: HTK GMM: 100% Hybrid: 84.6% Tandem: 64.5% Tandem + PC: 47.2% 10 5 HTK GMM baseline Hybrid connectionist Tandem Tandem + PC 2 1 clean -20 System-features -15 -10 -5 SNR / dB (averaged over 4 noises) 0 5 Avg. WER 20-0 dB Baseline WER ratio HTK-mfcc 13.7% 100% Neural net-mfcc 9.3% 84.5% Tandem-mfcc 7.4% 64.5% Tandem-msg+plp 6.4% 47.2% Tandem modeling - Dan Ellis 2000-06-20 - 5 Inside Tandem systems: What’s going on? 2 • Visualizations of the net outputs Clean 3 3 2 2 1 1 0 0 10 10 5 5 0 0 q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t freq / kHz 4 phone Spectrogram 5dB SNR to ‘Hall’ noise 4 phone “one eight three” (MFP_183A) 10 0 -10 -20 -30 -40 dB Cepstral-smoothed mel spectrum Phone posterior estimates Hidden layer linear outputs freq / mel chan 7 5 4 0 • 6 0.2 0.4 0.6 time / s 0.8 1 3 1 0.5 0 20 10 0 -10 0 0.2 0.4 0.6 time / s 0.8 1 Neural net normalizes away noise Tandem modeling - Dan Ellis 2000-06-20 - 6 -20 Feature space ‘magnification’ • Neural net performs a nonlinear remapping of the feature space Smoothed spectrum 12 10 8 6 4 2 20 Linear outputs for /r/, /iy/, /w/ 10 0 Posteriors for /r/, /iy/, /w/ 1 0 0.7 0.75 0.8 time / s - small changes across critical boundaries result in large output changes Tandem modeling - Dan Ellis 2000-06-20 - 7 Relative contributions • Approx relative impact on baseline WER ratio for different component: Combo over msg: +20% plp Neural net classifier h# pcl bcl tcl dcl C0 C1 C2 Ck tn tn+w Input sound Pre-nonlinearity over + posteriors: +12% PCA orthog'n HTK decoder Gauss mix models s msg Neural net classifier h# pcl bcl tcl dcl C0 C1 C2 Ck KLT over direct: +8% ah Combo-into-HTK over Combo-into-noway: +15% t Words tn Combo over plp: +20% tn+w Combo over mfcc: +25% NN over HTK: +15% Tandem over HTK: +35% Tandem over hybrid: +25% Tandem combo over HTK mfcc baseline: +53% Tandem modeling - Dan Ellis 2000-06-20 - 8 Omitting the GMMs • “Tied posteriors” (Rottland & Rigoll, ICASSP): Conventional ASR (HTK) HMM decoder s Feature calculation ah Gauss mix models t sub-phone states mixture weights Input sound Per-mixture likelihoods Speech features Words - EM training of GMM and HMM mixture weights Rottland & Rigoll (2000) HMM decoder s Feature calculation ah Neural net classifier sub-phone states h# pcl bcl tcl dcl C0 C1 t C2 Ck tn tn+w Input sound Speech features mixture weights Phone probabilities Words - only mixture weights trained by EM System WSJ0 WER Hybrid baseline 15.8% “Tied posteriors” 9.4% Tandem modeling - Dan Ellis 2000-06-20 - 9 Discussion 3 • Key limitation: task-specific - NN is not like features (it’s part of the trained system) • Aurora1999 was a ‘matched condition’ task - same noises added in training and test - Aurora2000 has mismatched conditions - Tandem modeling works just as well • How to relax specificity? - train on alternative task? - use articulatory targets Tandem modeling - Dan Ellis 2000-06-20 - 10 Future developments • How to optimize NN for this structure? - integrated training...previous work - HMM states as targets? • Understanding the gains - better analysis of each piece’s contribution - strengths of different modeling approaches - effects of model/training set size variation - “tied posteriors”? • Other speech corpora - need both NN and GMM systems... - Switchboard is next goal Tandem modeling - Dan Ellis 2000-06-20 - 11