Broadcast News: Features & acoustic modelling Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1. The modulation-filtered spectrogram 2. Features and combinations 3. Net size and training size 4. Results by condition 5. Whole-utterance features 6. Gender-dependence SPRACH/Broadcast News - Dan Ellis 1998dec15 - 1 SPRACH BN System Overview • Abbot + 2nd acoustic model + ... CD RNNs Noway PLP CI RNNs Noway Audio MSG MLP Feature calculation Acoustic models SPRACH/Broadcast News - Dan Ellis Rover Noway Frame-level combinations Decode Hypothesis combination 1998dec15 - 2 Result The modulation-filtered spectrogram (Brian Kingsbury) • Goal: invariance to variable acoustics speech - filter out irrelevant modulations - channel adaptation (on-line auto. gain control) - multiple representations Bark-scale power-spectral filterbank x x lowpass 0-16 Hz envelope filtering τ = 160 ms AGC AGC τ = 320 ms AGC AGC lowpass features • automatic gain control bandpass 2-16 Hz AGC AGC τ =160 ms AGC AGC τ =640 ms bandpass features Results (small vocabulary): Feature Clean test WER Reverb test WER plp 5.9% 22.2% msg 6.1% 13.8% SPRACH/Broadcast News - Dan Ellis 1998dec15 - 3 Feature choice • Base ABBOT system: normalized PLP • Additional ftrs band-limited to 4kHz - help with telephone speech - just to be ‘different’ • Searched over features, deltas, context window - plp12N-8k best alone - rasta performed poorly (16ms windows) - msg1N-8k best for combination with RNN Feature Elements WER% alone RNN baseline WER% RNN combo 33.2 plp12N 13 36.7 31.1 ras12+dN 26 44.4 32.5 msg1N 28 39.4 29.9 (2000HU, 37h trainset, align2 labels, 7hyp decode) SPRACH/Broadcast News - Dan Ellis 1998dec15 - 4 Net and training set: Size matters • Huge amount of training data available - 142h = 32M training patterns @ 16ms • Search over net size / training set size WER for PLP12N nets vs. net size & training data 44 42 WER% 40 38 36 34 32 0 0 1 1 2 2 3 3 Hidden layer size (log2 re 500 un Training set size (log2 re 9.25 hours) SPRACH/Broadcast News - Dan Ellis 1998dec15 - 5 Evolution of the acoustic model HUs 2000 2000 4000 8000 • Increasing size of classifier, training set • Iterative re-alignment of target labels - in combination with RNN base • Steady improvement: Trnset 37h 37h 74h 142h Labels align1 align2 align2 align4 Trn time 4days 4 days 7 days 21 days WER% 39.4 38.6 35.3 31.6 ComboWER % 29.9 30.4 29.3 26.8 (msg1N, 7hyp decode) SPRACH/Broadcast News - Dan Ellis 1998dec15 - 6 Results by acoustic condition • Evaluation results broken down into 6 spoke categories • 4kHz audio should help F2 (telephone) • msg features might help poor acoustics System ALL F0 F1 F2 F3 F4 F5 Fx RNN 29.9 15.2 29.3 51.6 33.0 32.8 19.3 56.7 MSG (8000) 29.7 17.7 31.9 39.4 32.5 33.3 29.8 49.4 RNN+MSG 25.4 14.3 24.4 38.0 31.0 28.7 18.5 49.0 Sprach’98-1 21.7 11.6 24.7 32.4 33.8 15.5 27.9 29.6 Sprach’98-2 20.0 13.6 23.8 28.4 18.9 23.0 15.7 48.3 (full decode) SPRACH/Broadcast News - Dan Ellis 1998dec15 - 7 Whole-utterance features • BN groups have focussed on adaptation & normalization - VTLN, MLLR, SAT • Maybe do similar thing with extra net inputs → Whole-utterance pre-normalization featuredimension variances as constant inputs to net Wacky ftr benefit vs. utt.dur threshold. − = h4e−97−sub, −− = h4ee−97 60 Utterance duration threshold, secs 50 40 30 20 10 0 36 36.2 SPRACH/Broadcast News - Dan Ellis 36.4 36.6 36.8 37 WER% 37.2 37.4 37.6 37.8 38 1998dec15 - 8 Gender dependence (GD) • Train separate nets on Female/Male data - males represented 2:1 in BN - oracle labels? Net F% (2224) M% (3711) WER%(5938) 2000HU/25h UF 28.3 53.1 43.8 2000HU/25h UM 42.1 33.3 36.6 4000HU/50h U 29.6 33.6 32.1 Oracle best 25.7 30.5 28.7 Combination scheme 27.7 32.2 30.5 (plp12N-8k, 7hyp decode) • Best practical scheme - use classifier entropy to choose M or F net - use decoder likelihood to choose GD or GI • Now training on full set SPRACH/Broadcast News - Dan Ellis 1998dec15 - 9