THISL & RESPITE progress reports Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 Thisl progress: GUI, SQI evaluation, exotic data 2 Respite progress: Choosing streams, NN feature extraction, Aurora evaluation results ICSI: Thisl & Respite progress - Dan Ellis 2000-01-23 - 1 Thisl: GUI 1 • Automatic voice detection - option to trigger on any utterance • Thomson NLP? • Ship to Thomson/BBC? ICSI: Thisl & Respite progress - Dan Ellis 2000-01-23 - 2 Spoken Query Interface (SQI) Evaluation • Full IR processing from spoken queries - 1-best hypotheses - all terms from (smallish) lattice (KW prec=16%) - (syntax not covered by Thomson NLP) Spoken Query Performance by Speaker 0.6 Average Precision 0.5 0.4 1-best lattice 0.3 0.2 0.1 0 0 10 20 30 40 50 Word Error Rate % • Spoken queries work OK - dumb lattice processing does not ICSI: Thisl & Respite progress - Dan Ellis 2000-01-23 - 3 Exotic data evaluation • Range of non-news programmes from BBC - natural history - interviews - features • Evaluation plan - basic WER%: (not so bad) Data set WER OOV Avg.Fr. Entropy Euro99Eval (6 news shows, 31k words) 29.2 ± 7.6 % 0.84% 1.14 ± 0.11 1999dec Exotic (13 varied files, 44k words) 38.9 ± 8.4 % 0.70% 1.25 ± 0.09 - separate acoustic & language model scores (for recognized & forced alignments) - Chase-style blame assignment ICSI: Thisl & Respite progress - Dan Ellis 2000-01-23 - 4 Combining feature streams 2 • How to allocate feature dimensions to models? - lower-dimension models train more quickly - higher-dimension models find more interactions Feature 1 calculation Feature 1 calculation Feature concatenate Input sound Feature 2 calculation Acoustic classifier Speech features Acoustic classifier Posterior multiply Speech features Phone probabilities to decoder Input sound Feature 2 calculation • ^ • Phone probabilities to decoder Acoustic classifier Variations of PLP & MSG for Aurora: Features Parameters baseline WER ratio plp12•dplp12 136k 97.6% plp12^dplp12 124k 89.6% msg3a•msg3b 145k 101.1% msg3a^msg3b 133k 85.8% plp12•dplp12•msg3a•msg3b 281k 76.5% plp12^dplp12^msg3a^msg3b 245k 74.1% plp12^dplp12•msg3a^msg3b 257k 63.0% ICSI: Thisl & Respite progress - Dan Ellis 2000-01-23 - 5 Choosing streams & combinations • Which combination methods? - structural co-dependence is better modeled in a single feature space - orthogonal variability generalizes better with later combination • Which feature streams? - best pairwise system was plp12^msg3b i.e. best single system plus worst! - combine streams with complementary information... ... look at conditional mutual information? ... of statistical model outputs? ... compensating for baseline performance? ICSI: Thisl & Respite progress - Dan Ellis 2000-01-23 - 6 Tandem connectionist models • Posterior combination for HTK systems? • Answer: use posteriors as HTK input features Feature calculation Input sound Neural net model Speech features (Hybrid system output) (Posterior decoder) (Phone probabilities) Pre-nonlinearity outputs PCA orthogn'n Subword likelihoods Othogonal features HTK GM model Tandem system output HTK decoder - (GMM system does not know they are phones) • Result: better performance than either alone! - neural net has trained discriminatively - GMM HMMs learn context-dependent structure →extract complementary info from training data System-features baseline WER ratio HTK-mfcc 100.0% Hybrid-mfcc 84.6% Tandem-mfcc 64.5% Tandem-plp+msg 47.2% ICSI: Thisl & Respite progress - Dan Ellis 2000-01-23 - 7 Aurora “Distributed SR” evaluation • 7 telecoms company submissions: Aurora DSR Evaluation 1999 Results Avg. WER -20-0dB Baseline improvement 100.00% 80.00% 60.00% 40.00% 20.00% Ta nd em 2 S6 S5 S4 S3 Ta nd em 1 -20.00% S2 S1 Ba se lin e 0.00% - Tandem systems from OGI-ICSI-Qualcomm • Best features for transmission? - (filtered) subband energies may be sufficient ICSI: Thisl & Respite progress - Dan Ellis 2000-01-23 - 8