RESPITE progress report Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 Hybrid AURORA system 2 Using hybrid results with HTK 3 Multifeature design 4 Multistream pronunciation modeling ICSI: RESPITE progress - Dan Ellis 1999sep13 - 1 Hybrid AURORA system 1 • AURORA noisy digits task - TIDIGITS + 4 kinds of noise x 7 SNR levels - standard HTK back-end provided - objective: standard features for mobile phones • ICSI’s small-vocab techniques - modulation-filtered spectrogram (MSG) features - posterior probability combination (multistream) • Can we combine them? - hybrid NN-HMM baseline system for AURORA - use a TIDIGITS lexicon & phone models - bootstrap labels from NUMBERS95 network - use 480 hidden-unit net as N95 ICSI: RESPITE progress - Dan Ellis 1999sep13 - 2 Baseline AURORA results System • AURORA test has 28 numbers... • ...report just a few - mean WER % for ∞, 15, 5, -5 dB SNR + overall mean ratio to HTK MFCC baseline Feature Clean SNR15 SNR5 SNR-5 Avg. ratio HTK MFCC+d 1.4% 3.7% 15.9% 68.0% 100.0% Hybrid MFCC+d 2.2% 2.6% 9.9% 49.1% 82.1% Hybrid plp12N+d 2.6% 2.8% 10.6% 47.9% 89.6% Hybrid msg3N 2.1% 2.9% 11.6% 49.2% 87.1% HTK msg3NKG 5.6% 6.4% 21.5% 66.8% 184.5% 2 WERR% 10 1 10 HTK MFCC Hybrid MFCC Hybrid plp Hybrid MSG HTK MSG 0 10 clean 15dB 5dB -5dB SNR ICSI: RESPITE progress - Dan Ellis 1999sep13 - 3 Combination systems • Posterior combination has worked well Feature 1 calculation Input sound Acoustic classifier Feature 2 calculation HMM decoder Posterior combination Speech features Acoustic classifier Word hypotheses Phone probabilities P(qi|X1,X2) ∝ P(qi|X1)·P(qi|X2) / P(qi) ... if X1⊥X2|q • But it depends on features Features Clean SNR15 SNR5 SNR-5 Avg. ratio plp12Nd 2.6% 2.8% 10.6% 47.9% 89.6% msg3N 2.1% 2.9% 11.6% 49.2% 87.1% plp12Nd-msg3N 1.7% 2.4% 9.5% 47.3% 74.1% plp12N-msg3aN • dplp12N-msg3bN 1.7% 2.1% 8.8% 46.9% 70.1% plp12Nd • msg3N 1.5% 1.9% 8.2% 43.0% 63.0% ICSI: RESPITE progress - Dan Ellis 1999sep13 - 4 Using hybrid results with HTK 2 • AURORA specification: use HTK recognizer • How to put combinations into HTK - feature combination (with LDA?) - posteriors as features (only 24 phone classes) plp calculation Neural net model Noway decoder x msg calculation Input sound Word hypotheses Subword likelihoods Phone probabilities Speech features • System Neural net model HTK GM model HTK decoder HTK handles it! Feature Clean SNR15 SNR5 SNR-5 Avg. ratio Hybrid plp • msg 1.5% 1.9% 8.2% 43.0% 63.0% HTK posteriors 1.1% 1.9% 8.2% 46.1% 59.1% ICSI: RESPITE progress - Dan Ellis 1999sep13 - 5 Tailoring posteriors for HTK 4 10 x 10 • Posteriors are very un-Gaussian - log-transform doesn’t help much • A linear output layer helps a lot - remove softmax: yi = exp(xi)/Σj(exp(xj)) Histograms for elements 1, 2 and 23 (=h#) of lna1L (logprob) feature set Histograms for elements 1,2,3 & 23 of lin out plp12Nd ftrs lin1 4000 2000 5 0 4 x 10 −12 10 −10 −8 −6 −4 −2 0 0 −20 4000 5 0 −20 4000 0 4 x 10 −12 5 2000 −10 −8 −6 −4 −2 0 −15 −10 −5 0 5 10 15 20 25 30 −15 −10 −5 0 5 10 15 20 25 30 2000 0 −20 4000 −15 −10 −5 0 5 10 15 20 25 2000 0 −12 −10 −8 −6 • System −4 −2 0 0 −15 −10 −5 0 5 10 15 Do combinations by summing linear outputs Feature Clean SNR15 SNR5 SNR-5 Avg. ratio HTK posteriors 1.1% 1.9% 8.2% 46.1% 59.1% HTK log(p) 0.9% 1.8% 8.9% 48.8% 58.6% HTK Σ(lin. o/p) 0.9% 1.6% 7.7% 44.1% 51.6% ICSI: RESPITE progress - Dan Ellis 1999sep13 - 6 20 Multifeature design 3 (Mike Shire) • ‘Optimal’ features for different conditions - subband envelope domain - linear-discriminant analysis (LDA) for filter coeffs • Modulation-frequency domain responses for clean, reverb, mixture: 1st Discriminant Filter 0 -5 dB -10 -15 Clean Light Reverb Severe Reverb Clean+Severe Reverb -20 -25 -30 2nd Discriminant Filter 0 -5 dB -10 -15 -20 -25 -30 0 10 ICSI: RESPITE progress - Dan Ellis 1 10 Hz 1999sep13 - 7 4 Multistream pronunciation models (Barry Chen) • Combine streams in the decoder - ‘HMM combination’ - separate state assignment for each stream - constrain (disallow?) asynchrony • Are particular asynchronies important? - between certain bands? - between certain sounds? - in particular directions? • Re-estimate transition probabilities in 1-state asynchrony 4-band models - no improvement yet ICSI: RESPITE progress - Dan Ellis 1999sep13 - 8