Columbia: Recent + Future information • -More FDLP / PLP2 features classifiers • -Better MI-based broad-class experts variability • -Reducing Temporal variation - Formant “automatic gain control” (AGC) model • -Signal “Deformable spectrograms” EARS Retreat - Dan Ellis 2004-08-05 Broad-Class Experts Patricia Scanlon • MI-based feature masks make superior classspecific classifiers (vowels, stops...) • smaller models: good for data-limited case • Apply to ASR by ‘patching in’ probabilities via separate broad-class center detector EARS Retreat - Dan Ellis 2004-08-05 MI-Based Class Experts • Idea: Different speech sounds have different information distribution - .. as identified by MI to phone | class for reducing model complexity • -Good benefits disappear given enough data measuring joint MI • -Not quick hack: checkerboard EARS Retreat - Dan Ellis 2004-08-05 Broad-Class Detector gives Pr(phone | class, features) • -Expert still need Pr(class | features) same approach • -Repeat separate detectors for each broad class - measure MI from TF cell to class train MLP from those features - reasonable vowel recognition with 10% insertions (6.3% deletions) of centers accept/false detect tradeoff • -False try to detect only center of phone EARS Retreat - Dan Ellis 2004-08-05 Overall System • ‘Patch in’ expert’s posteriors: P(qi|X) = - ! P(qi|class, X) · P(class|X) class ‘non-expert’ MLP for when P(class|X) are small TIMIT phone err rate Baseline: 28.4% Oracle P(VC): 26.9% Real P(VC): 28.0% Vowels+Frics: 27.6% at: • -Stillusinglooking more experts - better P(class|X) EARS Retreat - Dan Ellis 2004-08-05 Temporal Variation • Idea: Sambarta Bhattacharjee Banky Omodunbi Normalize phone durations by averages - .. to reveal per-speaker bias .. and timing variation within phrases on vowels • -Focus per-phone deviations are very noisy • Use to vary sampling/modeling? EARS Retreat - Dan Ellis 2004-08-05 Formant AGC Eric Fuller Sambarta Bhattacharjee • Hypothesis: Casual speech has ‘compressed’ formant motion - can we ‘enhance’ format motions to make speech more canonical / read-like? freq / kHz Rhonda Speaking 5 4 3 2 1 freq / Bark 0 15 10 freq / Bark 5 15 10 5 0 0.5 EARS Retreat - Dan Ellis 1 1.5 2 2.5 3 3.5 4 4.5 time / s 2004-08-05 Read vs. Spontaneous • Speaker-dependent means, vars of PLP pole locations in read vs. spontaneous speech - variance of angle of pole 3 discriminates well for red and green speakers - but opposite changes! EARS Retreat - Dan Ellis 2004-08-05 Deformable Spectra • Accurate spectral modeling in Nebojsa Jojic (MSR) Manuel Reyes conventional HMMs requires 1000s of states - cumbersome, especially transition matrices • Observation: Speech spectra undergo minor deformations - suggests a different generative model: 9 X9t Xt-1 8 Xt-1 X8t Transformation 7 matrix T Xt-1 X7t 6 Xt-1 Xt6 00100 5 = Xt5 0 0 0 1 0 • Xt-1 4 00001 Xt-1 Xt4 3 X3t NP=5 Xt-1 2 Xt-1 X2t 1 Xt-1 X1t EARS Retreat - Dan Ellis NC=3 2004-08-05 States+Transformation Model • Time-frequency state grid → • -State explicit prototype a) - • or a transformation on prior frame b) T51 T41 X51 X50 T!31 X41 T!22 T!21 X13 X30 X21 X20 frequency X10 T12 time Yellow/Orange: Upward motion (darker is steeper) 3 b) Transformation Map " "#$ T!11 Green: Identity transform 2 EARS Retreat - Dan Ellis T13 X11 1 a) Signal % T!23 X40 Infer underlying states ! ! Blue: Downward motion (darker is steeper) 2004-08-05 Two-layer model decomposition • -Source-filter pitch and formants have different dynamics transformation models for both • -Apply log-spectra: - sum of excitation & filter inference does separation !# !" " ' ( $ $%& = Signal Selected Bin EARS Retreat - Dan Ellis + Harmonics Harmonic Tracking $ $%& Formants Formant Tracking 2004-08-05