“Pushing the Envelope” A six month report By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL George Doddington, NA-sayer Overview Nelson Morgan, ICSI The Current Cast of Characters • ICSI: Morgan, Q. Zhu, B. Chen, G. Doddington • UW: M. Ostendorf, Ö. Çetin • OGI: H. Hermansky, S. Sivadas, P. Jain • Columbia: D. Ellis, M. Athineos • SRI: K. Sönmez • IDIAP: H. Bourlard, J. Ajmera, V. Tyagi Rethinking Acoustic Processing for ASR • Escape dependence on spectral envelope • Use multiple front-ends across time/freq • Modify statistical models to accommodate new front-ends • Design optimal combination schemes for multiple models Task 1: Pushing the Envelope (aside) OLD 10 ms estimate of sound identity • Problem: Spectral envelope is a fragile information carrier up to 1s kth estimate ith estimate nth estimate information fusion PROPOSED estimate of sound identity time • Solution: Probabilities from multiple time-frequency patches Task 2: Beyond Frames… OLD short-term features conventional HMM • Problem: Features & models interact; new features may require different models PROPOSED advanced features multi-rate, dynamic-scale classifier • Solution: Advanced features require advanced models, free of fixed-frame-rate paradigm Today’s presentation • Infrastructure: training, testing, software • Initial Experiments: pilot studies • Directions: where we’re headed Infrastructure Kemal Sönmez, SRI (SRI/UW/ICSI effort) Initial Experimental Paradigm • Focus on a small task to facilitate exploratory work (later move to CTS) • Choose a task where LM is fixed & plays a minor role (to focus on acoustics) • Use mismatched train/test data: To avoid tuning to the task To facilitate later move to CTS • Task: OGI numbers/ Train: swbd+macrophone Hub5 “Short” Training Set • Composition (total ~ 60 hours) Corpus callhome switchboard* credit-card macrophone hours Male Female 2.8 13.8 5.9 6.7 12.4 * subset of SWB-1 hand-checked at SRI for accuracy of transcriptions and segmentations • WER 2-4% higher vs. full 250+ hour training 4.3 7.1 5.8 Reduced UW Training Set • A reduced training set to shorten expt. turn-around time • Choose training utterances with per-frame likelihood scores close to the training set average • 1/4th of the original training set • Statistics (gender, data set constituencies) are similar to that of the full training set. data set constituencies macrophone callhome creditcard other switchboard male/female “short” 32% 32% 12% 24% 45/55% Reduced (UW) 38% 28% 12% 22% 48/52% • For OGI Numbers, no significant WER sacrifice in the baseline HMM system (worse for Hub 5). Development Test Sets • • • A “Core-Subset” of OGI’s Numbers 95 corpora – telephone speech of people reciting addresses, telephone numbers, zip codes, or other miscellaneous items “Core-Subset” or “CS” consists of utterances that were phonetically hand-transcribed, intelligible, and contained only numbers Vocabulary Size: 32 words (digits + eleven, twelve… twenty… hundred…thousand, etc.) Data Set Name Total Utterance Total Words Duration (hours) Numbers95-CS Cross Validation 357 1353 ~0.2 Numbers95-CS Development 1206 4673 ~0.6 Numbers95-CS Test 1227 4757 ~0.6 Statistical Modeling Tools • HTK (Hidden Markov Toolkit) for establishing an HMM baseline, debugging • GMTK (Graphical Models Toolkit) for implementing advanced models with multiple feature/state streams Allows direct dependencies across streams Not limited by single-rate, single-stream paradigm Rapid model specification/training/testing • SRI Decipher system for providing lattices to rescore (later in CTS expts) • Neural network tools from ICSI for posterior probability estimation, other statistical software from IDIAP Baseline SRI Recognizer for the numbers task • Bottom-up state-clustered Gaussian mixture HMMs for acoustic modeling • Acoustic adaptation to speakers using affine mean and variance transforms[Not used for numbers] • Vocal-tract length normalization using maximum likelihood estimation [Not helpful for numbers] • Progressive search with lattice recognition and Nbest rescoring [To be used in later work] • Bigram LM Initial Experiments Barry Chen, ICSI Hynek Hermansky, OHSU (OGI) Özgür Çetin, UW Goals of Initial Experiments • Establish performance baselines HMM + standard features (MFCC, PLP) HMM + current best from ICSI/OGI • Develop infrastructure for new models GMTK for multi-stream & multi-rate features Novel features based on large timespans Novel features based on temporal fine structure • Provide fodder for future error analysis ICSI Baseline experiments • PLP based - SRI system • “Tandem” PLP-based ANN + SRI system • Initial combination approach Development Baseline: Gender Independent PLP System Training Set Word,Sentence Error Rate on Numbers95-CS Test Set Full “Short” Hub5 (85k utterances, ~64.9 hrs) 3.4%,10.2% UW Reduced Hub5 (20k utterances, ~18.8 hrs) 3.8%,11.4% Phonetically Trained Neural Net • Multi-Layer Perceptron (input, hidden, and output layer) • Trained Using Error-Backpropagation Technique – outputs interpreted as posterior probabilities of target classes • Training Targets: 47 mono-phone targets from forced alignment using SRI Eval 2002 system • Training Utterances: UW Reduced Hub5 Set • Training Features: PLP12+e+d+dd, mean & variance normalized on per-conversation side basis • MLP Topology: 9 Frame Context Window (4 frames in past + current frame + 4 frames in future) 351 Input Units, 1500 Hidden Units, and 47 Output Units Total Number of Parameters: ~600k Baseline ICSI Tandem • Outputs of Neural Net before final softmax non-linearity used as inputs to PCA • PCA without dimensionality reduction • 4.1% Word and 11.7% Sentence Error Rate on Numbers95-CS test set Baseline ICSI Tandem+PLP • PLP Stream concatenated with neural net posteriors stream • PCA reduces dimensionality of posteriors stream to 16 (keeping 95% of overall variance) • 3.3% Word and 9.5% Sentence Error Rate on Numbers95-CS test set Word and String Error Rates on Numbers95-CS Test Set OGI Experiments: New Features in EARS • Develop on home-grown ASR system (phoneme-based HTK) • Pass the most promising to ICSI for running in SRI LVCSR system • So far new features match the performance of the baseline PLP features but do not exceed it advantage seen in combination with the baseline Looking to the human auditory system for design inspiration • Psychophysics Components within certain frequency range (several critical bands) interact [e.g. frequency masking] Components within certain time span (a few hundreds of ms) interact [e.g. temporal masking] • Physiology 2-D (time-frequency) matched filters for activity in auditory cortex [cortical receptive fields] TRAP-based HMM-NN hybrid ASR 101 point input Multilayer Perceptron (MLP) Posterior probabilities of phonemes Multilayer Perceptron (MLP) Mean & variance normalized, hamming windowed critical band trajectory Multilayer Perceptron (MLP) Search for the best match Feature estimation from linearly transformed temporal patterns transform MLP TANDEM transform MLP ? ? ? HMM ASR Preliminary TANDEM/TRAP results (OGI-HTK) WER% on OGI numbers, training on UW reduced training set, monophone models BASELINE 4.5 TANDEM 4.1 TANDEM with TRAP 3.9 Features from more than one critical-band temporal trajectory Studying KLT-derived basis functions, we observe: cosine transform + average frequency derivative UW Baseline Experiments • Constructed an HTK-based HMM system that is competitive with the SRI system • Replicated the HMM system in GMTK • Move on to models which integrate information from multiple sources in a principled manner: Multiple feature streams (multi-stream models) Different time scales (multi-rate models) • Focus on statistical models not on feature extraction HTK HMM Baseline • An HTK-based standard HMM system: • 3 state triphones with decision-tree clustering, • Mixture of diagonal Gaussians as state output dists., • No adaptation, fixed LM. • Dimensions explored: • Front-end: PLP vs. MFCC, VTLN • Gender dependent vs. independent modeling • Conclusions: • No significant performance differences • Decided on PLPs, no VTLN, gender-independent models for simplicity HMM Baselines (cont.) • Replicated HTK baseline with equivalent results in GMTK WER % tool dev test HTK 3.7 3.2 GMTK 3.7 3.0 • To reduce experiment turn-around time, wanted to reduce the training set • For HMMs and Numbers95, 3/4th of the training data can be safely ignored: Training set WER % dev test Full “short” 3.7 3.2 1/4th (“reduced”) 3.4 3.4 Multi-stream Models • Information fusion from multiple streams of features • Partially asynchronous state sequences STATE TOPOLOGY GRAPHICAL MODEL states of stream X feature stream X states of stream Y state seq. of stream X state seq. of stream Y feature stream Y model HMM (PLP) multi-stream (PLP+MFCC) WER % dev test 3.9 4.2 Temporal envelope features (Columbia) • Temporal fine structure is lost (deliberately) in STFT features: mpgr1-sx419 0.15 0.1 0.05 0 -0.05 0.65 0.7 0.75 0.8 0.85 0.9 8000 10 ms windows 0 6000 -20 4000 -40 2000 0 0.65 0.7 0.75 0.8 time / sec 0.85 0.9 • Need a compact, parametric description... -60 dB Frequency-Domain Linear Prediction (FDLP) • Extend LPC with LP model of spectrum TD-LP FD-LP DFT y[n] = Siaiy[n-i] Y[k] = SibiY[k-i] • ‘Poles’ represent temporal peaks: mpgr1-sx419: TDLPC env (60 poles / 30 0 ms) 0.1 0.05 0 -0.05 0.65 0.7 0.75 0.8 0.85 0.9 • Features ~ pole bandwidth, ‘frequency’ Preliminary FDLP Results • Distribution of pole magnitudes for different phone classes (in 4 bands): 0.1 0-500 Hz band 500-1000 Hz band 2-4 kHz band 1-2 kHz band /ah/ /p/ 0.08 0.06 0.04 0.02 0 -2 0 2 4 6 -2 0 2 4 6 -2 0 2 4 6 -2 0 2 -log(1- ||) • NN Classifier Frame Accuracies: plp12N plp12N+FDLP4 57.0% 58.2% 4 6 Directions Dan Ellis, Columbia (SRI/UW/Columbia work) Nelson Morgan, ICSI (OGI/IDIAP/ICSI work + summary) Multi-rate Models (UW) • Integrate acoustic information from different time scales • Account for dependencies across scales • Better robustness against time- and/or frequency localized interferences •Reduced redundancy gives better confidence estimates Cross-scale dependencies (example) long-term features coarse state chain fine state chain short-term features SRI Directions • Task 1: Signal-adaptive weighting of time-frequency patches Basis-entropy based representation Matching pursuit search for optimal weighting of patches Optimality based on minimum entropy criterion • Task 2: Graphical models of patch combinations Tiling-driven dependency modeling GM combines across patch selections Optimality based on information in representation Data-derived phonetic features (Columbia) • Find a set of independent attributes to account for phonetic (lexical) distinctions phones replaced by feature streams • Will require new pronunciation models asynchronous feature transitions (no phones) mapping from phonetics (for unseen words) Joint work with Eric Fosler-Lussier ICA for feature bases • PCA finds decorrelated bases; ICA finds independent bases test/dr1/faks0/sa2 15 10 5 Basis vectors 1 8 6 0 4 2 0 0 1 2 3 4 -1 8 2 6 4 1 2 0 0 d ow n ae s m iy t ix k eh r iy ix n oy l iy r ae g l ay k dh ae tcl -1 time / labels • Lexically-sufficient ICA basis set? 0 5 10 15 20 frequency / Bark OGI Directions: Targets in sub-bands • Initially context-independent and bandspecific phonemes • Gradually shifted to band-specific 6 broad phonetic classes (stops, fricatives, nasals, vowels, silence, flaps) • Moving towards band-independent speech classes (vocalic-like, fricative-like, plosivelike, ???) More than one temporal pattern? KLT1 MLP 101 dim KLTn Mean & Variance normalized, Hamming windowed critical band trajectory MLP Pre-processing by 2-D operators with subsequent TRAP-TANDEM * time differentiate f average t differentiate t average f diff upwards av downwards diff downwards av upwards 1 2 1 0 0 0 -1 -2 -1 -1 0 1 -2 0 2 -1 0 1 0 1 2 -1 0 1 -2 -1 0 -2 -1 0 -1 0 1 0 1 2 IDIAP Directions: Phase AutoCorrelation Features Traditional Features: Autocorrelation based. Very sensitive to additive noise, other variations. Phase AutoCorrelation (PAC): if R k , k 0,1,..., N 1. represents autocorrelation coeffs derived from a frame of length N 1 PACs: Pk cos -1 Rk , R0 R0 Frame energy. Entropy Based MultiStream Combination • Combination of evidences from more than one expert to improve performance • Entropy as a measure of confidence • Experts having low entropy are more reliable as compared to experts having high entropy • Inverse entropy weighting criterion • Relationship between entropy of the resulting (recombined) classifier and recognition rate ICSI Directions: Posterior Combination Framework • Combination of Several Discriminative Probability Streams Improvement of the Combo Infrastructure • Improve basic features: Add prosodic features: voicing level, energy continuity, Improve PLP by further removing the pitch difference among speakers. • Tandem Different targets, different training features. E.g.: word boundary. • Improve TRAP (OGI) • Combination Entropy based, accuracy based stream weighting or stream selection. New types of tandem features: Possible word/syllable boundary Processing NN Input feature Input feature: • Traditional or improved PLP • Spectral continuity • Voicing, voicing continuity • Formant continuity feature • …more Target posterior • Phonemes • Word/syllable boundary • Broad phoneme classes • Manner/ place / articulation… etc Data Driven Subword Unit Generation (IDIAP/ICSI) • Motivation: Phoneme-based units may not be optimal for ASR. • Approach (based on speaker segmentation method): Initial segmentation: large number of clusters Is thresholdless BIC-like merging criterion met? Yes No Stop Merge, re-segment, and re-estimate Summary • Staff and tools in place to proceed with core experiments • Pilot experiments provided coherent substrate for cooperation between 6 sites • Future directions for individual sites are all over the map, which is what we want • Possible exploration of collaborations w/MS in this meeting