Pushing the Envelope

“Pushing the Envelope” A six month report By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL George Doddington, NA-sayer Overview Nelson Morgan, ICSI The Current Cast of Characters • ICSI: Morgan, Q. Zhu, B. Chen, G. Doddington • UW: M. Ostendorf, Ö. Çetin • OGI: H. Hermansky, S. Sivadas, P. Jain • Columbia: D. Ellis, M. Athineos • SRI: K. Sönmez • IDIAP: H. Bourlard, J. Ajmera, V. Tyagi Rethinking Acoustic Processing for ASR • Escape dependence on spectral envelope • Use multiple front-ends across time/freq • Modify statistical models to accommodate new front-ends • Design optimal combination schemes for multiple models Task 1: Pushing the Envelope (aside) OLD 10 ms estimate of sound identity • Problem: Spectral envelope is a fragile information carrier up to 1s kth estimate ith estimate nth estimate information fusion PROPOSED estimate of sound identity time • Solution: Probabilities from multiple time-frequency patches Task 2: Beyond Frames… OLD short-term features conventional HMM • Problem: Features & models interact; new features may require different models PROPOSED advanced features multi-rate, dynamic-scale classifier • Solution: Advanced features require advanced models, free of fixed-frame-rate paradigm Today’s presentation • Infrastructure: training, testing, software • Initial Experiments: pilot studies • Directions: where we’re headed Infrastructure Kemal Sönmez, SRI (SRI/UW/ICSI effort) Initial Experimental Paradigm • Focus on a small task to facilitate exploratory work (later move to CTS) • Choose a task where LM is fixed & plays a minor role (to focus on acoustics) • Use mismatched train/test data:  To avoid tuning to the task  To facilitate later move to CTS • Task: OGI numbers/ Train: swbd+macrophone Hub5 “Short” Training Set • Composition (total ~ 60 hours) Corpus callhome switchboard* credit-card macrophone hours Male Female 2.8 13.8 5.9 6.7 12.4 * subset of SWB-1 hand-checked at SRI for accuracy of transcriptions and segmentations • WER 2-4% higher vs. full 250+ hour training 4.3 7.1 5.8 Reduced UW Training Set • A reduced training set to shorten expt. turn-around time • Choose training utterances with per-frame likelihood scores close to the training set average • 1/4th of the original training set • Statistics (gender, data set constituencies) are similar to that of the full training set. data set constituencies macrophone callhome creditcard other switchboard male/female “short” 32% 32% 12% 24% 45/55% Reduced (UW) 38% 28% 12% 22% 48/52% • For OGI Numbers, no significant WER sacrifice in the baseline HMM system (worse for Hub 5). Development Test Sets • • • A “Core-Subset” of OGI’s Numbers 95 corpora – telephone speech of people reciting addresses, telephone numbers, zip codes, or other miscellaneous items “Core-Subset” or “CS” consists of utterances that were phonetically hand-transcribed, intelligible, and contained only numbers Vocabulary Size: 32 words (digits + eleven, twelve… twenty… hundred…thousand, etc.) Data Set Name Total Utterance Total Words Duration (hours) Numbers95-CS Cross Validation 357 1353 ~0.2 Numbers95-CS Development 1206 4673 ~0.6 Numbers95-CS Test 1227 4757 ~0.6 Statistical Modeling Tools • HTK (Hidden Markov Toolkit) for establishing an HMM baseline, debugging • GMTK (Graphical Models Toolkit) for implementing advanced models with multiple feature/state streams  Allows direct dependencies across streams  Not limited by single-rate, single-stream paradigm  Rapid model specification/training/testing • SRI Decipher system for providing lattices to rescore (later in CTS expts) • Neural network tools from ICSI for posterior probability estimation, other statistical software from IDIAP Baseline SRI Recognizer for the numbers task • Bottom-up state-clustered Gaussian mixture HMMs for acoustic modeling • Acoustic adaptation to speakers using affine mean and variance transforms[Not used for numbers] • Vocal-tract length normalization using maximum likelihood estimation [Not helpful for numbers] • Progressive search with lattice recognition and Nbest rescoring [To be used in later work] • Bigram LM Initial Experiments Barry Chen, ICSI Hynek Hermansky, OHSU (OGI) Özgür Çetin, UW Goals of Initial Experiments • Establish performance baselines  HMM + standard features (MFCC, PLP)  HMM + current best from ICSI/OGI • Develop infrastructure for new models  GMTK for multi-stream & multi-rate features  Novel features based on large timespans  Novel features based on temporal fine structure • Provide fodder for future error analysis ICSI Baseline experiments • PLP based - SRI system • “Tandem” PLP-based ANN + SRI system • Initial combination approach Development Baseline: Gender Independent PLP System Training Set Word,Sentence Error Rate on Numbers95-CS Test Set Full “Short” Hub5 (85k utterances, ~64.9 hrs) 3.4%,10.2% UW Reduced Hub5 (20k utterances, ~18.8 hrs) 3.8%,11.4% Phonetically Trained Neural Net • Multi-Layer Perceptron (input, hidden, and output layer) • Trained Using Error-Backpropagation Technique – outputs interpreted as posterior probabilities of target classes • Training Targets: 47 mono-phone targets from forced alignment using SRI Eval 2002 system • Training Utterances: UW Reduced Hub5 Set • Training Features: PLP12+e+d+dd, mean & variance normalized on per-conversation side basis • MLP Topology:  9 Frame Context Window (4 frames in past + current frame + 4 frames in future)  351 Input Units, 1500 Hidden Units, and 47 Output Units  Total Number of Parameters: ~600k Baseline ICSI Tandem • Outputs of Neural Net before final softmax non-linearity used as inputs to PCA • PCA without dimensionality reduction • 4.1% Word and 11.7% Sentence Error Rate on Numbers95-CS test set Baseline ICSI Tandem+PLP • PLP Stream concatenated with neural net posteriors stream • PCA reduces dimensionality of posteriors stream to 16 (keeping 95% of overall variance) • 3.3% Word and 9.5% Sentence Error Rate on Numbers95-CS test set Word and String Error Rates on Numbers95-CS Test Set OGI Experiments: New Features in EARS • Develop on home-grown ASR system (phoneme-based HTK) • Pass the most promising to ICSI for running in SRI LVCSR system • So far  new features match the performance of the baseline PLP features but do not exceed it  advantage seen in combination with the baseline Looking to the human auditory system for design inspiration • Psychophysics  Components within certain frequency range (several critical bands) interact [e.g. frequency masking]  Components within certain time span (a few hundreds of ms) interact [e.g. temporal masking] • Physiology  2-D (time-frequency) matched filters for activity in auditory cortex [cortical receptive fields] TRAP-based HMM-NN hybrid ASR 101 point input Multilayer Perceptron (MLP) Posterior probabilities of phonemes Multilayer Perceptron (MLP) Mean & variance normalized, hamming windowed critical band trajectory Multilayer Perceptron (MLP) Search for the best match Feature estimation from linearly transformed temporal patterns transform MLP TANDEM transform MLP ? ? ? HMM ASR Preliminary TANDEM/TRAP results (OGI-HTK) WER% on OGI numbers, training on UW reduced training set, monophone models BASELINE 4.5 TANDEM 4.1 TANDEM with TRAP 3.9 Features from more than one critical-band temporal trajectory Studying KLT-derived basis functions, we observe: cosine transform + average frequency derivative UW Baseline Experiments • Constructed an HTK-based HMM system that is competitive with the SRI system • Replicated the HMM system in GMTK • Move on to models which integrate information from multiple sources in a principled manner: Multiple feature streams (multi-stream models) Different time scales (multi-rate models) • Focus on statistical models not on feature extraction HTK HMM Baseline • An HTK-based standard HMM system: • 3 state triphones with decision-tree clustering, • Mixture of diagonal Gaussians as state output dists., • No adaptation, fixed LM. • Dimensions explored: • Front-end: PLP vs. MFCC, VTLN • Gender dependent vs. independent modeling • Conclusions: • No significant performance differences • Decided on PLPs, no VTLN, gender-independent models for simplicity HMM Baselines (cont.) • Replicated HTK baseline with equivalent results in GMTK WER % tool dev test HTK 3.7 3.2 GMTK 3.7 3.0 • To reduce experiment turn-around time, wanted to reduce the training set • For HMMs and Numbers95, 3/4th of the training data can be safely ignored: Training set WER % dev test Full “short” 3.7 3.2 1/4th (“reduced”) 3.4 3.4 Multi-stream Models • Information fusion from multiple streams of features • Partially asynchronous state sequences STATE TOPOLOGY GRAPHICAL MODEL states of stream X feature stream X states of stream Y state seq. of stream X state seq. of stream Y feature stream Y model HMM (PLP) multi-stream (PLP+MFCC) WER % dev test 3.9 4.2 Temporal envelope features (Columbia) • Temporal fine structure is lost (deliberately) in STFT features: mpgr1-sx419 0.15 0.1 0.05 0 -0.05 0.65 0.7 0.75 0.8 0.85 0.9 8000 10 ms windows 0 6000 -20 4000 -40 2000 0 0.65 0.7 0.75 0.8 time / sec 0.85 0.9 • Need a compact, parametric description... -60 dB Frequency-Domain Linear Prediction (FDLP) • Extend LPC with LP model of spectrum TD-LP FD-LP DFT y[n] = Siaiy[n-i] Y[k] = SibiY[k-i] • ‘Poles’ represent temporal peaks: mpgr1-sx419: TDLPC env (60 poles / 30 0 ms) 0.1 0.05 0 -0.05 0.65 0.7 0.75 0.8 0.85 0.9 • Features ~ pole bandwidth, ‘frequency’ Preliminary FDLP Results • Distribution of pole magnitudes for different phone classes (in 4 bands): 0.1 0-500 Hz band 500-1000 Hz band 2-4 kHz band 1-2 kHz band /ah/ /p/ 0.08 0.06 0.04 0.02 0 -2 0 2 4 6 -2 0 2 4 6 -2 0 2 4 6 -2 0 2 -log(1- ||) • NN Classifier Frame Accuracies: plp12N plp12N+FDLP4 57.0% 58.2% 4 6 Directions Dan Ellis, Columbia (SRI/UW/Columbia work) Nelson Morgan, ICSI (OGI/IDIAP/ICSI work + summary) Multi-rate Models (UW) • Integrate acoustic information from different time scales • Account for dependencies across scales • Better robustness against time- and/or frequency localized interferences •Reduced redundancy gives better confidence estimates Cross-scale dependencies (example) long-term features coarse state chain fine state chain short-term features SRI Directions • Task 1: Signal-adaptive weighting of time-frequency patches  Basis-entropy based representation  Matching pursuit search for optimal weighting of patches  Optimality based on minimum entropy criterion • Task 2: Graphical models of patch combinations  Tiling-driven dependency modeling  GM combines across patch selections  Optimality based on information in representation Data-derived phonetic features (Columbia) • Find a set of independent attributes to account for phonetic (lexical) distinctions phones replaced by feature streams • Will require new pronunciation models asynchronous feature transitions (no phones) mapping from phonetics (for unseen words) Joint work with Eric Fosler-Lussier ICA for feature bases • PCA finds decorrelated bases; ICA finds independent bases test/dr1/faks0/sa2 15 10 5 Basis vectors 1 8 6 0 4 2 0 0 1 2 3 4 -1 8 2 6 4 1 2 0 0 d ow n ae s m iy t ix k eh r iy ix n oy l iy r ae g l ay k dh ae tcl -1 time / labels • Lexically-sufficient ICA basis set? 0 5 10 15 20 frequency / Bark OGI Directions: Targets in sub-bands • Initially context-independent and bandspecific phonemes • Gradually shifted to band-specific 6 broad phonetic classes (stops, fricatives, nasals, vowels, silence, flaps) • Moving towards band-independent speech classes (vocalic-like, fricative-like, plosivelike, ???) More than one temporal pattern? KLT1 MLP 101 dim KLTn Mean & Variance normalized, Hamming windowed critical band trajectory MLP Pre-processing by 2-D operators with subsequent TRAP-TANDEM * time differentiate f average t differentiate t average f diff upwards av downwards diff downwards av upwards 1 2 1 0 0 0 -1 -2 -1 -1 0 1 -2 0 2 -1 0 1 0 1 2 -1 0 1 -2 -1 0 -2 -1 0 -1 0 1 0 1 2 IDIAP Directions: Phase AutoCorrelation Features Traditional Features: Autocorrelation based. Very sensitive to additive noise, other variations. Phase AutoCorrelation (PAC):  if R k , k  0,1,..., N 1. represents autocorrelation coeffs derived from a frame of length N 1 PACs: Pk   cos -1  Rk     ,  R0  R0  Frame energy. Entropy Based MultiStream Combination • Combination of evidences from more than one expert to improve performance • Entropy as a measure of confidence • Experts having low entropy are more reliable as compared to experts having high entropy • Inverse entropy weighting criterion • Relationship between entropy of the resulting (recombined) classifier and recognition rate ICSI Directions: Posterior Combination Framework • Combination of Several Discriminative Probability Streams Improvement of the Combo Infrastructure • Improve basic features:  Add prosodic features: voicing level, energy continuity,  Improve PLP by further removing the pitch difference among speakers. • Tandem  Different targets, different training features. E.g.: word boundary. • Improve TRAP (OGI) • Combination  Entropy based, accuracy based stream weighting or stream selection. New types of tandem features: Possible word/syllable boundary Processing NN Input feature Input feature: • Traditional or improved PLP • Spectral continuity • Voicing, voicing continuity • Formant continuity feature • …more Target posterior • Phonemes • Word/syllable boundary • Broad phoneme classes • Manner/ place / articulation… etc Data Driven Subword Unit Generation (IDIAP/ICSI) • Motivation:  Phoneme-based units may not be optimal for ASR. • Approach (based on speaker segmentation method): Initial segmentation: large number of clusters Is thresholdless BIC-like merging criterion met? Yes No Stop Merge, re-segment, and re-estimate Summary • Staff and tools in place to proceed with core experiments • Pilot experiments provided coherent substrate for cooperation between 6 sites • Future directions for individual sites are all over the map, which is what we want • Possible exploration of collaborations w/MS in this meeting

Pushing the Envelope

Related documents

Products

Support

Pushing the Envelope

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib