Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA Lecture 5: Generalization Error; Support Vector Machines • • Observation Vector Summary Statistic; Principal Components Analysis (PCA) Risk Minimization – – – – • • If Posterior Probability is known: MAP is optimal Example: Linear Discriminant Analysis (LDA) When true Posterior is unknown: Generalization Error VC Dimension, and bounds on Generalization Error Lagrangian Optimization Linear Support Vector Machines – The SVM Optimality Metric – Lagrangian Optimization of SVM Metric – Hyper-parameters & Over-training • Kernel-Based Support Vector Machines – Kernel-based classification & optimization formulas – Hyperparameters & Over-training – The Entire Regularization Path of the SVM • High-Dimensional Linear SVM – Text classification using indicator functions – Speech acoustic classification using redundant features What is an Observation? Observation can be: • A vector created by “vectorizing” many consecutive MFCC or mel-spectra • A vector including MFCC, formants, pitch, PLP, auditory model features, … Normalized Observations Plotting the Observations, Part I: Scatter Plots and Histograms Problem: Where is the Information in a 1000-Dimensional Vector? Statistics that Summarize a Training Corpus Summary Statistics: Matrix Notation Examples of y=-1 Examples of y=+1 Eigenvectors and Eigenvalues of R Plotting the Observations, Part 2: Principal Components Analysis What Does PCA Extract from the Spectrogram? Plot: “PCAGram” 1024-dimensional principal component → 32X32 spectrogram, plot as an image: • 1st principal component (not shown) measures total energy of the spectrogram • 2nd principal component: E(after landmark) – E(before landmark) • 3rd principal component: E(at the landmark) – E(surrounding syllables) Minimum-Risk Classifier Design True Risk, Empirical Risk, and Generalization When PDF is Known: Maximum A Posteriori (MAP) is Optimal Another Way to Write the MAP Classifier: Test the Sign of the Log Likelihood Ratio MAP Example: Gaussians with Equal Covariance Linear Discriminant Projection of the Data Other Linear Classifiers: Empirical Risk Minimization (Choose v, b to Minimize Remp(v,b)) A Serious Problem: Over-Training Minimum-Error projection of training data The same projection, applied to new test data When the True PDF is Unknown: Upper Bounds on True Risk The VC Dimension of a Hyperplane Classifier Schematic Depiction: |w| Controls the Expressiveness of the Classifier (and a less expressive classifier is less prone to overtrain) The SVM = An Optimality Criterion Lagrangian Optimization: Inequality Constraint g(v) < 0 Unconstrained Minimum Consider minimizing f(v), subject to the constraint g(v) ≥ 0. Two solution types exist: g(v) < 0 v* v* g(v) > 0 g(v) > 0 g(v) = 0 • g(v*) = 0 • g(v)=0 curve is tangent to g(v) = 0 f(v)=fmin curve at v=v* • g(v*) > 0 • v* minimizes f(v) Diagram from Osborne, 2004 Case 1: gm(v*)=0 Case 2: gm(v*)>0 Training an SVM Differentiate the Lagrangian … now Simplify the Lagrangian… … and impose Kuhn-Tucker… Three Types of Vectors Interior Vector: a=0 Margin Support Vector: 0<a<C Error: a=C Partial Error: a=C From Hastie et al., NIPS 2004 … and finally, Solve the SVM Quadratic Programming ai2 C ai1 C ai* ai2 is off the margin; truncate to ai2=0. ai1 is still a margin candidate; solve for it again in iteration i+1. Linear SVM Example Linear SVM Example Choosing the Hyper-Parameter to Avoid Over-Training (Wang, Presentation at CLSP workshop WS04) SVM test corpus error vs. l=1/C, classification of nasal vs. non-nasal vowels. Choosing the Hyper-Parameter to Avoid Over-Training • Recall that v=Sm amymxm • Therefore, |v| < (C Sm |xm|)1/2 < (CM max|xm|)1/2 • Therefore, width of the margin is constrained to 1/|v| > (CM max|xm|)-1/2, and therefore, the SVM is not allowed to make the margin very small in its quest to fix individual errors • Recommended solution: – Normalize xm so that max|xm|≈1 (e.g., using libsvm) – Set C≈1/M – If desired, adjust C up or down by a factor of 2, to see if error rate on independent development test data will decrease From Linear to Nonlinear SVM Example: RBF Classifier An RBF Classification Boundary Two Hyperparameters Choosing Hyperparameters is Much Harder (Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004) Optimum Value of C Depends on g (Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004) From Hastie et al., NIPS 2004 SVM is a “Regularized Learner” (l=1/C) SVM Coefficients are a Piece-Wise Linear Function of l=1/C (Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004) The Entire Regularization Path of the SVM: Algorithm (Hastie, Zhu, Tibshirani and Rosset, NIPS 2004) • • • Start with l large enough (C small enough) so that all training tokens are partial errors (am=C). Compute the solution to the quadratic programming problem in this case, including inversion of XTX or XXT. Reduce l (increase C) until the initial event occurs: two partial error points enter the margin, i.e., in the QP problem, am=C becomes the unconstrained solution rather than just the constrained solution. This is the first breakpoint. The slopes dam/dl change, but only for the two training vectors the margin; all other training vectors continue to have am=C.Calculate the new values of dam/dl for these two training vectors. Iteratively find the next breakpoint. The next breakpoint occurs when one of the following occurs: – A value of am that was on the margin leaves the margin, i.e., the piece-wise-linear function am(l) hits am=0 or am=C. – One or more interior points enter the margin, i.e., in the QP problem, am=0 becomes the unconstrained solution rather than just the constrained solution. – One or more interior points enter the margin, i.e., in the QP problem, am=C becomes the unconstrained solution rather than just the constrained solution. One Method for Using SVMPath (WS04, Johns Hopkins, 2004) • Download SVMPath code from Trevor Hastie’s web page • Test several values of g, including values within a few orders of magnitude from g=1/K. • For each candidate value of g, use SVMPath to find the Cbreakpoints. Choose a few dozen C-breakpoints for further testing, and write out the corresponding values of am. • Test the SVMs on a separate development test database: for each combination (C,g), find the development test error. Choose the combination that gives least development test error. Results, RBF SVM SVM test corpus error vs. l=1/C, classification of nasal vs. non-nasal vowels. Wang, WS04 Student Presentation, 2004 High-Dimensional Linear SVMs Motivation: “Project it Yourself” • The purpose of a nonlinear SVM: – f(x) contains higher-order polynomial terms in the elements of x. – By combining these higher-order polynomial terms, SymamK(x,xm) can create a more flexible boundary than can SymamxTxm. – The flexibility of the boundary does not lead to generalization error: the regularization term l|v|2 avoids generalization error. • A different approach: – Augment x with higher-order terms, up to a very large dimension. These terms can include: • Polynomial terms, e.g., xixj • N-gram terms, e.g., (xi at time t AND xj at time t) • Other features suggested by knowledge-based analysis of the problem – Then: apply a linear SVM to the higher-dimensional problem Example #1: Acoustic Classification of Stop Place of Articulation • Feature Dimension: K=483/10ms – MFCCs+d+dd, 25ms window: K=39/10ms – Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond: K=40/10ms – Noise-robust MUSIC-based formant frequencies, amplitudes, and bandwidths: K=10/10ms – Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures): K=42/10ms – Rate-place model of neural response fields in the cat auditory cortex: K=352/10ms • Observation = concatenation of up to 17 frames, for a total of K=17 X 483 = 8211 dimensions • Results: Accuracy improves as you add more features, up to 7 frames (one/10ms; 3381-dimensional x). Adding more frames didn’t help. • RBF SVM still outperforms linear SVM, but only by 1% Example #2: Text Classification • Goal: – Utterances were recorded by physical therapy patients, specifying their physical activity once/half hour for seven days. – Example utterance: “I ate breakfast for twenty minutes, then I walked to school for ten minutes.” – Goal: for each time period, determine the type of physical activity, from among 2000 possible type categories. • Indicator features – 50000 features: one per word, in a 50000-word dictionary – x = [ d1, d2, d3, …, d50000 ]T – di = 1 if the ith dictionary word was contained in the utterance, zero otherwise – X is very sparse: most sentences contain only a few words – Linear SVM is very efficient Example #2: Text Classification • Result – 85% classification accuracy – Most incorrect classifications were reasonable to a human • “I played hopskotch with my daughter” = “playing a game”, or “light physical exercise”? – Some categories were never observed in the training data, therefore no test data were assigned to those categories • Conclusion: SVM is learning keywords & keyword combinations Summary • Plotting the Data: Use PCA, LDA, or any other discriminant • If PDF is known: Use MAP classifier • If PDF unknown: Structural Risk Minimization • “SVM” is a training criterion – a particular upper bound on structural risk of hyperplane • Choosing hyperparameters – Easy for a linear classifier – For a nonlinear classifier: use the Complete Regularization Path algorithm • High-dimensional Linear SVMs: human user acts as an “intelligent kernel”