Lecture 5 - University of Illinois at Urbana

Landmark-Based Speech
Spectrogram Reading,
Support Vector Machines,
Dynamic Bayesian Networks,
and Phonology
Mark Hasegawa-Johnson
University of Illinois at Urbana-Champaign, USA
Lecture 5: Generalization Error;
Support Vector Machines
Observation Vector Summary Statistic; Principal Components Analysis (PCA)
Risk Minimization
If Posterior Probability is known: MAP is optimal
Example: Linear Discriminant Analysis (LDA)
When true Posterior is unknown: Generalization Error
VC Dimension, and bounds on Generalization Error
Lagrangian Optimization
Linear Support Vector Machines
– The SVM Optimality Metric
– Lagrangian Optimization of SVM Metric
– Hyper-parameters & Over-training
Kernel-Based Support Vector Machines
– Kernel-based classification & optimization formulas
– Hyperparameters & Over-training
– The Entire Regularization Path of the SVM
High-Dimensional Linear SVM
– Text classification using indicator functions
– Speech acoustic classification using redundant features
What is an Observation?
Observation can be:
• A vector created by “vectorizing” many consecutive MFCC or mel-spectra
• A vector including MFCC, formants, pitch, PLP, auditory model features, …
Normalized Observations
Plotting the Observations, Part I:
Scatter Plots and Histograms
Problem: Where is the Information
in a 1000-Dimensional Vector?
Statistics that Summarize a
Training Corpus
Summary Statistics: Matrix
Examples of y=-1
Examples of y=+1
Eigenvectors and Eigenvalues of R
Plotting the Observations, Part 2:
Principal Components Analysis
What Does PCA Extract from the
Spectrogram? Plot: “PCAGram”
1024-dimensional principal component → 32X32 spectrogram, plot as an image:
• 1st principal component (not shown) measures total energy of the spectrogram
• 2nd principal component: E(after landmark) – E(before landmark)
• 3rd principal component: E(at the landmark) – E(surrounding syllables)
Minimum-Risk Classifier
True Risk, Empirical Risk, and
When PDF is Known: Maximum A
Posteriori (MAP) is Optimal
Another Way to Write the MAP
Classifier: Test the Sign of the Log
Likelihood Ratio
MAP Example: Gaussians with
Equal Covariance
Linear Discriminant Projection of
the Data
Other Linear Classifiers: Empirical
Risk Minimization (Choose v, b to
Minimize Remp(v,b))
A Serious Problem: Over-Training
Minimum-Error projection of training data
The same projection, applied to
new test data
When the True PDF is Unknown:
Upper Bounds on True Risk
The VC Dimension of a Hyperplane
Schematic Depiction: |w| Controls the
Expressiveness of the Classifier
(and a less expressive classifier is less prone to overtrain)
The SVM = An Optimality Criterion
Lagrangian Optimization: Inequality
g(v) < 0
Unconstrained Minimum
Consider minimizing
f(v), subject to the
constraint g(v) ≥ 0.
Two solution types
g(v) < 0
g(v) > 0
g(v) > 0
g(v) = 0
• g(v*) = 0
• g(v)=0 curve is
tangent to
g(v) = 0
f(v)=fmin curve at
• g(v*) > 0
• v* minimizes f(v)
Diagram from Osborne, 2004
Case 1: gm(v*)=0
Case 2: gm(v*)>0
Training an SVM
Differentiate the Lagrangian
… now Simplify the Lagrangian…
… and impose Kuhn-Tucker…
Three Types of Vectors
Interior Vector: a=0
Margin Support
Vector: 0<a<C
Error: a=C
Partial Error: a=C
From Hastie et al.,
NIPS 2004
… and finally, Solve the SVM
Quadratic Programming
ai2 is off the
truncate to
ai1 is still a
solve for it
again in
iteration i+1.
Linear SVM Example
Linear SVM Example
Choosing the Hyper-Parameter to
Avoid Over-Training
(Wang, Presentation at CLSP workshop WS04)
SVM test corpus error vs.
l=1/C, classification of
nasal vs. non-nasal vowels.
Choosing the Hyper-Parameter to
Avoid Over-Training
• Recall that v=Sm amymxm
• Therefore, |v| < (C Sm |xm|)1/2 < (CM max|xm|)1/2
• Therefore, width of the margin is constrained to
1/|v| > (CM max|xm|)-1/2, and therefore, the SVM is
not allowed to make the margin very small in its
quest to fix individual errors
• Recommended solution:
– Normalize xm so that max|xm|≈1 (e.g., using libsvm)
– Set C≈1/M
– If desired, adjust C up or down by a factor of 2, to see if
error rate on independent development test data will
From Linear to Nonlinear SVM
Example: RBF Classifier
An RBF Classification Boundary
Two Hyperparameters  Choosing
Hyperparameters is Much Harder
(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)
Optimum Value of C Depends on g
(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)
From Hastie et al.,
NIPS 2004
SVM is a “Regularized Learner”
SVM Coefficients are a Piece-Wise
Linear Function of l=1/C
(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)
The Entire Regularization Path of
the SVM: Algorithm
(Hastie, Zhu, Tibshirani and Rosset, NIPS 2004)
Start with l large enough (C small enough) so that all training tokens are partial
errors (am=C). Compute the solution to the quadratic programming problem in
this case, including inversion of XTX or XXT.
Reduce l (increase C) until the initial event occurs: two partial error points enter
the margin, i.e., in the QP problem, am=C becomes the unconstrained solution
rather than just the constrained solution. This is the first breakpoint. The slopes
dam/dl change, but only for the two training vectors the margin; all other training
vectors continue to have am=C.Calculate the new values of dam/dl for these two
training vectors.
Iteratively find the next breakpoint. The next breakpoint occurs when one of the
following occurs:
– A value of am that was on the margin leaves the margin, i.e., the piece-wise-linear
function am(l) hits am=0 or am=C.
– One or more interior points enter the margin, i.e., in the QP problem, am=0 becomes the
unconstrained solution rather than just the constrained solution.
– One or more interior points enter the margin, i.e., in the QP problem, am=C becomes the
unconstrained solution rather than just the constrained solution.
One Method for Using SVMPath
(WS04, Johns Hopkins, 2004)
• Download SVMPath code from Trevor Hastie’s web page
• Test several values of g, including values within a few orders
of magnitude from g=1/K.
• For each candidate value of g, use SVMPath to find the Cbreakpoints. Choose a few dozen C-breakpoints for further
testing, and write out the corresponding values of am.
• Test the SVMs on a separate development test database: for
each combination (C,g), find the development test error.
Choose the combination that gives least development test
Results, RBF SVM
SVM test corpus error vs.
l=1/C, classification of
nasal vs. non-nasal vowels.
Wang, WS04 Student
Presentation, 2004
Linear SVMs
Motivation: “Project it Yourself”
• The purpose of a nonlinear SVM:
– f(x) contains higher-order polynomial terms in the elements of x.
– By combining these higher-order polynomial terms, SymamK(x,xm) can
create a more flexible boundary than can SymamxTxm.
– The flexibility of the boundary does not lead to generalization error:
the regularization term l|v|2 avoids generalization error.
• A different approach:
– Augment x with higher-order terms, up to a very large dimension.
These terms can include:
• Polynomial terms, e.g., xixj
• N-gram terms, e.g., (xi at time t AND xj at time t)
• Other features suggested by knowledge-based analysis of the
– Then: apply a linear SVM to the higher-dimensional problem
Example #1: Acoustic Classification
of Stop Place of Articulation
• Feature Dimension: K=483/10ms
– MFCCs+d+dd, 25ms window: K=39/10ms
– Spectral shape: energy, spectral tilt, and spectral compactness,
once/millisecond: K=40/10ms
– Noise-robust MUSIC-based formant frequencies, amplitudes,
and bandwidths: K=10/10ms
– Acoustic-phonetic parameters (Formant-based relative spectral
measures and time-domain measures): K=42/10ms
– Rate-place model of neural response fields in the cat auditory
cortex: K=352/10ms
• Observation = concatenation of up to 17 frames, for a
total of K=17 X 483 = 8211 dimensions
• Results: Accuracy improves as you add more features,
up to 7 frames (one/10ms; 3381-dimensional x). Adding
more frames didn’t help.
• RBF SVM still outperforms linear SVM, but only by 1%
Example #2: Text Classification
• Goal:
– Utterances were recorded by physical therapy patients,
specifying their physical activity once/half hour for seven days.
– Example utterance: “I ate breakfast for twenty minutes, then I
walked to school for ten minutes.”
– Goal: for each time period, determine the type of physical
activity, from among 2000 possible type categories.
• Indicator features
– 50000 features: one per word, in a 50000-word dictionary
– x = [ d1, d2, d3, …, d50000 ]T
– di = 1 if the ith dictionary word was contained in the utterance,
zero otherwise
– X is very sparse: most sentences contain only a few words
– Linear SVM is very efficient
Example #2: Text Classification
• Result
– 85% classification accuracy
– Most incorrect classifications were reasonable to a
• “I played hopskotch with my daughter” = “playing a
game”, or “light physical exercise”?
– Some categories were never observed in the training
data, therefore no test data were assigned to those
• Conclusion: SVM is learning keywords &
keyword combinations
• Plotting the Data: Use PCA, LDA, or any other
• If PDF is known: Use MAP classifier
• If PDF unknown: Structural Risk Minimization
• “SVM” is a training criterion – a particular upper
bound on structural risk of hyperplane
• Choosing hyperparameters
– Easy for a linear classifier
– For a nonlinear classifier: use the Complete
Regularization Path algorithm
• High-dimensional Linear SVMs: human user
acts as an “intelligent kernel”