overfitting, regularization and review

advertisement
Over-fitting and Regularization
Chapter 4 textbook
Lectures 11 and 12 on amlbook.com
Over-fitting is easy to recognize in 1D
Parabolic target function
4th order hypothesis
5 data points -> Ein = 0
The origin of over-fitting can be analyzed in 1D:
Bias/variance dilemma. How does this apply to
case on previous slide?
Shape of fit very sensitive to noise in data
Out-of-sample error will vary greatly from
one dataset to another.
Sum of squared deviations
Over-fitting is easy to avoid in 1D:
Results from HW1
Eval
Ein
Degree of polynomial
Using Eval to avoid over-fitting works in all dimensions but
computation grows rapidly for large d
Ein
Ecv-1
Eval
EE
Digit recognition one vs not one
d = 2 (intensity and symmetry)
Terms in F5(x) added successively
500 pts in training set
Validation set needs to be large; 8798 this case
What if we want to add higher order terms to a linear model
but don’t have enough data for a validation set?
Solution: Augment the error function used to optimize weights
Example
Penalizes choices with large |w|. Called “weight decay”
Normal equations with weight decay essentially unchanged
(ZTZ + lI) wreg =ZTy
Best value l is subjective
In this case l = 0.0001 is large enough to suppress swings
but data still important in determining optimum weights.
Review for Quiz 2
Topics:
linear models
extending linear models by transformation
dimensionality reduction
over fitting and regularization
2 classes are distinguished by a threshold values of a linear combination
of d attributes. Explain how h(w|x) = sign(wTx) becomes a hypothesis set
for linear binary classification
More Review for Quiz 2
Topics:
linear models
extending linear models by transformation
dimensionality reduction
over fitting and regularization
We have used 1-step optimization in 4 ways:
polynomial regression in 1D (curve fitting)
multivariate linear regression
extending linear models by transformation
regularization by weight decay
2 of these are equivalent; which ones
More Review for Quiz 2
Topics:
linear models
extending linear models by transformation
dimensionality reduction
over fitting and regularization
1-step optimization requires the in-sample error to be the sum of squared
residuals. Define the in-sample error for following
multivariate linear regression,
extending linear models by transformation
regularization by weight decay
For multivariate linear regression
Derive the normal equations for extended linear regression with weight decay
Interpret the “learning curve” for multivariate linear regression when
training data has normally distributed noise
• Why does Eout approach s2 from above?
• Why does Ein approach s2 from below?
• Why is Ein not defined for N<d+1?
What do these learning curves say about
simple vs complex models
Still larger than bound set by noise
How do we estimate a good level of complexity without
sacrificing training data?
Why chose 3 rather than 4?
Review:Maximum Likelihood Estimation
• Estimate parameters q of a probability
distribution given a sample X drawn from that
distribution
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
19
Form the likelihood function
• Likelihood of q given the sample X
l(θ|X) = p (X |θ) = ∏t p(xt|θ)
• Log likelihood
L(θ|X) = log(l(θ|X)) = ∑t log p(xt|θ)
• Maximum likelihood estimator (MLE)
θ* = argmaxθ L(θ|X)
the value of θ that maximizes L(θ|X)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
20
How was MLE used in logistic regression to derive an expression
for in-sample error?
In logistic regression, parameters are the weights
Likelihood of w given the sample X
l(w|X) = p (X |w) = ∏t p(xt|w)
Log likelihood
L(w|X) = log(l(w|X)) = ∑t log p(xt|w)
In logistic regression, p(xt|w) = q(ynwT xn)
Since Log is a monotone increasing function, maximizing
log(likelihood) is equivalent to minimizing -log(likelihood)
Text also normalizes by dividing by N; hence error function becomes
How?
Derive the log-likelihood function for a 1D Gaussian distribution
Stochastic gradient decent: correct weights by error in each data point
given
e in ( h(x n ), y n )  ln(1  exp(-yn w T x n )
derive
 y n xn
e in 
1  exp(y n w T x n )
PCA
I want to perform PCA on a dataset. What must I assume about the noise in
data?
More PCA
Correlation coefficients of normally distributed attributes x are zero. What
can we say about the covariance of x
More PCA
Attributes x are normally distributed with mean m and
covariance S.
z = Mx is a linear transformation to feature space defined by
matrix M.
What are the mean and covariance of these features?
More PCA
zk is the a feature defined by projection of attributes in the
direction of the eigenvector wk of the covariance matrix.
Prove that eigenvalue lk is the variance of zk
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
29
Constrained optimization
How do we find values of x1 and x2 that minimize f(x1, x2)
subject to the constraint g(x1, x2) = c?
Find stationary points of f(x1, x2) = 1 - x12 – x22
subject to constraint g(x1, x2) = x1 + x2 = 1
Download