Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com Over-fitting is easy to recognize in 1D Parabolic target function 4th order hypothesis 5 data points -> Ein = 0 The origin of over-fitting can be analyzed in 1D: Bias/variance dilemma. How does this apply to case on previous slide? Shape of fit very sensitive to noise in data Out-of-sample error will vary greatly from one dataset to another. Sum of squared deviations Over-fitting is easy to avoid in 1D: Results from HW1 Eval Ein Degree of polynomial Using Eval to avoid over-fitting works in all dimensions but computation grows rapidly for large d Ein Ecv-1 Eval EE Digit recognition one vs not one d = 2 (intensity and symmetry) Terms in F5(x) added successively 500 pts in training set Validation set needs to be large; 8798 this case What if we want to add higher order terms to a linear model but don’t have enough data for a validation set? Solution: Augment the error function used to optimize weights Example Penalizes choices with large |w|. Called “weight decay” Normal equations with weight decay essentially unchanged (ZTZ + lI) wreg =ZTy Best value l is subjective In this case l = 0.0001 is large enough to suppress swings but data still important in determining optimum weights. Review for Quiz 2 Topics: linear models extending linear models by transformation dimensionality reduction over fitting and regularization 2 classes are distinguished by a threshold values of a linear combination of d attributes. Explain how h(w|x) = sign(wTx) becomes a hypothesis set for linear binary classification More Review for Quiz 2 Topics: linear models extending linear models by transformation dimensionality reduction over fitting and regularization We have used 1-step optimization in 4 ways: polynomial regression in 1D (curve fitting) multivariate linear regression extending linear models by transformation regularization by weight decay 2 of these are equivalent; which ones More Review for Quiz 2 Topics: linear models extending linear models by transformation dimensionality reduction over fitting and regularization 1-step optimization requires the in-sample error to be the sum of squared residuals. Define the in-sample error for following multivariate linear regression, extending linear models by transformation regularization by weight decay For multivariate linear regression Derive the normal equations for extended linear regression with weight decay Interpret the “learning curve” for multivariate linear regression when training data has normally distributed noise • Why does Eout approach s2 from above? • Why does Ein approach s2 from below? • Why is Ein not defined for N<d+1? What do these learning curves say about simple vs complex models Still larger than bound set by noise How do we estimate a good level of complexity without sacrificing training data? Why chose 3 rather than 4? Review:Maximum Likelihood Estimation • Estimate parameters q of a probability distribution given a sample X drawn from that distribution Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 19 Form the likelihood function • Likelihood of q given the sample X l(θ|X) = p (X |θ) = ∏t p(xt|θ) • Log likelihood L(θ|X) = log(l(θ|X)) = ∑t log p(xt|θ) • Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X) the value of θ that maximizes L(θ|X) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 20 How was MLE used in logistic regression to derive an expression for in-sample error? In logistic regression, parameters are the weights Likelihood of w given the sample X l(w|X) = p (X |w) = ∏t p(xt|w) Log likelihood L(w|X) = log(l(w|X)) = ∑t log p(xt|w) In logistic regression, p(xt|w) = q(ynwT xn) Since Log is a monotone increasing function, maximizing log(likelihood) is equivalent to minimizing -log(likelihood) Text also normalizes by dividing by N; hence error function becomes How? Derive the log-likelihood function for a 1D Gaussian distribution Stochastic gradient decent: correct weights by error in each data point given e in ( h(x n ), y n ) ln(1 exp(-yn w T x n ) derive y n xn e in 1 exp(y n w T x n ) PCA I want to perform PCA on a dataset. What must I assume about the noise in data? More PCA Correlation coefficients of normally distributed attributes x are zero. What can we say about the covariance of x More PCA Attributes x are normally distributed with mean m and covariance S. z = Mx is a linear transformation to feature space defined by matrix M. What are the mean and covariance of these features? More PCA zk is the a feature defined by projection of attributes in the direction of the eigenvector wk of the covariance matrix. Prove that eigenvalue lk is the variance of zk Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29 Constrained optimization How do we find values of x1 and x2 that minimize f(x1, x2) subject to the constraint g(x1, x2) = c? Find stationary points of f(x1, x2) = 1 - x12 – x22 subject to constraint g(x1, x2) = x1 + x2 = 1