Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006 Summary • • • • • • • • Bias, variance, model complexity Optimism of training error rate Estimates of in-sample prediction error, AIC Effective number of parameters The Bayesian approach and BIC Vapnik-Chernovekis dimension Cross-Validation Bootstrap method Model Selection Criteria • Loss Function • Training Error • Generalization Error Training Error vs. Test Error Model Selection and Assessment • Model selection: – Estimating the performance of different models in order to chose the best • Model assessment: – Having chosen a final model, estimating its prediction error (generalization error) on new data • If we were rich in data: Train Validation Test Bias-Variance Decomposition • As we have seen before, • The first term is the variance of the target around the true mean f(x0); the second term is the average by which our estimate is off from the true mean; the last term is variance of f^(x0) * The more complex f, the lower the bias, but the higher the variance Bias-Variance Decomposition (cont’d) • For K-nearest neighbor • For linear regression Bias-Variance Decomposition (cont’d) • For linear regression, where h(x0) is the vector of weights that produce fp(x0)=x0T(XTX)-1XTy and hence Var[(fp(x0)]=||h(x0)||22 • This variance changes with x0, but its average over the sample values xi is (p/N) 2 Example • 50 observations and 20 predictors, uniformly distributed in the hypercube [0,1]20. • Left: Y is 0 if X11/2 and apply k-NN • Right: Y is 1 if j=110Xj is 5 and 0 otherwise Prediction error Squared bias Variance Example – loss function Prediction error Squared bias Variance Optimism of Training Error • The training error • Is typically less than the true error • In sample error • Optimism • For squared error, 0-1, and other losses, on can show in general Optimism (cont’d) • Thus, the amount by which the error under estimates the true error depends on how much yi affects its own prediction • For linear model • For additive model Y=f(X)+ and thus, Optimism increases linearly with number of inputs or basis d, decreases as training size increases How to count for optimism? • Estimate the optimism and add it to the training error, e.g., AIC, BIC, etc. • Bootstrap and cross-validation, are direct estimates of this optimism error Estimates of In-Sample Prediction Error • General form of in-sample estimate is computed from • Cp statistic: for an additive error model, when d parameters are fit under squared error loss, • Using this criterion, adjust the training error by a factor proportional to the number of basis • Akaike Information Criterion (AIC) is a similar but a more generally applicable estimate of Errin, when the log-likelihood loss function is used Akaike Information Criterion (AIC) • AIC relies on a relationship that holds asymptotically as N • Pr(Y) is a family of densities for Y (contains the “true” density), “ hat” is the max likelihood estimate of , “loglik” is the maximized loglikelihood: N log lik log Pr ( y ) i 1 ˆ i AIC (cont’d) • For the Gaussian model, the AICCp • For the logistic regression, using the binomial log-likelihood, we have • AIC=-2/N. loglik + 2. d/N • Choose the model that produces the smallest possible AIC • What if we don’t know d? • How about having tuning parameters? AIC (cont’d) • Given a set of models f(x) indexed by a tuning parameter , denote by err() and d() the training error and number of parameters d ( ) AIC() err () 2. ˆ N 2 • The function AIC provides an estimate of the test error curve and we find the tuning parameter that maximizes it • By choosing the best fitting model with d inputs, the effective number of parameters fit is more than d AIC- Example: Phenome recognition The effective number of parameters • Generalize num of parameters to regularization • Effective num of parameters is: d(S) = trace(S) • In sample error is: The effective number of parameters • Thus, for a regularized model: • Hence • and The Bayesian Approach and BIC • Bayesian information criterion (BIC) • BIC/2 is also known as Schwartz criterion BIC is proportional to AIC (Cp) with a factor 2 replaced by log (N). BIC penalizes complex models more heavily, prefering Simpler models BIC (cont’d) • BIC is asymptotically consistent as a selection criteria: given a family of models, including the true one, the prob. of selecting the true one is 1 for N • Suppose we have a set of candidate models Mm, m=1,..,M and corresponding model parameters m, and we wish to chose a best model • Assuming a prior distribution Pr(m|Mm) for the parameters of each model Mm, compute the posterior probability of a given model! BIC (cont’d) • The posterior probability is • Where Z represents the training data. To compare two models Mm and Ml, form the posterior odds • If the posterior greater than one, chose m, otherwise l. BIC (cont’d) • Bayes factor: the rightmost term in posterior odds • We need to approximate Pr(Z|Mm) • A Laplace approximation to the integral gives • ^m is the maximum likelihood estimate and dm is the number of free parameters of model Mm • If the loss function is set as -2 log Pr(Z|Mm,^m), this is equivalent to the BIC criteria BIC (cont’d) • Thus, choosing the model with minimum BIC is equivalent to choosing the model with largest (approximate) posterior probability • If we compute the BIC criterion for a set of M models, BICm, m=1,…,M, then the posterior of each model is 0.5 BIC m estimates as e e M 0.5 BIC l l 1 • Thus, we can estimate not only the best model, but also asses the relative merits of the models considered Vapnik-Chernovenkis Dimension • It is difficult to specify the number of parameters • The Vapnik-Chernovenkis (VC) provides a general measure of complexity and associated bounds on optimism • For a class of functions {f(x,)} indexed by a parameter vector , and xp. • Assume f is in indicator function, either 0 or 1 • If =(0,1) and f is a linear indicator, I(0+1Tx>0), then it is reasonable to say complexity is p+1 • How about f(x, )=I(sin .x)? VC Dimension (cont’d) VC Dimension (cont’d) • The Vapnik-Chernovenkis dimension is a way of measuring the complexity of a class of functions by assessing how wiggly its members can be • The VC dimension of the class {f(x,)} is defined to be the largest number of points (in some configuration) that can be shattered by members of {f(x,)} VC Dimension (cont’d) • A set of points is shattered by a class of functions if no matter how we assign a binary label to each point, a member of the class can perfectly separate them • Example: VC dim of linear indicator function in 2D VC Dimension (cont’d) • Using the concepts of VC dimension, one can prove results about the optimism of training error when using a class of functions. E.g. • If we fit N data points using a class of functions {f(x,)} having VC dimension h, then with probability at least 1- over training sets For regression, a1=a2=1 Cherkassky and Mulier, 1998 VC Dimension (cont’d) • The bounds suggest that the optimism increases with h and decreases with N in qualitative agreement with the AIC correction d/N • The results of VC dimension bounds are stronger: they give a probabilistic upper bounds for all functions f(x,) and hence allow for searching over the class VC Dimension (cont’d) • Vapnik’s Structural Risk Minimization (SRM) is built around the described bounds • SRM fits a nested sequence of models of increasing VC dimensions h1<h2<…, and then chooses the model with the smallest value of the upper bound • Drawback is difficulty in computing VC dim • A crude upper bound may not be adequate Example – AIC, BIC, SRM Cross Validation (CV) • The most widely used method • Directly estimate the generalization error by applying the model to the test sample • K-fold cross validation – Use part of data to build a model, different part to test • Do this for k=1,2,…,K and calculate the prediction error when predicting the kth part CV (cont’d) • :{1,…,N}{1,…,K} divides the data to groups • Fitted function f^-(x), computed when removed • CV estimate of prediction error is CV 1 N L( y , fˆ N i 1 ( i ) i (x )) i • If K=N, is called leave-one-out CV • Given a set of models f^-(x), the th model fit with the kth part removed. For this set of models we have CV() 1 N L( y , fˆ N i 1 i ( i ) (x , )) i CV (cont’d) CV() 1 N L( y , fˆ N i 1 i ( i ) (x , )) i • CV() should be minimized over • What should we chose for K? • With K=N, CV is unbiased, but can have a high variance since the K training sets are almost the same • Computational complexity CV (cont’d) CV (cont’d) • With lower K, CV has a lower variance, but bias could be a problem! • The most common are 5-fold and 10-fold! CV (cont’d) • Generalized leave-one-out cross validation, for linear fitting with square error loss ỷ=Sy • For linear fits (Sii is the ith on S diagonal) 2 1 N 1 N yi fˆ ( x i ) 2 i [ yi f ( x i )] [ ] N i1 N i1 1 Sii • The GCV approximation is 1 y fˆ ( x ) GCV ] [ N 1 trace (S) / N N i i 2 i 1 GCV maybe sometimes advantageous where the trace is computed more easily than the individual Sii’s Bootstrap • Denote the training set by Z=(z1,…,zN) where zi=(xi,yi) • Randomly draw a dataset with replacement from training data • This is done B times (e.g., B=100) • Refit the model to each of the bootstrap datasets and examine the behavior over the B replications • From the bootstrap sample, we can estimate any aspect of the distribution of S(Z) – where S(z) can be any quantity computed from the data Bootstrap - Schematic 1 For e.g., Var[S( Z)] [S( Z ) S ] B 1 B b 1 *b * 2 Bootstrap (Cont’d) • Bootstrap to estimate the prediction error ˆ rr E boot 1 1 L( y , fˆ ( x )) BN B N b 1 i 1 *b i i • E^rrboot does not provide a good estimate – Bootstrap dataset is acting as both training and testing and these two have common observations – The overfit predictions will look unrealistically good • By mimicking CV, better bootstrap estimates • Only keep track of predictions from bootstrap samples not containing the observations Bootstrap (Cont’d) • The leave-one-out bootstrap estimate of prediction error • C-i is the set of indices of the bootstrap sample b that do not contain observation I • We either have to choose B large enough to ensure that all of |C-i| is greater than zero, or just leave-out the terms that correspond to |C-i|’s that are zero Bootstrap (Cont’d) • The leave-one-out bootstrap solves the overfitting problem, we has a training size bias • The average number of distinct observations in each bootstrap sample is 0.632.N • Thus, if the learning curve has a considerable slope at sample size N/2, leave-one-out bootstrap will be biased upward in estimating the error • There are a number of proposed methods to alleviate this problem, e.g., .632 estimator, information error rate (overfitting rate) Bootstrap (Example) • Five-fold CV and .632 estimate for the same problems as before • Any of the measures could be biased but not affecting, as long as relative performance is the same