Chapter 7 (Model Assessment and Selection ) 발표 일자 : 2004년 7월 15일 발 표 자:정보혜 Contents Model Assessment and Selection 1. Introduction 2. Bias, Variance and Model Complexity 3. The Bias-Variance Decomposition 4. Optimism of the Training Error Rate 5. Estimates of In-Sample Prediction Error 6. The Effective Number of Parameter 7. The Bayesian Approach and BIC 8. Minimum Description Length 9. Vapnik-Chernovenkis Dimension 10. Cross-Validation 11.Bootstrap Methods 2 1. Introduction ◆ Model Assessment and Selection -Model Selection: estimating the performance of different models in order to choose the (approximate) best one. -Model Assessment: having chosen a final model, estimating its prediction error (generalization error) on new data. Assessment of performance guides the choice of learning method or model, And gives us a measure of the quality of the ultimately chosen model. 3 2. Bias, Variance and Model Complexity ◆Test error (generalization error) : the expected prediction error over an independent test sample. ◆Training Error : the average loss the training sample. -Model becomes more complex decrease in bias increase in variance In between there is an optimal model complexity that gives minimum test error. 4 3. The Bias-Variance Decomposition The expected prediction error of a regression fit using squared-error loss: at an input point ▶ -For the K-nearest-neighbor regression fit: The number neighbors k is inversely related to the model complexity. Increase in k (complexity decrease) bias increase , variance decrease 5 , 3. The Bias-Variance Decomposition -For a linear model fit by least squares: Here is the N-vector of linear weights that produce the fit and hence This variance changes with x0 ,its average is ▶ In-sample error is Model complexity is directly elated to the number of parameters p. 6 3. The Bias-Variance Decomposition -For a ridge regression fit : Variance term-the linear weights are different. Bias termLet denote the parameters of the best-fitting linear approximation to f: The average squared bias : Ex0 The average squared model bias is the error between the best-fitting linear approximation and the true function. The average squared estimation bias is the error between the average estimate and the best fitting linear approximation. 7 3. The Bias-Variance Decomposition Figure shows the bias-variance tradeoff. The model space is the set of all linear predictions from p inputs. “closest fit” is The large yellow circle indicates variance. A shrunken or regularized fit (to fit a model fewer predictors, or regularize the coefficients by shrinking them toward zero) has an additional estimation bias, but it has smaller variance. 8 3. The Bias-Variance Decomposition 3.1 Example :Bias-Variance Tradeoff Prediction error (red) , squared bias (green) and variance (blue). The top row is regression with squared error loss: The bottom row is classification with 0-1 loss. The variance and bias curves are the same in regression and classification, but the prediction error curve is different. This means that the best choices of tuning parameters may differ in the two settings. 9 4. Optimism of the Training Error Rate err <Err Because the same data is being used to fit the method and assess its error. In-sample error The Ynew notation indicates that we observe N new response values at each of the training points xi, , i=1,2,…,N. Err is “extra-sample” error. Err= E(Errin) Optimism For squared error, o-1, and other loss functions. 10 4. Optimism of the Training Error Rate In summary, -example (least squares linear fit): ▶ And so, d↑ op↑ N↑ op ↓ -An obvious way to estimate prediction error is to estimate the optimism and then add it to the training error rate err. (AIC, BIC and others) 11 5. Estimates of In-Sample Prediction Error The general form of the in-sample estimates: Cp (fit under squared error loss) Here is an estimate of the noise variance, obtained from the mean-squared error of a low-bias model. AIC (fit under log-likelihood loss) Here is a family of densities for Y (containing “true” density), estimate of , and “loglik” is the maximized log-likelihood: It relies on a relationship similar to (7.20) that holds asymptotically as 12 is the maximum-likelihood 5. Estimates of In-Sample Prediction Error -for the Gaussian model (with variance assumed known), the AIC statistic is equivalent to Cp. ▶ Choose the model giving smallest AIC over the set of models considered. For nonlinear and other complex models, replace d by some measure of model complexity. Given a set of models indexed by a tuning parameters ,denote by and . The training error and number of parameters for each model. We define If we have a total of p inputs, and we choose the best-fitting linear model with d<p inputs, the optimism will exceed 13 5. Estimates of In-Sample Prediction Error 14 6. The Effective Number of Parameter A linear fitting method Where S is an matrix depending on the input vectors x i but not on yi . Then the effective number of parameters is defined as -If S is an orthogonal-projection matrix onto a basis set spanned by M features, then trace(S)=M. trace(S) is exactly the correct quantity to replace d as the number of parameters in the Cp statistic. 15 7. The Bayesian Approach and BIC BIC (The Bayesian information criterion) -Under the Gaussian model (assuming the variance is known) -2∙ loglik = (which is for squared error loss) BIC ∝ AIC (Cp) with the factor 2 replaced by log N. BIC tends to penalize complex models more heavily than AIC when 16 7. The Bayesian Approach and BIC The Bayesian approach to model selection Suppose we have a set candidate models Mm,m=1,…,M and corresponding model parameters , and we wish to choose a best model from among them. Assuming we have a prior distribution for the parameters of each model Mm. -the posterior probability of a given model Where Z represents the training data {xi , yi}N1 . -the posterior odds If odds>1 , then choose model m, otherwise choose model l . The contribution of the data toward the posterior odds 17 7. The Bayesian Approach and BIC Approximating As N∞ under some regularity conditions. For loss function , this is equivalent to the BIC criterion. ∴ Choosing the model with minimum BIC is equivalent to choosing the model with largest (approximate) posterior probability. AIC vs BIC N ∞ , BIC will select the correct model approaches one N ∞ , AIC tens to choose models which are too complex But, for finite samples, BIC often chooses models that are too simple, because of its heavy penalty on complexity. 18 8. Minimum Description Length The theory of coding for data If messages are sent with probabilities Pr (zi) , i=1,2,…, use code lengths li= -log2Pr(zi) and Model selection We have a model M with parameters , and data consisting of both inputs and outputs. Let the probability of the outputs under the model be -Choose the model that minimizes Length. Minimizing description length is equivalent to maximizing posterior probability. 19 9. Vapnik-Chernovenkis Dimension Definition of The VC dimension The VC dimension of the class {f (x, α )} is defined to be the largest number of points ( in some configuration) that can be shattered by numbers of {f (x, α )} The VC dimension is a way of measuring the complexity of a class of functions by assessing how wiggly its members can be. This is a very wiggly function that gets even rougher as the frequency Α increases, but it has only one parameter In general, a linear indicator function In p dimensions has VC dimension p+1, Which is also the number of free parameters. 20 9. Vapnik-Chernovenkis Dimension -If we fit N training points using a class of functions {f(x, α)} having VC dimension h, then with probability at least 1- η over training sets: The bounds agreement with the AIC . But, the result in (7.41)are stronger. -SRM approach fit s a nested sequence of models of increasing VC dimensions h1<h2 <…, and then chooses the mode with the smallest value of the upper bound. -drawback of SRM approach is the difficulty in calculating the VC dimension of a class of functions. 21 9. Vapnik-Chernovenkis Dimension ◆ 9.1 Example Boxplots show the distribution of the relative error Over the four scenarios of figure 7.3. This is the error in using the chosen model relative to the best model. The AIC seems to work well in all four scenarios, despite the lack of theoretical support with 0-1 loss. BIC dose nearly as well, while the performance of SRM is mixed. 22 10. Cross-Validation A method of estimating the extra-sample error Err directly. K-fold cross validation Let be an indexing function that indicates the partition to which observation i is allocated by the randomization. is the fitted function with the kth part of the data removed. -The cross-validation estimate of prediction error: The case K=N is known as leave-one-out cross-validation. What value shoud we choose for K? K smaller more bias, less variance K larger less bias, more variance 23 10. Cross-Validation The prediction error and tenfold cross-validation curve estimated from a single training set. Both curves have minima at p=10, although the CV curve is rather flat Beyond 10. “one-standard error” rule: To choose the most parsimonious model whose error is no more than one standard error above the error of the best model. Here it looks like a model with abot p=9 predictors would be chosen, While true model uses p=10. 24 10. Cross-Validation Generalized cross-validation -Linear predictor: for many linear fitting methods: -The GCV approximation: as an approximation to the leave one-out (K=N) cross-validation for the linear predictor under squared error loss. The similarity between GCV and AIC can be seen from the approximation 25 11.Bootstrap Methods ◆ The Bootstrap process We wish to assess the statistical accuracy of a quantity S(Z) computed from a our dataset. B training sets each of size N are drawn with replacements from the original data set. The quantity of interest S(Z) is computed from each bootstrap training set, and the values are used to assess the statistical accuracy of S(Z). 26 11.Bootstrap Methods -For example, Estimate of the variance of S(Z) Where . Monte-Carlo estimae of the variance of S(Z) under sampling from the empirical distribution function for the data . ◆ Bootstraping the prediction prediction error If is the predicted value at , from the model fitted to the bth bootstrap dataset, - Bootstrap samples act as the ‘training “ sample - original samples act as the ‘test’ sample under estimate Err due to the overlap of the ‘training’ sample & the ‘test’ sample. 27 11.Bootstrap Methods ◆ Leave –one-out bootstrap Here is the set of indices of the bootstrap samples b that do not contain observation I, and is the number of such samples. is biased upward as an estimate of Err. The average number of distinct observations in each bootstrap sample is about 0.632*N. ◆ 0.632 bootstrap estimate 28 11.Bootstrap Methods Figure shows the results of fivefold cross validation and the 0.632+ bootstrap estimate in the same four problems of figure 7.7. Both measures perform well overall, perhaps the same or slightly worse that the AIC in figure 7.7 29