POMS.6120 Statistics for Predictive Analytics Nichalin Summerfield, Ph.D. Fall 2016 1 Chapter 5 Resampling Methods Resampling method involve repeatedly drawing samples from a training set and refitting a model on each sample in order to obtain additional information about the fitted model. Two resampling methods: 1. Cross-validation methods (leave-one-out and k-fold) 2. Bootstrap Purposes: ◦ Model assessment = The process of evaluating a model’s performance ◦ E.g to estimate the test error or the variability ◦ Model selection = The process of selecting the proper level of flexibility for a model. ◦ E.g. to select k for KNN, or to select the degree of polynomial in regression. Resampling methods can be computationally expensive. Chapter 5 Resampling Methods Cross-Validation o The Validation Set Approach o Leave-One-Out Cross-Validation o k-Fold Cross-Validation o Bias-Variance Trade-Off for k-Fold CV o Cross-Validation on Classification Problems The Bootstrap Training Error versus Test error Training- versus Test-Set Performance or flexibility More on prediction-error estimates Chapter 5 Resampling Methods Cross-Validation o The Validation Set Approach o Leave-One-Out Cross-Validation o k-Fold Cross-Validation o Bias-Variance Trade-Off for k-Fold CV o Cross-Validation on Classification Problems The Bootstrap Validation-set approach Try this using Auto Data library(ISLR) set.seed(1) train=sample(392,196) train lm.fit=lm(mpg~horsepower,data=Auto,subset=train) attach(Auto) Pred=predict(lm.fit,Auto) mean((mpg-Pred)[-train]^2) set.seed(2) train=sample(392,196) train lm.fit=lm(mpg~horsepower,subset=train) mean((mpg-predict(lm.fit,Auto))[-train]^2) Example: Auto Data Suppose that we want to predict mpg from horsepower Two models: ◦ mpg ~ horsepower ◦ mpg ~ horsepower + horspower2 Which model gives a better fit? ◦ Randomly split Auto data set into training (196 obs.) and validation data (196 obs.) ◦ Fit both models using the training data set ◦ Then, evaluate both models using the validation data set ◦ The model with the lowest validation (testing) MSE is the winner! IOM 530: INTRO. TO STATISTICAL LEARNING 10 Now try this! We want to compare the performance of different degrees of polynomial. set.seed(1) train=sample(392,196) lm.fit=lm(mpg~horsepower,data=Auto,subset=train) mean((mpg-predict(lm.fit,Auto))[-train]^2) lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train) mean((mpg-predict(lm.fit2,Auto))[-train]^2) lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train) mean((mpg-predict(lm.fit3,Auto))[-train]^2) set.seed(2) train=sample(392,196) lm.fit=lm(mpg~horsepower,subset=train) mean((mpg-predict(lm.fit,Auto))[-train]^2) lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train) mean((mpg-predict(lm.fit2,Auto))[-train]^2) lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train) mean((mpg-predict(lm.fit3,Auto))[-train]^2) Example: Auto Data Left: Validation error rate for a single split Right: Validation method repeated 10 times, each time the split is done randomly! There is a lot of variability among the MSE’s… Not good! We need more stable methods! The Validation Set Approach Advantages: ◦ Simple ◦ Easy to implement Disadvantages: ◦ The validation MSE can be highly variable ◦ Only a subset of observations are used to fit the model (training data). Statistical methods tend to perform worse when trained on fewer observations IOM 530: INTRO. TO STATISTICAL LEARNING 13 Chapter 5 Resampling Methods Cross-Validation o The Validation Set Approach o Leave-One-Out Cross-Validation o k-Fold Cross-Validation o Bias-Variance Trade-Off for k-Fold CV o Cross-Validation on Classification Problems The Bootstrap Leave-One-Out Cross-Validation (LOOCV) • LOOCV involves splitting the set of observations into two parts. • However, instead of creating two subsets of comparable size, a single observation is used for the validation set, and the remaining observations (𝑛 − 1) make up the training set. • The process repeats for 𝑛 times. no randomness! Leave-One-Out Cross-Validation (LOOCV) Measuring MSE First data point as a test set: 𝑀𝑆𝐸1 = 𝑦1 − 𝑦1 2 Second data point as a test set: 𝑀𝑆𝐸2 = 𝑦2 − 𝑦2 2 And so on… The LOOCV estimate for the test MSE is the average of these MSEs. 𝑛 1 𝐶𝑉(𝑛) = 𝑀𝑆𝐸𝑖 𝑛 𝑖=1 Leave-One-Out Cross-Validation (LOOCV) Major advantages of LOOCV over the validation set approach: 1. It has far less bias, i.e. , the LOOCV approach tends not to overestimate the test error rate as much as the validation set approach does. 2. The validation set approach will yield different results when applied repeatedly due to randomness, but performing LOOCV multiple times will always yield the same results. (no randomness) Disadvantages: 1. LOOCV has the potential to be expensive to implement, since the model has to be fit 𝑛 times. • • Worse if 𝑛 is large Worse if each individual model is slow to fit Exception: There is a formula to calculate the LOOCV test MSE for least squares linear or polynomial regression, so you don’t have to fit 𝑛 times. No formula for other models though. Try this glm.fit=glm(mpg~horsepower,data=Auto) coef(glm.fit) lm.fit=lm(mpg~horsepower,data=Auto) coef(lm.fit) library(boot) glm.fit=glm(mpg~horsepower,data=Auto) cv.err=cv.glm(Auto,glm.fit) cv.err$delta cv.error=rep(0,5) for (i in 1:5){ glm.fit=glm(mpg~poly(horsepower,i),data=Auto) cv.error[i]=cv.glm(Auto,glm.fit)$delta[1] } cv.error Chapter 5 Resampling Methods Cross-Validation o The Validation Set Approach o Leave-One-Out Cross-Validation o k-Fold Cross-Validation o Bias-Variance Trade-Off for k-Fold CV o Cross-Validation on Classification Problems The Bootstrap 𝑘-fold Cross Validation LOOCV is computationally intensive, so we can run 𝑘-fold Cross Validation instead. Very popular! With 𝑘-fold Cross Validation, we divide the data set into 𝑘 different parts (e.g. 𝑘 = 5, or 𝑘 = 10, etc.) We then remove the first part, fit the model on the remaining 𝑘 − 1 parts (combined), and see how good the predictions are on the left out part (i.e. compute the MSE on the first part) We then repeat this 𝑘 different times taking out a different part each time. By averaging the 𝑘 different MSE’s we get an estimated validation (test) error rate for new observations 𝐶𝑉(𝑘) 1 = 𝑘 𝑘 𝑀𝑆𝐸𝑖 𝑖=1 𝑘-fold Cross Validation LOOCV is the same as 𝑘-fold Cross Validation when 𝑘 = 𝑛. Auto data revisited Left: LOOCV error curve Right: 10-fold CV was run many times, and the figure shows the slightly different CV error rates They are both stable, but LOOCV is more computationally intensive! Auto Data: Validation Set Approach vs. K-fold CV Approach Left: Validation Set Approach Right: 10-fold Cross Validation Approach Indeed, 10-fold CV is more stable! Try this set.seed(17) cv.error.12=rep(0,12) for (i in 1:12){ glm.fit=glm(mpg~poly(horsepower,i),data=Auto) cv.error.12[i]=cv.glm(Auto,glm.fit,K=10)$delta[1] } cv.error.12 Chapter 5 Resampling Methods Cross-Validation o The Validation Set Approach o Leave-One-Out Cross-Validation o k-Fold Cross-Validation o Bias-Variance Trade-Off for k-Fold CV o Cross-Validation on Classification Problems The Bootstrap Bias- Variance Trade-off for k-fold CV Putting aside that LOOCV is more computationally intensive than k-fold CV… Which is better LOOCV or K-fold CV? ◦ LOOCV is less bias than k-fold CV (when k < n) ◦ But, LOOCV has higher variance than k-fold CV (when k < n) ◦ The mean of many highly correlated quantities has higher variance than does the mean of many quantities that are not as highly correlated. ◦ When we perform k-fold CV with k < n, we are averaging the outputs of k fitted models that are somewhat less correlated with each other, since the overlap between the training sets in each model is smaller. ◦ Thus, there is a trade-off between what to use Conclusion: ◦ We tend to use k-fold CV with (K = 5 and K = 10) ◦ These are the magical K’s ◦ It has been empirically shown that they yield test error rate estimates that suffer neither from excessively high bias, nor from very high variance Chapter 5 Resampling Methods Cross-Validation o The Validation Set Approach o Leave-One-Out Cross-Validation o k-Fold Cross-Validation o Bias-Variance Trade-Off for k-Fold CV o Cross-Validation on Classification Problems The Bootstrap Cross Validation on Classification Problems So far, we have been dealing with CV on regression problems. We can use cross validation in a classification situation in a similar manner. ◦ Divide data into K parts ◦ Hold out one part, fit using the remaining data and compute the error rate on the hold out data ◦ Repeat K times ◦ CV error rate is the average over the K errors we have computed Cross Validation on Classification Problems We can use cross validation to help choosing the right level of flexibility ◦ Logistic regression: 𝑒 𝛽0+𝛽1𝑋1+𝛽2𝑋2 𝑝 𝑥 = 1 + 𝑒𝛽0+𝛽1𝑋1+𝛽2𝑋2 VS. 2 𝑝 𝑥 = ◦ KNN classification: ◦ K= 1 or K=2 or K= ??? 2 𝑒 𝛽0+𝛽1𝑋1+𝛽2𝑋1 +𝛽3𝑋2+𝛽4𝑋2 2 2 1 + 𝑒𝛽0+𝛽1𝑋1+𝛽2𝑋1 +𝛽3𝑋2+𝛽4𝑋2 Chapter 5 Resampling Methods Cross-Validation o The Validation Set Approach o Leave-One-Out Cross-Validation o k-Fold Cross-Validation o Bias-Variance Trade-Off for k-Fold CV o Cross-Validation on Classification Problems The Bootstrap The Bootstrap Primarily used to obtain standard errors of an estimate. E.g. When you use lm to find 𝛽0 , R will gives you SE in addition to p-value, but that SE has many assumptions. You can use bootstrap to verify SE from your data. Cannot be used to obtain estimated MSE or error rate. Where does the name came from? Bootstrap as a metaphor, meaning to better oneself by one's own unaided efforts, was in use in 1922 A simple example Example continued Example continued Example continued Results Example continued Now back to the real world Example with just 3 observations A general picture for the bootstrap Bootstrap Results The bootstrap in general In more complex data situations, figuring out the appropriate way to generate bootstrap samples can require some thought. For example, if the data is a time series, we can't simply sample the observations with replacement. Primarily used to obtain standard errors of an estimate. Also provides approximate confidence intervals for a population parameter. Can the bootstrap estimate prediction error? • In cross-validation, each of the K validation folds is distinct from the other K-1 folds used for training: there is no overlap. • To estimate prediction error using the bootstrap, we could think about using each bootstrap dataset as our training sample, and the original sample as our validation sample. • • • But each bootstrap sample has significant overlap with the original data. About two-thirds of the original data points appear in each bootstrap sample. This will cause the bootstrap to seriously underestimate the true prediction error. • The other way around-- with original sample = training sample, bootstrap dataset = validation sample -- is worse! Cross-validation provides a simpler, more attractive approach for estimating prediction error. Try This boot.fn=function(data,index) return(coef(lm(mpg~horsepower,data=data,subset=index))) boot.fn(Auto,1:392) set.seed(1) boot.fn(Auto,sample(392,392,replace=T)) boot.fn(Auto,sample(392,392,replace=T)) boot(Auto,boot.fn,1000) summary(lm(mpg~horsepower,data=Auto))$coef boot.fn=function(data,index) coefficients(lm(mpg~horsepower+I(horsepower^2),data=data,subset=index)) set.seed(1) boot(Auto,boot.fn,1000) summary(lm(mpg~horsepower+I(horsepower^2),data=Auto))$coef