Uploaded by Richie Anderson

Class 7 Cross validation

advertisement
POMS.6120
Statistics for Predictive Analytics
Nichalin Summerfield, Ph.D.
Fall 2016
1
Chapter 5 Resampling Methods
Resampling method involve repeatedly drawing samples from
a training set and refitting a model on each sample in order to
obtain additional information about the fitted model.
Two resampling methods:
1. Cross-validation methods (leave-one-out and k-fold)
2. Bootstrap
Purposes:
◦ Model assessment = The process of evaluating a model’s performance
◦ E.g to estimate the test error or the variability
◦ Model selection = The process of selecting the proper level of flexibility
for a model.
◦ E.g. to select k for KNN, or to select the degree of polynomial in regression.
Resampling methods can be computationally expensive.
Chapter 5 Resampling Methods
 Cross-Validation
o The Validation Set Approach
o Leave-One-Out Cross-Validation
o k-Fold Cross-Validation
o Bias-Variance Trade-Off for k-Fold CV
o Cross-Validation on Classification Problems
 The Bootstrap
Training Error versus Test error
Training- versus Test-Set
Performance
or flexibility
More on prediction-error
estimates
Chapter 5 Resampling Methods
 Cross-Validation
o The Validation Set Approach
o Leave-One-Out Cross-Validation
o k-Fold Cross-Validation
o Bias-Variance Trade-Off for k-Fold CV
o Cross-Validation on Classification Problems
 The Bootstrap
Validation-set approach
Try this using Auto Data
library(ISLR)
set.seed(1)
train=sample(392,196)
train
lm.fit=lm(mpg~horsepower,data=Auto,subset=train)
attach(Auto)
Pred=predict(lm.fit,Auto)
mean((mpg-Pred)[-train]^2)
set.seed(2)
train=sample(392,196)
train
lm.fit=lm(mpg~horsepower,subset=train)
mean((mpg-predict(lm.fit,Auto))[-train]^2)
Example: Auto Data
Suppose that we want to predict mpg from horsepower
Two models:
◦ mpg ~ horsepower
◦ mpg ~ horsepower + horspower2
Which model gives a better fit?
◦ Randomly split Auto data set into training (196 obs.) and validation data
(196 obs.)
◦ Fit both models using the training data set
◦ Then, evaluate both models using the validation data set
◦ The model with the lowest validation (testing) MSE is the winner!
IOM 530: INTRO. TO STATISTICAL LEARNING
10
Now try this!
We want to compare the performance of different degrees of polynomial.
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower,data=Auto,subset=train)
mean((mpg-predict(lm.fit,Auto))[-train]^2)
lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(lm.fit2,Auto))[-train]^2)
lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(lm.fit3,Auto))[-train]^2)
set.seed(2)
train=sample(392,196)
lm.fit=lm(mpg~horsepower,subset=train)
mean((mpg-predict(lm.fit,Auto))[-train]^2)
lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(lm.fit2,Auto))[-train]^2)
lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(lm.fit3,Auto))[-train]^2)
Example: Auto Data
Left: Validation error rate for a single split
Right: Validation method repeated 10 times, each time the split is done randomly!
There is a lot of variability among the MSE’s… Not good!
We need more stable methods!
The Validation Set Approach
Advantages:
◦ Simple
◦ Easy to implement
Disadvantages:
◦ The validation MSE can be highly variable
◦ Only a subset of observations are used to fit the model
(training data). Statistical methods tend to perform worse
when trained on fewer observations
IOM 530: INTRO. TO STATISTICAL LEARNING
13
Chapter 5 Resampling Methods
 Cross-Validation
o The Validation Set Approach
o Leave-One-Out Cross-Validation
o k-Fold Cross-Validation
o Bias-Variance Trade-Off for k-Fold CV
o Cross-Validation on Classification Problems
 The Bootstrap
Leave-One-Out Cross-Validation
(LOOCV)
• LOOCV involves splitting the set of observations into two parts.
• However, instead of creating two subsets of comparable size, a single
observation is used for the validation set, and the remaining
observations (𝑛 − 1) make up the training set.
• The process repeats for 𝑛 times.  no randomness!
Leave-One-Out Cross-Validation
(LOOCV)
Measuring MSE
First data point as a test set:
𝑀𝑆𝐸1 = 𝑦1 − 𝑦1
2
Second data point as a test set:
𝑀𝑆𝐸2 = 𝑦2 − 𝑦2
2
And so on…
The LOOCV estimate for the test MSE is the average of these MSEs.
𝑛
1
𝐶𝑉(𝑛) =
𝑀𝑆𝐸𝑖
𝑛
𝑖=1
Leave-One-Out Cross-Validation
(LOOCV)
Major advantages of LOOCV over the validation set approach:
1.
It has far less bias, i.e. , the LOOCV approach tends not to overestimate the
test error rate as much as the validation set approach does.
2.
The validation set approach will yield different results when applied
repeatedly due to randomness, but performing LOOCV multiple times will
always yield the same results. (no randomness)
Disadvantages:
1.
LOOCV has the potential to be expensive to implement, since the model has
to be fit 𝑛 times.
•
•
Worse if 𝑛 is large
Worse if each individual model is slow to fit
Exception: There is a formula to calculate the LOOCV test MSE for least squares
linear or polynomial regression, so you don’t have to fit 𝑛 times.
No formula for other models though.
Try this
glm.fit=glm(mpg~horsepower,data=Auto)
coef(glm.fit)
lm.fit=lm(mpg~horsepower,data=Auto)
coef(lm.fit)
library(boot)
glm.fit=glm(mpg~horsepower,data=Auto)
cv.err=cv.glm(Auto,glm.fit)
cv.err$delta
cv.error=rep(0,5)
for (i in 1:5){
glm.fit=glm(mpg~poly(horsepower,i),data=Auto)
cv.error[i]=cv.glm(Auto,glm.fit)$delta[1]
}
cv.error
Chapter 5 Resampling Methods
 Cross-Validation
o The Validation Set Approach
o Leave-One-Out Cross-Validation
o k-Fold Cross-Validation
o Bias-Variance Trade-Off for k-Fold CV
o Cross-Validation on Classification Problems
 The Bootstrap
𝑘-fold Cross Validation
 LOOCV is computationally intensive, so we can run 𝑘-fold Cross
Validation instead. Very popular!
 With 𝑘-fold Cross Validation, we divide the data set into 𝑘
different parts (e.g. 𝑘 = 5, or 𝑘 = 10, etc.)
 We then remove the first part, fit the model on the remaining
𝑘 − 1 parts (combined), and see how good the predictions are
on the left out part (i.e. compute the MSE on the first part)
 We then repeat this 𝑘 different times taking out a different part
each time.
 By averaging the 𝑘 different MSE’s we get an estimated
validation (test) error rate for new observations
𝐶𝑉(𝑘)
1
=
𝑘
𝑘
𝑀𝑆𝐸𝑖
𝑖=1
𝑘-fold Cross Validation
LOOCV is the same as 𝑘-fold Cross Validation when 𝑘 = 𝑛.
Auto data revisited
Left: LOOCV error curve
Right: 10-fold CV was run many times, and the figure shows the slightly different CV error rates
They are both stable, but LOOCV is more computationally intensive!
Auto Data: Validation Set
Approach vs. K-fold CV Approach
Left: Validation Set Approach
Right: 10-fold Cross Validation Approach
Indeed, 10-fold CV is more stable!
Try this
set.seed(17)
cv.error.12=rep(0,12)
for (i in 1:12){
glm.fit=glm(mpg~poly(horsepower,i),data=Auto)
cv.error.12[i]=cv.glm(Auto,glm.fit,K=10)$delta[1]
}
cv.error.12
Chapter 5 Resampling Methods
 Cross-Validation
o The Validation Set Approach
o Leave-One-Out Cross-Validation
o k-Fold Cross-Validation
o Bias-Variance Trade-Off for k-Fold CV
o Cross-Validation on Classification Problems
 The Bootstrap
Bias- Variance Trade-off for k-fold CV
Putting aside that LOOCV is more computationally intensive
than k-fold CV… Which is better LOOCV or K-fold CV?
◦ LOOCV is less bias than k-fold CV (when k < n)
◦ But, LOOCV has higher variance than k-fold CV (when k < n)
◦ The mean of many highly correlated quantities has higher variance than does the mean of many
quantities that are not as highly correlated.
◦ When we perform k-fold CV with k < n, we are averaging the outputs of k fitted models that are
somewhat less correlated with each other, since the overlap between the training sets in each
model is smaller.
◦ Thus, there is a trade-off between what to use
Conclusion:
◦ We tend to use k-fold CV with (K = 5 and K = 10)
◦ These are the magical K’s 
◦ It has been empirically shown that they yield test error rate
estimates that suffer neither from excessively high bias, nor from
very high variance
Chapter 5 Resampling Methods
 Cross-Validation
o The Validation Set Approach
o Leave-One-Out Cross-Validation
o k-Fold Cross-Validation
o Bias-Variance Trade-Off for k-Fold CV
o Cross-Validation on Classification Problems
 The Bootstrap
Cross Validation on Classification
Problems
So far, we have been dealing with CV on regression problems.
We can use cross validation in a classification situation in a
similar manner.
◦ Divide data into K parts
◦ Hold out one part, fit using the remaining data and compute the error
rate on the hold out data
◦ Repeat K times
◦ CV error rate is the average over the K errors we have computed
Cross Validation on Classification
Problems
We can use cross validation to help choosing the right level
of flexibility
◦ Logistic regression:
𝑒 𝛽0+𝛽1𝑋1+𝛽2𝑋2
𝑝 𝑥 =
1 + 𝑒𝛽0+𝛽1𝑋1+𝛽2𝑋2
VS.
2
𝑝 𝑥 =
◦ KNN classification:
◦ K= 1 or K=2 or K= ???
2
𝑒 𝛽0+𝛽1𝑋1+𝛽2𝑋1 +𝛽3𝑋2+𝛽4𝑋2
2
2
1 + 𝑒𝛽0+𝛽1𝑋1+𝛽2𝑋1 +𝛽3𝑋2+𝛽4𝑋2
Chapter 5 Resampling Methods
 Cross-Validation
o The Validation Set Approach
o Leave-One-Out Cross-Validation
o k-Fold Cross-Validation
o Bias-Variance Trade-Off for k-Fold CV
o Cross-Validation on Classification Problems
 The Bootstrap
The Bootstrap


Primarily used to obtain standard errors of an estimate.
 E.g. When you use lm to find 𝛽0 , R will gives you SE in
addition to p-value, but that SE has many assumptions.
 You can use bootstrap to verify SE from your data.
Cannot be used to obtain estimated MSE or error rate.
Where does the name came
from?
Bootstrap as a metaphor, meaning to better oneself by one's own unaided
efforts, was in use in 1922
A simple example
Example continued
Example continued
Example continued
Results
Example continued
Now back to the real world
Example with just 3 observations
A general picture for the
bootstrap
Bootstrap
Results
The bootstrap in general
 In more complex data situations, figuring out the
appropriate way to generate bootstrap samples can
require some thought.
 For example, if the data is a time series, we can't simply
sample the observations with replacement.
 Primarily used to obtain standard errors of an estimate.
 Also provides approximate confidence intervals for a
population parameter.
Can the bootstrap estimate
prediction error?
• In cross-validation, each of the K validation folds is distinct
from the other K-1 folds used for training: there is no
overlap.
• To estimate prediction error using the bootstrap, we could
think about using each bootstrap dataset as our training
sample, and the original sample as our validation sample.
•
•
•
But each bootstrap sample has significant overlap with the original
data.
About two-thirds of the original data points appear in each
bootstrap sample.
This will cause the bootstrap to seriously underestimate the true
prediction error.
• The other way around-- with original sample = training
sample, bootstrap dataset = validation sample -- is worse!
Cross-validation provides a simpler, more attractive approach for
estimating prediction error.
Try This
boot.fn=function(data,index)
return(coef(lm(mpg~horsepower,data=data,subset=index)))
boot.fn(Auto,1:392)
set.seed(1)
boot.fn(Auto,sample(392,392,replace=T))
boot.fn(Auto,sample(392,392,replace=T))
boot(Auto,boot.fn,1000)
summary(lm(mpg~horsepower,data=Auto))$coef
boot.fn=function(data,index)
coefficients(lm(mpg~horsepower+I(horsepower^2),data=data,subset=index))
set.seed(1)
boot(Auto,boot.fn,1000)
summary(lm(mpg~horsepower+I(horsepower^2),data=Auto))$coef
Download