Jackknife and Cross

advertisement
Handout #18: Jackknife and Cross-Validation in R
Section 18.1: The “Leave-One-Out” Concept for Simple Mean
The “leave-one-out” notion in regression involves understanding the effect of a single
observation on your model. The “leave-one-out” approach could be used to identify
observations with large leverage by investigating an observations effect on the estimated
regression coefficients.
It should be noted that the “leave-one-out” notion extends beyond regression problems. This
notion is more commonly known as jackknife resampling. A snip-it of the wiki entry for
jackknife resampling is provided here. The natural extension of jackknifing resampling would be
“leave-several-out.” This is known as cross-validation when the goal is understanding the
predictive ability of a regression model.
Source: http://en.wikipedia.org/wiki/Jackknife_resampling.
“Leave-one-out” for a Mean
To begin, type the following into R. This creates a vector whose elements are (2,3,5,8,10).
> y=c(2,3,5,8,10)
Recall, the mean() function is used to obtain the mean of y.
> mean(y)
[1] 5.6
The minus argument can be used to temporarily withhold an observation from calculations done
on an object. For example, the following will calculate the mean of y without the 1st
observation. The outcome here would be the same as calculating the mean using elements 2
through 5 which would be accomplished in R as mean( y[2:5] ).
> mean(y[-1])
[1] 6.5
1
The following shows the effect of the “leave-one-out” method applied to y for a mean
calculation.
> mean(y[-2])
[1] 6.25
> mean(y[-3])
[1] 5.75
> mean(y[-4])
[1] 5
> mean(y[-5])
[1] 4.5
The mean of the complete y vector was 5.6. Leaving out the 1st observation appears to have the
largest impact on the average. The following can be used to store the mean from each “leaveone-out” iteration.
Step #1: Create an initial vector to store the “leave-one-out” mean from each iteration.
> output = rep(0,5)
>
> #Looking at output
> output
[1] 0 0 0 0 0
Step #2: Use a for() loop to cycle through for each of the five observations
> for(i in 1:5){
+
output[i]=mean( y[-i] )
+ }
Step #3: Observing the output vector allows you to observe the effect of removing each
observation on the mean
> #Looking at output after for() loop
> output
[1] 6.50 6.25 5.75 5.00 4.50
Writing a function for “leave-one-out” for a mean
There are a variety of methods to create or build a function in R. The name of the function to be
created is called mean.jackknife and I will use the edit() function to create my function. The
edit() function produces a separate window.
> mean.jackknife=edit()
Some comments regarding functions in R.
ο‚· Arguments that need to be passed into a function are done so within the parentheses
attached to function, i.e. function( ). The labeling of arguments within functions is
separate from outside the function, e.g. in the following “x” is an argument only within
my mean.jackknife() function.
ο‚· The code for functions must be contained with a set of curly brackets, i.e. {}.
ο‚· The return() will return a single object from a function. The list() function can be used
when more than a single object is to be returned from the function.
2
The finished mean.jackknife() function in R.
The following can be used to cut-and-paste
this function in to R.
mean.jackknife = function(x){
#Find the length of x
n = length(x)
#Setup output vector
output = rep(0,n)
#Loop for iterations
for(i in 1:n){
output[i]=mean( x[-i])
}
#Return the output vector
return(output)
}
Note: If pasting this into the edit() window,
delete mean.jackknife and the equal sign.
After the function has been successfully created, you are able to use the function just as another
function in R. The following produces the same outcomes as above.
> mean.jackknife(y)
[1] 6.50 6.25 5.75 5.00 4.50
The outcomes from this function can be put into a vector, say outcomes. Note: The outcomes
vector does *not* need to be setup ahead of time in this situation.
> outcomes=mean.jackknife(y)
> outcomes
[1] 6.50 6.25 5.75 5.00 4.50
Creating a function to process the “leave-one-out” computations does take a bit more
time; however, a function is more flexible in that this function will work with vector.
Furthermore, assuming you save the workspace image upon exit, this function will be
permanently saved into your workspace so that the “leave-one-out” function written
here is available later.
3
Section 18.2: The “Leave-One-Out” Approach for Regression Coefficients (i.e. DFBETAs)
Example 18.2.1 Again, for the sake of understanding, let’s create a simple response vector, Y,
and a predictor variable, x.
2
1
3
2
𝑦 = 5 and π‘₯ = 3
8
[10]
4
[5 ]
Putting these into R can be done as follows.
> y=c(2,3,5,8,10)
> x=c(1,2,3,4,5)
Next, put each of these vectors into a data frame. A data frame will allow us to easily refer to
particular elements of the data within various R functions.
> mydata=data.frame(y,x)
Fitting a simple linear regression model in R is done through the use of the lm() function. The
data being used for this fit is contained in the data frame named mydata. The y and x used
within the lm() function must be variable names in the mydata data frame.
ο‚·
ο‚·
Mean Function: 𝐸(π‘Œ|π‘₯) = 𝛽0 + 𝛽1 ∗ π‘₯
Variance Function: π‘‰π‘Žπ‘Ÿ(π‘Œ|π‘₯) = 𝜎 2
> lm(y~x,data=mydata)
Call:
lm(formula = y ~ x, data = mydata)
Coefficients:
(Intercept)
-0.7
x
2.1
The following can be used to save the regression output into an R object, say myfit.
> myfit=lm(y~x,data=mydata)
> myfit
Call:
lm(formula = y ~ x, data = mydata)
Coefficients:
(Intercept)
-0.7
x
2.1
4
Next, let’s remove the 1st row from the mydata data.frame and refit the data. This is easily done
in R using mydata[-1,]. The output here is being saved into an object called myfit.minus1. The
regression coefficients from this model can easily be obtained from this object using
myfit.minus1$coefficients as is shown here.
> myfit.minus1=lm(y~x,data=mydata[-1,])
> myfit.minus1$coefficients
(Intercept)
x
-1.9
2.4
The object myfit.minus1 is actually a vector in R; thus, using the follow will return the second
coefficient, i.e. the estimated slope or 𝛽̂1 from the model.
> myfit.minus1$coefficients[2]
x
2.4
Understanding the effect of the removing additional observations on the estimated regression
coefficients through the “leave-one-out” process follows.
> myfit.minus2=lm(y~x,data=mydata[-2,])
> myfit.minus2$coefficients[2]
x
2.028571
> myfit.minus3=lm(y~x,data=mydata[-3,])
> myfit.minus3$coefficients[2]
x
2.1
> myfit.minus4=lm(y~x,data=mydata[-4,])
> myfit.minus4$coefficients[2]
x
2.057143
> myfit.minus5=lm(y~x,data=mydata[-5,])
> myfit.minus5$coefficients[2]
x
2
We can again automate this process through the use of a for() loop.
Step #1: Create a vector to store the estimated slope from the model from each iteration.
> output = rep(0,5)
>
> #Looking at output
> output
[1] 0 0 0 0 0
Step #2: Using a for() loop to cycle through each observation
> for(i in 1:5){
+
fit = lm(y~x,data=mydata[-i,])
+
output[i]=fit$coefficients[2]
+ }
5
Step #3: Observe the effect of removing each observation on the estimated slope from the
simple linear regression model.
> #Looking at output after for() loop
> output
[1] 2.400000 2.028571 2.100000 2.057143 2.000000
Recall, that 𝛽̂1 from the model with all the observations was 2.1. The following simple
subtraction can be used to understand how much the estimated slope changes when the “leaveone-out” approach is used.
> 2.1-output
[1] -0.30000000
0.07142857
0.00000000
0.04285714
0.10000000
Comments:
ο‚· The “leave-one-out” approach to understand the effect on the estimated regression
coefficients is called DFBETAs. For our model, a DFBETA exists for the estimated yintercept and the estimated slope.
ο‚· DFBETAs are often standardized with respect to the standard error. The DFBETA for 𝛽̂1
would be calculated as follows where the (−𝑖) notation simply implies the 𝑖 π‘‘β„Ž
observation has been removed.
𝐷𝐹𝐡𝐸𝑇𝐴1,(−𝑖) =
ο‚·
(𝛽̂1 − 𝛽̂1(−𝑖) )
π‘†π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ (𝛽̂1(−𝑖) )
The following function was written in R to conduct the “leave-one-out” procedure for a
simple linear regression model.
betahat.jackknife=function(slr.object,data){
#Getting the number of rows in data
n = dim(data)[1]
#Creating the output data frame, column 1 for beta0hat and
# column 2 for beta1hat
output = data.frame(beta0hat=rep(0,n),beta1hat=rep(0,n))
#Looping through data, Beta0hat will be put into column 1
# and Beta1hat will be put into column 2
for(i in 1:n){
fit=lm(formula(slr.object),data=data[-i,])
output[i,1]=fit$coefficients[1]
output[i,2]=fit$coefficients[2]
}
#Return the output vector
return(output)
}
Using this function in R
> fit=lm(y~x,data=mydata)
> betahat.jackknife(fit,mydata)
beta0hat beta1hat
1 -1.9000000 2.400000
2 -0.3428571 2.028571
3 -0.5500000 2.100000
4 -0.6571429 2.057143
5 -0.5000000 2.000000
6
Section 18.3: The “Leave-One-Out” Approach for Prediction
Example 18.3.1 Consider again the response and predictor variable from the previous section.
2
1
3
2
𝑦 = 5 and π‘₯ = 3
8
[10]
4
[5 ]
First, let’s fit the standard simple linear regression model in JMP.
The Data
JMP Output
Mean and Variance Functions
ο‚· 𝐸(π‘Œ|π‘₯) = 𝛽0 + 𝛽1 ∗ π‘₯
ο‚· π‘‰π‘Žπ‘Ÿ(π‘Œ|π‘₯) = 𝜎 2
Theoretical Model Setup
Model Estimates
Mean
𝐸(π‘Œ|π‘₯) = 𝛽0 + 𝛽1 ∗ π‘₯
Variance (MSE)
Standard Deviation (RMSE)
π‘‰π‘Žπ‘Ÿ(π‘Œ|π‘₯) = 𝜎 2
𝐸̂ (π‘Œ|π‘₯) = 𝛽̂0 + 𝛽̂1 ∗ π‘₯
= −0.70 + 2.1 ∗ π‘₯
Μ‚ (π‘Œ|π‘₯) = πœŽΜ‚ 2 = 0.3667
π‘‰π‘Žπ‘Ÿ
Μ‚
𝑆𝑑𝑑
𝐷𝑒𝑣(π‘Œ|π‘₯) = √0.3667 = 0.61
𝑆𝑑𝑑 𝐷𝑒𝑣(π‘Œ|π‘₯) = √𝜎 2 = 𝜎
The goal of using the “leave-one-out” approach in this section is to understand the predictive
ability of a model. The preferred measure to understand predictive ability is Root Mean Square
Error.
π‘†π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› ↔ "π΄π‘£π‘’π‘Ÿπ‘Žπ‘”π‘’" π·π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’ π‘‘π‘œ π‘€π‘’π‘Žπ‘›
There are shortcomings in using Root Mean Square Error from a regression model to understand
“average” distance to mean or “average” residual. The most significant are discussed here.
1. The purpose of the RMSE value is to provide the best possible unbiased estimate for the
standard deviation in the response distribution after conditioning, i.e. 𝑆𝑑𝑑 𝐷𝑒𝑣(π‘Œ|π‘₯) = 𝜎.
𝑅𝑀𝑆𝐸 ↔ 𝐡𝑒𝑠𝑑 π‘ˆπ‘›π‘π‘–π‘Žπ‘ π‘’π‘‘ πΈπ‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’ π‘œπ‘“ 𝑆𝑑𝑑 𝐷𝑒𝑣(π‘Œ|π‘₯)
7
2. Using the RMSE value from the model is likely to underestimate a models true ability to
predict because observations for which predictions are being made were used to build the
model.
Issue #1: The purpose of RMSE is to attain the
best possible estimate of 𝑆𝑑𝑑 𝐷𝑒𝑣(π‘Œ|π‘₯).
Issue #2: Using residuals from observations used
to build the model results in an underestimate of
a models true predictive ability.
Goal of “Leave-one-out” procedure for Prediction
Obtain a reasonable measure of RMSE which reflects the true predictive ability of a model.
Putting the response and predictor vectors from above into R and creating a data frame.
> y=c(2,3,5,8,10)
> x=c(1,2,3,4,5)
> mydata=data.frame(y,x)
Fitting the simple linear regression model in R and placing the output into an R object called fit.
The summary() function displays much of the regression output.
> fit=lm(y~x,data=mydata)
> summary(fit)
Call:
lm(formula = y ~ x, data = mydata)
Residuals:
1
2
3
0.6 -0.5 -0.6
4
0.3
5
0.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7000
0.6351 -1.102 0.35086
x
2.1000
0.1915 10.967 0.00162 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6055 on 3 degrees of freedom
Multiple R-squared: 0.9757, Adjusted R-squared: 0.9676
F-statistic: 120.3 on 1 and 3 DF, p-value: 0.001623
8
The estimated regression equation for this example is given by the following.
𝐸̂ (π‘Œ|π‘₯) = 𝛽̂0 + 𝛽̂1 ∗ π‘₯
= −0.70 + 2.1 ∗ π‘₯
Recall, from Handout #8, the predicted values can be obtained through the following matrix
multiplication.
Μ‚
𝐸̂ (π‘Œ|π‘₯) = π‘Ώπœ·
1
1
1
[1
5]
1.4
2
3.5
−0.70
Μ‚
(π‘Œ|π‘₯)
𝐸
= 1 3 [
] = 5.6
2.1
1 4
7.7
[ 9.8 ]
In R, this vector can be obtained directly using the predict() function. The predict function needs
two arguments: i) the model object and ii) the new data for which predictions should be made.
Note: The column names for the predictors (in the newdata data frame) should match those in
the data frame used to fit the model. In the following, predictions are being made on the
original data – which is the mydata data frame.
> predict(fit,newdata=mydata)
1
2
3
4
5
1.4 3.5 5.6 7.7 9.8
You can simply save these predicted values into its own vector as follows.
> predictedy = predict(fit,newdata=mydata)
Once the predicted values are obtained, residuals can easily be computed in R.
> y-predictedy
1
2
3
0.6 -0.5 -0.6
4
0.3
5
0.2
π‘…π‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™π‘  = (𝑦 − 𝐸̂ (π‘Œ|π‘₯))
2
1.4
0.6
3
3.5
−0.5
π‘…π‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™π‘  = (𝑦 − 𝐸̂ (π‘Œ|π‘₯)) = ( 5 − 5.6 = −0.6
8
7.7
0.3
[10] [ 9.8 ] [ 0.2 ]
Finding the Room Mean Squared Error by way of brute force in R is shown next. It should be
noted that the denominator is given by π‘‘π‘“π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ = 3 and not the number of residuals. Using
π‘‘π‘“π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ provides an unbiased estimate of 𝑆𝑑𝑑 𝐷𝑒𝑣(π‘Œ|π‘₯).
∑(π‘…π‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™π‘ )2
1.1
𝑅𝑀𝑆𝐸 = √
=√
= 0.61
π‘‘π‘“π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
3
> sqrt(sum((y-predictedy)^2)/3)
[1] 0.6055301
9
Obtaining “Leave-one-out” Predictions in R
The “leave-one-out” approach for predictions can be accomplished easily in R through the
following sequence of commands.
Step 1: Fit the model withholding the 1st observation
> fit.minus1=lm(y~x,data=mydata[-1,])
Step 2: Make the prediction for the 1st observation
> predictedy1=predict(fit.minus1,newdata=mydata[1,])
Step 3: Obtain the squared residual for this prediction
> (y[1]-predictedy1)^2
1
2.25
The above steps need to be done for each of the five observations in our data frame. This can
be done easily using a for() loop in R.
First, setup an output vector to store the results.
> output=rep(0,5)
> output
[1] 0 0 0 0 0
Using a for() loop to cycle through data frame one row at a time.
> for(i in 1:5){
+
fit.minus = lm(y~x,data=mydata[-i,])
+
predictedy = predict(fit.minus,newdata=mydata[i,])
+
output[i] = (y[i]-predictedy)^2
+ }
The desired output has been placed in the output vector.
> output
[1] 2.2500000 0.5102041 0.5625000 0.1836735 0.2500000
The average squared residual via the “leave-one-out” method is about 0.75. The estimated Root
Means Squared Error via “leave-one-out” is about 0.87.
> mean(output)
[1] 0.7512755
> sqrt(mean(output))
[1] 0.8667615
The RMSE value for the “leave-one-out” approach is somewhat higher as expected, i.e. our
ability to make a predictions is harder when an observation is not being used to build the model.
The degree of difference, i.e. 43% increase, is somewhat exaggerated in this simple example due
to having only 5 observations in the dataset.
10
% increase with “leave-one-out”
Original
Model
0.61
RMSE
“Leave-one-out”
Method
0.87
(0.87 − 0.61)
= 0.426 ≈ 43%
0.87
The following predict.jackknife() function was written in R to automate the process developed
above. The list() function is used to return output from this function as more than one object is
being returned.
> predict.jackknife=function(lm.object,data){
#Getting the number of rows in data
n = dim(data)[1]
#Keeping a copy of orginial y (used in computed residual)
originaly = lm.object$model[,1]
#Creating the output vector to save squared residuals
output = rep(0,n)
#Looping through data
for(i in 1:n){
fit.minus=lm(formula(lm.object),data=data[-i,])
predictedy = predict(fit.minus,newdata=data[i,])
output[i]=(originaly[i]-predictedy)^2
}
#Return the output vector
list(SquaredResids=output,Jackknife_RMSE=sqrt(mean(output)))
}
The following produces the same output as obtained above.
> predict.jackknife(fit,mydata)
$SquaredResids
[1] 2.2500000 0.5102041 0.5625000 0.1836735 0.2500000
$Jackknife_RMSE
[1] 0.8667615
Example 18.3.2 Consider once again the Grandfather Clocks dataset which was considered on
Handout #12. The jackknife estimate of the root mean square error is computed below. The
jackknife estimate of the RMSE provides a more realistic measure of a models ability to make
valid predictions.
Model Setup
ο‚· Response Variable: Price
ο‚· Predictor Variable: Age and Number of Bidders
ο‚· Assume the following structure for mean and variance functions
o
o
𝐸(π‘ƒπ‘Ÿπ‘–π‘π‘’ | 𝐴𝑔𝑒, π‘π‘’π‘š_π΅π‘–π‘‘π‘‘π‘’π‘Ÿπ‘  ) = 𝛽0 + 𝛽1 ∗ 𝐴𝑔𝑒 + 𝛽2 ∗ π‘π‘’π‘š_π΅π‘–π‘‘π‘‘π‘’π‘Ÿπ‘ 
π‘‰π‘Žπ‘Ÿ(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝐴𝑔𝑒, π‘π‘’π‘š_π΅π‘–π‘‘π‘‘π‘’π‘Ÿπ‘ ) = 𝜎 2
11
Fitting the model and getting summaries in R.
> fit=lm(Price~(Age + Number_Bidders),data=Grandfather_Clocks)
> summary(fit)
Call:
lm(formula = Price ~ (Age + Number_Bidders), data = Grandfather_Clocks)
Residuals:
Min
1Q Median
-207.2 -117.8
16.5
3Q
102.7
Max
213.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-1336.7221
173.3561 -7.711 1.67e-08 ***
Age
12.7362
0.9024 14.114 1.60e-14 ***
Number_Bidders
85.8151
8.7058
9.857 9.14e-11 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 133.1 on 29 degrees of freedom
Multiple R-squared: 0.8927, Adjusted R-squared: 0.8853
F-statistic: 120.7 on 2 and 29 DF, p-value: 8.769e-15
Getting the “leave-one-out” or jackknife estimate of the Root Mean Square Error.
> predict.jackknife(fit,Grandfather_Clocks)
$SquaredResids
[1] 31000.9962 7165.9569 1584.9082 33111.7095 16505.5745 3697.9257 23297.9888
2746.9212 14620.6329 6029.6136
432.7774
[12] 6187.0226 3015.9419 1434.9896 16781.6905 48435.3809 14878.0611 16813.4050
51526.3234 38840.7995 25337.9795
540.9022
[23] 46611.6946 1314.8238 45335.0691 6959.0187 52011.4549 24101.0321 15241.3391
302.2358 63774.8714 23201.3986
$Jackknife_RMSE
[1] 141.7348
For this example, the “leave-one-out” estimate of the RMSE is about 6.5% larger than the RMSE
estimate from a model using all the data. Again, the increase in the RMSE using the “leave-oneout” approach is expected and this estimate is a better measure of a models true ability to
predict.
RMSE
Original
Model
133.1
“Leave-one-out”
i.e. Jackknife Estimate
141.7
12
Section 18.4: Cross-Validation for Prediction
The natural extension to the “leave-one-out” approach would be to “leave-several-out”. This
more general notion is called cross-validation. There are a variety of cross-validation methods
used for prediction. The Train / Test and Monte Carlo approaches are discussed here. The Train
/ Test approach is considered a 2-fold cross-validation procedure. The k-fold cross-validation is
an extension of the Train / Test approach and is more commonly used in practice. However, kfold is not discussed in this handout.
ο‚·
ο‚·
ο‚·
ο‚·
“Leave-one-out”, i.e. Jackknife approach, see previous section
“Leave-several-out”, i.e. Train / Test or Split Sample approach, discussed here
Monte Carlo approach, discussed below
K-fold approach, this will not be discussed in this handout
The following procedure is used for the Train / Test or Split Sample approach to cross-validation.
Train / Test Cross-Validation Procedure
Step 1: Randomly divide the dataset into a training set and a test set. A typical division is 2/3 of
the data is put into the training set and the remaining 1/3 is set aside for the test set.
Step 2: Build the model using the observations from the training set
Step 3: Compute the RMSE on the observations in the test set using the model from Step 2
A visual depiction of the Train / Test cross-validation procedure is provided here. The “hat”
contains all the observations which are randomly split into two groups (i.e. 2-folds) -- the
training dataset and the test dataset. The model is built using the training dataset and a
measure of predictive ability is computed using the test dataset.
Example 18.4.1 Consider the Grandfather Clocks dataset again where Price is the response of
interest and Age and Number of Bidders are the predictor variables of interest.
Model Setup
ο‚· Response Variable: Price
ο‚· Predictor Variable: Age and Number of Bidders
ο‚· Assume the following structure for mean and variance functions
o 𝐸(π‘ƒπ‘Ÿπ‘–π‘π‘’ | 𝐴𝑔𝑒, π‘π‘’π‘š_π΅π‘–π‘‘π‘‘π‘’π‘Ÿπ‘  ) = 𝛽0 + 𝛽1 ∗ 𝐴𝑔𝑒 + 𝛽2 ∗ π‘π‘’π‘š_π΅π‘–π‘‘π‘‘π‘’π‘Ÿπ‘ 
o π‘‰π‘Žπ‘Ÿ(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝐴𝑔𝑒, π‘π‘’π‘š_π΅π‘–π‘‘π‘‘π‘’π‘Ÿπ‘ ) = 𝜎 2
13
Train / Test Cross-Validation in R
Step 1: Randomly divide the dataset into a training dataset and a test dataset. A typical division
is 2/3 of the data is put into the training set and the remaining 1/3 is set aside for the test set.
Determine the number of observations, i.e. rows in our dataset.
> dim(Grandfather_Clocks)
[1] 32 4
There are 32 observations in our data set. If 1/3 of the observations are to be saved out
for the test dataset, then about 10 of the 32 observations should be randomly selected
and placed into the test dataset.
> 0.33*32
[1] 10.56
The sample() function is used to randomly select the observations to be placed into the
test dataset. The following identifies the arguments used of the sample() function.
o
o
o
1:32 -- specifies the sequence 1, 2, 3, … , 32 where each represents a row in the
dataset
10 -- specifies that 10 observations are to be randomly set aside for the test
dataset
replace = F -- specifies that sampling of the rows should be done without
replacement, i.e. cannot take-out an observation out twice so sampling is done
without replacement
> holdout=sample(1:32,10,replace=F)
The following observations will *not* be used when building the model.
> holdout
[1] 26 6 30 13 17 12 28 10
5
4
Step 2: Build the model using the observations from the training set only.
> fit.train = lm(Price~(Age+Number_Bidders),data=Grandfather_Clocks[-holdout,])
The summary output from this model.
> summary(fit.train)
Call:
lm(formula = Price ~ (Age + Number_Bidders), data = Grandfather_Clocks[-holdout,
])
Residuals:
Min
1Q
-206.27 -129.96
Median
-31.32
3Q
116.09
Max
233.69
Coefficients:
Estimate Std. Error t value
(Intercept)
-1270.929
253.086 -5.022
Age
12.507
1.303
9.598
Number_Bidders
81.376
11.368
7.158
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
Pr(>|t|)
7.57e-05 ***
1.01e-08 ***
8.38e-07 ***
0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 147.3 on 19 degrees of freedom
Multiple R-squared: 0.8467,
Adjusted R-squared: 0.8305
F-statistic: 52.46 on 2 and 19 DF, p-value: 1.834e-08
14
Step 3: Compute the RMSE on the observations in the test set using the model from Step 2.
First, get the predicted values for observations in the test dataset. Compute their
corresponding residuals as well.
> predict.test = predict(fit.train,newdata=Grandfather_Clocks[holdout,])
> resid.test = (Grandfather_Clocks[holdout,2] - predict.test)
Computing the Root Mean Square Error on the test dataset.
> sqrt(mean((resid.test^2)))
[1] 104.8246
The RMSE value from the model using all the data was 133.1. The RMSE from the observations
in the test dataset using a model built from the observations in training set is somewhat smaller
at 104.8. This is somewhat unexpected as the observations in the test dataset were not used to
build the model – yet according to the RMSE value, I have an increased ability to make
predictions for these observations.
If the train / test process were repeated a second time, we’d get a different RMSE. The RMSE
value from a second iteration produces a RMSE value of 132.8.
> holdout=sample(1:32,10,replace=F)
> fit.train = lm(Price~(Age+Number_Bidders),data=Grandfather_Clocks[-holdout,])
> predict.test = predict(fit.train,newdata=Grandfather_Clocks[holdout,])
> resid.test = (Grandfather_Clocks[holdout,2] - predict.test)
> sqrt(mean((resid.test^2)))
[1] 132.7861
This happens because the observations are randomly split into the training and test datasets.
Understanding the amount of variation in the RMSE via cross-validation is important. A k-fold
cross-validation typically provides a better estimate as the RMSE is averaged over each of the kfolds. The Monte Carlo cross-validation approach presented below also alleviates this problem.
As we’ve seen above, the RMSE computed using the Train / Test approach is dependent on
which observations are placed into the training dataset and which are placed into the test
dataset. The Monte Carlo cross-validation approach alleviates this problem.
Monte Carlo Cross-Validation Procedure
Step 1: Randomly divide the dataset into a training set and a test set.
Same as
Train / Test
Approach
One additional
step
Step 2: Build the model using the observations from the training set.
Step 3: Compute RMSE on the observations in the test set using the
model from Step 2.
Step 4: Repeat Steps 1 – 3 a large number of times, say b times. Record
RMSE from Step 3 on each iteration.
15
Monte Carlo Cross-Validation in R
The following function was written to conduct a Monte Carlo cross-validation procedure in R.
mc.cv = function(lm.object,data,p=0.33,b=100){
#Getting the number of rows in data
n=dim(data)[1]
#How many observations should be in holdout sample
np=floor(p*n)
#Getting a copy of the orginal response vector
originaly = lm.object$model[,1]
#Getting an output vector to store RMSE on each of the b iterations
output = rep(0,n)
#The loop for repeated iterations
for(i in 1:b){
#Getting the observations for the holdout sample
holdout=sample(1:n,np,replace=F)
#Fitting the model on the training dataset
fit = lm(formula(lm.object),data=data[-holdout,])
#Getting the predicted values for the test dataset
predict.test = predict(fit,newdata=data[holdout,])
#Getting resid^2 for the test dataset
resid2 = (originaly[holdout]-predict.test)^2
#Computing RMSE and placing result into output vector
output[i] = sqrt(mean(resid2))
}
#Return RMSE values and their average over b iterations
list(RMSE_Vector=output,Avg_RMSE=mean(output))
}
Using the mc.cv function on the Grandfather_Clock dataset.
> MC_RMSE=mc.cv(fit,Grandfather_Clocks)
Outcomes from the Monte Carlo cross-validation procedure.
> MC_RMSE
$RMSE_Vector
[1] 116.6287
[8] 105.6085
[15] 165.4827
[22] 128.0354
[29] 110.1783
[36] 103.5771
[43] 167.5278
[50] 128.5601
[57] 139.8218
[64] 171.2468
[71] 160.3456
[78] 168.4703
[85] 154.2188
[92] 115.8677
[99] 186.1349
159.9629
148.4967
180.6907
144.8986
151.9054
147.6365
155.8323
151.7445
167.9250
111.7259
156.2246
112.3097
170.9216
118.3454
136.4803
113.4629
194.0042
130.8478
148.8463
137.6646
142.6218
146.2666
148.9523
121.7352
132.1239
145.5988
139.8134
127.2419
136.8542
137.5389
127.4927
131.2208
100.8981
145.9771
161.9792
164.4329
144.3970
173.9314
107.8448
181.3618
163.5863
102.3846
135.3349
115.3287
183.8086
146.0145
133.7067
159.5572
100.2995
154.7935
159.6166
194.6839
142.3572
134.2481
140.4604
164.5099
131.1051
135.0001
132.6002
111.5009
141.3693
119.1783
166.0258
116.9076
84.6834
145.8279
176.9573
138.2101
215.7054
179.0822
104.6982
121.4914
161.4672
146.2957
150.2426
155.9115
162.3593
139.6966
168.7590
146.3267
135.8692
162.4199
153.8596
120.4892
146.6755
$Avg_RMSE
[1] 143.8132
16
The average RMSE over the 100 repeated iteration is about 143. Once again, this is a bit larger
than the RMSE from the model that included all the observations, i.e. 133.1. A plot of the
Monte Carlo RMSE values over these 100 iterations is below..
The arguments in the mc.cv function can easily be modified. For example, the following will use
a 25% holdout sample and will use a total of 1000 repeated iterations. The average RMSE
returned under this setup was about 141.4. The distribution of the Monte Carlo RMSE values
over the 1000 iterations is shown below.
> mc.cv(fit,Grandfather_Clocks,p=0.25,b=1000)
17
Download