Jackknife and Cross

Handout #18: Jackknife and Cross-Validation in R Section 18.1: The “Leave-One-Out” Concept for Simple Mean The “leave-one-out” notion in regression involves understanding the effect of a single observation on your model. The “leave-one-out” approach could be used to identify observations with large leverage by investigating an observations effect on the estimated regression coefficients. It should be noted that the “leave-one-out” notion extends beyond regression problems. This notion is more commonly known as jackknife resampling. A snip-it of the wiki entry for jackknife resampling is provided here. The natural extension of jackknifing resampling would be “leave-several-out.” This is known as cross-validation when the goal is understanding the predictive ability of a regression model. Source: http://en.wikipedia.org/wiki/Jackknife_resampling. “Leave-one-out” for a Mean To begin, type the following into R. This creates a vector whose elements are (2,3,5,8,10). > y=c(2,3,5,8,10) Recall, the mean() function is used to obtain the mean of y. > mean(y) [1] 5.6 The minus argument can be used to temporarily withhold an observation from calculations done on an object. For example, the following will calculate the mean of y without the 1st observation. The outcome here would be the same as calculating the mean using elements 2 through 5 which would be accomplished in R as mean( y[2:5] ). > mean(y[-1]) [1] 6.5 1 The following shows the effect of the “leave-one-out” method applied to y for a mean calculation. > mean(y[-2]) [1] 6.25 > mean(y[-3]) [1] 5.75 > mean(y[-4]) [1] 5 > mean(y[-5]) [1] 4.5 The mean of the complete y vector was 5.6. Leaving out the 1st observation appears to have the largest impact on the average. The following can be used to store the mean from each “leaveone-out” iteration. Step #1: Create an initial vector to store the “leave-one-out” mean from each iteration. > output = rep(0,5) > > #Looking at output > output [1] 0 0 0 0 0 Step #2: Use a for() loop to cycle through for each of the five observations > for(i in 1:5){ + output[i]=mean( y[-i] ) + } Step #3: Observing the output vector allows you to observe the effect of removing each observation on the mean > #Looking at output after for() loop > output [1] 6.50 6.25 5.75 5.00 4.50 Writing a function for “leave-one-out” for a mean There are a variety of methods to create or build a function in R. The name of the function to be created is called mean.jackknife and I will use the edit() function to create my function. The edit() function produces a separate window. > mean.jackknife=edit() Some comments regarding functions in R.  Arguments that need to be passed into a function are done so within the parentheses attached to function, i.e. function( ). The labeling of arguments within functions is separate from outside the function, e.g. in the following “x” is an argument only within my mean.jackknife() function.  The code for functions must be contained with a set of curly brackets, i.e. {}.  The return() will return a single object from a function. The list() function can be used when more than a single object is to be returned from the function. 2 The finished mean.jackknife() function in R. The following can be used to cut-and-paste this function in to R. mean.jackknife = function(x){ #Find the length of x n = length(x) #Setup output vector output = rep(0,n) #Loop for iterations for(i in 1:n){ output[i]=mean( x[-i]) } #Return the output vector return(output) } Note: If pasting this into the edit() window, delete mean.jackknife and the equal sign. After the function has been successfully created, you are able to use the function just as another function in R. The following produces the same outcomes as above. > mean.jackknife(y) [1] 6.50 6.25 5.75 5.00 4.50 The outcomes from this function can be put into a vector, say outcomes. Note: The outcomes vector does *not* need to be setup ahead of time in this situation. > outcomes=mean.jackknife(y) > outcomes [1] 6.50 6.25 5.75 5.00 4.50 Creating a function to process the “leave-one-out” computations does take a bit more time; however, a function is more flexible in that this function will work with vector. Furthermore, assuming you save the workspace image upon exit, this function will be permanently saved into your workspace so that the “leave-one-out” function written here is available later. 3 Section 18.2: The “Leave-One-Out” Approach for Regression Coefficients (i.e. DFBETAs) Example 18.2.1 Again, for the sake of understanding, let’s create a simple response vector, Y, and a predictor variable, x. 2 1 3 2 𝑦 = 5 and 𝑥 = 3 8 [10] 4 [5 ] Putting these into R can be done as follows. > y=c(2,3,5,8,10) > x=c(1,2,3,4,5) Next, put each of these vectors into a data frame. A data frame will allow us to easily refer to particular elements of the data within various R functions. > mydata=data.frame(y,x) Fitting a simple linear regression model in R is done through the use of the lm() function. The data being used for this fit is contained in the data frame named mydata. The y and x used within the lm() function must be variable names in the mydata data frame.   Mean Function: 𝐸(𝑌|𝑥) = 𝛽0 + 𝛽1 ∗ 𝑥 Variance Function: 𝑉𝑎𝑟(𝑌|𝑥) = 𝜎 2 > lm(y~x,data=mydata) Call: lm(formula = y ~ x, data = mydata) Coefficients: (Intercept) -0.7 x 2.1 The following can be used to save the regression output into an R object, say myfit. > myfit=lm(y~x,data=mydata) > myfit Call: lm(formula = y ~ x, data = mydata) Coefficients: (Intercept) -0.7 x 2.1 4 Next, let’s remove the 1st row from the mydata data.frame and refit the data. This is easily done in R using mydata[-1,]. The output here is being saved into an object called myfit.minus1. The regression coefficients from this model can easily be obtained from this object using myfit.minus1$coefficients as is shown here. > myfit.minus1=lm(y~x,data=mydata[-1,]) > myfit.minus1$coefficients (Intercept) x -1.9 2.4 The object myfit.minus1 is actually a vector in R; thus, using the follow will return the second coefficient, i.e. the estimated slope or 𝛽̂1 from the model. > myfit.minus1$coefficients[2] x 2.4 Understanding the effect of the removing additional observations on the estimated regression coefficients through the “leave-one-out” process follows. > myfit.minus2=lm(y~x,data=mydata[-2,]) > myfit.minus2$coefficients[2] x 2.028571 > myfit.minus3=lm(y~x,data=mydata[-3,]) > myfit.minus3$coefficients[2] x 2.1 > myfit.minus4=lm(y~x,data=mydata[-4,]) > myfit.minus4$coefficients[2] x 2.057143 > myfit.minus5=lm(y~x,data=mydata[-5,]) > myfit.minus5$coefficients[2] x 2 We can again automate this process through the use of a for() loop. Step #1: Create a vector to store the estimated slope from the model from each iteration. > output = rep(0,5) > > #Looking at output > output [1] 0 0 0 0 0 Step #2: Using a for() loop to cycle through each observation > for(i in 1:5){ + fit = lm(y~x,data=mydata[-i,]) + output[i]=fit$coefficients[2] + } 5 Step #3: Observe the effect of removing each observation on the estimated slope from the simple linear regression model. > #Looking at output after for() loop > output [1] 2.400000 2.028571 2.100000 2.057143 2.000000 Recall, that 𝛽̂1 from the model with all the observations was 2.1. The following simple subtraction can be used to understand how much the estimated slope changes when the “leaveone-out” approach is used. > 2.1-output [1] -0.30000000 0.07142857 0.00000000 0.04285714 0.10000000 Comments:  The “leave-one-out” approach to understand the effect on the estimated regression coefficients is called DFBETAs. For our model, a DFBETA exists for the estimated yintercept and the estimated slope.  DFBETAs are often standardized with respect to the standard error. The DFBETA for 𝛽̂1 would be calculated as follows where the (−𝑖) notation simply implies the 𝑖 𝑡ℎ observation has been removed. 𝐷𝐹𝐵𝐸𝑇𝐴1,(−𝑖) =  (𝛽̂1 − 𝛽̂1(−𝑖) ) 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 (𝛽̂1(−𝑖) ) The following function was written in R to conduct the “leave-one-out” procedure for a simple linear regression model. betahat.jackknife=function(slr.object,data){ #Getting the number of rows in data n = dim(data)[1] #Creating the output data frame, column 1 for beta0hat and # column 2 for beta1hat output = data.frame(beta0hat=rep(0,n),beta1hat=rep(0,n)) #Looping through data, Beta0hat will be put into column 1 # and Beta1hat will be put into column 2 for(i in 1:n){ fit=lm(formula(slr.object),data=data[-i,]) output[i,1]=fit$coefficients[1] output[i,2]=fit$coefficients[2] } #Return the output vector return(output) } Using this function in R > fit=lm(y~x,data=mydata) > betahat.jackknife(fit,mydata) beta0hat beta1hat 1 -1.9000000 2.400000 2 -0.3428571 2.028571 3 -0.5500000 2.100000 4 -0.6571429 2.057143 5 -0.5000000 2.000000 6 Section 18.3: The “Leave-One-Out” Approach for Prediction Example 18.3.1 Consider again the response and predictor variable from the previous section. 2 1 3 2 𝑦 = 5 and 𝑥 = 3 8 [10] 4 [5 ] First, let’s fit the standard simple linear regression model in JMP. The Data JMP Output Mean and Variance Functions  𝐸(𝑌|𝑥) = 𝛽0 + 𝛽1 ∗ 𝑥  𝑉𝑎𝑟(𝑌|𝑥) = 𝜎 2 Theoretical Model Setup Model Estimates Mean 𝐸(𝑌|𝑥) = 𝛽0 + 𝛽1 ∗ 𝑥 Variance (MSE) Standard Deviation (RMSE) 𝑉𝑎𝑟(𝑌|𝑥) = 𝜎 2 𝐸̂ (𝑌|𝑥) = 𝛽̂0 + 𝛽̂1 ∗ 𝑥 = −0.70 + 2.1 ∗ 𝑥 ̂ (𝑌|𝑥) = 𝜎̂ 2 = 0.3667 𝑉𝑎𝑟 ̂ 𝑆𝑡𝑑 𝐷𝑒𝑣(𝑌|𝑥) = √0.3667 = 0.61 𝑆𝑡𝑑 𝐷𝑒𝑣(𝑌|𝑥) = √𝜎 2 = 𝜎 The goal of using the “leave-one-out” approach in this section is to understand the predictive ability of a model. The preferred measure to understand predictive ability is Root Mean Square Error. 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 ↔ "𝐴𝑣𝑒𝑟𝑎𝑔𝑒" 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑡𝑜 𝑀𝑒𝑎𝑛 There are shortcomings in using Root Mean Square Error from a regression model to understand “average” distance to mean or “average” residual. The most significant are discussed here. 1. The purpose of the RMSE value is to provide the best possible unbiased estimate for the standard deviation in the response distribution after conditioning, i.e. 𝑆𝑡𝑑 𝐷𝑒𝑣(𝑌|𝑥) = 𝜎. 𝑅𝑀𝑆𝐸 ↔ 𝐵𝑒𝑠𝑡 𝑈𝑛𝑏𝑖𝑎𝑠𝑒𝑑 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝑆𝑡𝑑 𝐷𝑒𝑣(𝑌|𝑥) 7 2. Using the RMSE value from the model is likely to underestimate a models true ability to predict because observations for which predictions are being made were used to build the model. Issue #1: The purpose of RMSE is to attain the best possible estimate of 𝑆𝑡𝑑 𝐷𝑒𝑣(𝑌|𝑥). Issue #2: Using residuals from observations used to build the model results in an underestimate of a models true predictive ability. Goal of “Leave-one-out” procedure for Prediction Obtain a reasonable measure of RMSE which reflects the true predictive ability of a model. Putting the response and predictor vectors from above into R and creating a data frame. > y=c(2,3,5,8,10) > x=c(1,2,3,4,5) > mydata=data.frame(y,x) Fitting the simple linear regression model in R and placing the output into an R object called fit. The summary() function displays much of the regression output. > fit=lm(y~x,data=mydata) > summary(fit) Call: lm(formula = y ~ x, data = mydata) Residuals: 1 2 3 0.6 -0.5 -0.6 4 0.3 5 0.2 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.7000 0.6351 -1.102 0.35086 x 2.1000 0.1915 10.967 0.00162 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6055 on 3 degrees of freedom Multiple R-squared: 0.9757, Adjusted R-squared: 0.9676 F-statistic: 120.3 on 1 and 3 DF, p-value: 0.001623 8 The estimated regression equation for this example is given by the following. 𝐸̂ (𝑌|𝑥) = 𝛽̂0 + 𝛽̂1 ∗ 𝑥 = −0.70 + 2.1 ∗ 𝑥 Recall, from Handout #8, the predicted values can be obtained through the following matrix multiplication. ̂ 𝐸̂ (𝑌|𝑥) = 𝑿𝜷 1 1 1 [1 5] 1.4 2 3.5 −0.70 ̂ (𝑌|𝑥) 𝐸 = 1 3 [ ] = 5.6 2.1 1 4 7.7 [ 9.8 ] In R, this vector can be obtained directly using the predict() function. The predict function needs two arguments: i) the model object and ii) the new data for which predictions should be made. Note: The column names for the predictors (in the newdata data frame) should match those in the data frame used to fit the model. In the following, predictions are being made on the original data – which is the mydata data frame. > predict(fit,newdata=mydata) 1 2 3 4 5 1.4 3.5 5.6 7.7 9.8 You can simply save these predicted values into its own vector as follows. > predictedy = predict(fit,newdata=mydata) Once the predicted values are obtained, residuals can easily be computed in R. > y-predictedy 1 2 3 0.6 -0.5 -0.6 4 0.3 5 0.2 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 = (𝑦 − 𝐸̂ (𝑌|𝑥)) 2 1.4 0.6 3 3.5 −0.5 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 = (𝑦 − 𝐸̂ (𝑌|𝑥)) = ( 5 − 5.6 = −0.6 8 7.7 0.3 [10] [ 9.8 ] [ 0.2 ] Finding the Room Mean Squared Error by way of brute force in R is shown next. It should be noted that the denominator is given by 𝑑𝑓𝑒𝑟𝑟𝑜𝑟 = 3 and not the number of residuals. Using 𝑑𝑓𝑒𝑟𝑟𝑜𝑟 provides an unbiased estimate of 𝑆𝑡𝑑 𝐷𝑒𝑣(𝑌|𝑥). ∑(𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠)2 1.1 𝑅𝑀𝑆𝐸 = √ =√ = 0.61 𝑑𝑓𝑒𝑟𝑟𝑜𝑟 3 > sqrt(sum((y-predictedy)^2)/3) [1] 0.6055301 9 Obtaining “Leave-one-out” Predictions in R The “leave-one-out” approach for predictions can be accomplished easily in R through the following sequence of commands. Step 1: Fit the model withholding the 1st observation > fit.minus1=lm(y~x,data=mydata[-1,]) Step 2: Make the prediction for the 1st observation > predictedy1=predict(fit.minus1,newdata=mydata[1,]) Step 3: Obtain the squared residual for this prediction > (y[1]-predictedy1)^2 1 2.25 The above steps need to be done for each of the five observations in our data frame. This can be done easily using a for() loop in R. First, setup an output vector to store the results. > output=rep(0,5) > output [1] 0 0 0 0 0 Using a for() loop to cycle through data frame one row at a time. > for(i in 1:5){ + fit.minus = lm(y~x,data=mydata[-i,]) + predictedy = predict(fit.minus,newdata=mydata[i,]) + output[i] = (y[i]-predictedy)^2 + } The desired output has been placed in the output vector. > output [1] 2.2500000 0.5102041 0.5625000 0.1836735 0.2500000 The average squared residual via the “leave-one-out” method is about 0.75. The estimated Root Means Squared Error via “leave-one-out” is about 0.87. > mean(output) [1] 0.7512755 > sqrt(mean(output)) [1] 0.8667615 The RMSE value for the “leave-one-out” approach is somewhat higher as expected, i.e. our ability to make a predictions is harder when an observation is not being used to build the model. The degree of difference, i.e. 43% increase, is somewhat exaggerated in this simple example due to having only 5 observations in the dataset. 10 % increase with “leave-one-out” Original Model 0.61 RMSE “Leave-one-out” Method 0.87 (0.87 − 0.61) = 0.426 ≈ 43% 0.87 The following predict.jackknife() function was written in R to automate the process developed above. The list() function is used to return output from this function as more than one object is being returned. > predict.jackknife=function(lm.object,data){ #Getting the number of rows in data n = dim(data)[1] #Keeping a copy of orginial y (used in computed residual) originaly = lm.object$model[,1] #Creating the output vector to save squared residuals output = rep(0,n) #Looping through data for(i in 1:n){ fit.minus=lm(formula(lm.object),data=data[-i,]) predictedy = predict(fit.minus,newdata=data[i,]) output[i]=(originaly[i]-predictedy)^2 } #Return the output vector list(SquaredResids=output,Jackknife_RMSE=sqrt(mean(output))) } The following produces the same output as obtained above. > predict.jackknife(fit,mydata) $SquaredResids [1] 2.2500000 0.5102041 0.5625000 0.1836735 0.2500000 $Jackknife_RMSE [1] 0.8667615 Example 18.3.2 Consider once again the Grandfather Clocks dataset which was considered on Handout #12. The jackknife estimate of the root mean square error is computed below. The jackknife estimate of the RMSE provides a more realistic measure of a models ability to make valid predictions. Model Setup  Response Variable: Price  Predictor Variable: Age and Number of Bidders  Assume the following structure for mean and variance functions o o 𝐸(𝑃𝑟𝑖𝑐𝑒 | 𝐴𝑔𝑒, 𝑁𝑢𝑚_𝐵𝑖𝑑𝑑𝑒𝑟𝑠 ) = 𝛽0 + 𝛽1 ∗ 𝐴𝑔𝑒 + 𝛽2 ∗ 𝑁𝑢𝑚_𝐵𝑖𝑑𝑑𝑒𝑟𝑠 𝑉𝑎𝑟(𝑃𝑟𝑖𝑐𝑒|𝐴𝑔𝑒, 𝑁𝑢𝑚_𝐵𝑖𝑑𝑑𝑒𝑟𝑠) = 𝜎 2 11 Fitting the model and getting summaries in R. > fit=lm(Price~(Age + Number_Bidders),data=Grandfather_Clocks) > summary(fit) Call: lm(formula = Price ~ (Age + Number_Bidders), data = Grandfather_Clocks) Residuals: Min 1Q Median -207.2 -117.8 16.5 3Q 102.7 Max 213.5 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1336.7221 173.3561 -7.711 1.67e-08 *** Age 12.7362 0.9024 14.114 1.60e-14 *** Number_Bidders 85.8151 8.7058 9.857 9.14e-11 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 133.1 on 29 degrees of freedom Multiple R-squared: 0.8927, Adjusted R-squared: 0.8853 F-statistic: 120.7 on 2 and 29 DF, p-value: 8.769e-15 Getting the “leave-one-out” or jackknife estimate of the Root Mean Square Error. > predict.jackknife(fit,Grandfather_Clocks) $SquaredResids [1] 31000.9962 7165.9569 1584.9082 33111.7095 16505.5745 3697.9257 23297.9888 2746.9212 14620.6329 6029.6136 432.7774 [12] 6187.0226 3015.9419 1434.9896 16781.6905 48435.3809 14878.0611 16813.4050 51526.3234 38840.7995 25337.9795 540.9022 [23] 46611.6946 1314.8238 45335.0691 6959.0187 52011.4549 24101.0321 15241.3391 302.2358 63774.8714 23201.3986 $Jackknife_RMSE [1] 141.7348 For this example, the “leave-one-out” estimate of the RMSE is about 6.5% larger than the RMSE estimate from a model using all the data. Again, the increase in the RMSE using the “leave-oneout” approach is expected and this estimate is a better measure of a models true ability to predict. RMSE Original Model 133.1 “Leave-one-out” i.e. Jackknife Estimate 141.7 12 Section 18.4: Cross-Validation for Prediction The natural extension to the “leave-one-out” approach would be to “leave-several-out”. This more general notion is called cross-validation. There are a variety of cross-validation methods used for prediction. The Train / Test and Monte Carlo approaches are discussed here. The Train / Test approach is considered a 2-fold cross-validation procedure. The k-fold cross-validation is an extension of the Train / Test approach and is more commonly used in practice. However, kfold is not discussed in this handout.     “Leave-one-out”, i.e. Jackknife approach, see previous section “Leave-several-out”, i.e. Train / Test or Split Sample approach, discussed here Monte Carlo approach, discussed below K-fold approach, this will not be discussed in this handout The following procedure is used for the Train / Test or Split Sample approach to cross-validation. Train / Test Cross-Validation Procedure Step 1: Randomly divide the dataset into a training set and a test set. A typical division is 2/3 of the data is put into the training set and the remaining 1/3 is set aside for the test set. Step 2: Build the model using the observations from the training set Step 3: Compute the RMSE on the observations in the test set using the model from Step 2 A visual depiction of the Train / Test cross-validation procedure is provided here. The “hat” contains all the observations which are randomly split into two groups (i.e. 2-folds) -- the training dataset and the test dataset. The model is built using the training dataset and a measure of predictive ability is computed using the test dataset. Example 18.4.1 Consider the Grandfather Clocks dataset again where Price is the response of interest and Age and Number of Bidders are the predictor variables of interest. Model Setup  Response Variable: Price  Predictor Variable: Age and Number of Bidders  Assume the following structure for mean and variance functions o 𝐸(𝑃𝑟𝑖𝑐𝑒 | 𝐴𝑔𝑒, 𝑁𝑢𝑚_𝐵𝑖𝑑𝑑𝑒𝑟𝑠 ) = 𝛽0 + 𝛽1 ∗ 𝐴𝑔𝑒 + 𝛽2 ∗ 𝑁𝑢𝑚_𝐵𝑖𝑑𝑑𝑒𝑟𝑠 o 𝑉𝑎𝑟(𝑃𝑟𝑖𝑐𝑒|𝐴𝑔𝑒, 𝑁𝑢𝑚_𝐵𝑖𝑑𝑑𝑒𝑟𝑠) = 𝜎 2 13 Train / Test Cross-Validation in R Step 1: Randomly divide the dataset into a training dataset and a test dataset. A typical division is 2/3 of the data is put into the training set and the remaining 1/3 is set aside for the test set. Determine the number of observations, i.e. rows in our dataset. > dim(Grandfather_Clocks) [1] 32 4 There are 32 observations in our data set. If 1/3 of the observations are to be saved out for the test dataset, then about 10 of the 32 observations should be randomly selected and placed into the test dataset. > 0.33*32 [1] 10.56 The sample() function is used to randomly select the observations to be placed into the test dataset. The following identifies the arguments used of the sample() function. o o o 1:32 -- specifies the sequence 1, 2, 3, … , 32 where each represents a row in the dataset 10 -- specifies that 10 observations are to be randomly set aside for the test dataset replace = F -- specifies that sampling of the rows should be done without replacement, i.e. cannot take-out an observation out twice so sampling is done without replacement > holdout=sample(1:32,10,replace=F) The following observations will *not* be used when building the model. > holdout [1] 26 6 30 13 17 12 28 10 5 4 Step 2: Build the model using the observations from the training set only. > fit.train = lm(Price~(Age+Number_Bidders),data=Grandfather_Clocks[-holdout,]) The summary output from this model. > summary(fit.train) Call: lm(formula = Price ~ (Age + Number_Bidders), data = Grandfather_Clocks[-holdout, ]) Residuals: Min 1Q -206.27 -129.96 Median -31.32 3Q 116.09 Max 233.69 Coefficients: Estimate Std. Error t value (Intercept) -1270.929 253.086 -5.022 Age 12.507 1.303 9.598 Number_Bidders 81.376 11.368 7.158 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ Pr(>|t|) 7.57e-05 *** 1.01e-08 *** 8.38e-07 *** 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 147.3 on 19 degrees of freedom Multiple R-squared: 0.8467, Adjusted R-squared: 0.8305 F-statistic: 52.46 on 2 and 19 DF, p-value: 1.834e-08 14 Step 3: Compute the RMSE on the observations in the test set using the model from Step 2. First, get the predicted values for observations in the test dataset. Compute their corresponding residuals as well. > predict.test = predict(fit.train,newdata=Grandfather_Clocks[holdout,]) > resid.test = (Grandfather_Clocks[holdout,2] - predict.test) Computing the Root Mean Square Error on the test dataset. > sqrt(mean((resid.test^2))) [1] 104.8246 The RMSE value from the model using all the data was 133.1. The RMSE from the observations in the test dataset using a model built from the observations in training set is somewhat smaller at 104.8. This is somewhat unexpected as the observations in the test dataset were not used to build the model – yet according to the RMSE value, I have an increased ability to make predictions for these observations. If the train / test process were repeated a second time, we’d get a different RMSE. The RMSE value from a second iteration produces a RMSE value of 132.8. > holdout=sample(1:32,10,replace=F) > fit.train = lm(Price~(Age+Number_Bidders),data=Grandfather_Clocks[-holdout,]) > predict.test = predict(fit.train,newdata=Grandfather_Clocks[holdout,]) > resid.test = (Grandfather_Clocks[holdout,2] - predict.test) > sqrt(mean((resid.test^2))) [1] 132.7861 This happens because the observations are randomly split into the training and test datasets. Understanding the amount of variation in the RMSE via cross-validation is important. A k-fold cross-validation typically provides a better estimate as the RMSE is averaged over each of the kfolds. The Monte Carlo cross-validation approach presented below also alleviates this problem. As we’ve seen above, the RMSE computed using the Train / Test approach is dependent on which observations are placed into the training dataset and which are placed into the test dataset. The Monte Carlo cross-validation approach alleviates this problem. Monte Carlo Cross-Validation Procedure Step 1: Randomly divide the dataset into a training set and a test set. Same as Train / Test Approach One additional step Step 2: Build the model using the observations from the training set. Step 3: Compute RMSE on the observations in the test set using the model from Step 2. Step 4: Repeat Steps 1 – 3 a large number of times, say b times. Record RMSE from Step 3 on each iteration. 15 Monte Carlo Cross-Validation in R The following function was written to conduct a Monte Carlo cross-validation procedure in R. mc.cv = function(lm.object,data,p=0.33,b=100){ #Getting the number of rows in data n=dim(data)[1] #How many observations should be in holdout sample np=floor(p*n) #Getting a copy of the orginal response vector originaly = lm.object$model[,1] #Getting an output vector to store RMSE on each of the b iterations output = rep(0,n) #The loop for repeated iterations for(i in 1:b){ #Getting the observations for the holdout sample holdout=sample(1:n,np,replace=F) #Fitting the model on the training dataset fit = lm(formula(lm.object),data=data[-holdout,]) #Getting the predicted values for the test dataset predict.test = predict(fit,newdata=data[holdout,]) #Getting resid^2 for the test dataset resid2 = (originaly[holdout]-predict.test)^2 #Computing RMSE and placing result into output vector output[i] = sqrt(mean(resid2)) } #Return RMSE values and their average over b iterations list(RMSE_Vector=output,Avg_RMSE=mean(output)) } Using the mc.cv function on the Grandfather_Clock dataset. > MC_RMSE=mc.cv(fit,Grandfather_Clocks) Outcomes from the Monte Carlo cross-validation procedure. > MC_RMSE $RMSE_Vector [1] 116.6287 [8] 105.6085 [15] 165.4827 [22] 128.0354 [29] 110.1783 [36] 103.5771 [43] 167.5278 [50] 128.5601 [57] 139.8218 [64] 171.2468 [71] 160.3456 [78] 168.4703 [85] 154.2188 [92] 115.8677 [99] 186.1349 159.9629 148.4967 180.6907 144.8986 151.9054 147.6365 155.8323 151.7445 167.9250 111.7259 156.2246 112.3097 170.9216 118.3454 136.4803 113.4629 194.0042 130.8478 148.8463 137.6646 142.6218 146.2666 148.9523 121.7352 132.1239 145.5988 139.8134 127.2419 136.8542 137.5389 127.4927 131.2208 100.8981 145.9771 161.9792 164.4329 144.3970 173.9314 107.8448 181.3618 163.5863 102.3846 135.3349 115.3287 183.8086 146.0145 133.7067 159.5572 100.2995 154.7935 159.6166 194.6839 142.3572 134.2481 140.4604 164.5099 131.1051 135.0001 132.6002 111.5009 141.3693 119.1783 166.0258 116.9076 84.6834 145.8279 176.9573 138.2101 215.7054 179.0822 104.6982 121.4914 161.4672 146.2957 150.2426 155.9115 162.3593 139.6966 168.7590 146.3267 135.8692 162.4199 153.8596 120.4892 146.6755 $Avg_RMSE [1] 143.8132 16 The average RMSE over the 100 repeated iteration is about 143. Once again, this is a bit larger than the RMSE from the model that included all the observations, i.e. 133.1. A plot of the Monte Carlo RMSE values over these 100 iterations is below.. The arguments in the mc.cv function can easily be modified. For example, the following will use a 25% holdout sample and will use a total of 1000 repeated iterations. The average RMSE returned under this setup was about 141.4. The distribution of the Monte Carlo RMSE values over the 1000 iterations is shown below. > mc.cv(fit,Grandfather_Clocks,p=0.25,b=1000) 17

Jackknife and Cross

Related documents

Products

Support

Jackknife and Cross

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib