1 - Multiple Regression in R using OLS 1.1 – “Review” of OLS Load the comma-delimited file bodyfat.csv into R > Bodyfat = read.table(file.choose(),header=T,sep=",") Read 3528 items > Bodyfat = Bodyfat[,-1] first column density is redundant Response is in column 1, the candidate predictors are in columns 2 – 14. > X <- Bodyfat[,2:14] > y <- Bodyfat[,1] > dim(X) [1] 252 13 > dim(y) [1] 252 1 > pairs.plus(Bodyfat) Examine a scatterplot matrix with the “bells and whistles”… > bodyfat.ols = lm(bodyfat~.,data=Bodyfat) 1 > summary(bodyfat.ols) Call: lm(formula = bodyfat ~ ., data = Bodyfat) Residuals: Min 1Q -11.1966 -2.8824 Median -0.1111 3Q 3.1901 Max 9.9979 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -21.35323 22.18616 -0.962 0.33680 age 0.06457 0.03219 2.006 0.04601 * weight -0.09638 0.06185 -1.558 0.12047 height -0.04394 0.17870 -0.246 0.80599 neck -0.47547 0.23557 -2.018 0.04467 * chest -0.01718 0.10322 -0.166 0.86792 abdomen 0.95500 0.09016 10.592 < 2e-16 *** hip -0.18859 0.14479 -1.302 0.19401 thigh 0.24835 0.14617 1.699 0.09061 . knee 0.01395 0.24775 0.056 0.95516 ankle 0.17788 0.22262 0.799 0.42505 biceps 0.18230 0.17250 1.057 0.29166 forearm 0.45574 0.19930 2.287 0.02309 * wrist -1.65450 0.53316 -3.103 0.00215 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.309 on 238 degrees of freedom Multiple R-squared: 0.7486, Adjusted R-squared: 0.7348 F-statistic: 54.5 on 13 and 238 DF, p-value: < 2.2e-16 Regression diagnostics using a variety of functions written by Chris Malone for his senior capstone project while an undergraduate at WSU. > Diagplot1(bodyfat.ols) Look at Cook’s Distances & Leverages > Diagplot2(bodyfat.ols) DFBETA’s primarily > Diagplot3(bodyfat.ols,dfbet=T) AVP’s, DFBETAS, and VIFS 2 > Resplot Various diagnostic plots examining the residuals 3 > MLRdiag(bodyfat.ols) Inverse response plots with case diagnostics added > VIF(bodyfat.ols) returns table of VIF’s for each predictor. This table is returned by Diagplot3 as well Variance Inflation Factor Table age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist Variable age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist VIF 2.224469 44.652515 2.939110 4.431923 10.234694 12.775528 14.541932 7.958662 4.825304 1.924098 3.670907 2.191933 3.348404 Rsquared 0.5504545 0.9776048 0.6597610 0.7743643 0.9022931 0.9217253 0.9312333 0.8743507 0.7927592 0.4802760 0.7275877 0.5437817 0.7013503 There is clearly evidence of collinearity suggesting that a reduced model should be considered. Model “selection” is the focus of this handout. We will first consider using standard stepwise selection methods – forward, backward, mixed, or potentially all possible subsets. 4 1.2 – C + R Plots and CERES Plots in R These plots are used to visualize the functional form for a predictor in a OLS multiple regression setting. We can formulate an OLS regression with a response Y and potential predictors 𝑋1 , 𝑋2 , … , 𝑋𝑝 as follows: 𝑌 = 𝜂𝑜 + 𝜂1 𝜏1 (𝑋1 ) + ⋯ + 𝜂𝑝 𝜏𝑝 (𝑋𝑝 ) + 𝜀 where the 𝜏𝑖 (𝑋𝑖 )′𝑠 represent the functional form of the 𝑖 𝑡ℎ predictor in the model. For example 𝜏𝑖 (𝑋𝑖 ) = ln(𝑋𝑖 ) or 𝜏𝑖 (𝑋𝑖 ) = 𝑝𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 𝑜𝑓 𝑑𝑒𝑔𝑟𝑒𝑒 2 𝑖𝑛 𝑋𝑖 (i.e. add 𝑋𝑖 and 𝑋𝑖2 ) terms to the model. The model above is an example of what we call an additive model. Later in the course we will look at the other methods for developing flexible additive models in a regression setting. The package car contains functions for regression that are similar to those available in Arc which the software developed to accompany Applied Regression: Including Computing and Graphics by Cook & Weisberg (text from STAT 360). Although not as interactive as Arc, the crPlots() & ceresPlots() functions in the car library will construct C+R and CERES plots respectively for each term in a regression model. As stated earlier, both C+R plots and CERES Plots are used to visualize the predictors that might benefit from the creation of nonlinear terms based on the predictor. CERES plots are better when there are nonlinear relationships amongst the predictors themselves. The nonlinear relationships between the predictors can “bleed” into the C+R Plots, resulting in an inaccurate representation of the potential terms. Component + Residual Plots (C+R Plots) > crPlots(bodyfat.ols) 5 CERES Plots (Conditional Expectation RESidual plots) > ceresPlots(bodyfat.ols) 6 1.3 - Standard Stepwise Selection Methods for OLS Regression These methods seek to minimize a penalized version of the RSS = residual sum of squares of the regression model. These statistics are Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), adjusted R-square (adj-R2), and Mallow’s Ck and presented below: 𝐴𝐼𝐶 = 𝑛𝑙𝑜𝑔 ( 𝐶𝑘 = 𝑅𝑆𝑆𝑘 1 (𝑅𝑆𝑆 + 2𝑘𝜎̂ 2 ) ) + 2𝑘 = 𝑛 𝑛𝜎̂ 2 𝑅𝑆𝑆𝑘 1 + 2𝑘 − 𝑛 = (𝑅𝑆𝑆 + 2𝑘𝜎̂ 2 ) 2 𝜎̂ 𝑛 𝐵𝐼𝐶 = 1 (𝑅𝑆𝑆 + log(𝑛) 𝑘𝜎̂ 2 ) 𝑛 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 2 = 1 − 𝑅𝑆𝑆/(𝑛 − 𝑘 − 1) 𝑆𝑆𝑇𝑜𝑡 /(𝑛 − 1) where k = the number of parameters in the candidate model and 𝜎̂ 2 = estimated residual variance from the “full” model. Minimizing AIC, BIC, or Ck in the case of OLS yields the “best” model according to that criterion. In contrast, the adjusted R2 is maximized to find the “best” model. Backward Elimination > bodyfat.back = step(bodyfat.ols,direction="backward") Backward elimination results are displayed (not shown) > anova(bodyfat.back) Analysis of Variance Table Response: bodyfat Df Sum Sq Mean Sq F value Pr(>F) age 1 1493.3 1493.3 81.4468 < 2.2e-16 weight 1 6674.3 6674.3 364.0279 < 2.2e-16 neck 1 182.5 182.5 9.9533 0.001808 abdomen 1 4373.0 4373.0 238.5125 < 2.2e-16 hip 1 6.9 6.9 0.3747 0.541022 thigh 1 136.6 136.6 7.4523 0.006799 forearm 1 90.1 90.1 4.9164 0.027528 wrist 1 166.8 166.8 9.1002 0.002827 Residuals 243 4455.3 18.3 --Signif. codes: *** *** ** *** ** * ** 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 7 > bodyfat.back$anova Step Df Deviance Resid. Df Resid. Dev 1 NA NA 238 4420.064 2 - knee 1 0.05885058 239 4420.123 3 - chest 1 0.52286065 240 4420.646 4 - height 1 0.68462867 241 4421.330 5 - ankle 1 13.28231735 242 4434.613 6 - biceps 1 20.71159705 243 4455.324 AIC 749.8491 747.8524 745.8822 743.9212 742.6772 741.8514 Forward Selection (painful due to the fact candidate predictors need to be listed explicitly) > bodyfat.base = lm(bodyfat~1,data=Bodyfat) Model with intercept only > bodyfat.forward step(bodyfat.base,~.+age+weight+height+ neck+chest+abdomen+hip+thigh+knee+ankle+biceps+forearm+ wrist,direction="forward") Start: AIC=1071.75 bodyfat ~ 1 + abdomen + chest + hip + weight + thigh + knee + biceps + neck + forearm + wrist + age + ankle <none> + height Df Sum of Sq RSS AIC 1 11631.5 5947.5 800.65 1 8678.3 8900.7 902.24 1 6871.2 10707.8 948.82 1 6593.0 10986.0 955.29 1 5505.0 12073.9 979.08 1 4548.4 13030.6 998.30 1 4277.3 13301.7 1003.49 1 4230.9 13348.1 1004.36 1 2295.8 15283.2 1038.48 1 2111.5 15467.5 1041.50 1 1493.3 16085.7 1051.38 1 1243.5 16335.5 1055.26 17579.0 1071.75 1 11.2 17567.7 1073.59 Etc… Both or Mixed Selection > bodyfat.mixed = step(bodyfat.ols) default=”both”, feeds in full model > anova(bodyfat.mixed) Analysis of Variance Table Response: bodyfat Df Sum Sq Mean Sq F value Pr(>F) age 1 1493.3 1493.3 81.4468 < 2.2e-16 *** weight 1 6674.3 6674.3 364.0279 < 2.2e-16 *** neck 1 182.5 182.5 9.9533 0.001808 ** abdomen 1 4373.0 4373.0 238.5125 < 2.2e-16 *** hip 1 6.9 6.9 0.3747 0.541022 thigh 1 136.6 136.6 7.4523 0.006799 ** forearm 1 90.1 90.1 4.9164 0.027528 * wrist 1 166.8 166.8 9.1002 0.002827 ** Residuals 243 4455.3 18.3 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 8 > bodyfat.mixed$anova Step Df Deviance Resid. Df Resid. Dev 1 NA NA 238 4420.064 2 - knee 1 0.05885058 239 4420.123 3 - chest 1 0.52286065 240 4420.646 4 - height 1 0.68462867 241 4421.330 5 - ankle 1 13.28231735 242 4434.613 6 - biceps 1 20.71159705 243 4455.324 AIC 749.8491 747.8524 745.8822 743.9212 742.6772 741.8514 Stepwise Methods Using the leaps package in R The package leaps available through CRAN will perform forward, backward, and mixed approaches as well, but offer some improvements over the default step function in base R. > library(leaps) > bodyfat.full = regsubsets(bodyfat~.,data=Bodyfat,nvmax=13) > summary(bodyfat.full) Subset selection object Call: regsubsets.formula(bodyfat 13 Variables (and intercept) Forced in Forced out age FALSE FALSE weight FALSE FALSE height FALSE FALSE neck FALSE FALSE chest FALSE FALSE abdomen FALSE FALSE hip FALSE FALSE thigh FALSE FALSE knee FALSE FALSE ankle FALSE FALSE biceps FALSE FALSE forearm FALSE FALSE wrist FALSE FALSE 1 subsets of each size up to 13 Selection Algorithm: exhaustive age weight height neck 1 ( 1 ) " " " " " " " " 2 ( 1 ) " " "*" " " " " 3 ( 1 ) " " "*" " " " " 4 ( 1 ) " " "*" " " " " 5 ( 1 ) " " "*" " " "*" 6 ( 1 ) "*" "*" " " " " 7 ( 1 ) "*" "*" " " "*" 8 ( 1 ) "*" "*" " " "*" 9 ( 1 ) "*" "*" " " "*" 10 ( 1 ) "*" "*" " " "*" 11 ( 1 ) "*" "*" "*" "*" 12 ( 1 ) "*" "*" "*" "*" 13 ( 1 ) "*" "*" "*" "*" ~ ., data = Bodyfat, nvmax = 13) chest " " " " " " " " " " " " " " " " " " " " " " "*" "*" abdomen "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" hip " " " " " " " " " " " " " " "*" "*" "*" "*" "*" "*" thigh " " " " " " " " " " "*" "*" "*" "*" "*" "*" "*" "*" knee " " " " " " " " " " " " " " " " " " " " " " " " "*" ankle " " " " " " " " " " " " " " " " " " "*" "*" "*" "*" biceps " " " " " " " " " " " " " " " " "*" "*" "*" "*" "*" forearm " " " " " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" wrist " " " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" > reg.summary = summary(bodyfat.full) > names(reg.summary) [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj" 9 > par(mfrow=c(2,2)) set up a 2 X 2 grid of plots > plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="b") > plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted Rsquare",type="b") > plot(reg.summary$cp,xlab="Number of Variables",ylab="Mallow's Cp",type="b") > plot(reg.summary$bic,xlab="Number of Variables",ylab="Bayesian Information Criterion (BIC)",type="b") > par(mfrow=c(1,1)) restore to 1 plot per page Find “optimal” model size using adjusted-R2, Mallow’s Ck, and BIC > which.max(reg.summary$adjr2) [1] 9 > which.min(reg.summary$cp) [1] 7 > which.min(reg.summary$bic) [1] 4 10 The regsubsets() function has a built-in plot command which can display the selected variables for the “best” model with a given model selection statistic. The top row of each plot contains a black square for each variable selected according to the optimal model associated with that statistic. Examples using the R2 (unadjusted), adjusted R2, Mallow’s Ck, and the BIC are shown on the following page. > plot(bodyfat.full,scale="r2") > plot(bodyfat.full,scale="adjr2") > plot(bodyfat.full,scale="Cp") > plot(bodyfat.full,scale="bic") Automatic Selection via All Possible Subsets The package bestglm uses will return the “best” model using user-specified model selection criterion such as AIC (basically Mallow’s Ck for OLS), BIC, and crossvalidation schemes. The PDF documentation for this package is excellent with several complete examples and details on how to use the various options. The output below shows the use of the bestglm function to find the “best” OLS model using the AIC/Ck criterion. > library(bestglm) > Xy = cbind(x,y) > bodyfat.best = bestglm(Xy,IC="AIC") 11 > attributes(bodyfat.best) $names [1] "BestModel" "BestModels" "ModelReport" "Bestq" "qTable" "Subsets" "Title" $class [1] "bestglm" > bodyfat.best$Subsets > bodyfat.best$BestModel Call: lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), drop = FALSE], y = y)) Coefficients: (Intercept) -22.65637 age 0.06578 weight -0.08985 neck -0.46656 > summary(bodyfat.best$BestModel) Residuals: Min 1Q Median 3Q -10.9757 -2.9937 -0.1644 2.9766 abdomen 0.94482 hip -0.19543 thigh 0.30239 forearm 0.51572 wrist -1.53665 Max 10.2244 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.65637 11.71385 -1.934 0.05426 . age 0.06578 0.03078 2.137 0.03356 * weight -0.08985 0.03991 -2.252 0.02524 * neck -0.46656 0.22462 -2.077 0.03884 * abdomen 0.94482 0.07193 13.134 < 2e-16 *** hip -0.19543 0.13847 -1.411 0.15940 thigh 0.30239 0.12904 2.343 0.01992 * forearm 0.51572 0.18631 2.768 0.00607 ** wrist -1.53665 0.50939 -3.017 0.00283 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.282 on 243 degrees of freedom Multiple R-squared: 0.7466, Adjusted R-squared: 0.7382 F-statistic: 89.47 on 8 and 243 DF, p-value: < 2.2e-16 Save the “best” OLS model to an object named appropriately. We can then examine various regression diagnostics for this model as considered above. > bodyfat.bestols = lm(formula(bodyfat.best$BestModel)) > MLRdiag(bodyfat.bestols) etc… 12 > VIF(bodyfat.bestols) Variance Inflation Factor Table age weight neck abdomen hip thigh forearm wrist Variable VIF Rsquared age 2.059194 0.5143731 weight 18.829990 0.9468932 neck 4.081562 0.7549958 abdomen 8.236808 0.8785937 hip 13.471431 0.9257688 thigh 6.283117 0.8408433 forearm 1.940309 0.4846180 wrist 3.096051 0.6770079 The presence of collinearity issues even after reducing the model size suggests some problems with the “in/out” selection strategies regardless of the criterion used to select them. 13 1.4 - Cross-Validation Functions for OLS In this section I will show some sample R code that can be used (and altered) to perform cross-validation to estimate the “average” PSE (prediction squared error), MPSE (mean prediction squared error), or the MSEP (mean squared error for prediction). Note these are all the same thing! We could also take the square root of any of these to obtain root squared error of prediction (RMSEP). I won’t even consider listing other associated acronyms. Suppose we have m observations to predict the value of response (y) for. These m observations must NOT have been used to develop/train the model! 1 MSE for prediction = 𝑚 ∑𝑚 𝑖=1 (𝑦𝑖 − 𝑦𝑝𝑟𝑒𝑑 (𝑖)) 2 As discussed in class there are several schemes that can be used to estimate the predictive abilities of a regression model. The list of methods we discussed is: 1) 2) 3) 4) 5) Test/Train Samples or Split Sample approach K-fold Cross-Validation (k-fold CV) Leave-out-one Cross-validation (LOOCV) Monte Carlo Cross-validation (MCCV) .632 Bootstrap In this handout I will demonstrate these different methods of cross-validation using the Bodyfat example. In the textbook (section 6.5.3 pg. 248-251), the authors demonstrate how to use k-fold cross-validation to determine the optimal number of predictors in the OLS model using the Hitters data found in the ISLR package. The approach I will take with the body fat data is a little different. I will assume that we have a model chosen and we wish to estimate the predictive performance of this model using the MSEP estimated via crossvalidation. Test/Train or Split Sample Cross-Validation for an OLS Regression Model The code below will construct a split sample cross-validation for an OLS model. We first choose a fraction of the available data p to form our training data (e.g. p = .67). The remaining data is then used to form our test cases. You fit the model to training data and then predict the response value for the test cases. > dim(Bodyfat) [1] 252 14 > n = nrow(Bodyfat) > n [1] 252 14 > p = .67 > m = floor(n*(1-p)) > m [1] 83 > sam = sample(1:n,m,replace=F) > Bodyfat.test = Bodyfat[sam,] > Bodyfat.train = Bodyfat[-sam,] > Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train) > bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test) > pred.err = Bodyfat.test$bodyfat – bodyfat.pred > pred.err > pred.err 109 6.16872468 211 -5.80966602 197 5.61137864 102 0.04633038 47 2.68099394 101 2.71919034 234 2.20765427 10 2.33085273 42 3.56701914 49 -4.26635609 72 -3.89057224 71 4.82964948 206 2.53708347 97 -7.65387694 27 -1.29145551 252 5.19305658 118 -0.20192157 241 2.17346001 41 -2.47471287 195 5.97086308 78 3.02314968 143 4.87661670 1 -3.66664960 15 -1.67430723 204 -8.82900506 88 2.17247349 5 1.13773495 185 -0.47892211 148 7.27411561 84 194 223 199 129 247 5.55960734 -2.26666371 -5.76919180 1.29122863 2.29403342 0.84380506 212 116 55 142 3 183 2.78002236 0.16611536 -4.26941892 -3.01050196 7.11824662 -4.41990034 91 83 134 222 110 58 -1.72264289 -4.54710906 5.50948186 -3.43478744 0.16872091 1.35967343 45 240 51 151 173 207 -3.82375057 4.43252304 -5.70578266 1.00073463 3.66326700 10.55889964 53 192 170 119 196 50 -6.68075831 8.27135710 -3.41991050 7.41856162 2.75480323 -2.12010334 107 123 163 153 67 232 -7.31113211 2.05349309 -2.67587344 5.74096473 5.40488472 -5.18731001 177 2 184 133 103 157 -2.32761863 -3.24592972 -4.80316777 -1.70083102 2.46434656 2.33830994 17 61 166 235 56 100 5.91822645 1.14124567 1.53410161 4.48022867 -0.81379728 3.74242033 40 251 210 233 169 43 0.85255665 1.77435585 -4.14641758 -0.90216282 -3.40791216 -2.22336196 > mean(pred.err^2) [1] 17.8044 If did this yourself you would most likely obtain a different PSE, because your random sample of the indices would produce different test and training samples. We can guarantee our results will match by using the command set.seed()to obtain the same random samples. > set.seed(1) if we all used this value before any command that utilizes randomization we get the same results. > set.seed(1) > sam = sample(1:n,m,replace=F) > Bodyfat.test = Bodyfat[sam,] > Bodyfat.train = Bodyfat[-sam,] > Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train) > bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test) > pred.err = Bodyfat.test$bodyfat - bodyfat.pred > mean(pred.err^2) [1] 14.20143 > set.seed(1000) > sam = sample(1:n,m,replace=F) > Bodyfat.test = Bodyfat[sam,] > Bodyfat.train = Bodyfat[-sam,] > Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train) > bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test) > pred.err = Bodyfat.test$bodyfat - bodyfat.pred > mean(pred.err^2) [1] 20.6782 > set.seed(1111) > sam = sample(1:n,m,replace=F) > Bodyfat.test = Bodyfat[sam,] > Bodyfat.train = Bodyfat[-sam,] > Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train) > bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test) > pred.err = Bodyfat.test$bodyfat - bodyfat.pred > mean(pred.err^2) [1] 22.39968 Notice the variation in the MPSE estimates! 15 Here is a slight variation on the code that will produce the results. > set.seed(1) > test = sample(n,m) > Bodyfat.ols = lm(bodyfat~.,data=Bodyfat,subset=-test) > mean((bodyfat-predict(Bodyfat.ols,Bodyfat))[test]^2) [1] 14.20143 > set.seed(1111) > test = sample(n,m) > Bodyfat.ols = lm(bodyfat~.,data=Bodyfat,subset=-test) > mean((bodyfat-predict(Bodyfat.ols,Bodyfat))[test]^2) [1] 22.39968 > set.seed(1000) > test = sample(n,m) > Bodyfat.ols = lm(bodyfat~.,data=Bodyfat,subset=-test) > mean((bodyfat-predict(Bodyfat.ols,Bodyfat))[test]^2) [1] 20.6782 k-Fold Cross-Validation To perform a k-fold cross-validation we first need to divide our available data into k roughly equal sample size sets of observations. We then our model using (k – 1) sets of the observations and predict the set of observations not used. This is done k times with each set being left out in turn. Typical values used in practice are k = 5 and k = 10. The function below will take an OLS model to be cross-validated using k-fold cross-validation and return the MSEP. > kfold.cv = function(fit,k=10) { sum.sqerr <- rep(0,k) y = fit$model[,1] x = fit$model[,-1] data = fit$model n = nrow(data) folds = sample(1:k,nrow(data),replace=T) for (i in 1:k) { fit2 <- lm(formula(fit),data=data[folds!=i,]) ypred <- predict(fit2,newdata=data[folds==i,]) sum.sqerr[i] <- sum((y[folds==i]-ypred)^2) } cv = sum(sum.sqerr)/n cv } > kfold.cv(Bodyfat.ols,k=10) [1] 21.02072 16 Leave-Out-One Cross-Validation (LOOCV) Using the fact the predicted value for 𝑦𝑖 when the 𝑖 𝑡ℎ case is deleted from the model is equal to 𝑒̂𝑖 𝑦𝑖 − 𝑦̂(𝑖) = = 𝑒̂(𝑖) = (𝑦𝑖 − 𝑦𝑝𝑟𝑒𝑑 (𝑖)) (1 − ℎ𝑖 ) This is also called the 𝑖 𝑡ℎ jackknife residual and the sum of these squared residuals is called the PRESS statistic, one of the first measures of prediction error. In R you can obtain the prediction errors as follows: > pred.err = resid(fit)/(1-lm.influence(fit)$hat) where fit is the OLS model we want to estimate the prediction error for. > pred.err = resid(Bodyfat.ols)/(1-lm.influence(Bodyfat.ols)$hat) > mean(pred.err^2) [1] 20.29476 Monte Carlo Cross-validation (MCCV) for an OLS Regression Model This function performs Monte Carlo Cross-validation for an arbitrary OLS model. Main argument is the fitted model from the lm()function. Optional arguments are the fraction of observations to use in the training set (default is p = .667 or approximately two-thirds of the original data) and the number of replications (default is B = 100, which is rather small actually). > ols.mccv = function(fit,p=.667,B=100) { cv <- rep(0,B) y = fit$model[,1] x = fit$model[,-1] data = fit$model n = nrow(data) for (i in 1:B) { ss <- floor(n*p) sam <- sample(1:n,ss,replace=F) fit2 <- lm(formula(fit),data=data[sam,]) ypred <- predict(fit2,newdata=x[-sam,]) cv[i] <- mean((y[-sam]-ypred)^2) } cv } 17 Here is a different version using a cleaner approach for dealing with the train/test data. > ols.mccv2 = function(fit,p=.667,B=100) { cv <- rep(0,B) y = fit$model[,1] x = fit$model[,-1] data = fit$model n = nrow(data) for (i in 1:B) { ss <- floor(n*p) sam <- sample(n,ss,replace=F) fit2 <- lm(formula(fit),subset=sam) ypred <- predict(fit2,data) cv[i] <- mean((y - ypred)[-sam]^2) } cv } MCCV Example: Bodyfat OLS – using dataframe with standardized X’s > Bodyfat.x = scale(Bodyfat[,-1]) > Bodyfat.scale = data.frame(bodyfat=Bodyfat$bodyfat,Bodyfat.x) > names(Bodyfat.scale) [1] "bodyfat" "age" "ankle" [12] "biceps" "weight" "height" "neck" "chest" "abdomen" "hip" "thigh" "knee" "forearm" "wrist" > bodyfat.ols = lm(bodyfat~.,data=Bodyfat.scale) note this is the full model > summary(bodyfat.ols) Call: lm(formula = bodyfat ~ ., data = Bodyfat.scale) Residuals: Min 1Q Median 3Q Max -11.1966 -2.8824 -0.1111 3.1901 9.9979 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.15079 0.27147 70.544 < 2e-16 *** age 0.81376 0.40570 2.006 0.04601 * weight -2.83261 1.81766 -1.558 0.12047 height -0.11466 0.46633 -0.246 0.80599 neck -1.15582 0.57264 -2.018 0.04467 * chest -0.14488 0.87021 -0.166 0.86792 abdomen 10.29781 0.97225 10.592 < 2e-16 *** hip -1.35104 1.03729 -1.302 0.19401 thigh 1.30382 0.76738 1.699 0.09061 . knee 0.03364 0.59752 0.056 0.95516 ankle 0.30150 0.37731 0.799 0.42505 biceps 0.55078 0.52117 1.057 0.29166 forearm 0.92091 0.40272 2.287 0.02309 * wrist -1.54462 0.49775 -3.103 0.00215 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 18 Residual standard error: 4.309 on 238 degrees of freedom Multiple R-squared: 0.7486, Adjusted R-squared: 0.7348 F-statistic: 54.5 on 13 and 238 DF, p-value: < 2.2e-16 > results = ols.mccv(bodyfat.ols) > mean(results) Avg. PSE or MPSE [1] 21.12341 > results [1] [13] [25] [37] [49] [61] [73] [85] [97] 23.05984 20.90526 17.92298 23.44665 24.07194 19.71538 20.98620 22.67341 20.77255 18.67317 25.05236 22.44382 23.61154 17.43179 18.75761 20.75209 17.84598 20.66017 16.19726 21.92504 19.99677 23.99787 27.06074 24.37032 20.18027 19.57925 19.95717 20.80143 19.57326 21.53336 19.85491 20.57398 17.28482 24.02314 21.03370 24.41063 19.03076 22.51138 19.05944 18.90459 21.50898 21.68807 14.90204 22.04792 23.04956 18.41306 23.18343 23.63645 17.12468 20.82153 23.22266 21.77679 19.51525 19.44615 23.86428 20.71451 22.13966 22.52111 23.30630 22.90814 25.96077 20.58773 14.57523 20.67572 22.57043 24.41738 21.74893 23.08865 23.53890 19.99501 22.69167 17.69072 19.62440 20.89708 20.02030 22.26177 23.21706 24.28484 18.88682 19.93048 20.08109 25.00203 19.84644 19.02380 25.40393 21.15600 19.61813 21.76474 23.17060 15.67878 22.31727 21.20719 20.72045 26.98161 21.38962 17.93562 17.34390 22.44576 19.65249 18.50420 > sum(resid(bodyfat.ols)^2)/252 RSS/n < MPSE as it should be! [1] 17.53994 > results = ols.mccv(bodyfat.ols,B=1000) > mean(results) [1] 20.95795 > bodyfat.step = step(bodyfat.ols) find the “best” OLS model using mixed selection. > results = ols.mccv(bodyfat.step,B=500) > mean(results) [1] 19.4979 Q: The MPSE is smaller for the simpler model, but is this the best we can do? 19 Bootstrap Estimate of the Mean Squared Error for Prediction The bootstrap in statistics is a method for approximating the sampling distribution of a statistic by resampling from our observed random sample. To put it simply, a bootstrap sample is a sample of size n drawn with replacement from our original sample. A bootstrap sample for regression (or classification) problems is illustrated below. 𝐷𝑎𝑡𝑎: (𝒙1 , 𝑦1 ), (𝒙𝟐 , 𝑦2 ), … , (𝒙𝒏 , 𝑦𝑛 ) here the 𝒙′𝒊 𝑠 are the p-dimensional predictor vectors. 𝐵𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝 𝑆𝑎𝑚𝑝𝑙𝑒: (𝒙∗𝟏 , 𝑦1∗ ), (𝒙∗𝟐 , 𝑦2∗ ), … , (𝒙∗𝒏 , 𝑦𝑛∗ ) where (𝒙∗𝒊 , 𝑦𝑖∗ ) is a random selected observation from the original data drawn with replacement. We can use the bootstrap sample to calculate any statistic of interest. This process is then repeated a large number of times (B = 500, 1000, 5000, etc.). For estimating prediction error we fit a model to our bootstrap sample and use it to predict the observations not selected in our bootstrap sample. One can show that about 63.2% of the original observations will represented in the bootstrap sample and about 36.8% of the original observations will not be selected. Thus we will almost certainly have some observations that are not represented in our bootstrap sample to serve as a “test” set, with the selected observations in our bootstrap sample serving as our “training” set. For each bootstrap sample we can predict the response for the cases Estimating the prediction error via the .632 Bootstrap Again our goal is to estimate the mean prediction squared error (MPSE or PSE for short) or mean squared error for prediction (MSEP). Another alternative to those presented above is to use the .632 bootstrap for estimating the PSE. The algorithm is given below: 1) First calculate the average squared residual (ASR) from your model ASR = 𝑅𝑆𝑆/𝑛. 2) Take B bootstrap samples drawn with replacement, i.e. we draw a sample with replacement from the numbers 1 to n and use those observations as our “new data”. 3) Fit the model to each of the B bootstrap samples, computing the 𝐴𝑆𝑅(𝑗) for predicting the observations not represented in the bootstrap sample. 𝐴𝑆𝑅(𝑗) = average squared residual for prediction in the jth bootstrap sample, j = 1,…,B. 4) Compute ASR0 = the average of the bootstrap ASR values 5) Compute the optimism (OP) = .632*(ASR0 – ASR) 6) The .632 bootstrap estimate of mean PSE = ASR + OP. 20 The bootstrap approach has been shown to be better than K-fold cross-validation in many cases. Here is an example/function of the .632 bootstrap estimate of the mean PSE again using the body fat dataset (Bodyfat). > bootols.cv = function(fit,B=100) { ASR = mean(fit$residuals^2) boot.err <- rep(0,B) y = fit$model[,1] x = fit$model[,-1] data = fit$model n = nrow(data) for (i in 1:B) { sam = sample(1:n,n,replace=T) samind = sort(unique(sam)) temp = lm(formula(fit),data=data[sam,]) ypred = predict(temp,newdata=data[-samind,]) boot.err[i] = mean((y[-samind]-ypred)^2) } ASR0 = mean(boot.err) OP = .632*(ASR0 – ASR) PSE = ASR + OP PSE } Again we perform cross-validation on the full OLS model for the body fat data. > Bodyfat.ols = lm(bodyfat~.,data=Bodyfat) > set.seed(1111) > bootols.cv(Bodyfat.ols,B=100) [1] 20.16974 > bootols.cv(Bodyfat.ols,B=100) [1] 19.87913 > bootols.cv(Bodyfat.ols,B=100) [1] 19.80591 > bootols.cv(Bodyfat.ols,B=1000) increasing the number of bootstrap samples (B = 1000) [1] 19.89335 21 More on Prediction Error and the Variance-Bias Tradeoff For any regression problem we assume that the response has the following model: 𝑌 = 𝑓(𝒙) + 𝜀 where 𝒙 = 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑝 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟𝑠 = (𝑥1 , 𝑥2 , … , 𝑥𝑝 ) and 𝑉𝑎𝑟(𝜀) = 𝜎𝜀2 . Our goal in modeling is to approximate or estimate 𝑓(𝒙) using a random sample of size n: (𝒙1 , 𝑦1 ), (𝒙𝟐 , 𝑦2 ), … , (𝒙𝒏 , 𝑦𝑛 ) here the 𝒙′𝒊 𝑠 are the p-dimensional predictor vectors. 2 2 2 PSE(Y) = 𝐸 [(𝑌 − 𝑓̂(𝒙)) ] = [𝐸(𝑓̂(𝒙)) − 𝑓(𝒙)] +𝐸 [(𝑓(𝒙) − 𝐸 (𝑓̂(𝒙)) ] +𝜎𝜀2 = 𝐵𝑖𝑎𝑠 2 + 𝑉𝑎𝑟 (𝑓̂(𝒙)) + 𝐼𝑟𝑟𝑒𝑑𝑢𝑐𝑖𝑏𝑙𝑒 𝐸𝑟𝑟𝑜𝑟 The cross-validation methods discussed above are all acceptable ways to estimate PSE(Y), but some are certainly better than others. This is still an active area of research and there is no definitive best method for every situation. Some methods are better at estimate the variance component of the PSE while others are better at estimating the bias. Ideally we would like to use a method of cross-validation that does a reasonable job of estimating each component. In the sections to follow we will be introducing alternatives to OLS or variations of OLS for developing models for 𝑓(𝒙). Some of these modeling strategies have the potential to be very flexible (i.e. have small Bias) but at the expense of being highly variable, i.e. have large variation, 𝑉𝑎𝑟(𝑓̂(𝒙)). Balancing these two components of prediction error is critical and cross-validation is one of the main tools we will use to create this balance in our model development. 22 2 - Shrinkage Methods (“Automatic” Variable Selection Methods) In our review OLS we review classic stepwise model selection methods: forward, backward, and mixed. All three of these methods will either include or exclude terms starting from an appropriate base model. Other model selection methods have been developed that are viable alternatives to these in/out strategies. These include ridge regression (old one but has new found life), LASSO (newer one), LARS (newest one), PCR, and PLS. We will discuss the idea behind each of these modeling methods in sections below. Aside from the model selection these methods have also been used extensively in high dimensional regression problems. A high dimensional problem is one in which n < p or n << p. The text authors present two examples where this might be the case, but there are certainly many others. Predicting blood pressure – rather than use standard predictors such as age, gender, and BMI, one might also collect measurements for half a million single nucleotide polymorphisms (SNP’s) for inclusion in the model. Thus we might have 𝑛 ≈ 300 and 𝑝 ≈ 500,000! Predicting purchasing behavior of online shoppers - using a table of 50,000 key words (coded 0/1) potential customers might use in the process of searching for products (i.e. Amazon.com) we might try to predict their purchasing behavior. We might gather information from 5,000 randomly selected visitors to the website, in which case 𝑛 ≈ 5,000 and 𝑝 ≈ 50,000! Ridge and Lasso regression models will allow us to fit models to these situations where 𝑛 ≪ 𝑝, where OLS mathematically cannot! 23 2.1 - Ridge Regression or Regularized Regression Ridge regression chooses parameter estimates, 𝛽̂ 𝑟𝑖𝑑𝑔𝑒 , to minimize the residual sum of squares subject to a penalty on the size of the coefficients. After standardizing all potential terms in the model the ridge coefficients minimize 𝑛 𝑘 𝑘 𝛽̂ 𝑟𝑖𝑑𝑔𝑒 = min {∑(𝑦𝑖 − 𝛽𝑜 − ∑ 𝑢𝑖𝑗 𝛽𝑗 )2 + 𝜆 ∑ 𝛽𝑗2 } 𝛽 𝑖=1 𝑗=1 𝑗=1 Here > 0 is a complexity parameter that controls the amount of shrinkage, the larger the greater the amount of shrinkage. The intercept is not included in the shrinkage and will be estimated as the mean of the response. An equivalent way to write the ridge regression criterion is 𝑛 𝑘 𝑘 𝛽̂ 𝑟𝑖𝑑𝑔𝑒 = min {∑(𝑦𝑖 − 𝛽𝑜 − ∑ 𝑢𝑖𝑗 𝛽𝑗 )2 } 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 ∑ 𝛽𝑗2 ≤ 𝑠 𝛽 𝑖=1 𝑗=1 𝑗=1 This clearly shows how of the size of the parameter estimates are constrained. Also this formulation of the problem also leads to a nice geometric interpretation of how the penalized least squares estimation works (see figure next page). Important Question: Why is it essential to standardize the terms in our model? 24 Visualization of Ridge Regression Usual OLS Estimate = (𝛽̂1 , 𝛽̂2 ) Contours of the OLS criterion 𝑟𝑖𝑑𝑔𝑒 Ridge regression estimate = (𝛽̂1 2 ∑ 𝛽𝑗2 = 𝛽12 + 𝛽22 ≤ 𝑠 𝑗=1 In matrix notation the ridge regression criterion is given by 𝑅𝑆𝑆(𝜆) = (𝑦 − 𝑈𝛽)𝑇 (𝑦 − 𝑈𝛽) + 𝜆𝛽 𝑇 𝛽 with the resulting parameter estimates being very similar to those for OLS 𝛽̂ 𝑟𝑖𝑑𝑔𝑒 = (𝑈 𝑇 𝑈 + 𝜆𝐼)−1 𝑈 𝑇 𝑦 I is the k x k identity matrix. There are several packages in R that contain functions that perform ridge regression. One we will use is lm.ridge in the package MASS. The MASS package actually contains a variety of very useful functions. MASS stands for Modern Applied Statistics in S-Plus (expensive R) by Venables & Ripley, this is an excellent reference if you are so inclined. The function call using lm.ridge is very similar to the lm() function. The other function we will use is the function ridge in the genridge package. The genridge package contains a number of plotting functions to help visualize the coefficient shrinkage that takes place by using ridge regression. Using the bodyfat dataset we will conduct a ridge regression analysis In order to fairly compare the parameter estimates obtained via ridge regression to those from OLS we will first run the OLS regression using the standardized predictors. 25 𝑟𝑖𝑑𝑔𝑒 , 𝛽̂2 ) > bodyfat.scaled = lm(bodyfat~.,data=Bodyfat.scale) > summary(bodyfat.scaled) Call: lm(formula = bodyfat ~ ., data = Bodyfat.scale) Residuals: Min 1Q -11.1966 -2.8824 Median -0.1111 3Q 3.1901 Max 9.9979 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.15079 0.27147 70.544 < 2e-16 *** age 0.81376 0.40570 2.006 0.04601 * weight -2.83261 1.81766 -1.558 0.12047 height -0.11466 0.46633 -0.246 0.80599 neck -1.15582 0.57264 -2.018 0.04467 * chest -0.14488 0.87021 -0.166 0.86792 abdomen 10.29781 0.97225 10.592 < 2e-16 *** hip -1.35104 1.03729 -1.302 0.19401 thigh 1.30382 0.76738 1.699 0.09061 . knee 0.03364 0.59752 0.056 0.95516 ankle 0.30150 0.37731 0.799 0.42505 biceps 0.55078 0.52117 1.057 0.29166 forearm 0.92091 0.40272 2.287 0.02309 * wrist -1.54462 0.49775 -3.103 0.00215 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.309 on 238 degrees of freedom Multiple R-squared: 0.7486, Adjusted R-squared: 0.7348 F-statistic: 54.5 on 13 and 238 DF, p-value: < 2.2e-16 > mean(bodyfat) [1] 19.15079 To run ridge regression we first need to choose an optimal value for the penalty parameter . The size of reasonable values varies vastly from one ridge model to the next so using some form of automated selection method like cross-validation to help find one is a good idea. Another approach is use the effective degrees of freedom of the model which is given by the trace (sum of the diagonal elements) of the matrix 𝑑𝑓(𝜆) = 𝑡𝑟[𝑈(𝑈 𝑇 𝑈 + 𝜆𝐼)−1 𝑈 𝑇 ] 26 which as we can see is a function of . Note that when = 0, i.e. OLS, this matrix is the hat matrix whose trace is always k. To fit ridge models and choose an appropriate we will use the function lm.ridge from the MASS package. > args(lm.ridge) function (formula, data, subset, na.action, lambda = 0, model = FALSE, x = FALSE, y = FALSE, contrasts = NULL, ...) The args command is an easy way to see what arguments a function takes to run. Of course some functions are quite complex so using the command >?lm.ridge will bring up the help file with additional details on the arguments and generally simple examples of the functions use. We first we will use a wide range of values and let the built-in optimal selection algorithms choose good candidates. > bodyfat.ridge = lm.ridge(bodyfat~.,data=Bodyfat.scale, lambda=seq(0,1000,.1)) > select(bodyfat.ridge) modified HKB estimator is 1.664046 modified L-W estimator is 3.91223 smallest value of GCV at 1.1 cross-validation choice for Using the ridge function from the genridge package along with some different plotting features we can see the shrinkage in the parameter estimates. > bodyfat.ridge2 = ridge(bodyfat,bodyfat.Xs, lambda=seq(0,1000,.1)) > traceplot(bodyfat.ridge2) 27 > traceplot(bodyfat.ridge2,X=”df”) We can narrow the range on the choices to take a closer look at the optimal shrinkage parameter values. > bodyfat.ridge3 = ridge(bodyfat,bodyfat.Xs, lambda=seq(0,4,.001)) > traceplot(bodyfat.ridge3) 28 > traceplot(bodyfat.ridge3,X=”df”) > > > > bodyfat.xs = Bodyfat.scale[,-1] bodyfat.y = Bodyfat.scale[,1] bodyfat.ridge = ridge(bodyfat.y,bodyfat.xs,lambda=seq(0,10,2)) pairs(bodyfat.ridge) This plot shows the shrinkage in the estimated coefficients occurring as lambda increases from 0 to 10 by increments of 2. Most of the shrinkage occurs in the first 3 terms: age, weight, and height. 29 > plot3d(bodyfat.ridge,variables=1:3) A 3-D look at the shrinkage of the coefficients of age, weight, and height. Fit ridge regression model using the HKB optimal value for > bodyfat.ridge4 = lm.ridge(bodyfat~.,data=Bodyfat.scale,lambda=1.66) > attributes(bodyfat.ridge4) $names [1] "coef" "scales" "Inter" "lambda" "ym" "xm" "GCV" [8] "kHKB" "kLW" $class [1] "ridgelm" Compare the OLS coefficients to the ridge coefficients side-by-side. > cbind(coef(bodyfat.scaled),coef(bodyfat.ridge4)) [,1] [,2] (Intercept) 19.15079365 19.150793651 age 0.81375776 0.941990017 weight -2.83261161 -1.944588412 height -0.11466232 -0.313666216 neck -1.15582043 -1.182543415 chest -0.14487500 -0.009673795 abdomen 10.29780784 9.416114940 hip -1.35104126 -1.197685531 thigh 1.30382219 1.227323244 knee 0.03363573 0.027303926 ankle 0.30149592 0.235719800 biceps 0.55078084 0.461889816 forearm 0.92090523 0.891302127 wrist -1.54461619 -1.592169696 The decreases in the parameter estimates for most notably abdomen and weight allow for nominal increases in some of the parameter estimates for the other predictors. 30 Unfortunately the ridge regression routines in these packages do not allow for easy extraction of the fitted values and residuals from the fit. It is not hard to write to a simple function that will return the fitted values from a lm.ridge fit. ridgefitted = function(fit,xmat) { p = length(coef(fit)) fitted = coef(fit)[1] + xmat%*%coef(fit)[2:p] fitted } > ridge4fit = ridgefitted(bodyfat.ridge4,bodyfat.Xs) > plot(bodyfat,ridge4fit,xlab="Bodyfat",ylab="Fitted Values from Ridge Regression") > ridge4resid = bodyfat - ridge4fit > plot(ridge4fit,ridge4resid,xlab="Fitted Values",ylab="Ridge Residuals") 31 Ridge Regression using glmnet() (Friedman, Hastie, Tibshirani 2013) The glmnet package contains the function glmnet()which can be used to fit both the ridge regression and the Lasso model discussed in the next section. This function has a natural predict() function so obtaining fitted values and making predictions is easier than in the functions used above. We again return to the body fat example. The author’s also present another example of ridge regression in Lab 2 of Chapter 6 beginning on pg. 251 using data on baseball hitters and their salaries. The function glmnet()does not use standard formula conventions for developing models. Instead we form the a model matrix (X) that contains the predictors/terms as columns and the response vector 𝑦, and use them as arguments to the function. The columns of X must be numeric, so any categorical variables will need to converted to dummy variables 1st. This is easily achieved by using the model.matrix()function. For this example we will use a driver seat position data set found in the faraway package from CRAN. Response is a numeric measurement of their hip position when sitting in the driver seat. > library(faraway) you need to install it first! > names(seatpos) [1] "Age" "Weight" [9] "hipcenter" "HtShoes" "Ht" "Seated" "Arm" "Thigh" "Leg" > summary(seatpos) Age Min. :19.00 1st Qu.:22.25 Median :30.00 Mean :35.26 3rd Qu.:46.75 Max. :72.00 Thigh Min. :31.00 1st Qu.:35.73 Median :38.55 Mean :38.66 3rd Qu.:41.30 Max. :45.50 Weight Min. :100.0 1st Qu.:131.8 Median :153.5 Mean :155.6 3rd Qu.:174.0 Max. :293.0 Leg Min. :30.20 1st Qu.:33.80 Median :36.30 Mean :36.26 3rd Qu.:38.33 Max. :43.10 HtShoes Min. :152.8 1st Qu.:165.7 Median :171.9 Mean :171.4 3rd Qu.:177.6 Max. :201.2 hipcenter Min. :-279.15 1st Qu.:-203.09 Median :-174.84 Mean :-164.88 3rd Qu.:-119.92 Max. : -30.95 Ht Min. :150.2 1st Qu.:163.6 Median :169.5 Mean :169.1 3rd Qu.:175.7 Max. :198.4 Seated Min. : 79.40 1st Qu.: 85.20 Median : 89.40 Mean : 88.95 3rd Qu.: 91.62 Max. :101.60 Arm Min. :26.00 1st Qu.:29.50 Median :32.00 Mean :32.22 3rd Qu.:34.48 Max. :39.60 32 > pairs.plus(seatpos) > hip.ols = lm(hipcenter~.,data=seatpos) > summary(hip.ols) Call: lm(formula = hipcenter ~ ., data = seatpos) Residuals: Min 1Q -73.827 -22.833 Median -3.678 3Q 25.017 Max 62.337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 436.43213 166.57162 2.620 0.0138 * Age 0.77572 0.57033 1.360 0.1843 Weight 0.02631 0.33097 0.080 0.9372 HtShoes -2.69241 9.75304 -0.276 0.7845 Ht 0.60134 10.12987 0.059 0.9531 Seated 0.53375 3.76189 0.142 0.8882 Arm -1.32807 3.90020 -0.341 0.7359 Thigh -1.14312 2.66002 -0.430 0.6706 Leg -6.43905 4.71386 -1.366 0.1824 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 37.72 on 29 degrees of freedom Multiple R-squared: 0.6866, Adjusted R-squared: 0.6001 F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05 > attach(seatpos) > VIF(hip.ols) Variance Inflation Factor Table Age Weight HtShoes Ht Seated Variable VIF Rsquared Age 1.997931 0.4994823 Weight 3.647030 0.7258043 HtShoes 307.429378 0.9967472 Ht 333.137832 0.9969982 Seated 8.951054 0.8882813 33 Arm Thigh Leg Arm Thigh Leg 4.496368 0.7775983 2.762886 0.6380596 6.694291 0.8506190 As stated above it is imperative when performing ridge regression (or any other regularized regression method) that we scale the terms to have mean 0 and variance 1. We will form a new data frame in R containing the seat position data with the predictors scaled. > X = model.matrix(hipcenter~.,data=seatpos)[,-1] > X = scale(X) > summary(X) Age Min. :-1.0582 1st Qu.:-0.8467 Median :-0.3425 Mean : 0.0000 3rd Qu.: 0.7474 Max. : 2.3904 Arm Min. :-1.8436 1st Qu.:-0.8055 Median :-0.0640 Mean : 0.0000 3rd Qu.: 0.6701 Max. : 2.1902 Weight Min. :-1.55477 1st Qu.:-0.66743 Median :-0.05957 Mean : 0.00000 3rd Qu.: 0.51335 Max. : 3.83913 Thigh Min. :-1.97556 1st Qu.:-0.75620 Median :-0.02716 Mean : 0.00000 3rd Qu.: 0.68252 Max. : 1.76639 HtShoes Min. :-1.66748 1st Qu.:-0.50810 Median : 0.05028 Mean : 0.00000 3rd Qu.: 0.55484 Max. : 2.67401 Leg Min. :-1.78135 1st Qu.:-0.72367 Median : 0.01082 Mean : 0.00000 3rd Qu.: 0.60577 Max. : 2.00866 Ht Min. :-1.69012 1st Qu.:-0.49307 Median : 0.03721 Mean : 0.00000 3rd Qu.: 0.59434 Max. : 2.62373 Seated Min. :-1.93695 1st Qu.:-0.76091 Median : 0.09071 Mean : 0.00000 3rd Qu.: 0.54187 Max. : 2.56446 > var(X) Age Weight HtShoes Ht Seated Arm Thigh Leg Age 1.00000000 0.08068523 -0.07929694 -0.09012812 -0.17020403 0.35951115 0.09128584 -0.04233121 Weight HtShoes Ht Seated Arm Thigh Leg 0.08068523 -0.07929694 -0.09012812 -0.1702040 0.3595111 0.09128584 -0.04233121 1.00000000 0.82817733 0.82852568 0.7756271 0.6975524 0.57261442 0.78425706 0.82817733 1.00000000 0.99814750 0.9296751 0.7519530 0.72486225 0.90843341 0.82852568 0.99814750 1.00000000 0.9282281 0.7521416 0.73496041 0.90975238 0.77562705 0.92967507 0.92822805 1.0000000 0.6251964 0.60709067 0.81191429 0.69755240 0.75195305 0.75214156 0.6251964 1.0000000 0.67109849 0.75381405 0.57261442 0.72486225 0.73496041 0.6070907 0.6710985 1.00000000 0.64954120 0.78425706 0.90843341 0.90975238 0.8119143 0.7538140 0.64954120 1.00000000 > seatpos.scale = data.frame(hip=seatpos$hipcenter,X) > names(seatpos.scale) [1] "hip" "Age" "Weight" "HtShoes" "Ht" "Seated" "Arm" "Thigh" "Leg" > hip.ols = lm(hip~.,data=seatpos.scale) > summary(hip.ols) Call: lm(formula = hip ~ ., data = seatpos.scale) Residuals: Min 1Q -73.827 -22.833 Median -3.678 3Q 25.017 Max 62.337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -164.8849 6.1190 -26.946 <2e-16 *** Age 11.9218 8.7653 1.360 0.184 Weight 0.9415 11.8425 0.080 0.937 HtShoes -30.0157 108.7294 -0.276 0.784 Ht 6.7190 113.1843 0.059 0.953 Seated 2.6324 18.5529 0.142 0.888 Arm -4.4775 13.1494 -0.341 0.736 Thigh -4.4296 10.3076 -0.430 0.671 Leg -21.9165 16.0445 -1.366 0.182 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 37.72 on 29 degrees of freedom Multiple R-squared: 0.6866, Adjusted R-squared: 0.6001 F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05 34 > attach(seatpos.scale) > VIF(hip.ols) Variance Inflation Factor Table Age Weight HtShoes Ht Seated Arm Thigh Leg Variable VIF Rsquared Age 1.997931 0.4994823 Weight 3.647030 0.7258043 HtShoes 307.429378 0.9967472 Ht 333.137832 0.9969982 Seated 8.951054 0.8882813 Arm 4.496368 0.7775983 Thigh 2.762886 0.6380596 Leg 6.694291 0.8506190 > detach(seatpos.scale) Rescaling the X’s does not change the model performance in any way. The p-values, R2, RSS, VIF’s, etc. are all the same. The only changes are the estimated regression coefficients. We now consider fitting a ridge regression model to these data. > X = model.matrix(hip~.,data=seatpos.scale)[,-1] > y = seatpos.scale$hip > library(glmnet) > grid = 10^seq(10,-2,length=100) <- set up a wide range of values > grid [1] [8] [15] [22] [29] [36] [43] [50] [57] [64] [71] [78] [85] [92] 1.000000e+10 1.417474e+09 2.009233e+08 2.848036e+07 4.037017e+06 5.722368e+05 8.111308e+04 1.149757e+04 1.629751e+03 2.310130e+02 3.274549e+01 4.641589e+00 6.579332e-01 9.326033e-02 7.564633e+09 1.072267e+09 1.519911e+08 2.154435e+07 3.053856e+06 4.328761e+05 6.135907e+04 8.697490e+03 1.232847e+03 1.747528e+02 2.477076e+01 3.511192e+00 4.977024e-01 7.054802e-02 5.722368e+09 8.111308e+08 1.149757e+08 1.629751e+07 2.310130e+06 3.274549e+05 4.641589e+04 6.579332e+03 9.326033e+02 1.321941e+02 1.873817e+01 2.656088e+00 3.764936e-01 5.336699e-02 4.328761e+09 6.135907e+08 8.697490e+07 1.232847e+07 1.747528e+06 2.477076e+05 3.511192e+04 4.977024e+03 7.054802e+02 1.000000e+02 1.417474e+01 2.009233e+00 2.848036e-01 4.037017e-02 3.274549e+09 4.641589e+08 6.579332e+07 9.326033e+06 1.321941e+06 1.873817e+05 2.656088e+04 3.764936e+03 5.336699e+02 7.564633e+01 1.072267e+01 1.519911e+00 2.154435e-01 3.053856e-02 2.477076e+09 3.511192e+08 4.977024e+07 7.054802e+06 1.000000e+06 1.417474e+05 2.009233e+04 2.848036e+03 4.037017e+02 5.722368e+01 8.111308e+00 1.149757e+00 1.629751e-01 2.310130e-02 1.873817e+09 2.656088e+08 3.764936e+07 5.336699e+06 7.564633e+05 1.072267e+05 1.519911e+04 2.154435e+03 3.053856e+02 4.328761e+01 6.135907e+00 8.697490e-01 1.232847e-01 1.747528e-02 > ridge.mod = glmnet(X,y,alpha=0,lambda=grid) alpha = 0 for ridge alpha = 1 for Lasso > dim(coef(ridge.mod)) has 100 columns of parameter estimates, one for [1] 9 100 each lambda in our sequence. > coef(ridge.mod)[,1] (Intercept) Age Weight HtShoes Ht Seated Arm -1.648849e+02 7.202949e-08 -2.248007e-07 -2.796599e-07 -2.804783e-07 -2.567201e-07 -2.054084e-07 Thigh Leg -2.075522e-07 -2.763501e-07 When lambda is very large we see that the parameter estimates are near 0 and the intercept estimate is approximately equal to the mean of the response (𝑦̅). > mean(y) [1] -164.8849 35 > coef(ridge.mod)[,100] (Intercept) Age -164.8848684 11.7788069 Thigh Leg -4.2792243 -21.9175012 Weight HtShoes 0.9953047 -23.0463404 Ht -0.2996308 Seated 2.4814234 Arm -4.4305306 When lambda is near 0, we see that the coefficients do not differ much from the OLS regression parameter estimates which are shown below. > coef(hip.ols) (Intercept) -164.8848684 Thigh -4.4295690 Age 11.9218052 Leg -21.9165049 Weight 0.9415132 HtShoes -30.0156578 Ht 6.7190129 Seated 2.6323517 Arm -4.4775359 We can see this shrinkage of the coefficients graphically by plotting the results. > plot(ridge.mod,xvar="lambda") What value of should we use to obtain the “best” ridge regression model? In the code below we form a train data set consisting of 75% of the original data set and use the remaining cases as test cases. We then look at the mean PSE for various choices of by setting the parameter s = in the glmnet()function call. > train = sample(n,floor(n*p)) > train [1] 15 23 4 28 20 2 9 35 21 25 22 31 34 18 32 7 16 27 26 36 29 5 8 19 12 13 17 11 > test = (-train) > ridge.mod = glmnet(X[train,],y[train],alpha=0,lambda=grid) > ridge.pred = predict(ridge.mod,s=1000,newx=X[test,]) > PSE = mean((ridge.pred-y[test])^2) > PSE [1] 3226.455 > ridge.pred = predict(ridge.mod,s=100,newx=X[test,]) > PSE = mean((ridge.pred-y[test])^2) > PSE [1] 1556.286 36 > ridge.pred = predict(ridge.mod,s=10,newx=X[test,]) > PSE = mean((ridge.pred-y[test])^2) > PSE [1] 1334.643 > ridge.pred = predict(ridge.mod,s=5,newx=X[test,]) > PSE = mean((ridge.pred-y[test])^2) > PSE [1] 1336.778 > ridge.pred = predict(ridge.mod,s=1,newx=X[test,]) > PSE = mean((ridge.pred-y[test])^2) > PSE [1] 1353.926 It appears a 𝜆 value between 5 and 10 appears optimal for this particular train/test set combination. What if we use different train/test sets? > set.seed(1) > train = sample(n,floor(n*p)) > ridge.mod glmnet(X[train,],y[train],lambda=grid,alpha=0) > ridge.pred = predict(ridge.mod,s=1000,newx=X[test,]) > PSE = mean((ridge.pred - y[test])^2) > PSE [1] 2983.198 > ridge.pred = predict(ridge.mod,s=100,newx=X[test,]) > PSE = mean((ridge.pred - y[test])^2) > PSE [1] 1321.951 > ridge.pred = predict(ridge.mod,s=50,newx=X[test,]) > PSE = mean((ridge.pred - y[test])^2) > PSE [1] 1304.09 > ridge.pred = predict(ridge.mod,s=25,newx=X[test,]) > PSE = mean((ridge.pred - y[test])^2) > PSE [1] 1375.791 > ridge.pred = predict(ridge.mod,s=10,newx=X[test,]) > PSE = mean((ridge.pred - y[test])^2) > PSE [1] 1527.045 Now it appears that the “optimal” is somewhere between 25 and 50? We can use cross-validation to choose an “optimal” for prediction purposes. The function cv.glmnet()uses 10-fold cross-validation to find an optimal value for . > cv.out = cv.glmnet(X[train,],y[train],alpha=0) Warning message: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold This dataset is too small to use 10-fold cross-validation on as the sample size n = 38! 37 > plot(cv.out) > cv.out$lambda.min [1] 36.58695 > bestlam = cv.out$lambda.min > ridge.best = glmnet(X[train,],y[train],alpha=0,lambda=bestlam) > ridge.pred = predict(ridge.best,newx=X[test,]) > PSE = mean((ridge.pred-y[test])^2) > PSE [1] 1328.86 > coef(ridge.best) 9 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) -165.669959 Age 9.954727 Weight -1.937062 HtShoes -10.000879 Ht -10.249189 Seated -4.228238 Arm -4.404729 Thigh -4.954140 Leg -10.749589 > coef(hip.ols) (Intercept) Age -164.8848684 11.9218052 Thigh Leg -4.4295690 -21.9165049 Weight HtShoes 0.9415132 -30.0156578 Ht 6.7190129 Seated 2.6323517 Arm -4.4775359 38 2.2 - The Lasso The lasso is another shrinkage method like ridge, but uses an L1-norm based penalty. The parameter estimates are chosen according to the following 𝑛 𝑘 𝑘 𝛽̂ 𝑙𝑎𝑠𝑠𝑜 = min {∑(𝑦𝑖 − 𝛽𝑜 − ∑ 𝑢𝑖𝑗 𝛽𝑗 )2 } 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 ∑|𝛽𝑗 | ≤ 𝑡 𝛽 𝑖=1 𝑗=1 𝑗=1 Here t > 0 is the complexity parameter that controls the amount of shrinkage, the smaller t the greater the amount of shrinkage. As with ridge regression, the intercept is not included in the shrinkage and will be estimated as the mean of the response. If t is chosen larger than 𝑡𝑜 = ∑𝑘𝑗=1|𝛽̂𝑗𝑙𝑠 | then there will be no shrinkage and the lasso estimates will be the same as the OLS estimates. If 𝑡 = 𝑡𝑜 /2 then the OLS estimates will 𝑙𝑎𝑠𝑠𝑜 be shrunk by about 50%, however this is not to say that 𝛽̂𝑗 = 𝛽̂𝑗𝑙𝑠 /2 . The shrinkage can result in some parameters being zeroed, essentially dropping the associated predictor from the model as the figure below shows. Here the lasso estimate for 𝛽̂1𝑙𝑎𝑠𝑠𝑜 = 0. Usual OLS Estimate = (𝛽̂1 , 𝛽̂2 ) Contours of the OLS criterion Lasso regression estimate = (𝛽̂1𝑙𝑎𝑠𝑠𝑜 , 𝛽̂2𝑙𝑎𝑠𝑠𝑜 ) 2 ∑|𝛽𝑗 | = |𝛽1 | + |𝛽2 | ≤ 𝑠 𝑗=1 39 We return again to the body fat example and look at the use of the Lasso to build a model for the body fat. > > > > > > > > > > X = model.matrix(bodyfat~.,data=Bodyfat)[,-1] y = Bodyfat$bodyfat n = nrow(X) p = .667 set.seed(1) train = sample(n,floor(n*p)) test = (-train) grid = 10^seq(10,-2,length=100) lasso.mod = glmnet(X[train,],y[train],alpha=1,lambda=grid) plot(lasso.mod) > plot(lasso.mod,xvar=”lambda”) 40 > set.seed(1) > cv.out = cv.glmnet(X[train,],y[train],alpha=1) > plot(cv.out) > bestlam.lasso = cv.out$lambda.min > bestlam.lasso [1] 0.1533247 Use test set to obtain and estimate of the PSE for the Lasso =========================================================================================== > lasso.mod = glmnet(X[train,],y[train],alpha=1,lambda=bestlam.lasso) > lasso.pred = predict(lasso.mod,newx=X[test,]) > PSE = mean((lasso.pred-y[test])^2) > PSE [1] 23.72193 Use the same 10-fold cross-validation to estimate optimal for ridge regression. Then estimate the PSE using the same test data as for the Lasso. Compare the mean PSE values. ============================================================================================ > set.seed(1) > cv.out = cv.glmnet(X[train,],y[train],alpha=0) > bestlam.ridge = cv.out$lambda.min > bestlam.ridge [1] 0.6335397 > ridge.mod = glmnet(X[train,],y[train],alpha=0,lambda=bestlam.ridge) > ridge.pred = predict(ridge.mod,newx=X[test,]) > PSE = mean((ridge.pred - y[test])^2) > PSE [1] 26.11665 41 Comparing the coefficient estimates from Lasso, ridge regression, and OLS. Also compare PSE for test data. =============================================================================================== > coef(lasso.mod) s0 (Intercept) 1.27888249 age 0.08947089 weight . height -0.28803077 neck -0.39922361 chest . abdomen 0.67740803 hip . thigh . knee . ankle . biceps . forearm 0.34448133 wrist -1.27946216 > coef(ridge.mod) (Intercept) age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist s0 -4.956145707 0.130174182 -0.005247158 -0.310172813 -0.452885891 0.159678718 0.467277929 0.003963329 0.189565205 0.057918646 0.043846187 0.022254664 0.348491036 -1.470360498 > temp = data.frame(bodyfat = y[train],X[train,]) > head(temp) 67 94 144 227 51 222 bodyfat age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist 21.5 54 151.50 70.75 35.6 90.0 83.9 93.9 55.0 36.1 21.7 29.6 27.4 17.4 24.9 46 192.50 71.75 38.0 106.6 97.5 100.6 58.9 40.5 24.5 33.3 29.6 19.1 9.4 23 159.75 72.25 35.5 92.1 77.1 93.9 56.1 36.1 22.7 30.5 27.2 18.2 14.8 55 169.50 68.25 37.2 101.7 91.1 97.1 56.6 38.5 22.6 33.4 29.3 18.8 10.2 47 158.25 72.25 34.9 90.2 86.7 98.3 52.6 37.2 22.4 26.0 25.8 17.3 26.0 54 230.00 72.25 42.5 119.9 110.4 105.5 64.2 42.7 27.0 38.4 32.0 19.6 > ols.mod = lm(bodyfat~.,data=temp) > coef(ols.mod) (Intercept) -40.65178764 knee -0.03849608 age 0.09175572 ankle 0.36585984 weight -0.15569221 biceps 0.11606918 height 0.06515098 forearm 0.44247339 neck -0.41393595 wrist -1.54993981 chest 0.10173785 abdomen 0.92607342 abdomen hip 0.93852739 -0.24119508 thigh 0.38608812 hip -0.18562568 thigh 0.37387418 > ols.step = step(ols.mod) > coef(ols.step) (Intercept) -24.91558645 age weight neck 0.09187304 -0.10466396 -0.46132959 forearm wrist 0.51997961 -1.33498663 > ols.pred = predict(ols.mod,newdata=Bodyfat[test,]) > PSE = mean((ols.pred-y[test])^2) [1] 23.39602 > ols.pred2 = predict(ols.step,newdata=Bodyfat[test,]) > PSE = mean((ols.pred2-y[test])^2) [1] 22.8308 42 For these data we see the three approaches differ in their results. Lasso is zeroes out some coefficients, thus does completely eliminate some terms from the model. Ridge will shrink coefficients down to very near zero, effectively eliminating them, but technically will zero none of them. Stepwise selection in OLS is either in or out, so some get zeroed some don’t, however there is no shrinkage of the estimated coefficients. A good question to ask would be “how do these methods cross-validate for making future predictions?” We can use cross-validation methods to compare these competing models via estimates of the PSE. Monte Carlo Cross-Validation of OLS Regression Models > ols.mccv = function(fit,p=.667,B=100) { cv <- rep(0,B) y = fit$model[,1] x = fit$model[,-1] data = fit$model n = nrow(data) for (i in 1:B) { ss <- floor(n*p) sam <- sample(1:n,ss,replace=F) fit2 <- lm(formula(fit),data=data[sam,]) ypred <- predict(fit2,newdata=x[-sam,]) cv[i] <- mean((y[-sam]-ypred)^2) } cv } Monte Carlo Cross-Validation of Ridge and Lasso Regression > glmnet.mccv = function(X,y,alpha=0,lambda=1,p=.667,B=100) { cv <- rep(0,B) n = nrow(X) for (i in 1:B) { ss <- floor(n*p) sam <- sample(n,ss,replace=F) fit <- glmnet(X[sam,],y[sam],lambda=lambda) ypred <- predict(fit,newx=X[-sam,]) cv[i] <- mean((y[-sam]-ypred)^2) } cv } > set.seed(1) > rr.cv = glmnet.mccv(X,y,alpha=0,lambda=.634) > Statplot(rr.cv) > mean(rr.cv) > sd(rr.cv) [1] 21.65482 [1] 2.847533 43 > set.seed(1) > lass.cv = glmnet.mccv(X,y,alpha=1,lambda=.153) > mean(lass.cv) > sd(lass.cv) [1] 20.30297 [1] 2.601356 > ols.scale = lm(bodyfat~.,data=Bodyfat.scale) > ols.results = ols.mccv(ols.scale) > mean(ols.results) [1] 20.68592 > sd(ols.results) [1] 2.737272 > Statplot(ols.results) > ols.scalestep = step(ols.scale) > ols.results = ols.mccv(ols.scalestep) > mean(ols.results) [1] 19.72026 > sd(ols.results) [1] 2.185153 > Statplot(ols.results) 44 1.3 - Least Angle Regression (LAR) – FYI only! The lars function in the library of the same name will perform least angle regression which is another shrinkage method for fitting regression models. lars(x, y, type = c("lasso", "lar", "forward.stagewise", "stepwise")) lar = Least Angle Regression (LAR) – see algorithm and diagram next page forward.stagewise = Forward Stagewise selection stepwise = forward stepwise selection (classic method) For Lasso regression use the glmnet function versus the lars implementation. http://www-stat.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf LAR Algorithm 45 As we can see the LAR and Forward Stagewise selection methods produce very similar models to the lasso for these data as seen below. Good advice would be to try them all, plot the results, and examine them for any large differences. The usability of the results from lars is an issue. Extracting fitted values, residuals and making predictions using lars is very cumbersome, but definitely doable. > > > > > X = model.matrix(bodyfat~.,data=Bodyfat)[,-1] y = Bodyfat$bodyfat bodyfat.lars = lars(X,y,type="lar") plot(bodyfat.lars) summary(bodyfat.lars) LARS/LAR Call: lars(x = X, y = y, type = "lar") Df Rss Cp 0 1 17579.0 696.547 1 2 6348.3 93.824 2 3 5999.8 77.062 3 4 5645.1 59.963 4 5 5037.4 29.241 5 6 4998.8 29.164 6 7 4684.9 14.262 7 8 4678.3 15.905 8 9 4658.4 16.831 9 10 4644.8 18.099 10 11 4516.3 13.183 11 12 4427.8 10.416 12 13 4421.5 12.079 13 14 4420.1 14.000 > fit = predict.lars(bodyfat.lars,X,s=11) > fit = predict.lars(bodyfat.lars,X,s=6) 46 1.5 - Principal Component Regression (PCR) & Partial Least Squares (PLS) Multivariate regression methods like principal component regression (PCR) and partial least squares regression (PLSR) enjoy large popularity in a wide range of fields, including the natural sciences. The main reason is that they have been designed to confront the situation where there are many, generally correlated, predictor variables, and relatively few samples – a situation that is common, especially in chemistry where developments in spectroscopy allow for obtaining hundreds of spectra readings on single sample. In these situations n << p, thus some form of dimension reduction in the predictor space is necessary. Principal components analysis is a dimension reduction technique where p independent/orthogonal linear combinations of the input numeric variables 𝑋1 , 𝑋2 , … , 𝑋𝑝 are formed so that the first linear combination accounts as much of the total variation in the original data as possible. The 2nd linear combination accounts for as much of the remaining variation in the data as possible subject to the constraint that it is orthogonal to the first linear combination, etc.. Generally the variables are all scaled to have mean 0 and variance 1 (denoted 𝑋𝑗∗ ) thus the total variation in the scaled data is given by 𝑝 𝑝 ∑ 𝑉(𝑋𝑗∗ ) 𝑗=1 = 𝑝 = ∑ 𝑉(𝑍𝑗 ) 𝑗=1 where, 𝑍1 = 𝑎11 𝑋1∗ + 𝑎12 𝑋2∗ + ⋯ + 𝑎1𝑝 𝑋𝑝∗ 𝑍2 = 𝑎21 𝑋1∗ + 𝑎22 𝑋2∗ + ⋯ + 𝑎2𝑝 𝑋𝑝∗ … 𝑍𝑝 = 𝑎𝑝1 𝑋1∗ + 𝑎𝑝2 𝑋2∗ + ⋯ + 𝑎𝑝𝑝 𝑋𝑝∗ and 𝐶𝑜𝑣(𝑍𝑖 , 𝑍𝑗 ) = 𝐶𝑜𝑟𝑟(𝑍𝑖 , 𝑍𝑗 ) = 0 𝑓𝑜𝑟 𝑖 ≠ 𝑗 The linear combinations are determined by the spectral analysis (i.e. finding eigenvalues and eigenvectors) of the sample correlation matrix (𝑅) and the variance of the jth principal component 𝑌𝑗 is 𝑉(𝑍𝑗 ) = 𝜆𝑗 𝑡ℎ𝑒 𝑗 𝑡ℎ 𝑙𝑎𝑟𝑔𝑒𝑠𝑡 𝑒𝑖𝑔𝑒𝑛𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑅 and the coefficients of the linear combination (𝑎𝑗1 , 𝑎𝑗2 , … , 𝑎𝑗𝑝 ) = 𝑒𝑖𝑔𝑒𝑛𝑣𝑒𝑐𝑡𝑜𝑟 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑖𝑛𝑔 𝑡𝑜 𝜆𝑗 Ideally the first k principal components will account for a sizeable percentage of the total variation in these data. We can then use these k principal components, 𝑍1 , 𝑍2 , … , 𝑍𝑘 as predictors in a multiple regression model below: 𝑘 𝐸(𝑌|𝑋1 , 𝑋2 , … , 𝑋𝑝 ) = 𝛽0 + ∑ 𝛽𝑗 𝑍𝑗 𝑎𝑛𝑑 𝑉(𝑌|𝑋1 , … , 𝑋𝑝 ) = 𝜎 2 𝑗=1 47 Yarn Data These data were obtained from a calibration study of polyethylene terephthalate (PET) yarns which are used for textile and industrial purposes. PET yarns are produced by a process of melt-spinning, whose settings largely determine the final semi-crystalline structure of the yarn, which, in turn, determines the physical structure of PET yarns are important quality parameters for the end use of the yarn. Raman near-infrared (NIR) spectroscopy has recently become an important tool in the pharmaceutical and semiconductor industries for investigating structural information on polymers; in particular, it is used to reveal information about the chemical nature, conformational order, state of the order, and orientation of polymers. Thus, Raman spectra are used to predict the physical characteristics of polymers. In this example, we study the relationship between the overall density of a PET yarn to its NIR spectrum. The data consist of a sample of n = 21 PET yarns having known mechanical and structural properties. For each PET yarn, the Y-variable is the density (kg/m3) of the yarn, and the p = 268 X-variables (measured at 268 frequencies in the range 598 – 1900 cm-1 ) are selected from the NIR spectrum of that yarn. Thus n << p!! Many of the X-variables are highly correlated as the scatterplot matrices on the following pages clearly show. 48 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑋1 , … , 𝑋10 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑋31 , … , 𝑋40 Obviously all 268 variables contain similar information and therefore we should be able to effectively use principal components to reduce the dimensionality of the spectra data. 49 Form principal components for the PET yarn data First load package pls which contains the yarn data and routines for both principal component regression (PCR) and partial least squares (PLS) regression. > Yarn = yarn[1:21,] > R = cor(Yarn$NIR) > eigenR = eigen(R) > attributes(eigenR) $names [1] "values" "vectors” Only the first four PC’s have variance larger than a single scaled variable, i.e. there are four eigenvalues greater 1.0. We will now form 𝑍1 , 𝑍2 , 𝑍3 , 𝑍4 using the corresponding eigenvectors. > > > > z1 z2 z3 z4 = = = = scale(Yarn$NIR)%*%eigenR$vectors[,1] scale(Yarn$NIR)%*%eigenR$vectors[,2] scale(Yarn$NIR)%*%eigenR$vectors[,3] scale(Yarn$NIR)%*%eigenR$vectors[,4] > YarnPC = data.frame(density=Yarn$density,z1,z2,z3,z4) > yarn.pcr = lm(density~.,data=YarnPC) > summary(yarn.pcr) Call: lm(formula = density ~ ., data = YarnPC) Residuals: Min 1Q Median -2.106 -0.522 0.246 3Q 0.632 Max 1.219 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.6200 0.2249 149.5 < 2e-16 *** z1 -2.3062 0.0194 -118.7 < 2e-16 *** z2 0.6602 0.0261 25.3 2.5e-14 *** z3 1.8044 0.0341 52.8 < 2e-16 *** z4 0.9759 0.1434 6.8 4.2e-06 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.03 on 16 degrees of freedom Multiple R-squared: 0.999, Adjusted R-squared: 0.999 F-statistic: 4.39e+03 on 4 and 16 DF, p-value: <2e-16 50 > pairs.plus(YarnPC) The marginal response plots look rather interesting, see rectangle above. Residuals look surprising good considering the nonlinear relationships displayed in the marginal response plots. 51 Using the pcr command from the pls package. > yarn.pcr2 = pcr(density~scale(NIR),data=yarn[1:21,],ncomp=6,validation="CV") > summary(yarn.pcr2) Data: X dimension: 21 268 Y dimension: 21 1 Fit method: svdpc Number of components considered: 6 VALIDATION: RMSEP Cross-validated using (Intercept) 1 CV 31.31 adjCV 31.31 10 random segments. comps 2 comps 3 comps 15.38 14.29 2.392 15.39 14.46 2.358 TRAINING: % variance explained 1 comps 2 comps 3 comps X 52.49 81.59 98.57 density 80.13 83.77 99.65 4 comps 99.54 99.91 4 comps 1.312 1.288 5 comps 99.75 99.93 5 comps 1.167 1.145 6 comps 0.9461 0.9336 6 comps 99.86 99.95 > loadingplot(yarn.pcr2,comps=1:4,legendpos=”topright”) The plot above shows the weight assigned to each variable on the first four PC’s. The solid line shows the weights assigned to the variables on the first principal component. Identifying important individual spectra will be very difficult but you can ranges that appear important for each component. 52 Extract fitted values from 4-component fit > fit = fitted(yarn.pcr2)[,,4] > plot(Yarn$density,fit) > predplot(yarn.pcr2,ncomp=1:6) 53 > corrplot(yarn.pcr2,comps=1:4) This plot displays the correlation between each of the 268 variables with each of the first four principal components. > YarnTest = yarn[22:28,] > predict(yarn.pcr2,ncomp=4,newdata=YarnTest) , , 4 comps 110 22 31 41 51 61 71 density 50.95 50.97 31.92 34.77 30.72 19.93 19.37 > YarnTest$density [1] 51.04 50.32 32.14 34.69 30.30 20.45 20.06 54 Partial Least Squares (PLS) Algorithm While PCR focuses on the covariance structure of the X’s independent of the response Y partial least squares (PLS) looks at the covariance structure of the X’s and the response Y jointly. The algorithm for PLS is shown below. To better understand the PLS algorithm consider the simple example below. Generate some data y = rnorm(100) y = y -mean(y) x1 = rnorm(100) x1 = (x1 - mean(x1))/sd(x1) x2 = y+x1+rnorm(100) x2 = (x2 - mean(x2))/sd(x2) phi1 = sum(y*x1) phi2 = sum(y*x2) z1 = phi1*x1 + phi2*x2 z1 = (z1 - mean(z1))/sd(z1) th1 = lsfit(z1,y,int=F)$coef y1 = y + th1*z1 pairs(cbind(y,x1,x2,z1,y1)) 55 Now we do the second iteration x11 = x1 - sum(x1*z1)*z1/sum(z1*z1) x21 = x2 - sum(x2*z1)*z1/sum(z1*z1) phi1 = sum(y1*x11) phi2 = sum(y1*x21) z2 = phi1*x11 + phi2*x21 z2 = (z2 - mean(z2))/sd(z2) th2 = lsfit(z2,y1,int=F)$coef y2 = y1 + th2*z2 pairs(cbind(y,z2,y2,y1)) Ultimately the final fitted values are a linear combination of the z-components and thus ̅ + ∑𝑘𝑗=1 𝜃̂𝑗 𝒛𝒋 . Interpretation of the results is done in a they can be expressed 𝑌̂ = 𝒚 similar fashion to PCR by examining plots of the cross-validation results, variable loadings, and correlations with the original predictors. We now examine the results from PLS regression for the yarn data. > yarn.pls = plsr(density~NIR,ncomp=10,data=Yarn,validation="CV") > summary(yarn.pls) Data: X dimension: 21 268 Y dimension: 21 1 Fit method: kernelpls Number of components considered: 10 VALIDATION: RMSEP Cross-validated using (Intercept) 1 CV 31.31 adjCV 31.31 10 random segments. comps 2 comps 3 comps 6.473 4.912 2.164 5.840 4.862 2.150 TRAINING: % variance explained 1 comps 2 comps 3 comps X 47.07 98.58 99.50 density 98.19 98.29 99.71 4 comps 99.72 99.97 4 comps 0.8847 0.8552 5 comps 99.87 99.99 5 comps 0.6041 0.5908 6 comps 99.98 99.99 6 comps 0.6550 0.6367 7 comps 99.98 100.00 7 comps 0.3983 0.3816 8 comps 99.99 100.00 8 comps 0.3244 0.3115 9 comps 99.99 100.00 9 comps 0.2890 0.2759 10 comps 0.2822 0.2687 10 comps 99.99 100.00 > plot(RMSEP(yarn.pls),legendpos="topright") 56 The 4 component model is suggested by cross-validation. > predplot(yarn.pls,ncomp=4,line=T) > plot(yarn.pls,plottype=”scores”,comps=1:4) 57 > plot(yarn.pls,"loadings",comps=1:4,legendpos="topright") > abline(h=0) > predict(yarn.pls,ncomp=4,newdata=YarnTest) , , 4 comps 110 22 31 41 51 61 71 density 51.05 50.72 32.01 34.29 30.36 20.58 19.08 > YarnTest$density [1] 51.04 50.32 32.14 34.69 30.30 20.45 20.06 > sum((predict t(yarn.pcr2,ncomp=4,newdata=YarnTest)YarnTest$density)^2) [1] 1.407 > sum((predict(yarn.pls,ncomp=4,newdata=YarnTest)-YarnTest$density)^2) [1] 1.320 The 4-component PLS model does a slightly better job of predicting the test yarn densities. 58