Multiple Regression in R using LARS

advertisement
1 - Multiple Regression in R using OLS
1.1 – “Review” of OLS
Load the comma-delimited file bodyfat.csv into R
> Bodyfat = read.table(file.choose(),header=T,sep=",")
Read 3528 items
> Bodyfat = Bodyfat[,-1]  first column density is redundant
Response is in column 1, the candidate predictors are in columns 2 – 14.
> X <- Bodyfat[,2:14]
> y <- Bodyfat[,1]
> dim(X)
[1] 252 13
> dim(y)
[1] 252
1
> pairs.plus(Bodyfat)
Examine a scatterplot matrix with the “bells and whistles”…
> bodyfat.ols = lm(bodyfat~.,data=Bodyfat)
1
> summary(bodyfat.ols)
Call:
lm(formula = bodyfat ~ ., data = Bodyfat)
Residuals:
Min
1Q
-11.1966 -2.8824
Median
-0.1111
3Q
3.1901
Max
9.9979
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.35323
22.18616 -0.962 0.33680
age
0.06457
0.03219
2.006 0.04601 *
weight
-0.09638
0.06185 -1.558 0.12047
height
-0.04394
0.17870 -0.246 0.80599
neck
-0.47547
0.23557 -2.018 0.04467 *
chest
-0.01718
0.10322 -0.166 0.86792
abdomen
0.95500
0.09016 10.592 < 2e-16 ***
hip
-0.18859
0.14479 -1.302 0.19401
thigh
0.24835
0.14617
1.699 0.09061 .
knee
0.01395
0.24775
0.056 0.95516
ankle
0.17788
0.22262
0.799 0.42505
biceps
0.18230
0.17250
1.057 0.29166
forearm
0.45574
0.19930
2.287 0.02309 *
wrist
-1.65450
0.53316 -3.103 0.00215 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1
Residual standard error: 4.309 on 238 degrees of freedom
Multiple R-squared: 0.7486,
Adjusted R-squared: 0.7348
F-statistic: 54.5 on 13 and 238 DF, p-value: < 2.2e-16
Regression diagnostics using a variety of functions written by Chris Malone for his senior
capstone project while an undergraduate at WSU.
> Diagplot1(bodyfat.ols)  Look at Cook’s Distances & Leverages
> Diagplot2(bodyfat.ols)  DFBETA’s primarily
> Diagplot3(bodyfat.ols,dfbet=T)  AVP’s, DFBETAS, and VIFS
2
> Resplot  Various diagnostic plots examining the residuals
3
> MLRdiag(bodyfat.ols)  Inverse response plots with case diagnostics added
> VIF(bodyfat.ols)  returns table of VIF’s for each predictor. This table is
returned by Diagplot3 as well
Variance Inflation Factor Table
age
weight
height
neck
chest
abdomen
hip
thigh
knee
ankle
biceps
forearm
wrist
Variable
age
weight
height
neck
chest
abdomen
hip
thigh
knee
ankle
biceps
forearm
wrist
VIF
2.224469
44.652515
2.939110
4.431923
10.234694
12.775528
14.541932
7.958662
4.825304
1.924098
3.670907
2.191933
3.348404
Rsquared
0.5504545
0.9776048
0.6597610
0.7743643
0.9022931
0.9217253
0.9312333
0.8743507
0.7927592
0.4802760
0.7275877
0.5437817
0.7013503
There is clearly evidence of collinearity suggesting that a reduced model should be
considered. Model “selection” is the focus of this handout. We will first consider using
standard stepwise selection methods – forward, backward, mixed, or potentially all
possible subsets.
4
1.2 – C + R Plots and CERES Plots in R
These plots are used to visualize the functional form for a predictor in a OLS multiple
regression setting. We can formulate an OLS regression with a response Y and potential
predictors 𝑋1 , 𝑋2 , … , 𝑋𝑝 as follows:
𝑌 = 𝜂𝑜 + 𝜂1 𝜏1 (𝑋1 ) + ⋯ + 𝜂𝑝 𝜏𝑝 (𝑋𝑝 ) + 𝜀
where the 𝜏𝑖 (𝑋𝑖 )′𝑠 represent the functional form of the 𝑖 𝑡ℎ predictor in the model. For
example 𝜏𝑖 (𝑋𝑖 ) = ln(𝑋𝑖 ) or 𝜏𝑖 (𝑋𝑖 ) = 𝑝𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 𝑜𝑓 𝑑𝑒𝑔𝑟𝑒𝑒 2 𝑖𝑛 𝑋𝑖 (i.e. add 𝑋𝑖 and
𝑋𝑖2 ) terms to the model. The model above is an example of what we call an additive
model. Later in the course we will look at the other methods for developing flexible
additive models in a regression setting.
The package car contains functions for regression that are similar to those available in
Arc which the software developed to accompany Applied Regression: Including
Computing and Graphics by Cook & Weisberg (text from STAT 360). Although not as
interactive as Arc, the crPlots() & ceresPlots() functions in the car library
will construct C+R and CERES plots respectively for each term in a regression model.
As stated earlier, both C+R plots and CERES Plots are used to visualize the predictors
that might benefit from the creation of nonlinear terms based on the predictor. CERES
plots are better when there are nonlinear relationships amongst the predictors themselves.
The nonlinear relationships between the predictors can “bleed” into the C+R Plots,
resulting in an inaccurate representation of the potential terms.
Component + Residual Plots (C+R Plots)
> crPlots(bodyfat.ols)
5
CERES Plots (Conditional Expectation RESidual plots)
> ceresPlots(bodyfat.ols)
6
1.3 - Standard Stepwise Selection Methods for OLS Regression
These methods seek to minimize a penalized version of the RSS = residual sum of
squares of the regression model. These statistics are Akaike Information Criterion (AIC),
Bayesian Information Criterion (BIC), adjusted R-square (adj-R2), and Mallow’s Ck and
presented below:
𝐴𝐼𝐶 = 𝑛𝑙𝑜𝑔 (
𝐶𝑘 =
𝑅𝑆𝑆𝑘
1
(𝑅𝑆𝑆 + 2𝑘𝜎̂ 2 )
) + 2𝑘 =
𝑛
𝑛𝜎̂ 2
𝑅𝑆𝑆𝑘
1
+ 2𝑘 − 𝑛 = (𝑅𝑆𝑆 + 2𝑘𝜎̂ 2 )
2
𝜎̂
𝑛
𝐵𝐼𝐶 =
1
(𝑅𝑆𝑆 + log(𝑛) 𝑘𝜎̂ 2 )
𝑛
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 2 = 1 −
𝑅𝑆𝑆/(𝑛 − 𝑘 − 1)
𝑆𝑆𝑇𝑜𝑡 /(𝑛 − 1)
where k = the number of parameters in the candidate model and 𝜎̂ 2 = estimated residual variance
from the “full” model. Minimizing AIC, BIC, or Ck in the case of OLS yields the “best” model
according to that criterion. In contrast, the adjusted R2 is maximized to find the “best” model.
Backward Elimination
> bodyfat.back = step(bodyfat.ols,direction="backward")
Backward elimination results are displayed (not shown)
> anova(bodyfat.back)
Analysis of Variance Table
Response: bodyfat
Df Sum Sq Mean Sq F value
Pr(>F)
age
1 1493.3 1493.3 81.4468 < 2.2e-16
weight
1 6674.3 6674.3 364.0279 < 2.2e-16
neck
1 182.5
182.5
9.9533 0.001808
abdomen
1 4373.0 4373.0 238.5125 < 2.2e-16
hip
1
6.9
6.9
0.3747 0.541022
thigh
1 136.6
136.6
7.4523 0.006799
forearm
1
90.1
90.1
4.9164 0.027528
wrist
1 166.8
166.8
9.1002 0.002827
Residuals 243 4455.3
18.3
--Signif. codes:
***
***
**
***
**
*
**
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
7
> bodyfat.back$anova
Step Df
Deviance Resid. Df Resid. Dev
1
NA
NA
238
4420.064
2
- knee 1 0.05885058
239
4420.123
3 - chest 1 0.52286065
240
4420.646
4 - height 1 0.68462867
241
4421.330
5 - ankle 1 13.28231735
242
4434.613
6 - biceps 1 20.71159705
243
4455.324
AIC
749.8491
747.8524
745.8822
743.9212
742.6772
741.8514
Forward Selection (painful due to the fact candidate predictors need
to be listed explicitly)
> bodyfat.base = lm(bodyfat~1,data=Bodyfat)  Model with intercept only
> bodyfat.forward step(bodyfat.base,~.+age+weight+height+
neck+chest+abdomen+hip+thigh+knee+ankle+biceps+forearm+
wrist,direction="forward")
Start: AIC=1071.75
bodyfat ~ 1
+ abdomen
+ chest
+ hip
+ weight
+ thigh
+ knee
+ biceps
+ neck
+ forearm
+ wrist
+ age
+ ankle
<none>
+ height
Df Sum of Sq
RSS
AIC
1
11631.5 5947.5 800.65
1
8678.3 8900.7 902.24
1
6871.2 10707.8 948.82
1
6593.0 10986.0 955.29
1
5505.0 12073.9 979.08
1
4548.4 13030.6 998.30
1
4277.3 13301.7 1003.49
1
4230.9 13348.1 1004.36
1
2295.8 15283.2 1038.48
1
2111.5 15467.5 1041.50
1
1493.3 16085.7 1051.38
1
1243.5 16335.5 1055.26
17579.0 1071.75
1
11.2 17567.7 1073.59
Etc…
Both or Mixed Selection
> bodyfat.mixed = step(bodyfat.ols)  default=”both”, feeds in full model
> anova(bodyfat.mixed)
Analysis of Variance Table
Response: bodyfat
Df Sum Sq Mean Sq F value
Pr(>F)
age
1 1493.3 1493.3 81.4468 < 2.2e-16 ***
weight
1 6674.3 6674.3 364.0279 < 2.2e-16 ***
neck
1 182.5
182.5
9.9533 0.001808 **
abdomen
1 4373.0 4373.0 238.5125 < 2.2e-16 ***
hip
1
6.9
6.9
0.3747 0.541022
thigh
1 136.6
136.6
7.4523 0.006799 **
forearm
1
90.1
90.1
4.9164 0.027528 *
wrist
1 166.8
166.8
9.1002 0.002827 **
Residuals 243 4455.3
18.3
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
8
> bodyfat.mixed$anova
Step Df
Deviance Resid. Df Resid. Dev
1
NA
NA
238
4420.064
2
- knee 1 0.05885058
239
4420.123
3 - chest 1 0.52286065
240
4420.646
4 - height 1 0.68462867
241
4421.330
5 - ankle 1 13.28231735
242
4434.613
6 - biceps 1 20.71159705
243
4455.324
AIC
749.8491
747.8524
745.8822
743.9212
742.6772
741.8514
Stepwise Methods Using the leaps package in R
The package leaps available through CRAN will perform forward, backward, and
mixed approaches as well, but offer some improvements over the default step function
in base R.
> library(leaps)
> bodyfat.full = regsubsets(bodyfat~.,data=Bodyfat,nvmax=13)
> summary(bodyfat.full)
Subset selection object
Call: regsubsets.formula(bodyfat
13 Variables (and intercept)
Forced in Forced out
age
FALSE
FALSE
weight
FALSE
FALSE
height
FALSE
FALSE
neck
FALSE
FALSE
chest
FALSE
FALSE
abdomen
FALSE
FALSE
hip
FALSE
FALSE
thigh
FALSE
FALSE
knee
FALSE
FALSE
ankle
FALSE
FALSE
biceps
FALSE
FALSE
forearm
FALSE
FALSE
wrist
FALSE
FALSE
1 subsets of each size up to 13
Selection Algorithm: exhaustive
age weight height neck
1 ( 1 ) " " " "
" "
" "
2 ( 1 ) " " "*"
" "
" "
3 ( 1 ) " " "*"
" "
" "
4 ( 1 ) " " "*"
" "
" "
5 ( 1 ) " " "*"
" "
"*"
6 ( 1 ) "*" "*"
" "
" "
7 ( 1 ) "*" "*"
" "
"*"
8 ( 1 ) "*" "*"
" "
"*"
9 ( 1 ) "*" "*"
" "
"*"
10 ( 1 ) "*" "*"
" "
"*"
11 ( 1 ) "*" "*"
"*"
"*"
12 ( 1 ) "*" "*"
"*"
"*"
13 ( 1 ) "*" "*"
"*"
"*"
~ ., data = Bodyfat, nvmax = 13)
chest
" "
" "
" "
" "
" "
" "
" "
" "
" "
" "
" "
"*"
"*"
abdomen
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
hip
" "
" "
" "
" "
" "
" "
" "
"*"
"*"
"*"
"*"
"*"
"*"
thigh
" "
" "
" "
" "
" "
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
knee
" "
" "
" "
" "
" "
" "
" "
" "
" "
" "
" "
" "
"*"
ankle
" "
" "
" "
" "
" "
" "
" "
" "
" "
"*"
"*"
"*"
"*"
biceps
" "
" "
" "
" "
" "
" "
" "
" "
"*"
"*"
"*"
"*"
"*"
forearm
" "
" "
" "
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
wrist
" "
" "
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
> reg.summary = summary(bodyfat.full)
> names(reg.summary)
[1] "which"
"rsq"
"rss"
"adjr2"
"cp"
"bic"
"outmat" "obj"
9
> par(mfrow=c(2,2))  set up a 2 X 2 grid of plots
> plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="b")
> plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted Rsquare",type="b")
> plot(reg.summary$cp,xlab="Number of Variables",ylab="Mallow's
Cp",type="b")
> plot(reg.summary$bic,xlab="Number of Variables",ylab="Bayesian
Information Criterion (BIC)",type="b")
> par(mfrow=c(1,1))  restore to 1 plot per page
Find “optimal” model size using adjusted-R2, Mallow’s Ck, and BIC
> which.max(reg.summary$adjr2)
[1] 9
> which.min(reg.summary$cp)
[1] 7
> which.min(reg.summary$bic)
[1] 4
10
The regsubsets() function has a built-in plot command which can display the
selected variables for the “best” model with a given model selection statistic. The top
row of each plot contains a black square for each variable selected according to the
optimal model associated with that statistic. Examples using the R2 (unadjusted),
adjusted R2, Mallow’s Ck, and the BIC are shown on the following page.
> plot(bodyfat.full,scale="r2")
> plot(bodyfat.full,scale="adjr2")
> plot(bodyfat.full,scale="Cp")
> plot(bodyfat.full,scale="bic")
Automatic Selection via All Possible Subsets
The package bestglm uses will return the “best” model using user-specified model
selection criterion such as AIC (basically Mallow’s Ck for OLS), BIC, and crossvalidation schemes. The PDF documentation for this package is excellent with several
complete examples and details on how to use the various options. The output below
shows the use of the bestglm function to find the “best” OLS model using the AIC/Ck
criterion.
> library(bestglm)
> Xy = cbind(x,y)
> bodyfat.best = bestglm(Xy,IC="AIC")
11
> attributes(bodyfat.best)
$names
[1] "BestModel"
"BestModels"
"ModelReport"
"Bestq"
"qTable"
"Subsets"
"Title"
$class
[1] "bestglm"
> bodyfat.best$Subsets
> bodyfat.best$BestModel
Call:
lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
drop = FALSE], y = y))
Coefficients:
(Intercept)
-22.65637
age
0.06578
weight
-0.08985
neck
-0.46656
> summary(bodyfat.best$BestModel)
Residuals:
Min
1Q
Median
3Q
-10.9757 -2.9937 -0.1644
2.9766
abdomen
0.94482
hip
-0.19543
thigh
0.30239
forearm
0.51572
wrist
-1.53665
Max
10.2244
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.65637
11.71385 -1.934 0.05426 .
age
0.06578
0.03078
2.137 0.03356 *
weight
-0.08985
0.03991 -2.252 0.02524 *
neck
-0.46656
0.22462 -2.077 0.03884 *
abdomen
0.94482
0.07193 13.134 < 2e-16 ***
hip
-0.19543
0.13847 -1.411 0.15940
thigh
0.30239
0.12904
2.343 0.01992 *
forearm
0.51572
0.18631
2.768 0.00607 **
wrist
-1.53665
0.50939 -3.017 0.00283 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.282 on 243 degrees of freedom
Multiple R-squared: 0.7466,
Adjusted R-squared: 0.7382
F-statistic: 89.47 on 8 and 243 DF, p-value: < 2.2e-16
Save the “best” OLS model to an object named appropriately. We can then examine
various regression diagnostics for this model as considered above.
> bodyfat.bestols = lm(formula(bodyfat.best$BestModel))
> MLRdiag(bodyfat.bestols) etc…
12
> VIF(bodyfat.bestols)
Variance Inflation Factor Table
age
weight
neck
abdomen
hip
thigh
forearm
wrist
Variable
VIF Rsquared
age 2.059194 0.5143731
weight 18.829990 0.9468932
neck 4.081562 0.7549958
abdomen 8.236808 0.8785937
hip 13.471431 0.9257688
thigh 6.283117 0.8408433
forearm 1.940309 0.4846180
wrist 3.096051 0.6770079
The presence of collinearity issues even after reducing the model size suggests some
problems with the “in/out” selection strategies regardless of the criterion used to select
them.
13
1.4 - Cross-Validation Functions for OLS
In this section I will show some sample R code that can be used (and altered) to
perform cross-validation to estimate the “average” PSE (prediction squared
error), MPSE (mean prediction squared error), or the MSEP (mean squared error
for prediction). Note these are all the same thing! We could also take the square
root of any of these to obtain root squared error of prediction (RMSEP). I won’t
even consider listing other associated acronyms. Suppose we have m
observations to predict the value of response (y) for. These m observations must
NOT have been used to develop/train the model!
1
MSE for prediction = 𝑚 ∑𝑚
𝑖=1 (𝑦𝑖 − 𝑦𝑝𝑟𝑒𝑑 (𝑖))
2
As discussed in class there are several schemes that can be used to estimate the predictive
abilities of a regression model. The list of methods we discussed is:
1)
2)
3)
4)
5)
Test/Train Samples or Split Sample approach
K-fold Cross-Validation (k-fold CV)
Leave-out-one Cross-validation (LOOCV)
Monte Carlo Cross-validation (MCCV)
.632 Bootstrap
In this handout I will demonstrate these different methods of cross-validation
using the Bodyfat example. In the textbook (section 6.5.3 pg. 248-251), the
authors demonstrate how to use k-fold cross-validation to determine the optimal
number of predictors in the OLS model using the Hitters data found in the
ISLR package. The approach I will take with the body fat data is a little
different. I will assume that we have a model chosen and we wish to estimate
the predictive performance of this model using the MSEP estimated via crossvalidation.
Test/Train or Split Sample Cross-Validation for an OLS Regression Model
The code below will construct a split sample cross-validation for an OLS model. We first
choose a fraction of the available data p to form our training data (e.g. p = .67). The
remaining data is then used to form our test cases. You fit the model to training data and
then predict the response value for the test cases.
> dim(Bodyfat)
[1] 252 14
> n = nrow(Bodyfat)
> n
[1] 252
14
> p = .67
> m = floor(n*(1-p))
> m
[1] 83
> sam = sample(1:n,m,replace=F)
> Bodyfat.test = Bodyfat[sam,]
> Bodyfat.train = Bodyfat[-sam,]
> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train)
> bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test)
> pred.err = Bodyfat.test$bodyfat – bodyfat.pred
> pred.err
> pred.err
109
6.16872468
211
-5.80966602
197
5.61137864
102
0.04633038
47
2.68099394
101
2.71919034
234
2.20765427
10
2.33085273
42
3.56701914
49
-4.26635609
72
-3.89057224
71
4.82964948
206
2.53708347
97
-7.65387694
27
-1.29145551
252
5.19305658
118
-0.20192157
241
2.17346001
41
-2.47471287
195
5.97086308
78
3.02314968
143
4.87661670
1
-3.66664960
15
-1.67430723
204
-8.82900506
88
2.17247349
5
1.13773495
185
-0.47892211
148
7.27411561
84
194
223
199
129
247
5.55960734 -2.26666371 -5.76919180 1.29122863 2.29403342 0.84380506
212
116
55
142
3
183
2.78002236 0.16611536 -4.26941892 -3.01050196 7.11824662 -4.41990034
91
83
134
222
110
58
-1.72264289 -4.54710906 5.50948186 -3.43478744 0.16872091 1.35967343
45
240
51
151
173
207
-3.82375057 4.43252304 -5.70578266 1.00073463 3.66326700 10.55889964
53
192
170
119
196
50
-6.68075831 8.27135710 -3.41991050 7.41856162 2.75480323 -2.12010334
107
123
163
153
67
232
-7.31113211 2.05349309 -2.67587344 5.74096473 5.40488472 -5.18731001
177
2
184
133
103
157
-2.32761863 -3.24592972 -4.80316777 -1.70083102 2.46434656 2.33830994
17
61
166
235
56
100
5.91822645 1.14124567 1.53410161 4.48022867 -0.81379728 3.74242033
40
251
210
233
169
43
0.85255665 1.77435585 -4.14641758 -0.90216282 -3.40791216 -2.22336196
> mean(pred.err^2)
[1] 17.8044
If did this yourself you would most likely obtain a different PSE, because your random
sample of the indices would produce different test and training samples. We can
guarantee our results will match by using the command set.seed()to obtain the same
random samples.
> set.seed(1)  if we all used this value before any command that
utilizes randomization we get the same results.
> set.seed(1)
> sam = sample(1:n,m,replace=F)
> Bodyfat.test = Bodyfat[sam,]
> Bodyfat.train = Bodyfat[-sam,]
> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train)
> bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test)
> pred.err = Bodyfat.test$bodyfat - bodyfat.pred
> mean(pred.err^2)
[1] 14.20143
> set.seed(1000)
> sam = sample(1:n,m,replace=F)
> Bodyfat.test = Bodyfat[sam,]
> Bodyfat.train = Bodyfat[-sam,]
> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train)
> bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test)
> pred.err = Bodyfat.test$bodyfat - bodyfat.pred
> mean(pred.err^2)
[1] 20.6782
> set.seed(1111)
> sam = sample(1:n,m,replace=F)
> Bodyfat.test = Bodyfat[sam,]
> Bodyfat.train = Bodyfat[-sam,]
> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat.train)
> bodyfat.pred = predict(Bodyfat.ols,newdata=Bodyfat.test)
> pred.err = Bodyfat.test$bodyfat - bodyfat.pred
> mean(pred.err^2)
[1] 22.39968
Notice the variation in the MPSE estimates!
15
Here is a slight variation on the code that will produce the results.
> set.seed(1)
> test = sample(n,m)
> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat,subset=-test)
> mean((bodyfat-predict(Bodyfat.ols,Bodyfat))[test]^2)
[1] 14.20143
> set.seed(1111)
> test = sample(n,m)
> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat,subset=-test)
> mean((bodyfat-predict(Bodyfat.ols,Bodyfat))[test]^2)
[1] 22.39968
> set.seed(1000)
> test = sample(n,m)
> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat,subset=-test)
> mean((bodyfat-predict(Bodyfat.ols,Bodyfat))[test]^2)
[1] 20.6782
k-Fold Cross-Validation
To perform a k-fold cross-validation we first need to divide our available data
into k roughly equal sample size sets of observations. We then our model using
(k – 1) sets of the observations and predict the set of observations not used. This
is done k times with each set being left out in turn. Typical values used in
practice are k = 5 and k = 10.
The function below will take an OLS model to be cross-validated using k-fold
cross-validation and return the MSEP.
> kfold.cv = function(fit,k=10) {
sum.sqerr <- rep(0,k)
y = fit$model[,1]
x = fit$model[,-1]
data = fit$model
n = nrow(data)
folds = sample(1:k,nrow(data),replace=T)
for (i in 1:k) {
fit2 <- lm(formula(fit),data=data[folds!=i,])
ypred <- predict(fit2,newdata=data[folds==i,])
sum.sqerr[i] <- sum((y[folds==i]-ypred)^2)
}
cv = sum(sum.sqerr)/n
cv
}
> kfold.cv(Bodyfat.ols,k=10)
[1] 21.02072
16
Leave-Out-One Cross-Validation (LOOCV)
Using the fact the predicted value for 𝑦𝑖 when the 𝑖 𝑡ℎ case is deleted from the model is
equal to
𝑒̂𝑖
𝑦𝑖 − 𝑦̂(𝑖) =
= 𝑒̂(𝑖) = (𝑦𝑖 − 𝑦𝑝𝑟𝑒𝑑 (𝑖))
(1 − ℎ𝑖 )
This is also called the 𝑖 𝑡ℎ jackknife residual and the sum of these squared residuals is
called the PRESS statistic, one of the first measures of prediction error.
In R you can obtain the prediction errors as follows:
> pred.err = resid(fit)/(1-lm.influence(fit)$hat)  where fit is the OLS model
we want
to estimate the prediction
error for.
> pred.err = resid(Bodyfat.ols)/(1-lm.influence(Bodyfat.ols)$hat)
> mean(pred.err^2)
[1] 20.29476
Monte Carlo Cross-validation (MCCV) for an OLS Regression Model
This function performs Monte Carlo Cross-validation for an arbitrary OLS model. Main
argument is the fitted model from the lm()function. Optional arguments are the fraction
of observations to use in the training set (default is p = .667 or approximately two-thirds
of the original data) and the number of replications (default is B = 100, which is rather
small actually).
> ols.mccv = function(fit,p=.667,B=100) {
cv <- rep(0,B)
y = fit$model[,1]
x = fit$model[,-1]
data = fit$model
n = nrow(data)
for (i in 1:B) {
ss <- floor(n*p)
sam <- sample(1:n,ss,replace=F)
fit2 <- lm(formula(fit),data=data[sam,])
ypred <- predict(fit2,newdata=x[-sam,])
cv[i] <- mean((y[-sam]-ypred)^2)
}
cv
}
17
Here is a different version using a cleaner approach for dealing with the train/test data.
> ols.mccv2 = function(fit,p=.667,B=100) {
cv <- rep(0,B)
y = fit$model[,1]
x = fit$model[,-1]
data = fit$model
n = nrow(data)
for (i in 1:B) {
ss <- floor(n*p)
sam <- sample(n,ss,replace=F)
fit2 <- lm(formula(fit),subset=sam)
ypred <- predict(fit2,data)
cv[i] <- mean((y - ypred)[-sam]^2)
}
cv
}
MCCV Example: Bodyfat OLS – using dataframe with standardized X’s
> Bodyfat.x = scale(Bodyfat[,-1])
> Bodyfat.scale = data.frame(bodyfat=Bodyfat$bodyfat,Bodyfat.x)
> names(Bodyfat.scale)
[1] "bodyfat" "age"
"ankle"
[12] "biceps"
"weight"
"height"
"neck"
"chest"
"abdomen" "hip"
"thigh"
"knee"
"forearm" "wrist"
> bodyfat.ols = lm(bodyfat~.,data=Bodyfat.scale) note this is the full model
> summary(bodyfat.ols)
Call:
lm(formula = bodyfat ~ ., data = Bodyfat.scale)
Residuals:
Min
1Q
Median
3Q
Max
-11.1966 -2.8824 -0.1111
3.1901
9.9979
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.15079
0.27147 70.544 < 2e-16 ***
age
0.81376
0.40570
2.006 0.04601 *
weight
-2.83261
1.81766 -1.558 0.12047
height
-0.11466
0.46633 -0.246 0.80599
neck
-1.15582
0.57264 -2.018 0.04467 *
chest
-0.14488
0.87021 -0.166 0.86792
abdomen
10.29781
0.97225 10.592 < 2e-16 ***
hip
-1.35104
1.03729 -1.302 0.19401
thigh
1.30382
0.76738
1.699 0.09061 .
knee
0.03364
0.59752
0.056 0.95516
ankle
0.30150
0.37731
0.799 0.42505
biceps
0.55078
0.52117
1.057 0.29166
forearm
0.92091
0.40272
2.287 0.02309 *
wrist
-1.54462
0.49775 -3.103 0.00215 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
18
Residual standard error: 4.309 on 238 degrees of freedom
Multiple R-squared: 0.7486,
Adjusted R-squared: 0.7348
F-statistic: 54.5 on 13 and 238 DF, p-value: < 2.2e-16
> results = ols.mccv(bodyfat.ols)
> mean(results)  Avg. PSE or MPSE
[1] 21.12341
> results
[1]
[13]
[25]
[37]
[49]
[61]
[73]
[85]
[97]
23.05984
20.90526
17.92298
23.44665
24.07194
19.71538
20.98620
22.67341
20.77255
18.67317
25.05236
22.44382
23.61154
17.43179
18.75761
20.75209
17.84598
20.66017
16.19726
21.92504
19.99677
23.99787
27.06074
24.37032
20.18027
19.57925
19.95717
20.80143
19.57326
21.53336
19.85491
20.57398
17.28482
24.02314
21.03370
24.41063
19.03076
22.51138
19.05944
18.90459
21.50898
21.68807
14.90204
22.04792
23.04956
18.41306
23.18343
23.63645
17.12468
20.82153
23.22266
21.77679
19.51525
19.44615
23.86428
20.71451
22.13966
22.52111
23.30630
22.90814
25.96077
20.58773
14.57523
20.67572
22.57043
24.41738
21.74893
23.08865
23.53890
19.99501
22.69167
17.69072
19.62440
20.89708
20.02030
22.26177
23.21706
24.28484
18.88682
19.93048
20.08109
25.00203
19.84644
19.02380
25.40393
21.15600
19.61813
21.76474
23.17060
15.67878
22.31727
21.20719
20.72045
26.98161
21.38962
17.93562
17.34390
22.44576
19.65249
18.50420
> sum(resid(bodyfat.ols)^2)/252  RSS/n < MPSE as it should be!
[1] 17.53994
> results = ols.mccv(bodyfat.ols,B=1000)
> mean(results)
[1] 20.95795
> bodyfat.step = step(bodyfat.ols)  find the “best” OLS model using mixed
selection.
> results = ols.mccv(bodyfat.step,B=500)
> mean(results)
[1] 19.4979  Q: The MPSE is smaller for the simpler model, but is this the best we can do?
19
Bootstrap Estimate of the Mean Squared Error for Prediction
The bootstrap in statistics is a method for approximating the sampling distribution of a
statistic by resampling from our observed random sample. To put it simply, a bootstrap
sample is a sample of size n drawn with replacement from our original sample. A
bootstrap sample for regression (or classification) problems is illustrated below.
𝐷𝑎𝑡𝑎: (𝒙1 , 𝑦1 ), (𝒙𝟐 , 𝑦2 ), … , (𝒙𝒏 , 𝑦𝑛 ) here the 𝒙′𝒊 𝑠 are the p-dimensional predictor
vectors.
𝐵𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝 𝑆𝑎𝑚𝑝𝑙𝑒: (𝒙∗𝟏 , 𝑦1∗ ), (𝒙∗𝟐 , 𝑦2∗ ), … , (𝒙∗𝒏 , 𝑦𝑛∗ ) where (𝒙∗𝒊 , 𝑦𝑖∗ ) is a random
selected observation from the original data drawn with replacement.
We can use the bootstrap sample to calculate any statistic of interest. This
process is then repeated a large number of times (B = 500, 1000, 5000, etc.).
For estimating prediction error we fit a model to our bootstrap sample and use it
to predict the observations not selected in our bootstrap sample. One can show
that about 63.2% of the original observations will represented in the bootstrap
sample and about 36.8% of the original observations will not be selected. Thus
we will almost certainly have some observations that are not represented in our
bootstrap sample to serve as a “test” set, with the selected observations in our
bootstrap sample serving as our “training” set. For each bootstrap sample we
can predict the response for the cases
Estimating the prediction error via the .632 Bootstrap
Again our goal is to estimate the mean prediction squared error (MPSE or PSE for short)
or mean squared error for prediction (MSEP).
Another alternative to those presented above is to use the .632 bootstrap for estimating
the PSE. The algorithm is given below:
1) First calculate the average squared residual (ASR) from your model
ASR = 𝑅𝑆𝑆/𝑛.
2) Take B bootstrap samples drawn with replacement, i.e. we draw a sample with
replacement from the numbers 1 to n and use those observations as our “new
data”.
3) Fit the model to each of the B bootstrap samples, computing the 𝐴𝑆𝑅(𝑗) for
predicting the observations not represented in the bootstrap sample.
𝐴𝑆𝑅(𝑗) = average squared residual for prediction in the jth bootstrap sample,
j = 1,…,B.
4) Compute ASR0 = the average of the bootstrap ASR values
5) Compute the optimism (OP) = .632*(ASR0 – ASR)
6) The .632 bootstrap estimate of mean PSE = ASR + OP.
20
The bootstrap approach has been shown to be better than K-fold cross-validation in many
cases.
Here is an example/function of the .632 bootstrap estimate of the mean PSE again using
the body fat dataset (Bodyfat).
> bootols.cv = function(fit,B=100) {
ASR = mean(fit$residuals^2)
boot.err <- rep(0,B)
y = fit$model[,1]
x = fit$model[,-1]
data = fit$model
n = nrow(data)
for (i in 1:B) {
sam = sample(1:n,n,replace=T)
samind = sort(unique(sam))
temp = lm(formula(fit),data=data[sam,])
ypred = predict(temp,newdata=data[-samind,])
boot.err[i] = mean((y[-samind]-ypred)^2)
}
ASR0 = mean(boot.err)
OP = .632*(ASR0 – ASR)
PSE = ASR + OP
PSE
}
Again we perform cross-validation on the full OLS model for the body fat data.
> Bodyfat.ols = lm(bodyfat~.,data=Bodyfat)
> set.seed(1111)
> bootols.cv(Bodyfat.ols,B=100)
[1] 20.16974
> bootols.cv(Bodyfat.ols,B=100)
[1] 19.87913
> bootols.cv(Bodyfat.ols,B=100)
[1] 19.80591
> bootols.cv(Bodyfat.ols,B=1000)  increasing the number of bootstrap samples (B = 1000)
[1] 19.89335
21
More on Prediction Error and the Variance-Bias Tradeoff
For any regression problem we assume that the response has the following
model:
𝑌 = 𝑓(𝒙) + 𝜀
where 𝒙 = 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑝 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟𝑠 = (𝑥1 , 𝑥2 , … , 𝑥𝑝 ) and 𝑉𝑎𝑟(𝜀) = 𝜎𝜀2 .
Our goal in modeling is to approximate or estimate 𝑓(𝒙) using a random sample
of size n: (𝒙1 , 𝑦1 ), (𝒙𝟐 , 𝑦2 ), … , (𝒙𝒏 , 𝑦𝑛 ) here the 𝒙′𝒊 𝑠 are the p-dimensional predictor
vectors.
2
2
2
PSE(Y) = 𝐸 [(𝑌 − 𝑓̂(𝒙)) ] = [𝐸(𝑓̂(𝒙)) − 𝑓(𝒙)] +𝐸 [(𝑓(𝒙) − 𝐸 (𝑓̂(𝒙)) ] +𝜎𝜀2
= 𝐵𝑖𝑎𝑠 2 + 𝑉𝑎𝑟 (𝑓̂(𝒙)) + 𝐼𝑟𝑟𝑒𝑑𝑢𝑐𝑖𝑏𝑙𝑒 𝐸𝑟𝑟𝑜𝑟
The cross-validation methods discussed above are all acceptable ways to estimate
PSE(Y), but some are certainly better than others. This is still an active area of research
and there is no definitive best method for every situation. Some methods are better at
estimate the variance component of the PSE while others are better at estimating the bias.
Ideally we would like to use a method of cross-validation that does a reasonable job of
estimating each component.
In the sections to follow we will be introducing alternatives to OLS or variations of OLS
for developing models for 𝑓(𝒙). Some of these modeling strategies have the potential to
be very flexible (i.e. have small Bias) but at the expense of being highly variable, i.e.
have large variation, 𝑉𝑎𝑟(𝑓̂(𝒙)). Balancing these two components of prediction error is
critical and cross-validation is one of the main tools we will use to create this balance in
our model development.
22
2 - Shrinkage Methods (“Automatic” Variable Selection Methods)
In our review OLS we review classic stepwise model selection methods: forward,
backward, and mixed. All three of these methods will either include or exclude terms
starting from an appropriate base model.
Other model selection methods have been developed that are viable alternatives to these
in/out strategies. These include ridge regression (old one but has new found life),
LASSO (newer one), LARS (newest one), PCR, and PLS. We will discuss the idea
behind each of these modeling methods in sections below.
Aside from the model selection these methods have also been used extensively in high
dimensional regression problems. A high dimensional problem is one in which n < p or
n << p. The text authors present two examples where this might be the case, but there are
certainly many others.

Predicting blood pressure – rather than use standard predictors such as age,
gender, and BMI, one might also collect measurements for half a million single
nucleotide polymorphisms (SNP’s) for inclusion in the model. Thus we might
have 𝑛 ≈ 300 and 𝑝 ≈ 500,000!

Predicting purchasing behavior of online shoppers - using a table of 50,000 key
words (coded 0/1) potential customers might use in the process of searching for
products (i.e. Amazon.com) we might try to predict their purchasing behavior.
We might gather information from 5,000 randomly selected visitors to the
website, in which case 𝑛 ≈ 5,000 and 𝑝 ≈ 50,000!
Ridge and Lasso regression models will allow us to fit models to these situations where
𝑛 ≪ 𝑝, where OLS mathematically cannot!
23
2.1 - Ridge Regression or Regularized Regression
Ridge regression chooses parameter estimates, 𝛽̂ 𝑟𝑖𝑑𝑔𝑒 , to minimize the residual sum of
squares subject to a penalty on the size of the coefficients. After standardizing all
potential terms in the model the ridge coefficients minimize
𝑛
𝑘
𝑘
𝛽̂ 𝑟𝑖𝑑𝑔𝑒 = min {∑(𝑦𝑖 − 𝛽𝑜 − ∑ 𝑢𝑖𝑗 𝛽𝑗 )2 + 𝜆 ∑ 𝛽𝑗2 }
𝛽
𝑖=1
𝑗=1
𝑗=1
Here  > 0 is a complexity parameter that controls the amount of shrinkage, the larger 
the greater the amount of shrinkage. The intercept is not included in the shrinkage and
will be estimated as the mean of the response. An equivalent way to write the ridge
regression criterion is
𝑛
𝑘
𝑘
𝛽̂ 𝑟𝑖𝑑𝑔𝑒 = min {∑(𝑦𝑖 − 𝛽𝑜 − ∑ 𝑢𝑖𝑗 𝛽𝑗 )2 } 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 ∑ 𝛽𝑗2 ≤ 𝑠
𝛽
𝑖=1
𝑗=1
𝑗=1
This clearly shows how of the size of the parameter estimates are constrained. Also this
formulation of the problem also leads to a nice geometric interpretation of how the
penalized least squares estimation works (see figure next page).
Important Question: Why is it essential to standardize the terms in our model?
24
Visualization of Ridge Regression
Usual OLS Estimate = (𝛽̂1 , 𝛽̂2 )
Contours of the OLS criterion
𝑟𝑖𝑑𝑔𝑒
Ridge regression estimate = (𝛽̂1
2
∑ 𝛽𝑗2 = 𝛽12 + 𝛽22 ≤ 𝑠
𝑗=1
In matrix notation the ridge regression criterion is given by
𝑅𝑆𝑆(𝜆) = (𝑦 − 𝑈𝛽)𝑇 (𝑦 − 𝑈𝛽) + 𝜆𝛽 𝑇 𝛽
with the resulting parameter estimates being very similar to those for OLS
𝛽̂ 𝑟𝑖𝑑𝑔𝑒 = (𝑈 𝑇 𝑈 + 𝜆𝐼)−1 𝑈 𝑇 𝑦
I is the k x k identity matrix.
There are several packages in R that contain functions that perform ridge regression. One
we will use is lm.ridge in the package MASS. The MASS package actually contains a
variety of very useful functions. MASS stands for Modern Applied Statistics in S-Plus
(expensive R) by Venables & Ripley, this is an excellent reference if you are so inclined.
The function call using lm.ridge is very similar to the lm() function. The other
function we will use is the function ridge in the genridge package. The genridge
package contains a number of plotting functions to help visualize the coefficient
shrinkage that takes place by using ridge regression.
Using the bodyfat dataset we will conduct a ridge regression analysis In order to fairly
compare the parameter estimates obtained via ridge regression to those from OLS we will
first run the OLS regression using the standardized predictors.
25
𝑟𝑖𝑑𝑔𝑒
, 𝛽̂2
)
> bodyfat.scaled = lm(bodyfat~.,data=Bodyfat.scale)
> summary(bodyfat.scaled)
Call:
lm(formula = bodyfat ~ ., data = Bodyfat.scale)
Residuals:
Min
1Q
-11.1966 -2.8824
Median
-0.1111
3Q
3.1901
Max
9.9979
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.15079
0.27147 70.544 < 2e-16 ***
age
0.81376
0.40570
2.006 0.04601 *
weight
-2.83261
1.81766 -1.558 0.12047
height
-0.11466
0.46633 -0.246 0.80599
neck
-1.15582
0.57264 -2.018 0.04467 *
chest
-0.14488
0.87021 -0.166 0.86792
abdomen
10.29781
0.97225 10.592 < 2e-16 ***
hip
-1.35104
1.03729 -1.302 0.19401
thigh
1.30382
0.76738
1.699 0.09061 .
knee
0.03364
0.59752
0.056 0.95516
ankle
0.30150
0.37731
0.799 0.42505
biceps
0.55078
0.52117
1.057 0.29166
forearm
0.92091
0.40272
2.287 0.02309 *
wrist
-1.54462
0.49775 -3.103 0.00215 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.309 on 238 degrees of freedom
Multiple R-squared: 0.7486,
Adjusted R-squared: 0.7348
F-statistic: 54.5 on 13 and 238 DF, p-value: < 2.2e-16
> mean(bodyfat)
[1] 19.15079
To run ridge regression we first need to choose an optimal value for the penalty
parameter . The size of reasonable  values varies vastly from one ridge model to the
next so using some form of automated selection method like cross-validation to help find
one is a good idea. Another approach is use the effective degrees of freedom of the
model which is given by the trace (sum of the diagonal elements) of the matrix
𝑑𝑓(𝜆) = 𝑡𝑟[𝑈(𝑈 𝑇 𝑈 + 𝜆𝐼)−1 𝑈 𝑇 ]
26
which as we can see is a function of . Note that when  = 0, i.e. OLS, this matrix is the
hat matrix whose trace is always k. To fit ridge models and choose an appropriate  we
will use the function lm.ridge from the MASS package.
> args(lm.ridge)
function (formula, data, subset, na.action, lambda = 0,
model = FALSE, x = FALSE, y = FALSE, contrasts = NULL, ...)
The args command is an easy way to see what arguments a function takes to run. Of
course some functions are quite complex so using the command >?lm.ridge will
bring up the help file with additional details on the arguments and generally simple
examples of the functions use. We first we will use a wide range of  values and let the
built-in optimal selection algorithms choose good candidates.
> bodyfat.ridge = lm.ridge(bodyfat~.,data=Bodyfat.scale,
lambda=seq(0,1000,.1))
> select(bodyfat.ridge)
modified HKB estimator is 1.664046
modified L-W estimator is 3.91223
smallest value of GCV at 1.1  cross-validation choice for 
Using the ridge function from the genridge package along with some different plotting
features we can see the shrinkage in the parameter estimates.
> bodyfat.ridge2 = ridge(bodyfat,bodyfat.Xs,
lambda=seq(0,1000,.1))
> traceplot(bodyfat.ridge2)
27
> traceplot(bodyfat.ridge2,X=”df”)
We can narrow the range on the  choices to take a closer look at the optimal shrinkage
parameter values.
> bodyfat.ridge3 = ridge(bodyfat,bodyfat.Xs,
lambda=seq(0,4,.001))
> traceplot(bodyfat.ridge3)
28
> traceplot(bodyfat.ridge3,X=”df”)
>
>
>
>
bodyfat.xs = Bodyfat.scale[,-1]
bodyfat.y = Bodyfat.scale[,1]
bodyfat.ridge = ridge(bodyfat.y,bodyfat.xs,lambda=seq(0,10,2))
pairs(bodyfat.ridge)
This plot shows the shrinkage in the estimated coefficients occurring as lambda increases from 0
to 10 by increments of 2. Most of the shrinkage occurs in the first 3 terms: age, weight, and
height.
29
> plot3d(bodyfat.ridge,variables=1:3)
A 3-D look at the shrinkage of the coefficients of age, weight, and height.
Fit ridge regression model using the HKB optimal value for 

> bodyfat.ridge4 = lm.ridge(bodyfat~.,data=Bodyfat.scale,lambda=1.66)
> attributes(bodyfat.ridge4)
$names
[1] "coef"
"scales" "Inter" "lambda" "ym"
"xm"
"GCV"
[8] "kHKB"
"kLW"
$class
[1] "ridgelm"
Compare the OLS coefficients to the ridge coefficients side-by-side.
> cbind(coef(bodyfat.scaled),coef(bodyfat.ridge4))
[,1]
[,2]
(Intercept) 19.15079365 19.150793651
age
0.81375776 0.941990017
weight
-2.83261161 -1.944588412
height
-0.11466232 -0.313666216
neck
-1.15582043 -1.182543415
chest
-0.14487500 -0.009673795
abdomen
10.29780784 9.416114940
hip
-1.35104126 -1.197685531
thigh
1.30382219 1.227323244
knee
0.03363573 0.027303926
ankle
0.30149592 0.235719800
biceps
0.55078084 0.461889816
forearm
0.92090523 0.891302127
wrist
-1.54461619 -1.592169696
The decreases in the parameter estimates for most notably abdomen and weight allow for
nominal increases in some of the parameter estimates for the other predictors.
30
Unfortunately the ridge regression routines in these packages do not allow for easy
extraction of the fitted values and residuals from the fit. It is not hard to write to a simple
function that will return the fitted values from a lm.ridge fit.
ridgefitted = function(fit,xmat) {
p = length(coef(fit))
fitted = coef(fit)[1] + xmat%*%coef(fit)[2:p]
fitted
}
> ridge4fit = ridgefitted(bodyfat.ridge4,bodyfat.Xs)
> plot(bodyfat,ridge4fit,xlab="Bodyfat",ylab="Fitted Values from Ridge
Regression")
> ridge4resid = bodyfat - ridge4fit
> plot(ridge4fit,ridge4resid,xlab="Fitted Values",ylab="Ridge
Residuals")
31
Ridge Regression using glmnet()
(Friedman, Hastie, Tibshirani 2013)
The glmnet package contains the function glmnet()which can be used to fit both the
ridge regression and the Lasso model discussed in the next section. This function has a
natural predict() function so obtaining fitted values and making predictions is easier
than in the functions used above.
We again return to the body fat example. The author’s also present another example of
ridge regression in Lab 2 of Chapter 6 beginning on pg. 251 using data on baseball hitters
and their salaries.
The function glmnet()does not use standard formula conventions for developing
models. Instead we form the a model matrix (X) that contains the predictors/terms as
columns and the response vector 𝑦, and use them as arguments to the function. The
columns of X must be numeric, so any categorical variables will need to converted to
dummy variables 1st. This is easily achieved by using the model.matrix()function.
For this example we will use a driver seat position data set found in the faraway package from
CRAN. Response is a numeric measurement of their hip position when sitting in the driver seat.
> library(faraway)  you need to install it first!
> names(seatpos)
[1] "Age"
"Weight"
[9] "hipcenter"
"HtShoes"
"Ht"
"Seated"
"Arm"
"Thigh"
"Leg"
> summary(seatpos)
Age
Min.
:19.00
1st Qu.:22.25
Median :30.00
Mean
:35.26
3rd Qu.:46.75
Max.
:72.00
Thigh
Min.
:31.00
1st Qu.:35.73
Median :38.55
Mean
:38.66
3rd Qu.:41.30
Max.
:45.50
Weight
Min.
:100.0
1st Qu.:131.8
Median :153.5
Mean
:155.6
3rd Qu.:174.0
Max.
:293.0
Leg
Min.
:30.20
1st Qu.:33.80
Median :36.30
Mean
:36.26
3rd Qu.:38.33
Max.
:43.10
HtShoes
Min.
:152.8
1st Qu.:165.7
Median :171.9
Mean
:171.4
3rd Qu.:177.6
Max.
:201.2
hipcenter
Min.
:-279.15
1st Qu.:-203.09
Median :-174.84
Mean
:-164.88
3rd Qu.:-119.92
Max.
: -30.95
Ht
Min.
:150.2
1st Qu.:163.6
Median :169.5
Mean
:169.1
3rd Qu.:175.7
Max.
:198.4
Seated
Min.
: 79.40
1st Qu.: 85.20
Median : 89.40
Mean
: 88.95
3rd Qu.: 91.62
Max.
:101.60
Arm
Min.
:26.00
1st Qu.:29.50
Median :32.00
Mean
:32.22
3rd Qu.:34.48
Max.
:39.60
32
> pairs.plus(seatpos)
> hip.ols = lm(hipcenter~.,data=seatpos)
> summary(hip.ols)
Call:
lm(formula = hipcenter ~ ., data = seatpos)
Residuals:
Min
1Q
-73.827 -22.833
Median
-3.678
3Q
25.017
Max
62.337
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 436.43213 166.57162
2.620
0.0138 *
Age
0.77572
0.57033
1.360
0.1843
Weight
0.02631
0.33097
0.080
0.9372
HtShoes
-2.69241
9.75304 -0.276
0.7845
Ht
0.60134
10.12987
0.059
0.9531
Seated
0.53375
3.76189
0.142
0.8882
Arm
-1.32807
3.90020 -0.341
0.7359
Thigh
-1.14312
2.66002 -0.430
0.6706
Leg
-6.43905
4.71386 -1.366
0.1824
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 37.72 on 29 degrees of freedom
Multiple R-squared: 0.6866,
Adjusted R-squared: 0.6001
F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05
> attach(seatpos)
> VIF(hip.ols)
Variance Inflation Factor Table
Age
Weight
HtShoes
Ht
Seated
Variable
VIF Rsquared
Age
1.997931 0.4994823
Weight
3.647030 0.7258043
HtShoes 307.429378 0.9967472
Ht 333.137832 0.9969982
Seated
8.951054 0.8882813
33
Arm
Thigh
Leg
Arm
Thigh
Leg
4.496368 0.7775983
2.762886 0.6380596
6.694291 0.8506190
As stated above it is imperative when performing ridge regression (or any other
regularized regression method) that we scale the terms to have mean 0 and variance 1.
We will form a new data frame in R containing the seat position data with the predictors
scaled.
> X = model.matrix(hipcenter~.,data=seatpos)[,-1]
> X = scale(X)
> summary(X)
Age
Min.
:-1.0582
1st Qu.:-0.8467
Median :-0.3425
Mean
: 0.0000
3rd Qu.: 0.7474
Max.
: 2.3904
Arm
Min.
:-1.8436
1st Qu.:-0.8055
Median :-0.0640
Mean
: 0.0000
3rd Qu.: 0.6701
Max.
: 2.1902
Weight
Min.
:-1.55477
1st Qu.:-0.66743
Median :-0.05957
Mean
: 0.00000
3rd Qu.: 0.51335
Max.
: 3.83913
Thigh
Min.
:-1.97556
1st Qu.:-0.75620
Median :-0.02716
Mean
: 0.00000
3rd Qu.: 0.68252
Max.
: 1.76639
HtShoes
Min.
:-1.66748
1st Qu.:-0.50810
Median : 0.05028
Mean
: 0.00000
3rd Qu.: 0.55484
Max.
: 2.67401
Leg
Min.
:-1.78135
1st Qu.:-0.72367
Median : 0.01082
Mean
: 0.00000
3rd Qu.: 0.60577
Max.
: 2.00866
Ht
Min.
:-1.69012
1st Qu.:-0.49307
Median : 0.03721
Mean
: 0.00000
3rd Qu.: 0.59434
Max.
: 2.62373
Seated
Min.
:-1.93695
1st Qu.:-0.76091
Median : 0.09071
Mean
: 0.00000
3rd Qu.: 0.54187
Max.
: 2.56446
> var(X)
Age
Weight
HtShoes
Ht
Seated
Arm
Thigh
Leg
Age
1.00000000
0.08068523
-0.07929694
-0.09012812
-0.17020403
0.35951115
0.09128584
-0.04233121
Weight
HtShoes
Ht
Seated
Arm
Thigh
Leg
0.08068523 -0.07929694 -0.09012812 -0.1702040 0.3595111 0.09128584 -0.04233121
1.00000000 0.82817733 0.82852568 0.7756271 0.6975524 0.57261442 0.78425706
0.82817733 1.00000000 0.99814750 0.9296751 0.7519530 0.72486225 0.90843341
0.82852568 0.99814750 1.00000000 0.9282281 0.7521416 0.73496041 0.90975238
0.77562705 0.92967507 0.92822805 1.0000000 0.6251964 0.60709067 0.81191429
0.69755240 0.75195305 0.75214156 0.6251964 1.0000000 0.67109849 0.75381405
0.57261442 0.72486225 0.73496041 0.6070907 0.6710985 1.00000000 0.64954120
0.78425706 0.90843341 0.90975238 0.8119143 0.7538140 0.64954120 1.00000000
> seatpos.scale = data.frame(hip=seatpos$hipcenter,X)
> names(seatpos.scale)
[1] "hip"
"Age"
"Weight"
"HtShoes" "Ht"
"Seated"
"Arm"
"Thigh"
"Leg"
> hip.ols = lm(hip~.,data=seatpos.scale)
> summary(hip.ols)
Call:
lm(formula = hip ~ ., data = seatpos.scale)
Residuals:
Min
1Q
-73.827 -22.833
Median
-3.678
3Q
25.017
Max
62.337
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -164.8849
6.1190 -26.946
<2e-16 ***
Age
11.9218
8.7653
1.360
0.184
Weight
0.9415
11.8425
0.080
0.937
HtShoes
-30.0157
108.7294 -0.276
0.784
Ht
6.7190
113.1843
0.059
0.953
Seated
2.6324
18.5529
0.142
0.888
Arm
-4.4775
13.1494 -0.341
0.736
Thigh
-4.4296
10.3076 -0.430
0.671
Leg
-21.9165
16.0445 -1.366
0.182
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 37.72 on 29 degrees of freedom
Multiple R-squared: 0.6866,
Adjusted R-squared: 0.6001
F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05
34
> attach(seatpos.scale)
> VIF(hip.ols)
Variance Inflation Factor Table
Age
Weight
HtShoes
Ht
Seated
Arm
Thigh
Leg
Variable
VIF Rsquared
Age
1.997931 0.4994823
Weight
3.647030 0.7258043
HtShoes 307.429378 0.9967472
Ht 333.137832 0.9969982
Seated
8.951054 0.8882813
Arm
4.496368 0.7775983
Thigh
2.762886 0.6380596
Leg
6.694291 0.8506190
> detach(seatpos.scale)
Rescaling the X’s does not change the model performance in any way. The p-values, R2,
RSS, VIF’s, etc. are all the same. The only changes are the estimated regression
coefficients. We now consider fitting a ridge regression model to these data.
> X = model.matrix(hip~.,data=seatpos.scale)[,-1]
> y = seatpos.scale$hip
> library(glmnet)
> grid = 10^seq(10,-2,length=100) <- set up a wide range of  values
> grid
[1]
[8]
[15]
[22]
[29]
[36]
[43]
[50]
[57]
[64]
[71]
[78]
[85]
[92]
1.000000e+10
1.417474e+09
2.009233e+08
2.848036e+07
4.037017e+06
5.722368e+05
8.111308e+04
1.149757e+04
1.629751e+03
2.310130e+02
3.274549e+01
4.641589e+00
6.579332e-01
9.326033e-02
7.564633e+09
1.072267e+09
1.519911e+08
2.154435e+07
3.053856e+06
4.328761e+05
6.135907e+04
8.697490e+03
1.232847e+03
1.747528e+02
2.477076e+01
3.511192e+00
4.977024e-01
7.054802e-02
5.722368e+09
8.111308e+08
1.149757e+08
1.629751e+07
2.310130e+06
3.274549e+05
4.641589e+04
6.579332e+03
9.326033e+02
1.321941e+02
1.873817e+01
2.656088e+00
3.764936e-01
5.336699e-02
4.328761e+09
6.135907e+08
8.697490e+07
1.232847e+07
1.747528e+06
2.477076e+05
3.511192e+04
4.977024e+03
7.054802e+02
1.000000e+02
1.417474e+01
2.009233e+00
2.848036e-01
4.037017e-02
3.274549e+09
4.641589e+08
6.579332e+07
9.326033e+06
1.321941e+06
1.873817e+05
2.656088e+04
3.764936e+03
5.336699e+02
7.564633e+01
1.072267e+01
1.519911e+00
2.154435e-01
3.053856e-02
2.477076e+09
3.511192e+08
4.977024e+07
7.054802e+06
1.000000e+06
1.417474e+05
2.009233e+04
2.848036e+03
4.037017e+02
5.722368e+01
8.111308e+00
1.149757e+00
1.629751e-01
2.310130e-02
1.873817e+09
2.656088e+08
3.764936e+07
5.336699e+06
7.564633e+05
1.072267e+05
1.519911e+04
2.154435e+03
3.053856e+02
4.328761e+01
6.135907e+00
8.697490e-01
1.232847e-01
1.747528e-02
> ridge.mod = glmnet(X,y,alpha=0,lambda=grid)  alpha = 0 for ridge
alpha = 1 for Lasso
> dim(coef(ridge.mod)) has 100 columns of parameter estimates, one for
[1] 9 100
each lambda in our sequence.
> coef(ridge.mod)[,1]  
(Intercept)
Age
Weight
HtShoes
Ht
Seated
Arm
-1.648849e+02 7.202949e-08 -2.248007e-07 -2.796599e-07 -2.804783e-07 -2.567201e-07 -2.054084e-07
Thigh
Leg
-2.075522e-07 -2.763501e-07
When lambda is very large we see that the parameter estimates are near 0 and the
intercept estimate is approximately equal to the mean of the response (𝑦̅).
> mean(y)
[1] -164.8849
35
> coef(ridge.mod)[,100]  
(Intercept)
Age
-164.8848684
11.7788069
Thigh
Leg
-4.2792243 -21.9175012
Weight
HtShoes
0.9953047 -23.0463404
Ht
-0.2996308
Seated
2.4814234
Arm
-4.4305306
When lambda is near 0, we see that the coefficients do not differ much from the OLS
regression parameter estimates which are shown below.
> coef(hip.ols)
(Intercept)
-164.8848684
Thigh
-4.4295690
Age
11.9218052
Leg
-21.9165049
Weight
0.9415132
HtShoes
-30.0156578
Ht
6.7190129
Seated
2.6323517
Arm
-4.4775359
We can see this shrinkage of the coefficients graphically by plotting the results.
> plot(ridge.mod,xvar="lambda")
What value of  should we use to obtain the “best” ridge regression model? In the code
below we form a train data set consisting of 75% of the original data set and use the
remaining cases as test cases. We then look at the mean PSE for various choices of  by
setting the parameter s =  in the glmnet()function call.
> train = sample(n,floor(n*p))
> train
[1] 15 23
4 28 20
2
9 35 21 25 22 31 34 18 32
7 16 27 26 36 29
5
8 19 12 13 17 11
> test = (-train)
> ridge.mod = glmnet(X[train,],y[train],alpha=0,lambda=grid)
> ridge.pred = predict(ridge.mod,s=1000,newx=X[test,])  
> PSE = mean((ridge.pred-y[test])^2)
> PSE
[1] 3226.455
> ridge.pred = predict(ridge.mod,s=100,newx=X[test,])
> PSE = mean((ridge.pred-y[test])^2)
> PSE
[1] 1556.286
36
> ridge.pred = predict(ridge.mod,s=10,newx=X[test,]) 
> PSE = mean((ridge.pred-y[test])^2)
> PSE
[1] 1334.643
> ridge.pred = predict(ridge.mod,s=5,newx=X[test,])  
> PSE = mean((ridge.pred-y[test])^2)
> PSE
[1] 1336.778
> ridge.pred = predict(ridge.mod,s=1,newx=X[test,])
> PSE = mean((ridge.pred-y[test])^2)
> PSE
[1] 1353.926
It appears a 𝜆 value between 5 and 10 appears optimal for this particular train/test set
combination. What if we use different train/test sets?
> set.seed(1)
> train = sample(n,floor(n*p))
> ridge.mod glmnet(X[train,],y[train],lambda=grid,alpha=0)
> ridge.pred = predict(ridge.mod,s=1000,newx=X[test,])
> PSE = mean((ridge.pred - y[test])^2)
> PSE
[1] 2983.198
> ridge.pred = predict(ridge.mod,s=100,newx=X[test,])
> PSE = mean((ridge.pred - y[test])^2)
> PSE
[1] 1321.951
> ridge.pred = predict(ridge.mod,s=50,newx=X[test,])
> PSE = mean((ridge.pred - y[test])^2)
> PSE
[1] 1304.09
> ridge.pred = predict(ridge.mod,s=25,newx=X[test,])
> PSE = mean((ridge.pred - y[test])^2)
> PSE
[1] 1375.791
> ridge.pred = predict(ridge.mod,s=10,newx=X[test,])
> PSE = mean((ridge.pred - y[test])^2)
> PSE
[1] 1527.045
Now it appears that the “optimal”  is somewhere between 25 and 50?
We can use cross-validation to choose an “optimal”  for prediction purposes. The
function cv.glmnet()uses 10-fold cross-validation to find an optimal value for .
> cv.out = cv.glmnet(X[train,],y[train],alpha=0)
Warning message:
Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
This dataset is too small to use 10-fold cross-validation on as the sample size n = 38!
37
> plot(cv.out)
> cv.out$lambda.min
[1] 36.58695
> bestlam = cv.out$lambda.min
> ridge.best = glmnet(X[train,],y[train],alpha=0,lambda=bestlam)
> ridge.pred = predict(ridge.best,newx=X[test,])
> PSE = mean((ridge.pred-y[test])^2)
> PSE
[1] 1328.86
> coef(ridge.best)
9 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -165.669959
Age
9.954727
Weight
-1.937062
HtShoes
-10.000879
Ht
-10.249189
Seated
-4.228238
Arm
-4.404729
Thigh
-4.954140
Leg
-10.749589
> coef(hip.ols)
(Intercept)
Age
-164.8848684
11.9218052
Thigh
Leg
-4.4295690 -21.9165049
Weight
HtShoes
0.9415132 -30.0156578
Ht
6.7190129
Seated
2.6323517
Arm
-4.4775359
38
2.2 - The Lasso
The lasso is another shrinkage method like ridge, but uses an L1-norm based penalty.
The parameter estimates are chosen according to the following
𝑛
𝑘
𝑘
𝛽̂ 𝑙𝑎𝑠𝑠𝑜 = min {∑(𝑦𝑖 − 𝛽𝑜 − ∑ 𝑢𝑖𝑗 𝛽𝑗 )2 } 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 ∑|𝛽𝑗 | ≤ 𝑡
𝛽
𝑖=1
𝑗=1
𝑗=1
Here t > 0 is the complexity parameter that controls the amount of shrinkage, the smaller
t the greater the amount of shrinkage. As with ridge regression, the intercept is not
included in the shrinkage and will be estimated as the mean of the response. If t is
chosen larger than 𝑡𝑜 = ∑𝑘𝑗=1|𝛽̂𝑗𝑙𝑠 | then there will be no shrinkage and the lasso
estimates will be the same as the OLS estimates. If 𝑡 = 𝑡𝑜 /2 then the OLS estimates will
𝑙𝑎𝑠𝑠𝑜
be shrunk by about 50%, however this is not to say that 𝛽̂𝑗
= 𝛽̂𝑗𝑙𝑠 /2 . The shrinkage
can result in some parameters being zeroed, essentially dropping the associated predictor
from the model as the figure below shows. Here the lasso estimate for 𝛽̂1𝑙𝑎𝑠𝑠𝑜 = 0.
Usual OLS Estimate = (𝛽̂1 , 𝛽̂2 )
Contours of the OLS criterion
Lasso regression estimate = (𝛽̂1𝑙𝑎𝑠𝑠𝑜 , 𝛽̂2𝑙𝑎𝑠𝑠𝑜 )
2
∑|𝛽𝑗 | = |𝛽1 | + |𝛽2 | ≤ 𝑠
𝑗=1
39
We return again to the body fat example and look at the use of the Lasso to build a
model for the body fat.
>
>
>
>
>
>
>
>
>
>
X = model.matrix(bodyfat~.,data=Bodyfat)[,-1]
y = Bodyfat$bodyfat
n = nrow(X)
p = .667
set.seed(1)
train = sample(n,floor(n*p))
test = (-train)
grid = 10^seq(10,-2,length=100)
lasso.mod = glmnet(X[train,],y[train],alpha=1,lambda=grid)
plot(lasso.mod)
> plot(lasso.mod,xvar=”lambda”)
40
> set.seed(1)
> cv.out = cv.glmnet(X[train,],y[train],alpha=1)
> plot(cv.out)
> bestlam.lasso = cv.out$lambda.min
> bestlam.lasso
[1] 0.1533247
Use test set to obtain and estimate of the PSE for the Lasso
===========================================================================================
> lasso.mod = glmnet(X[train,],y[train],alpha=1,lambda=bestlam.lasso)
> lasso.pred = predict(lasso.mod,newx=X[test,])
> PSE = mean((lasso.pred-y[test])^2)
> PSE
[1] 23.72193
Use the same 10-fold cross-validation to estimate optimal  for ridge regression. Then estimate the PSE
using the same test data as for the Lasso. Compare the mean PSE values.
============================================================================================
> set.seed(1)
> cv.out = cv.glmnet(X[train,],y[train],alpha=0)
> bestlam.ridge = cv.out$lambda.min
> bestlam.ridge
[1] 0.6335397
> ridge.mod = glmnet(X[train,],y[train],alpha=0,lambda=bestlam.ridge)
> ridge.pred = predict(ridge.mod,newx=X[test,])
> PSE = mean((ridge.pred - y[test])^2)
> PSE
[1] 26.11665
41
Comparing the coefficient estimates from Lasso, ridge regression, and OLS. Also compare PSE for test data.
===============================================================================================
> coef(lasso.mod)
s0
(Intercept) 1.27888249
age
0.08947089
weight
.
height
-0.28803077
neck
-0.39922361
chest
.
abdomen
0.67740803
hip
.
thigh
.
knee
.
ankle
.
biceps
.
forearm
0.34448133
wrist
-1.27946216
> coef(ridge.mod)
(Intercept)
age
weight
height
neck
chest
abdomen
hip
thigh
knee
ankle
biceps
forearm
wrist
s0
-4.956145707
0.130174182
-0.005247158
-0.310172813
-0.452885891
0.159678718
0.467277929
0.003963329
0.189565205
0.057918646
0.043846187
0.022254664
0.348491036
-1.470360498
> temp = data.frame(bodyfat = y[train],X[train,])
> head(temp)
67
94
144
227
51
222
bodyfat age weight height neck chest abdomen
hip thigh knee ankle biceps forearm wrist
21.5 54 151.50 70.75 35.6 90.0
83.9 93.9 55.0 36.1 21.7
29.6
27.4 17.4
24.9 46 192.50 71.75 38.0 106.6
97.5 100.6 58.9 40.5 24.5
33.3
29.6 19.1
9.4 23 159.75 72.25 35.5 92.1
77.1 93.9 56.1 36.1 22.7
30.5
27.2 18.2
14.8 55 169.50 68.25 37.2 101.7
91.1 97.1 56.6 38.5 22.6
33.4
29.3 18.8
10.2 47 158.25 72.25 34.9 90.2
86.7 98.3 52.6 37.2 22.4
26.0
25.8 17.3
26.0 54 230.00 72.25 42.5 119.9
110.4 105.5 64.2 42.7 27.0
38.4
32.0 19.6
> ols.mod = lm(bodyfat~.,data=temp)
> coef(ols.mod)
(Intercept)
-40.65178764
knee
-0.03849608
age
0.09175572
ankle
0.36585984
weight
-0.15569221
biceps
0.11606918
height
0.06515098
forearm
0.44247339
neck
-0.41393595
wrist
-1.54993981
chest
0.10173785
abdomen
0.92607342
abdomen
hip
0.93852739 -0.24119508
thigh
0.38608812
hip
-0.18562568
thigh
0.37387418
> ols.step = step(ols.mod)
> coef(ols.step)
(Intercept)
-24.91558645
age
weight
neck
0.09187304 -0.10466396 -0.46132959
forearm
wrist
0.51997961 -1.33498663
> ols.pred = predict(ols.mod,newdata=Bodyfat[test,])
> PSE = mean((ols.pred-y[test])^2)
[1] 23.39602
> ols.pred2 = predict(ols.step,newdata=Bodyfat[test,])
> PSE = mean((ols.pred2-y[test])^2)
[1] 22.8308
42
For these data we see the three approaches differ in their results. Lasso is zeroes out
some coefficients, thus does completely eliminate some terms from the model. Ridge
will shrink coefficients down to very near zero, effectively eliminating them, but
technically will zero none of them. Stepwise selection in OLS is either in or out, so some
get zeroed some don’t, however there is no shrinkage of the estimated coefficients. A
good question to ask would be “how do these methods cross-validate for making future
predictions?” We can use cross-validation methods to compare these competing
models via estimates of the PSE.
Monte Carlo Cross-Validation of OLS Regression Models
> ols.mccv = function(fit,p=.667,B=100) {
cv <- rep(0,B)
y = fit$model[,1]
x = fit$model[,-1]
data = fit$model
n = nrow(data)
for (i in 1:B) {
ss <- floor(n*p)
sam <- sample(1:n,ss,replace=F)
fit2 <- lm(formula(fit),data=data[sam,])
ypred <- predict(fit2,newdata=x[-sam,])
cv[i] <- mean((y[-sam]-ypred)^2)
}
cv
}
Monte Carlo Cross-Validation of Ridge and Lasso Regression
> glmnet.mccv = function(X,y,alpha=0,lambda=1,p=.667,B=100) {
cv <- rep(0,B)
n = nrow(X)
for (i in 1:B) {
ss <- floor(n*p)
sam <- sample(n,ss,replace=F)
fit <- glmnet(X[sam,],y[sam],lambda=lambda)
ypred <- predict(fit,newx=X[-sam,])
cv[i] <- mean((y[-sam]-ypred)^2)
}
cv
}
> set.seed(1)
> rr.cv = glmnet.mccv(X,y,alpha=0,lambda=.634)
> Statplot(rr.cv)
> mean(rr.cv)
> sd(rr.cv)
[1] 21.65482
[1] 2.847533
43
> set.seed(1)
> lass.cv = glmnet.mccv(X,y,alpha=1,lambda=.153)
> mean(lass.cv)
> sd(lass.cv)
[1] 20.30297
[1] 2.601356
> ols.scale = lm(bodyfat~.,data=Bodyfat.scale)
> ols.results = ols.mccv(ols.scale)
> mean(ols.results)
[1] 20.68592
> sd(ols.results)
[1] 2.737272
> Statplot(ols.results)
> ols.scalestep = step(ols.scale)
> ols.results = ols.mccv(ols.scalestep)
> mean(ols.results)
[1] 19.72026
> sd(ols.results)
[1] 2.185153
> Statplot(ols.results)
44
1.3 - Least Angle Regression (LAR) – FYI only!
The lars function in the library of the same name will perform least angle regression
which is another shrinkage method for fitting regression models.
lars(x, y, type = c("lasso", "lar", "forward.stagewise", "stepwise"))
lar = Least Angle Regression (LAR) – see algorithm and diagram next page
forward.stagewise = Forward Stagewise selection
stepwise = forward stepwise selection (classic method)
For Lasso regression use the glmnet function versus the lars implementation.
http://www-stat.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf
LAR Algorithm
45
As we can see the LAR and Forward Stagewise selection methods produce very similar
models to the lasso for these data as seen below. Good advice would be to try them all,
plot the results, and examine them for any large differences.
The usability of the results from lars is an issue. Extracting fitted values, residuals and
making predictions using lars is very cumbersome, but definitely doable.
>
>
>
>
>
X = model.matrix(bodyfat~.,data=Bodyfat)[,-1]
y = Bodyfat$bodyfat
bodyfat.lars = lars(X,y,type="lar")
plot(bodyfat.lars)
summary(bodyfat.lars)
LARS/LAR
Call: lars(x = X, y = y, type = "lar")
Df
Rss
Cp
0
1 17579.0 696.547
1
2 6348.3 93.824
2
3 5999.8 77.062
3
4 5645.1 59.963
4
5 5037.4 29.241
5
6 4998.8 29.164
6
7 4684.9 14.262
7
8 4678.3 15.905
8
9 4658.4 16.831
9 10 4644.8 18.099
10 11 4516.3 13.183
11 12 4427.8 10.416
12 13 4421.5 12.079
13 14 4420.1 14.000
> fit = predict.lars(bodyfat.lars,X,s=11)
> fit = predict.lars(bodyfat.lars,X,s=6)
46
1.5 - Principal Component Regression (PCR) & Partial Least Squares (PLS)
Multivariate regression methods like principal component regression (PCR) and partial
least squares regression (PLSR) enjoy large popularity in a wide range of fields,
including the natural sciences. The main reason is that they have been designed to
confront the situation where there are many, generally correlated, predictor variables, and
relatively few samples – a situation that is common, especially in chemistry where
developments in spectroscopy allow for obtaining hundreds of spectra readings on single
sample. In these situations n << p, thus some form of dimension reduction in the
predictor space is necessary.
Principal components analysis is a dimension reduction technique where p
independent/orthogonal linear combinations of the input numeric variables 𝑋1 , 𝑋2 , … , 𝑋𝑝
are formed so that the first linear combination accounts as much of the total variation in
the original data as possible. The 2nd linear combination accounts for as much of the
remaining variation in the data as possible subject to the constraint that it is orthogonal to
the first linear combination, etc.. Generally the variables are all scaled to have mean 0
and variance 1 (denoted 𝑋𝑗∗ ) thus the total variation in the scaled data is given by
𝑝
𝑝
∑ 𝑉(𝑋𝑗∗ )
𝑗=1
= 𝑝 = ∑ 𝑉(𝑍𝑗 )
𝑗=1
where,
𝑍1 = 𝑎11 𝑋1∗ + 𝑎12 𝑋2∗ + ⋯ + 𝑎1𝑝 𝑋𝑝∗
𝑍2 = 𝑎21 𝑋1∗ + 𝑎22 𝑋2∗ + ⋯ + 𝑎2𝑝 𝑋𝑝∗
…
𝑍𝑝 = 𝑎𝑝1 𝑋1∗ + 𝑎𝑝2 𝑋2∗ + ⋯ + 𝑎𝑝𝑝 𝑋𝑝∗
and
𝐶𝑜𝑣(𝑍𝑖 , 𝑍𝑗 ) = 𝐶𝑜𝑟𝑟(𝑍𝑖 , 𝑍𝑗 ) = 0 𝑓𝑜𝑟 𝑖 ≠ 𝑗
The linear combinations are determined by the spectral analysis (i.e. finding eigenvalues
and eigenvectors) of the sample correlation matrix (𝑅) and the variance of the jth
principal component 𝑌𝑗 is
𝑉(𝑍𝑗 ) = 𝜆𝑗 𝑡ℎ𝑒 𝑗 𝑡ℎ 𝑙𝑎𝑟𝑔𝑒𝑠𝑡 𝑒𝑖𝑔𝑒𝑛𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑅
and the coefficients of the linear combination
(𝑎𝑗1 , 𝑎𝑗2 , … , 𝑎𝑗𝑝 ) = 𝑒𝑖𝑔𝑒𝑛𝑣𝑒𝑐𝑡𝑜𝑟 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑖𝑛𝑔 𝑡𝑜 𝜆𝑗
Ideally the first k principal components will account for a sizeable percentage of the total
variation in these data. We can then use these k principal components, 𝑍1 , 𝑍2 , … , 𝑍𝑘 as
predictors in a multiple regression model below:
𝑘
𝐸(𝑌|𝑋1 , 𝑋2 , … , 𝑋𝑝 ) = 𝛽0 + ∑ 𝛽𝑗 𝑍𝑗 𝑎𝑛𝑑 𝑉(𝑌|𝑋1 , … , 𝑋𝑝 ) = 𝜎 2
𝑗=1
47
Yarn Data
These data were obtained from a calibration study of polyethylene terephthalate (PET)
yarns which are used for textile and industrial purposes. PET yarns are produced by a
process of melt-spinning, whose settings largely determine the final semi-crystalline
structure of the yarn, which, in turn, determines the physical structure of PET yarns are
important quality parameters for the end use of the yarn.
Raman near-infrared (NIR) spectroscopy has recently become an important tool in the
pharmaceutical and semiconductor industries for investigating structural information on
polymers; in particular, it is used to reveal information about the chemical nature,
conformational order, state of the order, and orientation of polymers. Thus, Raman
spectra are used to predict the physical characteristics of polymers.
In this example, we study the relationship between the overall density of a PET yarn to its
NIR spectrum. The data consist of a sample of n = 21 PET yarns having known
mechanical and structural properties. For each PET yarn, the Y-variable is the density
(kg/m3) of the yarn, and the p = 268 X-variables (measured at 268 frequencies in the
range 598 – 1900 cm-1 ) are selected from the NIR spectrum of that yarn. Thus n << p!!
Many of the X-variables are highly correlated as the scatterplot matrices on the following
pages clearly show.
48
𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑋1 , … , 𝑋10
𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑋31 , … , 𝑋40
Obviously all 268 variables contain similar information and therefore we should be able
to effectively use principal components to reduce the dimensionality of the spectra data.
49
Form principal components for the PET yarn data
First load package pls which contains the yarn data and routines for both principal
component regression (PCR) and partial least squares (PLS) regression.
> Yarn = yarn[1:21,]
> R = cor(Yarn$NIR)
> eigenR = eigen(R)
> attributes(eigenR)
$names
[1] "values" "vectors”
Only the first four PC’s have variance larger than a single scaled variable, i.e. there are
four eigenvalues greater 1.0. We will now form 𝑍1 , 𝑍2 , 𝑍3 , 𝑍4 using the corresponding
eigenvectors.
>
>
>
>
z1
z2
z3
z4
=
=
=
=
scale(Yarn$NIR)%*%eigenR$vectors[,1]
scale(Yarn$NIR)%*%eigenR$vectors[,2]
scale(Yarn$NIR)%*%eigenR$vectors[,3]
scale(Yarn$NIR)%*%eigenR$vectors[,4]
> YarnPC = data.frame(density=Yarn$density,z1,z2,z3,z4)
> yarn.pcr = lm(density~.,data=YarnPC)
> summary(yarn.pcr)
Call:
lm(formula = density ~ ., data = YarnPC)
Residuals:
Min
1Q Median
-2.106 -0.522 0.246
3Q
0.632
Max
1.219
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.6200
0.2249
149.5 < 2e-16 ***
z1
-2.3062
0.0194 -118.7 < 2e-16 ***
z2
0.6602
0.0261
25.3 2.5e-14 ***
z3
1.8044
0.0341
52.8 < 2e-16 ***
z4
0.9759
0.1434
6.8 4.2e-06 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.03 on 16 degrees of freedom
Multiple R-squared: 0.999,
Adjusted R-squared: 0.999
F-statistic: 4.39e+03 on 4 and 16 DF, p-value: <2e-16
50
> pairs.plus(YarnPC)
The marginal response plots look rather interesting, see rectangle above.
Residuals look surprising good considering
the nonlinear relationships displayed in the
marginal response plots.
51
Using the pcr command from the pls package.
> yarn.pcr2 = pcr(density~scale(NIR),data=yarn[1:21,],ncomp=6,validation="CV")
> summary(yarn.pcr2)
Data:
X dimension: 21 268
Y dimension: 21 1
Fit method: svdpc
Number of components considered: 6
VALIDATION: RMSEP
Cross-validated using
(Intercept) 1
CV
31.31
adjCV
31.31
10 random segments.
comps 2 comps 3 comps
15.38
14.29
2.392
15.39
14.46
2.358
TRAINING: % variance explained
1 comps 2 comps 3 comps
X
52.49
81.59
98.57
density
80.13
83.77
99.65
4 comps
99.54
99.91
4 comps
1.312
1.288
5 comps
99.75
99.93
5 comps
1.167
1.145
6 comps
0.9461
0.9336
6 comps
99.86
99.95
> loadingplot(yarn.pcr2,comps=1:4,legendpos=”topright”)
The plot above shows the weight assigned to each variable on the first four PC’s. The solid line
shows the weights assigned to the variables on the first principal component. Identifying
important individual spectra will be very difficult but you can ranges that appear important for
each component.
52
Extract fitted values from 4-component fit
> fit = fitted(yarn.pcr2)[,,4]
> plot(Yarn$density,fit)
> predplot(yarn.pcr2,ncomp=1:6)
53
> corrplot(yarn.pcr2,comps=1:4)
This plot displays the correlation between each of the 268 variables with each of the first
four principal components.
> YarnTest = yarn[22:28,]
> predict(yarn.pcr2,ncomp=4,newdata=YarnTest)
, , 4 comps
110
22
31
41
51
61
71
density
50.95
50.97
31.92
34.77
30.72
19.93
19.37
> YarnTest$density
[1] 51.04 50.32 32.14 34.69 30.30 20.45 20.06
54
Partial Least Squares (PLS) Algorithm
While PCR focuses on the covariance structure of the X’s independent of the response Y
partial least squares (PLS) looks at the covariance structure of the X’s and the response Y
jointly. The algorithm for PLS is shown below.
To better understand the PLS algorithm consider the simple example below.
Generate some data
y = rnorm(100)
y = y -mean(y)
x1 = rnorm(100)
x1 = (x1 - mean(x1))/sd(x1)
x2 = y+x1+rnorm(100)
x2 = (x2 - mean(x2))/sd(x2)
phi1 = sum(y*x1)
phi2 = sum(y*x2)
z1 = phi1*x1 + phi2*x2
z1 = (z1 - mean(z1))/sd(z1)
th1 = lsfit(z1,y,int=F)$coef
y1 = y + th1*z1
pairs(cbind(y,x1,x2,z1,y1))
55
Now we do the second iteration
x11 = x1 - sum(x1*z1)*z1/sum(z1*z1)
x21 = x2 - sum(x2*z1)*z1/sum(z1*z1)
phi1 = sum(y1*x11)
phi2 = sum(y1*x21)
z2 = phi1*x11 + phi2*x21
z2 = (z2 - mean(z2))/sd(z2)
th2 = lsfit(z2,y1,int=F)$coef
y2 = y1 + th2*z2
pairs(cbind(y,z2,y2,y1))
Ultimately the final fitted values are a linear combination of the z-components and thus
̅ + ∑𝑘𝑗=1 𝜃̂𝑗 𝒛𝒋 . Interpretation of the results is done in a
they can be expressed 𝑌̂ = 𝒚
similar fashion to PCR by examining plots of the cross-validation results, variable
loadings, and correlations with the original predictors.
We now examine the results from PLS regression for the yarn data.
> yarn.pls = plsr(density~NIR,ncomp=10,data=Yarn,validation="CV")
> summary(yarn.pls)
Data:
X dimension: 21 268
Y dimension: 21 1
Fit method: kernelpls
Number of components considered: 10
VALIDATION: RMSEP
Cross-validated using
(Intercept) 1
CV
31.31
adjCV
31.31
10 random segments.
comps 2 comps 3 comps
6.473
4.912
2.164
5.840
4.862
2.150
TRAINING: % variance explained
1 comps 2 comps 3 comps
X
47.07
98.58
99.50
density
98.19
98.29
99.71
4 comps
99.72
99.97
4 comps
0.8847
0.8552
5 comps
99.87
99.99
5 comps
0.6041
0.5908
6 comps
99.98
99.99
6 comps
0.6550
0.6367
7 comps
99.98
100.00
7 comps
0.3983
0.3816
8 comps
99.99
100.00
8 comps
0.3244
0.3115
9 comps
99.99
100.00
9 comps
0.2890
0.2759
10 comps
0.2822
0.2687
10 comps
99.99
100.00
> plot(RMSEP(yarn.pls),legendpos="topright")
56
The 4 component model is
suggested by cross-validation.
> predplot(yarn.pls,ncomp=4,line=T)
> plot(yarn.pls,plottype=”scores”,comps=1:4)
57
> plot(yarn.pls,"loadings",comps=1:4,legendpos="topright")
> abline(h=0)
> predict(yarn.pls,ncomp=4,newdata=YarnTest)
, , 4 comps
110
22
31
41
51
61
71
density
51.05
50.72
32.01
34.29
30.36
20.58
19.08
> YarnTest$density
[1] 51.04 50.32 32.14 34.69 30.30 20.45 20.06
> sum((predict t(yarn.pcr2,ncomp=4,newdata=YarnTest)YarnTest$density)^2)
[1] 1.407
> sum((predict(yarn.pls,ncomp=4,newdata=YarnTest)-YarnTest$density)^2)
[1] 1.320
The 4-component PLS model does a slightly better job of predicting the test yarn
densities.
58
Download