some `x` values beyond boundary knots may cause ill

advertisement
Chapter 6: Exercise 2
a (Lasso)
iii. Less flexible and better predictions because of less variance, more bias
b (Ridge regression)
Same as lasso. iii.
c (Non-linear methods)
ii. More flexible, less bias, more variance
Chapter 6: Exercise 3
a
(iv) Steadily decreases: As we increase s from 0, all β 's increase from 0 to their least square estimate values.
Training error for 0 β s is the maximum and it steadily decreases to the Ordinary Least Square RSS
b
(ii) Decrease initially, and then eventually start increasing in a U shape: When s=0, all β s are 0, the model is
extremely simple and has a high test RSS. As we increase s, beta s assume non-zero values and model starts fitting
well on test data and so test RSS decreases. Eventually, as beta s approach their full blown OLS values, they start
overfitting to the training data, increasing test RSS.
c
(iii) Steadily increase: When s=0, the model effectively predicts a constant and has almost no variance. As we
increase s, the models includes more β s and their values start increasing. At this point, the values of β s become
highly dependent on training data, thus increasing the variance.
d
(iv) Steadily decrease: When s=0, the model effectively predicts a constant and hence the prediction is far from
actual value. Thus bias is high. As s increases, more β s become non-zero and thus the model continues to fit
training data better. And thus, bias decreases.
e
(v) Remains constant: By definition, irreducible error is model independent and hence irrespective of the choice
of s, remains constant.
Chapter 6: Exercise 9
a
Load and split the College data
library(ISLR)
set.seed(11)
sum(is.na(College))
## [1] 0
train.size = dim(College)[1] / 2
train = sample(1:dim(College)[1], train.size)
test = -train
College.train = College[train, ]
College.test = College[test, ]
b
NUmber of applications is the Apps variable.
lm.fit = lm(Apps~., data=College.train)
lm.pred = predict(lm.fit, College.test)
mean((College.test[, "Apps"] - lm.pred)^2)
## [1] 1538442
Test RSS is 1538442
c
Pick λ using College.train and report error on College.test
library(glmnet)
## Warning: package 'glmnet' was built under R version 2.15.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 2.15.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 2.15.3
## Loaded glmnet 1.9-3
train.mat = model.matrix(Apps~., data=College.train)
test.mat = model.matrix(Apps~., data=College.test)
grid = 10 ^ seq(4, -2, length=100)
mod.ridge = cv.glmnet(train.mat, College.train[, "Apps"], alpha=0,
lambda=grid, thresh=1e-12)
lambda.best = mod.ridge$lambda.min
lambda.best
## [1] 18.74
ridge.pred = predict(mod.ridge, newx=test.mat, s=lambda.best)
mean((College.test[, "Apps"] - ridge.pred)^2)
## [1] 1608859
Test RSS is slightly higher that OLS, 1608859.
d
Pick λ using College.train and report error on College.test
mod.lasso = cv.glmnet(train.mat, College.train[, "Apps"], alpha=1,
lambda=grid, thresh=1e-12)
lambda.best = mod.lasso$lambda.min
lambda.best
## [1] 21.54
lasso.pred = predict(mod.lasso, newx=test.mat, s=lambda.best)
mean((College.test[, "Apps"] - lasso.pred)^2)
## [1] 1635280
Again, Test RSS is slightly higher that OLS, 1635280.
The coefficients look like
mod.lasso = glmnet(model.matrix(Apps~., data=College), College[, "Apps"],
alpha=1)
predict(mod.lasso, s=lambda.best, type="coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
##
1
## (Intercept) -6.038e+02
## (Intercept) .
## PrivateYes -4.235e+02
## Accept
1.455e+00
## Enroll
-2.004e-01
## Top10perc
3.368e+01
## Top25perc
-2.403e+00
## F.Undergrad .
## P.Undergrad 2.086e-02
## Outstate
-5.782e-02
## Room.Board
1.246e-01
## Books
.
## Personal
1.833e-05
## PhD
-5.601e+00
## Terminal
-3.314e+00
## S.F.Ratio
4.479e+00
## perc.alumni -9.797e-01
## Expend
6.968e-02
## Grad.Rate
5.160e+00
e
Use validation to fit pcr
library(pls)
##
## Attaching package: 'pls'
##
## The following object(s) are masked from 'package:stats':
##
##
loadings
pcr.fit = pcr(Apps~., data=College.train, scale=T, validation="CV")
validationplot(pcr.fit, val.type="MSEP")
pcr.pred = predict(pcr.fit, College.test, ncomp=10)
mean((College.test[, "Apps"] - data.frame(pcr.pred))^2)
## [1] 3014496
Test RSS for PCR is about 3014496.
f
Use validation to fit pls
pls.fit = plsr(Apps~., data=College.train, scale=T, validation="CV")
validationplot(pls.fit, val.type="MSEP")
pls.pred = predict(pls.fit, College.test, ncomp=10)
mean((College.test[, "Apps"] - data.frame(pls.pred))^2)
## [1] 1508987
Test RSS for PLS is about 1508987.
g
Results for OLS, Lasso, Ridge are comparable. Lasso reduces the F.Undergrad and Books variables to
zero and shrinks coefficients of other variables. Here are the test R2 for all models.
test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - lm.pred)^2)
/mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2)
/mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2)
/mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((College.test[, "Apps"] - data.frame(pcr.pred))^2)
/mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((College.test[, "Apps"] - data.frame(pls.pred))^2)
/mean((College.test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2,
pls.test.r2), col="red", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
main="Test R-squared")
The plot shows that test R2 for all models except PCR are around 0.9, with PLS having slightly higher
test R2 than others. PCR has a smaller test R2 of less than 0.8. All models except PCR predict college
applications with high accuracy.
Chapter 7: Exercise 5
a
We'd expect g2 to have the smaller training RSS because it will be a higher order polynomial due to the order of
the derivative penalty function.
b
We'd expect g1 to have the smaller test RSS because g2 could overfit with the extra degree of freedom.
c
Trick question. g1=g2 when λ=0.
Chapter 6: Exercise 9
Load the Boston dataset
set.seed(1)
library(MASS)
attach(Boston)
a
lm.fit = lm(nox ~ poly(dis, 3), data = Boston)
summary(lm.fit)
##
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
##
## Residuals:
##
Min
1Q
Median
3Q
Max
## -0.12113 -0.04062 -0.00974 0.02338 0.19490
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
0.55470
0.00276 201.02 < 2e-16 ***
## poly(dis, 3)1 -2.00310
0.06207 -32.27 < 2e-16 ***
## poly(dis, 3)2 0.85633
0.06207
13.80 < 2e-16 ***
## poly(dis, 3)3 -0.31805
0.06207
-5.12 4.3e-07 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0621 on 502 degrees of freedom
## Multiple R-squared: 0.715,
Adjusted R-squared: 0.713
## F-statistic: 419 on 3 and 502 DF, p-value: <2e-16
dislim = range(dis)
dis.grid = seq(from = dislim[1], to = dislim[2], by = 0.1)
lm.pred = predict(lm.fit, list(dis = dis.grid))
plot(nox ~ dis, data = Boston, col = "darkgrey")
lines(dis.grid, lm.pred, col = "red", lwd = 2)
Summary shows that all polynomial terms are significant while predicting nox using dis. Plot shows a
smooth curve fitting the data fairly well.
b
We plot polynomials of degrees 1 to 10 and save train RSS.
all.rss = rep(NA, 10)
for (i in 1:10) {
lm.fit = lm(nox ~ poly(dis, i), data = Boston)
all.rss[i] = sum(lm.fit$residuals^2)
}
all.rss
## [1] 2.769 2.035 1.934 1.933 1.915 1.878 1.849 1.836 1.833 1.832
As expected, train RSS monotonically decreases with degree of polynomial.
c
We use a 10-fold cross validation to pick the best polynomial degree.
library(boot)
all.deltas = rep(NA, 10)
for (i in 1:10) {
glm.fit = glm(nox ~ poly(dis, i), data = Boston)
all.deltas[i] = cv.glm(Boston, glm.fit, K = 10)$delta[2]
}
plot(1:10, all.deltas, xlab = "Degree", ylab = "CV error", type = "l", pch =
20,
lwd = 2)
A 10-fold CV shows that the CV error reduces as we increase degree from 1 to 3, stay almost constant till
degree 5, and the starts increasing for higher degrees. We pick 4 as the best polynomial degree.
d
We see that dis has limits of about 1 and 13 respectively. We split this range in roughly equal 4 intervals
and establish knots at [4,7,11]. Note: bs function in R expects either df or knots argument. If both are
specified, knots are ignored.
library(splines)
sp.fit = lm(nox ~ bs(dis, df = 4, knots = c(4, 7, 11)), data = Boston)
summary(sp.fit)
##
## Call:
## lm(formula = nox ~ bs(dis, df = 4, knots = c(4, 7, 11)), data = Boston)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -0.1246 -0.0403 -0.0087 0.0247 0.1929
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
0.7393
0.0133
55.54 < 2e-16
## bs(dis, df = 4, knots = c(4, 7, 11))1 -0.0886
0.0250
-3.54 0.00044
## bs(dis, df = 4, knots = c(4, 7, 11))2 -0.3134
0.0168 -18.66 < 2e-16
## bs(dis, df = 4, knots = c(4, 7, 11))3 -0.2662
0.0315
-8.46 3.0e-16
## bs(dis, df = 4, knots = c(4, 7, 11))4 -0.3980
0.0465
-8.56 < 2e-16
## bs(dis, df = 4, knots = c(4, 7, 11))5 -0.2568
0.0900
-2.85 0.00451
## bs(dis, df = 4, knots = c(4, 7, 11))6 -0.3293
0.0633
-5.20 2.9e-07
##
## (Intercept)
***
## bs(dis, df = 4, knots = c(4, 7, 11))1 ***
## bs(dis, df = 4, knots = c(4, 7, 11))2 ***
## bs(dis, df = 4, knots = c(4, 7, 11))3 ***
## bs(dis, df = 4, knots = c(4, 7, 11))4 ***
## bs(dis, df = 4, knots = c(4, 7, 11))5 **
## bs(dis, df = 4, knots = c(4, 7, 11))6 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0619 on 499 degrees of freedom
## Multiple R-squared: 0.718,
Adjusted R-squared: 0.715
## F-statistic: 212 on 6 and 499 DF, p-value: <2e-16
sp.pred = predict(sp.fit, list(dis = dis.grid))
plot(nox ~ dis, data = Boston, col = "darkgrey")
lines(dis.grid, sp.pred, col = "red", lwd = 2)
The summary shows that all terms in spline fit are significant. Plot shows that the spline fits data well
except at the extreme values of dis, (especially dis>10).
e
We fit regression splines with dfs between 3 and 16.
all.cv = rep(NA, 16)
for (i in 3:16) {
lm.fit = lm(nox ~ bs(dis, df = i), data = Boston)
all.cv[i] = sum(lm.fit$residuals^2)
}
all.cv[-c(1, 2)]
## [1] 1.934 1.923 1.840 1.834 1.830 1.817 1.826 1.793 1.797 1.789 1.782
## [12] 1.782 1.783 1.784
Train RSS monotonically decreases till df=14 and then slightly increases for df=15 and df=16.
f
Finally, we use a 10-fold cross validation to find best df. We try all integer values of df between 3 and 16.
all.cv = rep(NA, 16)
for (i in 3:16) {
lm.fit = glm(nox ~ bs(dis, df = i), data = Boston)
all.cv[i] = cv.glm(Boston, lm.fit, K = 10)$delta[2]
}
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
## Warning: some 'x' values beyond boundary knots may cause
bases
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
ill-conditioned
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
## Warning:
bases
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
some 'x' values beyond boundary knots may cause ill-conditioned
## Warning: some 'x' values beyond
bases
## Warning: some 'x' values beyond
bases
## Warning: some 'x' values beyond
bases
## Warning: some 'x' values beyond
bases
plot(3:16, all.cv[-c(1, 2)], lwd =
error")
boundary knots may cause ill-conditioned
boundary knots may cause ill-conditioned
boundary knots may cause ill-conditioned
boundary knots may cause ill-conditioned
2, type = "l", xlab = "df", ylab = "CV
CV error is more jumpy in this case, but attains minimum at df=10. We pick 10 as the optimal degrees of
freedom.
Chapter 7: Exercise 10
a
set.seed(1)
library(ISLR)
library(leaps)
attach(College)
train = sample(length(Outstate), length(Outstate)/2)
test = -train
College.train = College[train, ]
College.test = College[test, ]
reg.fit = regsubsets(Outstate ~ ., data = College.train, nvmax = 17, method =
"forward")
reg.summary = summary(reg.fit)
par(mfrow = c(1, 3))
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
min.cp = min(reg.summary$cp)
std.cp = sd(reg.summary$cp)
abline(h = min.cp + 0.2 * std.cp, col = "red", lty = 2)
abline(h = min.cp - 0.2 * std.cp, col = "red", lty = 2)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
min.bic = min(reg.summary$bic)
std.bic = sd(reg.summary$bic)
abline(h = min.bic + 0.2 * std.bic, col = "red", lty = 2)
abline(h = min.bic - 0.2 * std.bic, col = "red", lty = 2)
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2",
type = "l", ylim = c(0.4, 0.84))
max.adjr2 = max(reg.summary$adjr2)
std.adjr2 = sd(reg.summary$adjr2)
abline(h = max.adjr2 + 0.2 * std.adjr2, col = "red", lty = 2)
abline(h = max.adjr2 - 0.2 * std.adjr2, col = "red", lty = 2)
All cp, BIC and adjr2 scores show that size 6 is the minimum size for the subset for which the scores are
withing 0.2 standard deviations of optimum. We pick 6 as the best subset size and find best 6 variables
using entire data.
reg.fit = regsubsets(Outstate ~ ., data = College, method = "forward")
coefi = coef(reg.fit, id = 6)
names(coefi)
## [1] "(Intercept)" "PrivateYes"
## [6] "Expend"
"Grad.Rate"
"Room.Board"
"PhD"
"perc.alumni"
b
library(gam)
## Loading required package: splines
## Loaded gam 1.09
gam.fit = gam(Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, df = 2) +
s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, df = 2), data =
College.train)
par(mfrow = c(2, 3))
plot(gam.fit, se = T, col = "blue")
c
gam.pred = predict(gam.fit, College.test)
gam.err = mean((College.test$Outstate - gam.pred)^2)
gam.err
## [1] 3745460
gam.tss = mean((College.test$Outstate - mean(College.test$Outstate))^2)
test.rss = 1 - gam.err/gam.tss
test.rss
## [1] 0.7697
We obtain a test R-squared of 0.77 using GAM with 6 predictors. This is a slight improvement over a test
RSS of 0.74 obtained using OLS.
d
summary(gam.fit)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD,
##
df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate,
##
df = 2), data = College.train)
## Deviance Residuals:
##
Min
1Q Median
3Q
Max
## -4977.7 -1184.5
58.3 1220.0 7688.3
##
## (Dispersion Parameter for gaussian family taken to be 3300711)
##
##
Null Deviance: 6.222e+09 on 387 degrees of freedom
## Residual Deviance: 1.231e+09 on 373 degrees of freedom
## AIC: 6942
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
##
Df
Sum Sq Mean Sq F value Pr(>F)
## Private
1 1.78e+09 1.78e+09
539.1 < 2e-16 ***
## s(Room.Board, df = 2)
1 1.22e+09 1.22e+09
370.2 < 2e-16 ***
## s(PhD, df = 2)
1 3.82e+08 3.82e+08
115.9 < 2e-16 ***
## s(perc.alumni, df = 2)
1 3.28e+08 3.28e+08
99.5 < 2e-16 ***
## s(Expend, df = 5)
1 4.17e+08 4.17e+08
126.2 < 2e-16 ***
## s(Grad.Rate, df = 2)
1 5.53e+07 5.53e+07
16.8 5.2e-05 ***
## Residuals
373 1.23e+09 3.30e+06
## ---
##
##
##
##
##
##
##
##
##
##
##
##
##
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova for Nonparametric Effects
Npar Df Npar F Pr(F)
(Intercept)
Private
s(Room.Board, df = 2)
1
3.56 0.060
s(PhD, df = 2)
1
4.34 0.038
s(perc.alumni, df = 2)
1
1.92 0.167
s(Expend, df = 5)
4 16.86 1e-12
s(Grad.Rate, df = 2)
1
3.72 0.055
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*'
.
*
***
.
0.05 '.' 0.1 ' ' 1
Non-parametric Anova test shows a strong evidence of non-linear relationship between response and
Expend, and a moderately strong non-linear relationship (using p value of 0.05) between response and
Grad.Rate or PhD.
Download