Chapter 6: Exercise 2 a (Lasso) iii. Less flexible and better predictions because of less variance, more bias b (Ridge regression) Same as lasso. iii. c (Non-linear methods) ii. More flexible, less bias, more variance Chapter 6: Exercise 3 a (iv) Steadily decreases: As we increase s from 0, all β 's increase from 0 to their least square estimate values. Training error for 0 β s is the maximum and it steadily decreases to the Ordinary Least Square RSS b (ii) Decrease initially, and then eventually start increasing in a U shape: When s=0, all β s are 0, the model is extremely simple and has a high test RSS. As we increase s, beta s assume non-zero values and model starts fitting well on test data and so test RSS decreases. Eventually, as beta s approach their full blown OLS values, they start overfitting to the training data, increasing test RSS. c (iii) Steadily increase: When s=0, the model effectively predicts a constant and has almost no variance. As we increase s, the models includes more β s and their values start increasing. At this point, the values of β s become highly dependent on training data, thus increasing the variance. d (iv) Steadily decrease: When s=0, the model effectively predicts a constant and hence the prediction is far from actual value. Thus bias is high. As s increases, more β s become non-zero and thus the model continues to fit training data better. And thus, bias decreases. e (v) Remains constant: By definition, irreducible error is model independent and hence irrespective of the choice of s, remains constant. Chapter 6: Exercise 9 a Load and split the College data library(ISLR) set.seed(11) sum(is.na(College)) ## [1] 0 train.size = dim(College)[1] / 2 train = sample(1:dim(College)[1], train.size) test = -train College.train = College[train, ] College.test = College[test, ] b NUmber of applications is the Apps variable. lm.fit = lm(Apps~., data=College.train) lm.pred = predict(lm.fit, College.test) mean((College.test[, "Apps"] - lm.pred)^2) ## [1] 1538442 Test RSS is 1538442 c Pick λ using College.train and report error on College.test library(glmnet) ## Warning: package 'glmnet' was built under R version 2.15.2 ## Loading required package: Matrix ## Warning: package 'Matrix' was built under R version 2.15.3 ## Loading required package: lattice ## Warning: package 'lattice' was built under R version 2.15.3 ## Loaded glmnet 1.9-3 train.mat = model.matrix(Apps~., data=College.train) test.mat = model.matrix(Apps~., data=College.test) grid = 10 ^ seq(4, -2, length=100) mod.ridge = cv.glmnet(train.mat, College.train[, "Apps"], alpha=0, lambda=grid, thresh=1e-12) lambda.best = mod.ridge$lambda.min lambda.best ## [1] 18.74 ridge.pred = predict(mod.ridge, newx=test.mat, s=lambda.best) mean((College.test[, "Apps"] - ridge.pred)^2) ## [1] 1608859 Test RSS is slightly higher that OLS, 1608859. d Pick λ using College.train and report error on College.test mod.lasso = cv.glmnet(train.mat, College.train[, "Apps"], alpha=1, lambda=grid, thresh=1e-12) lambda.best = mod.lasso$lambda.min lambda.best ## [1] 21.54 lasso.pred = predict(mod.lasso, newx=test.mat, s=lambda.best) mean((College.test[, "Apps"] - lasso.pred)^2) ## [1] 1635280 Again, Test RSS is slightly higher that OLS, 1635280. The coefficients look like mod.lasso = glmnet(model.matrix(Apps~., data=College), College[, "Apps"], alpha=1) predict(mod.lasso, s=lambda.best, type="coefficients") ## 19 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) -6.038e+02 ## (Intercept) . ## PrivateYes -4.235e+02 ## Accept 1.455e+00 ## Enroll -2.004e-01 ## Top10perc 3.368e+01 ## Top25perc -2.403e+00 ## F.Undergrad . ## P.Undergrad 2.086e-02 ## Outstate -5.782e-02 ## Room.Board 1.246e-01 ## Books . ## Personal 1.833e-05 ## PhD -5.601e+00 ## Terminal -3.314e+00 ## S.F.Ratio 4.479e+00 ## perc.alumni -9.797e-01 ## Expend 6.968e-02 ## Grad.Rate 5.160e+00 e Use validation to fit pcr library(pls) ## ## Attaching package: 'pls' ## ## The following object(s) are masked from 'package:stats': ## ## loadings pcr.fit = pcr(Apps~., data=College.train, scale=T, validation="CV") validationplot(pcr.fit, val.type="MSEP") pcr.pred = predict(pcr.fit, College.test, ncomp=10) mean((College.test[, "Apps"] - data.frame(pcr.pred))^2) ## [1] 3014496 Test RSS for PCR is about 3014496. f Use validation to fit pls pls.fit = plsr(Apps~., data=College.train, scale=T, validation="CV") validationplot(pls.fit, val.type="MSEP") pls.pred = predict(pls.fit, College.test, ncomp=10) mean((College.test[, "Apps"] - data.frame(pls.pred))^2) ## [1] 1508987 Test RSS for PLS is about 1508987. g Results for OLS, Lasso, Ridge are comparable. Lasso reduces the F.Undergrad and Books variables to zero and shrinks coefficients of other variables. Here are the test R2 for all models. test.avg = mean(College.test[, "Apps"]) lm.test.r2 = 1 - mean((College.test[, "Apps"] - lm.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2) ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2) lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2) pcr.test.r2 = 1 - mean((College.test[, "Apps"] - data.frame(pcr.pred))^2) /mean((College.test[, "Apps"] - test.avg)^2) pls.test.r2 = 1 - mean((College.test[, "Apps"] - data.frame(pls.pred))^2) /mean((College.test[, "Apps"] - test.avg)^2) barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), col="red", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared") The plot shows that test R2 for all models except PCR are around 0.9, with PLS having slightly higher test R2 than others. PCR has a smaller test R2 of less than 0.8. All models except PCR predict college applications with high accuracy. Chapter 7: Exercise 5 a We'd expect g2 to have the smaller training RSS because it will be a higher order polynomial due to the order of the derivative penalty function. b We'd expect g1 to have the smaller test RSS because g2 could overfit with the extra degree of freedom. c Trick question. g1=g2 when λ=0. Chapter 6: Exercise 9 Load the Boston dataset set.seed(1) library(MASS) attach(Boston) a lm.fit = lm(nox ~ poly(dis, 3), data = Boston) summary(lm.fit) ## ## Call: ## lm(formula = nox ~ poly(dis, 3), data = Boston) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.12113 -0.04062 -0.00974 0.02338 0.19490 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.55470 0.00276 201.02 < 2e-16 *** ## poly(dis, 3)1 -2.00310 0.06207 -32.27 < 2e-16 *** ## poly(dis, 3)2 0.85633 0.06207 13.80 < 2e-16 *** ## poly(dis, 3)3 -0.31805 0.06207 -5.12 4.3e-07 *** ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.0621 on 502 degrees of freedom ## Multiple R-squared: 0.715, Adjusted R-squared: 0.713 ## F-statistic: 419 on 3 and 502 DF, p-value: <2e-16 dislim = range(dis) dis.grid = seq(from = dislim[1], to = dislim[2], by = 0.1) lm.pred = predict(lm.fit, list(dis = dis.grid)) plot(nox ~ dis, data = Boston, col = "darkgrey") lines(dis.grid, lm.pred, col = "red", lwd = 2) Summary shows that all polynomial terms are significant while predicting nox using dis. Plot shows a smooth curve fitting the data fairly well. b We plot polynomials of degrees 1 to 10 and save train RSS. all.rss = rep(NA, 10) for (i in 1:10) { lm.fit = lm(nox ~ poly(dis, i), data = Boston) all.rss[i] = sum(lm.fit$residuals^2) } all.rss ## [1] 2.769 2.035 1.934 1.933 1.915 1.878 1.849 1.836 1.833 1.832 As expected, train RSS monotonically decreases with degree of polynomial. c We use a 10-fold cross validation to pick the best polynomial degree. library(boot) all.deltas = rep(NA, 10) for (i in 1:10) { glm.fit = glm(nox ~ poly(dis, i), data = Boston) all.deltas[i] = cv.glm(Boston, glm.fit, K = 10)$delta[2] } plot(1:10, all.deltas, xlab = "Degree", ylab = "CV error", type = "l", pch = 20, lwd = 2) A 10-fold CV shows that the CV error reduces as we increase degree from 1 to 3, stay almost constant till degree 5, and the starts increasing for higher degrees. We pick 4 as the best polynomial degree. d We see that dis has limits of about 1 and 13 respectively. We split this range in roughly equal 4 intervals and establish knots at [4,7,11]. Note: bs function in R expects either df or knots argument. If both are specified, knots are ignored. library(splines) sp.fit = lm(nox ~ bs(dis, df = 4, knots = c(4, 7, 11)), data = Boston) summary(sp.fit) ## ## Call: ## lm(formula = nox ~ bs(dis, df = 4, knots = c(4, 7, 11)), data = Boston) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.1246 -0.0403 -0.0087 0.0247 0.1929 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.7393 0.0133 55.54 < 2e-16 ## bs(dis, df = 4, knots = c(4, 7, 11))1 -0.0886 0.0250 -3.54 0.00044 ## bs(dis, df = 4, knots = c(4, 7, 11))2 -0.3134 0.0168 -18.66 < 2e-16 ## bs(dis, df = 4, knots = c(4, 7, 11))3 -0.2662 0.0315 -8.46 3.0e-16 ## bs(dis, df = 4, knots = c(4, 7, 11))4 -0.3980 0.0465 -8.56 < 2e-16 ## bs(dis, df = 4, knots = c(4, 7, 11))5 -0.2568 0.0900 -2.85 0.00451 ## bs(dis, df = 4, knots = c(4, 7, 11))6 -0.3293 0.0633 -5.20 2.9e-07 ## ## (Intercept) *** ## bs(dis, df = 4, knots = c(4, 7, 11))1 *** ## bs(dis, df = 4, knots = c(4, 7, 11))2 *** ## bs(dis, df = 4, knots = c(4, 7, 11))3 *** ## bs(dis, df = 4, knots = c(4, 7, 11))4 *** ## bs(dis, df = 4, knots = c(4, 7, 11))5 ** ## bs(dis, df = 4, knots = c(4, 7, 11))6 *** ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.0619 on 499 degrees of freedom ## Multiple R-squared: 0.718, Adjusted R-squared: 0.715 ## F-statistic: 212 on 6 and 499 DF, p-value: <2e-16 sp.pred = predict(sp.fit, list(dis = dis.grid)) plot(nox ~ dis, data = Boston, col = "darkgrey") lines(dis.grid, sp.pred, col = "red", lwd = 2) The summary shows that all terms in spline fit are significant. Plot shows that the spline fits data well except at the extreme values of dis, (especially dis>10). e We fit regression splines with dfs between 3 and 16. all.cv = rep(NA, 16) for (i in 3:16) { lm.fit = lm(nox ~ bs(dis, df = i), data = Boston) all.cv[i] = sum(lm.fit$residuals^2) } all.cv[-c(1, 2)] ## [1] 1.934 1.923 1.840 1.834 1.830 1.817 1.826 1.793 1.797 1.789 1.782 ## [12] 1.782 1.783 1.784 Train RSS monotonically decreases till df=14 and then slightly increases for df=15 and df=16. f Finally, we use a 10-fold cross validation to find best df. We try all integer values of df between 3 and 16. all.cv = rep(NA, 16) for (i in 3:16) { lm.fit = glm(nox ~ bs(dis, df = i), data = Boston) all.cv[i] = cv.glm(Boston, lm.fit, K = 10)$delta[2] } ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ## Warning: some 'x' values beyond boundary knots may cause bases ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ill-conditioned ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases ## Warning: bases some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned some 'x' values beyond boundary knots may cause ill-conditioned ## Warning: some 'x' values beyond bases ## Warning: some 'x' values beyond bases ## Warning: some 'x' values beyond bases ## Warning: some 'x' values beyond bases plot(3:16, all.cv[-c(1, 2)], lwd = error") boundary knots may cause ill-conditioned boundary knots may cause ill-conditioned boundary knots may cause ill-conditioned boundary knots may cause ill-conditioned 2, type = "l", xlab = "df", ylab = "CV CV error is more jumpy in this case, but attains minimum at df=10. We pick 10 as the optimal degrees of freedom. Chapter 7: Exercise 10 a set.seed(1) library(ISLR) library(leaps) attach(College) train = sample(length(Outstate), length(Outstate)/2) test = -train College.train = College[train, ] College.test = College[test, ] reg.fit = regsubsets(Outstate ~ ., data = College.train, nvmax = 17, method = "forward") reg.summary = summary(reg.fit) par(mfrow = c(1, 3)) plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l") min.cp = min(reg.summary$cp) std.cp = sd(reg.summary$cp) abline(h = min.cp + 0.2 * std.cp, col = "red", lty = 2) abline(h = min.cp - 0.2 * std.cp, col = "red", lty = 2) plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l") min.bic = min(reg.summary$bic) std.bic = sd(reg.summary$bic) abline(h = min.bic + 0.2 * std.bic, col = "red", lty = 2) abline(h = min.bic - 0.2 * std.bic, col = "red", lty = 2) plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "l", ylim = c(0.4, 0.84)) max.adjr2 = max(reg.summary$adjr2) std.adjr2 = sd(reg.summary$adjr2) abline(h = max.adjr2 + 0.2 * std.adjr2, col = "red", lty = 2) abline(h = max.adjr2 - 0.2 * std.adjr2, col = "red", lty = 2) All cp, BIC and adjr2 scores show that size 6 is the minimum size for the subset for which the scores are withing 0.2 standard deviations of optimum. We pick 6 as the best subset size and find best 6 variables using entire data. reg.fit = regsubsets(Outstate ~ ., data = College, method = "forward") coefi = coef(reg.fit, id = 6) names(coefi) ## [1] "(Intercept)" "PrivateYes" ## [6] "Expend" "Grad.Rate" "Room.Board" "PhD" "perc.alumni" b library(gam) ## Loading required package: splines ## Loaded gam 1.09 gam.fit = gam(Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, df = 2), data = College.train) par(mfrow = c(2, 3)) plot(gam.fit, se = T, col = "blue") c gam.pred = predict(gam.fit, College.test) gam.err = mean((College.test$Outstate - gam.pred)^2) gam.err ## [1] 3745460 gam.tss = mean((College.test$Outstate - mean(College.test$Outstate))^2) test.rss = 1 - gam.err/gam.tss test.rss ## [1] 0.7697 We obtain a test R-squared of 0.77 using GAM with 6 predictors. This is a slight improvement over a test RSS of 0.74 obtained using OLS. d summary(gam.fit) ## ## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, ## df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, ## df = 2), data = College.train) ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4977.7 -1184.5 58.3 1220.0 7688.3 ## ## (Dispersion Parameter for gaussian family taken to be 3300711) ## ## Null Deviance: 6.222e+09 on 387 degrees of freedom ## Residual Deviance: 1.231e+09 on 373 degrees of freedom ## AIC: 6942 ## ## Number of Local Scoring Iterations: 2 ## ## Anova for Parametric Effects ## Df Sum Sq Mean Sq F value Pr(>F) ## Private 1 1.78e+09 1.78e+09 539.1 < 2e-16 *** ## s(Room.Board, df = 2) 1 1.22e+09 1.22e+09 370.2 < 2e-16 *** ## s(PhD, df = 2) 1 3.82e+08 3.82e+08 115.9 < 2e-16 *** ## s(perc.alumni, df = 2) 1 3.28e+08 3.28e+08 99.5 < 2e-16 *** ## s(Expend, df = 5) 1 4.17e+08 4.17e+08 126.2 < 2e-16 *** ## s(Grad.Rate, df = 2) 1 5.53e+07 5.53e+07 16.8 5.2e-05 *** ## Residuals 373 1.23e+09 3.30e+06 ## --- ## ## ## ## ## ## ## ## ## ## ## ## ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Anova for Nonparametric Effects Npar Df Npar F Pr(F) (Intercept) Private s(Room.Board, df = 2) 1 3.56 0.060 s(PhD, df = 2) 1 4.34 0.038 s(perc.alumni, df = 2) 1 1.92 0.167 s(Expend, df = 5) 4 16.86 1e-12 s(Grad.Rate, df = 2) 1 3.72 0.055 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' . * *** . 0.05 '.' 0.1 ' ' 1 Non-parametric Anova test shows a strong evidence of non-linear relationship between response and Expend, and a moderately strong non-linear relationship (using p value of 0.05) between response and Grad.Rate or PhD.