STAT 425 – Modern Methods of Data Analysis Assignment 3 – Lasso, LAR, PCR, and PLS (38 points) PROBLEM 1 – NEAR INFRARED (NIR) SPECTRA & GASOLINE OCTANE In this regression problem your goal is to model gasoline octane as a function of NIR spectra values. The NIR spectra were measured using diffuse reflectance as log(1/R) from 900 nm to 1700 nm in 2 nm intervals, giving 401 wavelengths, i.e. p = 401. We will use 50 of the original 60 observations to train the model and compare the predictive performance of different methods when predicting back the 10 test cases. The data frame is gasoline and comes with the pls package. Some preliminary plots to get a sense of the nature of these data: > library(pls) Attaching package: ‘pls’ > gasoline.x = gasoline$NIR > dim(gasoline.x) [1] 60 401 > matplot(t(gasoline.x),type="l",xlab="Variable",ylab="Spectral Intensity") > title(main="Spectral Readings for Gasoline Data") > pairs.plus(gasoline.x[,1:10]) > pairs.plus(gasoline.x[,201:210]) > pairs.plus(gasoline.x[,301:310]) etc… a) Use the eigen() function to obtain the spectral decomposition of the correlation matrix R. Some code to help you out is given below (see notes as well): > R = cor(gasoline.x) > gasoline.xs = scale(gasoline.x) > S = var(gasoline.xs) BONUS QUESTION: Why are R and S the same matrix?? (2 pts.) > > > > eigenR = eigen(R) eigenR$values How many eigenvalues are above 1?? (1 pt.) z1 = gasoline.xs%*%eigenR$vectors[,1] z2 = gasoline.xs%*%eigenR$vectors[,2] etc… > OctPC = data.frame(octane=gasoline$octane,z1,z2,...) > pairs.plus(OctPC) Form as many zi’s as you have eigenvalues above 1. Examine a scatterplot matrix of the principal components for eigenvalues above 1 and the response octane, what do you see? (3 pts.) 1 b) Use these zi’s as terms in an OLS regression model for octane, discuss the model. (3 pts.) c) What is the squared correlation between the fitted values and the actual octane values for this model? (1 pt.) d) Now use the pcr() function as shown in yarn example in your notes to fit a PCR model for these data. What is the “optimal” # of components to use in your model? (3 pts.) > oct.pcr = pcr(octane~scale(NIR),data=gasoline,ncomp=12,validation=”CV”) > summary(oct.pcr) e) Examine the loadings on the components you used in your model. Use this plot to determine which NIR spectra load heavy on the 1st two principal components. Summarize your findings. (3 pts.) > loadingplot(oct.pcr,comps=1:2,legendpos=”topright”) f) Form a training subset of the gasoline data as follows: > gasoline.train = gasoline[1:50,] > gasoline.test = gasoline[51:60,] > attributes(gasoline.train) $names [1] "octane" "NIR" $row.names [1] "1" "2" "14" "15" "16" [21] "21" "22" "34" "35" "36" [41] "41" "42" "3" "17" "23" "37" "43" "4" "18" "24" "38" "44" "5" "19" "25" "39" "45" "6" "7" "8" "9" "10" "11" "12" "13" "20" "26" "27" "28" "29" "30" "31" "32" "33" "40" "46" "47" "48" "49" "50" $class [1] "data.frame" > dim(gasoline.train$NIR) [1] 50 401 Using the optimal number of components chosen above, fit the model to these training data and predict the octane of the test cases using their NIR spectra. What is the RMSEP using a the training/test set approach? (2 pts.) Assuming you have already built a training model called mymodel do the following to obtain the predicted octanes for the observations in the test set. > ypred = predict(mymodel,ncomp=??,newdata=gasoline.test) > cbind(ypred,gasoline.test$octane) > sqrt(sum((ypred – gasoline.test$octane)^2/10)) RMSEP Note: ?? = the number of components you think should be used. 2 RMSEP = root mean squared error for prediction g) Estimate the prediction error (PSE) using Monte Carlo Cross-Validation (MCCV) using p = .80. (2 pts.) The code for the function pcr.cv is shown below. It takes the X’s, the response y, and the number of components to use in the PCR fit as the required arguments. Note the function computes RMSEP for each MC sample, rather than the average squared prediction error. Copy and paste this code into R. pcr.cv = function(x,y,ncomp=2,p=.667,B=100) { n = length(y) data = data.frame(x,y) cv <- rep(0,B) for (i in 1:B) { ss <- floor(n*p) sam <- sample(1:n,ss,replace=F) fit2 <- pcr(y~x,ncomps=1:ncomp,data=data[sam,]) ynew <- predict(fit2,ncomp=ncomp,newdata=data[-sam,]) cv[i] <- sqrt(sum((y[-sam]-ynew)^2)/(n - ss)) } cv } PROBLEM 2 – PROSTATE CANCER DATA Using the prostate data that you used for the ridge regression assignment, fit the LASSO and LAR regression models to these data and compare the results to both OLS and ridge regression. The prostate cancer data is described below: lcavol lweight age lbph svi lcp gleason pgg45 lpsa log cancer volume log prostate weight in years log of the amount of benign prostatic hyperplasia seminal vesicle invasion log of capsular penetration a numeric vector percent of Gleason score 4 or 5 response, log PSA level The data frame is contained the ElemStatLearn package. > Prostate = prostate[,-10] remove column specifying train/test data > names(Prostate) [1] "lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason" "pgg45" "lpsa" > dim(Prostate) [1] 97 9 > pro.x = as.matrix(scale(Prostate[,1:8])) > pro.y = as.matrix(Prostate[,9]) 3 a) Fit a Lasso regression model to these data and obtain the summary plot that shows the coefficient shrinkage as a function of the L1-norm fraction relative to that of the OLS estimates. Discuss your results. (3 pts.) > pro.lasso = lars(pro.x,pro.y,type="lasso",trace=T) > plot(pro.lasso) b) Use the cv.lars() function to find a range of optimal fractions to use for a lasso model. Include the CV results, both in tabular and graphical form, and choose the smallest fraction you are comfortable using based on your CV results. (3 pts.) c) Using your “optimal” fraction choice conduct a MCCV analysis of the predictive abilities of your final Lasso model. The code for the function is shown below. (2 pts.) Lasso MCCV Function lasso.cv = function(x,y,fraction=.5,p=.667,B=100) { n = length(y) cv <- rep(0,B) for (i in 1:B) { ss <- floor(n*p) sam <- sample(1:n,ss,replace=F) fit2 <- lars(x[sam,],y[sam],type="lasso") ynew <- predict(fit2,s=fraction,newx=x[-sam,],mode="fraction") cv[i] <- sqrt(sum((y[-sam]-ynew$fit)^2)/(n - ss)) } cv } d) Fit an LAR regression model to these data and obtain the summary plot that shows the coefficient shrinkage. Discuss your results. (3 pts.) e) Use the cv.lars() function to find a range of step (model) sizes to use for a LAR model. Include the CV results, both in tabular and graphical form, and choose the smallest model you are comfortable using based on your CV results. (3 pts.) f) Using your “optimal” fraction choice conduct a MCCV analysis of the predictive abilities of your final LAR model. Modify the lasso.cv function above to cross-validate a LAR model of a given size s. To do this in R, edit the lasso.cv function above. (2 pts.) > lars.cv = edit(lasso.cv) 4 Edit the lasso.cv function in R Editor window and then select Save followed by Close script from the File pull-down menu. If you make a syntax mistake it will ask you to redit the function by typing lars.cv = edit() to edit the last script you had open. g) Which cross-validates better ridge, Lasso, or LARS? Explain. (2 pts.) Problem 3 – CRYSTAL MELTING POINT DATA (forthcoming…) 5