Problem 1 – Near infrared (NIR) Spectra & Gasoline octane

advertisement
STAT 425 – Modern Methods of Data Analysis
Assignment 3 – Lasso, LAR, PCR, and PLS (38 points)
PROBLEM 1 – NEAR INFRARED (NIR) SPECTRA & GASOLINE OCTANE
In this regression problem your goal is to model gasoline octane as a function of NIR spectra
values. The NIR spectra were measured using diffuse reflectance as log(1/R) from 900 nm to
1700 nm in 2 nm intervals, giving 401 wavelengths, i.e. p = 401. We will use 50 of the original
60 observations to train the model and compare the predictive performance of different methods
when predicting back the 10 test cases. The data frame is gasoline and comes with the pls
package.
Some preliminary plots to get a sense of the nature of these data:
> library(pls)
Attaching package: ‘pls’
> gasoline.x = gasoline$NIR
> dim(gasoline.x)
[1] 60 401
> matplot(t(gasoline.x),type="l",xlab="Variable",ylab="Spectral Intensity")
> title(main="Spectral Readings for Gasoline Data")
> pairs.plus(gasoline.x[,1:10])
> pairs.plus(gasoline.x[,201:210])
> pairs.plus(gasoline.x[,301:310]) etc…
a) Use the eigen() function to obtain the spectral decomposition of the correlation matrix
R. Some code to help you out is given below (see notes as well):
> R = cor(gasoline.x)
> gasoline.xs = scale(gasoline.x)
> S = var(gasoline.xs)
BONUS QUESTION: Why are R and S the same matrix?? (2 pts.)
>
>
>
>
eigenR = eigen(R)
eigenR$values  How many eigenvalues are above 1?? (1 pt.)
z1 = gasoline.xs%*%eigenR$vectors[,1]
z2 = gasoline.xs%*%eigenR$vectors[,2]
etc…
> OctPC = data.frame(octane=gasoline$octane,z1,z2,...)
> pairs.plus(OctPC)
Form as many zi’s as you have eigenvalues above 1. Examine a scatterplot matrix of the
principal components for eigenvalues above 1 and the response octane, what do you see?
(3 pts.)
1
b)
Use these zi’s as terms in an OLS regression model for octane, discuss the model. (3 pts.)
c)
What is the squared correlation between the fitted values and the actual octane values for
this model? (1 pt.)
d)
Now use the pcr() function as shown in yarn example in your notes to fit a PCR model
for these data. What is the “optimal” # of components to use in your model? (3 pts.)
> oct.pcr = pcr(octane~scale(NIR),data=gasoline,ncomp=12,validation=”CV”)
> summary(oct.pcr)
e)
Examine the loadings on the components you used in your model. Use this plot to
determine which NIR spectra load heavy on the 1st two principal components.
Summarize your findings. (3 pts.)
> loadingplot(oct.pcr,comps=1:2,legendpos=”topright”)
f) Form a training subset of the gasoline data as follows:
> gasoline.train = gasoline[1:50,]
> gasoline.test = gasoline[51:60,]
> attributes(gasoline.train)
$names
[1] "octane" "NIR"
$row.names
[1] "1" "2"
"14" "15" "16"
[21] "21" "22"
"34" "35" "36"
[41] "41" "42"
"3"
"17"
"23"
"37"
"43"
"4"
"18"
"24"
"38"
"44"
"5"
"19"
"25"
"39"
"45"
"6" "7" "8" "9" "10" "11" "12" "13"
"20"
"26" "27" "28" "29" "30" "31" "32" "33"
"40"
"46" "47" "48" "49" "50"
$class
[1] "data.frame"
> dim(gasoline.train$NIR)
[1] 50 401
Using the optimal number of components chosen above, fit the model to these training
data and predict the octane of the test cases using their NIR spectra. What is the RMSEP
using a the training/test set approach? (2 pts.)
Assuming you have already built a training model called mymodel do the following to
obtain the predicted octanes for the observations in the test set.
> ypred = predict(mymodel,ncomp=??,newdata=gasoline.test)
> cbind(ypred,gasoline.test$octane)
> sqrt(sum((ypred – gasoline.test$octane)^2/10))  RMSEP
Note:
??
= the number of components you think should be used.
2
RMSEP = root mean squared error for prediction
g) Estimate the prediction error (PSE) using Monte Carlo Cross-Validation (MCCV) using
p = .80. (2 pts.)
The code for the function pcr.cv is shown below. It takes the X’s, the response y, and
the number of components to use in the PCR fit as the required arguments. Note the
function computes RMSEP for each MC sample, rather than the average squared
prediction error. Copy and paste this code into R.
pcr.cv = function(x,y,ncomp=2,p=.667,B=100) {
n = length(y)
data = data.frame(x,y)
cv <- rep(0,B)
for (i in 1:B) {
ss <- floor(n*p)
sam <- sample(1:n,ss,replace=F)
fit2 <- pcr(y~x,ncomps=1:ncomp,data=data[sam,])
ynew <- predict(fit2,ncomp=ncomp,newdata=data[-sam,])
cv[i] <- sqrt(sum((y[-sam]-ynew)^2)/(n - ss))
}
cv
}
PROBLEM 2 – PROSTATE CANCER DATA
Using the prostate data that you used for the ridge regression assignment, fit the LASSO and
LAR regression models to these data and compare the results to both OLS and ridge regression.
The prostate cancer data is described below:









lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
lpsa
log cancer volume
log prostate weight
in years
log of the amount of benign prostatic hyperplasia
seminal vesicle invasion
log of capsular penetration
a numeric vector
percent of Gleason score 4 or 5
response, log PSA level
The data frame is contained the ElemStatLearn package.
> Prostate = prostate[,-10]  remove column specifying train/test data
> names(Prostate)
[1] "lcavol"
"lweight" "age"
"lbph"
"svi"
"lcp"
"gleason" "pgg45"
"lpsa"
> dim(Prostate)
[1] 97 9
> pro.x = as.matrix(scale(Prostate[,1:8]))
> pro.y = as.matrix(Prostate[,9])
3
a) Fit a Lasso regression model to these data and obtain the summary plot that shows the
coefficient shrinkage as a function of the L1-norm fraction relative to that of the OLS
estimates. Discuss your results. (3 pts.)
> pro.lasso = lars(pro.x,pro.y,type="lasso",trace=T)
> plot(pro.lasso)
b) Use the cv.lars() function to find a range of optimal fractions to use for a lasso
model. Include the CV results, both in tabular and graphical form, and choose the
smallest fraction you are comfortable using based on your CV results. (3 pts.)
c) Using your “optimal” fraction choice conduct a MCCV analysis of the predictive abilities
of your final Lasso model. The code for the function is shown below. (2 pts.)
Lasso MCCV Function
lasso.cv = function(x,y,fraction=.5,p=.667,B=100) {
n = length(y)
cv <- rep(0,B)
for (i in 1:B) {
ss <- floor(n*p)
sam <- sample(1:n,ss,replace=F)
fit2 <- lars(x[sam,],y[sam],type="lasso")
ynew <- predict(fit2,s=fraction,newx=x[-sam,],mode="fraction")
cv[i] <- sqrt(sum((y[-sam]-ynew$fit)^2)/(n - ss))
}
cv
}
d) Fit an LAR regression model to these data and obtain the summary plot that shows the
coefficient shrinkage. Discuss your results. (3 pts.)
e) Use the cv.lars() function to find a range of step (model) sizes to use for a LAR model.
Include the CV results, both in tabular and graphical form, and choose the smallest model
you are comfortable using based on your CV results. (3 pts.)
f) Using your “optimal” fraction choice conduct a MCCV analysis of the predictive abilities
of your final LAR model. Modify the lasso.cv function above to cross-validate a LAR
model of a given size s. To do this in R, edit the lasso.cv function above. (2 pts.)
> lars.cv = edit(lasso.cv)
4
Edit the lasso.cv function in R Editor window and then select Save followed by Close script from the
File pull-down menu. If you make a syntax mistake it will ask you to redit the function by typing
lars.cv = edit() to edit the last script you had open.
g) Which cross-validates better ridge, Lasso, or LARS? Explain. (2 pts.)
Problem 3 – CRYSTAL MELTING POINT DATA
(forthcoming…)
5
Download