Stat 401B Fall 2015 Lab #11 (Due 12/10/15)

advertisement
Stat 401B Fall 2015 Lab #11 (Due 12/10/15)
There is an R code set at the end of this lab and posted on the course web page that will prove useful for
doing this lab. Use it in answering/doing the following.
1. There is an "Airfoil Self-Noise Data Set" of Lopez at
http://archive.ics.uci.edu/ml/datasets/Airfoil+Self‐Noise that will be used here. There are 5 predictors variables in that data set that can be used to predict a 6th.
1503 cases are provided.
a) Fit an ordinary MLR model to these data via ordinary least squares. Also fit Lasso, Ridge and Elastic
Net (with say   .5 ) linear models to these data (i.e. use penalized least squares and "shrink" the
predictions toward y and the size of the regression coefficients). Use glmnet and cross-validation to
choose the values of  you employ.
b) How do the fits in a) compare? Do residual plots make you think that a predictor linear in the 5 input
variables will be effective? Explain. Does it look like some aspects of the relationship between inputs
and the response are unaccounted for in the fitting? Why or why not?
c) Standardize the input variables and fit k -nearest neighbor predictors to the data for k  1,3, 4,5 (I don't
know why the routine crashes when one tries k  2 ) using knn.reg() from the FNN package. The
"PRESS" it produces is the LOO cross-validation error sum of squares and can thus be used to choose a
value of the complexity parameter k . Which value seems best? Large k corresponds to which, a
"simple" or a "complex" predictor? How does knn prediction seem to compare to linear prediction in this
"fairly large N and small p " context?
d) Use the tree package and fit a good regression tree to these data. Employ cost-complexity pruning
and cross-validation to pick this tree. How many final nodes do you suggest? How do the predictions for
this tree compare to the ones in a) and c)?
e) Fit a random forest to these data using m  2 (which is approximately the standard default of the
number of predictors divided by 3) using the randomForest package. How do these predictions
compare to all the others?
f) People sometimes try to use "ensembles" of predictors in big predictive analytics problems. This
means combining predictors by using a weighted average. See if you can find positive constants
cOLS , cLASSO , cRIDGE , cENET , cKNN , cTREE , and cRF
adding to 1 so that the corresponding linear combination of predictions has a larger correlation with the
response variable than any single predictor. (In practice, one would have to look for these constant inside
and not after cross-validations.)
1
2. Redo everything above using the Ames House Price data set. (Change parameters of methods as
appropriate.) Overall, what do you recommend as a predictor for this situation? Explain.
Code Set for Stat 401B Laboratory #11
#Here is code for Stat 401B Lab 11
#Here is Code for Lab #11
#Load the psych package
Airfoil<-read.clipboard(header=FALSE,sep='\t')
Air<-data.frame(Airfoil)
names(Air)<-c("Freq","Angle","Chord",
"Velocity","Displace","Pressure")
summary(Air)
cor(Air)
hist(Air$Freq)
hist(Air$Angle)
hist(Air$Chord)
hist(Air$Velocity)
hist(Air$Displace)
hist(Air$Pressure)
x=as.matrix(Air[,1:5])
y=as.matrix(Air[,6])
#Load the glmnet,stats, and graphics packages and do some
#cross-validations to pick best lambda values
AirLasso<-cv.glmnet(x,y,alpha=1)
AirRidge<-cv.glmnet(x,y,alpha=0)
AirENet<-cv.glmnet(x,y,alpha=.5)
plot(AirLasso)
plot(AirRidge)
plot(AirENet)
#The next commands give lambdas usually recommended .... they are not
#exact minimizers of CVSSE, but ones "close to" the minimizers
#and associated with somewhat less flexible predictors
AirLasso$lambda.1se
AirRidge$lambda.1se
AirENet$lambda.1se
2
#We may plot residuals for fits to the whole data set based on the
#above lambdas against the original pressures and look at the
#fitted coefficients
plot(y,y-predict(AirLasso,newx=x))
abline(a=0,b=0)
coef.cv.glmnet(AirLasso)
plot(y,y-predict(AirRidge,newx=x))
abline(a=0,b=0)
coef.cv.glmnet(AirRidge)
plot(y,y-predict(AirENet,newx=x))
abline(a=0,b=0)
coef.cv.glmnet(AirENet)
#Here is an ordinary least squares (MLR) fit
AirLM<-lm(y~x)
plot(y,y-predict(AirLM,newx=x))
abline(a=0,b=0)
AirLM$coef
#Here is some code for putting the coefficients for the
#fits side-by-side for purposes of examining the differences
compcoef<-cbind(as.matrix(AirLM$coeff),coef.cv.glmnet(AirLasso),
coef.cv.glmnet(AirRidge),coef.cv.glmnet(AirENet))
colnames(compcoef)<-c("OLS","Lasso","Ridge","ENet(.5)")
compcoef
#Now do some nearest neighbor regressions
#First, load the FNN package and standardize the predictors
scale.x<-scale(x)
x[1:10]
scale.x[1:10,]
Air.knn1<-knn.reg(scale.x,test=NULL,y,k=1)
Air.knn1
Air.knn1$pred[1:10]
plot(y,y-Air.knn1$pred)
abline(a=0,b=0)
3
Air.knn3<-knn.reg(scale.x,test=NULL,y,k=3)
Air.knn3
Air.knn3$pred[1:10]
plot(y,y-Air.knn3$pred)
abline(a=0,b=0)
Air.knn4<-knn.reg(scale.x,test=NULL,y,k=4)
Air.knn4
Air.knn4$pred[1:10]
plot(y,y-Air.knn4$pred)
abline(a=0,b=0)
Air.knn5<-knn.reg(scale.x,test=NULL,y,k=5)
Air.knn5
Air.knn5$pred[1:10]
plot(y,y-Air.knn5$pred)
abline(a=0,b=0)
#Now load the tree package and do fitting of a big tree
Airtree<-tree(Pressure ~.,Air,
control=tree.control(nobs=1503,mincut=5,minsize=10,mindev=.005))
summary(Airtree)
Airtree
plot(Airtree)
text(Airtree,pretty=0)
#One can try to find an optimal sub-tree of the tree just grown
#We'll use cross-validation based on cost-complexity pruning
#Each alpha (in the lecture notation) has a favorite number of
#nodes and not all numbers of final nodes are ones optimizing
#the cost-complexity
#The code below finds the alphas at which optimal subtrees change
cv.Airtree<-cv.tree(Airtree,FUN=prune.tree)
names(cv.Airtree)
cv.Airtree
#We can plot SSE versus size of the optimizing trees
plot(cv.Airtree$size,cv.Airtree$dev,type="b")
4
#And we can plot versus what the program calls "k," which is the
#reciprocal of alpha from class
plot(cv.Airtree$k,cv.Airtree$dev,type="b")
#Or we can plot versus "1/k" (i.e. alpha)
plot(1/cv.Airtree$k,cv.Airtree$dev,type="b")
#We can see what a pruned tree will look like for a size
#identified by the cross-validation
Airgoodtree<-prune.tree(Airtree,best=15)
Airgoodtree
summary(Airgoodtree)
plot(Airgoodtree)
text(Airgoodtree,pretty=0)
Airpred<-predict(Airgoodtree,Air,type="vector")
plot(y,y-Airpred)
abline(a=0,b=0)
#Next is some code for Random Forest Fitting
#First Load the randomForest package
Air.rf<-randomForest(Pressure~.,data=Air,
type="regression",ntree=500,mtry=2)
Air.rf
predict(Air.rf)[1:10]
#This code produces a scatterplot matrix for all the predictions
#with the "45 degree line" drawn in
comppred<-cbind(y,predict(AirLM,newx=x),predict(AirLasso,newx=x),
predict(AirRidge,newx=x),predict(AirENet,newx=x),
Air.knn3$pred,Air.knn5$pred,Airpred,predict(Air.rf))
colnames(comppred)<-c("y","OLS","Lasso","Ridge",
"ENet(.5)","3NN","5NN","Tree","RF")
pairs(comppred,panel=function(x,y,...){
points(x,y)
abline(0,1)},xlim=c(100,145),
ylim=c(100,145))
round(cor(as.matrix(comppred)),2)
5
Download