Stat 401B Fall 2015 Lab #11 (Due 12/10/15) There is an R code set at the end of this lab and posted on the course web page that will prove useful for doing this lab. Use it in answering/doing the following. 1. There is an "Airfoil Self-Noise Data Set" of Lopez at http://archive.ics.uci.edu/ml/datasets/Airfoil+Self‐Noise that will be used here. There are 5 predictors variables in that data set that can be used to predict a 6th. 1503 cases are provided. a) Fit an ordinary MLR model to these data via ordinary least squares. Also fit Lasso, Ridge and Elastic Net (with say .5 ) linear models to these data (i.e. use penalized least squares and "shrink" the predictions toward y and the size of the regression coefficients). Use glmnet and cross-validation to choose the values of you employ. b) How do the fits in a) compare? Do residual plots make you think that a predictor linear in the 5 input variables will be effective? Explain. Does it look like some aspects of the relationship between inputs and the response are unaccounted for in the fitting? Why or why not? c) Standardize the input variables and fit k -nearest neighbor predictors to the data for k 1,3, 4,5 (I don't know why the routine crashes when one tries k 2 ) using knn.reg() from the FNN package. The "PRESS" it produces is the LOO cross-validation error sum of squares and can thus be used to choose a value of the complexity parameter k . Which value seems best? Large k corresponds to which, a "simple" or a "complex" predictor? How does knn prediction seem to compare to linear prediction in this "fairly large N and small p " context? d) Use the tree package and fit a good regression tree to these data. Employ cost-complexity pruning and cross-validation to pick this tree. How many final nodes do you suggest? How do the predictions for this tree compare to the ones in a) and c)? e) Fit a random forest to these data using m 2 (which is approximately the standard default of the number of predictors divided by 3) using the randomForest package. How do these predictions compare to all the others? f) People sometimes try to use "ensembles" of predictors in big predictive analytics problems. This means combining predictors by using a weighted average. See if you can find positive constants cOLS , cLASSO , cRIDGE , cENET , cKNN , cTREE , and cRF adding to 1 so that the corresponding linear combination of predictions has a larger correlation with the response variable than any single predictor. (In practice, one would have to look for these constant inside and not after cross-validations.) 1 2. Redo everything above using the Ames House Price data set. (Change parameters of methods as appropriate.) Overall, what do you recommend as a predictor for this situation? Explain. Code Set for Stat 401B Laboratory #11 #Here is code for Stat 401B Lab 11 #Here is Code for Lab #11 #Load the psych package Airfoil<-read.clipboard(header=FALSE,sep='\t') Air<-data.frame(Airfoil) names(Air)<-c("Freq","Angle","Chord", "Velocity","Displace","Pressure") summary(Air) cor(Air) hist(Air$Freq) hist(Air$Angle) hist(Air$Chord) hist(Air$Velocity) hist(Air$Displace) hist(Air$Pressure) x=as.matrix(Air[,1:5]) y=as.matrix(Air[,6]) #Load the glmnet,stats, and graphics packages and do some #cross-validations to pick best lambda values AirLasso<-cv.glmnet(x,y,alpha=1) AirRidge<-cv.glmnet(x,y,alpha=0) AirENet<-cv.glmnet(x,y,alpha=.5) plot(AirLasso) plot(AirRidge) plot(AirENet) #The next commands give lambdas usually recommended .... they are not #exact minimizers of CVSSE, but ones "close to" the minimizers #and associated with somewhat less flexible predictors AirLasso$lambda.1se AirRidge$lambda.1se AirENet$lambda.1se 2 #We may plot residuals for fits to the whole data set based on the #above lambdas against the original pressures and look at the #fitted coefficients plot(y,y-predict(AirLasso,newx=x)) abline(a=0,b=0) coef.cv.glmnet(AirLasso) plot(y,y-predict(AirRidge,newx=x)) abline(a=0,b=0) coef.cv.glmnet(AirRidge) plot(y,y-predict(AirENet,newx=x)) abline(a=0,b=0) coef.cv.glmnet(AirENet) #Here is an ordinary least squares (MLR) fit AirLM<-lm(y~x) plot(y,y-predict(AirLM,newx=x)) abline(a=0,b=0) AirLM$coef #Here is some code for putting the coefficients for the #fits side-by-side for purposes of examining the differences compcoef<-cbind(as.matrix(AirLM$coeff),coef.cv.glmnet(AirLasso), coef.cv.glmnet(AirRidge),coef.cv.glmnet(AirENet)) colnames(compcoef)<-c("OLS","Lasso","Ridge","ENet(.5)") compcoef #Now do some nearest neighbor regressions #First, load the FNN package and standardize the predictors scale.x<-scale(x) x[1:10] scale.x[1:10,] Air.knn1<-knn.reg(scale.x,test=NULL,y,k=1) Air.knn1 Air.knn1$pred[1:10] plot(y,y-Air.knn1$pred) abline(a=0,b=0) 3 Air.knn3<-knn.reg(scale.x,test=NULL,y,k=3) Air.knn3 Air.knn3$pred[1:10] plot(y,y-Air.knn3$pred) abline(a=0,b=0) Air.knn4<-knn.reg(scale.x,test=NULL,y,k=4) Air.knn4 Air.knn4$pred[1:10] plot(y,y-Air.knn4$pred) abline(a=0,b=0) Air.knn5<-knn.reg(scale.x,test=NULL,y,k=5) Air.knn5 Air.knn5$pred[1:10] plot(y,y-Air.knn5$pred) abline(a=0,b=0) #Now load the tree package and do fitting of a big tree Airtree<-tree(Pressure ~.,Air, control=tree.control(nobs=1503,mincut=5,minsize=10,mindev=.005)) summary(Airtree) Airtree plot(Airtree) text(Airtree,pretty=0) #One can try to find an optimal sub-tree of the tree just grown #We'll use cross-validation based on cost-complexity pruning #Each alpha (in the lecture notation) has a favorite number of #nodes and not all numbers of final nodes are ones optimizing #the cost-complexity #The code below finds the alphas at which optimal subtrees change cv.Airtree<-cv.tree(Airtree,FUN=prune.tree) names(cv.Airtree) cv.Airtree #We can plot SSE versus size of the optimizing trees plot(cv.Airtree$size,cv.Airtree$dev,type="b") 4 #And we can plot versus what the program calls "k," which is the #reciprocal of alpha from class plot(cv.Airtree$k,cv.Airtree$dev,type="b") #Or we can plot versus "1/k" (i.e. alpha) plot(1/cv.Airtree$k,cv.Airtree$dev,type="b") #We can see what a pruned tree will look like for a size #identified by the cross-validation Airgoodtree<-prune.tree(Airtree,best=15) Airgoodtree summary(Airgoodtree) plot(Airgoodtree) text(Airgoodtree,pretty=0) Airpred<-predict(Airgoodtree,Air,type="vector") plot(y,y-Airpred) abline(a=0,b=0) #Next is some code for Random Forest Fitting #First Load the randomForest package Air.rf<-randomForest(Pressure~.,data=Air, type="regression",ntree=500,mtry=2) Air.rf predict(Air.rf)[1:10] #This code produces a scatterplot matrix for all the predictions #with the "45 degree line" drawn in comppred<-cbind(y,predict(AirLM,newx=x),predict(AirLasso,newx=x), predict(AirRidge,newx=x),predict(AirENet,newx=x), Air.knn3$pred,Air.knn5$pred,Airpred,predict(Air.rf)) colnames(comppred)<-c("y","OLS","Lasso","Ridge", "ENet(.5)","3NN","5NN","Tree","RF") pairs(comppred,panel=function(x,y,...){ points(x,y) abline(0,1)},xlim=c(100,145), ylim=c(100,145)) round(cor(as.matrix(comppred)),2) 5