Lab 10 Model Selection Model Selection Goal: find the best group of X variables that can predict Y *Don’t use model selection when you have a specific hypothesis that you are testing!! Four ways to select variables for a prediction model: Best subsets Backward elimination Forward selection Stepwise selection Best Subsets Considers all possible models with 1 covariate, then 2 covariates, then 3 covariates, etc., until the maximum model with all possible covariates Total number of models = 2k-1 where k is the number of potential covariates Only available within proc reg Syntax: Use Mallow’s Cp or F test to assess which model is best! Backward Elimination Start with the maximum model (model with all variables), and eliminate the non-significant predictors Non significant predictors are those with a p-value above the specified SLS (significance level to stay) Available within proc reg or proc glmselect Glmselect allows for easy inclusion of class variables and/or interaction terms Syntax: Any class variables here, as in proc glm Keeps all parts of the class variable together (all parts will be removed together) Use significance level to remove variables from the model Type of selection procedure More liberal cutoff… Shows statistics for removed variables Stops when all variables in the model have a p<0.15 Shows p-values for all models Forward Selection Start with a model with no covariates, and add in X variables one at a time if they are significant at the SLE (significance level for entry) Again, can use proc reg or proc glmselect Syntax: Any class variables here, as in proc glm Keeps all parts of the class variable together (all parts will be entered together) Use significance level to enter variables into the model Type of selection procedure More liberal cutoff… Shows statistics for entered variables Stops when all variables in the model have a p<0.15 Shows p-values for all models Stepwise Selection Combination of forward and backward selection Specify an entrance significance level, and a significance level for variables to remain in the model SAS will always remove a variable before adding a new one Syntax: Any class variables here, as in proc glm Keeps all parts of the class variable together (all parts will be removed together) Use significance level to enter or remove variables from the model More liberal cutoff… Shows statistics for Type of selection entered or removed procedure variables Shows p-values Stops when all variables in the for all models model have a p<0.15 Additional points Remember that these are automated processes, and must be checked by hand! Be aware of entry of some dummy variables and not others (ex: white race, but not Asian race), which is only a problem in proc reg model selection Be aware of entry of higher order variables without lower term ones (ex: height2 without height) Be sure to evaluate assumptions of linear regression for the final model Consider possible interactions, and assess model for any collinearity issues Additional points Remember that you should NOT use model selection procedures when you are testing a specific hypothesis!