Chapter 9 Types of Studies i. controlled experiments – X’s are controlled through designed experiment. Therefore, no selection methods needed. ii. controlled experiments with supplemental variables – most of X’s come from design experiment, but in addition have supplemental variables. E.g. gender which was not originally included. Goal is to reduce the error variance (MSE). iii. confirmatory observational studies – based on observational study as opposed to experiment. Intended to test hypothesis derived from previous studies. Data consist of covariates from previous studies which are called controlled variables or risk factors. In addition to these, we also have new variables which are called primary variables. E.g. suppose Y is incidence of type of cancer (categorical) and risk factors are Age, Gender, Race, etc., but primary covariate is amount of vitamin E taken daily. iv. exploratory observational studies – investigators don’t have a clue what they are looking for, but have a large amount of data. They then search for covariates that may be related to Y. v. preliminary model investigation – identify functional forms of covariates. That is, do want X, or X2 or log X? Also look for important interactions. Reduction of Explanatory Variables i. This is not a concern ii. May want to reduce supplemental variables iii. Do NOT want to reduce controlled variables or primary variables – no reduction used iv. Covariate reduction Methods of Model Selection 1. All-possible regression procedures using some criteria to identify a small group of good regression models 2. Automatic subset selection methods. Results in identifying the “best” overall model. This best model may vary by selection method invoked. Criteria Method – Minitab By using Stat > Regression > Best Subset features, Minitab produces 4 criteria: 1 1. 2. 3. 4. R2 R2adj Mallows Cp S If you use Stat > Regression > Stepwise under the options you can also get the following criteria statistics: 5. PRESS – Predicted Sum of Squares 6. Predicted R2 R2 = SSR/SST or alternatively, 1 – SSE/SST R2adj = 1 – [(n-1)/(n-p)]*(SSEp/SST) or alternatively 1 – [MSE/(SST/(n-1)] Mallows Cp = (SSEp/MSEk) – (n – 2p) where MSEk is MSE with all predictors in model and p is for the number of estimates in model (i.e. number of predictors plus intercept). S= MSE 2 e PRESS = i where ei = residual and hi = leverage value for the ith observation. 1 hi Leverage, hi, is the ith diagonal element of the hat matrix, H = X(XTX)-1XT Predicted R2 = 1 – (PRESS/SST) Model Selection Base on Criteria If using criteria: a. 1, 2 or 6: select model with highest value(s). Keep in mind that as variables are added R2 will always at least stay the same if not increase, but for R2adj you may experience a decrease as variables are added. b. 4: select model with lowest value(s). These are models that are providing the smallest error. c. 3: select model where Cp ≈ p (the number of parameters in model) and Cp value is small. Where Cp > p this reflects possible bias in the model and where Cp < p this reflects random error. d. 5: select model with low PRESS value(s) 2 Percent Weights: X1 = Tricalcium Aluminate X2 = Tricalcium Silicate X3 = Tetracalcium Alumino Ferrite X4 = Dicalcium Silicate Y = Amount of Heat Evolved during curing in calories per gram of cement X1 7 1 11 11 7 11 3 1 2 21 1 11 10 X2 26 29 56 31 52 55 71 31 54 47 40 66 68 X3 6 15 8 8 6 9 17 22 18 4 23 9 8 X4 60 52 20 47 33 22 6 44 22 26 34 12 12 Y 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.4 Best Subsets Regression: Y versus X1, X2, X3, X4 MTB > BReg 'Y' 'X1'-'X4' ; SUBC> NVars 1 4; SUBC> Best 2; NOTE: This option choice prints the best 2 of any group SUBC> Constant. Response is Y Vars 1 1 2 2 3 3 4 R-Sq 67.5 66.6 97.9 97.2 98.2 98.2 98.2 R-Sq(adj) 64.5 63.6 97.4 96.7 97.6 97.6 97.4 Mallows C-p 138.7 142.5 2.7 5.5 3.0 3.0 5.0 S 8.9639 9.0771 2.4063 2.7343 2.3087 2.3121 2.4460 X X X X 1 2 3 4 X X X X X X X X X X X X X X X X Based on the criteria, the best models are those including X1, X2 and X4 or X1, X2 and X3 3 Forward Selection: Y versus X1, X2, X3, X4 MTB > Stepwise 'Y' 'X1'-'X4'; SUBC> Forward; SUBC> AEnter 0.25; This is the alpha default value SUBC> Best 0; SUBC> Constant; SUBC> Press. Forward selection. Alpha-to-Enter: 0.25 Response is Y on 4 predictors, with N = 13 Step Constant 1 117.57 2 103.10 3 71.65 X4 T-Value P-Value -0.738 -4.77 0.001 -0.614 -12.62 0.000 -0.237 -1.37 0.205 1.44 10.40 0.000 1.45 12.41 0.000 X1 T-Value P-Value X2 T-Value P-Value S R-Sq R-Sq(adj) Mallows C-p PRESS R-Sq(pred) 0.42 2.24 0.052 8.96 67.45 64.50 138.7 1194.22 56.03 2.73 97.25 96.70 5.5 121.224 95.54 2.31 98.23 97.64 3.0 85.3511 96.86 Best model is after step 3 concludes and is Y = Bo + B1X1 + B2X2 + B4X4 --Remember alpha to enter is 0.25 thus X4 is entered. Step1 starts by calculating the p-value for regressing Y on each variable separately, keeping the variable with the lowest p-value that satisfies being less than the alpha-toenter value (e.g. 0.25), in this case the initial variable is X4. Next, step 2 is to compute partial statistics for adding each remaining predictor variable to model containing variable from step 1. That is, MTB finds the partial statistics for adding X1 to a model containing X4, then finds the partial statistics for adding X2 to a model containing X4, and finally for adding X3 to a model containing X4, and from these partials MTB selects the variable whose p-value is lowest and less than the criterion alpha of 0.25. For step 2 this variable is X1. Step 3 is then to find partial statistics for remaining variables to be added individually to a model already containing previous selected variables. That is, find partials for adding X2 to a model containing X4,X1 and partials for adding X3 to a model containing X4,X1. From these partial statistics, enter the variable whose p-value is lowest and less than 0.25. From this step the variable entered is X2. Step 4 is to repeat the previous step but now using the model with the 3 variables previously selected. That is 4 find partials for adding X3 to model already containing X1,X2, and X4. The p-value for this partial F is not less than 0.25 so the process stops at Step 3 with the best forward selected model. Backward Selection Regression: Y versus X1, X2, X3, X4 MTB > Stepwise 'Y' 'X1'-'X4'; SUBC> Backward; SUBC> ARemove 0.1; This is the alpha default value SUBC> Best 0; SUBC> Constant; SUBC> Press. Backward elimination. Alpha-to-Remove: 0.1 Response is Y on 4 predictors, with N = 13 Step Constant 1 62.41 2 71.65 3 52.58 X1 T-Value P-Value 1.55 2.08 0.071 1.45 12.41 0.000 1.47 12.10 0.000 X2 T-Value P-Value 0.510 0.70 0.501 0.416 2.24 0.052 0.662 14.44 0.000 X3 T-Value P-Value 0.10 0.14 0.896 X4 T-Value P-Value -0.14 -0.20 0.844 -0.24 -1.37 0.205 2.45 98.24 97.36 5.0 110.347 95.94 2.31 98.23 97.64 3.0 85.3511 96.86 S R-Sq R-Sq(adj) Mallows C-p PRESS R-Sq(pred) 2.41 97.87 97.44 2.7 93.8825 96.54 Best model is after step 3 concludes and is Y = Bo + B1X1 + B2X2 Step 1 starts with full model, removes X3 first since this has the largest p-value and is greater than the criterion of alpha = 0.1. Step two, the regression is for Y on X1, X2, X4 and the p-value for X4 is largest greater than 0.1 so X4 is removed. Next step is to regress Y on X1, X2 and from this model no p-values are greater than 0.1 so the process stops. 5 Stepwise Selection Regression: Y versus X1, X2, X3, X4 MTB > Stepwise 'Y' 'X1'-'X4'; SUBC> AEnter 0.15; These AEter/ARemove are the alpha default criterion SUBC> ARemove 0.15; SUBC> Best 0; SUBC> Constant; SUBC> Press. This is another criterion measure such as R2, Cp, S, R2adj Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 - STEPWISE Response is Y on 4 predictors, with N = 13 Step Constant 1 117.57 2 103.10 3 71.65 X4 T-Value P-Value -0.738 -4.77 0.001 -0.614 -12.62 0.000 -0.237 -1.37 0.205 1.44 10.40 0.000 1.45 12.41 0.000 1.47 12.10 0.000 0.416 2.24 0.052 0.662 14.44 0.000 2.31 98.23 97.64 3.0 85.3511 96.86 2.41 97.87 97.44 2.7 93.8825 96.54 X1 T-Value P-Value X2 T-Value P-Value S R-Sq R-Sq(adj) Mallows C-p PRESS R-Sq(pred) 8.96 67.45 64.50 138.7 1194.22 56.03 2.73 97.25 96.70 5.5 121.224 95.54 4 52.58 Best model is after step 4 concludes and is Y = Bo + B1X1 + B2X2 Step1 starts by calculating p-value for regressing Y on each variable separately, keeping the variable with the lowest p-value that satisfies being less than the alpha-to-enter value (e.g. 0.15), in this case the initial variable is X4. Next step 2 is to compute partial statistics for adding each remaining predictor variable to model containing variable from step 1. That is, MTB finds the partials for adding X1 to a model containing X4, then finds the partials for adding X2 to a model containing X4, and finally for adding X3 to a model containing X4, and from these partials select the variable with lowest p-value and less than the criterion alpha of 0.15 and at the same time comparing p-values and removing a variable if its p-value is the highest and greater than 0.15. From step 2 this results in X1 being added to the model with X4. Now the regression repeats step 2 for adding the next variable, X1, but now checks whether any previous variables added to the model, here that is X4, should be dropped from the model when added to the model containing the variable from Step 2 (i.e. variable X1). This is analogous to running the conditional Ttests and that is why you see the T-value in the output. Since the T-value and resulting pvalues satisfy the conditions set to keep a variable in the model, less than 0.15 both the variables remain.. Step 3 continues by going to the remaining unused variables to see if 6 one could be added to the model from Step 2 and if so repeats the conditional t-tests. From this step X2 is added to the model, but now the p-value for X4 is greater than 0.15 and is therefore removed – resulting in Step 4. This final step then leaves us with a model containing X1 and X2 as adding the other variables does not produce a p-value less than 0.15 to allow the variable to be entered nor does it produce a partial F with a p-value greater than 0.15 for any previously selected variable indicating that variable be removed. SPECIAL NOTE: If centering is done on the data set used to select model then validation set needs to center on means used from model building set to maintain the definitions of these variables as used in the fitted model. Model Validation Once a model or set of models has been determined, the final step is to validate selected model(s). The best method is to gather new data and apply the model(s) to this new set and compare estimates and other factors. However, gathering new data is not often feasible to do constraints such as time and money. A more popular technique is to split the data into two sets: model building and model validation. The data is typically split randomly into two equal sets, but if that is not possible then often the model building set is larger. The random split should be done to meet any study requirements. For instance, if known in the population that a certain percentage is male and gender is a considered variable on which the data is gathered, then the two data sets should be split to reflect these percentages. Once the data is separated into the model and validation sets, the researcher proceeds to use the model building set to develop best model(s). Once the model(s) are selected, they are applied to the validation set. From here one considers measures of internal and external validity of the best model(s). Internal Validity The models are fit to the model building data set. Comparisons are done by comparing PRESS to SSE within each model (they should be reasonably close) and the Cp values to their respective p. External Validity The models are fit to the validation data set. Comparisons of the beta estimates and their respective standard errors are made between the two sets. Some problem indicators are if an estimate/standard error from the model building set is much larger/smaller than that estimate from the validation set, or if the sign of the estimate changes. A good measure to use to gauge the predictive validity of the selected model(s) is to compare the Mean Square Predicted Error (MSPR) from the validation set to the MSE of the model selection set. 2 Yi Yˆi MSPR = where Yi are the observed values from the validation set, n* Yˆi are the values calculated by running the best model(s) variable(s) from the validation set as predicted values for new observations in the model building set, and n* come from 7 the validation set. The process is to 1. Copy and paste the x-variables for the model selected from the validation set into the model building set. 2. Perform a regression of one of the selected model(s) and use the Options feature to enter the newly copied xvariables into the “Prediction intervals for new observations” and BE SURE TO CLICK THE STORE FITS OPTION IN THIS WINDOW! 3. These stored fits (PFITS) are then copied and pasted into the validation set are the Yˆi in the formula above. 4. Calculate MSPR as shown and compare to MSE that resulted from the selected model fitted to the model building set. Points to Consider: Consider 1: Many statistical packages will do a line-item removal of missing data prior to beginning a stepwise procedure. That is, the system will create a complete data set prior to analysis by deleting any row of observations where a missing data point exists. Variable(s) with several missing data points can greatly influence the final model selected. If your data does consist of many missing data points you may want to run your model and then return to the original data set and create a second data set consisting of only those variables selected. Using this new data set repeat your stepwise procedure. Consider2: When your data set has a small number of observations in relation to the number of variable (one rule of thumb is if number of observations is less than ten times the number of explanatory variables) then the stepwise procedure may possibly select a model containing all variables, i.e. the full model. This could also occur if a high degree of multicollinearity exists between the variables. This model selection result can be especially true when either or both of these conditions exist and you option to use low threshold values for the F-enter and F-to-remove Consider 3: Of great importance is to remember that the computer is just a tool to help you solve a problem. The statistical package does not have any intuition in which to apply to the situation. Don’t forget this as the model selected by the software and should not be treated as absolute. You should apply your knowledge of the data and the research to the model selected in order to revise if you deem necessary. Choosing between Forward and Backward: When your data set is very large and you are not sure of what to expect, then use the Forward method to select a model. On the other hand, if you have a medium set of variables from which you would prefer to eliminate a few, apply Backward. Don’t be afraid to try both, especially if you started with Forward. In such a case the Forward method will supply a set of variables from which you may want to then use in a Backward process. All possible regression: Different software packages will supply you with various criteria for making a selection. However, adjusted R2 and Mallow’s Cp are two very common statistics provided. These two are related with Cp being more “punishing” to models where the additional variable is of little consequence. When selecting a model using the criteria remember that very rarely will you get a model that ranks the best on all 8 indicators. You should temper your decision by considering models where the values are ranked high in several of the criteria presented. Again, remember to apply your judgment. The computer does not know the difficulty in gathering the data between variables. For instance, the variable Weight in humans, especially when self-reported can lead to reporting errors (men tend to over report and females under) or missing data (women are typically less likely to present their weight than males). A model selected by any of the methods discussed here might involve Weight, but another model that is similar in statistics but slightly less simple (i.e. includes more variables) could be easier in terms of data gathering 9