Stat 301 – Lecture 23 Model Selection In multiple regression we often have many explanatory variables. How do we find the “best” model, or at least some good models? 1 Model Selection We might want to select the set of explanatory variables that will explain the most variation in the response and have each variable adding significantly to the model? 2 Model Selection Having all variables adding significantly to the model could be a problem if there are polynomial or interaction terms. 3 Stat 301 – Lecture 23 Cruising Timber Response: Mean Diameter at Breast Height (MDBH) of a tree. Explanatory: X1 = Mean Height of Pines X2 = Age of Tract times the Number of Pines X3 = Mean Height of Pines divided by the Number of Pines 4 Forward Selection Begin with no variables in the model. At each step check to see if you can add a variable to the model. If you can, add the variable. If not, stop. 5 Forward Selection – Step 1 Select the variable that has the highest correlation with the response. If this correlation is statistically significant, add the variable to the model. 6 Stat 301 – Lecture 23 JMP Multivariate Methods Multivariate Put MDBH, X1, X2, and X3 in the Y, Columns box. 7 Multivariate Correlations MDBH 1.0000 0.7731 0.2435 0.8404 MDBH X1 X2 X3 X1 0.7731 1.0000 0.7546 0.6345 X2 0.2435 0.7546 1.0000 0.0557 X3 0.8404 0.6345 0.0557 1.0000 Scatterplot Matrix 7.5 6.5 MDBH 5.5 4.5 55 X1 45 35 25 17500 15000 X2 12500 10000 7500 5000 0.12 0.1 X3 0.08 0.06 0.04 4.5 5.5 6.5 7.5 25 35 45 55 5000 12500 0.04 0.07 0.1 8 Correlation with response Multivariate Pairwise Correlations Variable X1 X2 X2 X3 X3 X3 by Variable MDBH MDBH X1 MDBH X1 X2 Correlation Count Signif Prob 0.7731 20 <.0001* 0.2435 20 0.3008 0.7546 20 0.0001* 0.8404 20 <.0001* 0.6345 20 0.0027* 0.0557 20 0.8156 -.8 -.6 -.4 -.2 0 .2 .4 .6 .8 9 Stat 301 – Lecture 23 Comment The explanatory variable X3 has the highest correlation with the response MDBH. r = 0.8404 The correlation between X3 and MDBH is statistically significant. Signif Prob < 0.0001, small P-value. 10 Step 1 - Action Fit the simple linear regression of MDBH on X3. Predicted MDBH = 3.896 + 32.937*X3 2 R = 0.7063 RMSE = 0.4117 11 SLR of MDBH on X3 Test of Model Utility Statistical Significance of X3 F = 43.2886, P-value < 0.0001 t = 6.58, P-value < 0.0001 Exactly the same as the test for significant correlation. 12 Stat 301 – Lecture 23 Can we do better? Can we explain more variation in MDBH by adding one of the other variables to the model with X3? Will that addition be statistically significant? 13 Forward Selection – Step 2 Which variable should we add, X1 or X2? How can we decide? 14 Correlation among explanatory variables Multivariate Pairwise Correlations Variable X1 X2 X2 X3 X3 X3 by Variable MDBH MDBH X1 MDBH X1 X2 Correlation Count Signif Prob 0.7731 20 <.0001* 0.2435 20 0.3008 0.7546 20 0.0001* 0.8404 20 <.0001* 0.6345 20 0.0027* 0.0557 20 0.8156 -.8 -.6 -.4 -.2 0 .2 .4 .6 .8 15 Stat 301 – Lecture 23 Multicollinearity Because some explanatory variables are correlated, they may carry overlapping information about the response. You can’t rely on the simple correlations between explanatory and response to tell you which variable to add. 16 Forward selection – Step 2 Look at partial residual plots. Determine statistical significance. 17 Partial Residual Plots Look at the residuals from the SLR of Y on X3 plotted against the other variables once the overlapping information with X3 has been removed. 18 Stat 301 – Lecture 23 How is this done? Fit MDBH versus X3 and obtain residuals – Resid(Y on X3) Fit X1 versus X3 and obtain residuals - Resid(X1 on X3) Fit X2 versus X3 and obtain residuals - Resid(X2 on X3) 19 0.5 0 Resid(YonX3) -0.5 20 15 10 5 0 -5 -10 Resid(X1onX3) 10000 7500 5000 Resid(X2onX3) 2500 0 -2500 -0.5 0 .5 -10 -5 0 5 10 15 20 -2500 0 2500 7500 20 Correlations Resid(YonX3) Resid(X1onX3) Resid(X2onX3) Resid(YonX3) 1.0000 0.5726 0.3636 Resid(X1onX3) 0.5726 1.0000 0.9320 Resid(X2onX3) 0.3636 0.9320 1.000 21 Stat 301 – Lecture 23 Comment The residuals (unexplained variation in the response) from the SLR of MDBH on X3 have the highest correlation with X1 once we have adjusted for the overlapping information with X3. 22 Statistical Significance Does X1 add significantly to the model that already contains X3? t = 2.88, P-value = 0.0104 F = 8.29, P-value = 0.0104 Because the P-value is small, X1 adds significantly to the model with X3. 23 Summary Step 1 – add X3 Step 2 – add X1 to X3 R2 = 0.706 R2 = 0.803 Can we do better? 24 Stat 301 – Lecture 23 Forward Selection – Step 3 Does X2 add significantly to the model that already contains X3 and X1? t = –2.79, P-value = 0.0131 F = 7.78, P-value = 0.0131 Because the P-value is small, X2 adds significantly to the model with X3 and X1. 25 Summary Step 1 – add X3 Step 2 – add X1 to X3 R2 = 0.706 R2 = 0.803 Step 3 – add X2 to X1 and X3 R2 = 0.867 26 Summary At each step the variable being added is statistically significant. Has the forward selection procedure found a model that has a high R2 and all variables add significantly to the model? 27 Stat 301 – Lecture 23 “Best” Model? The model with all three variables is useful. F = 34.83, P-value < 0.0001 The variable X3 does not add significantly to the model with just X1 and X2. t = 0.41, P-value = 0.6844 28 Remove X3? Because X3 does not add significantly to the model, if we remove it, we will still have good predictions with a simpler model. 29 Comparison of Models Model (X1, X2, X3) Model (X1, X2) Response MDBH Response MDBH Summary of Fit Summary of Fit RSquare 0.867207 RSquare Adj 0.842309 Root Mean Square Error 0.29359 Mean of Response 6.265 Observations (or Sum Wgts) 20 RSquare 0.865785 RSquare Adj 0.849995 Root Mean Square Error 0.286345 Mean of Response 6.265 Observations (or Sum Wgts) 20 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 3.2357323 0.346656 9.33 <.0001* X1 0.0974056 0.025398 3.84 0.0015* X2 -0.000169 6.052e-5 -2.79 0.0131* X3 3.4668135 8.373792 0.41 0.6844 Effect Tests Source X1 X2 X3 Nparm 1 1 1 DF 1 1 1 Sum of Squares 1.2678300 0.6709668 0.0147740 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 3.2605137 0.333024 9.79 <.0001* X1 0.1069135 0.010578 10.11 <.0001* X2 -0.00019 3.256e-5 -5.83 <.0001* Effect Tests F Ratio Prob > F 14.7089 0.0015* 7.7843 0.0131* 0.1714 0.6844 Source X1 X2 Nparm 1 1 DF 1 1 Sum of F Ratio Prob > F Squares 8.3755800 102.1490 <.0001* 2.7845615 33.9607 <.0001* 30