Selection of predictor variables

Model selection Stepwise regression Statement of problem • A common problem is that there is a large set of candidate predictor variables. • (Note: The examples herein are really not that large.) • Goal is to choose a small subset from the larger set so that the resulting regression model is simple, yet have good predictive ability. Example: Cement data • Response y: heat evolved in calories during hardening of cement on a per gram basis • Predictor x1: % of tricalcium aluminate • Predictor x2: % of tricalcium silicate • Predictor x3: % of tetracalcium alumino ferrite • Predictor x4: % of dicalcium silicate Example: Cement data 105.05 y 83.35 16 x1 6 59.75 x2 37.25 18.25 x3 8.75 46.5 x4 19.5 83 .35 10 5.0 5 6 16 37 .25 59 .7 5 8.7 5 18 .2 5 19 .5 46 .5 Two basic methods of selecting predictors • Stepwise regression: Enter and remove predictors, in a stepwise manner, until there is no justifiable reason to enter or remove more. • Best subsets regression: Select the subset of predictors that do the best at meeting some well-defined objective criterion. Stepwise regression: the idea • Start with no predictors in the “stepwise model.” • At each step, enter or remove a predictor based on partial F-tests (that is, the t-tests). • Stop when no more predictors can be justifiably entered or removed from the stepwise model. Stepwise regression: Preliminary steps 1. Specify an Alpha-to-Enter (αE = 0.15) significance level. 2. Specify an Alpha-to-Remove (αR = 0.15) significance level. Stepwise regression: Step #1 1. Fit each of the one-predictor models, that is, regress y on x1, regress y on x2, … regress y on xp-1. 2. The first predictor put in the stepwise model is the predictor that has the smallest t-test P-value (below αE = 0.15). 3. If no P-value < 0.15, stop. Stepwise regression: Step #2 1. Suppose x1 was the “best” one predictor. 2. Fit each of the two-predictor models with x1 in the model, that is, regress y on (x1, x2), regress y on (x1, x3), …, and y on (x1, xp-1). 3. The second predictor put in stepwise model is the predictor that has the smallest t-test P-value (below αE = 0.15). 4. If no P-value < 0.15, stop. Stepwise regression: Step #2 (continued) 1. Suppose x2 was the “best” second predictor. 2. Step back and check P-value for β1 = 0. If the P-value for β1 = 0 has become not significant (above αR = 0.15), remove x1 from the stepwise model. Stepwise regression: Step #3 1. Suppose both x1 and x2 made it into the two-predictor stepwise model. 2. Fit each of the three-predictor models with x1 and x2 in the model, that is, regress y on (x1, x2, x3), regress y on (x1, x2, x4), …, and regress y on (x1, x2, xp-1). Stepwise regression: Step #3 (continued) 1. The third predictor put in stepwise model is the predictor that has the smallest t-test P-value (below αE = 0.15). 2. If no P-value < 0.15, stop. 3. Step back and check P-values for β1 = 0 and β2 = 0. If either P-value has become not significant (above αR = 0.15), remove the predictor from the stepwise model. Stepwise regression: Stopping the procedure • The procedure is stopped when adding an additional predictor does not yield a t-test P-value below αE = 0.15. Example: Cement data 105.05 y 83.35 16 x1 6 59.75 x2 37.25 18.25 x3 8.75 46.5 x4 19.5 83 .35 10 5.0 5 6 16 37 .25 59 .7 5 8.7 5 18 .2 5 19 .5 46 .5 Predictor Constant x1 Coef 81.479 1.8687 SE Coef 4.927 0.5264 T 16.54 3.55 P 0.000 0.005 Predictor Constant x2 Coef 57.424 0.7891 SE Coef 8.491 0.1684 T 6.76 4.69 P 0.000 0.001 Predictor Constant x3 Coef 110.203 -1.2558 SE Coef 7.948 0.5984 T 13.87 -2.10 P 0.000 0.060 Predictor Constant x4 Coef 117.568 -0.7382 SE Coef 5.262 0.1546 T 22.34 -4.77 P 0.000 0.001 Predictor Constant x4 x1 Coef 103.097 -0.61395 1.4400 SE Coef 2.124 0.04864 0.1384 T 48.54 -12.62 10.40 P 0.000 0.000 0.000 Predictor Constant x4 x2 Coef 94.16 -0.4569 0.3109 SE Coef 56.63 0.6960 0.7486 T 1.66 -0.66 0.42 P 0.127 0.526 0.687 Predictor Constant x4 x3 Coef 131.282 -0.72460 -1.1999 SE Coef 3.275 0.07233 0.1890 T 40.09 -10.02 -6.35 P 0.000 0.000 0.000 Predictor Constant x4 x1 x2 Predictor Constant x4 x1 x3 Coef 71.65 -0.2365 1.4519 0.4161 SE Coef 14.14 0.1733 0.1170 0.1856 Coef SE Coef 111.684 4.562 -0.64280 0.04454 1.0519 0.2237 -0.4100 0.1992 T 5.07 -1.37 12.41 2.24 P 0.001 0.205 0.000 0.052 T 24.48 -14.43 4.70 -2.06 P 0.000 0.000 0.001 0.070 Predictor Constant x1 x2 Coef 52.577 1.4683 0.66225 SE Coef 2.286 0.1213 0.04585 T 23.00 12.10 14.44 P 0.000 0.000 0.000 Predictor Constant x1 x2 x4 Coef 71.65 1.4519 0.4161 -0.2365 SE Coef 14.14 0.1170 0.1856 0.1733 Predictor Constant x1 x2 x3 Coef SE Coef 48.194 3.913 1.6959 0.2046 0.65691 0.04423 0.2500 0.1847 T 5.07 12.41 2.24 -1.37 P 0.001 0.000 0.052 0.205 T 12.32 8.29 14.85 1.35 P 0.000 0.000 0.000 0.209 Predictor Constant x1 x2 Coef 52.577 1.4683 0.66225 SE Coef 2.286 0.1213 0.04585 T 23.00 12.10 14.44 P 0.000 0.000 0.000 Stepwise Regression: y versus x1, x2, x3, x4 Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is y on 4 predictors, with N = Step Constant 1 117.57 2 103.10 3 71.65 x4 T-Value P-Value -0.738 -4.77 0.001 -0.614 -12.62 0.000 -0.237 -1.37 0.205 1.44 10.40 0.000 1.45 12.41 0.000 1.47 12.10 0.000 0.416 2.24 0.052 0.662 14.44 0.000 2.31 98.23 97.64 3.0 2.41 97.87 97.44 2.7 x1 T-Value P-Value x2 T-Value P-Value S R-Sq R-Sq(adj) C-p 8.96 67.45 64.50 138.7 2.73 97.25 96.70 5.5 4 52.58 13 Caution about stepwise regression! • Do not jump to the conclusion … – that all the important predictor variables for predicting y have been identified, or – that all the unimportant predictor variables have been eliminated. Caution about stepwise regression! • Many t-tests for βk = 0 are conducted in a stepwise regression procedure. • The probability is high … – that we included some unimportant predictors – that we excluded some important predictors Drawbacks of stepwise regression • The final model is not guaranteed to be optimal in any specified sense. • The procedure yields a single final model, although in practice there are often several equally good models. • It doesn’t take into account a researcher’s knowledge about the predictors. Example: Modeling PIQ 130.5 PIQ 91.5 100.728 MRI 86.283 73.25 Height 65.75 170.5 Weight 127.5 .5 .5 91 130 3 8 .28 .7 2 86 100 .75 3.25 65 7 7.5 70.5 12 1 Stepwise Regression: PIQ versus MRI, Height, Weight Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is PIQ on 3 predictors, with N = 38 Step Constant 1 4.652 2 111.276 MRI T-Value P-Value 1.18 2.45 0.019 2.06 3.77 0.001 Height T-Value P-Value S R-Sq R-Sq(adj) C-p -2.73 -2.75 0.009 21.2 14.27 11.89 7.3 19.5 29.49 25.46 2.0 The regression equation is PIQ = 111 + 2.06 MRI - 2.73 Height Predictor Constant MRI Height Coef 111.28 2.0606 -2.7299 S = 19.51 SE Coef 55.87 0.5466 0.9932 R-Sq = 29.5% T 1.99 3.77 -2.75 P 0.054 0.001 0.009 R-Sq(adj) = 25.5% Analysis of Variance Source Regression Error Total DF 2 35 37 Source MRI Height DF 1 1 SS 5572.7 13321.8 18894.6 Seq SS 2697.1 2875.6 MS 2786.4 380.6 F 7.32 P 0.002 Example: Modeling BP 120 BP 110 53.25 Age 47.75 97.325 Weight 89.375 2.125 BSA 1.875 8.275 Duration 4.425 72.5 Pulse 65.5 76.25 Stress 30.75 0 11 0 12 . 75 3.25 47 5 5 5 .37 7. 32 89 9 75 25 1. 8 2. 1 25 .275 4. 4 8 .5 65 .5 72 .75 6. 25 30 7 Stepwise Regression: BP versus Age, Weight, BSA, Duration, Pulse, Stress Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is BP on 6 predictors, with N = 20 Step Constant 1 2.205 2 -16.579 3 -13.667 Weight T-Value P-Value 1.201 12.92 0.000 1.033 33.15 0.000 0.906 18.49 0.000 0.708 13.23 0.000 0.702 15.96 0.000 Age T-Value P-Value BSA T-Value P-Value S R-Sq R-Sq(adj) C-p 4.6 3.04 0.008 1.74 90.26 89.72 312.8 0.533 99.14 99.04 15.1 0.437 99.45 99.35 6.4 The regression equation is BP = - 13.7 + 0.702 Age + 0.906 Weight + 4.63 BSA Predictor Constant Age Weight BSA S = 0.4370 Coef SE Coef T P -13.667 2.647 -5.16 0.000 0.70162 0.04396 15.96 0.000 0.90582 0.04899 18.49 0.000 4.627 1.521 3.04 0.008 R-Sq = 99.5% R-Sq(adj) = 99.4% Analysis of Variance Source Regression Error Total Source Age Weight BSA DF 3 16 19 DF 1 1 1 SS 556.94 3.06 560.00 MS 185.65 0.19 Seq SS 243.27 311.91 1.77 F 971.93 P 0.000 Stepwise regression in Minitab • Stat >> Regression >> Stepwise … • Specify response and all possible predictors. • If desired, specify predictors that must be included in every model. (This is where researcher’s knowledge helps!!) • Select OK. Results appear in session window.

Selection of predictor variables

Related documents

Products

Support

Selection of predictor variables

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib