Stat 301 – Lecture 23 Model Selection

advertisement
Stat 301 – Lecture 23
Model Selection
In multiple regression we
often have many explanatory
variables.
 How do we find the “best”
model, or at least some good
models?

1
Model Selection

We might want to select the set
of explanatory variables that
will explain the most variation in
the response and have each
variable adding significantly to
the model?
2
Model Selection

Having all variables adding
significantly to the model could
be a problem if there are
polynomial or interaction terms.
3
Stat 301 – Lecture 23
Cruising Timber


Response: Mean Diameter at
Breast Height (MDBH) of a tree.
Explanatory:
X1 = Mean Height of Pines
X2 = Age of Tract times the Number
of Pines
 X3 = Mean Height of Pines divided by
the Number of Pines


4
Forward Selection


Begin with no variables in the
model.
At each step check to see if you
can add a variable to the model.
If you can, add the variable.
 If not, stop.

5
Forward Selection – Step 1
Select the variable that has
the highest correlation with
the response.
 If this correlation is
statistically significant, add
the variable to the model.

6
Stat 301 – Lecture 23
JMP
Multivariate Methods
 Multivariate
 Put MDBH, X1, X2, and X3 in the
Y, Columns box.

7
Multivariate
Correlations
MDBH
1.0000
0.7731
0.2435
0.8404
MDBH
X1
X2
X3
X1
0.7731
1.0000
0.7546
0.6345
X2
0.2435
0.7546
1.0000
0.0557
X3
0.8404
0.6345
0.0557
1.0000
Scatterplot Matrix
7.5
6.5
MDBH
5.5
4.5
55
X1
45
35
25
17500
15000
X2
12500
10000
7500
5000
0.12
0.1
X3
0.08
0.06
0.04
4.5 5.5
6.5
7.5 25 35
45
55
5000
12500
0.04 0.07
0.1
8
Correlation with response
Multivariate
Pairwise Correlations
Variable
X1
X2
X2
X3
X3
X3
by Variable
MDBH
MDBH
X1
MDBH
X1
X2
Correlation Count Signif Prob
0.7731
20
<.0001*
0.2435
20
0.3008
0.7546
20
0.0001*
0.8404
20
<.0001*
0.6345
20
0.0027*
0.0557
20
0.8156
-.8 -.6 -.4 -.2 0 .2 .4 .6 .8
9
Stat 301 – Lecture 23
Comment

The explanatory variable X3 has the
highest correlation with the
response MDBH.


r = 0.8404
The correlation between X3 and
MDBH is statistically significant.

Signif Prob < 0.0001, small P-value.
10
Step 1 - Action
Fit the simple linear regression
of MDBH on X3.
 Predicted MDBH = 3.896 +
32.937*X3
2
 R = 0.7063
 RMSE = 0.4117

11
SLR of MDBH on X3

Test of Model Utility


Statistical Significance of X3


F = 43.2886, P-value < 0.0001
t = 6.58, P-value < 0.0001
Exactly the same as the test
for significant correlation.
12
Stat 301 – Lecture 23
Can we do better?
Can we explain more variation
in MDBH by adding one of the
other variables to the model
with X3?
 Will that addition be statistically
significant?

13
Forward Selection – Step 2
Which variable should we add,
X1 or X2?
 How can we decide?

14
Correlation among
explanatory variables
Multivariate
Pairwise Correlations
Variable
X1
X2
X2
X3
X3
X3
by Variable
MDBH
MDBH
X1
MDBH
X1
X2
Correlation Count Signif Prob
0.7731
20
<.0001*
0.2435
20
0.3008
0.7546
20
0.0001*
0.8404
20
<.0001*
0.6345
20
0.0027*
0.0557
20
0.8156
-.8 -.6 -.4 -.2 0 .2 .4 .6 .8
15
Stat 301 – Lecture 23
Multicollinearity


Because some explanatory
variables are correlated, they may
carry overlapping information about
the response.
You can’t rely on the simple
correlations between explanatory
and response to tell you which
variable to add.
16
Forward selection – Step 2
Look at partial residual plots.
 Determine statistical
significance.

17
Partial Residual Plots

Look at the residuals from the
SLR of Y on X3 plotted against
the other variables once the
overlapping information with X3
has been removed.
18
Stat 301 – Lecture 23
How is this done?
Fit MDBH versus X3 and obtain
residuals – Resid(Y on X3)
 Fit X1 versus X3 and obtain
residuals - Resid(X1 on X3)
 Fit X2 versus X3 and obtain
residuals - Resid(X2 on X3)

19
0.5
0
Resid(YonX3)
-0.5
20
15
10
5
0
-5
-10
Resid(X1onX3)
10000
7500
5000
Resid(X2onX3)
2500
0
-2500
-0.5
0
.5
-10 -5 0 5 10 15 20 -2500 0 2500
7500
20
Correlations
Resid(YonX3)
Resid(X1onX3)
Resid(X2onX3)
Resid(YonX3)
1.0000
0.5726
0.3636
Resid(X1onX3)
0.5726
1.0000
0.9320
Resid(X2onX3)
0.3636
0.9320
1.000
21
Stat 301 – Lecture 23
Comment

The residuals (unexplained
variation in the response) from
the SLR of MDBH on X3 have
the highest correlation with X1
once we have adjusted for the
overlapping information with X3.
22
Statistical Significance

Does X1 add significantly to the
model that already contains X3?
t = 2.88, P-value = 0.0104
 F = 8.29, P-value = 0.0104
 Because the P-value is small, X1
adds significantly to the model
with X3.

23
Summary

Step 1 – add X3


Step 2 – add X1 to X3


R2 = 0.706
R2 = 0.803
Can we do better?
24
Stat 301 – Lecture 23
Forward Selection – Step 3

Does X2 add significantly to the
model that already contains X3 and
X1?
t = –2.79, P-value = 0.0131
F = 7.78, P-value = 0.0131
 Because the P-value is small, X2 adds
significantly to the model with X3 and
X1.


25
Summary

Step 1 – add X3


Step 2 – add X1 to X3


R2 = 0.706
R2 = 0.803
Step 3 – add X2 to X1 and X3

R2 = 0.867
26
Summary
At each step the variable being
added is statistically significant.
 Has the forward selection
procedure found a model that
has a high R2 and all variables
add significantly to the model?

27
Stat 301 – Lecture 23
“Best” Model?

The model with all three variables
is useful.


F = 34.83, P-value < 0.0001
The variable X3 does not add
significantly to the model with just
X1 and X2.

t = 0.41, P-value = 0.6844
28
Remove X3?

Because X3 does not add
significantly to the model, if we
remove it, we will still have
good predictions with a simpler
model.
29
Comparison of Models
Model (X1, X2, X3)
Model (X1, X2)
Response MDBH
Response MDBH
Summary of Fit
Summary of Fit
RSquare
0.867207
RSquare Adj
0.842309
Root Mean Square Error
0.29359
Mean of Response
6.265
Observations (or Sum Wgts)
20
RSquare
0.865785
RSquare Adj
0.849995
Root Mean Square Error
0.286345
Mean of Response
6.265
Observations (or Sum Wgts)
20
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t|
Intercept 3.2357323 0.346656
9.33 <.0001*
X1
0.0974056 0.025398
3.84 0.0015*
X2
-0.000169 6.052e-5
-2.79 0.0131*
X3
3.4668135 8.373792
0.41 0.6844
Effect Tests
Source
X1
X2
X3
Nparm
1
1
1
DF
1
1
1
Sum of
Squares
1.2678300
0.6709668
0.0147740
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t|
Intercept 3.2605137 0.333024
9.79 <.0001*
X1
0.1069135 0.010578 10.11 <.0001*
X2
-0.00019 3.256e-5
-5.83 <.0001*
Effect Tests
F Ratio Prob > F
14.7089 0.0015*
7.7843 0.0131*
0.1714 0.6844
Source
X1
X2
Nparm
1
1
DF
1
1
Sum of
F Ratio Prob > F
Squares
8.3755800 102.1490 <.0001*
2.7845615 33.9607 <.0001*
30
Download