Stat 301 – Lecture 25 Model Selection

advertisement
Stat 301 – Lecture 25
Model Selection
In multiple regression we
often have many explanatory
variables.
 How do we find the “best”
model?

1
Model Selection

How can we select the set of
explanatory variables that will
explain the most variation in the
response and have each
variable adding significantly to
the model?
2
Cruising Timber


Response: Mean Diameter at
Breast Height (MDBH) of a tree.
Explanatory:
X1 = Mean Height of Pines
X2 = Age of Tract times the Number
of Pines
 X3 = Mean Height of Pines divided by
the Number of Pines


3
Stat 301 – Lecture 25
Forward Selection


Begin with no variables in the
model.
At each step check to see if you
can add a variable to the model.
If you can, add the variable.
 If not, stop.

4
Forward Selection – Step 1
Select the variable that has
the highest correlation with
the response.
 If this correlation is
statistically significant, add
the variable to the model.

5
JMP
Multivariate Methods
 Multivariate
 Put MDBH, X1, X2, and X3 in the
Y, Columns box.

6
Stat 301 – Lecture 25
Multivariate
Correlations
MDBH
1.0000
0.7731
0.2435
0.8404
MDBH
X1
X2
X3
X1
0.7731
1.0000
0.7546
0.6345
X2
0.2435
0.7546
1.0000
0.0557
X3
0.8404
0.6345
0.0557
1.0000
Scatterplot Matrix
7.5
6.5
MDBH
5.5
4.5
55
X1
45
35
25
17500
15000
X2
12500
10000
7500
5000
0.12
0.1
X3
0.08
0.06
0.04
4.5 5.5
6.5
7.5 25 35
45
55
5000
12500
0.04 0.07
0.1
7
Correlation with response
Multivariate
Pairwise Correlations
Variable
X1
X2
X2
X3
X3
X3
by Variable
MDBH
MDBH
X1
MDBH
X1
X2
Correlation Count Signif Prob
0.7731
20
<.0001*
0.2435
20
0.3008
0.7546
20
0.0001*
0.8404
20
<.0001*
0.6345
20
0.0027*
0.0557
20
0.8156
-.8 -.6 -.4 -.2 0 .2 .4 .6 .8
8
Comment

The explanatory variable X3 has the
highest correlation with the
response MDBH.


r = 0.8404
The correlation between X3 and
MDBH is statistically significant.

Signif Prob < 0.0001, small P-value.
9
Stat 301 – Lecture 25
Step 1 - Action
Fit the simple linear regression
of MDBH on X3.
 Predicted MDBH = 3.896 +
32.937*X3
2
 R = 0.7063
 RMSE = 0.4117

10
SLR of MDBH on X3

Test of Model Utility


Statistical Significance of X3


F = 43.2886, P-value < 0.0001
t = 6.58, P-value < 0.0001
Exactly the same as the test
for significant correlation.
11
Can we do better?
Can we explain more variation
in MDBH by adding one of the
other variables to the model
with X3?
 Will that addition be statistically
significant?

12
Stat 301 – Lecture 25
Forward Selection – Step 2
Which variable should we add,
X1 or X2?
 How can we decide?

13
Correlation among
explanatory variables
Multivariate
Pairwise Correlations
Variable
X1
X2
X2
X3
X3
X3
by Variable
MDBH
MDBH
X1
MDBH
X1
X2
Correlation Count Signif Prob
0.7731
20
<.0001*
0.2435
20
0.3008
0.7546
20
0.0001*
0.8404
20
<.0001*
0.6345
20
0.0027*
0.0557
20
0.8156
-.8 -.6 -.4 -.2 0 .2 .4 .6 .8
14
Multicollinearity


Because some explanatory
variables are correlated, they may
carry overlapping information about
the response.
You can’t rely on the simple
correlations between explanatory
and response to tell you which
variable to add.
15
Stat 301 – Lecture 25
Forward selection – Step 2
Look at partial residual plots.
 Determine statistical
significance.

16
Partial Residual Plots

Look at the residuals from the
SLR of Y on X3 plotted against
the other variables once the
overlapping information with X3
has been removed.
17
How is this done?
Fit MDBH versus X3 and obtain
residuals – Resid(Y on X3)
 Fit X1 versus X3 and obtain
residuals - Resid(X1 on X3)
 Fit X2 versus X3 and obtain
residuals - Resid(X2 on X3)

18
Stat 301 – Lecture 25
0.5
0
Resid(YonX3)
-0.5
20
15
10
5
0
-5
-10
Resid(X1onX3)
10000
7500
5000
Resid(X2onX3)
2500
0
-2500
-0.5
0
.5
-10 -5 0 5 10 15 20 -2500 0 2500
7500
19
Correlations
Resid(YonX3)
Resid(X1onX3)
Resid(X2onX3)
Resid(YonX3)
1.0000
0.5726
0.3636
Resid(X1onX3)
0.5726
1.0000
0.9320
Resid(X2onX3)
0.3636
0.9320
1.000
20
Comment

The residuals (unexplained
variation in the response) from
the SLR of MDBH on X3 have
the highest correlation with X1
once we have adjusted for the
overlapping information with X3.
21
Stat 301 – Lecture 25
Statistical Significance

Does X1 add significantly to the
model that already contains X3?
t = 2.88, P-value = 0.0104
 F = 8.29, P-value = 0.0104
 Because the P-value is small, X1
adds significantly to the model
with X3.

22
Summary

Step 1 – add X3


Step 2 – add X1 to X3


R2 = 0.706
R2 = 0.803
Can we do better?
23
Forward Selection – Step 3

Does X2 add significantly to the
model that already contains X3 and
X1?
t = –2.79, P-value = 0.0131
F = 7.78, P-value = 0.0131
 Because the P-value is small, X2 adds
significantly to the model with X3 and
X1.


24
Stat 301 – Lecture 25
Summary

Step 1 – add X3


Step 2 – add X1 to X3


R2 = 0.706
R2 = 0.803
Step 3 – add X2 to X1 and X3

R2 = 0.867
25
Summary
At each step the variable being
added is statistically significant.
 Has the forward selection
procedure found the “best”
model?

26
“Best” Model?

The model with all three variables
is useful.


F = 34.83, P-value < 0.0001
The variable X3 does not add
significantly to the model with just
X1 and X2.

t = 0.41, P-value = 0.6844
27
Download