9.3 Criteria for model selection

advertisement
1
Chapter 9 Building the Regression Model I: Model
Selection and Validation
9.3 Criteria for Model Selection:
Two opposed criteria of selecting a model:
 Including as many covariates as possible so that the fitted values are
reliable.
 Including as few covariates so that the cost of obtaining information
and monitoring is not a lot.
Note:
There is not unique statistical procedure for selecting the best regression
model.
Note:
Common sense, basic knowledge of the data being analyzed, and
considerations related to invariance principle (shift and scale invariance)
can not ever be set side.
Motivating example:
The “Hald" regression data
X1
Y
X2
X3
X4
78.5
7
26
6
60
74.3
1
29
15
52
104.3
11
56
8
20
87.6
11
31
8
47
95.9
7
52
6
33
109.2
11
55
9
22
102.7
3
71
17
6
72.5
1
31
22
44
93.1
2
54
18
22
115.9
21
47
4
26
83.8
1
40
23
34
113.3
11
66
9
12
109.4
10
68
8
12
 Total 13 observations.
2
Several methods which can be used are:
(a) using the value of R 2 .
(b) using the value of s 2 , the mean residual sum of squares.
(c) using Mallows’ C p statistic.
(d) using AIC p and SBC p criteria.
(e) using PRESS p criterion.
(a)
R2 :
Example (continue):
In “Hald” data, there are 4 covariates, X 1 , X 2 , X 3 , and X 4 . All possible models are
divided into 5 sets:
 4
Set A: Y   0       1 possible model.
 0
 4
Set B: Y   0   i X i   , i  1,2,3,4     4 possible models.
 1
 4
Set C: Y   0   i X i   j X j   , i  j , i, j  1,2,3,4     6 possible models.
 2
Set D: Y   0   i X i   j X j   k X k   , i  j  k , i, j , k  1,2,3,4 
 4
   4 possible models
 3
 4
Set E: Y   0  1 X 1   2 X 2   3 X 3   4 X 4   ,     1 possible model.
 4
 Total 2 4  16 models.
For every set, one or two models with large R 2 are picked. They are the
following:
Sets
Set B
Set C
Set D
Set E
Models
Y  0  2 X 2  
R2
0.666
Y  0  4 X 4  
0.675
Y   0  1 X 1   2 X 2  
0.979
Y   0  1 X 1   4 X 4  
0.972
Y   0  1 X 1   2 X 2   4 X 4  
0.982
Y   0  1 X 1   2 X 2   3 X 3  
0.982
Y   0  1 X1   2 X 2   3 X 3   4 X 4  
0.982
3
Principle based on R 2 :
A model with large R 2 and small number of covariates should be a
good choice since large R 2 implies the reliability of fitted values and a
small number of covariates reduce the costs of obtaining information and
monitoring.
Example (continue):
Based on the above principle, two models,
Y   0  1 X 1   2 X 2  
and
Y   0  1 X 1   4 X 4  
are sensible choices!!
Note:
X 2 and X 4 are highly correlated. The correlation coefficient is -0.973.
Therefore, it is not surprising that the two models have very close R 2 .
(b) Mean residual sum of squares
s2 :
s2
A useful result:
As more and more covariates are added to an already overfitted model,
the mean residual sum of squares will tend to stabilize and approach the
true value of  2 provided that all important variables have been
included. That is,
5
Number of Covariates (p)
10
4
Example (continue):)
Again, we compute all s 2 for 16 possible models. We have the following table:
Sets
Average
s2
s2
Set B
115.06( X 1 ), 82.39( X 2 ), 176.31( X 3 ), 80.35( X 4 )
113.53
Set C
5.79( X 1 , X 2 ), 122.71( X 1 , X 3 ), 7.48( X 1 , X 4 ), 41.54( X 2 , X 3 )
86.89( X 2 , X 4 ), 17.59( X 3 , X 4 )
47.00
Set D
5.35( X 1 , X 2 X 3 ), 5.33( X 1 , X 2 , X 3 ),5.65( X 1 , X 3 , X 4 )
8.20( X 2 , X 3 , X 4 )
6.13
Set E
5.98( X 1 , X 2 X 3 , X 4 )
5.98
Average s2
The plot of average s 2 against p (the number of covariates, including  0 ) is
6
2
3
4
5
Number of Covariates (p)
Principle based on s 2 :
A model with mean sum of squares s 2 close to the estimate of  2 (the
horizontal line) and with the fewest covariates might be a sensible model.
5
Example:
The estimate of  2 could be 6. The model
Y   0  1 X 1   2 X 2  
is sensible since its mean residual sum of squares s 2 is 5.79 (close to 6) and the
number of covariates are small compared with the other models with s 2 close to 6.
(c) Mallows
Cp :
Suppse the full model (containing all possible covariates) is
model r : Y   0  1 X 1     r 1 X r 1  
.
Then,
Mallows C p 
RSS (model p)
 (n  2 p) ,
s2
where n is the sample size, p is the number of parameters, RSS (model p)
is the residual sum of squares from a model containing p parameters, and
s 2 is the mean residual sum of squares from the model containing all
possible covariates.
Intuition of Mallows C p :
Suppose s 2 is the mean residual sum of squares from the full model
(containing all possible covariates)
Y   0  1 X 1     r 1 X r 1   ,
and
Y   0   1 X 1     p 1 X p 1  
is the true model, p  r . Then
RSS (model p)
, the mean residual sum of squares from model p, should be
n p
a sensible estimate  2 accurately. That is,
RSS (model p)
  2 . Thus,
n p
6
RSS (model p)  (n  p) 2 . Also, the mean residual sum of squares s 2 for
the overfitted model s 2   2 .
RSS (model p)
(n  p) 2
 Cp 
 (n  2 p) 
 (n  2 p)
s2
2
 (n  p)  (n  2 p)  p
Thus, ( p, C p ) will falls close to the line of Y  X .
(r,Cr)
Cp
r
p
(p,Cp)
p
r
p
Note: Cr  r.
Principle based Mallows C p :
The principle of selecting a best regression equation is to plot C p versus
p for every possible models. Then, choose some models with fewer
covariates close to the line Y  X .
7
Example (continue):
For the motivating example, we calculate C p for all 16 possible models. We then
have the following table:
Cp
Set A
Set B
Set C
Set D
Set E
443.2
202.5 ( X 1 ) ,142.5 ( X 2 ) ,315.2 ( X 3 ) ,138.7 ( X 4 )
2.7 ( X 1 , X 2 ) ,198.1 ( X 1 , X 3 ) ,5.5 ( X 1 , X 4 ), 62.4 ( X 2 , X 3 ),
138.2 ( X 2 , X 4 ), 22.4 ( X 3 , X 4 )
3 ( X 1 , X 2 , X 3 ), 3 ( X 1 , X 2 , X 4 ), 3.5 ( X 1 , X 3 , X 4 ), 7.3 ( X 2 , X 3 , X 4 )
5 ( X1 , X 2 , X 3 , X 4 )
5
4
Cp
(X1,X3,X4)
(X1,X2,X3),(X1,X2,X4)
3
(X1,X2)
3
4
5
p
The point ( p, C p ) value for the model Y   0  1 X 1   2 X 2   is close to the
line Y  X . and the model also has fewer parameters. Therefore, we recommend this
model as a sensible choice.
(d) AIC p and
SBC p
criteria:
8
AIC p  n ln RSS (model p)  n ln n  2 p
SBC p  n ln RSS (model p)  n ln n  ln n p
Principle based on
AIC p
and
SBC p
criteria:
Search for models with small values of AIC p and SBC p .
Example (continue):
For the motivating example, we calculate
AIC p
and
SBC p
for all 16 possible
models. We then have the following tables:
AIC p
Set B
Set C
Set D
Set E
63.51 ( X 1 ) ,59.17 ( X 2 ) ,69.09 ( X 3 ) ,58.85 ( X 4 )
25.41 ( X 1 , X 2 ) ,65.11 ( X 1 , X 3 ) ,28.74 ( X 1 , X 4 ), 51.03 ( X 2 , X 3 ),
60.62 ( X 2 , X 4 ), 39.86 ( X 3 , X 4 )
25.02 ( X 1 , X 2 , X 3 ), 24.97 ( X 1 , X 2 , X 4 ), 25.73 ( X 1 , X 3 , X 4 ), 30.57
( X2 , X3, X4 )
26.93 ( X 1 , X 2 , X 3 , X 4 )
BIC p
Set B
Set C
Set D
Set E
64.641 ( X 1 ) ,60.30 ( X 2 ) ,70.22 ( X 3 ) ,59.98 ( X 4 )
27.11 ( X 1 , X 2 ) ,66.81 ( X 1 , X 3 ) ,30.44 ( X 1 , X 4 ), 52.73 ( X 2 , X 3 ),
62.32 ( X 2 , X 4 ), 41.56 ( X 3 , X 4 )
27.28 ( X 1 , X 2 , X 3 ), 27.23 ( X 1 , X 2 , X 4 ), 27.99 ( X 1 , X 3 , X 4 ), 32.83
( X2 , X3, X4 )
29.75 ( X 1 , X 2 , X 3 , X 4 )
According to the two selection criteria, several models,
Y   0  1 X 1   2 X 2   ,
Y   0  1 X 1   2 X 2   3 X 3   ,
Y   0  1 X 1   2 X 2   4 X 4  
Y   0  1 X 1   3 X 3   4 X 4  
9
are sensible models.
(d) PRESS p (prediction sum of squares) criteria:
PRESS p    yi  yˆi (i ) model p 
n
2
i 1
where
,
yˆ i (i ) model p  is the predicted value of y i using the
current model (model p) and the observation without y i to fit the
model. For example, in “Hald” data, as the current model is
Y   0  1 X 1   2 X 2   ,
yˆ 2 ( 2 ) model 3 is the predicted value of y 2 using the data
y1 , y3 , y 4 ,, y13
Principle based on
to fit the model.
PRESS p
criteria:
Search for models with small values of
PRESS p .
Download