Best subsets regression

advertisement
Model selection
Best subsets regression
Statement of problem
• A common problem is that there is a large
set of candidate predictor variables.
• Goal is to choose a small subset from the
larger set so that the resulting regression
model is simple, yet have good predictive
ability.
Example: Cement data
• Response y: heat evolved in calories during
hardening of cement on a per gram basis
• Predictor x1: % of tricalcium aluminate
• Predictor x2: % of tricalcium silicate
• Predictor x3: % of tetracalcium alumino ferrite
• Predictor x4: % of dicalcium silicate
Example: Cement data
105.05
y
83.35
16
x1
6
59.75
x2
37.25
18.25
x3
8.75
46.5
x4
19.5
83
.35
10
5.0
5
6
16
37
.25
59
.7 5
8.7
5
18
.2 5
19
.5
46
.5
Two basic methods
of selecting predictors
• Stepwise regression: Enter and remove
predictors, in a stepwise manner, until no
justifiable reason to enter or remove more.
• Best subsets regression: Select the subset
of predictors that do the best at meeting
some well-defined objective criterion.
Why best subsets regression?
# of predictors
(p-1)
# of regression models
1
2 : ( ) (x1)
2
4 : ( ) (x1) (x2) (x1, x2)
3
8: ( ) (x1) (x2) (x3) (x1, x2)
(x1, x3) (x2, x3) (x1, x2, x3)
16: 1 none, 4 one, 6 two,
4 three, 1 four
4
Why best subsets regression?
• If there are p-1 possible predictors, then
there are 2p-1 possible regression models
containing the predictors.
• For example, 10 predictors yields 210 = 1024
possible regression models.
• A best subsets algorithm determines the
best subsets of each size, so that choice of
the final model can be made by researcher.
What is used to judge “best”?
•
•
•
•
R-squared
Adjusted R-squared
MSE (or S = square root of MSE)
Mallow’s Cp
R-squared
SSR
SSE
R 
 1
SSTO
SSTO
2
Use the R-squared values to find the point
where adding more predictors is not
worthwhile because it leads to a very small
increase in R-squared.
Adjusted R-squared or MSE
 n  1  SSE 
 n 1 

R  1  
  1 
 MSE
 SSTO 
 n  p  SSTO 
2
a
Adjusted R-squared increases only if MSE decreases,
so adjusted R-squared and MSE provide equivalent
information.
Find a few subsets for which MSE is smallest (or
adjusted R-squared is largest) or so close to the
smallest (largest) that adding more predictors is not
worthwhile.
Mallow’s Cp criterion
The goal is to minimize the total standardized
mean square error of prediction:
p 
which equals:
1

2
 
 E Yˆ
n
i 1
ip


 E Yi 
2
 
n
2
1 n

ˆ
ˆ
p  2  E Yip  E Yi   Var Yip 
  i 1
i 1

which in English is:
p  some bias   some variance 
Mallow’s Cp criterion
Mallow’s Cp statistic
Cp 
SSE p
MSE ( X 1 ,..., X p 1 )
 n  2 p 
estimates  p
where:
• SSEp is the error sum of squares for the fitted (subset) regression
model with p parameters.
• MSE(X1,…, Xp-1) is the MSE of the model containing all p-1
predictors. It is an unbiased estimator of σ2.
• p is the number of parameters in the (subset) model
Facts about Mallow’s Cp
• Subset models with small Cp values have a small
total standardized MSE of prediction.
• When the Cp value is …
– near p, the bias is small (next to none),
– much greater than p, the bias is substantial,
– below p, it is due to sampling error; interpret as no bias.
• For the largest model with all possible predictors,
Cp= p (always).
Using the Cp criterion
• So, identify subsets of predictors for which:
– the Cp value is smallest, and
– the Cp value is near p (if possible)
• In general, though, don’t always choose the
largest model just because it yields Cp= p.
Best Subsets Regression: y versus x1, x2, x3, x4
Response is y
Vars
R-Sq
R-Sq(adj)
C-p
S
1
1
2
2
3
3
4
67.5
66.6
97.9
97.2
98.2
98.2
98.2
64.5
63.6
97.4
96.7
97.6
97.6
97.4
138.7
142.5
2.7
5.5
3.0
3.0
5.0
8.9639
9.0771
2.4063
2.7343
2.3087
2.3121
2.4460
x x x x
1 2 3 4
X
X
X X
X
X
X X
X
X X X
X X X X
Stepwise Regression: y versus x1, x2, x3, x4
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is
y
on 4 predictors, with N =
Step
Constant
1
117.57
2
103.10
3
71.65
x4
T-Value
P-Value
-0.738
-4.77
0.001
-0.614
-12.62
0.000
-0.237
-1.37
0.205
1.44
10.40
0.000
1.45
12.41
0.000
1.47
12.10
0.000
0.416
2.24
0.052
0.662
14.44
0.000
2.31
98.23
97.64
3.0
2.41
97.87
97.44
2.7
x1
T-Value
P-Value
x2
T-Value
P-Value
S
R-Sq
R-Sq(adj)
C-p
8.96
67.45
64.50
138.7
2.73
97.25
96.70
5.5
4
52.58
13
Example: Modeling PIQ
130.5
PIQ
91.5
100.728
MRI
86.283
73.25
Height
65.75
170.5
Weight
127.5
.5
.5
91 130
3
8
.28
.7 2
86 100
.75 3.25
65
7
7.5 70.5
12
1
Best Subsets Regression: PIQ versus MRI, Height, Weight
Response is PIQ
Vars
R-Sq
R-Sq(adj)
C-p
S
1
1
2
2
3
14.3
0.9
29.5
19.3
29.5
11.9
0.0
25.5
14.6
23.3
7.3
13.8
2.0
6.9
4.0
21.212
22.810
19.510
20.878
19.794
H
e
i
M g
R h
I t
W
e
i
g
h
t
X
X
X X
X
X
X X X
Stepwise Regression: PIQ versus MRI, Height, Weight
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is
PIQ
on 3 predictors, with N =
38
Step
Constant
1
4.652
2
111.276
MRI
T-Value
P-Value
1.18
2.45
0.019
2.06
3.77
0.001
Height
T-Value
P-Value
S
R-Sq
R-Sq(adj)
C-p
-2.73
-2.75
0.009
21.2
14.27
11.89
7.3
19.5
29.49
25.46
2.0
Example: Modeling BP
120
BP
110
53.25
Age
47.75
97.325
Weight
89.375
2.125
BSA
1.875
8.275
Duration
4.425
72.5
Pulse
65.5
76.25
Stress
30.75
0
11
0
12
. 75 3.25
47
5
5
5
.37 7. 32
89
9
75
25
1. 8 2. 1
25 .275
4. 4
8
.5
65
.5
72
.75 6. 25
30
7
Best Subsets Regression: BP versus Age, Weight, ...
Response is BP
Vars
R-Sq
R-Sq(adj)
C-p
S
1
1
2
2
3
3
4
4
5
5
6
90.3
75.0
99.1
92.0
99.5
99.2
99.5
99.5
99.6
99.5
99.6
89.7
73.6
99.0
91.0
99.4
99.1
99.4
99.4
99.4
99.4
99.4
312.8
829.1
15.1
256.6
6.4
14.1
6.4
7.1
7.0
7.7
7.0
1.7405
2.7903
0.53269
1.6246
0.43705
0.52012
0.42591
0.43500
0.42142
0.43078
0.40723
D
u
W
r
e
a
i
t
A g B i
g h S o
e t A n
P
u
l
s
e
S
t
r
e
s
s
X
X
X X
X
X X
X X
X X
X X
X X
X X
X X
X
X
X
X X
X
X
X
X X
X X X
X X X X
Stepwise Regression: BP versus Age, Weight, BSA, Duration,
Pulse, Stress
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is
BP
on 6 predictors, with N =
20
Step
Constant
1
2.205
2
-16.579
3
-13.667
Weight
T-Value
P-Value
1.201
12.92
0.000
1.033
33.15
0.000
0.906
18.49
0.000
0.708
13.23
0.000
0.702
15.96
0.000
Age
T-Value
P-Value
BSA
T-Value
P-Value
S
R-Sq
R-Sq(adj)
C-p
4.6
3.04
0.008
1.74
90.26
89.72
312.8
0.533
99.14
99.04
15.1
0.437
99.45
99.35
6.4
Best subsets regression
• Stat >> Regression >> Best subsets …
• Specify response and all possible predictors.
• If desired, specify predictors that must be
included in every model. (Researcher’s
knowledge!)
• Select OK. Results appear in session
window.
Download