Stat 512 Class 1

advertisement
Topic 18: Model Selection
and Diagnostics
Variable Selection
• We want to choose a “best” model
that is a subset of the available
explanatory variables
• Two separate problems
– How many explanatory variables
should we use (i.e., subset size)
– Given the subset size, which
variables should we choose
KNNL Example
• Page 350, Section 9.2
• Y : survival time of patient (liver op)
• X’s (explanatory variables) are
– Blood clotting score
– Prognostic index
– Enzyme function test
– Liver function test
KNNL Example cont.
• n = 54 patients
• Start with the usual plots and
descriptive statistics
• Time-to-event data is often
heavily skewed and typically
transformed with a log
Tab
delimited
Data
Data a1;
infile 'U:\.www\datasets512\CH09TA01.txt‘
delimiter='09'x;
input blood prog enz liver age gender
alcmod alcheavy surv lsurv;
run;
Ln(surv)
Dummy
variables for
alcohol use
Data
Obs blood prog enz liver age gender alcmod alcheavy surv lsurv
1
6.7
62 81 2.59 50
0
1
0 695 6.544
2
5.1
59
66 1.70
39
0
0
0
403 5.999
3
7.4
57
83 2.16
55
0
0
0
710 6.565
4
6.5
73
41 2.01
48
0
0
0
349 5.854
5
7.8
65 115 4.30
45
0
0
1 2343 7.759
6
5.8
38
65
1
1
0
72 1.42
348 5.852
Log Transform of Y
• Recall that regression model does
not require Y to be Normally
distributed
• In this case, transform reduces
influence of long right tail and often
stabilizes the variance of the
residuals
Scatterplots
proc corr plot=matrix;
var blood prog enz liver;
run;
proc corr plot=scatter;
var blood prog enz liver;
with lsurv;
run;
Correlation Summary
Pearson Correlation Coefficients, N = 54
Prob > |r| under H0: Rho=0
blood
prog
enz
liver
lsurv
0.24619 0.46994 0.65389 0.64926
0.0727 0.0003 <.0001 <.0001
The Two Problems in
Variable Selection
1. To determine an appropriate subset size
– Might use adjusted R2, Cp, MSE,
PRESS, AIC, SBC (BIC)
2. To determine best model of this fixed
size
– Might use R2
Adjusted R2
• R2 by its construction is guaranteed
to increase with p
– SSE cannot decrease with
additional X and SSTO constant
• Adjusted R2 uses df to account for p
MSEp
 n  1  SSEp

R  1  
 1
MSTO
 n  p  SSTO
2
a
Adjusted R2
2
a
• Want to find model that maximizes R
• Since MSTO will remain constant for
a given data set
– Depends only on Y
• Equivalent information to MSE
• Thus could also find choice of model
that minimizes MSE
• Details on pages 354-356
Cp Criterion
• The basic idea is to compare subset
models with the full model
• A subset model is good if there is not
substantial “bias” in the predicted values
(relative to the full model)
• Looks at the ratio of total mean squared
error and the true error variance
• See page 357-359 for details
Cp Criterion
Cp 
SSE p
MSE (Full )
 (n  2 p)
SSE based on a specific choice of p-1
variables
MSE based on the full set of variables
Select the full set and Cp=(n-p)-(n-2p)=p
Use of Cp
• p is the number of regression
coefficients including the intercept
• A model is good according to this
criterion if Cp ≤ p
• Rule: Pick the smallest model for
which Cp is smaller than p or pick the
model that minimizes Cp, provided
the minimum Cp is smaller than p
SBC (BIC) and AIC
Criterion based on log(likelihood)
plus a penalty for more complexity
• AIC – minimize
 SSE p 
  2p
n log 
 n 
 SSE p 
  p log( n )
• SBC – minimize n log 
 n 
Other approaches
• PRESS (prediction SS)
– For each case i
– Delete the case and predict Y using
a model based on the other n-1
cases
– Look at the SS for observed minus
predicted
– Want to minimize the PRESS
Variable Selection
• Additional proc reg model statement
options useful in variable selection
– INCLUDE=n forces the first n
explanatory variables into all models
– BEST=n limits the output to the best n
models of each subset size or total
– START=n limits output to models that
include at least n explanatory
variables
Variable Selection
• Step type procedures
– Forward selection (Step up)
– Backward elimination (Step down)
– Stepwise (forward selection with a
backward glance)
• Very popular but now have much
better search techniques like BEST
Ordering models of the
same subset size
• Use R2 or SSE
• This approach can lead us to consider
several models that give us
approximately the same predicted
values
• May need to apply knowledge of the
subject matter to make a final selection
• Not that important if prediction is the
key goal
Proc Reg
proc reg data=a1;
model lsurv=
blood prog enz liver/
selection=rsquare cp aic
sbc b best=3;
run;
Selection Results
Number in
Model
1
1
1
2
2
2
3
3
3
4
R-Square
C(p)
AIC
SBC
0.4276 66.4889 -103.8269 -99.84889
0.4215 67.7148 -103.2615 -99.28357
0.2208 108.5558 -87.1781 -83.20011
0.6633 20.5197 -130.4833 -124.51634
0.5995 33.5041 -121.1126 -115.14561
0.5486 43.8517 -114.6583 -108.69138
0.7573
3.3905 -146.1609 -138.20494
0.7178 11.4237 -138.0232 -130.06723
0.6121 32.9320 -120.8442 -112.88823
0.7592
5.0000 -144.5895 -134.64461
Selection Results
Number in
Model
1
1
1
2
2
2
3
3
3
4
Intercept
5.26426
5.61218
5.56613
4.35058
5.02818
4.54623
3.76618
4.40582
4.78168
3.85195
Parameter Estimates
blood
prog
enz
.
.
0.01512
.
.
.
. 0.01367
.
. 0.01412
0.01539
.
.
0.01073
0.10792
.
0.01634
0.09546 0.01334
0.01645
. 0.01101
0.01261
0.04482
.
0.01220
0.08368 0.01266 0.01563
liver
.
0.29819
.
.
0.20945
.
.
0.12977
0.16360
0.03216
Proc Reg
proc reg data=a1;
model lsurv=
blood prog enz liver/
selection=cp aic
sbc b best=3;
run;
Selection Results
Number in
Model
3
4
3
C(p)
3.3905
5.0000
11.4237
R-Square
0.7573
0.7592
0.7178
AIC
-146.1609
-144.5895
-138.0232
SBC
-138.20494
-134.64461
-130.06723
WARNING: “selection=cp” just lists the
models in order based on lowest C(p),
regardless of whether it is good or
not
How to Choose with C(p)
1. Want small C(p)
2. Want C(p) near p
In original paper, it was suggested to
plot C(p) versus p and consider the
smallest model that satisfies these
criteria
Can be somewhat subjective when
determining “near”
Proc Reg
Creates data set
with estimates &
criteria
proc reg data=a1 outest=b1;
model lsurv=blood prog enz liver/
selection=rsquare cp aic sbc b;
run;quit;
symbol1 v=circle i=none;
symbol2 v=none i=join;
proc gplot data=b1;
plot _Cp_*_P_ _P_*_P_ / overlay;
run;
Mallows C(p)
150
140
130
120
110
100
90
80
70
60
50
40
30
20
10
0
Start to approach
C(p)=p line here
2
3
4
Number of parameters in model
5
Model Validation
• Since data used to generate
parameter estimates, you’d expect
model to predict fitted Y’s well
• Want to check model predictive
ability for a separate data set
• Various techniques of cross
validation (data split, leave one out)
are possible
Regression Diagnostics
•
•
•
•
•
•
Partial regression plots
Studentized deleted residuals
Hat matrix diagonals
Dffits, Cook’s D, DFBETAS
Variance inflation factor
Tolerance
KNNL Example
•
•
•
•
•
Page 386, Section 10.1
Y is amount of life insurance
X1 is average annual income
X2 is a risk aversion score
n = 18 managers
Read in the data set
data a1;
infile ‘../data/ch10ta01.txt';
input income risk insur;
Partial regression plots
• Also called added variable
plots or adjusted variable plots
• One plot for each Xi
Partial regression plots
• These plots show the strength of the
marginal relationship between Y and Xi in
the full model .
• They can also detect
– Nonlinear relationships
– Heterogeneous variances
– Outliers
Partial regression plots
• Consider plot for X1
–Use the other X’s to predict Y
–Use the other X’s to predict X1
–Plot the residuals from the first
regression vs the residuals from
the second regression
The partial option with
proc reg and plots=
proc reg data=a1
plots=partialplot;
model insur=income risk
/partial;
run;
Output
Analysis of Variance
Sum of
Mean
Source
DF Squares Square F Value Pr > F
Model
2
173919
86960 542.33 <.0001
Error
15 2405.1476 160.3431
Corrected Total 17
176324
Root MSE
Dependent Mean
Coeff Var
12.66267 R-Square
134.44444 Adj R-Sq
9.41851
0.9864
0.9845
Output
Parameter Estimates
Parameter Standard
Variable DF Estimate
Error t Value Pr > |t| Tolerance
Intercept 1 -205.71866 11.39268 -18.06 <.0001
.
income
1
6.28803
0.20415
30.80 <.0001
0.93524
risk
1
4.73760
1.37808
3.44 0.0037
0.93524
Output
• The partial option on the model
statement in proc reg generates
graphs in the output window
• These are ok for some purposes but
we prefer better looking plots
• To generate these plots we follow
the regression steps outlined earlier
and use gplot or plots=partialplot
Partial regression plots
*partial regression plot for risk;
proc reg data=a1;
model insur risk = income;
output out=a2 r=resins resris;
symbol1 v=circle i=sm70;
proc gplot data=a2;
plot resins*resris;
run;
The plot for risk
Partial plot for income
code not shown
Residual plot (vs risk)
proc reg data=a1;
model insur= risk income;
output out=a2 r=resins;
symbol1 v=circle i=sm70;
proc sort data=a2; by risk;
proc gplot data=a2;
plot resins*risk;
run;
Residuals vs Risk
Residual plot (vs income)
proc sort data=a2; by income;
proc gplot data=a2;
plot resins*income;
run;
Residuals vs Income
insur = -205.72 +6.288 income +4.7376 risk
25
N
18
Rsq
0.9864
AdjRsq
0.9845
RMSE
12.663
20
15
Residual
10
5
0
-5
-10
-15
-20
25
30
35
40
45
50
55
income
60
65
70
75
80
insur = -205.72 +6.288 income +4.7376 risk
25
N
18
Rsq
0.9864
AdjRsq
0.9845
RMSE
12.663
20
15
Residual
10
5
0
-5
-10
-15
-20
1
2
3
4
5
6
risk
7
8
9
10
Other “Residuals”
•
There are several versions of residuals
1. Our usual residuals
ˆ
ei= Yi – Y
i
2. Studentized residuals
ei
•
e 
•
Studentized means dividing by its
standard error
•
Are distributed t(n-p) ( ≈ Normal)
*
i
MSE(1  h ii )
Other “Residuals”
– Studentized deleted residuals
• Delete case i and refit the model
• Compute the predicted value for
case i using this refitted model
• Compute the “studentized
residual”
• Don’t do this literally but this is
the concept
Studentized Deleted
Residuals
• We use the notation (i) to indicate that
case i has been deleted from the model fit
computations
ˆ is the deleted residual
• di = Yi - Y
i(i)
• Turns out di = ei/(1-hii)
• Also Var di=(Var ei)/(1-hii)2=MSE(i)/(1- hii)
•
t i  e i / MSE(i) (1  hii )
Residuals
• When we examine the residuals,
regardless of version, we are looking
for
– Outliers
– Non-normal error distributions
– Influential observations
The r option and
studentized residuals
proc reg data=a1;
model insur=income risk/r;
run;
Output
Obs
1
2
3
4
5
Student
Residual
-1.206
-0.910
2.121
-0.363
-0.210
The influence option and
studentized deleted
residuals
proc reg data=a1;
model insur=income risk
/influence;
run;
Output
Obs
1
2
3
4
5
6
Residual
-14.7311
-10.9321
24.1845
-4.2780
-2.5522
10.3417
RStudent
-1.2259
-0.9048
2.4487
-0.3518
-0.2028
1.0138
Hat matrix diagonals
• hii is a measure of how much Yi is
ˆ
contributing to the prediction of Y
i
ˆ =h Y +h Y +h Y + …
• Y
i
i1 1
i2 2
i3 3
• hii is sometimes called the leverage
of the ith observation
• It is a measure of the distance
between the X values for the ith case
and the means of the X values
Hat matrix diagonals
• 0 ≤ hii ≤ 1
• Σ(hii) = p
• Large value of hii suggess that ith
case is distant from the center of all
X’s
• The average value is p/n
• Values far from this average point to
cases that should be examined
carefully

Influence option gives hat
diagonals
Obs
1
2
3
4
5
Hat Diag
H
0.0693
0.1006
0.1890
0.1316
0.0756
DFFITS
• A measure of the influence of case i
ˆ (a single case)
on Y
i
• Thus, it is closely related to hii
• It is a standardized version of the
ˆ computed
difference between Y
i
with and without case i
• Concern if greater than 1 for small
data sets or greater than 2 p n for
large data sets
Cook’s Distance
• A measure of the influence of case i
ˆ ’s (all the cases)
on all of the Y
i
• It is a standardized version of the
sum of squares of the differences
between the predicted values
computed with and without case I
• Compare with F(p,n-p)
• Concern if distance above 50%-tile
DFBETAS
• A measure of the influence of case i on
each of the regression coefficients
• It is a standardized version of the
difference between the regression
coefficient computed with and without
case i
• Concern if DFBETA greater than 1 in
small data sets or greater than 2 / n for
large data sets
Variance Inflation Factor
• The VIF is related to the variance of
the estimated regression coefficients
• We calculate it for each explanatory
variable
• One suggested rule is that a value of
10 or more for VIF indicates
excessive multicollinearity
Tolerance
• TOL = (1-R2k) where R2k is the
squared multiple correlation
obtained in a regression where all
other explanatory variables are used
to predict Xk
• TOL = 1/VIF
• Described in comment on p 410
Output
Variable
Intercept
income
risk
Tolerance
.
0.93524
0.93524
Full diagnostics
proc reg data=a1;
model insur=income risk
/r partial influence tol;
id income risk;
plot r.*(income risk);
run;
Plot statement inside Reg
• Can generate several plots within
Proc Reg
• Need to know symbol names
• Available in Table 1 once you click
on plot command inside REG syntax
– r. represents usual residuals
– rstudent. represents deleted resids
– p. represents predicted values
Last slide
• We went over KNNL Chapters 9 and 10
• We used program topic18.sas to
generate the output
Download