building the regression model ii: diagnostics

advertisement
BUILDING THE REGRESSION MODEL II:
DIAGNOSTICS
A plot of residuals vs. a predictor variable (in the regression model/not yet in the regression
model) can be used to check whether a curvature effect exists or an extra variable should be
added to the current model. However, it may not properly show the marginal effect of a predictor
variable, given the other predictor variables in the model.
Partial Regression Residual Plots
Definitions:
Partial Regression Residuals Plots:
Y , X1 , X 2
Yi  X 2   b0  b2 X i 2
ei Y | X 2   Yi  Yi  X 2 
X i1  X 2   b0*  b1* X i 2
ei  X1| X 2   X i1  X i1  X 2 
Residuals ei(Y|X2) and ei(X1|X2) reflect
the part of Y and X1, respectively, that is
not linearly associated with X2.
ei  Y| X 2  vsei  X1| X 2 
 Reveals hidden marginal relation
(linear or curvilinear) between Y
and X1
 Reveals strength of this
relationship (Fig 10.2 of ALSM)
Helps to uncover outlying points
that may have a strong influence
on Least Squares estimates
1
Example 1:
 Y = Amount of Life Insurance
Carried
Y  205.72  6.2880 X1  4.738 X 2
Y  X   50.70  15.54 X
2
2
X 1  X 2   40.779  1718
.
X2
 X1 = Average Annual Income
 X2 = Risk Aversion Score
a linear relation for X1 is not appropriate in
the model already containing X2
(1) The curvilinear relation (slight
concave upward shape) is strongly
positive. But the deviations from
linearity appear to be modest
(2) The scatter of the points around the
least squares line through the origin
with slop b1=6.2880 is much
smaller than is the scatter around
the horizontal line e(Y|X2)=0,
indicating that adding X1 to the
regression model with a linear
relation will substantially reduce the
error sum of squares.
(3) Incorporating a curvilinear effect
for X1 will lead to only a modest
further reduction in the error sum of
squares.
2
Example 2:
Ŷ  19.174  0.2224 X1  0.6594 X 2
 Y = Body fat
 X1 = Triceps skinfold thickness
 X2 = Thigh circumference
X1 is of little additional help when X2 is
already present
X2 may be helpful even when X1 is already
present
Use partial regression residual plot with cautions (Page 389-390 of ALSM)
Outlying Cases (Figure 10.5 of ALSM)
(1) Cases that are outlying or extreme in a data set
(2) A case may be outlying or extreme with respect to its Y value, its X value(s), or both.
(3) Outlying cases should be carefully studied to decide whether they should be retained or
eliminated.
(4) If retained, carefully decide whether their influence should be reduced in the fitting
process and/or the regression model should be revised.
3
Identifying Outlying Y Observations - Studentized Deleted
Residuals
Residuals and Semistudentized
Residuals
 True variance of residuals
involves the Hat matrix
ei  Yi  Ŷi
H  X XX X 
  HY
Y
e*i 
ei
MSE
Need for improved residuals
 Residuals do not have the same
variance
1
e   I  H Y
 2  e   2  I  H
 2  e i    2 1  hii 




 e i , e j   hij  2 i  j
s 2  e i   MSE1  hii 
 The ith observation affects the ith
fitted value distorting the
ordinary residual
 e i , e j   hij  MSE  i  j
Studentized Residual
Deleted Residual
s ei  
d i  Yi  Yi i 
ri 
ei
s ei 
MSE 1  hii 
di 
ei
1  hii
s di  
2
MSE i 
1  hii
4
Studentized Deleted Residual
ti 
Test for Outliers
Bonferroni test procedure: t(1-/2n;n-p-1)
di
sd i 
SAS symbolic CODE:
t i ~ t n  p  1
data t;


n  p 1
t i  ei 
2
 SSE 1  h ii   e i 
12
tvalue=tinv(1-/2n,n-p-1);
run;
proc print data=t;
run;
Body Fat Example (three
predictors)
Case Summaries
Unstandardized
Residual
Leverage
Value
Standardized
Deleted Residual
1
-1.683
.201
-.73
2
3.643
.059
1.534
3
-3.176
.372
-1.656
4
-3.158
.111
-1.348
5
0.000
.248
.000
6
-.361
.129
-.148
7
.716
.156
.298
8
4.015
.096
1.760
9
2.655
.115
1.117
10
-2.475
.110
-1.034
11
.336
.120
.137
12
2.226
.109
.923
13
-3.947
.178
-1.825
14
3.447
.148
1.524
15
.571
.333
.267
16
.642
.095
.258
17
-.851
.106
.344
18
-.783
.197
.335
19
-2.857
.067
-1.176
20
1.040
.050
.409
 Cases 3, 8, 13 have the largest
absolute studentized deleted
residuals
 The ordinary residuals identify
as most outlying cases 2, 8, 13
but not 3.
 Test if case 13 is an outlier:
(Bonferroni .05 family n = 20)
t 1   2n; n  p  1 
t .997516
;   3.252
t13  1826
.
 3.252
 Case 13 not an outlier
5
Identifying Outlying X Observation - Hat Matrix Leverage
Values
 Hat matrix plays a major role
in identifying outlying Y
observations
 The value hii measures the
distance between observation Xi
and the centroid of the X’s. (
Figure 10.6 of ALSM)
 Hat matrix also useful in
identifying outlying X
observations
n
h
 Useful properties:
0  hii  1
n
h
i1
i 1
n
ii

p
n
 Values larger than 2p/n signify
outlying X.
p
ii
h
proc reg data=dataname;
/*obtain studentized deleted
residuals and hat matrix*/
model y=x1 x2 /influence;
run;
Cases 15 and 3 appear to be
outlying, also 1 and 5.
Body Fat
60
18 7
Case Centered Leverages
1211
17
164
6
10
8
3
20
9
50
.201
3
.372
5
.248
2
19
Thigh Circumference
1
13
15 .333
14
15
40
10
1
5
20
Triceps Skinfold Thickness
2p/n = 6/20 = .3
30
40
Cases 3 and 15 are outlying and are
potentially influential on the fitted
model.
6
Identifying Influential Cases - DFFITS, Cook’s Distance,
and DFBETAS Measures
Influence on Single Fitted Value DFFITS (Standardized)
DFFITSi 
Ŷi  Ŷi i 
Di 
12
ji
j1
pMSE
e i2

pMSE
 h 
 t i  ii 
 1  h ii 
 h ii 

2


1

h
ii


 Must exceed F(.5;p,n-p), i.e. 50th
percentile to be influential
 Must exceed 1 in small sets to be
influential

2
j
MSE i h ii
2 p/n
 Ŷ  Ŷ   
n
estimated
standard
deviationof
Ŷi , with 2
estimated by
MSE  i 
 Must exceed
sets
Influence on All Fitted Values Cook’s Distance (Standardized)
Case 3 in Body Fat data is at the 30.6th
percentile (influential but not large
enough)
in large data
Influence on the Regression
Coefficients - DFBETAS
(Standardized)
Case 3 in Body Fat Data is
influential
 DFBETAS  k  i  
Ca se S um ma ries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
St andardiz ed
DFFIT
-.36615
.38381
-1. 27307
-.47635
-.00007
-.05669
.12794
.57452
.40216
-.36387
.05055
.32334
-.85078
.63551
.18885
.08377
-.11837
-.16553
-.31507
.09400
Cook's
Distance
.04595
.04548
.49016
.07216
.00000
.00114
.00576
.09794
.05313
.04396
.00090
.03515
.21215
.12489
.01258
.00247
.00493
.00964
.03236
.00310
St andardiz ed
DFBETA
Int ercept
-.30518
.17257
-.84710
-.10161
-.00006
.03968
-.07753
.26143
-.15135
.23775
-.00902
-.13049
.11941
.45174
-.00300
.00931
.07951
.13205
-.12960
.01019
St andardiz ed
DFBETA X1
-.13149
.11503
-1. 18252
-.29352
-.00003
.04008
-.01561
.39113
-.29466
.24460
.01706
.02246
.59242
.11317
-.12476
.04311
.05504
.07533
-.00407
.00229
St andardiz ed
DFBETA X2
.23203
-.14261
1.06690
.19607
.00005
-.04427
.05432
-.33245
.24691
-.26881
-.00248
.07000
-.38949
-.29770
.06877
-.02512
-.07609
-.11610
.06443
-.00331

ckk  DIAGk  X X
bk  bk  i 
MSE i  ckk
1

 Must exceed 1 in small sets to be
influential
 Must exceed
2/ n
in large data sets
Case 3 in Body Fat Data is influential,
but not large enough to require
remedial action
7
proc reg data=dataname;
/*obtain studentized deleted residuals, hat matrix* and DFBETAS/
model y=x1 x2/influence;
/*output Cook distance, DFFITS*/
output out=result1 cookd=cookd dffits=dffits;
ods output outputstatistics=result2;
/*print out cook’s distance values*/
proc print data=result1;
var cookd;
/*F percentile based on cookd*/
data result1;
set result1;
percent1=100*probf(cookd,p,n-p);
proc print data=result1;
var percent1;
run;
8
Multicollinearity Diagnostics - Variance Inflation Factors
Coefficients a
Recall the variance - covariance matrix
of the coefficients of the model with
resulting from the correlation
transformation
1
 2 b   rXX
(Constant)
Triceps
Skinfold
Thickness
Thigh
Circumference
Midarm
Circumference
Beta
Collinearity Statistics
t
1.173
Sig.
.258
Tolerance
VIF
4.334
3.016
4.264
1.437
.170
.001
708.843
-2.857
2.582
-2.929
-1.106
.285
.002
564.343
-2.186
1.595
-1.561
-1.370
.190
.010
104.606
t
-2. 293
Sig.
.035
Tolerance
Coeffi cientsa
 bk    VIFk
2
2
Standar
dized
Coeffici
ents
a. Dependent Variable: Body Fat
2
Model
1
VIFk  1 2
1 R k
(Const ant)
Triceps
Sk infold
Thickness
Thigh
Circumference
Unstandardized
Coeffic ients
St d.
B
Error
-19.174
8.361
St andar
diz ed
Coeffic i
ents
Beta
Collinearity Statistic s
VIF
.222
.303
.219
.733
.474
.147
6.825
.659
.291
.676
2.265
.037
.147
6.825
a. Dependent Variable: Body Fat
Re gress X k on otherXs 
proc reg data=dataname;
R 2k
 VIF called the Variance Inflation
Factor

Model
1
Unstandardized
Coefficients
Std.
B
Error
117.085
99.782
model y=x1 x2/VIF;
run;
VIF must exceed 10 to indicate
serious multicollinearity
p -1
 VIF
k

avg(VIF) 
k 1
p -1
 1 considerably
9
Surgical Unit Example – Continued
ln Yi  3.85  0.083X i1  0.014 X i 2  0.015 X i 3  0.353X i 4
r esi d
0. 6
r esi d
0. 6
0. 5
0. 5
0. 4
0. 4
0. 3
0. 3
0. 2
0. 2
0. 1
0. 1
0. 0
0. 0
- 0. 1
- 0. 1
- 0. 2
- 0. 2
- 0. 3
- 0. 3
- 0. 4
- 0. 4
- 0. 5
- 0. 5
5
6
7
8
30
40
50
pr ed
60
70
x5
Residual plot against predicted. No evidence of
serious departures from the model.
residual plot against X5. No need to include X5.
Re s i d u a l
0. 6
l ogy
= 3. 8524 +0. 0733 x 1 +0. 0142 x 2 +0. 0155 x 3 +0. 353 x 8
1. 0
N
54
0. 5
Rs q
0. 8299
A d j Rs q
0. 8160
0. 4
RMS E
0. 2109
0. 8
0. 3
0. 2
0. 1
0. 6
0. 0
- 0. 1
0. 4
- 0. 2
- 0. 3
0. 2
- 0. 4
- 0. 5
- 30
- 20
- 10
0
10
20
Re s i d u a l
0. 0
0. 0
Added-variable plot for X5. the marginal
relationship between X5 and logY is weak.
(Additional Support for dropping X5)
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
Cu mu l a t i v e Di s t r i b u t i o n o f
0. 7
0. 8
0. 9
1. 0
Re s i d u a l
The normal probability plot of the residuals
Little departure from linearity, conclusion?
Tests for Normality
Test
--Statistic---
-----p Value------
Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
W
D
W-Sq
0.968175
0.110905
0.097053
Pr < W
Pr > D
Pr > W-Sq
0.1600
0.0949
0.1233
Anderson-Darling
A-Sq
0.635756
Pr > A-Sq
0.0942
10
Variable
(VIF)k
X1
1.10
X2
1.02
X3
1.05
X8
1.09
Multicollinearity among the four predictor variables is not a problem.
11
1. Case 17 was identified
as outlying with regard to
its Y value.
Formal test:
t(1-0.05/2*54;54-5-1)
=t(0.99954;49)=3.528
Since |t17|=3.36963.528.
the formal outlier test
indicates that case 17 is
not an outlier. Still t17 is
very close to the critical
value, we may still wish to
investigate the influence of
case 17.
2. Cases
23,28,32,38,
42 and 52 were
identified as outlying
with regard to X values
since their leverage
values exceed the critical
value
2p/n=2*5/54=0.185
3. Determine the
influence of cases 17,
23,28,32,38, 42 and 52,
we consider their
Cook’s distance and
DFFITS values. Case
17 is most influential,
with Cook’s distance
D17=0.3306 and
(DFFITS)17=1.4151.
F(0.3306, 5, 49) is
corresponding to 11th
percentile.
4. In summary, the
diagnostic analyses
identified a number of
potential problems, but
none of these was
considered to be serious
enough to require
further remedial action.
12
13
Download