ch3.3

advertisement
1.
Model diagnostics:
The following plots or quantities can be used for assessing the fit or examining the
assumptions made in the model fitting process.
(a) residuals versus fitted value, square root of absolute residuals versus fitted value,
and normal quartile plot of residuals:
Let ei  Yi  Yˆi  Yi  xi b be the residual for the i’th observation, i  1,2,  , n and
.i .d

N (0,  2 ) and e i is the “estimate”
let Yˆi  xi b be the fitted value. Since  i i
of  i , the behavior of e i could reflect the possible behavior of  i ’s. The plot of e i
can help us to identify the outliers and to visualize the structure in the residuals. If the
residuals spread randomly, this might imply the assumptions of independence and
equal variance might not be violated. The reason to plot e i versus Yˆi , not versus Yi ,
is that Cov(ei , Yˆi )  0 but Cov(ei , Yi )  0 .
Example:
>plot(ozonelm3)
# page 1: e i versus Yˆi
# page 2:
>resid<-ozonelm3$residuals
ei
versus Yˆi
# ei
>par(mfrow=c(1,2))
>plot(yhat4,resid)
# e i versus Yˆi
>plot(ozonelm3,ask=T)
Selection: 2
Selection: 0
Normal quartile plot of residuals provide a visual test of the assumption that  i ’s are
normally distributed. If the quartile-quartile line is quite straight, then we might have
evidence that the errors are indeed normal.
>plot(ozonelm3,ask=T)
Selection: 5.
1
(b) Yi versus Yˆi :
This plot can provide the evidence of how well the model has captured the broad
outlines of the data.
>plot(ozonelm3,ask=T)
Selection: 4.
(c) Outliers and influential observations:

Outliers:
Two diagnostics are commonly used for identifying the outliers. They are
(i) Internally studentized residuals:
ti 
ei
ei

s.e.(ei ) s 1  pii , where
pii is the i’th diagonal element of P  X ( X t X ) 1 X t .
Note : Var (ei )  (1  pii ) 2 .
(ii) Externally studentized residuals:
ti 
s( i )
ei
ti

1/ 2
1  pii  n  p  ti2  , where


n

p

1


s (i2 ) is the mean residual sum of square while fitting the linear regression with the i’th
observation deleted.
Example:
>help(lm.influence)
>lminflu<-lm.influence(ozonelm3)
>hat<-lminflu$hat
>resid<-ozonelm3$residuals
# pii
# ei
2
>yhat4<-ozonelm3$fitted.values
>s2<-sum((ozone-yhat4)^2)/107
# Yˆ
# s2
>itresid<-resid/(sqrt(s2*(1-hat))
# ti 
ei
s 2 (1  pii )
>itresid
>plot(itresid)
>etresid<-itresid/sqrt((107-itresid^2)/106) # t i 
ti
 n  p  t i2 


 n  p 1 
1/ 2
>etresid
>plot(etresid)
>si<-lminflu$sigma
# s (i )
>etresid2<-resid/(si*sqrt(1-hat))
# t i 
ei
s(i ) 1  pii
>etresid-etresid2

Influential Observations:
The commonly used diagnostic for identifying the influential observations is the
Cook’s distance,
Ci 
(b  b(i ) ) t X t X (b  b(i ) )
ps 2

Xb  Xb(i )
2

ps 2
(Yˆ  Yˆ(i ) ) t (Yˆ  Yˆ(i ) )
ps 2

Yˆ  Yˆ(i )
ps 2
2
,
where b(i ) is the least square estimate with observation i deleted and Yˆ( i ) is the
vector of fitted values without contribution from observation i.
Example:
>bi<-lminflu$coefficients
 b(t1) 
 t 
b
#  ( 2) 
  
 t 
b( n ) 
>bi[5,]
3
n p
>dellm5<-lm(ozone[-5]~radi[-5]+temper[-5]+wind[-5])
>dellm5$coefficients
>x<-cbind(1,air[,2:4])
#X
>b<-ozonelm3$coefficients
b0 
b 
# b   1
b2 
 
b3 
>cook<-rep(0,111)
>for (j in 1:111){
cook[j]<-sum(((b-bi[j,])%*%t(x))^2)/(4*s2)
}
>par(mfrow=c(1,3))
>plot(cook)
>plot(ozonelm3,ask=T)
Selection: 7
Note : a simpler formula for Cook’s distance without using the loop is
t i2
p ii
Ci 

.
p 1  pii
Example (conti):
>cook2<-(itresid^2/4)*(hat/(1-hat))
>plot(cook2)
4
Download