1. Model diagnostics: The following plots or quantities can be used for assessing the fit or examining the assumptions made in the model fitting process. (a) residuals versus fitted value, square root of absolute residuals versus fitted value, and normal quartile plot of residuals: Let ei Yi Yˆi Yi xi b be the residual for the i’th observation, i 1,2, , n and .i .d N (0, 2 ) and e i is the “estimate” let Yˆi xi b be the fitted value. Since i i of i , the behavior of e i could reflect the possible behavior of i ’s. The plot of e i can help us to identify the outliers and to visualize the structure in the residuals. If the residuals spread randomly, this might imply the assumptions of independence and equal variance might not be violated. The reason to plot e i versus Yˆi , not versus Yi , is that Cov(ei , Yˆi ) 0 but Cov(ei , Yi ) 0 . Example: >plot(ozonelm3) # page 1: e i versus Yˆi # page 2: >resid<-ozonelm3$residuals ei versus Yˆi # ei >par(mfrow=c(1,2)) >plot(yhat4,resid) # e i versus Yˆi >plot(ozonelm3,ask=T) Selection: 2 Selection: 0 Normal quartile plot of residuals provide a visual test of the assumption that i ’s are normally distributed. If the quartile-quartile line is quite straight, then we might have evidence that the errors are indeed normal. >plot(ozonelm3,ask=T) Selection: 5. 1 (b) Yi versus Yˆi : This plot can provide the evidence of how well the model has captured the broad outlines of the data. >plot(ozonelm3,ask=T) Selection: 4. (c) Outliers and influential observations: Outliers: Two diagnostics are commonly used for identifying the outliers. They are (i) Internally studentized residuals: ti ei ei s.e.(ei ) s 1 pii , where pii is the i’th diagonal element of P X ( X t X ) 1 X t . Note : Var (ei ) (1 pii ) 2 . (ii) Externally studentized residuals: ti s( i ) ei ti 1/ 2 1 pii n p ti2 , where n p 1 s (i2 ) is the mean residual sum of square while fitting the linear regression with the i’th observation deleted. Example: >help(lm.influence) >lminflu<-lm.influence(ozonelm3) >hat<-lminflu$hat >resid<-ozonelm3$residuals # pii # ei 2 >yhat4<-ozonelm3$fitted.values >s2<-sum((ozone-yhat4)^2)/107 # Yˆ # s2 >itresid<-resid/(sqrt(s2*(1-hat)) # ti ei s 2 (1 pii ) >itresid >plot(itresid) >etresid<-itresid/sqrt((107-itresid^2)/106) # t i ti n p t i2 n p 1 1/ 2 >etresid >plot(etresid) >si<-lminflu$sigma # s (i ) >etresid2<-resid/(si*sqrt(1-hat)) # t i ei s(i ) 1 pii >etresid-etresid2 Influential Observations: The commonly used diagnostic for identifying the influential observations is the Cook’s distance, Ci (b b(i ) ) t X t X (b b(i ) ) ps 2 Xb Xb(i ) 2 ps 2 (Yˆ Yˆ(i ) ) t (Yˆ Yˆ(i ) ) ps 2 Yˆ Yˆ(i ) ps 2 2 , where b(i ) is the least square estimate with observation i deleted and Yˆ( i ) is the vector of fitted values without contribution from observation i. Example: >bi<-lminflu$coefficients b(t1) t b # ( 2) t b( n ) >bi[5,] 3 n p >dellm5<-lm(ozone[-5]~radi[-5]+temper[-5]+wind[-5]) >dellm5$coefficients >x<-cbind(1,air[,2:4]) #X >b<-ozonelm3$coefficients b0 b # b 1 b2 b3 >cook<-rep(0,111) >for (j in 1:111){ cook[j]<-sum(((b-bi[j,])%*%t(x))^2)/(4*s2) } >par(mfrow=c(1,3)) >plot(cook) >plot(ozonelm3,ask=T) Selection: 7 Note : a simpler formula for Cook’s distance without using the loop is t i2 p ii Ci . p 1 pii Example (conti): >cook2<-(itresid^2/4)*(hat/(1-hat)) >plot(cook2) 4