Chapter 8 Model Checking 8.1 Introduction The process of statistical analysis might take the form Select Model Class Summarize Some Models Conclusions Data Stop In the above process, however, even after a careful selection of model class, the data themselves may indicate that the particular model is unsuitable. Thus, it seems to be reasonable to introduce model checking to the original process. The news process of statistical analysis is Select Model Class Data Some Models Conclusions Summarize Model Checking Stop The inadequacy indicated by model checking could take two forms. It may be that the data as a whole show some systematic departure from the fitted values, or it may be that a few data values are discrepant from the rest. The detection of both systematic and isolated discrepancies is part of the technique of model checking. 1 8.2 Basic Quantities in Model Checking In linear model, model checking uses mainly the following statistics from the fit: The fitted values: ˆ n1 X n p ˆ p1 . Y Xˆ Y Xˆ t The mean residual sum of square: s 2 n p The residual: e Y ̂ The hat matrix: P X X X t 1 Xt In generalized linear model, the statistics used in model checking are: ˆ1 g 1 ˆ1 g 1 x1 ˆ ˆ 1 ˆ 1 ˆ g 2 g x2 ˆ n1 2 The fitted values: 1 ˆ 1 ˆ n g ˆn g xn . '' ' The variance estimate: Vi V ˆ i b ˆi , ˆ i b ˆi , i 1,, n The hat matrix: H W w1 0 W 0 0 1/ 2 w2 0 X X tWX 1 X tW 1/ 2 , 2 0 i i ˆ 0 , wi Vi . wn The residuals: the standardized Pearson residuals rP' ,i yi ˆ i ˆV 1 h ii i rP ,i ˆ1 hii , the standardized deviance residuals 2 rD ,i rD' ,i ˆ1 h ii where , , ~ ~ rD ,i sign yi ˆ i 2 yi i ˆi b i b ˆi ~ y . ' , hii is the i’th diagonal element of H and i b 1 i 8.3 Checks for Systematic Departure from Model (a) Informal check using residuals ' Standardized deviance residuals rD ,i are recommended, plotted either against ̂i or against the fitted value ̂ i transformed to the constant-information scale of the error distribution. Thus, we use ̂ : Normal errors 2 ̂ : Poisson errors 2 sin 1 ̂ : Binomial errors 2 log ̂ : Gamma errors 2 ˆ 1 2 : Inverse Gaussian errors The null pattern of this plot is a distribution of residuals for varying ̂ with mean 0 and constant range. The plot is given in (i). Typical deviations are The appearance of curvature in the mean A systematic change of range with fitted value. The plots corresponding to the above deviations are given in (ii) and (iii). 3 0 -3 -2 -1 Residuals 1 2 3 (i) 0 20 40 60 80 (ii) (iii) 4 100 Curvature may arise from the following causes: Wrong choice of link function Wrong choice of scale of one or more covariate Omission of a quadratic form in a covariate Note: The standardized residuals can be plotted against an explanatory variable. (b) Checking the variance function A plot of the absolute residuals against fitted values gives an informal check on the adequacy of the assumed variance function. The null pattern shows no trend, but an ill-chosen variance function will result in a trend in the mean. A positive trend indicates that the current variance function is increasing too slowly with the mean. For example, an original choice of V may need to be replaced 2 by V (c) Checking the link function A plot of the adjusted dependent variable i zi ˆi yi ˆ i i ˆ against ̂i gives an informal check on the link function. The null pattern is a straight line. For link functions of the power family an upwards curvature in the plot points to a link with higher power than that used, and downwards curvature to a lower power. (d) Checking the link function The plot of ei zi ˆi ˆ j xij , j 1,, p against xij provides 5 an informal check on the scale of covariates. The null pattern should be a straight line. 8.4 Checks for Isolated Departures from the Model (a) Measure of leverage t In standard regression, P X X X 1 X t pij nn , pii can be used to measure of leverage (i.e., the effect of the change of the covariate value on the fitted value). As pii 2p , the i’th n observation can be regarded as high leverage point. Similarly, in generalized linear model, hii , the i’th diagonal element of H, can be used as a measure of leverage. Note: Usually, in standard linear model, a point at the extreme of the x-range will have high leverage. However, for generalized linear model, a point at the extreme of the x-range will not necessarily have high leverage if its weight is very small. (b) Measure of influence In standard linear model, the Cook’s distance ˆ ˆ X X ˆ ˆ ˆ ˆ Vaˆr ˆ ˆ ˆ t Ci (i ) (i ) ps ps (i ) 2 t Yˆ Yˆ( i ) Yˆ Yˆ( i ) 2 1 t t (i ) p Yˆ Yˆ( i ) ps 2 2 , i 1,, n can be used to assess the influence of the i’th observation, where ̂ i is the parameter estimate without the contribution from the i’th 6 observation and Yˆi X̂ i . Similarly, in generalized linear model, the modified Cook’s distance is Di ˆ ˆ X WX ˆ ˆ t (i ) t (i ) ps 2 can be used to assess the influence of the i’th observation. 7