BUILDING THE REGRESSION MODEL II: DIAGNOSTICS A plot of residuals vs. a predictor variable (in the regression model/not yet in the regression model) can be used to check whether a curvature effect exists or an extra variable should be added to the current model. However, it may not properly show the marginal effect of a predictor variable, given the other predictor variables in the model. Partial Regression Residual Plots Definitions: Partial Regression Residuals Plots: Y , X1 , X 2 Yi X 2 b0 b2 X i 2 ei Y | X 2 Yi Yi X 2 X i1 X 2 b0* b1* X i 2 ei X1| X 2 X i1 X i1 X 2 Residuals ei(Y|X2) and ei(X1|X2) reflect the part of Y and X1, respectively, that is not linearly associated with X2. ei Y| X 2 vsei X1| X 2 Reveals hidden marginal relation (linear or curvilinear) between Y and X1 Reveals strength of this relationship (Fig 10.2 of ALSM) Helps to uncover outlying points that may have a strong influence on Least Squares estimates 1 Example 1: Y = Amount of Life Insurance Carried Y 205.72 6.2880 X1 4.738 X 2 Y X 50.70 15.54 X 2 2 X 1 X 2 40.779 1718 . X2 X1 = Average Annual Income X2 = Risk Aversion Score a linear relation for X1 is not appropriate in the model already containing X2 (1) The curvilinear relation (slight concave upward shape) is strongly positive. But the deviations from linearity appear to be modest (2) The scatter of the points around the least squares line through the origin with slop b1=6.2880 is much smaller than is the scatter around the horizontal line e(Y|X2)=0, indicating that adding X1 to the regression model with a linear relation will substantially reduce the error sum of squares. (3) Incorporating a curvilinear effect for X1 will lead to only a modest further reduction in the error sum of squares. 2 Example 2: Ŷ 19.174 0.2224 X1 0.6594 X 2 Y = Body fat X1 = Triceps skinfold thickness X2 = Thigh circumference X1 is of little additional help when X2 is already present X2 may be helpful even when X1 is already present Use partial regression residual plot with cautions (Page 389-390 of ALSM) Outlying Cases (Figure 10.5 of ALSM) (1) Cases that are outlying or extreme in a data set (2) A case may be outlying or extreme with respect to its Y value, its X value(s), or both. (3) Outlying cases should be carefully studied to decide whether they should be retained or eliminated. (4) If retained, carefully decide whether their influence should be reduced in the fitting process and/or the regression model should be revised. 3 Identifying Outlying Y Observations - Studentized Deleted Residuals Residuals and Semistudentized Residuals True variance of residuals involves the Hat matrix ei Yi Ŷi H X XX X HY Y e*i ei MSE Need for improved residuals Residuals do not have the same variance 1 e I H Y 2 e 2 I H 2 e i 2 1 hii e i , e j hij 2 i j s 2 e i MSE1 hii The ith observation affects the ith fitted value distorting the ordinary residual e i , e j hij MSE i j Studentized Residual Deleted Residual s ei d i Yi Yi i ri ei s ei MSE 1 hii di ei 1 hii s di 2 MSE i 1 hii 4 Studentized Deleted Residual ti Test for Outliers Bonferroni test procedure: t(1-/2n;n-p-1) di sd i SAS symbolic CODE: t i ~ t n p 1 data t; n p 1 t i ei 2 SSE 1 h ii e i 12 tvalue=tinv(1-/2n,n-p-1); run; proc print data=t; run; Body Fat Example (three predictors) Case Summaries Unstandardized Residual Leverage Value Standardized Deleted Residual 1 -1.683 .201 -.73 2 3.643 .059 1.534 3 -3.176 .372 -1.656 4 -3.158 .111 -1.348 5 0.000 .248 .000 6 -.361 .129 -.148 7 .716 .156 .298 8 4.015 .096 1.760 9 2.655 .115 1.117 10 -2.475 .110 -1.034 11 .336 .120 .137 12 2.226 .109 .923 13 -3.947 .178 -1.825 14 3.447 .148 1.524 15 .571 .333 .267 16 .642 .095 .258 17 -.851 .106 .344 18 -.783 .197 .335 19 -2.857 .067 -1.176 20 1.040 .050 .409 Cases 3, 8, 13 have the largest absolute studentized deleted residuals The ordinary residuals identify as most outlying cases 2, 8, 13 but not 3. Test if case 13 is an outlier: (Bonferroni .05 family n = 20) t 1 2n; n p 1 t .997516 ; 3.252 t13 1826 . 3.252 Case 13 not an outlier 5 Identifying Outlying X Observation - Hat Matrix Leverage Values Hat matrix plays a major role in identifying outlying Y observations The value hii measures the distance between observation Xi and the centroid of the X’s. ( Figure 10.6 of ALSM) Hat matrix also useful in identifying outlying X observations n h Useful properties: 0 hii 1 n h i1 i 1 n ii p n Values larger than 2p/n signify outlying X. p ii h proc reg data=dataname; /*obtain studentized deleted residuals and hat matrix*/ model y=x1 x2 /influence; run; Cases 15 and 3 appear to be outlying, also 1 and 5. Body Fat 60 18 7 Case Centered Leverages 1211 17 164 6 10 8 3 20 9 50 .201 3 .372 5 .248 2 19 Thigh Circumference 1 13 15 .333 14 15 40 10 1 5 20 Triceps Skinfold Thickness 2p/n = 6/20 = .3 30 40 Cases 3 and 15 are outlying and are potentially influential on the fitted model. 6 Identifying Influential Cases - DFFITS, Cook’s Distance, and DFBETAS Measures Influence on Single Fitted Value DFFITS (Standardized) DFFITSi Ŷi Ŷi i Di 12 ji j1 pMSE e i2 pMSE h t i ii 1 h ii h ii 2 1 h ii Must exceed F(.5;p,n-p), i.e. 50th percentile to be influential Must exceed 1 in small sets to be influential 2 j MSE i h ii 2 p/n Ŷ Ŷ n estimated standard deviationof Ŷi , with 2 estimated by MSE i Must exceed sets Influence on All Fitted Values Cook’s Distance (Standardized) Case 3 in Body Fat data is at the 30.6th percentile (influential but not large enough) in large data Influence on the Regression Coefficients - DFBETAS (Standardized) Case 3 in Body Fat Data is influential DFBETAS k i Ca se S um ma ries 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 St andardiz ed DFFIT -.36615 .38381 -1. 27307 -.47635 -.00007 -.05669 .12794 .57452 .40216 -.36387 .05055 .32334 -.85078 .63551 .18885 .08377 -.11837 -.16553 -.31507 .09400 Cook's Distance .04595 .04548 .49016 .07216 .00000 .00114 .00576 .09794 .05313 .04396 .00090 .03515 .21215 .12489 .01258 .00247 .00493 .00964 .03236 .00310 St andardiz ed DFBETA Int ercept -.30518 .17257 -.84710 -.10161 -.00006 .03968 -.07753 .26143 -.15135 .23775 -.00902 -.13049 .11941 .45174 -.00300 .00931 .07951 .13205 -.12960 .01019 St andardiz ed DFBETA X1 -.13149 .11503 -1. 18252 -.29352 -.00003 .04008 -.01561 .39113 -.29466 .24460 .01706 .02246 .59242 .11317 -.12476 .04311 .05504 .07533 -.00407 .00229 St andardiz ed DFBETA X2 .23203 -.14261 1.06690 .19607 .00005 -.04427 .05432 -.33245 .24691 -.26881 -.00248 .07000 -.38949 -.29770 .06877 -.02512 -.07609 -.11610 .06443 -.00331 ckk DIAGk X X bk bk i MSE i ckk 1 Must exceed 1 in small sets to be influential Must exceed 2/ n in large data sets Case 3 in Body Fat Data is influential, but not large enough to require remedial action 7 proc reg data=dataname; /*obtain studentized deleted residuals, hat matrix* and DFBETAS/ model y=x1 x2/influence; /*output Cook distance, DFFITS*/ output out=result1 cookd=cookd dffits=dffits; ods output outputstatistics=result2; /*print out cook’s distance values*/ proc print data=result1; var cookd; /*F percentile based on cookd*/ data result1; set result1; percent1=100*probf(cookd,p,n-p); proc print data=result1; var percent1; run; 8 Multicollinearity Diagnostics - Variance Inflation Factors Coefficients a Recall the variance - covariance matrix of the coefficients of the model with resulting from the correlation transformation 1 2 b rXX (Constant) Triceps Skinfold Thickness Thigh Circumference Midarm Circumference Beta Collinearity Statistics t 1.173 Sig. .258 Tolerance VIF 4.334 3.016 4.264 1.437 .170 .001 708.843 -2.857 2.582 -2.929 -1.106 .285 .002 564.343 -2.186 1.595 -1.561 -1.370 .190 .010 104.606 t -2. 293 Sig. .035 Tolerance Coeffi cientsa bk VIFk 2 2 Standar dized Coeffici ents a. Dependent Variable: Body Fat 2 Model 1 VIFk 1 2 1 R k (Const ant) Triceps Sk infold Thickness Thigh Circumference Unstandardized Coeffic ients St d. B Error -19.174 8.361 St andar diz ed Coeffic i ents Beta Collinearity Statistic s VIF .222 .303 .219 .733 .474 .147 6.825 .659 .291 .676 2.265 .037 .147 6.825 a. Dependent Variable: Body Fat Re gress X k on otherXs proc reg data=dataname; R 2k VIF called the Variance Inflation Factor Model 1 Unstandardized Coefficients Std. B Error 117.085 99.782 model y=x1 x2/VIF; run; VIF must exceed 10 to indicate serious multicollinearity p -1 VIF k avg(VIF) k 1 p -1 1 considerably 9 Surgical Unit Example – Continued ln Yi 3.85 0.083X i1 0.014 X i 2 0.015 X i 3 0.353X i 4 r esi d 0. 6 r esi d 0. 6 0. 5 0. 5 0. 4 0. 4 0. 3 0. 3 0. 2 0. 2 0. 1 0. 1 0. 0 0. 0 - 0. 1 - 0. 1 - 0. 2 - 0. 2 - 0. 3 - 0. 3 - 0. 4 - 0. 4 - 0. 5 - 0. 5 5 6 7 8 30 40 50 pr ed 60 70 x5 Residual plot against predicted. No evidence of serious departures from the model. residual plot against X5. No need to include X5. Re s i d u a l 0. 6 l ogy = 3. 8524 +0. 0733 x 1 +0. 0142 x 2 +0. 0155 x 3 +0. 353 x 8 1. 0 N 54 0. 5 Rs q 0. 8299 A d j Rs q 0. 8160 0. 4 RMS E 0. 2109 0. 8 0. 3 0. 2 0. 1 0. 6 0. 0 - 0. 1 0. 4 - 0. 2 - 0. 3 0. 2 - 0. 4 - 0. 5 - 30 - 20 - 10 0 10 20 Re s i d u a l 0. 0 0. 0 Added-variable plot for X5. the marginal relationship between X5 and logY is weak. (Additional Support for dropping X5) 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 Cu mu l a t i v e Di s t r i b u t i o n o f 0. 7 0. 8 0. 9 1. 0 Re s i d u a l The normal probability plot of the residuals Little departure from linearity, conclusion? Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises W D W-Sq 0.968175 0.110905 0.097053 Pr < W Pr > D Pr > W-Sq 0.1600 0.0949 0.1233 Anderson-Darling A-Sq 0.635756 Pr > A-Sq 0.0942 10 Variable (VIF)k X1 1.10 X2 1.02 X3 1.05 X8 1.09 Multicollinearity among the four predictor variables is not a problem. 11 1. Case 17 was identified as outlying with regard to its Y value. Formal test: t(1-0.05/2*54;54-5-1) =t(0.99954;49)=3.528 Since |t17|=3.36963.528. the formal outlier test indicates that case 17 is not an outlier. Still t17 is very close to the critical value, we may still wish to investigate the influence of case 17. 2. Cases 23,28,32,38, 42 and 52 were identified as outlying with regard to X values since their leverage values exceed the critical value 2p/n=2*5/54=0.185 3. Determine the influence of cases 17, 23,28,32,38, 42 and 52, we consider their Cook’s distance and DFFITS values. Case 17 is most influential, with Cook’s distance D17=0.3306 and (DFFITS)17=1.4151. F(0.3306, 5, 49) is corresponding to 11th percentile. 4. In summary, the diagnostic analyses identified a number of potential problems, but none of these was considered to be serious enough to require further remedial action. 12 13