Multiple Linear Regression Regression Diagnostics Find Scores That • Contribute to violation of assumptions. • Are suspect because they are far removed from the centroid (multidimensional mean) • Have undue influence on the solution. Outliers Among the Predictors • Leverage, hi or Hat Diagonal • The larger this statistic, the greater the distance between the data point and the centroid in p-dimensional space. • Investigate cases with hi greater than 2(p-1)/N. • p is the number of parameters in the model, including the intercept. Distance from the Regression Surface • Standardized Residual (aka Studentized Residual) – Difference between actual Y and predicted Y divided by an appropriate standard error • Rstudent (aka Studentized Deleted Residual) – same except for each case the regression surface is that obtained when this individual case is removed. • Investigate if greater than 2. Influence on the Solution • Cook’s D – how much would the regression surface change if this case were removed – Investigate cases with D > 1. • Dfbetas – how much would one parameter (slope or intercept) change if this case were removed – Investigate cases with values > 2. Simple Example • Y = sperm count • X1 = % time recently spent with mate • X2 = time since last ejaculation Output Statistics Obs Student Cook's Residual D RStudent Hat Diag DFBETAS H Intercept Together SR_Last Ejac 7 0.310 0.038 0.2921 0.5405 0.2920 -0.2869 -0.1987 8 -0.183 0.006 -0.1715 0.3605 -0.0959 0.1083 0.0437 9 -1.240 0.098 -1.2906 0.1600 -0.0398 -0.2265 0.0999 10 -1.270 0.261 -1.3296 0.3270 -0.2614 -0.2321 0.4657 11 2.643 1.183 6.9409 0.3369 1.6194 1.0137 -2.6903 Leverage • Investigate cases with values greater than 2(3)/11 = .55. • Case 7 is close to this cutoff. • It is a univariate outlier on the time together variable. • Further investigation indicates the case is valid, so we retain it. Residuals • Case 11 has large residuals, it should be investigated. • Notice that Rstudent is much larger than the standardized residual • This indicates that removing this case has a large effect on the solution. Output Statistics Obs Student Cook's Residual D RStudent Hat Diag DFBETAS H Intercept Together SR_Last Ejac 11 2.643 6.9409 1.183 0.3369 1.6194 1.0137 -2.6903 Influence • Case 11 has a high value of Cook’s D. • It has a high Dfbeta for the time since last ejaculation predictor, even after I transformed that variable to reduce skewness. • Upon investigation, it was found that this subject did not follow the instructions for gathering the data. His scores were deleted. Plots of Residuals • These can also be useful, but • It takes some practice to get good at detecting problems from such plots • Plot the residual versus predicted Y Heteroscedasticity Trying Squaring One Predictor Residuals not Normal and Variance not Constant