Diagnostics – Part II Using statistical tests to check to see if the assumptions we made about the model are realistic Diagnostic methods • Some simple (but subjective) plots. (Then) • Some formal statistical tests. (Now) Simple linear regression model The response Yi is a function of a systematic linear component and a random error component: Yi 0 1 X i i with assumptions that: • • • • Error terms have mean 0, i.e., E(i) = 0. i and j are uncorrelated (independent). Error terms have same variance, i.e., Var(i) = 2. Error terms i are normally distributed. Why should we keep NAGGING ourselves about the model? • All of the estimates, confidence intervals, prediction intervals, hypothesis tests, etc. have been developed assuming that the model is correct. • If the model is incorrect, then the formulas and methods we use are at risk of being incorrect. (Some are more forgiving than others.) Summary of the tests we’ll learn … • Durbin-Watson test for detecting correlated (adjacent) error terms. • Modified Levene test for constant error variance. • (Ryan-Joiner) correlation test for normality of error terms. The Durbin-Watson test for uncorrelated (adjacent) error terms n Durbin-Watson test statistic D e t 2 et 1 2 t n 2 e t t 1 Compare D to Durbin-Watson test bounds in Table B.7: • If D > upper bound (dU), conclude no correlation. • If D < lower bound (dL), conclude positive correlation. • If D is between the two bounds, the test is inconclusive. Example: Blaisdell Company Regression Plot Company = -1.45475 + 0.176283 Industry S = 0.0860563 R-Sq = 99.9 % R-Sq(adj) = 99.9 % Seasonally adjusted quarterly data, 1988 to 1992. Company Sales ($ millions) 29 28 27 26 25 Reasonable fit, but are the error terms positively auto-correlated? 24 23 22 21 130 140 150 Industry Sales ($ millions) 160 170 Blaisdell Company Example: Durbin-Watson test • Stat >> Regression >> Regression. Under Options…, select Durbin-Watson statistic. • Durbin-Watson statistic = 0.73 • Table B.7 with level of significance α=0.01, (p-1)=1 predictor variable, and n=20 (5 years, 4 quarters each) gives dL= 0.95 and dU=1.15. • Since D=0.73 < dL=0.95, conclude error terms are positively auto-correlated. For completeness’ sake … one more thing about Durbin-Watson test • If test for negative auto-correlation is desired, use D*=4-D instead. If D* < dL, then conclude error terms are negatively auto-correlated. • If two-sided test is desired (both positive and negative auto-correlation possible), conduct both one-sided tests, D and D*, separately. Level of significance is then 2α. Modified Levene Test for nonconstant error variance • Divide the data set into two roughly equal-sized groups, based on the level of X. • If the error variance is either increasing or decreasing with X, the absolute deviations of the residuals around their group median will be larger for one of the two groups. • Two-sample t* to test whether mean of absolute deviations for one group differs significantly from mean of absolute deviations for second group. Modified Levene Test in Minitab • Use Manip >> Code >> Numeric to numeric … to create a GROUP variable based on the values of X. • Stat >> Regression >> Regression. Under Storage …, select residuals. • Stat >> Basic statistics >> 2 Variances … Specify Samples (RESI1) and Subscripts (GROUP). Select OK. Look in session window for Levene P-value. Example: How is plutonium activity related to alpha particle counts? Regression Plot alpha = 0.0070331 + 0.0055370 plutonium S = 0.0125713 R-Sq = 91.6 % R-Sq(adj) = 91.2 % 0.15 alpha 0.10 0.05 0.00 0 10 plutonium 20 A residual versus fits plot suggesting non-constant error variance Plutonium Alpha Example: Modified Levene’s Test Levene's Test (any continuous distribution) Test Statistic: 9.452 P-Value : 0.006 It is highly unlikely (P=0.006) that we’d get such an extreme Levene statistic (L=9.452) if the variances of the two groups were equal. Reject the null hypothesis at the 0.01 level, and conclude that the error variances are not constant. (Ryan-Joiner) Correlation test for normality of error terms in Minitab • H0: Error terms are normally distributed vs. HA: Error terms are not normally distributed • Stat >> Regression >> Regression. Under storage…, select residuals. • Stat >> Basic statistics >> Normality Test. Select residuals (RESI1) and request RyanJoiner test. Select OK. 100 chi-square (1 df) data values 40 Percent 30 20 10 0 0 5 chi 10 Normal probability plot and test for 100 chi-square (1 df) data values 100 normal(0,1) data values Percent 20 10 0 -2.5 -2.0 -1.5 -1.0 -0.5 -0.0 0.5 normal 1.0 1.5 2.0 2.5 Normal probability plot and test for 100 normal(0,1) data values Normal probability plot for Tree diameter (X) and C-dating Age (Y) Normal Probability Plot of the Residuals (response is Age) 2 Normal Score 1 0 -1 -2 -400 -200 0 200 Residual 400 600 Tree diameter and Age Example: Ryan-Joiner Correlation Test Some closing comments • Checking of assumptions is important, but be aware of the “robustness” of your methods, so you don’t get too hung up. • Model checking is an art as well as a science. • Do not think that there is some definitive correct answer “in the back of the book.” • Use your knowledge of the subject matter.