Topic 8: Model Diagnostics Outline • Diagnostics to check model assumptions – Diagnostics concerning X – Diagnostics using the residuals Diagnostics and remedial measures • Diagnostics: look at the data to diagnose situations where the assumptions of our model are violated • Remedies: changes in analytic strategy to fix these problems Look at the data • Before trying to describe the relationship between a response variable (Y) and an explanatory variable (X), we should look at the distributions of these variables • We should always look at X • If Y depends on X, looking at Y alone may not be very informative Diagnostics for X • If X has many values, use Proc Univariate to get numerical summaries (e.g., mean, median, quartiles) • If X has only a few values, use Proc Freq or the Freq option in Proc Univariate to get summaries (e.g., percentages, counts) Diagnostics for X • Examine the distribution of X – Is it skewed? – Are there outliers? • Do the values of X depend on time (i.e., the order in which they were collected)? What’s the concern? • Model estimates based on means and sums of squares • These numerical summaries are not robust to outliers • Can inflate variance or influence trend • Observations that show a pattern over time are not independent Important Statistics • • • • • Mean Standard deviation Skewness Kurtosis Range Example: Toluca lot size data toluca; infile ‘../data/CH01TA01.txt'; input lotsize hours; seq=_n_; proc univariate data=toluca plot; var lotsize; run; Crude Plots Stem 12 11 10 9 8 7 6 5 4 3 2 Leaf 0 00 00 0000 000 000 0 000 00 000 0 ----+----+----+----+ Multiply Stem.Leaf by 10**+1 # 1 2 2 4 3 3 1 3 2 3 1 Boxplot | | | +-----+ | | *--+--* | | +-----+ | | | Moments N Mean Std Deviation Moments 25 Sum Weights 70 Sum Observations 28.7228132 Variance Skewness -0.1032081 Kurtosis Uncorrected SS 142300 Corrected SS Coeff Variation 41.0325903 Std Error Mean 25 1750 825 -1.0794107 19800 5.74456265 Location and Spread Basic Statistical Measures Location Variability Mean 70.00000 Std Deviation 28.72281 Median 70.00000 Variance 825.00000 Mode 100.00000 90.00000 Range Interquartile Range 40.00000 Quantiles (Definition 5) Quantile Estimate 100% Max 120 99% 120 95% 110 90% 110 75% Q3 90 50% Median 70 25% Q1 50 10% 30 5% 30 1% 20 0% Min 20 Extreme Observations Lowest Highest Value 20 Obs 14 Value 100 Obs 9 30 21 100 16 30 17 110 15 30 2 110 20 40 23 120 7 SAS CODE FOR “TREND IN ORDER?” symbol1 v=circle i=sm70; proc gplot data=a1; plot lotsize*seq; run; Normal distributions • Our model does not state that X comes from a single normal population • Same comment applies to Y • In some cases, X and/or Y may be normal and it can be useful to know this Normal quantile plots • Consider n=5 observations iid N(0,1) • From Table B.1, we find – P(z -.84) = .20 – P(-.84 < z -.25) = .20 – P(-.25 < z .25) = .20 – P(.25 < z .84) = .20 – P(z > .84) = .20 Normal quantile plots • So we expect – One observation -.84 – One observation in (-.84, -.25) – One observation in (-.25, .25) – One observation in (25, .84) – One observation > .84 Normal quantile plots • Zi = -1((i-.375)/(n+.25)), i=1 to n • Plot the order statistics X(i) vs Zi • KNNL plots X(i) vs s Zi • Doesn’t affect nature of plot Normal quantile plots • The standardized X variable is z = (X - μ)/σ • So, X = μ + σ z • If the data are approximately normal, the relationship will be approximately linear with slope close to σ and intercept close to μ. SAS CODE proc univariate data=toluca plot; var lotsize; qqplot lotsize; run; Diagnostics for residuals • • • • • • Model: Yi = β0 + β1Xi + ei Predicted values: Ŷi = b0 + b1Xi Residuals: ei = Yi – Ŷi So, Yi = Ŷi + ei The ei should be similar to the ei The model assumes ei iid N(0, σ2) Plot Plot PLOT PLOT PLOT Plot Plot Questions addressed by diagnostics for residuals • • • • • • Is the relationship linear? Does the variance depend on X? Are there outliers? Do the errors depend on order? Are the errors normal? Are the errors dependent? Is the Relationship Linear? • Plot Y vs X • Plot e vs X (residual plot) • Residual plot better emphasizes deviations from linear pattern SAS CODE: Fake #1 libname xxx ‘../data’; Data xxx.a100; do x=1 to 30; y=x*x-10*x+30+25*normal(0); output; end; run; Generates data set where Y=X2-10X+30 Errors are normally distributed with s=25 SAS CODE proc reg data=xxx.a100; model y=x; output out=a2 r=resid; run; OUTPUT Source Model Error Corrected Total Variable Intercept x DF 1 1 Analysis of Variance Sum of Mean DF Squares Square F Value Pr > F 1 1032098 1032098 170.95 <.0001 28 169048 6037.41596 29 1201145 Parameter Estimates Parameter Standard Estimate Error t Value Pr > |t| -145.37495 29.09684 -5.00 <.0001 21.42943 1.63899 13.07 <.0001 A significant positive relationship!! SAS CODE: Visual Checks symbol1 v=circle i=rl; proc gplot data=a2; plot y*x; Scatterplot with regression line run; symbol1 v=circle i=sm60; proc gplot data=a2; Scatterplot with smoothed curve plot y*x; proc gplot data=a2; Residual plot plot resid*x/ vref=0; run; Does not appear to be linear Nonlinear behavior easier to see here?! Does the variance depend on X? • Plot Y vs X • Plot e vs X • Plot of e vs X will emphasize problems with the variance assumption SAS CODE: Fake #2 libname xxx ‘../data'; Data xxx.a100a; do x=1 to 100; y=30+100*x+10*x*normal(0); output; end; run; Generates data set where Y=30 + 100X Errors are normally distributed with s=10X SAS CODE proc reg data=xxx.a100a; model y=x; output out=a2 r=resid; run; OUTPUT Source Model Error Corrected Total Variable Intercept x Analysis of Variance Sum of Mean DF Squares Square F Value Pr > F 1 856723171 856723171 1682.55 <.0001 98 49899722 509181 99 906622893 Parameter Estimates Parameter Standard DF Estimate Error t Value Pr > |t| 1 13.80557 143.79092 0.10 0.9237 1 101.39875 2.47200 41.02 <.0001 A significant positive relationship!! SAS CODE: Visual Checks symbol1 v=circle i=sm60; proc gplot data=a2; Scatterplot with plot y*x; smoothed curve proc gplot data=a2; Residual plot plot resid*x / vref=0; run; Are the errors normal? • The real question is whether the distribution of the errors is far enough away from normal to invalidate our confidence intervals and significance tests • Look at the residuals’ distribution • Use a normal quantile plot SAS CODE data a1; infile ‘..\data\CH01TA01.txt'; input lotsize hours; proc reg data=a1; model hours=lotsize; output out=a2 r=resid; proc univariate data=a2 plot normal; var resid; histogram resid / normal kernel; qqplot resid; Univariate Output Fitted Normal Distribution for resid Parameters for Normal Distribution Parameter Symbol Estimate Mean Mu 0 Std Dev Sigma 47.79534 Goodness-of-Fit Tests for Normal Distribution Test Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling ----Statistic----D 0.09571960 W-Sq 0.03326349 A-Sq 0.20714170 ------p Value-----Pr > D >0.150 Pr > W-Sq >0.250 Pr > A-Sq >0.250 No obvious deviations from normality as P-values are greater than 0.05 Dependent Errors • Usually we see this in a plot of residuals vs time order (KNNL) or seq (our SAS variable) • We can have trends and/or cyclical effects in the residuals • If you are interested read KNNL pg 108-110 Are there outliers? • Plot Y vs X • Plot e vs X • Plot of e vs X should emphasize an outlier SAS CODE: Fake #3 Data xxx.a100b1; do x=1 to 100 by 5; y=30+50*x+200*normal(0); output; end; x=50; y=30+50*50+10000; d='out'; output; run; Generates data set where Y=30+50X Errors are normally distributed with s=200 SAS CODE proc reg data=xxx.a100b1; model y=x; where d ne 'out'; run; proc reg data=xxx.a100b1; model y=x; output out=a2 r=resid; run; Without Outlier Source Model Error Corrected Total Variable Intercept x Analysis of Variance Sum of Mean DF Squares Square 1 42426770 42426770 18 853668 47426 19 43280438 F Value Pr > F 894.59 <.0001 Parameter Estimates Parameter Standard DF Estimate Error t Value Pr > |t| 1 -2.54677 95.29715 -0.03 0.9790 1 50.51719 1.68899 29.91 <.0001 s=217.8 With Outlier Source Model Error Corrected Total Variable Intercept x Analysis of Variance Sum of Mean DF Squares Square 1 43888843 43888843 19 96206895 5063521 20 140095738 F Value 8.67 Parameter Estimates Parameter Standard DF Estimate Error t Value Pr > |t| 1 432.20263 979.57661 0.44 0.6640 1 51.37694 17.45089 2.94 0.0083 Pr > F 0.0083 s=2250.2 SAS CODE: Visual Checks symbol1 v=circle i=rl; proc gplot data=a2; plot y*x; proc gplot data=a2; plot resid*x/ vref=0; run; Different kinds of outliers • The outlier in the last example influenced the intercept but not the slope • It inflated all of our standard errors • Here is an example of an outlier that influences the slope SAS CODE Data xxx.a100c1; do x=1 to 100 by 5; y=30+50*x+200*normal(0); output; end; x=100; y=30+50*100 -10000; d='out'; output; run; SAS CODE proc reg data=xxx.a100c1; model y=x; where d ne 'out'; run; proc reg data=xxx.a100c1; model y=x; output out=a2 r=resid; run; Without Outlier Source Model Error Corrected Total Variable Intercept x Analysis of Variance Sum of Mean DF Squares Square 1 41233447 41233447 18 823612 45756 19 42057060 F Value Pr > F 901.15 <.0001 Parameter Estimates Parameter Standard DF Estimate Error t Value Pr > |t| 1 73.28061 93.60451 0.78 0.4439 1 49.80168 1.65899 30.02 <.0001 With Outlier Source Model Error Corrected Total Variable Intercept x Analysis of Variance Sum of Mean DF Squares Square 1 11151297 11151297 19 83888277 4415172 20 95039574 F Value 2.53 Parameter Estimates Parameter Standard DF Estimate Error t Value Pr > |t| 1 903.97793 899.32018 1.01 0.3274 1 24.13057 15.18374 1.59 0.1285 Pr > F 0.1285 SAS CODE: Visual Checks symbol1 v=circle i=rl; proc gplot data=a2; plot y*x; proc gplot data=a2; plot resid*x/ vref=0; run; Background Reading • Program topic8.sas has code for the proc univariate diagnostics of X • Program residualchecks.sas have the residual analysis • The permanent sas data sets are a100.sas7bdat, a100a.sas7bdat, a100b1.sas7bdat, and a100c1.sas7bdat. • Read sections 3.8 and 3.9