Regression Diagnostics

advertisement
Multiple Linear Regression
Regression Diagnostics
Find Scores That
• Contribute to violation of assumptions.
• Are suspect because they are far removed
from the centroid (multidimensional mean)
• Have undue influence on the solution.
Outliers Among the Predictors
• Leverage, hi or Hat Diagonal
• The larger this statistic, the greater the
distance between the data point and the
centroid in p-dimensional space.
• Investigate cases with hi greater than
2(p-1)/N.
• p is the number of parameters in the
model, including the intercept.
Distance from the Regression Surface
• Standardized Residual (aka Studentized
Residual)
– Difference between actual Y and predicted Y
divided by an appropriate standard error
• Rstudent (aka Studentized Deleted
Residual) – same except for each case the
regression surface is that obtained when
this individual case is removed.
• Investigate if greater than 2.
Influence on the Solution
• Cook’s D – how much would the
regression surface change if this case
were removed
– Investigate cases with D > 1.
• Dfbetas – how much would one parameter
(slope or intercept) change if this case
were removed
– Investigate cases with values > 2.
Simple Example
• Y = sperm count
• X1 = % time recently spent with mate
• X2 = time since last ejaculation
Output Statistics
Obs
Student Cook's
Residual D
RStudent Hat Diag DFBETAS
H
Intercept Together SR_Last
Ejac
7
0.310
0.038
0.2921
0.5405
0.2920
-0.2869
-0.1987
8
-0.183
0.006
-0.1715
0.3605
-0.0959
0.1083
0.0437
9
-1.240
0.098
-1.2906
0.1600
-0.0398
-0.2265
0.0999
10
-1.270
0.261
-1.3296
0.3270
-0.2614
-0.2321
0.4657
11
2.643
1.183
6.9409
0.3369
1.6194
1.0137
-2.6903
Leverage
• Investigate cases with values greater than
2(3)/11 = .55.
• Case 7 is close to this cutoff.
• It is a univariate outlier on the time
together variable.
• Further investigation indicates the case is
valid, so we retain it.
Residuals
• Case 11 has large residuals, it should be
investigated.
• Notice that Rstudent is much larger than
the standardized residual
• This indicates that removing this case has
a large effect on the solution.
Output Statistics
Obs
Student Cook's
Residual D
RStudent Hat Diag DFBETAS
H
Intercept Together SR_Last
Ejac
11
2.643
6.9409
1.183
0.3369
1.6194
1.0137
-2.6903
Influence
• Case 11 has a high value of Cook’s D.
• It has a high Dfbeta for the time since last
ejaculation predictor, even after I
transformed that variable to reduce
skewness.
• Upon investigation, it was found that this
subject did not follow the instructions for
gathering the data. His scores were
deleted.
Plots of Residuals
• These can also be useful, but
• It takes some practice to get good at
detecting problems from such plots
• Plot the residual versus predicted Y
Heteroscedasticity
Trying Squaring One Predictor
Residuals not Normal and
Variance not Constant
Download