Influence Statistics and Outliers

advertisement
Influence Statistics, Outliers, and Collinearity Diagnostics

Standardized Residuals – Residuals divided by their estimated standard errors (like t-statistics).
Observations with values larger than t 
2n
ri 

ei
s 1  vii
,n  p '
in absolute value are considered outliers.
s  MS Re sidual
vii  i thh diagonal element of P
Studentized Residuals – Similar to standardized residuals, except s ( i )  MS Re sidual (i ) is computed on
the regression fit on the remaining n-1 cases. Observations with values larger than t 
2n
, n  p ' 1
in absolute
value are considered outliers
ri* 
ei
s(i ) 1  vii
s  MS Re sidual (i )
vii  i thh diagonal element of P
These are labelled as RSTUDENT by SAS.

Leverage Values (Hat Diag) – Measure of how far an observation is from the others in terms of the levels
of the independent variables (not the dependent variable). Observations with values larger than 2p’/n are
considered to be potentially highly influential.
vii  i th diagonal element of P

DFFITS – Measure of how much an observation has effected its fitted value from the regression model.
p'
Values larger than 2
in absolute value are considered highly influential.
n
^
^


ei
e2


(n  p'1) s(2i )  (n  p' ) s 2  i
 s 1 v 
1  vii
s (i ) vii
ii 
 (i )
Note that DFFITSi measures the number of standard errors that the fitted value for the ith case has shifted
when it was not used in the regression fit.
DFFITS i 

Y i  Y i (i )

vii
1  vii
DFBETAS – Measure of how much an observation has effected the estimate of a regression coefficient
2
(there is one DFBETA for each regression coefficient, including the intercept). Values larger than
in
n
absolute value are considered highly influential.
^
DFBETAS j (i ) 
^
 j   j (i )
s(i ) c jj
c jj  ( j  1) st diagonal element of ( X ' X ) 1
Note that DFBETASj(i) measures the number of standard errors that the regression coefficient for the jth
predictor variable has shifted when the ith case was not used in the regression fit.

Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients, as
well as the group of fitted values. Values larger than F.50, p ',n  p , are considered highly influential.
'
^
Di 
^
^
^
(  (i )   )' ( X ' X )(  (i )   )
p' s 2

^
^
^
^

Y

Y
Y

Y
(
i
)
(
i
)




2
ri  vii 





 
2
p'  1  vii 
p' s
Note that Di is like an F-statistic used for testing K '   m where K’=I. Can also be thought of as the shift in
a 100(1-)100% confidence ellipse when Di  Fa , p ',n  p ' when the ith case is not used to fit the regression model.

COVRATIO – Measure of the impact of each observation on the variances (and standard errors) of the
3 p'
regression coefficients and their covariances. Values outside interval 1 
considered highly influential.
n
det( s(2i ) ( X (' i ) X (i ) ) 1 )
COVRATIO i 
det( s 2 ( X ' X ) 1 )

Variance Inflation Factor (VIF) – Measure of how highly correlated each independent variable is with
the other predictors in the model. Values larger than 10 for a predictor imply large inflation of standard
errors of regression coefficients due to this variable being in model.
1
where Rk2 is the coefficient of multiple determination when Xk is regressed on the p-1 remaining
2
1  Rk
independent variables.
VIFk 
Obtaining Influence Statistics and Studentized Residuals in SPSS
A. Choose ANALYZE, REGRESSION, LINEAR, and input the Dependent variable and set of Independent
variables from your model of interest (possibly having been chosen via an automated model selection
method).
B. Under STATISTICS, select Collinearity Diagnostics, Casewise Diagnostics and All Cases and
CONTINUE
C. Under PLOTS, select Y:*SRESID and X:*ZPRED. Also choose HISTOGRAM. These give a plot of
studentized residuals versus standardized predicted values, and a histogram of standardized residuals
(residual/sqrt(MSE)). Select CONTINUE.
D. Under SAVE, select Studentized Residuals, Cook’s, Leverage Values, Covariance Ratio, Standardized
DFBETAS, Standardized DFFITS. Select CONTINUE. The results will be added to your original data
worksheet.
Obtaining Influence Statistics and Studentized Residuals in SAS
PROC REG;
MODEL Y = X1 X2 … XP / R INFLUENCE VIF;
RUN;
Obtaining Influence Statistics and Studentized Residuals in R
reg1.reg <- lm(y ~ x1 + … + xp)
reg1.rstudent <- rstudent(reg1.reg)
reg1.inf <- influence.measures(reg1.reg)
Download