Outliers Outliers are data points which lie outside the general linear

advertisement
Outliers
Outliers are data points which lie outside the general linear pattern of which the midline is the
regression line. A rule of thumb is that outliers are points whose standardized residual is greater than 3.3
(corresponding to the .001 alpha level). The removal of outliers from the data set under analysis can at
times dramatically affect the performance of a regression model. Outliers should be removed if there is
reason to believe that other variables not in the model explain why the outlier cases are unusual -- that
is, these cases need a separate model. Alternatively, outliers may suggest that additional explanatory
variables need to be brought into the model (that is, the model needs respecification). Another
alternative is to use robust regression, whose algorithm gives less weight to outliers but does not
discard them.
The leverage statistic, h, also called the hat-value, is available to identify cases which
influence the regression model more than others.
•
Belsley, Kuh, and Welsch (1980) define the leverage (hi ) of the ith observation as
(x i&x)
¯2
1
hi =
%
2
n
(n&1)S x
Leverage assesses how far away a value of the independent variable value is from the mean value: the
farther away the observation the more leverage it has.
From the definition you can see that leverage is mitigated by a larger sample size (any single point
should have less influence) and by a larger variance of the independent variable (again, any single point
should have less influence).
•
0<h<1 The leverage statistic varies from 0 (no influence on the model) to 1 (completely
determines the model).
•
A rule of thumb is that cases with leverage under .2 are not a problem, but if a case has
leverage over .5, the case has undue leverage and should be examined for the possibility of
measurement error or the need to model such cases separately.
STATA command predict h, hat.
Cook's distance, D, is another measure of the influence of a case. Cook's distance measures the
effect of deleting a given observation.
•
Observations with larger D values than the rest of the data are those which have unusual
leverage.
•
D > 4/n the criterion to indicate a possible problem.
STATA command predict D, cooksd
dfbetas, is another statistic for assessing the influence of a case.
•
If dfbetas > 0, the case increases the slope; if <0, the case decreases the slope.
•
The case may be considered an influential outlier if |dfbetas| > 2/%n.
STATA command dfbeta creates dfbeta’s for all variables.
or predict DFx1, dfbeta(x1) for individual variables
dfFit. DfFit measues how much the estimate changes as a result of a particular observation being
dropped from analysis.
dfits is defined as
hi
•
dfitsi = Rstudent
•
The case may be considered an influential outlier if DFITS> 2/%k/n
1&hi
. Where Rstudent is the studentized residual.
STATA command predict DFITS, dfits
Studentized residuals and deleted studentized residuals are also used to detect outliers with high
leverage. A "studentized residual" is the observed residual divided by the standard deviation. The
"studentized deleted residual," also called the "jacknife residual," is the observed residual divided by the
standard deviation computed with the given observation left out of the analysis. Analysis of outliers
usually focuses on deleted residuals. Other synonyms include externally studentized residual or,
misleadingly, standardized residual. There will be a t value for each residual, with df - n - k - 1, where
k is the number of independent variables. When t exceeds the critical value for a given alpha level (ex.,
.05) then the case is considered an outlier. In a plot of deleted studentized residuals versus ordinary
residuals, one may draw lines at plus and minus two standard units to highlight cases outside the range
where 95% of the cases normally lie; points substantially off the straight line are potential leverage
problems.
STATA command predict student, rstudent
or predict standard, rstandard
Partial regression plots, also called partial regression leverage plots or added variable plots, are
yet another way of detecting influential sets of cases. Partial regression plots are a series of bivariate
regression plots of the dependent variable with each of the independent variables in turn. The plots
show cases by number or label instead of dots. One looks for cases which are outliers on all or many of
the plots.
STATA command avplots
Example using the Murder.dta data set.
reg
mrdrte exec unem d90 d93
Source
SS
df
Model
Residual
977.390644 4
11867.9475 148
MS
244.347661
80.1888343
Total 12845.3381 152 84.5088034
mrdrte Coef.
exec .1627547
unem 1.390786
d90
2.675335
d93
1.607317
_cons -1.864393
Std. Err.
.1939295
.4508653
1.816934
1.774768
3.069517
predict e, residual
predict yhat
predict standard, rstandard
predict student, rstudent
predict h, hat
predict D, cooksd
predict DFITS, dfits
predict W, welsch
dfbeta
DFexec: DFbeta(exec)
DFunem: DFbeta(unem)
DFd90: DFbeta(d90)
DFd93: DFbeta(d93)
t
0.84
3.08
1.47
0.91
-0.61
P>t
0.403
0.002
0.143
0.367
0.545
Number of obs= 153
F( 4, 148) = 3.05
Prob > F
= 0.0190
R-squared
= 0.0761
Adj R-squared= 0.0511
Root MSE
= 8.9548
[95% Conf.
-.2204738
.4998207
-.91515
-1.899842
-7.930134
Interval]
.5459832
2.281751
6.26582
5.114476
4.201349
-20
0
Residuals
20
40
60
rvplot
0
5
To create a standardized residual plot
graph twoway scatter standard yhat, yline(0)
Fitted values
10
15
8
6
Standardized residuals
2
4
0
-2
0
5
Fitted values
10
15
8
To identify the outliers.
graph twoway scatter standard yhat, yline(0) mlabel(state)
DC
Standardized residuals
2
4
6
DC
DC
CA
TX
MS AK
WV
WV
TX
WV
-2
0
NH
LA
LA
N YMS
MD NY
FLNC
GA
CSC
A
MD
MD
NC
SC
N
C
GA
CA
GA
DE
V
A
T
N
M
I
TX
T
N
CT NHIENJ
TN
NV
MO NY
MO
VA
VAR AM
N
V N
AZ
IL
AL
MA
IN
RI NE SHDNE
OK
LIOILFL
VT
AZ
IL
VA
AZ
NM
SC
NM
MI
ARA LMS
I UT
PA
OK
M
OR
CO
OK
AK
INWA
PA
K
YAR
ME KS
HI
W WI
IKS
KY
FL
SD
WI
DE
NO
JIN
H
CT
NM
CT
O
OH
H
PA
IA
MN
K
S
D
E
WY
WA
CO
KY
AK
IA
MN
VT
CO
NDNSD
UT
DN DIAUT
WY
MTMT
MN
OR
NJ
VME
TMT
ID
MAOR
RIWA
INH
DMA
ID
NH
RI
WY
ME
0
5
Fitted values
10
15
LA
80
avplots, mlabel(state)
60
DC
DC
DC
LA LA
NY
MD
FL GATX
MS
N
CVA
MD
MD
N
SC
C
NY
M
TN
TN
IICT
DE
N
GA
C
GA
VA
MO
NV
NH
A
R
IOH
M
MO
IDE
AZ
LCA
SC
N
MS
CA
NJ
H
RI
L
NE
MA
NV
AL
AL
OK
ISC
AZ
AL
MO
NM
IL
MI
NM
KS
PA
AZ
KS
VT
HI
OK
A
RNV
FL FL VA
AK
OK
CO
O
NE
IN
R
NE
NM
PA
A
CT
KY
AK
OH
WI
KY
OH
MN
PA
IN
HI
CT
DE
IA
NJ
WA
R
ME
KS
WI
SD
MS
UT
AKY
WA
CO
K
V
MN
WI
IA
WY
ND
CO
T
SD
UT
UT
NJ
O
MT
MA
MT
WA
O
ND
R
IMN
ND
WY
A
R
LA
MA
ID
RI
ME
VT
WV
ID
ID
NH
W
WY
NH
V
WRI
V
ME
TX
LALAN YMI CA
NY
MD
MS
CA
CA
MS NM MS AK LA
MD
NNY
FL
NCCMD
GA
GA
TX
AL
IL
NC
SSC
GA
C
TN
TN
MO
AR
NV
NV
MO
TN
M
ITX
NM
LAL
ISC
A
A
MI
R
L
NV
AZ
FL
OK
AK
A
VA
VA
ID
N
PA
OK
AZ
AZ
O
MO
OK
R
FL
KY
OH
NM
W
C
AO
A
K R KY
CT
HI
KS
IWI
KY
CT
PA
N
IN
OH
PA
WV
VA
KS
C
O
OH
MT
NH
MA
VT
HRI
IT
ME
WI
NE
W
KS
MN
EIMN
CO
WA
CT
DE
WY
U
OR
T
MT
MA
OR
NJ
W
R
NENDE
SD
HI
TX
M
W
VT
N
IA
YNJ
MA
RIA
IID WYWV
ESD
UNJ
IA
ND
SD
UT
IA
N
D
VT
ME
MT
ID
NH
NH
ID
ND
ME
-20
0
10
20
e( exec | X )
30
-4
-2
0
2
e( unem | X )
4
6
coef = 1.3907856, se = .45086525, t = 3.08
80
coef = .16275467, se = .19392954, t = .84
DC
DC
0
TX
e( mrdrte | X )
20
40
60
e( mrdrte | X )
0
20
40
DC
DC
N
Y
M
D
GA
FL
N
C
A
C
VA
SC
D
C
M
E
I
H
N
A
MO
V
H
JN
IZ
M
R
IT
LIT
A
VT
K
P
O
N
A
S
M
K
TX LA
MS
IAL
SD
M
N
A
N
E
E
H
R
W
C
K
Y
K
N
IA
UT
IN
MT
D
IA
DO
W
Y
V
LA
N
MD
MS
C
GA
TN
NV
NY
MO
A
IL
R
AL
NE
IN
AZ
OK
CA
SC
KS
MI
FL
H
SD
W
CO
KY
AK
IY
I
UT
DE
CT
PA
NM
TX VA
W
IA
ND
MN
VT
OH
NJ
W
ID
MT
O
MA
AV
NH
RI
WR
ME
LA
NY
MD
TX
GA
SC
N
C
ACI
TN
NV
VA
FL
AL
MS
IL
M
M
A
OK
HO
AZ
NM
RIA
NE
PA
KY
KS
C
NJ
DE
CT
W
IN
OH
AK
O
UT
SD
MT
ND
IA
MN
V
ME
O
MA
RI
TRIY
ID
NH
W
V
-20
-20
0
LA
LA
NY
MS
N
MD
CGA
NY
NC
MD
SC
C
A
TX
MD
GA
CA
GA
TN
NYC A NE
VA
NV
TN
MS
N
C
SC
MI
A
NV
MO
R
IL
FL
IL
AL
D
E
VA
AL
AZ
NT
A
M
M
RI
NH
C
H
N
T
JFL
MO
NV
AZ
TN
AL
KS
VA
IDE
N
OK
AZ
MI
SC
AK
H IND
NJ
IN
PA
OK
M
O
M
R
A
IIKS
IL
NM
SD
CO
KY
FL
WMN
KS
WA
DE
CT
IKY
OH
AK
VT
SD
ME
NE
PA
IOR
N
OH
OK
AR
TX
SNE
UT
IH
A
WI
MN
IWY
CT
OH
TX
PA
NM
UT
IA
CO
W
OR
M
MN
WI
W
CO
KY
A MAK
ND
VT
NJ
V
ME
T
MA
I WV
ND
IA
UT
MT
ID
MA
O
WA
INH
DRY
ID
NH
RIR
WV SD
WY W LA
V MT
ME
e( mrdrte | X )
0
20
40
e( mrdrte | X )
20 40 6 0
DC
DC
60
DC
DC
DC
-1
-.5
0
e( d90 | X )
.5
coef = 2.6753348, se = 1.8169343, t = 1.47
-.5
0
e( d93 | X )
.5
coef = 1.6073174, se = 1.774768, t = .91
-2
0
Standardized residuals
2
4
6
8
To graphically measure the influence of observations
graph twoway scatter standard yhat [aweight=D], msymbol(oh) yline(0)
0
5
Fitted values
10
15
Leverage versus residual squared plot marks the means of leverage and squared residuals. Leverage
tells us how much potential for influencing the regression an observation has.
lvr2plot
To examine the numerical measure for outliers.
leverage h>2k/n
Cook’s D>4/n
DFITS>2/%k/n
Welsch’s W>3/%k
DFBETA>2/%n
For example,
sort D
list state yhat D DFITS W in -5/l
149.
150.
151.
152.
153.
state
WV
DC
TX
DC
DC
yhat
13.15609
6.897557
15.01208
9.990127
11.5646
D
.0150447
.0452811
.0504226
.2905912
.4116747
DFITS
-.2742131
.4927538
-.500824
1.547051
1.832155
W
-3.513339
6.137678
-8.794935
19.3077
22.9866
Recall that
D > 4/n indicates a possible problem.
With n=153, D > .02614379
DFITS> 2/%k/n may be considered an influential outlier.
With n=153, k=4, DFITS>12.369317
Download