1 732A35/732G28

advertisement
732A35/732G28
1
How to:

Identify outliers

Influential observations
732A35/732G28
2
To identify outliers:




Scatterplot X or Y
Boxplot X or Y
Residuals vs X
Residuals vs Y
Why bother about
outliers?


Better tools?
Indicate errors in data
May affect regression
parameters strongly

wrong conclusions
732A35/732G28
3
45
Types:
 Outlying Y (1)
 Outlying X (2)
 Both (3)
1
40
35
30
Y
25
Comment which
are influential..
2
20
15
10
5
3
0
0
2
4
6
8
10
12
X
732A35/732G28
4
We studied before:
 Plots with residuals
 Plots with semistudentized residuals (assume, variances are
same, MSE)
However, variance may be different for residuals 
Estimated standard deviation
s(ei )  MSE (1  hii )
where H = X(X’X)-1X’
732A35/732G28
5

Studentized residuals
ri 
ei
ei

MSE (1  hii ) s(ei )
Thumb rule:
If |ri| > 2  large residual


Studentized residuals= Standardized residuals
In Minitab, |ri| > 2 labeled with an 'R' in the table of unusual
observations
732A35/732G28
6
Price (Y) and size (X) for a sample of 150 houses.
30
Y (Price)
121.87
150.25
122.78
144.35
116.2
139.49
115.73
140.59
120.29
.
.
.
114.92
X (Square_Footage)
20.5
22
15.9
18.6
12.1
17.1
16.7
17.8
15.2
.
.
.
14.2
25
20
Y (Price)
Case
1
2
3
4
5
6
7
8
9
.
.
.
150
15
10
5
0
70
90
110
130
150
170
190
X (Square footage)
732A35/732G28
7
Deleted residuals
Idea: Outlier may affect regression coefficients  remove
observation i, compute deleted residual (for each i )
di  Yi  Yˆi(i )
equivalent
di 
ei
1  hii
Compute deleted residuals and search for outliers as before
732A35/732G28
8
Studentized deleted residuals
Idea: studentize deleted residuals to produce residuals with
equal variance


n  p 1
t i  ei 
2 
 SSE (1  hii )  ei 
1/ 2
Further:
 Examine large deviations
 Test if observation with largest ti is an outlier:
H 0 : Observation with largest ti is not outlier
H a : Observation with largest ti is outlier
Test function T=ti, Critical value=t(1-α/2n,n-p-1), two-sided
732A35/732G28
9
Smoother matrix H = X(X’X)-1X’
Comments:




0  hii  1
n
h
i 1
ii
p
hii (leverage) is measure of the distance between Xi and
mean(X)  a measure for detecting outliers!
ˆ  HY
Y
fit
 hii defines contribution of the ith observation to ith
s(ei )  MSE (1  hii )
to Yi
 Large hii  small s(ei) fitted Yi is close
732A35/732G28
10
To find outliers,
1.
Look for hii larger than 2*mean leverage:
2p
hii 
n
2.
Alternative: 0.2-0.5 –moderate leverage, >0.5 – high
leverage
Ex. Define threshold for our data..
732A35/732G28
11
Hidden extrapolations


When predicting Y for new X, is it within domain?
◦ Easy in two dimensions
1 
 new 
 x1 
In general, for a new observation, X N  

...


1
'
 x new 
hNN  X N X' X X N
 p 1 



Check if hNN is much larger than other leverages
732A35/732G28
12

Influential=if added, changes fitted model considerably
Influence on single fitted value - DFFITS
ˆ
 Measure of the influence that case i has on the fitted value Yi.
Defined as
Yˆ  Yˆ (i )
( DFFITS )i 
i
i
MSE(i ) hii
Computed as (ti = studentized deleted residual)
1/ 2
 h 
( DFFITS )i  ti  ii 
 1  hii 
732A35/732G28
13
To find influential cases,
1.
Small, medium datasets |DFFITS| > 1
2.
Large data sets
| DFFITS | 2
p
n
Ex. Define threshold for our data..
732A35/732G28
14
Influence on All fitted values - Cook’s Distance
Measures influence of the i:th case on all fitted values
Defined as
n

Di 
Computed as:
 (Yˆ
j 1
j
 Yˆj(i ) ) 2
p  MSE
ei2  hii 


Di 
2 

p  MSE  1  hii  
Look how hii and ei influence Di…
732A35/732G28
15
Decision rule:

If Di ≥ F(0.5; p, n-p), the case is influential.

Ex. Define threshold for our data..
732A35/732G28
16

Ch 10.2-10.4, 10.6
732A35/732G28
17
Download