Stat 301 – Lecture 27 Unusual Points Box Plot – Potential Outliers

advertisement
Stat 301 – Lecture 27
Unusual Points

Outlier box plot identifies
potential outliers.
1
Normal Quantile Plot
3
.99
2
.95
.90
1
.75
.50
0
.25
-1
.10
.05
-2
.01
-3
35
30
20
Count
25
15
10
5
-5
0
5
10
15
2
Box Plot – Potential Outliers
Vehicle Name
Highway MPG Predicted MPG
Residual
Honda Civic HX 2dr
44
35.4
8.6
Toyota Echo 2dr
manual
43
35.5
7.5
Toyota Prius 4dr
(gas/electric)
51
35.3
15.7
Volkswagen Jetta
GLS TDI 4dr
46
33.4
12.6
3
Stat 301 – Lecture 27
Outlier

How do we determine if a
potential outlier identified on
the box plot is statistically
significant?
4
Unusual Points in Regression
Outlier – a point with an
unusually large residual.
 High leverage point – a point
with an extreme value for one,
or more, of the explanatory
variables

5
Influential Points

Does a point influence where
the regression line goes?



An outlier can.
A high leverage point can.
Is that point statistically
significant in terms of
influence?
6
Stat 301 – Lecture 27
Simple Linear Regression
Example – 50 mammals
 Response variable: gestation
(length of pregnancy) days
 Explanatory: brain weight

7
Distributions
Gestation
10
Count
15
5
0
50 100
200
300
400
500
8
Gestation (days)
Skewed to the right.
 Two potential outliers.
 Mean = 117.4 days
 Median = 65.5 days
 Values from 12 days to 440
days.

9
Stat 301 – Lecture 27
Distributions
BrainWgt
35
30
20
15
Count
25
10
5
0
500
1000
1500
10
Brain Weight
Highly skewed to the right with
several mounds.
 Six potential outliers.
 Mean = 107.25 g
 Median = 16.25 g
 Values from 0.14 g to 1320 g

11
Simple Linear Regression

Trying to explain variation in
the response (gestation) by
relating the response to the
explanatory variable (brain
weight).
12
Stat 301 – Lecture 27
Regression Residuals
residual  y  yˆ

Those observations that do not
follow the general trend will have
residuals that are far from zero,
either positive or negative.
13
Regression Outlier


A residual very far from zero,
either negative or positive, will
be called an outlier for
regression.
An outlier for regression
corresponds to a value of the
response that does not match
the overall trend.
14
Simple Linear Regression



Predicted Gestation = 85.25 +
0.30*Brain Weight
R2 = 0.372, so only 37.2% of the
variation in gestation is explained
by the linear relationship with brain
weight.
RMSE = 85.1 days
15
Stat 301 – Lecture 27
Simple Linear Regression

The model is useful.


F = 28.49, P-value < 0.0001
This also indicates that there is
a statistically significant linear
relationship between brain
weight and gestation.
16
300
200
Residual
100
0
-100
-200
-300
0
500
1000
1500
BrainWgt
17
Unusual Points
The mammal with a brain
weight around 1300 g has the
residual furthest from zero on
the negative side.
 There are other mammals with
residuals of the same
magnitude on the positive side.

18
Stat 301 – Lecture 27
Outlier Box Plot

Start with five number summary
Minimum = –214.1
25% Quartile = –57.9
 50% Median = –31.1
 75% Quartile = 36.7
 Maximum = 256.1


19
InterQuartile Range (IQR)

IQR = 75% Quart – 25% Quart


Upper = 75% Quart + 1.5*IQR


IQR = 36.7 – (– 57.9) = 94.6
Upper = 36.7 + 141.9 = 178.6
Lower =25% Quart – 1.5*IQR

Lower = – 57.9 – 141.9 = – 199.8
20
Outlier Box Plot
Any point above the Upper or
below the Lower will be flagged
as a potential outlier.
 Lines extend to the most
extreme points inside the Lower
and Upper bounds.

21
Stat 301 – Lecture 27
Distributions
Residual Gestation
10
Count
15
5
-300
-200
-100
0
100
200
300
22
Regression Outliers
Brain Gestation Pred Resid
Weight
Brazilian 169 g 392 days 135.9 256.1
Tapir
“Man”
1320 g 267 days 481.1 –214.1
Okapi
490 g
440 days 232.2 207.8
23
Comments
The residual for “Man” is not
the most extreme.
 The residual for the Brazilian
Tapir is the furthest from zero.

24
Stat 301 – Lecture 27
Text Definition

Our text defines an outlier in
regression as a value with a
residual that is more than 3
standard deviations away from
zero.
25
Standardized Residual
z

residual
RMSE
A standardized residual should
follow a standard normal
distribution.
26
Text Definition

Our text defines an outlier in
regression as a value with a
standardized residual less than
–3 or greater than +3.
27
Stat 301 – Lecture 27
Standardized Residual
Residual
z
Outlier?
Brazilian 256.1
Tapir
“Man”
–214.1
3.01
Yes
–2.52
No
Okapi
2.44
No
207.8
28
Computing a P-value
JMP – Col – Formula
 (1 – Normal Distribution(|z|))*2
where |z| is the absolute value of
z.

29
Standardized Residual
Residual
z
P-value
Brazilian 256.1
Tapir
“Man”
–214.1
3.01
0.0026
–2.52
0.0119
Okapi
2.44
0.0146
207.8
30
Stat 301 – Lecture 27
Caution
We are essentially doing 50
tests of hypothesis.
 If each test has a chance of
error of 5%, then one would
expect to see some P-values
less than 0.05 just by chance.

31
Bonferroni Correction

Adjust what is a small P-value.
0.05
0.05

 0.001
# of residuals 50

If a P-value is less than 0.001, then
the standardized residual is
statistically significant.
32
Conclusion

Although some of the residuals were
flagged on the outlier box plot, and
the residual for the Brazilian Tapir
meets the text definition, none were
deemed statistically significant once
we corrected for doing 50
simultaneous tests.
33
Stat 301 – Lecture 27
Comment


Because the 3 potential outliers
have residuals far from zero, they
inflate the value of RMSE.
Is there a way to evaluate these
potential outliers using a value of
RMSE more representative of the
remaining values?
34
Adjust RMSE
∗



Variance(residuals) = 7091.274
Adjust by subtracting off the
squared residuals for the potential
outliers.
35
Adjust RMSE
Brazilian
Tapir
“Man”
Okapi
Residual
Residual2
256.1
65587.21
–214.1
45838.81
207.8
43180.84
36
Stat 301 – Lecture 27
Adjust RMSE
7091.2704(49) = 347472.2496
 Subtract off
(65587.21+45838.81+43180.84) =
154606.86
 Adjusted Sum of Squares Residual
347472.25 – 154606.86 = 192865.39

37
Adjust RMSE


.

New RMSE = 65.467
Calculate new standardized
residuals.
38
New Standardized Residual
Residual
z
P-value
Brazilian 256.1
Tapir
“Man”
–214.1
3.91
0.00009
–3.27
0.00108
Okapi
3.17
0.00150
207.8
39
Download