Module Five: Outlier Detection for One Sample Case, Box Plot and h

advertisement
Module Five: Outlier Detection for One
Sample Case
In module Four, we discuss methods for detecting
normality of a response variable, and ways of dealing with
extremes, if exist.
In this unit, we will discuss methods, both numerical and
modern graphical methods for detecting extremes.
We start with one variable case inter-laboratory testing
studies, and extend to two-sample cases in Module Six
1
Detecting outliers for one variable case
Consider the TAPPI inter-laboratory testing study, there
were 87 labs participated the study to test the Sample
GR35. The data reported were the lab averages.
NOTE: it is often the case that each lab test the same sample twice
or more for investigating the within-lab variability as well as between
lab variability. Before an adequate analysis of within and between lab
variability, it is critical the testing procedure for each lab is
standardized and the testing process is under statistical control. If
there are very unusual testing results found, one should look for
possible causes, and decide to either keep or delete the outliers for
further analysis.
2
First thing to do in detecting outliers
The detection of outliers is usually a preliminary analysis to ensure the
reliability of the data. Before conducting any numerical or graphical
approaches, it is a common practice to do the following for
identifying obvious mistakes from sampling or testing:
1. A quick visual check through the data values to identify obvious
typos or impossible data values based on the context of the study, for
example, a miss of one decimal place, or a data values that are
completely out of the possible range of the testing results.
2. A quick computation of descriptive statistics provides minimum and
maximum data values that help us quickly check typos or impossible
data values as well.
3. Once these are done, we can apply numerical and graphical methods
to investigate ‘not-so-obvious’ outliers.
Graphical and Numerical methods for Detecting Outliers
1. The use of Empirical Rule for identifying
outliers
Empirical Rule: When the distribution of the data is moundshaped, if a data value is outside two s.d. of the mean, we may
say it is a possible outlier (extreme), since there is only about
2.5% of chance to be lower than or higher than the mean.
.025
.025
x  2s x
x  2s
Note: We replace m by
and s
by s for the Empirical rule, since
(m,s) are not known, and we
estimate them by sample
information.
4
2. Box-Plot for detecting outliers
A popular graphical tool for detecting outliers of a
variable is the Box-Plot:
Q1-3IQR Q1-(1.5)IQR Q1
m x Q3
Q3+(1.5)IQR Q3+3IQR
Where, Inter-Quartile , IQR = Q3-Q1, the range of the middle 50% of
the data values.
NOTE: the *’s are possible outliers.
and are very likely outliers. These data values are very far away
from the center (mean and median).
Revisit the blood pressure data for 15-20 years old
young adults
Systolic Blood Pressure for 15-20 years old young adults
210
Systolic Blood Pressure
200
Likely
Possible
outliers
outlier
170
150
Sample
mean
Median
114.590
112
100
Possible
outliers
70
Box plot of the systolic blood pressure shows that 210 is
a likely outlier, and several others are possible outliers
(such as 170, and 70).
This plot can be done by hand easily. Minitab also has this plot.
Here are steps of constructing box plot using Minitab:
1.
Go to Graph Menu, choose Box Plot.
2.
In the Dialog box, enter the variable name in Y. If we want to conduct two
box plots based on the gender, then, enter Gender in X.
3.
In the Data Display, one can add more displays than the default ones by add
the row 3, say, Mean Symbol for displaying sample mean on the plot.
4.
Annotation allows to show outlier values, mean and median on the plot.
5.
Frame allows to display more than one plot on the same page.
Box Plots of Blood Pressure, Comparing Male and Female
Box PLots for Systolic Blood Pressure,
separated by Gender
210
Systolic Blood Pressure
200
150
170
168
166
162
160
156
166
160
158
120
118.706
110.390
110
100
80
70
Male
Female
SEX
The likely outlier, 210 is a Male. The distribution for Male is somewhat
skewed-to-right. However, excluding 210, it will be pretty much
symmetric. However, there are some potential extremes on either end.
The distribution shape for Female is approximately symmetric, and
therefore, we can assume Normality for Female.
Hands-on activity
Use the Inter-laboratory testing data – the TAPPI data to
construct a box plot for GR-Lab35-Mean variable and
GR-Lab35-Mean-1 variable. And identify the likely
outliers from each variable.
3. Numerical Methods for Detecting Outliers
a. Studentized Residuals (as known as CPV’s(Comparative
Performance Values), h-statistics in the literature of
Inter-laboratory testing studies).
b. Deleted Studentized Residuals
Consider the TAPPI lab data of sample GR35:
Using the notations {y1, y2, y3, …., yn} to represent the n data values, one
from each lab.
When the same testing procedure is applied and each lab process is under
statistical control, the expected testing result should be the same.
We will use the notation m for the expected measurement.
10
A Simple model for describing the one sample testing
As we demonstrated in the ‘2 cm drawing activity’, there is always some
uncertainties above and below the true measurement, and that if there is
no special causes or systematic bias, the deviations between each lab’s
testing result, ei = yi – m, should behave at a random fashion.
This suggests that each testing result can be expressed in the following
model:
yi = m +ei for i = 1,2,3, …., n labs
This describe the expected situation in one sample testing. We then use
the observed lab testing data to estimate the expected testing result and to
investigate the random deviation. By using the sample data, this is what
we have:
yi = y +ei .
y is the average of all included labs (as known as grand mean).
ei is what we call residual. And we also see that average of ei’s is zero.
11
If the testing result , yi from a lab is likely an outlier, it’s corresponding ei
will be far away from the average, 0. Therefore, one can use the
residual to detect labs with extreme testing result.
In stead of using the residual, ei, itself (the value depends on the
measurement units), we use some standardized form of ei to detect
outliers, so that, it will not be measurement dependent.
A classical one is : Standardized ei (as known as CPV as well as hstatistics in inter-laboratory studies):
How to compute standardized ei for each lab?
1. Compute y , the grand mean of all included labs.
2. Compute ei = yi – y
3. Compute the between-lab variance, s2 and standard deviation, s:
s2 =
( y  y)
i
2
/(n  1) and
4. Standardized ei = ei/s
s  s2
12
How to use standardized residual (CPV or h-statistic) to detect outliers?
A quick rule :
•If standardized residual > 2 or < -2 then it is a possible outlier. Since,
based on the normal probability, there is approximately only 2.5% of
chance to have a standardized residual > 2 or < -2, respectively.
•If standardized residual > 2.6 or < -2.6, then, it is a likely outlier. There
is approximately only 0.5% of chance to be > 2.6 or < -2.6, respectively.
NOTE: 2.0 and 2.6 are values from the Z-distribution, N(0,1).
f(z)
.005 .025
-2.576 –1.96
.025 .005
Z
0
1.96 2.576
13
A more precise rule:
•Standardized residual > t(.025, n-1) or < -t(.025, n-1), then, it is a probable
outlier.
•Standardized residual > t(.005, n-1) or < -t(.005, n-1), then, it is a likely
outlier.
NOTE: t(a, n-1) is a value of t-distribution. The standardized residual
follows a t-distribution with degrees of freedom n-1 in this case. tdistribution is very similar to Z-distribution. T depends on sample
size. When sample size is larger, t is eventually the same as Z.
f(t)
.005 .025
-t(.005,n-1) –t(.025,n-1)
.025 .005
t
0
t(.025,n-1) t(.005,n-1)
14
A more sensitive measure for detecting outliers:
Deleted Standardized Residual, dj.
The steps for computing this measurement:
1. Delete the jth case,
2. then compute y( j ) and residual ei(j) = yi - y( j ) for every case,
including the jth case.
3. Compute s(2j ) and s(j) using the (n-1) residuals, excluding jth case.
4. Compute the deleted standardized residual, dj = ej(j)/s(j)
5. Repeat the steps 1-3 for cases j = 1,2,3 …., n.
Since the Deleted Standardized residual for the jth observation estimates all
quantities with this observation deleted from the data set, the jth observation
cannot influence these estimates. Therefore, unusual Y values clearly stand out.
It is more sensitive than the classical standardized residual.
15
How to use the Deleted Standardized Residual to detect
outliers?
The same quick rule as the standardized residual applies here.
However, if we are to be more precise, we need to use the t-distribution.
In applying the t-distribution, the degrees of freedom is now (n-2).
For most of applications, the rule QUICK RULE is sufficient. Unless the
sample size n is very small. A common wisdom is that n < 30 is small.
However, for practical reason in outlier detection, it is appropriate to
consider n < 20 to be small, and that the t-distribution should be applied.
The key issue after detecting the outliers is to find out the possible causes
of these outliers.
16
The h-plot for Inter-laboratory Testing
The h-plot plots the CPV values on a two dimensional plot with a
center line and upper and lower limits along the X-axis. The Xaxis is the Lab ID. The CPV values of replications within each
lab, if existed, are grouped together. The Y-axis is the
standardized (or deleted studentized residuals). An example is
given in the following:
2
0
1 2 3 4 5
6 7
8
9
10 11 12
-2
One may use the more precise t-values for the upper and lower bounds :
 t(.025,df )
In this plot, there are 12 labs. Each lab has two replications. The length of each
line is the standardized residual (h-value or CPV) or deleted studentized residual.
17
The h-plot is a graphical view of the standardized residuals or
deleted studentized residuals. The same plot is not available in
Minitab. However, Minitab does provide all needed numerical
measurements. We can create a similar graph using Minitab as well.
The outlier detection using residuals is a very useful tool. In the above
case, we consider the simplest model that describe one sample data as
y = m + e. This model assumes
• Each lab is similar in its operation,
•The testing procedure is standardized,
•The operators have similar quality,
•The testing material is similar.
If any of these assumptions is seriously violated, this model is not
adequate. A more complicated model should be considered. The outliers
detection should not be applied to response variable directly if we know
18
in advance the violation of these assumptions.
Use Minitab to compute numerical measurements for
conducting outlier detection for one sample case
NOTE: This process involves a lot of computations. We do not do this by
hand. Here is the steps of using Minitab to compute residuals,
standardized residuals, and deleted standardized residual.
The TAPPI study is used for demonstration here.
1. Create a column of 1’s, say, in C7:
a. Go to Calc, choose ‘Make Patterned Data, select ‘Arbitrary Set of
Numbers’, in the Dialog box, enter C7 to store the data, enter ‘1’
in the ‘Arbitrary set of Numbers’, List each value ’87’ times, the
sample size, and List the whole sequence ‘1’ times.
2. Go to Stat, choose Regression, then select ‘Regression’.
3. In the Dialog box, enter the response variable, say C5, and enter
predictors C7, the column with all ‘1’.
19
Steps- Continued:
5. Click on ‘Storage’, and select Residuals, Standardized
Residuals, Deleted Studentized Residuals, and Fits. Each of
these will appear as a column is the worksheet.
Residuals is named: RESI1,
Standardized Residual is named: SRES1,
Deleted Studentized Residual is named:TRES1
The Fitted Value is named: FITS1. In the one sample case, this
is exactly the Grand Mean of all included labs.
The number at the end of each variable will increase by one,
such as RESI2, SRES2, for additional storage in the later
analysis.
We can change the variable names as we wish.
20
There are two additional selections in the Regression
Procedure: Graphs, Results.
6. Click on ‘Graphs’, it allows you to conduct graphical
detection of these residuals. Choose some graphs as you wish
to see. For example, one may choose ‘Standardized’ choose
‘Normal Plot of Residuals’ to conduct a normal probability
plot for standardized residuals.
The Graphs will appear in the graph window.
7. Click on ‘Results’, it allows to choose the amount of computer
output as needed. The last one gives the most extensive
output.
The results will appear in the Session Window.
21
Use Minitab to construct the h-plot
Since Minitab does not have the same plot as h-plot shown before, I will
demonstrate how to use other procedure to construct a plot that is
similar to the h-plot using the TAPPI data.
1.
Go to Stat, choose Control Charts, then select ‘Individuals’.
2.
In the Dialog box, enter ‘SRES1’ into the Variable box (or any variable of interest
such as deleted studentized residuals.
3.
Enter 0 for Historical Mean. This will be the center line on the plot.
4.
There are five additional selections and three graph editing selections. Leave Test
and Estimate as default.
Click on ‘S-Limit’ selection, and enter 2 for upper sigma limit and –2 for lower
sigma limit. You can also change the line color and line type.
5.
Click on ‘Stamp’ selection, enter C1 as the Tick Labels. This will define the ticks
on the X-axis using the laboratory names.
6.
Click on ‘Options’ selection, you can change the symbol attributes and connection
22
line attributes.
Case Example: TAPPI Inter-laboratory Study
Let’s start with the SAMPLE GR35.
1.
A quick eye-checking immediately suggest the following cases are clear outliers,
and they are removed from the outlier detection analysis immediately:
U3438: Lab mean = 80.55 ,
2.
U3531: Lab mean = 85.75
Now, we follow the procedure described above to compute the standardized
residuals and deleted studentized residuals using the remaining data and normal
plot analysis.
The unusual observations are Unusual Observations
Lab Code
GR35-Lab
Fit
SE Fit
Residual
St Resid
U2415
1.00
76.0630
77.5273
0.0652
-1.4643
-2.45R
U3154
1.00
79.5500
77.5273
0.0652
2.0227
3.39R
U3185
1.00
79.1000
77.5273
0.0652
1.5727
2.63R
U3216
1.00
79.1620
77.5273
0.0652
1.6347
2.74R
U3249
1.00
76.2630
77.5273
0.0652
-1.2643
-2.12R
U3292
1.00
79.1380
77.5273
0.0652
1.6107
2.70R
U3334
1.00
78.7750
77.5273
0.0652
1.2477
2.09R
23
The normal probability plot and Normality test for the
Standardized Residuals
Normal Probability Plot
.999
.99
Probability
.95
.80
.50
.20
.05
.01
.001
-2
-1
0
1
2
3
Standardized Residual
Average: -0.0000000
StDev: 1.00593
N: 85
Anderson-Darling Normality Test
A-Squared: 3.044
P-Value: 0.000
The pattern does not follow a straight line well. The Normality
Test suggests the lab testing results clearly do not follow normal.
24
•The quick rule is used to detect the outliers in this case, since the
sample size is large.
•Both standardized residuals and deleted studentized residuals give
the same group of unusual labs.
•These labs of which the testing results are found unusual will be
notified. Further analysis is then taken to find out if there are any
special causes or reasons for these unusual lab results.
•NOTE, the result using one sample detection technique is somewhat
different from the two-sample plot approach. Since some labs which
do not show outliers from this sample may show outliers when
testing another sample. This is one reason why we should also
conduct two-sample plots.
25
h-Plot for Standardized Residuals
4
1
3
CPV Values
2
UCL=2
1
0
Mean=0
-1
-2
LCL=-2
-3
Subgroup
C1
0
10
0
24
U
7
20
61
31
U
30
93
31
U
40
61
32
U
50
33
U
02
60
33
U
38
70
31
35
U
80
9
45
U
90
9
This is created by Minitab. It is not quite the same as the h-plot. It does the same
function as the h-plot and more. The mark ‘1’ is the lab which is over 3, a definite
outlier. The labs outside the upper and lower limit of 2 are considered as outlier.
One can choose to use different upper and lower bounds.
26
Hands-on Activity
Detect labs which result outliers in testing Sample GR
36 of the TAPPI study.
27
Use of Basic Quality Control Chart Techniques for monitoring
laboratory performances
Quality Control charts were originally developed to monitor the mean shift and
and the variation changes along the time domain in manufacturing process.
For the inter-laboratory performance of testing a given material, we can
apply the same charting method to monitor the performance of laboratories
based on two measurements:
1.
laboratory measurement means and
2.
within-lab measurement variations.
The control charts to be discussed are called
X  chart for monitoring between-laborotary meansrement means.
R-chart for monitoring within-laboratory measurement variations.
Example: A study of a chromatographic method was conducted for determining
malathion. Ten labs participated in the study; each lab received a subsample of a technical
grade malathion (Tech), two wettable powders (25% WP and 50% WP), and an
emulsifiable concentrate (58% EC), and a dust. Each participant also received an
internally tested standard of malathion (99.1%) along with the analytical method.
(Wernimont, 1985).
28
Row
lab
Rep
WP25
WP50
Row
lab
Rep
WP25
WP50
1
1
1
26.17
50.76
19
6
3
27.01
51.72
2
1
2
26.22
50.67
20
6
4
25.72
52.07
3
1
3
25.85
50.81
21
7
1
26.14
50.53
4
1
4
25.80
50.72
22
7
2
26.78
50.75
5
2
1
26.44
50.82
23
7
3
26.04
49.99
6
2
2
26.57
50.90
24
7
4
25.97
50.92
7
2
3
25.80
51.04
25
8
1
25.70
50.00
8
2
4
26.06
50.96
26
8
2
25.90
50.30
9
3
1
26.95
52.53
27
8
3
25.80
50.50
10
3
2
26.91
52.54
28
8
4
25.70
50.60
11
3
3
26.98
52.55
29
9
1
26.13
50.26
12
3
4
26.91
52.47
30
9
2
26.13
50.36
13
5
1
26.23
50.20
31
9
3
25.91
50.97
14
5
2
26.00
50.47
32
9
4
25.86
50.44
15
5
3
26.22
50.39
33
10
1
26.22
50.23
16
5
4
26.18
50.43
34
10
2
26.20
50.27
17
6
1
25.45
51.65
35
10
3
25.84
50.29
29
18
6
2
25.62
51.67
36
10
4
25.84
49.97
Construction of X  chart and R-chart
Consider the above Malathion testing study. Ten labs particilated in the
study. Each Lab tested material WP50 for four replications. Lab 4 was
excluded since it did not complete the testing.
Lab ID
Rep1
Rep2
Rep3
Rep4
1
x11
x12
x13
x14
2
x21
x22
x23
X10,1
X10,2
X10,3
Sample
mean,
xi
Range,
R1
x24
x1
x2
X10,4
x10
R10
Ri
R2
3
5
6
7
8
9
10
Average
x
R
30
Range = Largest – Smallest in each Lab.
An X-bar chart is to monitor the laboratory mean. If labs are consistent,
then, the average of each lab should be close. If all of them the equal, then,
the grand average is the same of lab average. If lab averages are very
different (that is some lab systematic biases exist), then there will have
deviation between grant mean and lab mean. This provides the basis of the
X-bar chart.
m  3s x
Upper Limit
m
Grand Mean
m  3s x
Lower Limit
Lab order (or Time order)
The lab averages are then plotted along the lab order.
The multiple ‘3’ is applied commonly in process control. Under the normality
assumption, there is 99.7% of chance the lab sample mean should be within the
interval.
As the chart indicates, we need to estimate the grand mean and SE of lab mean.
Since range is usually easier to compute, the estimate of the population variance
and, hence the SE of lab mean can also be estimated, using the distribution of
31
Range.
The expected value of Range: E(R) = d2sx , where d2 depends on sample
size (in the lab testing case, it is the # of replications conducted by each lab.
The values of d2 will be provided in the class.
Therefore, the estimate of sx is given by sˆ x  R / d 2
And the SE of sample mean is
Upper Control Limit is : x 
sˆ x  sˆ x / r 
3
d2 r
R
d2 r
R  x  A2 R
Center Line is : x
Lower Control Limit is : x 
3
d2 r
R  x  A2 R
The multiple A 2 and d 2 depend on sample size, the # of replications, r.
They will be provided in class.
32
Construction of R-chart
Similar to X-chart, R-chart has the form: R  3s R
To estimate s R , we use the following fact about Range: s R = d 3s x .
An estimate from data is sˆ R = d 3sˆ x =
d3
R
d2
d
The Upper Limit of the R-chart is R+3  3
 d2

 R  D4 R

d 
The Lower Limit of the R-chart is R-3  3  R  D3 R
 d2 
The multiples, D3 and D 4 depend on the sample size, and will be provided in class.
33
Analyzing the malathion data – the WP50 variable
Xbar/R Chart for WP50%
Sample Mean
1
52
1
Subgroup
laboratory_1
1.0
0.5
0.0
1
1
50
Sample Range
UCL=51.18
Mean=50.88
LCL=50.58
51
0
1
1
1
1
2
3
4
5
6
7
8
9
1
2
3
5
6
7
8
9
10
UCL=0.9353
R=0.41
LCL=0
•X-bar chart suggests that there exists a very large mean differences among
labs. This is an indication of systematic lab bias. When comparing with the
standard proportion of 50%, Lab 3 shows much higher lab average than
others. Some attention to Lab 3 should be taken.
•R-chart indicates, in general, no lab has dramatically high within-lab
variation. However, Lab 7 has somewhat higher within-lab variation.
34
Analyzing the Malathion Data – 25% Variable
Xbar/R Chart for WP25
Sample Mean
27.0
1
UCL=26.53
26.5
Mean=26.15
26.0
LCL=25.76
Subgroup
lab
Sample Range
1.5
0
1
2
3
4
5
6
7
8
9
1
2
3
5
6
7
8
9
10
1
UCL=1.194
1.0
0.5
R=0.5233
0.0
LCL=0
X-bar chart for the WP25% variable also show that Lab 3 has a
significantly high lab average. A closer check is necessary.
The R-chart indicates the within-lab variation exceeds the upper limit. A
review of Lab 6 for special causes would be recommended.
35
Some General Comments of applying the control charts for monitoring
laboratory means and within-lab variations
This X-bar, R-chart technique is valid under the assumptions:
•
The response variable follows a normal distribution.
•
The same or very similar material is tested by every participated lab.
•
The operation of each lab is independent of others.
In most laboratory studies,
condition (3) is usually satisfied.
Condition (2) may be satisfied if the preparation and distribution of material and the
time period of conducting the lab testing is within a reasonable time period.
If there are more than one material tested by participated labs, we can conduct a
series of control charts to monitor each material. There are also multivariate
control charts that can be applied to monitor more than one material at a time and
take into account the laboratory systematic biases into account.
The Youden’s two-sample plots can be applied (to be discussed later) to diagnose the
lab performance based on two samples at a time.
36
Other Control Charts that may be useful for monitoring
inter-laboratory testing study
X  chart for monitoring between-laborotary meansrement means.
R-chart for monitoring within-laboratory measurement variations.
There are many other types of cnotral charts, each is developed for
some specific purposes.
X-,S-charts are similar to X- , R-charts for continuous, normal responses.
p-cahrt for binomial data, particularly for monitoring proportion of defectives in a
sample along the time domain.
U- or C-charts are for Poisson data, particularly for monitoring # of defects in
sampled parts along time domain.
These charts can be handy for monitoring laboraty testing.
If the proportion of a certain property is the measurement, then, p-chart can be applied.
If the # of times that a targeted property is measured in each lab, u- or c-carts can be applied.
37
How to use Minitab to conduct control chart analysis?
Constructing X-bar and R-charts is straightforward even by hand.
However, Minitab can do the charting and much much more for us. There
are steps are constructing the X-bar and R-charts:
1. GO to Stat, choose Control Charts, select Xbar-R…
2. In the dialog box, depending on the data arrangement in the
worksheet. If response is in one column and lab# in another column, enter
response and lab id columns into ‘single column’ and ‘sub-group size’.
3. There are four selections. We have shown these before. Click on
‘Stamp’ selection, and enter the column that consists of the correct ‘Lab
ID or Name’ . The correct ID or Lab Name will show on the X-ticks for
easier reading.
38
Hands-on Activity
Analyze the other variables in the Malathion data, and draw your
final conclusion about the lab consistency with regards to
(a) Lab averages,
(b) Within-lab variations
39
40
Download