L2_ch_8_ANOVA__2014.doc

advertisement
Lecture 2:
ANOVA – Review and Refresh
Last class we discussed comparing two means to each other. Now we
want to generalize to the case where we have more than two means.
Examples:
1. Are there differences in mean yield for three varieties of wheat?
2. Is there a difference in the mean hourly wages for three different
ethnic groups?
3. IS there a difference in the mean Pb content in the three main lakes in
Eastern WA, (Couer D’Alene, Liberty Lake and Newman Lake)
4. Is there a difference in the cholesterol ratio among the 4 groups
Young male, Young female, Older males, Older females.
In each case we are interested in comparing multiple means to each
other.
In each case we are assuming that the treatments were randomly
assigned to the homogenous units, or units are selected at random from
homogenous groups. Hence the DESIGN that goes along with OneWay ANOVA is COMPLETELY RANDOMIZED DESIGN (CRD).
So the hypothesis of interest is:
Ho: 
Ha: at least one is unequal.
So we are interested in finding whether or not at least one of the
treatments are different from another.
We are also interested in identifying WHICH ones are different.
First we will discuss the ANALYSIS aspect of the problem and then we
will address the design aspect of the problem.
The logic behind ANOVA:
The idea is we decide whether the means are the same or not based on
their variability.
The general idea is that we don’t know the populations means, but can
only observe the sample means and the standard deviations. Obviously
the sample means are not going to be exactly equal to each other. So the
idea is figuring out how different the means are based on their variability.
So if the variability across samples is a lot bigger than variability within
the samples (normal variation) we decide the sample means are different.
Lets consider the following example from your book.
Assume that we wish to compare the three ethnic mean hour wages based
on samples of five workers selected from each of the ethnic groups
Table 1
1
5.9
5.92
5.91
5.89
5.88
5.90
2
5.51
5.50
5.50
5.49
5.50
5.50
3
5.01
5.00
4.99
4.98
5.02
5.00
What can you say about the variability in the three groups for Table 1 and
2?
Table 2
1
5.90
4.42
7.51
7.89
3.78
5.90
2
6.31
3.54
4.73
7.20
5.72
5.50
3
4.52
6.93
4.48
5.55
3.52
5.00
Do the data in Table 1 present sufficient evidence to indicate differences
among the three population means? An inspection of the data indicates
very little variation within a sample, whereas the variability among the
sample means is much larger. Since the variability among the sample
means is so large in comparison to the within-sample variation, we might
conclude that the corresponding population means are different.
Table 2 illustrates a situation in which the sample means are the same as
given in Table 1 but the variability within a sample is much larger. In
contrast to the data in Table 1, the between-sample variation is small
relative to the within sample variability. We would be less likely to
conclude that the corresponding population means differ based on these
data.
Let’s look at the mean and Standard Devaition:
Mean:
SD:
Mean:
SD:
5.9
.016
5.9
1.82
5.5
.007
5.0
.016
5.5
1.42
5.0
1.30
So in ANOVA what we do is find the WITHIN sample (sometimes called
treatments) variance and compare that to the ACROSS sample variance.
If the variation across the samples is MUCH greater than the variation
within the samples, we can conclude that the samples means are different
from each other. If the across sample variation is not big compared to the
within sample variance, we cannot conclude that the means are different.
So the main work is to calculate the WITHIN and ACROSS sample
variation. In ANOVA we call WITHIN variation the ERROR Variation.
And the ACROSS variation the TREATMENT variance.
Measuring Variability within and across:
So how do we measure variability within a sample?
Lets go back to the two sample situation: How did we measure variability
within the samples, we used the pooled variance (assuming equal
variances for both populations). It’s the same idea extended when you
have multiple means:
We pool it across all the samples.
So for the data in Table 1:
Lets first define some notation:
Level(i)
1
2
3
ni
5
5
5
5.9000
5.5000
5.0000
Here t=# of treatments=3
N=n1+n2+n3=15
si
0.0158
0.0071
0.0158
So the Variation within we use the notation sw is:
Pooling the variances across the three groups.
Now, how do we get Variability across samples?
We find the variances for the 3 sample means.
Where,
Lets calculate the terms for our example:
=.000183
So a logical choice for testing is the ratio
than 1? How much?
, in our case is this bigger
A bit of distribution THEORY:
Before we find a mathematical way to look at how different is different we
will digress a bit and talk about distributions and their relationships’.
I will assume you ALL have used and know what the NORMAL
distribution is.
Normal:
Let Y follow a Normal distribution with mean  and standard deviation 
f(y) =
1 y 2
exp( (
) )
2 
 2
1
If Y* = a + bY, Y* follows a Normal distribution with E(Y*) = a + b
and Var(Y*) = b2.
A normal distribution with mean 0 and variance 1 is called the
STANDARD normal distribution.
Chi-square distribution:
Let z1, …, zn be iid N(0,1). Then X2 = z12+ … +zn2 follows a chi-square
distribution with n degrees of freedom.
If there are some linear constraints upon the z’s like: z1+….+zn = a (some
constant) then the degrees of freedom DECREASE by the number of
constraints in the model.
Students t-distribution:
Let z follow a N (0,1) and 2() follow a chi-square distribution with 
degrees of freedom then:
z
t
follows a t distribution with  degrees of freedom.
2
 ( ) / 
F-distribution:
Let 2() and 2() follow INDEPENDENT chi-square distributions
with and  degrees of freedom then:
12 ( 1 ) / 1
F 2
follows a F distribution with and  degrees of
2 (2 ) / 2
freedom.
So Back to our Problem:
We want to figure out how large is large for the ratio of
.
So we need to look at the model of ANOVA and assumptions to
understand the distribution of the statistic.
Model Based Approach To ANOVA (AOV)
Cell-means model
yij = i + ij
OR (the most commonly used form)
Treatment-effect model
yij =  + i + ij
Where: yij is our response variable
: our overall mean (grand mean)
i: our effect from treatment i
ij: our error terms.
Our assumption is that eij are independent and follows a Normal
distribution with mean 0 and variance s2.
So are assumptions are:
1. Normality of the Y’s
2. Independence among the Y’s
3. All the Y’s have equal variance (variance does not depend upon the
group they belong to).
So the hypothesis we are testing are:
Ho: k
Ha: at least one inequality


Ho: k=(0)
Ha: at least one inequality
Lets consider the following:
So,
Total Sum Squares=Within Sum Squares + Between Sum Squares
Now lets reason this out, Y is normal so
is chi-square with
Ratio
is also normal so
degrees of freedom. Similarly
is a chi-square with
degrees of freedom and
is a chi-square with
degrees of freedom. So the
follows F with ((t-1),
determine how large is large using the F table.
degrees of freedom. So we
This leads us to the ANOVA table:
Source
Degrees of
Freedom
dff=(t-1)
Sums of Squares Mean
F testpSquares statistic value
SSF=
SSF/dff SSF/SSE
Within
Samples
(Error)
dfe
SSE=
Total
dft
TSS=
Between
Samples
(Treatment)
SSE/dfe
This is the ANOVA table used in practice. In theory we generally add a
column for the Expected(MS).
Lets look at the ANOVA table for the data in Table 1 and how we
complete it.
Source
Degrees of
Freedom
Sums of
Squares
Mean
Squares
Between
Samples
(Treatment
)
Ethinicity
Within
Samples
(Error)
dff=(t-1)
=3-1=2
SSF=
SSF/dff SSF/SS
=1.0167 E
=5545.4
5
Total
=2.03333
dfe
SSE=
=15-3=12
dft
F teststatistic
pvalue
*
<.001
SSE/dfe
=.00018
3
=.002200
TSS=
=15-1=14
=2.03553
*For finding p-value we need to look at the F distribution with degrees of
freedom 2 and 12.
If you look at the F-table F(.001,2,12)=18.64. 5545.45 is bigger than that,
so p-value <.001
As we are using SAS in our labs lets look at what the SAS output gives us
for Data set 1:
The SAS System
The MEANS Procedure
factor=A
Analysis Variable : wage
N
Mean
Std Dev Minimum Maximum
5 5.9000000 0.0158114 5.8800000 5.9200000
factor=B
Analysis Variable : wage
N
Mean
Std Dev Minimum Maximum
5 5.5000000 0.0070711 5.4900000 5.5100000
factor=C
Analysis Variable : wage
N
Mean
Std Dev Minimum Maximum
5 5.0000000 0.0158114 4.9800000 5.0200000
The GLM Procedure
Class Level Information
Class
factor
Levels Values
3 ABC
Number of Observations Read 15
Number of Observations Used 15
The SAS System
The GLM Procedure
Dependent Variable: wage
Source
DF Sum of Squares Mean Square F Value Pr > F
Model
2
2.03333333
1.01666667 5545.45 <.0001
Error
12
0.00220000
0.00018333
Corrected Total 14
2.03553333
R-Square Coeff Var Root MSE wage Mean
0.998919
Source DF
factor
0.247684
0.013540
5.466667
Type I SS Mean Square F Value Pr > F
2 2.03333333
1.01666667 5545.45 <.0001
Source DF Type III SS Mean Square F Value Pr > F
factor
2 2.03333333
1.01666667 5545.45 <.0001
Lets look at the ANOVA Table for Data Set 2:
The SAS System
The MEANS Procedure
factor=A1
Analysis Variable : wage
N
Mean
Std Dev Minimum Maximum
5 5.9000000 1.8191344 3.7800000 7.8900000
factor=B1
Analysis Variable : wage
N
Mean
Std Dev Minimum Maximum
5 5.5000000 1.4167745 3.5400000 7.2000000
factor=C1
Analysis Variable : wage
N
Mean
Std Dev Minimum Maximum
5 5.0000000 1.2960131 3.5200000 6.9300000
The SAS System
The GLM Procedure
Class Level Information
Class Levels Values
factor
3 A1 B1 C1
Number of Observations Read 15
Number of Observations Used 15
The SAS System
The GLM Procedure
Dependent Variable: wage
Source
DF Sum of Squares Mean Square F Value Pr > F
Model
2
2.03333333
1.01666667
Error
12
27.98460000
2.33205000
Corrected Total 14
30.01793333
0.44 0.6565
R-Square Coeff Var Root MSE wage Mean
0.067737
Source DF
factor
27.93485
1.527105
5.466667
Type I SS Mean Square F Value Pr > F
2 2.03333333
1.01666667
0.44 0.6565
Source DF Type III SS Mean Square F Value Pr > F
factor
2 2.03333333
1.01666667
0.44 0.6565
So our findings from the analysis are:
1. For Data set 1 there is a significant difference across the ethnicities
with a p-value <.001
2. For data set 2 there is NOT a significant difference between the three
ethnicities.
How did these results emerge:
1. The SSF was the SAME for both data sets, however the SSE was
MUCH smaller in Data set 1 than 2.
2. So in the first data set almost all the total variability was explained by
the difference among the groups and very little by error. Not the case
for the second data set.
3. So essentially ANOVA detects differences looking at how the total
variability is broken up into variability for treatments and variability
for errors.
One thing we need to check now – was ANOVA appropriate in this
scenario. Were the assumptions met?
Lets discuss assumptions:
1. Normality
2. Independence
3. Equal Variance
The Procedure to check to see if assumptions are met is called diagnostics.
Its similar to being a doctor and looking at symptoms to see if the model is
“well” or not.
In each case we can “test” or look at plots.
We will start with the easier ways first looking at plots.
Before we do that, lets talk about the assumptions and what these
assumptions are about:
1. Normality: here we assume that our response variable (in turn our
errors) are normally distributed.
2. Independence: here again we assume that each error term is
independent of the other.
3. Equal Variance: we assume that each error term comes from a
distribution with the same variance.
So lets think about how to diagnose problems. So let us take a long hard
look at our model.
Model: for ONE-WAY ANOVA fixed effects:
(we will discuss random effects in Lecture 4)
yij =
+
i +
ij
th
observed value=overall mean+deviation for the i group + random error
yij =random
+
i is NOT random (fixed)
ij is random
ij= yij - ( + i)
Lets think of ij as we don’t know the true value of the mean of the yij’s
we can never observe ij
How do we estimate ij?
, which is our observed error seems reasonable. Actually we call
our OBSERVED error our RESIDUAL and represent it by
so that,
REMEMBER all our assumptions are on the ERROR but we use the
RESIDUAL as a proxy to check these assumptions out.
So, to check for normality we check to see if the residuals are normally
distributed. We do this by the normal probability plot.
What is the normal probability plot?
It’s a plot of the ranked residuals and the expected values of the ranked
residuals if we assume the distribution is normal. So if the distribution is
normal we expect the plot to look like a STRAIGHT Line. Any deviation
from the straight line would mean departures from normality.
For equality of Variance we look at the plot of Residuals versus the
Explanatory Variable. The idea is if the variances are equal, the residuals
should be of equal spread for the different X’s.
For non-independence: we look at the plot of residuals versus predicted
and do not want to see any patterns in general.
Consider the plots from Data Set 1:
Consider the plots for data set 2:
Some words of advice:
1. We can do tests for these assumptions and we will talk about them in
LAB.
2. While we are making an assumption on the error we test on the
residuals, so use these as guidelines to diagnose problems but not
something irrevocable.
If assumptions are not met, we have options. More and more new
techniques are coming up which deal with non-standard situations. So its
just a matter of getting the right procedure.
We can always transform our data. Not ALWAYS the best option as the
analysis is then done on the transformed Response and might not have
much physical meaning. The book talks about some standard techniques
in section 8.5. You can read that.
My advice is: unless the transformation allows the response to be
physically interpretable they are best avoided.
An Example A development engineer is interested in determining if
varying the cotton content in a synthetic fiber affects the tensile strength.
% Cotton
15
7
20
12
25
14
30
19
35
7
7
17
18
25
10
Observed Tensile Strength
15
11
9
12
18
18
18
19
19
22
19
23
11
15
11
One-Way Analysis of Variance
Analysis of Variance for resp
Source
DF
SS
MS
Trt
4
475.76
118.94
Error
20 161.20
8.06
Total
24 636.96
F
14.76
P
0.000
Residuals Versus the Fitted Values
(response is strength)
6
5
4
Residual
3
2
1
0
-1
-2
-3
-4
20
15
10
Fitted Value
Normal Probability Plot
.999
.99
Probability
.95
.80
.50
.20
.05
.01
.001
-4
-3
-2
-1
0
1
2
3
4
5
RESI1
Average: -0.0000000
StDev: 2.59165
N: 25
Anderson-Darling Normality Test
A-Squared: 0.519
P-Value: 0.170
Example The given observations are tomato yields (kg/plot) for four
different levels of electrical conductivity (EC) of the soil. Chosen EC
levels were 1.6, 3.8, 6.0, and 10.2 nmhos/cm.
EC level
1.6
3.8
6.0
10.2
59.5
55.2
51.7
44.6
53.3
59.1
48.8
48.5
Yield
56.8
52.8
53.9
41.0
63.1
54.5
49.0
47.3
Construct Analysis of Variance Table and use it to test the null hypothesis
of no difference in true mean antigen concentrations for the three groups.
One-way ANOVA: yeild versus trt
Analysis of Variance for yeild
Source
DF
SS
MS
trt
3
377.8
125.9
Error
12
123.8
10.3
Total
15
501.6
F
12.20
P
0.001
Normal Probability Plot of the Residuals
(response is yeild)
2
Normal Score
1
0
-1
-2
-5
-4
-3
-2
-1
0
1
2
3
4
5
Residual
Residuals Versus the Fitted Values
(response is yeild)
5
4
3
Residual
2
1
0
-1
-2
-3
-4
-5
45
47
49
51
53
Fitted Value
55
57
59
So our next steps would be to look at the follow-up of ANOVA, ie
Multiple Comparison.
Review:
Things we learned in this set of notes:
1. Completely Randomized Design
2. Logic of ANOVA
3. Partitioning of the Sums of Squares
4. Model for ANOVA
5. Diagnostics
Download