Analysis_of_Variance..

advertisement
Analysis of Variance (ANOVA)
In previous weeks, we’ve covered tests of claims about the population mean  and about the
differences in the population means of two populations. But what if you want to test whether there are
differences in the population means of more than two populations?
The Challenge of Comparing Multiple Means
Why don’t we just pair off the various populations and test for differences in means for all the pairs?
There’s a problem there. Say you have four populations you’re interested in. Four populations (say A, B,
C, and D) make 6 pairs (AB, AC, AD, BC, BD, and CD). So you have to do six tests. What if you’re
conducting the tests at the 5% significance level? Then there’s a 95% chance that you won’t make a Type
I error on any given test. But the chance that you won’t make a Type I error on any of the tests is
0.956  0.74 , because you have to multiply the chance that you don’t make a Type I error on the first
test by the chance that you don’t make a Type I error on the second test by the ... you get the idea.
(Here we are making the assumption that all the tests are independent, and the multiplication rule in
Chapter 4 allows us to calculate the probability that we don't make any Type I errors in all 6 tests.) The
significance level for these tests as a group will be 1  0.74  0.26 , or 26%. Not what we want – the risk
of having made a Type I error somewhere (anywhere in one of the 6 tests) is too great. In addition, even
if you want to do this, such a battery of tests provide more than what you need, if what you are
interested is simply whether if any of the four groups is different from the rest.
To counter this difficulty, a technique called analysis of variance was invented. It appeared first in 1918
in a paper by R. A. Fisher, a British statistician and geneticist. It has the nickname ANOVA, which stands
for Analysis of Variance.
Example: Comparing 3 Groups
Here’s an example to give you an idea of the concepts involved. Let’s say you have three different
populations, A, B, and C, and you take a sample of size 3 from each population. You’re interested in the
population means of these populations. You’re claiming that these population means are not all the
same.
Compare two scenarios, which I call Set 1 and Set 2:
In which case would you be more likely to conclude that the population means aren’t the same? If you
said Set 2, you have the right idea.
As you can see, the sample means are the same for Sets 1 and 2:
However, the numbers in Set 1 are all spread out and overlapping, whereas those in Set 2 are tightly
grouped around the means and make you believe that they might actually come from populations with
different population means.
Here are two boxplots that make it clearer why the means in Set 2 are more likely to be different than
those in Set 1:

Set 1

Set 2
The basic idea of analysis of variance is to compare the variability of the sample means (which we call
the variance between groups) to the variability of the samples themselves (which we call the variance
within groups). If the former is large compared to the latter, as in Set 2, we feel that there really are
differences among the population means, but if the variability between groups is not large compared to
the variability within groups, we’re not going to conclude that there are differences among the
population means. In the second case we say that there is too much “noise” to draw a conclusion about
the differences.
Let’s use the range (the difference between the largest number and the smallest number) as a measure
of variability. For both Set 1 and Set 2, the sample means have a range of 10. That is a measure of the
variability between groups. But for Set 1, the variability within the groups, if measured by the range, is
10 (e.g. 15 - 5), whereas for Set 2, the variability within the groups measured this way is 2 (e.g. 11 - 9).
So compared to the variability within groups, the variability between the groups is much larger (five
times larger) for Set 2 than it is for Set 1 (where the two are the same).
Introducing the F-distribution
Of course, the range is not a very good measure of variability. Much better are the standard deviation,
or its square, the variance. The comparison of variance between groups and variance within groups is
done by using a ratio, which can be roughly stated as follows:
F
Variation between groups
Variation within groups
In his original paper, Fisher named this test statistic "F distribution", which does not sound very modest
in comparison with Gosset, who chose the much more humble "Student's t distribution" for another
ground-breaking work.
Fisher's ingenuity is that he discovered a way to condense all the data in several groups into one test
statistic. By showing the mathematical properties of this new test statistic, he is able to apply the same
mechanism of hypothesis testing to this rather complex decision problem involving multiple groups.
We’ll get into the calculation of Fisher's F statistic briefly below. For now, it helps to keep in mind that
the nature of the F-distribution (see the graph below) determines that the P-value is always right-tailed.
In addition, since F is expressed as a ratio, both the numerator and denominator have a degree of
freedom.
When the multiple groups have different means (evidence against H 0 ), the numerator tends to be
much larger than the denominator, which will lead to a large F statistic. On the other hand, when the
groups all have the same mean, the ratio will be equal to zero, since the numerator will be zero.
This interpretation of the F-statistic may help you see why the P-value needs to be right-tailed, since it
characterizes how "extreme" the test statistic is, assuming that H 0 is true (all the means are equal).
"Extreme" here means a large F value, since the F-distribution does not allow any negative values (unlike
the Normal or the t-distributions). Specifically, the larger the F-value, the more different the means are
with respect to each other.
Hypotheses for ANOVA / F-test
Now let's state the hypotheses for ANOVA. Since there are three groups, we will use:
H 0 : 1  2  3
H a : at least one group has a different mean from the rest.
In the alternative hypothesis we are not saying that the three means should be all different from each
other. As far as one is different, it will bring support to the alternative hypothesis.
Interpreting the Result of ANOVA
Continuing with Set 1 as a working example, there are k  3 groups, each containing 3 values. So the
total number of values is 9 . The F statistic will have the degrees of freedom k 1  2 in the numerator,
and n  k  6 in the denominator. If we put these data in GeoGebra, we get the following output:
All the essential pieces are included in the GeoGebra screenshots shown above: the F statistic is
evaluated as the ratio of Mean Squares (MS) between groups over the MS within groups: F = 75/25 = 3,
while the P-value is 0.125. You can also check the P-value by using the F calculator to evaluate P(F > 3) =
0.125 in GeoGebra:
In research articles, when ANOVA is used, it’s customary to report the F statistic as well as the degrees
of freedom. So in our example, we should say F(2, 6) = 3.00.
If we used a significance level of 0.10, we should fail to reject H 0 , and conclude that there is not
enough evidence to show there is a difference in the group means.
If we had used Set 2 instead, the outcome will be very different, as we expected based on the fact that
the groups do not overlap at all:
In this case, although the group means are the same as in Set 1, the fact that the F statistic is huge
( F (2, 6)  75 ) indicates that the between-group variance is much larger than within-group variance.
Hence the P-value is way below the significance level, leading to the rejection of H 0 . Consequently, we
conclude that indeed there is a difference among the group means.
Another way to draw the same conclusions is by using the traditional method of hypothesis testing. If
we used the F-calculator and the degrees of freedom mentioned above, and the critical value
corresponding to   0.10 will be 3.46, as shown in the following graph:
Since the F-test is always right-tailed, we will draw the same conclusion as in the P-value method: F  3
is less than the critical value and will lead to a failure to reject H 0 ; however, F  75 is in the critical
region and guarantees the rejection of H 0 .
After you have studied the F-test, you may wonder why it is used so widely. The answer is that multiple
categories naturally emerge in many situations where you may want to compare the means. For
example, if you were trying to find out whether a fertilizer helps increase the yield of a crop, you may
not know how much fertilizer to use. So you will use several plots with varying amount of usage, and
observing whether any of them provides better yield than the rest. (this was actually Fisher's original
problem, since he was working for the Department of Agriculture of the British government.) The same
scenario applies to the testing of new pharmaceuticals, where it is difficult to determine the optimal
dosage. ANOVA has also been a popular choice for term projects in the past. For example, in the folder
of sample projects, you may find a project investigating whether the seating of a customer (bar, table,
take-out) has an effect on the amount of tip. So I hope you will keep ANOVA in your toolbox when you
consider the different ways your data can be analyzed.
Download