Analysis of Variance (ANOVA) In previous weeks, we’ve covered tests of claims about the population mean and about the differences in the population means of two populations. But what if you want to test whether there are differences in the population means of more than two populations? The Challenge of Comparing Multiple Means Why don’t we just pair off the various populations and test for differences in means for all the pairs? There’s a problem there. Say you have four populations you’re interested in. Four populations (say A, B, C, and D) make 6 pairs (AB, AC, AD, BC, BD, and CD). So you have to do six tests. What if you’re conducting the tests at the 5% significance level? Then there’s a 95% chance that you won’t make a Type I error on any given test. But the chance that you won’t make a Type I error on any of the tests is 0.956 0.74 , because you have to multiply the chance that you don’t make a Type I error on the first test by the chance that you don’t make a Type I error on the second test by the ... you get the idea. (Here we are making the assumption that all the tests are independent, and the multiplication rule in Chapter 4 allows us to calculate the probability that we don't make any Type I errors in all 6 tests.) The significance level for these tests as a group will be 1 0.74 0.26 , or 26%. Not what we want – the risk of having made a Type I error somewhere (anywhere in one of the 6 tests) is too great. In addition, even if you want to do this, such a battery of tests provide more than what you need, if what you are interested is simply whether if any of the four groups is different from the rest. To counter this difficulty, a technique called analysis of variance was invented. It appeared first in 1918 in a paper by R. A. Fisher, a British statistician and geneticist. It has the nickname ANOVA, which stands for Analysis of Variance. Example: Comparing 3 Groups Here’s an example to give you an idea of the concepts involved. Let’s say you have three different populations, A, B, and C, and you take a sample of size 3 from each population. You’re interested in the population means of these populations. You’re claiming that these population means are not all the same. Compare two scenarios, which I call Set 1 and Set 2: In which case would you be more likely to conclude that the population means aren’t the same? If you said Set 2, you have the right idea. As you can see, the sample means are the same for Sets 1 and 2: However, the numbers in Set 1 are all spread out and overlapping, whereas those in Set 2 are tightly grouped around the means and make you believe that they might actually come from populations with different population means. Here are two boxplots that make it clearer why the means in Set 2 are more likely to be different than those in Set 1: Set 1 Set 2 The basic idea of analysis of variance is to compare the variability of the sample means (which we call the variance between groups) to the variability of the samples themselves (which we call the variance within groups). If the former is large compared to the latter, as in Set 2, we feel that there really are differences among the population means, but if the variability between groups is not large compared to the variability within groups, we’re not going to conclude that there are differences among the population means. In the second case we say that there is too much “noise” to draw a conclusion about the differences. Let’s use the range (the difference between the largest number and the smallest number) as a measure of variability. For both Set 1 and Set 2, the sample means have a range of 10. That is a measure of the variability between groups. But for Set 1, the variability within the groups, if measured by the range, is 10 (e.g. 15 - 5), whereas for Set 2, the variability within the groups measured this way is 2 (e.g. 11 - 9). So compared to the variability within groups, the variability between the groups is much larger (five times larger) for Set 2 than it is for Set 1 (where the two are the same). Introducing the F-distribution Of course, the range is not a very good measure of variability. Much better are the standard deviation, or its square, the variance. The comparison of variance between groups and variance within groups is done by using a ratio, which can be roughly stated as follows: F Variation between groups Variation within groups In his original paper, Fisher named this test statistic "F distribution", which does not sound very modest in comparison with Gosset, who chose the much more humble "Student's t distribution" for another ground-breaking work. Fisher's ingenuity is that he discovered a way to condense all the data in several groups into one test statistic. By showing the mathematical properties of this new test statistic, he is able to apply the same mechanism of hypothesis testing to this rather complex decision problem involving multiple groups. We’ll get into the calculation of Fisher's F statistic briefly below. For now, it helps to keep in mind that the nature of the F-distribution (see the graph below) determines that the P-value is always right-tailed. In addition, since F is expressed as a ratio, both the numerator and denominator have a degree of freedom. When the multiple groups have different means (evidence against H 0 ), the numerator tends to be much larger than the denominator, which will lead to a large F statistic. On the other hand, when the groups all have the same mean, the ratio will be equal to zero, since the numerator will be zero. This interpretation of the F-statistic may help you see why the P-value needs to be right-tailed, since it characterizes how "extreme" the test statistic is, assuming that H 0 is true (all the means are equal). "Extreme" here means a large F value, since the F-distribution does not allow any negative values (unlike the Normal or the t-distributions). Specifically, the larger the F-value, the more different the means are with respect to each other. Hypotheses for ANOVA / F-test Now let's state the hypotheses for ANOVA. Since there are three groups, we will use: H 0 : 1 2 3 H a : at least one group has a different mean from the rest. In the alternative hypothesis we are not saying that the three means should be all different from each other. As far as one is different, it will bring support to the alternative hypothesis. Interpreting the Result of ANOVA Continuing with Set 1 as a working example, there are k 3 groups, each containing 3 values. So the total number of values is 9 . The F statistic will have the degrees of freedom k 1 2 in the numerator, and n k 6 in the denominator. If we put these data in GeoGebra, we get the following output: All the essential pieces are included in the GeoGebra screenshots shown above: the F statistic is evaluated as the ratio of Mean Squares (MS) between groups over the MS within groups: F = 75/25 = 3, while the P-value is 0.125. You can also check the P-value by using the F calculator to evaluate P(F > 3) = 0.125 in GeoGebra: In research articles, when ANOVA is used, it’s customary to report the F statistic as well as the degrees of freedom. So in our example, we should say F(2, 6) = 3.00. If we used a significance level of 0.10, we should fail to reject H 0 , and conclude that there is not enough evidence to show there is a difference in the group means. If we had used Set 2 instead, the outcome will be very different, as we expected based on the fact that the groups do not overlap at all: In this case, although the group means are the same as in Set 1, the fact that the F statistic is huge ( F (2, 6) 75 ) indicates that the between-group variance is much larger than within-group variance. Hence the P-value is way below the significance level, leading to the rejection of H 0 . Consequently, we conclude that indeed there is a difference among the group means. Another way to draw the same conclusions is by using the traditional method of hypothesis testing. If we used the F-calculator and the degrees of freedom mentioned above, and the critical value corresponding to 0.10 will be 3.46, as shown in the following graph: Since the F-test is always right-tailed, we will draw the same conclusion as in the P-value method: F 3 is less than the critical value and will lead to a failure to reject H 0 ; however, F 75 is in the critical region and guarantees the rejection of H 0 . After you have studied the F-test, you may wonder why it is used so widely. The answer is that multiple categories naturally emerge in many situations where you may want to compare the means. For example, if you were trying to find out whether a fertilizer helps increase the yield of a crop, you may not know how much fertilizer to use. So you will use several plots with varying amount of usage, and observing whether any of them provides better yield than the rest. (this was actually Fisher's original problem, since he was working for the Department of Agriculture of the British government.) The same scenario applies to the testing of new pharmaceuticals, where it is difficult to determine the optimal dosage. ANOVA has also been a popular choice for term projects in the past. For example, in the folder of sample projects, you may find a project investigating whether the seating of a customer (bar, table, take-out) has an effect on the amount of tip. So I hope you will keep ANOVA in your toolbox when you consider the different ways your data can be analyzed.