Error Rates for Multiple Comparisons When testing hypotheses, a Type I Error occurs when the null hypothesis, H0 , is incorrectly rejected when it is actually true. In the case of comparing two means, µ1 and µ2 , if the null hypothesis H0 : µ1 = µ2 is rejected in favor of H1 : µ1 6= µ2 as the result of a two-sample t-test when the two means are actually equal, a Type I Error is said to be committed. That is, the two means are found to be different when they are not, a so called “false positive.” Suppose that this two-treatment experiment is repeated a large number of times and the above t-test is carried out separately for each data set, using α = .05 (say), we define Type I error rate as the proportion of times a Type I error is made in all of the tests. Since we are setting α = .05 to control Type I error, the Type I error rate is expected to be close to .05 (or whatever α-level used for the tests). When comparing a treatments, there are a(a − 1)/2 pairwise tests (or comparisons) of the type H0 : µp = µq vs. H1 : µp 6= µq possible. If two-sample t-tests are performed one-comparison-at-a-time at a selected α level, the Type I error rate for each test will be still controlled at α, the so-called comparison-wise error rate or CER. Note that if a = 4, the number of pairwise t-tests turns-out to be 4(3)/2 = 6. That is 6 pairwise tests are performed. This means that there is a possibility of committing 0 up to 6 possible Type I errors (if µp = µq for some pairs). If the experiment is repeated many times and the proportion of Type I errors made determined, it will be quite different from α-level used for each test (in fact, increases as a, and thus the number of comparisons made gets larger). For use with pairwise tests, an alternative measure of Type I error rate is needed. Experiment-wise error rate or EER is defined as the probability of making at least one Type I error among all pairwise tests in the experiment. If the experiment is repeated many times as above, the CER may be calculated as the proportion of the number of Type I errors made to the total number of comparisons. Tukey’s method controls EER at the α-level selected for the tests, while LSD controls the CER. An upper bound of EER can be derived using a probability inequality as EER ≤ 1−(1−α)k for a specified α, where k is the number of tests being done. Then for a = 4 and thus k = 6, for α = .05 we find that the upper bound of EER is ≈ .265. This means that for this experiment if LSD with α = .05 is used to perform pairwise tests, the EER could be as large as .265!. If we would still like to use the LSD method and keep the EER low, we may use a smaller α value, say .01, in which case for this experiment, the upper bound for EER would be ≈ .0585; however, the power of each test for detecting a difference is also diminished at the same time. The Bonferroni procedure (see notes) suggests doing pairwise t-tests using an α value divided by the number of comparisons. For example, for the experiment with a = 4, when α = .05, a t-value of t.05/12 must be used for t-tests if the Bonferroni method is adopted. We know that when a procedure such as Tukey’s HSD is used, the EER is controlled. The effect of this is that a lesser number of comparisons are declared to be significantly different. It is easy to see this by comparing the numerical values of the HSD (quantity use to find if a difference is significant based of Tukey’s method) and the LSD for a specified α. The HSD value is larger meaning that the difference in the sample means have to be much larger for the test to find a difference. This results in Tukey’s method declaring a lesser number of differences significant. Thus we say that Tukey’s method is more conservative than using pairwise t-test or the LSD method. Thus, depending on whether the confidence interval approach or the underlining approach is used, the experimenter may end-up making a different set of conclusions with each method.