Error Rates for Multiple Comparisons

advertisement
Error Rates for Multiple Comparisons
When testing hypotheses, a Type I Error occurs when the null hypothesis, H0 , is incorrectly rejected when it is actually true. In the case of comparing two means, µ1 and µ2 , if the
null hypothesis H0 : µ1 = µ2 is rejected in favor of H1 : µ1 6= µ2 as the result of a two-sample
t-test when the two means are actually equal, a Type I Error is said to be committed. That
is, the two means are found to be different when they are not, a so called “false positive.”
Suppose that this two-treatment experiment is repeated a large number of times and the
above t-test is carried out separately for each data set, using α = .05 (say), we define Type
I error rate as the proportion of times a Type I error is made in all of the tests. Since we
are setting α = .05 to control Type I error, the Type I error rate is expected to be close to
.05 (or whatever α-level used for the tests).
When comparing a treatments, there are a(a − 1)/2 pairwise tests (or comparisons) of
the type H0 : µp = µq vs. H1 : µp 6= µq possible. If two-sample t-tests are performed
one-comparison-at-a-time at a selected α level, the Type I error rate for each test will be
still controlled at α, the so-called comparison-wise error rate or CER. Note that if a = 4,
the number of pairwise t-tests turns-out to be 4(3)/2 = 6. That is 6 pairwise tests are
performed. This means that there is a possibility of committing 0 up to 6 possible Type
I errors (if µp = µq for some pairs). If the experiment is repeated many times and the
proportion of Type I errors made determined, it will be quite different from α-level used for
each test (in fact, increases as a, and thus the number of comparisons made gets larger).
For use with pairwise tests, an alternative measure of Type I error rate is needed.
Experiment-wise error rate or EER is defined as the probability of making at least one Type
I error among all pairwise tests in the experiment. If the experiment is repeated many times
as above, the CER may be calculated as the proportion of the number of Type I errors made
to the total number of comparisons. Tukey’s method controls EER at the α-level selected for
the tests, while LSD controls the CER.
An upper bound of EER can be derived using a probability inequality as EER ≤ 1−(1−α)k
for a specified α, where k is the number of tests being done. Then for a = 4 and thus k = 6,
for α = .05 we find that the upper bound of EER is ≈ .265. This means that for this
experiment if LSD with α = .05 is used to perform pairwise tests, the EER could be as large
as .265!. If we would still like to use the LSD method and keep the EER low, we may use a
smaller α value, say .01, in which case for this experiment, the upper bound for EER would
be ≈ .0585; however, the power of each test for detecting a difference is also diminished at
the same time.
The Bonferroni procedure (see notes) suggests doing pairwise t-tests using an α value
divided by the number of comparisons. For example, for the experiment with a = 4, when
α = .05, a t-value of t.05/12 must be used for t-tests if the Bonferroni method is adopted.
We know that when a procedure such as Tukey’s HSD is used, the EER is controlled. The
effect of this is that a lesser number of comparisons are declared to be significantly different.
It is easy to see this by comparing the numerical values of the HSD (quantity use to find
if a difference is significant based of Tukey’s method) and the LSD for a specified α. The
HSD value is larger meaning that the difference in the sample means have to be much larger
for the test to find a difference. This results in Tukey’s method declaring a lesser number
of differences significant. Thus we say that Tukey’s method is more conservative than using
pairwise t-test or the LSD method.
Thus, depending on whether the confidence interval approach or the underlining approach
is used, the experimenter may end-up making a different set of conclusions with each method.
Download