Another Look at Multiple Testing • We know that H0 : Cµ = 0 has been rejected, where C is a m × p matrix (of full row rank). • Suppose we are interested in testing which of the m hypotheses on Cµ are significantly different from zero. • Denote H0k denote the null hypothesis corresponding to the kth coordinate of Cµ, i.e. H0k : c0k µ = 0. • Let p1, p2, . . . , pm denote the p-values corresponding to the m tests. 397 Another Look at Multiple Testing (contd.) • Suppose there are m0 of the H0k ’s that are true and m1 = m − m0 that are false. At this point, we shall assume that this is known. • Suppose that c denotes the value between 0 and 1 that will serve as our critical region for all of the tests: – Thus, we reject H0k if pk ≤ c, in which case, we declare that the difference between c0k µ and zero to be significant. 398 All Possible Outcomes of the Hypothesis Tests H0 true H0 false Total accept H0 U T W Reject H0 V S R m0 m1 m • U = # true negatives, V = # false positives • T = # Type II errors, S = # true positives • W = # ks 3 H0k not rejected, R = # k 3 H0k rejected • Note that W and R are observed, but not the rest. 399 Familywise Error Rate (FWER) • FWER is defined as the probability of having at least one false positive FWER = P (V > 0) • Traditionally, statisticians have controlled the FWER in multiple testing • Controlling FWER is the same as choosing the rejection level c 3 the FWER is no more than some desired α. • An simple approach to controlling FWER is Bonferroni, for which c = α/m, and when the FWER is bounded above by α for any family of m tests. 400 Holm’s Method • Let p(1) ≤ p(2) ≤ . . . ≤ p(m) denote the m ordered p-values obtained from the m tests. • Find the largest k 3 α p(i) ≤ m−i+1 ∀ i ≤ k. • If no such k exists, declare that no tests are significant, otherwise reject the null hypotheses corresponding to the k smallest p-values. 401 An Example • Suppose we conducted m = 5 tests with the p-values: Test p-value 1 0.042 2 0.001 3 0.031 4 0.014 5 0.007 • Suppose we want to control FWER at level 0.05 – Using Bonferroni, the cutoff would be 0.05/5, so H02 and H05 would be the only ones rejected. – The ordered p-values are as follows: Order p-value α m−i+1 (1) 0.001 0.01 (2) 0.007 0.0125 (3) 0.014 0.0167 (4) 0.031 0.025 (5) 0.042 0.05 – So Holm’s method would reject H0 for tests 2, 5 and 4. 402 Strong and Control of Error Rates • A testing method provides strong control of an error rate for a family of m tests if the error rate is controlled regardless of how many (or which) of the m tests are true. • A testing method provides weak control of an error rate for a family of m tests if the error rate is controlled whenever all null hypotheses are true. • Bonferroni and Holm’s methods are both examples of a method that provides strong control. 403 A closer look at FWER • Suppose a scientist conducts N independent experiments, each involving m tests of hypotheses and obtains a list of significant results for each experiment. • Each list that contains one or more falsely significant results is considered to be in error for FWER calculations • The FWER is approximated by the proportion of these N lists that contain one or more of these false positives • Suppose one of these m tests are not truly significant but the rest are, however this list is considered in error. Such lists are considered to make up only a small proportion of the total number of lists if FWER is to be controlled. • Thus, controlling FWER may seem to be overly conservative. 404 Controlling the False Discovery Rate (FDR) • FDR is an alternative error rate introduced by Benjamini and Hochberg (1995) and formally defined as E(Q) where Q = VR if R > 0, Q = 0 otherwise • Controlling FDR amounts to choosing the cutoff c α. 3 F DR ≤ • Conceptually, FDR averages the ratio of the number of false positives to the total number of simultaneous tests done 405 Conceptual Description of FDR • When there are N independent experiments conducted, each containing m simultaneous tests as before, consider the N lists containing the tests that are declared significant in that experiment • For each list, consider the ratio of the number of false positives to its cardinality (set to zero if no list) • The average of these ratios (over N experiments) approximates the FDR. • Note: Some of the above lists may contain many false positives, yet FDR may still be controlled by the method we are using because FDR is all about the average performance across replicated experiments. 406 B-H Procedure for Strong Control of FDR at level α • Let p(1), p(2), . . . , p(m) denote the m ordered p-values • Find the largest integer k 3 α. p(k) ≤ k m • If no such k exists, declare nothing significant, otherwise reject H0j ’s corresponding to the k smallest p-values. 407 Example Revisited Test p-value 1 0.042 2 0.001 3 0.031 4 0.014 5 0.007 Order p-value (1) 0.001 0.01 (2) 0.007 0.02 (3) 0.014 0.03 (4) 0.031 0.04 (5) 0.042 0.05 α i • The B-H procedure would reject H0i for all 5 tests 408 Example Slightly Modified Test p-value 1 0.042 2 0.001 3 0.041 4 0.014 5 0.007 Order p-value (1) 0.001 0.01 (2) 0.007 0.02 (3) 0.014 0.03 (4) 0.041 0.04 (5) 0.042 0.05 α i • The B-H procedure would still reject H0i for all 5 tests, even α. though p(4) > 4 m 409