Another Look at Multiple Testing Cµ C

advertisement
Another Look at Multiple Testing
• We know that H0 : Cµ = 0 has been rejected, where C is a
m × p matrix (of full row rank).
• Suppose we are interested in testing which of the m hypotheses on Cµ are significantly different from zero.
• Denote H0k denote the null hypothesis corresponding to the
kth coordinate of Cµ, i.e. H0k : c0k µ = 0.
• Let p1, p2, . . . , pm denote the p-values corresponding to the m
tests.
397
Another Look at Multiple Testing (contd.)
• Suppose there are m0 of the H0k ’s that are true and m1 =
m − m0 that are false. At this point, we shall assume that
this is known.
• Suppose that c denotes the value between 0 and 1 that will
serve as our critical region for all of the tests:
– Thus, we reject H0k if pk ≤ c, in which case, we declare
that the difference between c0k µ and zero to be significant.
398
All Possible Outcomes of the Hypothesis Tests
H0 true
H0 false
Total
accept H0
U
T
W
Reject H0
V
S
R
m0
m1
m
• U = # true negatives,
V = # false positives
• T = # Type II errors,
S = # true positives
• W = # ks 3 H0k not rejected,
R = # k 3 H0k rejected
• Note that W and R are observed, but not the rest.
399
Familywise Error Rate (FWER)
• FWER is defined as the probability of having at least one
false positive
FWER = P (V > 0)
• Traditionally, statisticians have controlled the FWER in multiple testing
• Controlling FWER is the same as choosing the rejection level
c 3 the FWER is no more than some desired α.
• An simple approach to controlling FWER is Bonferroni, for
which c = α/m, and when the FWER is bounded above by α
for any family of m tests.
400
Holm’s Method
• Let p(1) ≤ p(2) ≤ . . . ≤ p(m) denote the m ordered p-values
obtained from the m tests.
• Find the largest k
3
α
p(i) ≤ m−i+1
∀
i ≤ k.
• If no such k exists, declare that no tests are significant, otherwise reject the null hypotheses corresponding to the k smallest p-values.
401
An Example
• Suppose we conducted m = 5 tests with the p-values:
Test
p-value
1
0.042
2
0.001
3
0.031
4
0.014
5
0.007
• Suppose we want to control FWER at level 0.05
– Using Bonferroni, the cutoff would be 0.05/5, so H02 and
H05 would be the only ones rejected.
– The ordered p-values are as follows:
Order
p-value
α
m−i+1
(1)
0.001
0.01
(2)
0.007
0.0125
(3)
0.014
0.0167
(4)
0.031
0.025
(5)
0.042
0.05
– So Holm’s method would reject H0 for tests 2, 5 and 4.
402
Strong and Control of Error Rates
• A testing method provides strong control of an error rate for
a family of m tests if the error rate is controlled regardless
of how many (or which) of the m tests are true.
• A testing method provides weak control of an error rate for
a family of m tests if the error rate is controlled whenever all
null hypotheses are true.
• Bonferroni and Holm’s methods are both examples of a method
that provides strong control.
403
A closer look at FWER
• Suppose a scientist conducts N independent experiments,
each involving m tests of hypotheses and obtains a list of
significant results for each experiment.
• Each list that contains one or more falsely significant results
is considered to be in error for FWER calculations
• The FWER is approximated by the proportion of these N
lists that contain one or more of these false positives
• Suppose one of these m tests are not truly significant but
the rest are, however this list is considered in error. Such
lists are considered to make up only a small proportion of
the total number of lists if FWER is to be controlled.
• Thus, controlling FWER may seem to be overly conservative.
404
Controlling the False Discovery Rate (FDR)
• FDR is an alternative error rate introduced by Benjamini and
Hochberg (1995) and formally defined as
E(Q) where Q = VR if R > 0, Q = 0 otherwise
• Controlling FDR amounts to choosing the cutoff c
α.
3 F DR ≤
• Conceptually, FDR averages the ratio of the number of false
positives to the total number of simultaneous tests done
405
Conceptual Description of FDR
• When there are N independent experiments conducted, each
containing m simultaneous tests as before, consider the N
lists containing the tests that are declared significant in that
experiment
• For each list, consider the ratio of the number of false positives to its cardinality (set to zero if no list)
• The average of these ratios (over N experiments) approximates the FDR.
• Note: Some of the above lists may contain many false positives, yet FDR may still be controlled by the method we
are using because FDR is all about the average performance
across replicated experiments.
406
B-H Procedure for Strong Control of FDR at
level α
• Let p(1), p(2), . . . , p(m) denote the m ordered p-values
• Find the largest integer k
3
α.
p(k) ≤ k m
• If no such k exists, declare nothing significant, otherwise
reject H0j ’s corresponding to the k smallest p-values.
407
Example Revisited
Test
p-value
1
0.042
2
0.001
3
0.031
4
0.014
5
0.007
Order
p-value
(1)
0.001
0.01
(2)
0.007
0.02
(3)
0.014
0.03
(4)
0.031
0.04
(5)
0.042
0.05
α
i
• The B-H procedure would reject H0i for all 5 tests
408
Example Slightly Modified
Test
p-value
1
0.042
2
0.001
3
0.041
4
0.014
5
0.007
Order
p-value
(1)
0.001
0.01
(2)
0.007
0.02
(3)
0.014
0.03
(4)
0.041
0.04
(5)
0.042
0.05
α
i
• The B-H procedure would still reject H0i for all 5 tests, even
α.
though p(4) > 4 m
409
Download