Multiple comparisons, multiple testing, permutation testing, bootstrap p.1 Multiple comparisons “The more you look, the more you discover ... that’s actually false!” Data dredging, data snooping, fishing expedition. Suppose you do K hypothesis tests. Let’s evaluate the strategy: “report the best P-value”. Suppose all the H0’s are true! (No “discoveries” to be made.) Suppose they are all statistically independent. (Assume continuous data & continuous P-values.) Then the P-values, P1,...,PK, are i.i.d. Uniform(0,1). What is the actual distribution of the reported best P-value? Hint: it’s not Uniform(0,1). P2 Pr(at least one P-value ≤ 0.05) = Pr( Pmin ≤ 0.05) = Pr( P1 ≤ 0.05 or P2 ≤ 0.05) = 2×0.05 − 0.052 = 0.0975 0.0 5 0.0 5 P1 The classical view of multiple testing (independent case) If you test H0 versus HAi for i=1,…,k, and then you report the best of the k P-values (“nominal”), Pmin= Pbest= minPi : i 1,...,k , then a) the true Type I error for this procedure is bigger than the nominal Type I error, b) the true P-value is bigger than the nominal Multiple comparisons, multiple testing, permutation testing, bootstrap p.2 For (a), the reason is that the actual Type I error = Pr(rejection region | H0), and overall rejection region = rejection region i. For (b), the reason is that the true “tail of surprise” includes tails for other hypotheses, not just the tail for Hibest (where ibest argminPi : i 1,..., k ). Disturbing examples with multiple testing adjustments Example 1: Response of cancer pts to IL 2: effect of immunologic HLA type Rubin et al, 1995. Data = 2x2 table, the HLA type DQ1 = present or absent, response to IL2 = yes or no, Fisher “exact” test P= 0.01. Data = three 2x2 tables, for HLA types DQ1…DQ3. Minimum P-value = 0.01 for DQ1. Times 3 0.03 Data = 3 groups of 2x2 tables, including 3 DQ types, 5 DP types, and 7 DR types. Minimum P-value = 0.01 for DQ1. Times 15 0.15 Data = 2 groups of groups of 2x2 tables, including MHC1 (A, B, C) and MHC2 (DP,DQ,DR). Total # tables = 120. Minimum P-value = 0.01 for DQ1. Times 120 1.2. Sidak: 0.70 What is the “proper” collection of tests to control the Type I error over? Just the DQ1 test? All DQ tests? All MHC2 tests? All HLA tests? Multiple comparisons, multiple testing, permutation testing, bootstrap p.3 Example 2: “Comparisons of a priori interest” Cohen Anwar, Day 1983. Testing 6 methods of measuring echocardiograms—are they equivalent? 6x5/2 = 15 comparisons Nominal P for method B versus method C = 0.005. But not “of a priori interest”, so P = 15 * 0.005 = 0.075, “not significant”. But now investigator states “the comparisons of a priori interest were: A versus D, A versus E, A versus F, D versus E, D versus F, E versus F Now the adjusted P values for B versus C is P = (15-6) * 0.005 = 0.045, “significant”. So the inference on B versus C changed depending on how many others were “of a priori interest”. Example 3: ECOG 5592 cooperative group clinical trial Arms: (A) etoposide + cisplatin, (B) taxol+cisplatin+G-CSF, (C) taxol+cisplatin. Should multiple comparisons adjustments be made? Which ones? The mystery answer: “There are four comparisons: B > A, C > A, B > C, C > B so the required significance level will be 0.05 / 4 = 0.0125”. Issues with multiple comparisons methods This is sometimes too cautious (“conservative”), if the tests are positively correlated. Sometimes it’s difficult to decide what collection of tests to throw together into one “bag”. Multiple comparisons, multiple testing, permutation testing, bootstrap p.4 For huge numbers of tests (for example, high-thoughput biological data; degrees of freedom is negative, n < K), the “family-wise error rate” (FWER) may be far too conservative. One would allow a high probability of at least one “false positive” (Type I error) in exchange for making some true discoveries. Classic methods for multiple comparisons Bonferroni and Sidak The simplest multiple comparison method is the Bonferroni correction: Padjusted = k Pmin. Almost identical, and slightly more justified perhaps, is the Sidak correction: Padjusted = 1- (1 - Pmin) k , which is roughly k Pmin - k (k-1)/2 Pmin 2 This is exactly correct when the k P-values are statistically independent U(0,1) given H0. Since the P values are generally positively correlated, it is usually conservative (too big). Non-independent cases There are many many special adjustments in special cases to avoid being too conserative: Tukey LSD, Duncan range test, Neuman-Keuls, randomization tests, permutation tests,… Multiple comparisons, multiple testing, permutation testing, bootstrap p.5 Permutation and randomization methods Randomization tests are often used as a basis of these methods, or to evaluate their performance. These tests condition on the collection of numbers, but break a connection in order to assure that the null hypothesis is true. In the simplest case there is just ONE test, and one wants to check a questionable test. Example: checking a chi-square test. Dark Hair (D) Responder (R) 3 Nonresponder (N) 2 TOTAL 5 Light Hair (L) 5 90 95 TOTAL 8 92 100 Is hair color associated with response? Chi-square P-value: See randomizationTest-example.R Benjamini-Hochberg method (1995) for “false discovery rate”, http://www.jstor.org/stable/2346101 ; implemented in Statistical Analysis of Microarrays (SAM) and BRBTOOLS from the NCI B-H is a classical frequentist technique. False discovery rate could be defined as Multiple comparisons, multiple testing, permutation testing, bootstrap But what if R = # declared significant can equal 0? The Qe will be infinite. And when all null hypotheses are true, then all discoveries are false and FDR=1. Instead we define FDR = The BH procedure is: Then FDR is no bigger than q*. Example: An RCT of rt-PA versus APSACj when myocardial infarction occurs. There are 15 different clinical endpoints. The 15 P values, ranked, and the critical values with q*=0.05, are Pvalue 0.0001 0.0004 0.0019 0.0095 0.0201 0.0278 0.0298 ... 1 cutoff 0.0033 0.0067 0.0100 0.0133 0.0167 0.0200 0.0233 ... 0.05 The first 4 hypotheses are “significant” with this rule. This method does not really help with the problems listed before. Suppose we tweak the results: Pvalue 0.0001 0.0004 0.0019 0.0095 0.0151 0.0278 0.0298 ... 1 cutoff 0.0033 0.0067 0.0100 0.0133 0.0167 0.0200 0.0233 ... 0.05 OK, the 5th is significant. Now tweak the 4th. Pvalue 0.0001 0.0004 0.0019 0.0135 0.0151 0.0278 0.0298 ... 1 cutoff 0.0033 0.0067 0.0100 0.0133 0.0167 0.0200 0.0233 ... 0.05 The 4th now becomes insignificant. Therefore the 5th is now out of the running. Suddenly the 5th is not significant too, even though the data for that test has not changed at all. That’s a typical result for frequentist approaches to multiple comparisons. It’s unsatisfying to many people. Storey-Tibshirani method for “Qvalues”: conditional false discovery rate p.6 Multiple comparisons, multiple testing, permutation testing, bootstrap p.7 http://www.pnas.org/content/100/16/9440.abstract Define pFDR = E(V/R | R > 0). (If R = 0, why would we care about FDR?) (In this paper, V/R is written as F/S.) Write FDR as a posterior probability: where m0 = “prior probability a null hypothesis is true”. We estimate m0 from the observed distribution of all the P-values, assuming a mixture of a uniform multiplied by π0 = m0 /( m0 + m1 ) and a portion near zero multiplied by 1 − π0. Different methods can estimate π0. In the Figure, the LOWER dotted line corresponds to p̂ 0 . See the R package qvalue by Dabney and Storey. For each threshold t, define Finally, for each null hypothesis i , qvaluei = The interpretation is the minimum FDR you get if pi is called “significant”. Multiple comparisons, multiple testing, permutation testing, bootstrap p.8 The Bootstrap The bootstrap is a technique for checking calculations of standard errors and confidence intervals. It is a non-parametric non-model based version of the classical rationale for estimating a property of a statistics S. (S must be a continuous statistic.) F̂ “is near” FTrue . Therefore, distribution S | F = Fˆ is “near” distribution (S | F = FTrue ) ( ) And likewise for any feature of the distribution, for example var S | F = Fˆ is “near” ( ) THE RESAMPLING IDEA Motivation Our sample = {Y1¼Yn } . Our statistic S = S(Y1 ¼Yn ) . Our question: what is var S ? If we had many other independent samples of size n, we could calculate S on each one to get S1, SB , we’d use: But you only have one sample, so RESAMPLE to create the samples--WITH replacement. Again we are assuming that F̂ “is near” FTrue -- but this time F̂ is the empirical c.d.f. ORIGINAL SAMPLE BOOTSTRAP SAMPLES BOOTSTRAPPED STATISTICS various summaries Then for example we estimate . There is MUCH MUCH more to the bootstrap, but this is the central idea.