False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav Pecen Outline Introduction Familywise Error Rate False Discovery Rate Positive False Discovery Rate Comparison of approaches Example – microarray study Estimate of FDR and pFDR Bayes approach to FDR and pFDR Background The aim is to find an approach to Problem of multiple testing Problem, which occur by multiple testing each single test of one null hypothesis has probability of type I error equal to α calculation of several tests, where each has probability of type I error equal to α, the probability of overall type I error increases this leads to high value of false positive results in the worst situation (tests are independent) – the probabilities of type I error are summing assume 100 of independent tests each test has probability of type I error equal to α = 0.05 in about 5% of tests ( = 5 tests) you will make type I error, i.e. you will reject the null hypothesis although it is true Typical areas where you can meet multiple testing microarray studies genetics and all lab testing Theory of hypothesis testing We test the null hypothesis H0 against the alternative hypothesis H1 One of the following four possibilities will happen Testing result True null hypothesis alternative hypothesis reject null hypothesis accept null hypothesis type I error type II error Usually we require restrictions on the two errors: probability of type I error, i.e. false positive rate, should be limited by α P(rejection of H0 | H0 is true) = P(type I error) ≤ α power of the test should not be lower than 1 – β P(rejection of H0 | H0 is not true) = 1 – P(acceptation of H0 | H0 is not true) = 1 – P(type II error) ≥ 1 - β Multiple hypotheses testing Testing of multiple hypotheses Testing result True # of rejected null hypotheses # of accepted null hypotheses Total # of true null hypotheses S T m0 # of true alternatives U W m1=m-m0 Total R0 R1=m-R0 m Also here we have restrictions on the false results, these could be similar as in one test of hypothesis : control false positive rate: (# of incorrectly rejected H0)/(# of true H0) = S/m0 this is a classical approach, also called “Familywise error rate approach” (FWER) or can be approached from different point of view control false discovery rate (# of incorrectly rejected H0)/(# of rejected H0) = S/R0 Connected characteristics sensitivity – proportion of correctly identified DE genes: U/m1 specificity – proportion of correctly identified non-DE genes: S/m0 Example Example from genetics, microarray studies 10 000 genes examined search for differentially expressed genes Results of testing Determined as DE genes Reality Determined as non-DE genes Total Truly DE genes 400 100 500 Truly non-DE genes 475 9 025 9 500 Total 875 9 125 10 000 type I error rate (false positive rate) = 475 / 9 500 = 5% type II error rate (false negative rate) = 100 / 500 = 20% sensitivity of the test (power) = 400 / 500 = 80% false discovery rate = 475 / 875 = 54% more than a half of discovered DE genes are faults false non-discovery rate = 100 / 9125 = 1% Common approach Control of familywise error rate (FWER) probability of making one or more false discoveries (type I error) estimate: number of false discoveries out of all tests done assume, we provide k independent tests, each at significance level α = 0.05 i.e. in each test we have 5% probability of making false positive decision using Bonferroni inequality, we can estimate the overall probability of type I error, i.e. the probability of making at least one false positive decision as k * α = k * 0.05 already for 10 tests we have the upper bound for probability of at least one false positive result 50% to keep the overall significance level controlled (e.g. equal to 5%), one have to decrease the significance level in each particular test to be sure that overall significance level is α = 0.05, each test has to be provided at significance level α = 0.05 / k for 10 test, each has to have its significance level 0.005; for thousand of tests, ... Such approach leads to highly conservative results for thousands of test it is highly difficult to prove truly positive result the more tests, the lower power False Discovery Rate Definition False Discovery Rate (FDR) = proportion of false positive results out of all positive (= statistically significant) results Advantages if null hypothesis is rejected, we know the probability, that it is correctly rejected FDR = 5% means that out of 100 positive tests circa 5 are false positive and remaining 95 are truly positive usually it is more powerful than the traditional FWER approach convenient especially when testing large amount of null hypotheses Disadvantages do not need to keep the probability of at least one wrongly rejected null hypothesis lower than α one has to care about situation, when number of rejected H0 is zero Factors determining False Discovery Rate (FDR) proportion of truly DE genes: m1/m distribution of true differences variability sample size False Discovery Rate The FDR as defined above works nicely, when at least one of the null hypotheses is rejected, i.e. when R0 > 0 In case, when P(R0 = 0) > 0, three possibilities are available FDR1 = E(S / R0 | R0 > 0) * P(R0 > 0) FDR2 = E(S / R0 | R0 > 0) FDR3 = E(S) / E(R0) The second and third alternatives are equal to 1, if m0 = m FDR2 and FDR3 cannot be controlled (limited by α) whenever m0 = m, hence Benjamini and Hochberg decided to work with FDR1 in further, it is called False Discovery Rate (FDR) when controlling FDR1 by α, it means that we control FDR1 = E(S/R0 |R0> 0) only with α / P(R0 > 0), hence Storey decided to work with FDR2 the fact that FDR2 = 1 when m0 = m is not a problem, since this result is obvious in further, it is called positive False Discovery Rate (pFDR) Benjamini-Hochberg - FDR Procedure controlling false discovery rate consider testing of m null hypotheses H1, H2, H3, ..., Hm order the respective p-values such that p(1) ≤ p(2) ≤ p(3) ≤ ... ≤ p(m) and denote the null hypothesis corresponding to p(i) as H(i) choose k* such that p(k*) is the largest p-value less than α * k / m, i.e. k* = argmax{k: p(k) ≤ α * k / m; 1 ≤ k ≤ m} then reject all hypotheses H0(i) for which p(i) ≤ p(k*) Properties for independent test statistics and any configuration of false null hypotheses, the procedure controls FDR at value α E(S/k0) = E(falsely rejected / number of rejected) ≤ α *m0/m Benjamini, Y. and Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. (1995) J. R. Stat. Soc. Ser. B 57289–300 Positive False Discovery Rate Definition false discovery rate given that at least one test has positive result i.e. proportion of false positive results between all the positive results given at least one positive result occur pFDR = E(V / R0 | R0 > 0) Additional characteristics can be defined q-value: a natural pFDR counterpart to common p-value p-value is defined as test statistic greater than equal to observed value given null hypothesis P (T ≥ t | H0) q-value is Bayesian analogue to p-value q-value is posterior probability of null hypothesis given test statistic is greater than or equal to the observed value P(H0 | T ≥ t) for more hypotheses: q-value(t) = inf {pFDR(Γα); t ϵ Γα} Positive False Discovery Rate for more hypotheses: q-value(t) = inf {pFDR(Γα); t ϵ Γα} the minimum pFDR that can occur, when rejecting a statistic with value t minimum posterior probability of null hypothesis over all significance regions containing the statistics q-value minimizes the ratio of the type I error to the power over all significance regions that contain the statistic pFNR (positive False Negatives hypothesis Rate): a natural counterpart to pFDR rate of false negatives hypothesis between all negative results first we define False Non-discovery rate (FNR) FNR = E(T / (m – R0) | (m – R0) > 0) P((m – R0) > 0) the positive False Non-discovery rate (FNR) is defined as pFNR = E(T / (m – R0) | (m – R0) > 0) Comparison of FWER, FDR and pFDR FWER – control for multiple error rate we fix the error rate and estimate the rejection area FDR – control of false positive between all positive we fix the rejection area and estimate the error rate Interpretation of FDR and pFDR FDR - rate that false discoveries occur pFDR - rate that discoveries are false when all the null hypotheses are true (i.e. when all genes are non-DE and m0 = m), the pFDR will be always equal to 1 (and hence cannot be controlled by some prespecified value of α when controlling FDR at level α, and positive findings have occurred, then FDR has really only been controlled at level α / P(R0 > 0) Comparison of FWER, FDR and pFDR Two approaches to false discovery rate to fix the acceptable rate α and estimate the corresponding significance threshold is available only when using FDR, since the pFDR cannot be controlled to fix the significance threshold and to estimate the corresponding rate α both FDR as well as pFDR can be used ... the later one leads to “stronger” results Example Back to the genetics and the microarray study the problems come from high percentage of truly non-DE genes lowering of significance level decreases also FDR Results of testing Determined as DE genes Reality Determined as non-DE genes Total Truly DE genes 400 100 500 Truly non-DE genes 475 9 025 9 500 Total 875 9 125 10 000 Assume the following situation evaluate circa 10 000 genes to decide whether they are DE or non-DE the genes are independent or slightly dependent for each gene compare two independent groups with equal variance n arrays per group usage of standard t-statistics with pooled variance Example denote α the significance level for each one test, not an overall significance level for all multiple tests together Any formal statistical testing procedure compute relevant test statistic for each gene sort the statistics (or p-values) by order determine the cut-off point dividing the genes into DE and non-DE In such situation it makes sense to care about FDR – how many of the rejected null hypotheses are rejected wrongly FNR – proportion of true alternative missed by the test Example Genetics, microarray studies, particular situations FDR varies depending on sample size per group (n), significance level (α) and proportion of true null hypotheses (π0 = m0 / m) n = 5 microarrays per group at significance level α = 5% we get sensitivity of the test about 35% FDR at π0 = 0.9 is greater than 60% and at π0 = 0.995 it is around 95% at sensitivity equal to 80% we get significance level around 0.45 FDR at π0 = 0.9 is around 82% and at π0 = 0.99 it is almost 99% hence, n = 5 leads to underpowered study Example n = 20 microarrays per group at significance level α = 5% we get sensitivity of the test around 90% FDR at π0 = 0.9 is around 35% and at π0 = 0.99 it is more than 80% n = 30 microarrays per group at significance level α = 5% the results are still poor at significance level α = 0.4% we get sensitivity of the test around 80% FDR at π0 = 0.9 is slightly above 20% and at π0 = 0.99 it is around 72% for any n, the best results associated with significance level α = 5% can be minimal FDR at π0 = 0.9 is 18% and at π0 = 0.99 it is 71% Example Genetics, microarray study FDR enables another approach to sample size estimation the sample size depends on number and distribution of truly DE genes and on the tolerated value of FDR if n = 5, we have to have about π0 = 80% of truly DE genes to obtain reasonable FDR another possibility is to hope for large differences in both cases we use really small significance level if π0 = 0.9 and we desire FDR less than 10%, then we should have at least n = 30 observations per group if π0 = 0.99 and we classify 1% of top genes as DE, we need sample size of n = 45 to observe FDR around 10% (sample size n = 35 is necessary for FDR less than 20%) if we can estimate π0, then it makes sense to apply the rule to reject top (1 – π0) * 100% of hypotheses, since then FDR = FNR = 1 – sensitivity we control both these statistics together Typical examples Situations, where controlling of FWER (i.e. probability of one false rejection of H0) is not needed and then controlling of FDR is meaningful multiple endpoints typical goal: to recommend a new test or treatment over the standard one the aim is to find as many endpoints in which the new treatment can exceed the standard one the limitation on false positive result is not so strict, but too many false discoveries are also bad multiple separate decisions without an overall decision multiple subgroups problem, where two treatments are compared in various subgroups of patients we want to find as many subgroups with potentially different reaction on the two treatments as possible but we want to control the rate of false discoveries screening of multiple potential effects multiple potential effect is screened to weed out the null effect screening of various chemicals when screening of potential drug development again, we want as many discoveries as possible, while controlling for FDR Thank you Sometimes I would like to exchange all my knowledge for a bottle of Whisky