Multiple testing in highthroughput biology Petter Mostad Overview • • • • • Review of hypothesis testing Multiple testing: What is the issue? Types of errors to control Adjusting p-values A Bayesian alternative Statistical hypothesis testing • Fix a null-hypothesis H0, and a model for what data would look like if H0 is true. • Find a test-statistic (something computed from the data) which will be more extreme when H0 is not true. • Compute the test statistic from the data • Compute its p-value: The probability of observing this or a more extreme test statistic if H0 is true • Reject H0 if p is very small (e.g., less than 0.05) Example: testing the mean of a sample • Assume you have observed values x1 , x2 ,..., xn • Assume they come from a normal distribution with expectation μ and variance σ². H0: μ = μ0 • Test statistic: x 0 t s x2 / n • Under the null-hypothesis, t has a tn-1-distribution, i.e., a t-distribution with n-1 d.f. • The p-value is the probability of more extreme values than t in this distribution. • p-value < 0.05 if and only if μ0 is in confidence interval for μ Example: Non-parametric tests • Tests where the null hypothesis does not specify a particular probability model for the data • Example: Are the values in two groups of values different? To find out, rank all the values. • Wilcoxon rank sum test statistic: Add the ranks of all items in one group, and compare to the sum of the other ranks. • Compare with the rank sums one would get if one set of items is selected at random. This gives a pvalue. Example: Finding differentially expressed genes • Assume we want to find diff.exp. genes using 4 cDNA arrays with 6000 genes comparing ”treated” samples with ”untreated”. • Assume there are NO diff.exp. genes, only noise, and that all 6000x4 values are normally distributed. • Of the 6000 t-test statistics computed, roughly 300 will have values ”more extreme” than the 0.05 cutoff value, just by chance. These are false positives. • This problem is a consequence of the hypothesis testing approach and multiple testing, and not of the particular method chosen. Multiple testing in general • Multiple testing is in fact always a problem: When using a confidence level of 5%, 1 in 20 null hypotheses will be falsely rejected, whether the tests occur in the same or different research projects. • Problem seems more acute when a large number of formalized statistical tests are performed, as in modern high-throughput biology. • What to do? Ways to deal with the problem • Adjusting p-values: – Adjusting the levels of p-values to account for multiple tests. – Changing the interpretation of p-values. • Reformulating statistics away from hypothesis testing towards (Bayesian) model choices. Type 1 error: false positives Types of errors # not rejected # rejected SUM # true null hyp U V M0 # false null hyp T S M1 SUM W R M Type 2 errors: false negatives Traditionally: Want to find a procedure controlling size of V, while at the same time keeping T as small as possible. Some error rates: • Per comparison error rate: • Family-wise error rate: E (V ) PCER M FWER Pr(V 0) V FDR E | R 0 Pr( R 0) R V • Positive false discovery rate: pFDR E | R 0 R • False discovery rate: Types of control • Exact control: Controlling error rates conditionally on the set of true null hypotheses. • Weak control: Controlling error rates under the condition that all null hypotheses are true. • Strong control: Controlling error rates no matter what hypotheses are true or false. Controlling FWER with Bonferroni • Adjust p-values p1, p2, ..., pM using Bonferroni method: ~ p min( Mp ,1) i i • Provides strong control of FWER • Very conservative, hard to get any hypotheses rejected. Alternative: Sidak adjustment • Given p-values p1, p2, ..., pM, the adjusted pvalues are now ~ p 1 (1 p ) M i i • Provides strong control of FWER if the test statistics are independent (rarely the case) Holm step-down adjusted p-values • Order the tests so that the raw p-values are ordered: p1 p2 ... pM • Compute adjusted p-values: ~ pi max k 1,.., i min ( M k 1) pi ,1 • Provides strong control of FWER • Somewhat less conservative than Bonferroni Permutation testing • Given items divided into two or more groups • Compute a test statistic capturing the way in which your data seem to be extreme. • Compare with the test statistic computed when the items are permuted. This gives a p-value. • Wilcoxon rank sum test can be seen as a permutation test • Can be a good way to find statistically significant differences without using parametric assumptions Combining permutation testing with adjusting p-values • Given data matrix, with some columns corresponding to ”treated” samples, some to ”untreated”. • Permute columns, to get permutationderived p-values. • Adjust these, using for example the Holm step-down method. Correcting for multiple testing with permutations (Dudoit / Speed) • Assumes treatment and control samples have been compared to a common reference sample. • Use test statistic m2 m1 s12 s22 n1 n2 • By permuting data from treated and control samples we can approximate the null distribution of the test statistic. • Use permutation for multiple testing problem Adjusting p-values | • Order genes so that | t j values are non-increasing. (b ) | t • Compute j | values by permutations (b 1,..., B) Unadjusted p *j is proportion of these that exceed | t j | • For each permutation b, set u (bj ) as non-increasing envelope. * ~ p • Set j as proportion of these that exceed | t j | p *j non-decreasing by possibly increasing values • Make ~ Adjusting p-values QQ-plots • We may also compute t-values by computing differences within groups instead of between groups. • Find extreme quantiles of nulldistribution, and compare with the first t-values. • Compare null t-values, or normal distribution, with the first t-values in qq-plot. • May compute single-gene pvalues with quantiles in the null-distribution. Controlling FDR • Start with ordering tests so that raw p-values are ordered: p1 p2 ... pM • Adjusted p-values can then be computed: M ~ pi min k i ,.., M min pi ,1 k • This can be interpreted as a step-up procedure, starting with investigating the largest p-value. • Provides strong control of FDR under indepencence of test statistics, and under certain more general conditions. Interpreting FDR-adjusted p-values • Remember that fdr-adjusted p-values must be interpreted differently: • If you select all genes with fdr-adjusted p-values less than 0.05, it means that you expect a proportion of 0.05 of these to be false discoveries. • In contrast, if you select all genes with holmadjusted p-values less than 0.05, it means that you can expect the chance of seing any false positive gene as 0.05. Multiple testing using BioConductor software • In the limma package, the topTable function can be used to adjust p-values. • It contains a number of different procedures, for example bonferroni, holm, and fdr. • The package multtest can be used to perform permutation-based computations Critisism of FDR approach • It is ”possible to cheat”: – You have a number of hypotheses you want to reject, but p-values are not quite good enough. – Add to your hypotheses a number of untrue hypotheses, with low p-values. – The number of rejections will rise, but not the number of false rejections, so your FDR improves, and you ”prove” the hypotheses you care about. Critisism of hypothesis testing in general • Assume a pharmaceutical company wants to prove a new drug is effective • If they compute new p-values continuously, as new data comes in, they can just stop whenever the p-value dips below 0.05. • This gives ”unfair advantage” compared to testing only after the end of the entire study, as they do multiple tests. • ”Solution”: Deciding before study at what times to compute p-values, and adjust them accordingly. • Result: Two companies can have exactly identical data, but one gets drug approved, the other not. Example: A Bayesian approach (Scott and Berger, 2003) • Objective: Given a large number of values of normally distributed noise with expectation zero, and some values which deviate substantially more from zero, how to find these deviating values, limiting errors of types 1 and 2. • Crucial ingredient: Use data to estimate the probability that values are deviating. Hierarchical model used: X i | i , 2 , i ~ N ( i i , 2 ) i | 2 ~ N (0, 2 ) ( 2 , 2 ) ( 2 2 ) 2 i | p ~ Ber (1 p) p ~ Unif (0,1) • The model above specifies ”noise values”, for which i 0values”, where and ”signal i 1 • We would like to fit the model to data, and thus find probability for each to be equal to 1. i Example computations: • We assume we have 10 ”signal observations”: -8.48, -5.43, -4.81, -2.64, 2.40, 3.32, 4.07, 4.81, 5.81, 6.24 • We have n=10, 50, 500, or 5000 ”noise observations”, normally distributed N(0,1). • We find posterior probabilities for each observation to be noise or signal: Can we fish out the signals from the noise? Results False positives: Central seven ’signal’ observations: n 10 -5.4 1 -4.8 1 -2.6 .94 -2.4 .89 3.3 .99 4.1 1 4.81 1 #pi>.6 1 50 1 1 .71 .59 .94 1 1 0 500 1 1 .26 .17 .67 .96 1 1 5000 1.0 .98 .03 .02 .16 .67 .98 2 Note: The 3 more extreme signal observations always have prob. 1 of being signals. Note: The number of ”false positives” does not change much with the amount of noise: Adjustment for multiple testing is automatic! Note: The most central observations are ”drowned out” when the noise increases. Multiple testing for microarray data in practice: • Too few repetitions (2 or 3) of each sample is often used to reliably find differentially expressed genes. • Instead of finding a definite list of diff.exp. genes, a ranking list may be the output, ordering genes according to how likely it is that they are diff. exp.. The top of the list is chosen for further validation. • Combination with other sources of information may be used to find interesting genes.