Statistical modeling and machine learning for molecular biology Alan Moses Multiple testing. Given the ease with which data can be collected and statistical tests can be performed with computational statistics packages, the problem in contemporary molecular biology is that we often have too many statistical tests that we might want to do. For example, in gene set enrichment analysis, we might not know in advance which kinase the genes we predicted were likely to be targets of. We could try testing all of them. Or maybe we think the genes are either targets of a kinase or a transcription factor. In that case, we could try testing for enrichment of targets of all possible kinases and transcription factors. More typically, we are in the situation where we have no idea what our list of genes is doing and we will just want to test for all possible enrichments of anything we can get our hands on: GO categories, MIPS functions, KEGG pathways, ChIP-seq compendiums, phospho-proteomics datasets, you get the picture. In bioinformatics jargon these types of functional assignments to genes are known as “annotations.” Table. Some common annotations of gene function Of course, this seems like a good idea, and we can often discover interesting biology this way. However, let’s think about the statistical situation here. In Chapter 2, when we tested for enrichment of kinase substrates in our gene list, the P-value is the probability of getting a gene list as extreme or more as the one we got under the null hypothesis that our gene list was randomly drawn from the pool of genes. Let’s say we want the significance threshold to be P<0.05. This threshold means that we want there to be a 1 in 20 chance of seeing this extreme a gene list under the null hypothesis. Consider what happens if we now perform 20 tests: we expect about one of them to have P<0.05 even when there is no enrichment. More generally, for any P-value threshold, α, if we do m tests, we expect about m x α tests to pass it under the null hypothesis. Another way to think about this is to remember that the distribution of P-values under the null hypothesis is uniform between 0 and 1. 0.05 covers 1/20 of this distance. This means that if we randomly draw m P-values from the null distribution, m/20 will fall below 0.05. You can see immediately that in the context of many tests (or multiple testing) there will be a problem with our conventional use of the P-value. The simplest solution to this problem is the so-called Bonferoni correction to the P-value. The idea here is to use the laws of probability to correct the P-values. Since we know the probability of the P-values under the null hypothesis is uniform, we simply *rescale* the region that corresponds to a “significant” P-value to be 0.05 *taking into account* the fact that we have done m tests. The new P-value threshold is simply 0.05/m and now there will only be a 5% chance of observing a P-value that small under the null hypothesis. Alternatively, we can achieve the same effect by rescaling the P-values themselves or “correct” the P-values by simply multiplying by the number of tests: P becomes P x m. To keep the P-values as probabilities, any P-value that gets bigger than 1 is simply set to be one. Formally, the correction can therefore be written as min(1,m*P). Now, no matter how many tests we do, we will only have a 5% chance of observing a P-value below 0.05 under the null hypothesis. Statistical modeling and machine learning for molecular biology Alan Moses The Bonferoni correction is widely used in practice because it is so simple to calculate. The only tricky part is knowing exactly what number of tests you really did (this is not always strictly equal to the number of tests you actually performed using your statistics software). Figure X illustrates multiple testing and the Bonferoni correction for Gene Set Enrichment Analysis. Multiple testing in Differential Expression analysis Another very well-studied example where multiple testing is important in modern molecular biology is trying to identify differentially expressed genes Differential Expression analysis. Let’s consider trying to find the genes that are differentially expressed between T-cells and all other cells in the Immgen data. We have ~25000 of genes whose expression has been measured in 100s of cells. More formally, We want to find for which of m genes the expression level is different in the T-cells. For each gene, j, we have a sample of n observations, Xj1, Xj2 … and Yj1, Yj2 … where X are the expression levels in the T cells and Y are the expression levels in other cells. For each locus, we compute a test statistic, t*=f(Xj,Yj) and compute the P-value, or Pj(t≥t*|H0). Number of Genes Under the null-hypothesis, for any P-value threshold we choose α, we will expect to find m α loci beyond that threshold under the null hypothesis. The Bonferoni correction for multiple testing says that we should choose that threshold to be α/m, so that we expect fewer than 1 gene to pass this threshold under the null hypothesis. Here’s the result of actually doing 24,922 t-tests on the Immgen data. 100000 100000 10000 10000 1000 1000 100 100 10 10 1 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 P-value (uncorrected) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 P-value (Bonferroni corrected) Figure – identifying differentially expressed T-cell genes. First of all, notice that the y-axis of these graphs is in *log* scale, so the number of genes with P<0.05 is very large. In fact, more than 50% of the 25,000 genes are expressed differently in the T-cells compared to all other cells. Obviously this is more than the 5% expected with P<0.05 by chance. In fact, by looking at the distribution of P-values we can guess that probably around 300 genes (halfway between 100 and 1000 on a log scale) are expected in each bin – at least that’s Statistical modeling and machine learning for molecular biology Alan Moses the number we see in the bins for P-values greater than about 0.3. Based on this, you’d guess that of the more than 14,000 genes with P-value < 0.05 (remember, log scale) there are 14,000 – 300 = 13,700 more than expected by chance. Alternatively, you could argue that you would have expected to see 5% of the 25,000 under the null hypothesis in each bin, or about 1,250. This suggests that there are 14,000 – 1,250 = 12,750 more than expected under the null hypothesis. So based on the uncorrected P-values, in either case we’d guess that more 12,000 genes have different expression patterns in the T-cells relative to the other cells. Now let’s consider the effect of the Bonferroni correction (right panel). You can see that the correction has shifted many P-values into the largest bin: these were all the P-values that were set to 1 because m*P was greater than 1. But more importantly, look at the 0.05 bin (furthest to the left). There are now 4,500 genes with P<0.05. Since the P-values are now corrected, we can say that these are the genes that have significantly different expression in T-cells compared to all other cells. Yay! 4,500 genes is a lot of genes. However, you might have noticed that this is a lot less than the more than 12,000 genes we estimated by looking at the uncorrected P-value distribution. Essentially, what’s going on here is that the Bonferroni correction is giving us the 4,500 *most* significantly different genes, but in fact it’s being *way* too stringent. The Bonferroni correction was designed for situations where only 1 or 2 or a handful of tests would be significant – it really doesn’t make much sense in the case where 1000s of tests are significant. It’s not really reasonable expect to do 1000s of tests and not include *any* false rejections of the null hypothesis. Because of requirement that *everything* is significant, when we use the Bonferroni correction in a case like this, we’re going to miss 1000s of significant things – in this case the bulk of our signal. In this case, we still found plenty of significant genes, so we felt happy, but there are many situations where using the Bonferroni correction we will conclude that there is no signal in our data, but in fact there are thousands of bona fide associations. It is for this reason that it’s always advisable to look at this distribution of the raw P-values for at least a few examples before deciding on the multiple testing correction procedure. The FDR From a probabilistic perspective, the interpretation of the Bonferoni correction is that we choose a new threshold for the P-value where we don’t expect anything to pass by chance. In practice, however, with more than 25,000 GO terms used in gene set enrichment analysis and 25,000 genes measured in each experiment, there are situations where we just won’t be able to get a small enough P-value to pass the Bonferoni corrected threshold. In this case, we might choose another threshold where we know that some of the tests that pass are actually false discoveries (expected under the null hypothesis) as long as the fraction of tests that are false discoveries is known (and small). Changing the P-values according to this strategy is known as controlling the false discovery rate or performing an FDR correction. The most famous and widely-used FDR correction is the so-called Benjamini-Hochberg correction. This correction is applied as follows: Statistical modeling and machine learning for molecular biology Alan Moses 1. Calculate the P-values for each of the tests that are performed. 2. Rank the tests by their test statistics, so that the most significant is 1st, the second most significant is 2nd, etc. 3. Move down the ranked list, and reject the null hypothesis for the r most significant tests, where the P-value is less than r α / m, where m is the number of tests and α is the FDR threshold. In this case, α is no longer a P-value, but rather an estimate of the fraction of the tests that are falsely rejected, or P( H0 was true | H0 was rejected ). Notice the similarity between the formula to choose the FDR P-value threshold and the Bonferroni. For r=1 there is only one test with a small P-value, the FDR is equivalent to the Bonferroni correction. However, if there are 100 tests with small P-values, now the threshold will be 100 times less stringent than the Bonferroni correction. However, with FDR, we accept that 100*α of those tests will actually be false. Applying the BH procedure to the search for genes that are differentially expressed between Tcells and other cells, at α=0.05, we have more than 13,000 genes passing the threshold. Since 5% of these are false discoveries, our estimate for the number of genes is actually closer to 12,000. Notice that this is very close to what we guessed by looking at the P-value distribution above, and more than 2.5x the estimate we got using the Bonferroni correction. There are other interesting ways of controlling the false discovery rate (Storey’s Q-values) but the BH procedure is widely used because it is simply to calculate. eQTLs – a very complicated multiple-testing problem. As a final example of how important multiple testing can be in practice, let’s consider one of the most complicated statistical problems in modern molecular biology: the search for eQTLs. The goal of this experiment is to systematically find associations between gene expression levels and genotypes. For example, in their classic eQTL experiment in yeast Brem et al. measured the expression levels for 1000s of genes in more than 100 segregants of a cross, and also genotyped the segregants at 1000s of markers covering nearly the whole genome. For example, there was a very strong eQTL for AMN1 expression just upstream of the AMN1 gene, so that strains with the RM allele had much higher gene expression levels than those with the BY allele. This is exactly the sort of thing that we’re looking for in an eQTL study, as it strongly suggests that there’s an important genetic difference between the two strains that controls expression of the nearby gene. 0.05 0.05 0 0 modeling and machine learning for molecular biology Statistical Alan-1.5 Moses -1.5 -0.5 -0.5 0.50.5 At chr2 position 555787 BY allele Series1 Series1 Fraction of strains 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -1.5 -0.5 0.5 1.5 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 2.5 -1.5 0.5 4.5 6.5 1 1 marker 2 All markers All pairs of 3 markers 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Number of tests 2.5 DSE1 gene expression Series1 Series2 (log ratio) 100 markers x 100 genes 10 200 0 1 2.52.5 RMSeries2 allele Series2 AMN1Series1 gene expression Series2 (log ratio) 100000 1000 10000 800 1000 600 100 400 1.51.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P-value (uncorrected) 100 10000 1000000 100000000 1E+10 Number of tests Figure – multiple testing in a yeast eQTL experiment As is obvious from the histogram, there’s a very strong statistical difference between these two expression patterns. Since the expression levels look reasonably Gaussian, we can go ahead and do a T-test, which gives a P-value around 10-30, strong evidence that the genotype and expression level are not independent. However, AMN1 might not be the only gene whose expression is linked to this locus. Sure enough, DSE1’s expression level is also very strongly linked to the SNP near the AMN1 locus, and as you can guess from the histogram, it’s highly statistically significant. And this is the great thing about the eQTL experiment: we’ve measured expression levels for all ~5,800 yeast genes. So we can go ahead and test them all against the genotype at locus, it’s only 5,800 tests: clearly these will still be significant even after a stringent multiple testing correction. However, in general, we don’t know which will be the important loci to look at: remember, we also have the genotypes at ~3,000 loci. We need to try them all: now we’re up to more than 17 million tests. The P-values for 10,000 of those WMW tests (randomly chosen from the 17 million) are shown in the figure. You can see that like the differentially expressed genes, there is already a large number of tests piling up in the bin for P<0.05, so that some kind of FDR correction will be probably be appropriate here. In general, however, geneticists don’t believe that genotypes affect quantitative traits independently: typically the expression level of a given gene is expected to be affected by Statistical modeling and machine learning for molecular biology Alan Moses multiple loci in the genome (for example, genetic differences in transcription factor binding sites in the promoter, as well as genetic differences in the transcription factors that bind to them). What we would really like is to test for associations between combinations of genotypes and the expression levels. I hope it’s clear that doing this naively, even just for pairs of loci will yield more than 10 billion tests, not to mention the computer power need to actually compute all the tests. Obviously, smarter methods to search for these types of effects are needed. Finally, to add one more layer of complexity to the eQTL picture: so far we’re been considering only searching for differences in gene expression level that are associated with genotype. In fact it’s quite possible that the variability in gene expression levels, or covariance between expression levels of one or more genes is actually under genetic control. I hope that it’s no surprise that development of statistical methods for eQTL analysis has been an area of intense research interest. Excercises 1. Assuming you were trying to do an eQTL study using expression of all the genes in the Immgen dataset on all the 1.5 million SNPs in the phase 3 hapmap, what would be the multiple testing correction?