ch3

advertisement
Statistical modeling and machine learning for molecular biology
Alan Moses
Multiple testing.
Given the ease with which data can be collected and statistical tests can be performed with
computational statistics packages, the problem in contemporary molecular biology is that we
often have too many statistical tests that we might want to do. For example, in gene set
enrichment analysis, we might not know in advance which kinase the genes we predicted were
likely to be targets of. We could try testing all of them. Or maybe we think the genes are either
targets of a kinase or a transcription factor. In that case, we could try testing for enrichment of
targets of all possible kinases and transcription factors. More typically, we are in the situation
where we have no idea what our list of genes is doing and we will just want to test for all
possible enrichments of anything we can get our hands on: GO categories, MIPS functions,
KEGG pathways, ChIP-seq compendiums, phospho-proteomics datasets, you get the picture. In
bioinformatics jargon these types of functional assignments to genes are known as “annotations.”
Table. Some common annotations of gene function
Of course, this seems like a good idea, and we can often discover interesting biology this way.
However, let’s think about the statistical situation here. In Chapter 2, when we tested for
enrichment of kinase substrates in our gene list, the P-value is the probability of getting a gene
list as extreme or more as the one we got under the null hypothesis that our gene list was
randomly drawn from the pool of genes. Let’s say we want the significance threshold to be
P<0.05. This threshold means that we want there to be a 1 in 20 chance of seeing this extreme a
gene list under the null hypothesis. Consider what happens if we now perform 20 tests: we
expect about one of them to have P<0.05 even when there is no enrichment. More generally, for
any P-value threshold, α, if we do m tests, we expect about m x α tests to pass it under the null
hypothesis. Another way to think about this is to remember that the distribution of P-values
under the null hypothesis is uniform between 0 and 1. 0.05 covers 1/20 of this distance. This
means that if we randomly draw m P-values from the null distribution, m/20 will fall below 0.05.
You can see immediately that in the context of many tests (or multiple testing) there will be a
problem with our conventional use of the P-value. The simplest solution to this problem is the
so-called Bonferoni correction to the P-value. The idea here is to use the laws of probability to
correct the P-values. Since we know the probability of the P-values under the null hypothesis is
uniform, we simply *rescale* the region that corresponds to a “significant” P-value to be 0.05
*taking into account* the fact that we have done m tests. The new P-value threshold is simply
0.05/m and now there will only be a 5% chance of observing a P-value that small under the null
hypothesis. Alternatively, we can achieve the same effect by rescaling the P-values themselves
or “correct” the P-values by simply multiplying by the number of tests: P becomes P x m. To
keep the P-values as probabilities, any P-value that gets bigger than 1 is simply set to be one.
Formally, the correction can therefore be written as min(1,m*P). Now, no matter how many tests
we do, we will only have a 5% chance of observing a P-value below 0.05 under the null
hypothesis.
Statistical modeling and machine learning for molecular biology
Alan Moses
The Bonferoni correction is widely used in practice because it is so simple to calculate. The only
tricky part is knowing exactly what number of tests you really did (this is not always strictly
equal to the number of tests you actually performed using your statistics software).
Figure X illustrates multiple testing and the Bonferoni correction for Gene Set Enrichment
Analysis.
Multiple testing in Differential Expression analysis
Another very well-studied example where multiple testing is important in modern molecular
biology is trying to identify differentially expressed genes Differential Expression analysis.
Let’s consider trying to find the genes that are differentially expressed between T-cells and all
other cells in the Immgen data. We have ~25000 of genes whose expression has been measured
in 100s of cells. More formally,



We want to find for which of m genes the expression level is different in the T-cells.
For each gene, j, we have a sample of n observations, Xj1, Xj2 … and Yj1, Yj2 … where
X are the expression levels in the T cells and Y are the expression levels in other cells.
For each locus, we compute a test statistic, t*=f(Xj,Yj) and compute the P-value, or
Pj(t≥t*|H0).
Number of Genes
Under the null-hypothesis, for any P-value threshold we choose α, we will expect to find m α
loci beyond that threshold under the null hypothesis. The Bonferoni correction for multiple
testing says that we should choose that threshold to be α/m, so that we expect fewer than 1 gene
to pass this threshold under the null hypothesis. Here’s the result of actually doing 24,922 t-tests
on the Immgen data.
100000
100000
10000
10000
1000
1000
100
100
10
10
1
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
P-value
(uncorrected)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
P-value
(Bonferroni corrected)
Figure – identifying differentially expressed T-cell genes.
First of all, notice that the y-axis of these graphs is in *log* scale, so the number of genes with
P<0.05 is very large. In fact, more than 50% of the 25,000 genes are expressed differently in the
T-cells compared to all other cells. Obviously this is more than the 5% expected with P<0.05 by
chance. In fact, by looking at the distribution of P-values we can guess that probably around 300
genes (halfway between 100 and 1000 on a log scale) are expected in each bin – at least that’s
Statistical modeling and machine learning for molecular biology
Alan Moses
the number we see in the bins for P-values greater than about 0.3. Based on this, you’d guess that
of the more than 14,000 genes with P-value < 0.05 (remember, log scale) there are 14,000 – 300
= 13,700 more than expected by chance. Alternatively, you could argue that you would have
expected to see 5% of the 25,000 under the null hypothesis in each bin, or about 1,250. This
suggests that there are 14,000 – 1,250 = 12,750 more than expected under the null hypothesis. So
based on the uncorrected P-values, in either case we’d guess that more 12,000 genes have
different expression patterns in the T-cells relative to the other cells.
Now let’s consider the effect of the Bonferroni correction (right panel). You can see that the
correction has shifted many P-values into the largest bin: these were all the P-values that were set
to 1 because m*P was greater than 1. But more importantly, look at the 0.05 bin (furthest to the
left). There are now 4,500 genes with P<0.05. Since the P-values are now corrected, we can say
that these are the genes that have significantly different expression in T-cells compared to all
other cells. Yay! 4,500 genes is a lot of genes. However, you might have noticed that this is a
lot less than the more than 12,000 genes we estimated by looking at the uncorrected P-value
distribution. Essentially, what’s going on here is that the Bonferroni correction is giving us the
4,500 *most* significantly different genes, but in fact it’s being *way* too stringent. The
Bonferroni correction was designed for situations where only 1 or 2 or a handful of tests would
be significant – it really doesn’t make much sense in the case where 1000s of tests are
significant. It’s not really reasonable expect to do 1000s of tests and not include *any* false
rejections of the null hypothesis. Because of requirement that *everything* is significant, when
we use the Bonferroni correction in a case like this, we’re going to miss 1000s of significant
things – in this case the bulk of our signal. In this case, we still found plenty of significant
genes, so we felt happy, but there are many situations where using the Bonferroni correction we
will conclude that there is no signal in our data, but in fact there are thousands of bona fide
associations. It is for this reason that it’s always advisable to look at this distribution of the raw
P-values for at least a few examples before deciding on the multiple testing correction procedure.
The FDR
From a probabilistic perspective, the interpretation of the Bonferoni correction is that we choose
a new threshold for the P-value where we don’t expect anything to pass by chance. In practice,
however, with more than 25,000 GO terms used in gene set enrichment analysis and 25,000
genes measured in each experiment, there are situations where we just won’t be able to get a
small enough P-value to pass the Bonferoni corrected threshold.
In this case, we might choose another threshold where we know that some of the tests that pass
are actually false discoveries (expected under the null hypothesis) as long as the fraction of tests
that are false discoveries is known (and small). Changing the P-values according to this strategy
is known as controlling the false discovery rate or performing an FDR correction.
The most famous and widely-used FDR correction is the so-called Benjamini-Hochberg
correction. This correction is applied as follows:
Statistical modeling and machine learning for molecular biology
Alan Moses
1. Calculate the P-values for each of the tests that are performed.
2. Rank the tests by their test statistics, so that the most significant is 1st, the second most
significant is 2nd, etc.
3. Move down the ranked list, and reject the null hypothesis for the r most significant tests,
where the P-value is less than r α / m, where m is the number of tests and α is the FDR
threshold.
In this case, α is no longer a P-value, but rather an estimate of the fraction of the tests that are
falsely rejected, or P( H0 was true | H0 was rejected ). Notice the similarity between the formula
to choose the FDR P-value threshold and the Bonferroni. For r=1 there is only one test with a
small P-value, the FDR is equivalent to the Bonferroni correction. However, if there are 100
tests with small P-values, now the threshold will be 100 times less stringent than the Bonferroni
correction. However, with FDR, we accept that 100*α of those tests will actually be false.
Applying the BH procedure to the search for genes that are differentially expressed between Tcells and other cells, at α=0.05, we have more than 13,000 genes passing the threshold. Since 5%
of these are false discoveries, our estimate for the number of genes is actually closer to 12,000.
Notice that this is very close to what we guessed by looking at the P-value distribution above,
and more than 2.5x the estimate we got using the Bonferroni correction.
There are other interesting ways of controlling the false discovery rate (Storey’s Q-values) but
the BH procedure is widely used because it is simply to calculate.
eQTLs – a very complicated multiple-testing problem.
As a final example of how important multiple testing can be in practice, let’s consider one of the
most complicated statistical problems in modern molecular biology: the search for eQTLs. The
goal of this experiment is to systematically find associations between gene expression levels and
genotypes. For example, in their classic eQTL experiment in yeast Brem et al. measured the
expression levels for 1000s of genes in more than 100 segregants of a cross, and also genotyped
the segregants at 1000s of markers covering nearly the whole genome. For example, there was a
very strong eQTL for AMN1 expression just upstream of the AMN1 gene, so that strains with
the RM allele had much higher gene expression levels than those with the BY allele. This is
exactly the sort of thing that we’re looking for in an eQTL study, as it strongly suggests that
there’s an important genetic difference between the two strains that controls expression of the
nearby gene.
0.05
0.05
0 0 modeling and machine learning for molecular biology
Statistical
Alan-1.5
Moses
-1.5
-0.5
-0.5
0.50.5
At chr2 position 555787
BY allele
Series1
Series1
Fraction of strains
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-1.5
-0.5
0.5
1.5
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
2.5 -1.5
0.5
4.5
6.5
1
1 marker
2
All markers
All pairs of
3
markers
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Number of tests
2.5
DSE1
gene expression
Series1
Series2
(log ratio)
100 markers x 100 genes
10
200
0
1
2.52.5
RMSeries2
allele
Series2
AMN1Series1
gene expression
Series2
(log ratio)
100000
1000
10000
800
1000
600
100
400
1.51.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
P-value
(uncorrected)
100
10000
1000000 100000000
1E+10
Number of tests
Figure – multiple testing in a yeast eQTL experiment
As is obvious from the histogram, there’s a very strong statistical difference between these two
expression patterns. Since the expression levels look reasonably Gaussian, we can go ahead and
do a T-test, which gives a P-value around 10-30, strong evidence that the genotype and expression
level are not independent.
However, AMN1 might not be the only gene whose expression is linked to this locus. Sure
enough, DSE1’s expression level is also very strongly linked to the SNP near the AMN1 locus,
and as you can guess from the histogram, it’s highly statistically significant. And this is the great
thing about the eQTL experiment: we’ve measured expression levels for all ~5,800 yeast genes.
So we can go ahead and test them all against the genotype at locus, it’s only 5,800 tests: clearly
these will still be significant even after a stringent multiple testing correction. However, in
general, we don’t know which will be the important loci to look at: remember, we also have the
genotypes at ~3,000 loci. We need to try them all: now we’re up to more than 17 million tests.
The P-values for 10,000 of those WMW tests (randomly chosen from the 17 million) are shown
in the figure. You can see that like the differentially expressed genes, there is already a large
number of tests piling up in the bin for P<0.05, so that some kind of FDR correction will be
probably be appropriate here.
In general, however, geneticists don’t believe that genotypes affect quantitative traits
independently: typically the expression level of a given gene is expected to be affected by
Statistical modeling and machine learning for molecular biology
Alan Moses
multiple loci in the genome (for example, genetic differences in transcription factor binding sites
in the promoter, as well as genetic differences in the transcription factors that bind to them).
What we would really like is to test for associations between combinations of genotypes and the
expression levels. I hope it’s clear that doing this naively, even just for pairs of loci will yield
more than 10 billion tests, not to mention the computer power need to actually compute all the
tests. Obviously, smarter methods to search for these types of effects are needed.
Finally, to add one more layer of complexity to the eQTL picture: so far we’re been considering
only searching for differences in gene expression level that are associated with genotype. In fact
it’s quite possible that the variability in gene expression levels, or covariance between expression
levels of one or more genes is actually under genetic control. I hope that it’s no surprise that
development of statistical methods for eQTL analysis has been an area of intense research
interest.
Excercises
1. Assuming you were trying to do an eQTL study using expression of all the genes in the
Immgen dataset on all the 1.5 million SNPs in the phase 3 hapmap, what would be the
multiple testing correction?
Download