x 123 - Carnegie Mellon University

advertisement
Statistical Tests for Gene Clusters Spanning Three Genomic Regions
Gene clusters: evidence of common ancestry?
Many analyses use gene clusters---distinct
A gene cluster
chromosomal regions that share homologous
W1
gene pairs, but for which neither gene order
nor gene content is preserved---as evidence
of shared ancestry. However, it is necessary
W2
to first rule out the possibility that the regions
Are W1 and W2
are unrelated, and simply share homologous
genes by chance.
homologous regions?
W1
W2
W3
Given a third region W3, are
W1 and W2 homologous?
Current statistical approaches
primarily focus on comparisons of
two regions only. With the rapid rate
of whole genome sequencing,
analysis of gene clusters that span
three or more chromosomal regions
is of increasing interest. However,
the statistical questions are more
difficult.
To design statistical tests for three regions we need to model:
• the number of genes shared among the three regions
• the extent of gene content overlap among the genomes
Narayanan Raghupathy*, Rose Hoberman*, and Dannie Durand
Carnegie Mellon University, Pittsburgh, PA
When comparing two regions, x, the number of shared genes is a natural test statistic: the more
genes that are shared, the less likely the genes are shared by chance. In contrast, when
comparing three regions, there are many quantities that provide evidence of homology:
• the number of genes shared among all three regions (x123)
• the number of genes shared between exactly two regions
(x12, x13, x23)
• the number of genes unique to one window (x1, x2, x3)
Our statistical approach tests the hypothesis that a gene cluster is
evidence of shared ancestry against a null hypothesis of random gene
order. We try to rule out the null hypothesis by showing that the
probability of the observed cluster is small under the null hypothesis.
Given a set of three windows, each containing r consecutive genes, we
wish to determine whether the windows share more homologous genes
than expected by chance.
Previous attempts to test the significance of three or more regions have either used multiple
pairwise comparisons (reviewed by Simillion et al [2]), or only considered genes shared
between all regions (x123) [1]. How best to combine evidence from different subsets of regions
remains an unsolved problem.
The significance of a cluster depends not only on the properties of windows, but also on the size of the
genomes and the number of genes in common between the genomes. We design statistical tests for
genome models that are appropriate for two common types of comparative genomics problems.
We propose a novel test that takes into account both the genes conserved in all
three regions (x123) and in only pairs of regions ( x12 , x13 and x23). We use a
combinatorial approach to obtain expressions for each genome model for the
probability P( X  x ), under the null hypothesis of random gene order,
(Equations omitted for brevity.) where X  ( X 123 , X 12 , X 13 , X 23 ) denotes the
random variables drawn from the distribution given by the null hypothesis.
(a)
We present the first attempt to evaluate the significance of clusters spanning exactly
three regions, taking into account both the genes conserved in all regions and in only
pairs of regions. We
•
•
•
•
•
(b)
Develop genome models appropriate for common comparative genomics problems.
Develop statistical tests for clusters spanning three regions, for each model.
Study the relative importance of the above quantities to cluster significance.
Investigate how the genome model affects cluster significance.
Compare our proposed tests to previous statistical approaches.
A gene cluster spanning two regions can be characterized by the
following quantities:
• the number of shared genes (x)
• the number of genes unique to each window
Does the
proportion of singleton
genes in the genome matter?
We illustrate these by a Venn diagram
representation of a gene cluster, where each
circle represents a window, and the number of
shared genes (x) is given in the intersection.
Genomes under comparison often contain singletons,
genes which do not have homologs in any of the other
genomes (n1, n2, n3 in the orthology model).
As the proportion of singletons in the genomes increases,
cluster significance increases substantially. This is because
as fewer homologs are shared between the genomes, it is
more surprising to find them clustered together.
How much more does a gene shared by all three
regions contribute to significance?
Which cluster is less likely to occur by chance, when genes are arranged randomly?
W2
Statistical Tests
Orthology model: n123 genes are shared between all three
genomes. The remaining genes in each genome (n1,n2,n3)
are singletons, genes which do not have homologs in any of
the other genomes.
n1=n2=n3=s, n123+s = 5000, r = 100
W1
Duplication model: G post is a genome that has undergone a
whole genome duplication (WGD) and G pre is a related genome
that diverged from a common ancestor before the WGD.
a) n1, 2 genes appear twice in G post and once in G pre. These are
the genes that are retained in duplicate.
n
b) 1,1 genes appear once in G post and once in G pre . These are
the genes that were preferentially lost.
n
c) 0 ,1 genes appear once in G post but do not appear in G pre.
How do retained duplicates after WGD affect cluster significance?
Following a WGD, in many cases there is no immediate selective advantage for retaining a gene in duplicate,
so one of the duplicates is often lost. Therefore, paralogous regions may share few paralogous genes. Thus,
these duplicated regions are often detected by comparison to a related pre-duplication genome.
We computed cluster probabilities for the duplication model using the following parameters:
n1,1= 3600, n1,2= 450 and n0,1= 500. This is consistent with a recent study of pre- and post-duplication yeast
species [3], in which only 16% of duplicates were retained following WGD in S. cerevisiae
Wpost1
Wpost1
Wpre
Wpre
a) Two genes are shared by all three
windows (x123 = 2, x12=x13=x23=0)
Wpost2
Wpost2
a) Wpre shares three genes with Wpost1,
and three other genes with Wpost2
b) Wpre shares only two genes each with Wpost1 and
Wpost2, but Wpost1 and Wpost2 share an additional gene
W2
W3
b) Two distinct genes are shared
by each pair of windows
(x123= 0, x12= x13= x23= 2)
n123=5000, n1=n2=n3=0, r=100
In both cases, each pair of windows shares
two genes. However, the total number of
genes shared in (b) is twice as large as in
(a). Nonetheless, as the figure at right
shows, the scenario shown in cluster (a) is
much less likely to occur by chance under
the orthology model. This illustrates the
importance of x123 to cluster significance.
Using these expressions, we computed cluster probabilities in Mathematica for
typical genome parameters and window sizes. These simulations were used to
investigate the following questions.
n1,1= 3600, n1,2= 450, n0,1= 500, r=100
s 5000
Are pairwise
statistical tests sufficient?
The most common strategy for testing significance
of multiple regions is to conduct multiple pairwise comparisons
(reviewed in [2]). For example, if region W1 is significantly similar to
region W2, and W2 is significantly similar to region W3, then homology
between all three regions is inferred, even if W1 and W3 share few or no genes.
This approach allows the use of existing statistical methods, which are
designed for comparing two regions. However, the pairwise approach
• requires at least two of the three pairwise comparisons to be
independently significant
• does not consider the greater impact of genes shared among all three
regions.
We compared the pairwise probabilities to our three-way probabilities for various
cluster parameter values. The figure below shows that, even when x123= 0,
pairwise tests underestimate the significance, when compared to our three-way
test, which considers all three regions jointly.
Which cluster is less likely to occur by chance, if 84% of duplicates were lost following WGD?
W3
The expression X  x is shorthand for X 123  x123, X 12  x12 , X 13  x13 and X 23  x23 ,
that is, each of the quantities ( X 123, X 12 , X 13 , X 23 ) is at least as large as the
observed quantity.
Our goals:
Hypothesis Testing Approach
W1
Gene content overlap models
The first model is designed for analyses of conserved linkage of genes in three regions from three distinct
genomes. The second model is for detection of segments duplicated by a whole genome duplication
(WGD), via comparison with the genome of a related, pre-duplication species. We again use a Venn
diagram representation to illustrate the extent of gene content overlap among the genomes.
In the cluster at left, x123 = 1, x12 = 3, x13 = 1, x23 = 1
* Contributed equally
The figure at right shows that the two scenarios shown above are
actually quite close in significance, even though the second scenario
shares fewer homologous matches. Current approaches typically
compare the pre-duplication region independently with each of the
post-duplication regions, and thus ignore the values of x23 and x123.
These methods could fail to detect clearly significant clusters.
n123=5000, n1=n2=n3=0, r =100, x123=0
For example, given a significance
threshold of   0.01 , the pairwise
approach requires two of the three
regions to share at least seven
genes. In contrast, using our threeway test a cluster is significant when
each pair of regions shares only
four genes.

x12= x13= x23
x12 = x13
Our results suggest that pairwise tests are not
always sufficient and multi-region tests will be
able to identify more distantly related
homologous regions.
References
[1] D Durand and D Sankoff, J Comput. Biol.,10, 2003.
[2] C Simillion et al, Bioessays 26, 2004.
[3] KP Byrne and KH Wolfe, Genome Res., 10, 2005.
Download