Statistical Tests for Gene Clusters Spanning Three Genomic Regions Gene clusters: evidence of common ancestry? Many analyses use gene clusters---distinct A gene cluster chromosomal regions that share homologous W1 gene pairs, but for which neither gene order nor gene content is preserved---as evidence of shared ancestry. However, it is necessary W2 to first rule out the possibility that the regions Are W1 and W2 are unrelated, and simply share homologous genes by chance. homologous regions? W1 W2 W3 Given a third region W3, are W1 and W2 homologous? Current statistical approaches primarily focus on comparisons of two regions only. With the rapid rate of whole genome sequencing, analysis of gene clusters that span three or more chromosomal regions is of increasing interest. However, the statistical questions are more difficult. To design statistical tests for three regions we need to model: • the number of genes shared among the three regions • the extent of gene content overlap among the genomes Narayanan Raghupathy*, Rose Hoberman*, and Dannie Durand Carnegie Mellon University, Pittsburgh, PA When comparing two regions, x, the number of shared genes is a natural test statistic: the more genes that are shared, the less likely the genes are shared by chance. In contrast, when comparing three regions, there are many quantities that provide evidence of homology: • the number of genes shared among all three regions (x123) • the number of genes shared between exactly two regions (x12, x13, x23) • the number of genes unique to one window (x1, x2, x3) Our statistical approach tests the hypothesis that a gene cluster is evidence of shared ancestry against a null hypothesis of random gene order. We try to rule out the null hypothesis by showing that the probability of the observed cluster is small under the null hypothesis. Given a set of three windows, each containing r consecutive genes, we wish to determine whether the windows share more homologous genes than expected by chance. Previous attempts to test the significance of three or more regions have either used multiple pairwise comparisons (reviewed by Simillion et al [2]), or only considered genes shared between all regions (x123) [1]. How best to combine evidence from different subsets of regions remains an unsolved problem. The significance of a cluster depends not only on the properties of windows, but also on the size of the genomes and the number of genes in common between the genomes. We design statistical tests for genome models that are appropriate for two common types of comparative genomics problems. We propose a novel test that takes into account both the genes conserved in all three regions (x123) and in only pairs of regions ( x12 , x13 and x23). We use a combinatorial approach to obtain expressions for each genome model for the probability P( X x ), under the null hypothesis of random gene order, (Equations omitted for brevity.) where X ( X 123 , X 12 , X 13 , X 23 ) denotes the random variables drawn from the distribution given by the null hypothesis. (a) We present the first attempt to evaluate the significance of clusters spanning exactly three regions, taking into account both the genes conserved in all regions and in only pairs of regions. We • • • • • (b) Develop genome models appropriate for common comparative genomics problems. Develop statistical tests for clusters spanning three regions, for each model. Study the relative importance of the above quantities to cluster significance. Investigate how the genome model affects cluster significance. Compare our proposed tests to previous statistical approaches. A gene cluster spanning two regions can be characterized by the following quantities: • the number of shared genes (x) • the number of genes unique to each window Does the proportion of singleton genes in the genome matter? We illustrate these by a Venn diagram representation of a gene cluster, where each circle represents a window, and the number of shared genes (x) is given in the intersection. Genomes under comparison often contain singletons, genes which do not have homologs in any of the other genomes (n1, n2, n3 in the orthology model). As the proportion of singletons in the genomes increases, cluster significance increases substantially. This is because as fewer homologs are shared between the genomes, it is more surprising to find them clustered together. How much more does a gene shared by all three regions contribute to significance? Which cluster is less likely to occur by chance, when genes are arranged randomly? W2 Statistical Tests Orthology model: n123 genes are shared between all three genomes. The remaining genes in each genome (n1,n2,n3) are singletons, genes which do not have homologs in any of the other genomes. n1=n2=n3=s, n123+s = 5000, r = 100 W1 Duplication model: G post is a genome that has undergone a whole genome duplication (WGD) and G pre is a related genome that diverged from a common ancestor before the WGD. a) n1, 2 genes appear twice in G post and once in G pre. These are the genes that are retained in duplicate. n b) 1,1 genes appear once in G post and once in G pre . These are the genes that were preferentially lost. n c) 0 ,1 genes appear once in G post but do not appear in G pre. How do retained duplicates after WGD affect cluster significance? Following a WGD, in many cases there is no immediate selective advantage for retaining a gene in duplicate, so one of the duplicates is often lost. Therefore, paralogous regions may share few paralogous genes. Thus, these duplicated regions are often detected by comparison to a related pre-duplication genome. We computed cluster probabilities for the duplication model using the following parameters: n1,1= 3600, n1,2= 450 and n0,1= 500. This is consistent with a recent study of pre- and post-duplication yeast species [3], in which only 16% of duplicates were retained following WGD in S. cerevisiae Wpost1 Wpost1 Wpre Wpre a) Two genes are shared by all three windows (x123 = 2, x12=x13=x23=0) Wpost2 Wpost2 a) Wpre shares three genes with Wpost1, and three other genes with Wpost2 b) Wpre shares only two genes each with Wpost1 and Wpost2, but Wpost1 and Wpost2 share an additional gene W2 W3 b) Two distinct genes are shared by each pair of windows (x123= 0, x12= x13= x23= 2) n123=5000, n1=n2=n3=0, r=100 In both cases, each pair of windows shares two genes. However, the total number of genes shared in (b) is twice as large as in (a). Nonetheless, as the figure at right shows, the scenario shown in cluster (a) is much less likely to occur by chance under the orthology model. This illustrates the importance of x123 to cluster significance. Using these expressions, we computed cluster probabilities in Mathematica for typical genome parameters and window sizes. These simulations were used to investigate the following questions. n1,1= 3600, n1,2= 450, n0,1= 500, r=100 s 5000 Are pairwise statistical tests sufficient? The most common strategy for testing significance of multiple regions is to conduct multiple pairwise comparisons (reviewed in [2]). For example, if region W1 is significantly similar to region W2, and W2 is significantly similar to region W3, then homology between all three regions is inferred, even if W1 and W3 share few or no genes. This approach allows the use of existing statistical methods, which are designed for comparing two regions. However, the pairwise approach • requires at least two of the three pairwise comparisons to be independently significant • does not consider the greater impact of genes shared among all three regions. We compared the pairwise probabilities to our three-way probabilities for various cluster parameter values. The figure below shows that, even when x123= 0, pairwise tests underestimate the significance, when compared to our three-way test, which considers all three regions jointly. Which cluster is less likely to occur by chance, if 84% of duplicates were lost following WGD? W3 The expression X x is shorthand for X 123 x123, X 12 x12 , X 13 x13 and X 23 x23 , that is, each of the quantities ( X 123, X 12 , X 13 , X 23 ) is at least as large as the observed quantity. Our goals: Hypothesis Testing Approach W1 Gene content overlap models The first model is designed for analyses of conserved linkage of genes in three regions from three distinct genomes. The second model is for detection of segments duplicated by a whole genome duplication (WGD), via comparison with the genome of a related, pre-duplication species. We again use a Venn diagram representation to illustrate the extent of gene content overlap among the genomes. In the cluster at left, x123 = 1, x12 = 3, x13 = 1, x23 = 1 * Contributed equally The figure at right shows that the two scenarios shown above are actually quite close in significance, even though the second scenario shares fewer homologous matches. Current approaches typically compare the pre-duplication region independently with each of the post-duplication regions, and thus ignore the values of x23 and x123. These methods could fail to detect clearly significant clusters. n123=5000, n1=n2=n3=0, r =100, x123=0 For example, given a significance threshold of 0.01 , the pairwise approach requires two of the three regions to share at least seven genes. In contrast, using our threeway test a cluster is significant when each pair of regions shares only four genes. x12= x13= x23 x12 = x13 Our results suggest that pairwise tests are not always sufficient and multi-region tests will be able to identify more distantly related homologous regions. References [1] D Durand and D Sankoff, J Comput. Biol.,10, 2003. [2] C Simillion et al, Bioessays 26, 2004. [3] KP Byrne and KH Wolfe, Genome Res., 10, 2005.