Genome Complexity and Splicing: An Evolutionary Perspective Introduction It has been widely appreciated in the last couple years that the perceived biological complexity of an organism may not correlate well with the number of genes in its genome. Human does not seem to possess any more genes than mouse, and has only about twice as many as Drosophila and C. Elegans. Errors in gene annotations aside, organismic complexity can be due to differences in regulatory programs and the potential increase in proteome complexity due to alternative splicing (Maniatis and Tasic 2002). The differential removal and inclusion of exons and the use of alternative splice sites facilitate the generation of different protein isoforms. Numerous examples of alternative splicing are known, a recent study (Johnson et al. 2003) estimated that 74% of human genes undergo alternative splicing. To gain further understanding of this phenomenon, I propose to study how alternative splicing could potentially contribute to biological complexity from the viewpoint of evolution. The focus will be on the evolution of gene structure across the phylogeny of sequenced genomes, mainly exoncreation events. In addition, I would like to explore whether the conservation patterns of splicing regulatory signals can lead to predictions about the prevalence of alternative splicing in different lineages across phylogeny. Goal 1: Exon Evolutionary Dynamics across Phylogeny Recently it was estimated that about 5% of the exons in the human genome came from Alu elements and such elements have high potential to form new exons (Kreahling and Graveley 2004) in intronic regions de novo. Modrek and Lee (Modrek and Lee 2003) found that less-frequent alternatively spliced exons tend not to be conserved between human and mouse, implying that such exons could be new to the human lineage. They proposed that new exons can be evolved by first having weak splice signals. Consistent with this hypothesis, Sorek et al. (Sorek et al. 2004) found that most of the non-conserved (when compared to mouse) alternatively spliced exons in the human genome tend to code for proteins that may not be functional (i.e. containing early stop codons), suggesting that such exons may be in evolutionary transition. In addition, tandem exon duplication provides another means to evolve new exons (Kondrashov et al. 2001). In general, having new exons increases the coding and alternative splicing potential of a gene. A key question is whether exon-creation events are prevalent across phylogeny, and if so, whether such patterns are lineage specific. While some preliminary studies (Rogozin et al. 2003, Kondrashov et al. 2003) on a small number of proteins have shown evidence of exon creation within ancient introns in the eukaryote lineage (up to Yeast), a genomewide study across different branches of phylogeny is needed to understand the global dynamics of exon creation. One of the goals is to see whether the lineage leading to more complex organisms like human is associated with accelerated exon-creation events. To understand short-term evolutionary dynamics, I will look at pairs of species that are close1, including human/chimp, human/mouse, mouse/rat, c. elegans/c. briggsae, d. melanogaster/d. simulans, and d. melanogaster/d. pseudooscura. This will be followed by comparisons between human/d. melanogaster, human/c. elegans, d. melanogaster/c. elegans and human/yeast to understand long-term exon creation dynamics. To detect exon creation events for each pair of species, the following procedure will be used: 1. Infer orthologs between pairs of species using HomologGene (Wheeler et al. 2002) 2. Extract the corresponding genomic region for each gene and concatenate all annotated exons and then align the translated sequence to look for gaps of size greater than N (determine empirically) in the alignment. Such gaps could correspond to exon gain/loss events. Although by definition intron insertions would create new exons, they do not introduce novel coding sequences that confer new function. That is why I only focus on gaps at the coding level. 3. Map gaps back to genomic DNA to make sure it is flanked by introns. This can also distinguish between exon creation/loss and elongation (i.e. splice site extension of an existing exon) events. 4. If the gap corresponds to an elongation event, try to map the exon onto the genomic DNA of the other species. If it is conserved, it can be ignored since it is likely due to incomplete annotation. 5. Use the closest out-group genome available to infer the ancestral state if possible. 6. Compute the rate of exon creation/loss and elongation for each species pair One issue of using existing genome annotations to conduct the above analysis is that rarely used but functional exons could be missed, due to the bias towards highly abundant 1 The official release of the chimp genome should be coming soon. transcripts in cDNA/EST libraries and incomplete tissue coverage. More importantly, such exons are more likely to be unique in organisms like human. Also, the high level of conservation between human and chimp implies that the differences in annotated gene structures may be subtle (chimp cDNA and EST data is also scarce). Therefore it is important to conduct the above analysis with inferred exons as well. Genescan (Burge 1997) is known to be substantially more accurate when used on small genomic regions. Hence it can be run on the annotated genic regions of the above genomes to detect putative exons with high coding and splicing potential. As a first pass, I will only analyze the human/chimp and human/mouse pairs. Potential exons on introns could correspond to functional exons or exons under evolutionary transition. It would be extremely interesting to see whether the human genome contains substantially more potential coding exons in introns when compared to chimp and mouse2, suggesting accelerated exon creation in the human lineage3. It has also been shown that exons tend to code for protein domains (Kriventseva et al. 2003), hence we can scan the potential exons using Interpro. This will give an indication of the potential increase in proteome complexity in functional terms and provide another confidence measure of potential exons. Finally, using GO, determine the functional enrichment of genes with accelerated exon-creation events. It would be exciting to see neuronal/CNS and membrane proteins4, as these are more likely to be alternatively spliced and positive selection could have played an important role. Goal 2: Gauging the Extent of Alternative Splicing across Phylogeny by Conservation Pattern Analysis In addition to the basal splicing machinery that recognizes splice sites and branching signals, a variety of signals near the splice sites in both introns and exons have been shown to regulate splicing (Maniatis and Tasic 2002). Such sequence motif combinations could control the spatial and temporal specificity of exon inclusion. While SELEX experiments and computational methods (Tacke and Manley 1999 and Fairbrother et al. 2002) have provided numerous candidates, studying the evolution of 2 Also compare the coding potential and splice signal strength between putative exons Computational experiments to control for false positive rates when Genscan is used in introns as well as introns length effects will be implemented (due to length restrictions, details are omitted here). 4 Based on TMHMM predictions and expression data from Unigene and existing microarray studies 3 such elements remains difficult. However, a intriguing study (Sorek and Ast 2003) on human/mouse comparison showed that alternatively spliced exons tend to be flanked by highly conserved intronic sequences extending for ~100 bps on average, while only a small percentage (5 fold less) of constitutively spliced exons are flanked by conserved regions (~32 bps on average). Even though it is uncertain whether such conserved elements are entirely splicing specific, it makes sense that heavily regulated exons contain more functional information, suggesting combinatorial control by multiple enhancers and repressors. I suspect that the same might be true for other metazoans, if so, this observation can be employed to predict the extent of alternative splicing in different lineages across phylogeny5. That is, if an exon is flanked by long and highly conserved intronic sequences, it is more likely to be alternatively spliced (with the assumption that the alternative splicing machinery is relatively fixed, so this would work best over relatively short evolutionary distances). Note that such predictions are independent of cDNA/EST data. Due to length restrictions, the details of how this will be done are omitted. The main idea is simple: given two genomes, infer orthologous genes as usual and then determine the amount (A) and length (L) of flanking sequence conservation between orthologous internal exons; finally cluster the exons into groups based on A and L. Clusters of high A and L indicate higher potential of being alternatively spliced. When comparing across different lineages, the differences in intron length needs to be controlled for. For genome pairs with substantial EST/cDNA evidence such that exons can be classified as constitutively or alternatively spliced, a supervised learning approach can also be used. Using EST/cDNA data, Brett et al. (Brett et al. 2001) showed that there is no detectable difference in the extent of alternative splicing between human, mouse, fly, worm and plants. We can re-evaluate this using the proposed method to compare the extent of potential alternative splicing in human/mouse, d. melanogaster/d. psedoobscura, and c. elegans/c. briggsae. It would be very interesting to see whether the conservation pattern points to increasing splicing regulatory complexity in mammals. References 5 The first step would be to confirm this pattern in the fly and worm lineages. Maniatis T. and Tasic B. (2002) Alternative pre-mRNA splicing and proteome expansion in metazoans, Nature 418:236-243 Tacke R. and Manley J. (1999) Determinants of SR protein specificity, Current Opinion in Cell Biology 11:358-362 Modrek B. and Lee CJ. (2003) Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss, Nature Genetics 34:177-180 Sorek R. and Ast G. (2003) Intronic sequences flanking alternatively spliced exons are conserved between human and mouse, Genome Research 13:1631-1637 Kondrashov FA et al. (2001) Origin of alternative splicing by tandem exon duplication, Human Mol. Genet. 10:2661-2669 Brett et al. (2002) Alternative splicing and genome complexity, Nature Genetics 30:2930 Johnson et al. (2003) Genome-wide scan of human alternative pre-mRNA splicing with exon junction microarrays, Science 302:2141-2144 Kondrasnhov F. and Koonin E. (2003) Evolution of alternative splicing: deletions, insertions and origin of functional parts of proteins from introns sequences, Trends in Genetics 19:115-119 Rogozin IB. (2003) Remarkable interkingdom conservation of introns positions and massive, lineage-specific introns loss and gain in eukaryotic evolution, Current Biology 13: 1512-1517 Kreahling J. and Gravelen BR. (2004) The origins and implications of Aluternative splicing, Trends in Genetics 20:1-4 Sorek R. et al. (2004) How prevalent is functional alternative splicing in the human genome?, Trends in Genetics 20:68-71 Fairbrother W. et al. (2002) Predictive identification of exonic splicing enhancers in human genes, Science 297:1007-1013 Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA, J. Mol. Biol. 268:78 - 94