Stephen M. Mount Cell Biology and Molecular Genetics H. J. Patterson Hall University of Maryland College Park, MD 20742-5815 Phone301-405-6934 FAX301-314-9081 permanent email addresssm193@umail.umd.edu This is Steve Mount's web page for gene annotation and splice site selection. Much of the material here is relevant to a review article in the American Journal of Human Genetics. Annotation Gene annotation incorporates cDNA data (including ESTs); sequence similarity; and computational predictions based on the recognition of probable splice sites and coding regions. The state of the art was recently surveyed by the Gene Annotation Assessment Project (GASP1), the results of which were published in a special issue of Genome Research. Ensembl - Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. Gene Ontology Consortium - The goal of the Gene Ontology Consortium is to produce a dynamic controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. Oak Ridge National Laboratory Computational Biosciences Section - A project whose stated mission is to address fundamental questions in the life sciences and provide information and analytical resources to the wider biology research community. TIGR Databases - The Institute for Genomic Research TIGR Databases are a collection of curated databases containing DNA and protein sequence, gene expression, cellular role, protein family, and taxonomic data for microbes, plants and humans. Celera - Celera genomics, a division of PE Corporation, which produced the fruit fly and human genomes. They are currently working on the mouse, and have announced a plan to move into proteomics. Genefinding Many genefinding servers are available, and the following list is not complete. GENSCAN - GENSCAN is a genefinder developed by Chris Burge. GlimmerM - limmerM is a gene finder developed specifically for small eukaryotes with a gene density of around 20%. Genie - Genie is a gene identification tool developed at the University of California, Santa Cruz, that uses Hidden Markov Models to find genes. FGENES - This genefinder is available through The Sanger Center's Computational Genomics Group. GRAIL - The Gene Recognition and Assembly Internet Link is available through the Oak Ridge National Laboratory Computational Biosciences Section HMMGene - This genefinder is available through the Center for Biological Sequence Analysis at the Department of Biotechnology, The Technical University of Denmark. Splice site prediction Again, there are other sites, but the following sites are known to me. Splice Site Prediction by Neural Network - Hosted, by the Berkeley Drosophila Genome Project and written by Martin Reese. NetGene and NetPlantGene - Both of these are available through the Center for Biological Sequence Analysis at the Department of Biotechnology, The Technical University of Denmark. cDNA alignment SIM4 - SIM4 is described by Florea et al.. The Intronerator - a collection of tools for exploring the molecular biology and genomics of C. elegans with a special emphasis on alternative splicing. This is specific to C. elegans, and does more than align cDNAs. Alternative Splicing PALS Putative Alternative Splicing database. Searchable, limited to mouse and human. ASDB Alternative Splicing Database -- based on GenBank entries. HASDB Human Alternative Splicing Database. Chris Lee. UCLA. 6201 alternative splice relationships in human genes identified through a genome-wide analysis of expressed sequence tags (ESTs). Splicing anomalies in Arabidopsis put into categories that include alternative splicing, based on full-length cDNA sequences. Splice Site Consensus It is well-established that nearly all splice sites conform to consensus sequences . These consensus sequences include nearly invariant dinucleotides at each end of the intron, GT at the 5' end of the intron, and AG at the 3' end of the intron, and generally resemble MAG|GTRAGT at the 5' splice site and CAG|G at the 3' splice site. The most common class of nonconsensus splice sites consists of 5' splice sites with a GC dinucleotide (Wu and Krainer 1999). GC sites conform extremely well to the standard consensus sequences at other positions. 42 of 44 sites have a consensus G residue at both position -1 and position 5. It is reasonable to assume that GC sites are recognized by the standard (U2-dependent) spliceosome. The second class of exception to splice site consensus is U12 introns, a minor class of rare introns with splice site sequences that are very different from the standard consensus, but which are very similar to each other (reviewed by Burge et al 1999 and Tarn and Steitz 1997. U12 introns can be identified by highly conserved sequences at the 5' splice site, (RTATCCTY; R = A or G; Y = C or T); and branch site (TCCTRAY). U12 introns are found in many eukaryotes, including Drosophila melanogaster and Arabidopsis, but not C. elegans. Finally, there are a small number of nonconsensus sites that fit into neither of the two categories mentioned above. Many reports of such variant splice sites can be traced to errors in annotation or interpretation, polymorphic differences between the sources of cDNA and genomic sequence, inclusion of pseudogene sequences, or failure to account for somatic mutation. However, there are many examples of sites that match the consensus very poorly, and experimental work has established that 5' splice sites do not absolutely require GT, and 3' splice sites do not absolutely require AG, to be recognized in vivo. Microexons A list of selected documented microexons is available. Very small exons, or microexons, pose special problems for gene annotation. They are difficult to recognize using computational genefinding methods, and can even confound the alignment of cDNA and genomic sequences. Furthermore, because microexons are very often the site of alternative splicing, an understanding of how they are recognized (and regulated) is key to understanding gene expression.