Splice Site Consensus

advertisement
Stephen M. Mount
Cell Biology and Molecular Genetics
H. J. Patterson Hall
University of Maryland
College Park, MD 20742-5815
Phone301-405-6934
FAX301-314-9081
permanent email addresssm193@umail.umd.edu
This is Steve Mount's web page for gene annotation and splice site selection. Much of the
material here is relevant to a review article in the American Journal of Human Genetics.
Annotation
Gene annotation incorporates cDNA data (including ESTs); sequence similarity; and
computational predictions based on the recognition of probable splice sites and coding
regions. The state of the art was recently surveyed by the Gene Annotation Assessment
Project (GASP1), the results of which were published in a special issue of Genome Research.
Ensembl - Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a
software system which produces and maintains automatic annotation on eukaryotic genomes.
Gene Ontology Consortium - The goal of the Gene Ontology Consortium is to produce a
dynamic controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene
and protein roles in cells is accumulating and changing.
Oak Ridge National Laboratory Computational Biosciences Section - A project whose stated
mission is to address fundamental questions in the life sciences and provide information and
analytical resources to the wider biology research community.
TIGR Databases - The Institute for Genomic Research TIGR Databases are a collection of
curated databases containing DNA and protein sequence, gene expression, cellular role, protein
family, and taxonomic data for microbes, plants and humans.
Celera - Celera genomics, a division of PE Corporation, which produced the fruit fly and human
genomes. They are currently working on the mouse, and have announced a plan to move into
proteomics.
Genefinding
Many genefinding servers are available, and the following list is not complete.
GENSCAN - GENSCAN is a genefinder developed by Chris Burge.
GlimmerM - limmerM is a gene finder developed specifically for small eukaryotes with a gene
density of around 20%.
Genie - Genie is a gene identification tool developed at the University of California, Santa Cruz,
that uses Hidden Markov Models to find genes.
FGENES - This genefinder is available through The Sanger Center's Computational Genomics
Group.
GRAIL - The Gene Recognition and Assembly Internet Link is available through the Oak Ridge
National Laboratory Computational Biosciences Section
HMMGene - This genefinder is available through the Center for Biological Sequence Analysis at
the Department of Biotechnology, The Technical University of Denmark.
Splice site prediction
Again, there are other sites, but the following sites are known to me.
Splice Site Prediction by Neural Network - Hosted, by the Berkeley Drosophila Genome Project
and written by Martin Reese.
NetGene and NetPlantGene - Both of these are available through the Center for Biological
Sequence Analysis at the Department of Biotechnology, The Technical University of Denmark.
cDNA alignment
SIM4 - SIM4 is described by Florea et al..
The Intronerator - a collection of tools for exploring the molecular biology and genomics of C.
elegans with a special emphasis on alternative splicing. This is specific to C. elegans, and does
more than align cDNAs.
Alternative Splicing
PALS Putative Alternative Splicing database. Searchable, limited to mouse and human.
ASDB Alternative Splicing Database -- based on GenBank entries.
HASDB Human Alternative Splicing Database. Chris Lee. UCLA. 6201 alternative splice
relationships in human genes identified through a genome-wide analysis of expressed sequence
tags (ESTs).
Splicing anomalies in Arabidopsis put into categories that include alternative splicing, based on
full-length cDNA sequences.
Splice Site Consensus
It is well-established that nearly all splice sites conform to consensus sequences . These consensus
sequences include nearly invariant dinucleotides at each end of the intron, GT at the 5' end of the
intron, and AG at the 3' end of the intron, and generally resemble MAG|GTRAGT at the 5' splice
site and CAG|G at the 3' splice site.
The most common class of nonconsensus splice sites consists of 5' splice sites with a GC
dinucleotide (Wu and Krainer 1999). GC sites conform extremely well to the standard consensus
sequences at other positions. 42 of 44 sites have a consensus G residue at both position -1 and
position 5. It is reasonable to assume that GC sites are recognized by the standard (U2-dependent)
spliceosome.
The second class of exception to splice site consensus is U12 introns, a minor class of rare introns
with splice site sequences that are very different from the standard consensus, but which are very
similar to each other (reviewed by Burge et al 1999 and Tarn and Steitz 1997. U12 introns can be
identified by highly conserved sequences at the 5' splice site, (RTATCCTY; R = A or G; Y = C or
T); and branch site (TCCTRAY). U12 introns are found in many eukaryotes, including Drosophila
melanogaster and Arabidopsis, but not C. elegans.
Finally, there are a small number of nonconsensus sites that fit into neither of the two categories
mentioned above. Many reports of such variant splice sites can be traced to errors in annotation or
interpretation, polymorphic differences between the sources of cDNA and genomic sequence,
inclusion of pseudogene sequences, or failure to account for somatic mutation. However, there are
many examples of sites that match the consensus very poorly, and experimental work has
established that 5' splice sites do not absolutely require GT, and 3' splice sites do not absolutely
require AG, to be recognized in vivo.
Microexons
A list of selected documented microexons is available. Very small exons,
or microexons, pose special problems for gene annotation. They are
difficult to recognize using computational genefinding methods, and can
even confound the alignment of cDNA and genomic sequences. Furthermore,
because microexons are very often the site of alternative splicing, an
understanding of how they are recognized (and regulated) is key to
understanding gene expression.
Download