The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster Martin G. Reese (mgreese@lbl.gov) Nomi L. Harris (nlharris@lbl.gov) George Hartzell (hartzell@cs.berkeley.edu) Suzanna E. Lewis (suzi@fruitfly.berkeley.edu) Drosophila Genome Center Department of Molecular and Cell Biology 539 Life Sciences Addition University of California, Berkeley Reese et al., Tutorial #3, ISMB ‘99 Abstract Many of the technical issues involved in sequencing complete genomes are essentially solved. Technologies already exist that provide sufficient solutions for ascertaining sequencing error rates and for assembling sequence data. Currently, however, standards or rules for the annotation process are still an outstanding problem. How shall the genomes be annotated, what shall be annotated, which computational tools are most effective, how reliable are these annotations, how organism-specific do the tools have to be and ultimately how should the computational results be presented to the community? All these questions are unsolved. This tutorial will give an overview and assessment of the current state of annotation based upon experiences gained at the Drosophila melanogaster genome project. In the tutorial we will do three things. First, we will break down the annotation process and discuss the various aspects of the problem. This will serve to clarify the term "annotation", which is often used to collectively describe a process that has a number of discrete steps. Second, with the participation of computational biologists from the community we will compare existing tools for sequence annotation. We will do this by providing a 3 megabase sequence that has already been well-characterized at our center as a testbed for evaluating other feature-finding algorithms. This is similar to what has been done at the CASP (critical assessment of techniques for protein structure prediction) conferences (http://predictioncenter.llnl.gov) for protein structure prediction. Third, we will discuss which annotation problems are essentially solved and which problems remain. Reese et al., Tutorial #3, ISMB ‘99 Tutorial goals Review the algorithms currently used in annotation Assess existing methods under “field” conditions Identify open issues in annotation Reese et al., Tutorial #3, ISMB ‘99 Tutorial organization Definitions Annotation “Biological” issues “Engineering” issues Application of tools within an existing annotation system Break (20 minutes) Review of existing tools Our annotation experiment Conclusions and outstanding issues Reese et al., Tutorial #3, ISMB ‘99 What is a gene? Definition: An inheritable trait associated with a region of DNA that codes for a polypeptide chain or specifies an RNA molecule which in turn have an influence on some characteristic phenotype of the organism. Reese et al., Tutorial #3, ISMB ‘99 What are annotations? Definition: Features on the genome derived through the transformation of raw genomic sequences into information by integrating computational tools, auxiliary biological data, and biological knowledge. Reese et al., Tutorial #3, ISMB ‘99 How does an annotation differ from a gene? Many annotations are the same as ‘genes’ The annotation describes an inheritable trait associated with a region of DNA. But an annotation may not always correspond in this way, e.g. an STS, or sequence overlap Region of genomic DNA or RNA is not translated or transcribed Reese et al., Tutorial #3, ISMB ‘99 Transcription and translation Reese et al., Tutorial #3, ISMB ‘99 Schematic gene structure DNA: Promo te r Exo n 2 Exo n 1 Exo n 3 TSS Intron 2 Intron 1 ATG GT AG GT TAA AG tran scription Exo n 2 Exo n 1 Exo n 3 Intron 1 preRNA: ATG GT Intron 2 AG GT TAA AG sp licin g 5'UTR mRNA: ORF 3'UTR ATG TAA polyA AAAAAAAAA tran slation primary translation: m odification [cleavage product] ATG TAA MPYCPLTW ..............GFL amino acid se que nce [glycosylation site] active protein: CPLTW ......G Reese et al., Tutorial #3, ISMB ‘99 Sequence feature types Transcribed region Structural region Exon, intron, 5’ UTR, 3’ UTR, ORF, cleavage product Mutations: insertion, deletion, substitution, inversion, translocation Functional or signal region Promoter, enhancer, DNA/RNA binding site, splice site signal, polyadenylation signal Protein processing: glycosylation, methylation, phosphorylation site Similarity mRNA, tRNA, snoRNA, snRNA, rRNA Homolog, paralog, genomic overlap (syntenic region) Other feature types Transposable element, repetitive element Pseudogene STS, insertion site Reese et al., Tutorial #3, ISMB ‘99 DNA transcription unit features Promoter elements Core promoter elements TATA box Initiator (Inr) Downstream promoter element (DPE) Transcription factor (“TF”) binding sites CAAT boxes GC boxes SP-1 sites GAGA boxes Enhancer site(s) Reese et al., Tutorial #3, ISMB ‘99 mRNA features Exon Initial, internal, terminal Intron 5’ splice site (“GT”), branchpoint (lariat), 3’ splice site (“AG”) Repeat elements “Kozak” rule 5’ UTR Start codon (translation start site) UTR (untranslated regions) Codon usage, preference Control elements (e.g. splice enhancers) Translation regulatory elements RNA binding sites Control elements (e.g. splice enhancers) RNA binding sites (cis-acting elements) Initial, internal, terminal 3’ UTR Stop codon Poly-adenylation signal and site RNA destabilization signal Reese et al., Tutorial #3, ISMB ‘99 Reese et al., Tutorial #3, ISMB ‘99 Definitions for data modeling Feature: An interval or an ordered set of intervals on a sequence that describes some biological attribute and is justified by evidence. Sequence: A linear molecule of DNA, RNA or amino acids. Evidence: A computational or experimental result coming out of an analysis of a sequence Annotation: A set of features Reese et al., Tutorial #3, ISMB ‘99 Depth of knowledge Annotation Detailed analysis (typically biological) of single genes Annotated genome Large-scale analysis (typically computational) of entire genome Breadth of knowledge Reese et al., Tutorial #3, ISMB ‘99 Annotation process overview Methods Data Genome Sequence Auxiliary Data Computational Tools Database Resources Annotation Systems Understanding of a Genome Reese et al., Tutorial #3, ISMB ‘99 Types of sequence data Chromosomal sequence Euchromatic Heterochromatic mRNA sequences Full length cDNA 5’ EST 3’ EST Protein sequences Insertion site flanking sequences Reese et al., Tutorial #3, ISMB ‘99 Auxiliary data Maps Genetic, physical, radiation hybrid map (RH), deletion, cytogenetic Expression data Tissue, stage Phenotypes Lethality, sterility Reese et al., Tutorial #3, ISMB ‘99 Computational annotation tools Gene finding Repeat finding EST/cDNA alignment Homology searching BLAST, FASTA, HMM-based methods, etc. Protein family searching PFAM, Prosite, etc. Reese et al., Tutorial #3, ISMB ‘99 Database resources Curated sequence feature data sets Repeat elements Transposons Non-redundant mRNA STSs and other sequence markers Genome sequence from related species D. melanogaster vs. D. virilis, D. hydei Genome sequence from more distant species Protein sequences from distant species Reese et al., Tutorial #3, ISMB ‘99 Biological issues in annotation Common Genes within genes Alternative splicing Alternative poly-adenylation sites Rare Translational frame shifting mRNA editing Eukaryotic operons Alternative initiation Reese et al., Tutorial #3, ISMB ‘99 Engineering issues in annotation What sequence to start with? When to annotate? Because features are intervals on a sequence, problems can be caused by gaps, frameshifts, and other changes to the sequence. How do you track these changes over time and model features that span gaps? Feature identification can aid in sequencing. It may be advisable to carry out sequencing and annotation in parallel thus enabling them to complement one another. What analyses need to be run and how? What dependencies are there between various analysis programs? What parameters settings to use? Reese et al., Tutorial #3, ISMB ‘99 Engineering issues in annotation What public sequence data sets are needed? How do you achieve computational throughput? What are the mechanics of obtaining public sequence databases? Are curated data sets available or do you need to set up a means of maintaining your own (for repeats, insertions, organism of interest) Workstation farm, or simply a big, powerful box? Job flow control What do you do with the results? Homogenize results into single format? Filter results for significance and redundancy Reese et al., Tutorial #3, ISMB ‘99 Engineering issues in annotation Interpreting the results Is human curation needed? How can you achieve consistency between curators? How do you design the user interface so that it is simple enough to get the task completed speedily but complex enough to deal with biology? How do you capture curations? How are annotation translations to be described? EC terminology ProSite families Pfam domains Is function distinguishable from process? Reese et al., Tutorial #3, ISMB ‘99 Engineering issues in annotation How do you manage data? What is the appropriate database schema design? How is the database to be kept up to date? Will it be directly from programs running user interfaces and analyses or via a middleware layer? Is a flat file format needed and what should it be? What query and retrieval support is needed? How do you distribute data? For bulk downloads what is the format of the data? What information is best summarized in tables? What information requires an integrated graphical view? Reese et al., Tutorial #3, ISMB ‘99 Engineering issues in annotation How do you update the annotations? How frequently are they re-evaluated? How can re-evaluation be minimized (only subsets of the databanks, only modified sequences)? How can differences between old and new computational results be detected? Changes in computational results may need to trigger changes in curated annotations Reese et al., Tutorial #3, ISMB ‘99 Drosophila melanogaster Drosophila is the most important model organism* Drosophila genome: 4 chromosomes 180 Mb total sequence 140 Mb euchromatic sequence 12-14,000 genes * source: G.M. Rubin Reese et al., Tutorial #3, ISMB ‘99 Drosophila Genome Project Laboratories working on Drosophila sequencing: BDGP (Berkeley Drosophila Genome Project) EDGP (European Drosophila Genome Project) Celera Genomics Inc. “Complete” D. melanogaster sequence will be finished by the end of 1999 Comprehensive database - FlyBase Reese et al., Tutorial #3, ISMB ‘99 Goals of the Drosophila Genome Project Complete genome sequence Structure of all transcripts Expression pattern of all genes Phenotype resulting from mutation of all ORFs And more... Reese et al., Tutorial #3, ISMB ‘99 Sequencing at the BDGP Genomic sequence P1 and BAC clones 24Mb of completed sequence (as of July 22, 1999) 18Mb unfinished sequence in process Complete tiling path in BACs 1.5x-path draft sequencing ESTs and cDNAs 80,942 ESTs finished (as of March 19, 1999) Over 800 full-length cDNAs Reese et al., Tutorial #3, ISMB ‘99 The BDGP sequence annotation process Reese et al., Tutorial #3, ISMB ‘99 What sequence to start with? Unit of sequencing at the BDGP Completed high-quality clone sequences Reassembling the genomic sequence Need to place clones in correct genomic positions Need to integrate genes that span multiple clones Solved by using genomic overlaps to reconstitute full genomic sequence Reese et al., Tutorial #3, ISMB ‘99 Which analyses need to be run? Similarity searches BLAST (Altschul et al., 1990) BLASTN (nucleotide databases) BLASTX (amino acid databases) TBLASTX (amino acid databases, six-frame translation) sim4 (Miller et al., 1998) Sequence alignment program for finding near-perfect matches between nucleotide sequences containing introns Gene predictors Genefinder (Green, unpublished) GenScan (Burge and Karlin, 1997) Genie (Reese et al., 1997) Other analyses tRNAscanSE (Lowe and Eddy, 1996) Reese et al., Tutorial #3, ISMB ‘99 Which analyses need to be run and how? mRNAs ORFFinder(Frise, unpublished) Protein translations HMMPFAM 2.1 (Eddy 1998) against PFAM (v 2.1.1 Sonnhammer et al. 1997, Bateman et al. 1999) Ppsearch (Fuchs 1994) against ProSite (release 15.0) filtered with EMOTIF ( Nevill-Manning et al. 1998) Psort II (Horton and Nakai 1997) ClustalW (Higgins et al. 1996) Reese et al., Tutorial #3, ISMB ‘99 What public sequence data sets are needed? Automating updates of public databases: Genbank, SwissProt, trEMBL, BLOCKS, dbEST, EDGP Curated data sets D. melanogaster genes (FlyBase) Transposable elements (EDGP) Repeat elements (EDGP) STSs (BDGP) Reese et al., Tutorial #3, ISMB ‘99 Which analyses need to be run and how? Reese et al., Tutorial #3, ISMB ‘99 How do you achieve computational throughput? BDGP computing power Sun Ultra 450 (3 machines, 4 processors each) Sun Enterprise (1 machine, 8 processors) Used these directly, without any system for distributed computing. Job flow control: the Genomic Daemon Automatic batch analysis of genomic clones Berkeley Fly Database is used for queuing system and storage of results Many clones can be analyzed simultaneously Results are processed and saved in XML format for interactive browsing Reese et al., Tutorial #3, ISMB ‘99 What do you do with the results? Berkeley Output Parser (BOP) Input to BOP: Genomic sequence Results of computational analyses Filtering preferences Parses results from BLAST, sim4, GeneFinder, GenScan, and tRNAscan-SE analyses Filters BLAST and sim4 results Eliminates redundant or insignificant hits Merges hits that represent single region of homology Homogenizes results into single format Output: sequence and filtered results in XML format Reese et al., Tutorial #3, ISMB ‘99 Is human curation needed? Not for everything Some features are obvious and can be identified computationally Known D. melanogaster genes are detected automatically by GeneSkimmer Repetitive elements But still for many things Annotating complete gene structure is still hard We use CloneCurator (BDGP’s Java graphical editor) for curation Reese et al., Tutorial #3, ISMB ‘99 Gene Skimmer Quick way of identifying genes in new sequence before curation Start with XML output from BOP Look for sim4 hits with known Drosophila genes Find gene hits with sequence identity >98%, coverage >30% Verify that hits represent real genes Reese et al., Tutorial #3, ISMB ‘99 Gene Skimmer URL: http://www.fruitfly.org/sequence/genomic-clones.html Reese et al., Tutorial #3, ISMB ‘99 CloneCurator Displays computational results and annotations on a genomic clone Interactive browsing Zoom/scroll Change cutoffs for display of results Analyze GC content, restriction sites, etc. Interactive annotation editing Expert “endorses” selected results Presents annotations to community via Web site Reese et al., Tutorial #3, ISMB ‘99 Reese et al., Tutorial #3, ISMB ‘99 How do we annotate gene/protein function? Gene Ontology Project Controlled hierarchical vocabulary for multiple-genome annotations and comparisons Standardized vocabulary facilitates collaboration Good data modeling allows better database querying Ontology browser provides interactive search of hierarchical terms “GO” project (http://www.ebi.ac.uk/~ashburn/GO) Reese et al., Tutorial #3, ISMB ‘99 Ontology browser Reese et al., Tutorial #3, ISMB ‘99 Reese et al., Tutorial #3, ISMB ‘99 Ontology browser: searching for terms Reese et al., Tutorial #3, ISMB ‘99 How do you distribute the data? Bulk downloads FASTA at http://www.fruitfly.org/sequence/download.html Curated data sets Tabular data At http://www.fruitfly.org/sequence/ Sequenced genomic clones Clone contigs sorted by genomic location Clone contigs sorted by size Ribbon provides integrated graphical view of annotations on physical contigs Reese et al., Tutorial #3, ISMB ‘99 Ribbon Human curator annotates individual clones (~100Kb) Clones are assembled into physical contigs (regions of physical map) Clone annotations are merged and renumbered for display on whole physical contigs Ribbon is our Java display tool for displaying curated annotations on physical contigs Will soon be available on Web Reese et al., Tutorial #3, ISMB ‘99 Ribbon Reese et al., Tutorial #3, ISMB ‘99 How do you manage the data? Using Informix as our database server Updated via Perl dbi.pm module Development underway in Schema revisions GAME DTD (Genome Annotation Markup Entities) Perl module for annotation objects http://www.bioxml.org/ (Ewan Birney) Reese et al., Tutorial #3, ISMB ‘99 How do you maintain annotations? Open questions How frequently are annotations re-evaluated? How can re-evaluation be minimized (only subsets of the databanks, only modified sequences)? How can differences between old and new computational results be detected? Changes in computational results may need to trigger changes in curated annotations Reese et al., Tutorial #3, ISMB ‘99 Integrated annotation systems ACeDB Genotator Magpie GAIA TIGR Reese et al., Tutorial #3, ISMB ‘99 Integrated annotation systems: ACeDB Developed for analysis of the C. elegans genome Sophisticated database designed for storing annotations and related information New Java and Web-based versions available Written by Jean Thierry-Mieg and Richard Durbin http://www.sanger.ac.uk/Software/Acedb/ Reese et al., Tutorial #3, ISMB ‘99 ACeDB Reese et al., Tutorial #3, ISMB ‘99 Genotator Back end automates sequence analysis; browser provides interactive viewing and editing of annotations Nomi Harris (1997), Genome Research 7(7), 754-762. http://www-hgc.lbl.gov/inf/annotation.html Reese et al., Tutorial #3, ISMB ‘99 Magpie Expert system based (PROLOG) Data collection daemon Data analysis and report daemon “Intelligent” integration of various individual feature prediction systems Allows human interactions Gaasterlund and Sensen (1996), TIG, 12, 76-78. http://genomes.rockefeller.edu/magpie/magpie.html Reese et al., Tutorial #3, ISMB ‘99 GAIA Web-based system Results displayed as Java applets Bailey, L.C., J. Schug, S. Fischer, M. Gibson, J. Crabtree, D.B. Searls, and G.C. Overton (1998), Genome Research. http://daphne.humgen.upenn.edu:1024/gaia/ Reese et al., Tutorial #3, ISMB ‘99 TIGR Human Gene Index Gene Indices for various organisms Databases for transcribed genes linked into external/internal genomic databases Internal backend analysis software http://www.tigr.org/tdb/tdb.html Reese et al., Tutorial #3, ISMB ‘99 Computational analysis tools Gene finding Repeat finding EST/cDNA alignment Homology searching BLAST, FASTA, HMM-based methods, etc. Protein family searching PFAM, Prosite, etc. Reese et al., Tutorial #3, ISMB ‘99 Gene finding: Prokaryotes vs. Eukaryotes Prokaryotes Contiguous open reading frames (ORF) Short intergenic sequences Good method: detecting large ORFs Complications: Partial sequences Sequencing errors Start codon prediction Overlapping genes on both strands Reese et al., Tutorial #3, ISMB ‘99 Gene finding: Prokaryotes vs. Eukaryotes Eukaryotes Complex gene structures (exon/introns) D. melanogaster has an average of 4 introns/gene Very long genes (D. melanogaster X gene 160 kb) Very long introns Many introns “Nested”, overlapping, and alternatively spliced genes 5’ UTRs with non-coding exons Long 3’ UTRs Complex transcription machinery ORF-finding alone is not adequate Reese et al., Tutorial #3, ISMB ‘99 Integrated gene finding Assumptions Signals and content method sensors alone are not sufficient for predicting gene structure Gene structure is hierarchical Each component (exon, intron, splice site, etc.) can be modeled independently The approach Generate a list of candidates for each component (with scores) Assemble the components into a “gene model” Reese et al., Tutorial #3, ISMB ‘99 Integrated gene finding: Dynamic programming Determines the best combination of components Two-part problem: Develop an “optimal” scoring function Use dynamic programming to find an “optimal” alignment through scoring matrix Reese et al., Tutorial #3, ISMB ‘99 Integrated gene finding: Dynamic programming Reese et al., Tutorial #3, ISMB ‘99 Integrated gene finding: Linear and Quadratic Discriminant Analysis (LDA/QDA) LDA Deterministic calculation of thresholds n-class discrimination Example: HSPL, Solovyev et al. (1997), ISMB, 5,294-302. QDA Can represent a great improvement over LDA Example: MZEF, Michael Zhang (1997), PNAS, 94, 565-568. Reese et al., Tutorial #3, ISMB ‘99 Integrated gene finding: Feed-forward neural networks Supervised learning Training to discriminate between several feature classes Computing units Gradient descent optimization Multi-layer networks Limitations Black-box predictions Local minima Example: GRAIL, Uberbacher et al. (1991), PNAS, 88, 11261-11265. Reese et al., Tutorial #3, ISMB ‘99 Approaches to gene finding: Hidden Markov models Model Markov k-order Markov chain: current state dependent on k previous states The next state in a 1st-order Markov model depends on current state Hidden A finite model describing a probability distribution over all possible sequences of equal length “Natural” scoring function (Conditional) Maximum likelihood “training” Hidden states generate visible symbols Assumptions Independence of states No long range correlation Example: HMMgene, A. Krogh (1998), In Guide to Human Genome Computing, 261-274. Reese et al., Tutorial #3, ISMB ‘99 Approaches to gene finding: Generalized hidden Markov models Each HMM state can be a probabilistic sub-model Complex hierarchical system Requires care in modeling state overlaps Example: Genie, Kulp et al. (1996), ISMB, 4, 134-142 GenScan, Burge and Karlin (1997), JMB, 268(1), 78-94 Reese et al., Tutorial #3, ISMB ‘99 Gene finding software Signal recognition Promoter prediction Splice site prediction Start codon prediction Poly-adenylation site prediction Coding potential Coding exons Gene structure prediction Spliced alignment LDA/QDA Neural networks HMMs and GHMMs Reese et al., Tutorial #3, ISMB ‘99 Promoter recognition PromoterScan Identify potential promoter regions Based on databases of known TF binding sites TFD (Gosh (1991), TIBS, 16, 445-447) TRANSFAC (Heinemeyer et al. (1999), NAR, 27, 318-322) Prestridge (1995), JMB, 249, 923-932 http://bimas.dcrt.nih.gov/molbio/proscan/ MatInd and MatInspector Finding consensus matches to known TF binding sites Based on TRANSFAC Heinemeyer et al. (1999), NAR, 27, 318-322 Quandt et al. (1995), NAR, 23, 4878-4884. http://transfac.gbf.de/TRANSFAC/ Reese et al., Tutorial #3, ISMB ‘99 Promoter recognition (cont.) TSSG/TSSW LDA based combination of several features (TATA-box, Inr signal, upstream regions) Solovyev et al. (1997), ISMB, 5, 294-302. http://genomic.sanger.ac.uk/gf/gf.shtml Transcription Element Search Software Identify TF binding sites Based on TRANSFAC http://agave.humgen.upenn.edu/tess/index.html Reese et al., Tutorial #3, ISMB ‘99 Promoter recognition (cont.) CBS Promoter 2.0 Prediction Server Simulated transcription factors Principles common to neural networks and genetic algorithms Knudsen (1999), Bioinformatics 13(5), 356-361. http://genome.cbs.dtu.dk/services/promoter/ CorePromoter Position dependent 5-tuple QDA Michael Zhang (1998), Genome Research, 8, 319-326. http://scislio.cshl.org/genefinder/CPROMOTER/ Reese et al., Tutorial #3, ISMB ‘99 Promoter recognition (cont.) Neural network promoter prediction (NNPP) Time-delay neural network Combining TATA box and initiator Reese (1999), in preparation. http://www-hgc.lbl.gov/projects/promoter.html Reese et al., Tutorial #3, ISMB ‘99 Example: NNPP Reese et al., Tutorial #3, ISMB ‘99 Promoter recognition (cont.) Markov chain promoter finder Competing interpolated Markov chains for promoters, exons, introns Promoter model consists of five states representing the core promoter parts Ohler, Reese et al., Bioinformatics 13(5), 362-369. Reese et al., Tutorial #3, ISMB ‘99 Splice site prediction Nakata, 1985 Nakata (1985), NAR, 13(14), 5327-5340. BCM GeneFinder HSPL - Prediction of splice sites in human DNA sequences Triplet frequencies in various functional parts of splice site regions Combined with codon statistics Solovyev et al. (1994), NAR, 22(24), 5156-5163. http://genomic.sanger.ac.uk/gf/gf.shtml Reese et al., Tutorial #3, ISMB ‘99 Splice site prediction (cont.) Neural Network splice site predictor (NNSPLICE) Multi-layered feed-forward neural network Modeled after Brunak et al. (1991), JMB, 220, 49-65. Reese et al. (1997), JCB, 4(3), 311-323. http://www-hgc.lbl.gov/projects/splice.html NetGene2 Combination of neural networks and rule-based system Splice site signal neural network combined with coding potential Hebsgaard et al. (1996), NAR, 24(17), 3439-3452. Brunak et al. (1991), JMB, 220, 49-65. http://www.cbs.dtu.dk/services/NetGene2/ Reese et al., Tutorial #3, ISMB ‘99 Splice site prediction (cont.) SplicePredictor Logitlinear models for splice site regions Degree of matching to the splice site consensus Local compositional contrast Brendel and Kleffe (1998), NAR, 26(20), 4748-4757. http://gnomic.stanford.edu/~volker/SplicePredictor.html Reese et al., Tutorial #3, ISMB ‘99 Start codon prediction NetStart Trained on cDNA-like sequences Neural network based Local start codon information Global sequence information Pedersen and Nielsen (1997), ISMB, 5, 226-233. http://www.cbs.dtu.dk/services/NetStart/ Reese et al., Tutorial #3, ISMB ‘99 Poly-adenylation signal prediction BCM GeneFinder POLYAH - Recognition of 3'-end cleavage and polyadenylation region Triplet frequencies in various functional parts in polyadenylation regions LDA Solovyev et al. (1994), NAR, 22(24), 5156-5163. http://genomic.sanger.ac.uk/gf/gf.shtml Reese et al., Tutorial #3, ISMB ‘99 Prediction of coding potential Periodicity detection Coding sequences have an inherent periodicity of three Especially good on long coding sequences Auto-correlation Seeking the strongest response when shifted sequence is compared with original Michel (1986), J. Theor. Biol. 120, 223-236. Fourier transformation: Spectral analysis Detection of peak at position corresponding to 1/3 of the frequency Silverman and Linsker (1986), J. Theor. Biol. 118, 295-300. Reese et al., Tutorial #3, ISMB ‘99 Prediction of coding potential (cont.) Trifonov (1980;1987) G-notG-U periodicity JMB , 194, 643-652. Fickett (1982) Position asymmetry in the three codon positions NAR 10(17), 5303-5318. Staden (1984) Codon usage in tables NAR 12, 551-567. Reese et al., Tutorial #3, ISMB ‘99 Prediction of coding potential (cont.) Claverie and Bougueleret (1987) Hexamer frequency differentials NAR 14, 179-196. Fichant and Gautier (1987) Codon usage homogeneity CABIOS, 3(4), 287-295. GRAIL I (1991) Neural network using a shifting fixed size window 7 sensors as input, 2 hidden layers and 1 unit as output Uberbacher et al. (1991), PNAS, 88(24), 11261-11265. Reese et al., Tutorial #3, ISMB ‘99 Prediction of coding potential (cont.) GeneMark (1986) Inhomogeneous Markov chain models Easy trainable (closed solution for Maximum Likelihood) Used extensively in prokaryotic genomes Borodovsky et al. (1993), Computers & Chemistry, 17, 123133. Glimmer (1998) Interpolated Markov chains from first to eighth order Salzberg et al. (1998), NAR, 26(2), 544-548. http://www.tigr.org/softlab/glimmer/glimmer.html Reese et al., Tutorial #3, ISMB ‘99 Prediction of coding potential (cont.) Review by Fickett (1992) “Assessment of protein coding measures”, NAR, 20, 6441- 6450. Reese et al., Tutorial #3, ISMB ‘99 Prediction of coding exons SorFind BCM GeneFinder Detection of “spliceable” ORFs Hutchinson, NAR, 20(13), 3453-3462. FEXD, FEXN, FEXA, FEXY, FEXH, HEXON LDA Solovyev et al. (1994), NAR, 22(24), 5156-5163. http://genomic.sanger.ac.uk/gf/gf.shtml GRAIL II Exon candidates, heuristic integration, learning with neural network Uberbacher et al., Genet. Eng., 16, 241-253. http://compbio.ornl.gov/ Reese et al., Tutorial #3, ISMB ‘99 “Integrated” gene models: LDA/QDA FGene LDA based Dynamic programming for the integration of LDA output Solovyev et al. (1995), ISMB, 3, 367-375. http://genomic.sanger.ac.uk/gf/gf.shtml Reese et al., Tutorial #3, ISMB ‘99 “Integrated” gene models: NN GeneParser “Gene-parsing” approach Potential alternative splicing recognized Neural network and dynamic programming Snyder and Stormo (1995), JMB, 248, 1-18. Reese et al., Tutorial #3, ISMB ‘99 “Integrated” gene models: Artificial intelligence approaches GeneID Rule-based system Homology integration Guigó et al. (1992), JMB , 226, 141-157. http://www1.imim.es/geneid.html GeneID using DP DP to combine a set of potential exons Guigó et al. (1998), JCB , 5, 681-702. Reese et al., Tutorial #3, ISMB ‘99 “Integrated” gene models: Artificial intelligence approaches GenLang Syntactic pattern recognition system Formal grammar Tools from computational linguistics Dong and Searls (1994), Genomics, 23,540-551. http://cbil.humgen.upenn.edu/~sdong/genlang_home.html Reese et al., Tutorial #3, ISMB ‘99 “Integrated” gene models: HMMs HMMGene Several genes per sequence possible User constraints possible Krogh (1997), ISMB, 5, 179-186. http://www.cbs.dtu.dk/services/HMMgene/ GeneMark.hmm Based on GeneMark program for bacterial sequences Can predict frame shifts Trained for various organisms Lukashin and Borodovsky (1998), NAR, 26, 1107-1115. http://genemark.biology.gatech.edu/GeneMark/hmmchoice.html Reese et al., Tutorial #3, ISMB ‘99 “Integrated” gene models: GHMMs Genie Generalized hidden Markov model with length distribution Integration of multiple content and signal sensors Content: codon statistics, repeats, intron, intergenic, database homology hits Signal: promoter, start codon, splice sites, stop codon Dynamic programming to find optimal parse Several genes per sequence possible Kulp et al. (1996), ISMB, 4, 134-142. Reese et al. (1997), JCB, 4(3), 311-323. http://www.cse.ucsc.edu/~dkulp/cgi-bin/genie Reese et al., Tutorial #3, ISMB ‘99 Example: Genie Reese et al., Tutorial #3, ISMB ‘99 “Integrated” gene models: GHMMs GenScan Multiple content and signal models Semi-hidden Markov model sensors with length distribution Takes GC content into account (separate models) Several genes per sequence possible Burge and Karlin (1997), JMB, 268(1), 78-94. http://CCR-081.mit.edu/GENSCAN.html Reese et al., Tutorial #3, ISMB ‘99 EST/cDNA alignment for gene finding: Spliced alignments PROCRUSTES Spliced alignment algorithm Dynamic programming to combine a set of potential exons Frame conservation Homologous sequence needed Gelfand et al. (1996), PNAS, 93, 9061-9066. http://hto-13.usc.edu/software/procrustes/ Reese et al., Tutorial #3, ISMB ‘99 EST/cDNA alignment Sim4 Aligns cDNA to genomic sequence Uses local similarity Florea et al. (1998), Genome Research, 8, 967-974. GeneWise Dynamic programming Partial genes allowed Based on Pfam and statistical splice site models Birney (1999), unpublished http://www.sanger.ac.uk/Software/Wise2 Reese et al., Tutorial #3, ISMB ‘99 EST/cDNA alignment (cont.) ACEMBLY Aligns ESTs to genomic sequence Identifies alternative splicing Integrated in ACeDB Jean Thierry-Mieg (unpublished) Reese et al., Tutorial #3, ISMB ‘99 Repeat finders Censor Uses database of repeat sequences Jurka et al. (1996), Comp. and Chem., 20(1), 119-122. BLAST Integrated masking operations XBLAST procedure Claverie (1994), In Automated DNA Sequencing and Analysis Techniques, M. D. Adams, C. Fields and J. C. Venter, eds., 267-279. http//:www.ncbi.nlm.nih.gov/BLAST Reese et al., Tutorial #3, ISMB ‘99 Repeat finders (cont.) RepeatMasker Detection of interspersed repeats Smit and Green, unpublished results http://ftp.genome.washington.edu/RM/RepeatMasker.html Reese et al., Tutorial #3, ISMB ‘99 Homology searching BLAST suite BLASTN, BLASTX, TBLASTX, PSI-BLAST Altschul et al. (1990), JMB, 215, 403-410. http://www.ncbi.nlm.nih.gov/BLAST FASTA suite FASTA, TFASTA Pearson and Lipman (1988), PNAS, 85, 2444-2448. HMM-based searching SAM (UCSC group) http://www.cse.ucsc.edu/research/compbio/sam.html HMMER, Sean Eddy http://hmmer.wustl.edu/ Reese et al., Tutorial #3, ISMB ‘99 Gene family searching BLOCKS http://www.blocks.fhcrc.org PROSITE http://www.expasy.ch/prosite/ PFAM http://pfam.wustl.edu/ SCOP http://scop.mrc-lmb.cam.ac.uk/scop/ Reese et al., Tutorial #3, ISMB ‘99 The genome annotation experiment (GASP1) Genome Annotation Assessment Project (GASP1) Annotation of 2.9 Mb of Drosophila melanogaster genomic DNA Open to everybody, announced on several mailing lists Participants can use any analysis methods they like (gene finding programs, homology searches, by-eye assessment, combination methods, etc.) and should disclose their methods. “CASP” like 12 participating groups Reese et al., Tutorial #3, ISMB ‘99 URL: http://www.fruitfly.org/GASP1 Reese et al., Tutorial #3, ISMB ‘99 Goals of the experiment Compare and contrast various genome annotation methods Objective assessment of the state of the art in gene finding and functional site prediction Identify outstanding problems in computational methods for the annotation process Reese et al., Tutorial #3, ISMB ‘99 Adh contig 2.9 Mb contiguous Drosophila sequence from the Adh region, one of the best studied genomic regions From chromosome 2L (34D-36A) Ashburner et al., (to appear in Genetics) 222 gene annotations (as of July 22, 1999) 375,585 bases are coding (12.95%) We chose the Adh region because it was thought to be typical. A representative test bed to evaluate annotation techniques. Reese et al., Tutorial #3, ISMB ‘99 Adh paper (to appear in Genetics) URL: http://www.fruitfly.org/publications/PDF/ADH.pdf Reese et al., Tutorial #3, ISMB ‘99 GAATTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCA TACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTG ATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCT TGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAG ACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGA AAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGA TAGATAGATAGATCGTGCGGCGATCTCGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGG TTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGG CCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTG TCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACT TCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCC TGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTT AAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGA TAGATAGATAGATCGTGCGGCGATCTCGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGG TTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGG CCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTG TCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACT TCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCC TGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTT TATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAA ACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAG CTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGT GTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGG TTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTT CCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTG ACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTT GGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCAC TGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGA CGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACT GCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTA TCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAATTCCACGGAAATGTCAG GAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGTGCGGCGATCTCGTACTGGACGGAAATGTCAGGAGATAGGAGAAGAAAA Raw sequence: Adh.fa Reese et al., Tutorial #3, ISMB ‘99 Drosophila data sets provided to participants Curated Drosophila nuclear DNA "coding sequences" (CDS) Curated non-redundant Drosophila genomic DNA data (275 “multi”- and 144 “single”-exon sequence entries from Genbank) Drosophila 5' and 3' splice sites Drosophila start codon sites Drosophila promoter sequences Drosophila repeat sequences Drosophila transposon sequences Drosophila cDNA sequences Drosophila EST sequences URL: http://www.fruitfly.org/GASP1/data/data.html Reese et al., Tutorial #3, ISMB ‘99 Timetable May 13, 1999 - June 30, 1999 Distribution of the sample sequence and associated data to the predictors. Collection of predictions. June 30, 1999 - July 31, 1999 Evaluation of the predictions by the Drosophila Genome Center. August 4, 1999 External expert assessment of the prediction results (HUGO meeting, EMBL) August 6, 1999 Tutorial #3 at the ISMB ‘99 conference in Heidelberg, Germany Reese et al., Tutorial #3, ISMB ‘99 Resources for assessing predictions 80 cDNA sequences NOT in Genbank before experiment deadline Sequenced from 5 different cDNA libraries 3 paralogs to other genes in the genome 19 cDNAs with cloning artifacts 2 apparently representing unspliced RNA Multiple inserts (2 cDNAs cloned in the same vector) 58 “usable” cDNAs 33 cDNA sequences in Genbank during experiment Annotations from Adh paper Reese et al., Tutorial #3, ISMB ‘99 Curated data sets for assessing predictions Standard 1 (Adh.std1.gff) “conservative gene set” 43 gene structures (7 single- and 36 multi- coding exon genes) Criteria for inclusion: >=95% (most >=99%) of the cDNA aligned to genomic DNA (using sim4) “GT”/”AG” splice site consensus sequences Splice site score from neural net • 5’ splice sites: >=0.35 threshold ( 98% True Positive score) • 3’ splice sites: >=0.25 threshold ( 92% True Positive score) Start codon and stop codon annotations from Standard 3 (derived from Adh paper) These 43 genes represent “typical” genes Reese et al., Tutorial #3, ISMB ‘99 Curated data sets for assessing predictions Standard 2 (Adh.std2.gff) Superset of Standard 1 15 additional gene structures Same alignment criteria as Standard 1 but no splice site consensus requirement Not used in the experiment Reese et al., Tutorial #3, ISMB ‘99 Curated data sets for assessment Standard 3 (Adh.std3.gff) “more complete gene set” 222 gene structures (39 single- and 183 multi- coding exon genes) Criteria: Annotated as described in Ashburner et al. cDNA to genomic alignment using sim4 Start codons predicted by ORFFinder (Frise et al., unpublished) ~182 genes have similarity to a homologous protein sequence in another organism or have a Drosophila EST hit • • • • Edge verification by partial EST/cDNA alignments BLASTX, TBLASTX homology results PFAM alignments Gene structure verification using GenScan (human) 14 genes had EST/homology hits but no gene finding predictions ~40 genes only have “strong” GenScan predictions Reese et al., Tutorial #3, ISMB ‘99 Submission format GFF (Durbin and Haussler, 1998, unpublished) http://www.sanger.ac.uk/Software/GFF/ Reese et al., Tutorial #3, ISMB ‘99 Sample submission # organism: Drosophila melanogaster # std1 Gene 1 Gene 2 Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh Adh std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 std1 TFBS 32002 TATA_signal TSS 32033 prim_transcript exon 32034 start_codon CDS 32122 splice5 32277 splice3 32332 exon 32785 CDS 32785 splice5 32830 splice3 32825 CDS 32826 exon 32826 stop_codon polyA_signal polyA_site prim_transcript exon 38100 polyA_site polyA_signal stop_codon CDS 40125 start_codon TSS 41973 TATA_signal TFBS 42187 TFBS 42211 32006 32009 32034 32034 32277 32122 32277 32278 32333 32830 32830 32831 32826 33003 33122 33001 33090 33101 38100 41973 39620 39685 40125 40390 40388 41974 41998 42193 42216 . 32012 . 33122 . 32124 . . . . . . . . . 33003 33095 33102 41973 . 39621 39690 40127 . 40390 . 42001 . . + . + . + . + + + + + + + + + . . . . . . . . . - . + . + . + . . . . . . . . . + + + . . . . . . transcript transcript "1" . transcript transcript "1" . transcript transcript "1" transcript "1" transcript "1" transcript "1" transcript "1" transcript "1" transcript "1" transcript "1" transcript "1" . transcript . transcript . transcript . transcript transcript "2" . transcript . transcript . transcript transcript "2" . transcript transcript "2" . transcript "1" "1" "1" "1" "1" "1" "2" "2" "2" "2" "2" "2" Reese et al., Tutorial #3, ISMB ‘99 Submissions MAGPIE Team Credit Terry Gaasterland, Alexander Sczyrba, Elizabeth Thomas, Gulriz Kurban, Paul Gordon, Christoph Sensen Laboratory for Computational Genomics, Rockefeller and Institute for Marine Biosciences, Canada Method Automatic genome analysis system integrating Drosophila Genscan predictions, confirming exons boundaries using database searches, repeat finding (Calypso, REPupter) and gene function annotations. Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) References “Multigenome MAGPIE” poster at ISMB ‘99. Gaasterland and Ragan (1998), J. of Microbial and Comparative Genomics, 3, 305-312. Gaasterland and Sensen (1996), Biochimie 78, 302-310. REPupter: Kurtz and Schleiermacher (1999), Bioinformatics 15(5), 426-427. Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) Computational Genomics Group, The Sanger Centre Credit Victor Solovyev, Asaf Salamov Method Discriminant analysis based gene prediction programs FGenes (trained for Human) and FGenesH (trained for Drosophila); Combining the output of Fgenes, FGenesH and BLAST using FGenesH+. 3 different “threshold” annotations are submitted. The programming running time is linear with the sequence length. Automatic, plus additional user interactive screening. Non-redundant NCBI database used for BLAST. URL/References http://genomic.sanger.ac.uk/gf/gf.shtml Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) Genome Annotation Group, The Sanger Centre Credit Ewan Birney Method Protein family based gene identification using Wise2 (previously Genewise) and PFAM. URL http://www.sanger.ac.uk/Software/Wise2 Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) Pattern Recognition, The University of Erlangen Credit Uwe Ohler, Georg Stemmer, Stefan Harbeck, Heinrich Niemann Method Promoter recognition based on interpolated Markov chains; “Genscan” like promoter model (MCPromoter); maximal mutual information based estimation of interpolated Markov chains. Automatic. Promoter training data set from http://www.fruitfly.org/data/genesets Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) References Ohler, Harbeck, Niemann, Noeth and Reese (1999), Bioinformatics 15(5), 362-369. Ohler, Harbeck and Niemann (1999), Proc. EUROSPEECH, to appear. URL http://www5.informatik.uni-erlangen/HTML/English/Research/Promoter Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) Computational Biosciences, Oakridge National Laboratory Credit Richard J. Mural, Douglas Hyatt, Frank Larimer, Manesh Shah, Morey Parang Method Integrated neural network based system including gene assembly using EST and homology information (GRAILexp). URL: http://compbio.ornl.gov/droso Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) Center for Biological Sequence Analysis, Technical University of Denmark Credit Anders Krogh Method Modular HMM incorporating database hits (proteins and ESTs/cDNAS) and other “external information” probabilistically (HMMGene); the HMM has modules for coding regions, splice sites, translation start/stop, etc.. It will be a fully automated system. Trained on Drosophila data • http://www.fruitfly.org/GSAC1/data/data.html and • Victor Solovyev (personal communication) Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) References Krogh (1998), In S.L. Salzberg et al., eds., Computational Methods in Molecular Biology, 45-63, Elsevier. Krogh (1997), Gaasterland et al., eds., Proc. ISMB 97, 179-186. http://www.cbs.dtu.dk/krogh/refs.html URL http://www.cbs.dtu.dk/services/HMMgene/ Not yet for Drosophila. Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) BLOCKS group, Fred Hutchinson Cancer Research Center in Seattle, Washington Credit Jorja Henikoff, Steve Henikoff Method DNA translation in 6 frames and search against BLOCKS+ and against BLOCKS extracted from Smart3.0 (http://coot-emblheidelberg.de/SMART/) using BLIMPS; automatic post-processing to join multiple predictions from the same block. Automatic with some user interactive screening of results. Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) References Henikoff, Henikoff and Pietrokovski (1999), Nucl. Acids Res., 27, 226-228. Henikoff and Henikoff (1994), Proc. 27th Ann. Hawaii Intl. Conf. On System Sciences, 265-274. Henikoff and Henikoff (1994), Genomics, 19, 97-107. URL http://blocks.fhcrc.org http://blocks.fhcrc.org/blocks-bin/getblock.sh?<block name> Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) Genome Informatics Team, IMIM, Barcelona, Spain Credit Roderic Guigó, Josep F. Abril, Enrique Blanco, Moises Burset, Genis Parra Method Dynamic programming based system to combine potential exon candidates modeled as a fifth order Markov model and functional sequence sites modeled as a position weight matrix (Geneid version 3). Fully automatic, very fast. Trained on Drosophila data • http://www.fruitfly.org/GSAC1/data/data.html Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) References Guigó et al. (1998), JCB , 5, 681-702. URL Information on training process: • http://www1.imim.es/~rguigo/AnnotationExperiment/index.html http://www1.imim.es/geneid.html Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) Mark Borodovsky's Lab, School of Biology, Georgia Institute of Technology Credit Mark Borodovsky, John Besemer Method Markov chain models combined with HMM technology (Genemark.hmm). URL http://genemark.biology.gatech.edu/GeneMark/hmmchoice.html Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) Biodivision, GSF Forschungszentrum für Umwelt und Gesundheit, Neuherberg, Germany Credit Matthias Scherf, Andreas Klingenhoff, Thomas Werner Method Universal sequence classifier which is based on a correlated word analysis to predict initiators and promoter associated TATA boxes (CoreInspector V1.0 beta). Sequences of 100 bp are classified at once. Trained on Eukaryotic Promoter Database (EPD version 5.9). Fully automatic, 2 seconds per 1Kb. References Scherf et al. (1999), in preparation. URL http://www.gsf.de/biodv/ Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) The Department of Biomathematical Sciences, Mount Sinai School of Medicine, New York Credit Gary Benson Method Tandem repeats finder (TRF v2.02) uses theoretical model of the similarity between adjacent copies of pattern (pattern from 1 -500 bp recognized); dynamic programming for candidate validation. Fully automatic; very fast (seconds per 1Mb). http://c3.biomath.mssm.edu/trf/Adh.fa.2.7.7.80.10.50.500.1.html References Benson (1999), Nucl. Acids Res., 27(2), 573-580. URL http://c3.biomath.mssm.edu/trf.html Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) Genie, UC Berkeley/UC Santa Cruz/ Neomorphic Inc. Credit Martin G. Reese, David Kulp, Hari Tammana, David Haussler Method Generalized hidden Markov model with optional integration of EST hits and homology searches (Genie). Trained on Drosophila data • http://www.fruitfly.org/GSAC1/data/data.html Semi-automatic, in that the overlaps of the analyzed sequence contigs (110kb) where manual run again with Genie to resolve conflicts. BLAST used for homology searches on non-redundant protein database (nr). Reese et al., Tutorial #3, ISMB ‘99 Submissions (cont.) References Reese et al. (1997), JCB, 4(3), 311-323. Kulp et al. (1997), Biocomputing: Proc. Of the 1997 PSB conference, 232-244. Kulp et al. (1996), ISMB, 4, 134-142. URL http://www.neomorphic.com/genie Reese et al., Tutorial #3, ISMB ‘99 Submission classes Program name Gene finding Mural et al. Oakridge, US GRAILexp X Guigó et al. Barcelona, ES GeneID X Krogh Copenhagen, DK HMMGene X Borodovsky et al. Georgia, US GeneMark.hmm X Henikoff et al. Fred Hutchinson, Seattle, US Solovyev et al. Sanger, UK BLOCKS FGenes/FGenesH Promoter EST/cDNA recognition Alignement Protein Repeat similarity X Gene function X X X X Reese et al., Tutorial #3, ISMB ‘99 Submission classes (cont.) Program name Gaasterland et al. Rockefeller, US MAGPIE Benson et al. Mount Sinai, US TRF Werner et al. Munich, GER CoreInspector Gene finding X Reese et al. Berkeley/Santa Cruz, US X X X Gene function X X X Wise2 Genie Protein Repeat similarity X Ohler et al. Nuermberg, GER MCPromoter Birney Sanger, UK Promoter EST/cDNA recognition Alignment X X X X Reese et al., Tutorial #3, ISMB ‘99 Gene finding techniques Program name Statistics Promoter EST/cDNA Alignment Mural et al. Oakridge, US GRAILexp X Guigo et al. Barcelona, ES GeneID X Krogh Copenhagen, DK HMMGene X Borodovsky et al. Georgia, US GeneMark.hmm X Solovyev et al. Sanger, UK FGenes/FGenesH X Gaasterland et al. Rockefeller, US MAGPIE X X X Genie X X X Reese et al. Berkeley/Santa Cruz, US Protein similarity X X X X Reese et al., Tutorial #3, ISMB ‘99 Measuring success By nucleotide Sensitivity/Specificity (Sn/Sp) By exon Sn/Sp Missed exons (ME), wrong exons (WE) By gene Sn/Sp Missed genes (MG), wrong genes (WG) Average overlap statistics Based on Burset and Guigo (1996), “Evaluation of gene structure prediction programs”. Genomics, 34(3), 353-367. Reese et al., Tutorial #3, ISMB ‘99 Definitions and formulae Sn = TP/(TP+FN) Sp = TP/(TP+FP) TP = True positive FP = False positive FN = False negative Reese et al., Tutorial #3, ISMB ‘99 Genes: True positives (TP) Reese et al., Tutorial #3, ISMB ‘99 Genes: False positives (FP) Reese et al., Tutorial #3, ISMB ‘99 Genes: False Negatives (FN) Reese et al., Tutorial #3, ISMB ‘99 Toy example 1 (1) Std1 Pred1 Pred2 TP 2 2 FP 1 5 FN SN SP 1 2/3 2/3 1 2/3 2/7 Sn = TP/(TP+FN) Sp = TP/(TP+FP) Reese et al., Tutorial #3, ISMB ‘99 Genes: Missing Genes (MG) Reese et al., Tutorial #3, ISMB ‘99 Genes: Wrong Genes (WG) Reese et al., Tutorial #3, ISMB ‘99 Toy example 1 (2) Std1 Pred1 Pred2 TP 2 2 FP 1 5 FN SN SP MG WG 1 2/3 2/3 1 1 1 2/3 2/7 0 4 Sn = TP/(TP+FN) Sp = TP/(TP+FP) Reese et al., Tutorial #3, ISMB ‘99 Genes: Std 1 versus Std 3 Std1: “conservative gene set” Std3: “more complete gene set” Reese et al., Tutorial #3, ISMB ‘99 Toy example 1 (3) Std1 Pred1 Pred2 Std3 Pred1 Pred2 TP 2 2 FP 1 5 2 3 1 4 FN SN SP MG WG 1 2/3 2/3 1 1 1 2/3 2/7 0 4 2 1 2/4 2/3 3/4 3/7 2 0 1 3 Sn = TP/(TP+FN) Sp = TP/(TP+FP) Reese et al., Tutorial #3, ISMB ‘99 Genes: Std1 and Std3 versus “real” gene structure Reese et al., Tutorial #3, ISMB ‘99 Toy example 1 (4) Std1 Pred1 Pred2 Std3 Pred1 Pred2 "Real" Pred1 Pred2 FN SN SP MG WG 1 1 2/3 2/3 1 4 1 2/3 2/7 0 TP 2 2 FP 1 5 2 3 1 4 2 1 2/4 2/3 3/4 3/7 2 0 1 3 3 3 0 4 1 1 3/4 3/3 3/4 3/7 1 0 0 3 Reese et al., Tutorial #3, ISMB ‘99 Toy example 1 (5): Exon level Std1 Pred1 Pred2 Std3 Pred1 Pred2 "Real" Pred1 Pred2 FN SN SP ME WE 2 1 5/6 5/7 1 7 2 2/3 1/3 1 TP 5 4 FP 2 8 5 5 2 7 2 2 5/7 5/7 5/7 5/12 2 1 2 6 7 6 0 6 2 3 7/9 7/7 2/3 1/2 1 1 0 5 Reese et al., Tutorial #3, ISMB ‘99 Genes: Joined genes (JG) Reese et al., Tutorial #3, ISMB ‘99 Genes: Split genes (SG) Reese et al., Tutorial #3, ISMB ‘99 Definition: “Joined” and “split” genes # Actual genes that overlap predicted genes # Predicted genes that overlap one or more actual genes JG = ------------------------------------------# Predicted genes that overlap actual genes # Actual genes that overlap one or more predicted genes SG = ------------------------------------------ JG > 1, tendency to join multiple actual genes into one prediction SG > 1, tendency to split actual genes into separate gene predictions Inspired by Hayes and Guigó (1999), unpublished. Reese et al., Tutorial #3, ISMB ‘99 Toy example 2 (1) Std1 Pred1 Pred2 TP 0 1 FP 2 7 FN 3 2 SN 0 1/3 SP MG WG 0 1 1 1/8 0 4 JG 2 1 SG 1 1.33 Reese et al., Tutorial #3, ISMB ‘99 Annotation experiment results Results available during tutorial and at http://www.fruitfly.org/GASP1/results/ Reese et al., Tutorial #3, ISMB ‘99 Results: Base level Fgene Fgene Fgene Gene Gene Gene s s s ID v1 ID v2 Mark CGG1 CGG2 CGG3 HMM Sn (Std1) Sp (Std3) Genie Genie Genie HMM EST EST Gene HOM MAG PIE Grail exp 0.89 0.49 0.93 0.48 0.86 0.96 0.96 0.97 0.97 0.97 0.96 0.81 0.77 0.86 0.60 0.84 0.83 0.86 0.92 0.91 0.83 0.91 0.63 0.86 Sensitivity: Low variability among predictors ~95% coverage of the proteome Specificity ~90% Programs that are more like Genscan (used for original annotation) might do better? Reese et al., Tutorial #3, ISMB ‘99 Results: Exon level Fgen es CGG1 Sn (Std1) Sp (Std3) Fgen es CGG2 Fgen es CGG3 Gene ID v2 Gene Mark HMM Genie Genie Genie HMM EST EST Gene HOM MAG PIE Grai l exp 0.65 0.44 0.75 0.27 0.58 0.70 0.70 0.77 0.79 0.68 0.63 0.42 0.49 0.68 0.24 0.29 0.34 0.47 0.57 0.55 0.52 0.53 0.41 0.41 ME(%) 10.5 45.5 5.6 (Std1) Gene ID v1 54.4 21.1 8.1 8.1 4.8 3.2 4.8 12.1 24.3 WE(%) 31.6 17.2 53.3 47.9 47.4 28.9 17.4 20.1 22.8 20.2 50.2 28.7 (Std3) Higher variability among predictors Up to ~75% sensitivity (both exon boundaries correct) 55% specificity Low specificity because partial exon overlaps do not count Missing exons below 5% Many wrong exons (~20%) Reese et al., Tutorial #3, ISMB ‘99 Results: Gene level Fgene Fgene Fgene Gene Gene Gene s s s ID v1 ID v2 Mark CGG1 CGG2 CGG3 HMM Sn (Std1) Sp (Std3) Genie Genie Genie HMM EST EST Gene HOM Grail exp 0.51 0.16 0.60 0.07 0.35 0.56 0.56 0.65 0.65 0.56 0.47 0.33 0.36 0.32 0.14 0.07 0.14 0.31 0.37 0.38 0.34 0.39 0.25 0.21 MG(%) 27.9 81.3 13.9 81.3 46.5 20.9 18.6 11.6 9.3 (Std1) MAG PIE 11.6 27.9 37.2 WG(%) 50.3 33.8 74.5 85.4 72.2 53.5 39.0 41.8 45.7 42.0 67.0 52.0 (Std3) SG 1.10 1.10 2.11 1.06 1.06 1.07 1.17 1.15 1.16 1.04 1.22 1.23 JG 1.06 1.09 1.08 1.62 1.11 1.11 1.08 1.09 1.09 1.12 1.06 1.08 Reese et al., Tutorial #3, ISMB ‘99 Results: Gene level 60% of actual genes predicted completely correct Specificity only 30-40% 5-10% missed genes (comparable to Sanger Center) 40% wrong genes, a lot of short genes over-predicted (possibly not annotated in Standard 3) Splitting genes is a bigger problem than joining genes Reese et al., Tutorial #3, ISMB ‘99 Results (protein homology): Base level BLOCKS Sn (Std1) Sp (Std3) Wise2 MAGPIE cDNA MAGPIE EST GRAIL Simila rity 0.04 0.12 0.02 0.31 0.31 0.80 0.82 0.55 0.32 0.81 Reese et al., Tutorial #3, ISMB ‘99 Results (protein homology): Exon level BLOCKS Sn (Std1) Sp (Std3) ME(%) (Std3) WE(%) (Std3) Wise2 MAGPIE cDNA MAGPIE EST GRAIL Simila rity 0.00 0.06 0.00 0.02 0.07 0.00 0.09 0.04 0.00 0.35 86.1 77.2 98.3 64.2 54.4 13.2 14.2 25.4 56.4 12.4 Reese et al., Tutorial #3, ISMB ‘99 Results (protein homology): Gene level BLOCKS Sn (Std1) Sp (Std3) MG(%) (Std3) WG(%) (Std3) Wise2 MAGPIE cDNA MAGPIE EST GRAIL Simila rity 0.00 0.00 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.18 95.3 90.6 97.6 88.3 74.4 17.5 15.7 52.6 58.5 29.7 Reese et al., Tutorial #3, ISMB ‘99 Transcription Start Site (TSS): Standard 1 Reese et al., Tutorial #3, ISMB ‘99 TSS: Standard 3 Reese et al., Tutorial #3, ISMB ‘99 Results: TSS recognition Likely (7.7%) Unlikely (6.5%) Possible (86.8%) MAGPIE Genie MCPromoter CoreInspector 153 (36.3%) 29 (6.8%) 239 (56.7%) 143 (61.1%) 62 (26.4%) 29 (12.3%) 80 (9.2%) 170 (19.5%) 619 (71.2%) 3 (13.0%) 3 (13.0%) 17 (74.0%) Reese et al., Tutorial #3, ISMB ‘99 Interesting gene examples: bubblegum Reese et al., Tutorial #3, ISMB ‘99 Adh/Adhr (Alcohol dehydrogenase/Adh related) Reese et al., Tutorial #3, ISMB ‘99 Adh/Adhr (cont..) Reese et al., Tutorial #3, ISMB ‘99 osp (outspread) Contains Adh and Adhr embedded in an intron Reese et al., Tutorial #3, ISMB ‘99 cact (cactus) Reese et al., Tutorial #3, ISMB ‘99 kuz (kuzbanian) Reese et al., Tutorial #3, ISMB ‘99 beat (beaten path) Reese et al., Tutorial #3, ISMB ‘99 Idfg1, Idfg2, Idfg3 (Imaginal Disc Growth Factor) Reese et al., Tutorial #3, ISMB ‘99 Idfg1, Idfg2, Idfg3 (cont.) Chitinase-related Gene function has changed (now a growth factor) Reese et al., Tutorial #3, ISMB ‘99 Conclusion of GASP1 95% coverage of the proteome Base level prediction is easier, exon level prediction is harder Small genes over predicted (?) Long introns The high number of “wrong genes” indicates possible incomplete annotation in Standard 3 (Are there more genes?) HMM seems to currently be the best approach Major improvements in multiple gene regions Reese et al., Tutorial #3, ISMB ‘99 Conclusion GASP1 (cont.) Much lower false positive rates Methods optimized for organism of interest do better Gene finding including homology not always improves prediction Split genes is more of a problem than joined genes No program is perfect Reese et al., Tutorial #3, ISMB ‘99 Discussion GASP1 Genes in introns Alternative splicing Genomic contamination in cDNA libraries Translation start prediction Biological verification of prediction needed Improve test bed by cDNA sequencing More regulation data needed to confirm promoter assessment Combining methods Better methods needed GASP 2 ? Reese et al., Tutorial #3, ISMB ‘99 Conclusions on annotating complete eukaryotic genomes Throughput has to improve dramatically Not only genes but also their relationships have to be elucidated Complete transcript cDNAs very powerful tool for annotation including alternative transcripts Comparative genomics as well as expression analysis improves/completes genome annotation Standardization efforts needed (ontology working group, OMG, OiB, NCBI/EBI, Bioxml, etc.) Standards for description of gene products Exchange format (GFF, Genbank, EMBL, XML) Reese et al., Tutorial #3, ISMB ‘99 Conclusions on annotating complete eukaryotic genomes (cont.) Maintenance requires even more effort than the original development Automated methods are not good enough Human curators can cause problems too Functional assignment by homology is sometimes unreliable Reese et al., Tutorial #3, ISMB ‘99 Discussion on annotating complete eukaryotic genomes Re-annotation: updating results and annotations over time Genomic sequence changes (indels, point mutations) Analysis software changes New entries in public sequence databases Entries removed from sequence databases Audit trail for annotations Master copy of genome annotations should reside in the model organism databases where the expertise resides Community collaborative annotation Reese et al., Tutorial #3, ISMB ‘99 Acknowledgments Uwe Ohler (University of Erlangen, Germany) Gerry Rubin (UC Berkeley) Sima Misra (UC Berkeley) Erwin Frise (UC Berkeley) Roderic Guigó (Barcelona) GFF team (headed by Richard Bruskiewich, Sanger Centre) Assessment team: Michael Ashburner (EBI), Peer Bork (EMBL), Richard Durbin (Sanger), Roderic Guigó (Barcelona), Tim Hubbard (Sanger) Annotation experiment participants Reese et al., Tutorial #3, ISMB ‘99