#27 - Gene Prediction II 10/24/07 Required Reading BCB 444/544 (before lecture) Mon Oct 22 - Lecture 26 Lecture 27 Gene Prediction • Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 Gene Prediction II (will not be covered on Exam 2) Promoter & Regulatory Element Prediction • Chp 9 - pp 113 - 126 Thurs Oct 25 - Review Session & Project Planning #27_Oct24 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Fri Oct 26 - EXAM 2 10/24/07 1 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Assignments & Announcements 10/24/07 2 BCB 544 "Team" Projects Mon Oct 22 - Study Guide for Exam 2 was posted, finally… • 544 Extra HW#2 is next step in Team Projects Mon Oct 22 - HW#4 Due (no "correct" answer to post) • • • • Thu Oct 25 - no Lab => Optional Review Session for Exam 544 Project Planning/Consult with DD & MT Write ~ 1 page outline Schedule meeting with Michael & Drena to discuss topic Read a few papers Write a more detailed plan • You may work alone if you prefer Fri Oct 26 - Exam 2 - Will cover: • • • • • Last week of classes will be devoted to Projects • Written reports due: Mon Dec 3 (no class that day) • Oral presentations (15-20') will be: Wed-Fri Dec 5,6,7 Lectures 13-26 (thru Mon Sept 17) Labs 5-8 HW# 3 & 4 All assigned reading: Chps 6 (beginning with HMMs), 7-8, 12-16 Eddy: What is an HMM Ginalski: Practical Lessons… BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 • 1 or 2 teams will present during each class period See Guidelines for Projects posted online 3 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BCB 544 Only: New Homework Assignment 4 Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: 544 Extra#2 (posted online Thurs?) http://www.bcb.iastate.edu/seminars/index.html No - sorry! sent by email on Sat… Due: 10/24/07 • Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB PART 1 - ASAP PART 2 - Fri Nov 2 by 5 PM • Part 1 - Brief outline of Project, email to Drena & Michael Dave Segal UC Davis Zinc Finger Protein Design • Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI after response/approval, then: • Guang Song ComS, ISU Probing functional mechanisms by structure-based modeling and simulations Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BCB 444/544 Fall 07 Dobbs 10/24/07 5 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 6 1 #27 - Gene Prediction II 10/24/07 What is a Gene? Chp 8 - Gene Prediction What is a gene? segment of DNA, some of which is SECTION III GENE AND PROMOTER PREDICTION "structural," i.e., transcribed to give a functional RNA product, & some of which is "regulatory" Xiong: Chp 8 Gene Prediction • Categories of Gene Prediction Programs • Genes can encode: • Gene Prediction in Prokaryotes • mRNA (for protein) • Gene Prediction in Eukaryotes • other types of RNA (tRNA, rRNA, miRNA, etc.) • Genes differ in eukaryotes vs prokaryotes (& archaea), both structure & regulation BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 7 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Synthesis & Processing of Eukaryotic mRNA intron 1' transcript (RNA) exon 2 3’ exon 3 5’ intron • • • • • 3’ Splicing (remove introns) 3’ 5’ 5’ 7MeG What are cDNAs & ESTs? • Isolate RNA (always from a specific Transcription 5’ Mature mRNA 8 cDNA libraries are important for determining gene structure & studying regulation of gene expression DN Gene in DNA 5’ exon 1 3’ 10/24/07 organism, region, and time point) Convert RNA to complementary DNA (with reverse transcriptase) Clone into cDNA vector Sequence the cDNA inserts Short cDNAs are called ESTs or Expressed Sequence Tags ESTs are strong evidence for genes Capping & polyadenylation 10/24/07 9 UniGene: Unique genes via ESTs BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Eukaryotes • Large genomes 107 – 1010 bp • Often less than 2% coding • Complicated gene structure (splicing, long exons) • UniGene clusters contain many ESTs • UniGene data come from many cDNA libraries. • Prediction success 50-95% When you look up a gene in UniGene, you can obtain information re: level & tissue distribution of expression 10 Promotor 11 • Small genomes 0.5 - 10·106 bp • About 90% of genome is coding • Simple gene structure TAA 5’ UTR 3’ UTR Exons 10/24/07 Prokaryotes • Prediction success ~99% Splice sites ATG BCB 444/544 Fall 07 Dobbs 10/24/07 Gene Prediction in Prokaryotes vs Eukaryotes • Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II vector • Full-length cDNAs can be difficult to obtain AAAAA 3’ m Export to cytoplasm BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II insert Start codon Stop codon ATG TAA Promotor Open reading frame (ORF) Introns BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 12 2 #27 - Gene Prediction II 10/24/07 Gene Prediction - The Problem Prediction is Easier in Microbial Genomes Why? Smaller genomes Simpler gene structures Many more sequenced genomes! (for comparative approaches) Problem: Given a new genomic DNA sequence, identify coding regions and their predicted RNA and protein sequences Many microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are available e.g., GeneMark.hmm, Glimmer TIGR Comprehensive Microbial Resource (CMR) NCBI Microbial Genomes BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT 13 Computational Gene Prediction: Approaches BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 14 Computational Gene Prediction: Algorithms • Ab initio methods 1. Neural Networks (NNs) • Search by signal: find DNA sequences involved in gene (more on these later…) e.g., GRAIL expression. 2. Linear discriminant analysis (LDA) (see text) • Search by content: Test statistical properties distinguishing coding from non-coding DNA e.g., FGENES, MZEF • Similarity-based methods 3. Markov Models (MMs) & Hidden Markov Models (HMMs) • Database search: exploit similarity to proteins, ESTs, cDNAs e.g., GeneSeqer - uses MMs • Comparative genomics: exploit aligned genomes GENSCAN - uses 5th order HMMs - (see text) • Do other organisms have similar sequence? HMMgene - uses conditional maximum likelihood (see text) • Hybrid methods - best BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 15 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 16 Signals Search Gene Prediction Strategies Approach: Build models (PSSMs, profiles, HMMs, …) and search What sequence signals can be used? against DNA. Detected instances provide evidence for genes • Transcription: TF binding sites, promoter, initiation site, terminator, GC islands, etc. • Processing signals: Splice donor/acceptors, polyA signal • Translation: Start (AUG = Met) & stop (UGA,UUA, UAG) ORFs, codon usage What other types of information can be used? • Homology (sequence comparison, BLAST) • cDNAs & ESTs (experimental data, pairwise alignment) BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BCB 444/544 Fall 07 Dobbs 10/24/07 17 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 18 3 #27 - Gene Prediction II 10/24/07 Content Search DNA Signals Used in Gene Prediction 1. Exploit the regular gene structure ATG—Exon1—Intron1—Exon2—…—ExonN—STOP 2. Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… 3. Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron 4. Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length 5. Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 Observation: Encoding a protein affects statistical properties of DNA sequence: • Nucleotide composition • Hexamer frequency • GC content (CpG islands, exon/intron) • Uneven usage of synonymous codons (codon bias) Method: Evaluate these differences (coding statistics) to differentiate between coding and non-coding regions 19 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 20 Predicting Genes based on Codon Usage Differences Human Codon Usage Algorithm: Process sliding window • • Use codon frequencies to compute probability of coding versus non-coding Plot log-likelihood ratio: & P (S | coding ) # log$$ !! % P ( S | non ' coding ) " Exons Coding Profile of ß-globin gene BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 21 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Similarity-Based Methods: Database Search 10/24/07 22 Similarity-Based Methods: Comparative Genomics In different genomes: Translate DNA into all 6 reading frames and search against proteins (TBLASTX,BLASTX, etc.) Idea: Functional regions are more conserved than non-functional ones; high similarity in alignment indicates gene human mouse ATTGCGTAGGGCGCT TAACGCATCCCGCGA GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | | C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA- Within same genome: Search with EST/cDNA database Advantages: (EST2genome, BLAT, etc.). • Problems: BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BCB 444/544 Fall 07 Dobbs 10/24/07 May find uncharacterized or RNA genes Problems: • Will not find “new” or RNA genes (non-coding genes). • Limits of similarity are hard to define • Small exons might be overlooked • • 23 Finding suitable evolutionary distance Finding limits of high similarity (functional regions) BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 24 4 #27 - Gene Prediction II 10/24/07 Human-Mouse Homology Hum an Gene Prediction Flowchart Mouse Comparison of 1196 orthologous genes • Sequence identity between genes in human vs mouse Exons: 84.6% Protein: 85.4% Introns: 35% 5’ UTRs: 67% 3’ UTRs: 69% BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 25 Predicting Genes - Basic steps: 26 1. 1st, mask to "remove" repetitive elements (ALUs, etc.) 2. Perform database search on translated DNA • BLAST it! • Perform database similarity search (with EST & cDNA databases, if available) • Translate in all 6 reading frames (i.e., "6-frame translation") • Compare with protein sequence databases (BlastX,TFasta) 3. Use several programs to predict genes & find ORFs (GENSCAN, GeneSeqer, GeneMark.hmm, GRAIL) 4. Search for functional motifs in translated ORFs & in neighboring DNA sequences (InterPro, Transfac) Use Gene Prediction software to locate genes Compare results obtained using different programs Analyze regulatory sequences, too Refine gene prediction BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 Predicting Genes - a few Details: • Obtain genomic sequence • • • • Fig 5.15 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Baxevanis & Ouellette 2005 5. Repeat 10/24/07 27 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Thanks to Volker Brendel, ISU for the following Figs & Slides 10/24/07 28 GeneSeqer Genomic Sequence Slightly modified from: Fast Search BSSI Genome Informatics Module http://www.bioinformatics.iastate.edu/BBSI/course_desc_20 05.html#moduleB Spliced Alignment EST or protein database (Suffix Array/Suffix Tree) V Brendel vbrendel@iastate.edu Output Assembly Brendel et al (2004) Bioinformatics 20: 1157 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BCB 444/544 Fall 07 Dobbs 10/24/07 29 Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 30 5 #27 - Gene Prediction II 10/24/07 GeneSeqer - Brendel et al.- ISU Signals: Pre-mRNA Splicing http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Spliced Alignment Algorithm Start codon Stop codon Genomic DNA Brendel et al (2004) Bioinformatics 20: 1157 http://bioinformatics.oxfordjournals.org/cgi/con tent/abstract/20/7/1157 Transcription pre-mRNA Cap- -Poly(A) Splicing • Perform pairwise alignment with large gaps in one sequence (due to introns) mRNA -Poly(A) Cap- Translation • Align genomic DNA with cDNA, ESTs, protein sequences Protein • Score semi-conserved sequences at splice junctions EXON • Using Bayesian probability model & 1st order MM INTRON GT • Score coding constraints in translated exons • Using Bayesian model GT Donor Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Splice sites Acceptor 10/24/07 31 Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Brendel - Spliced Alignment I: Start codon 32 10/24/07 34 Compare with protein probes Stop codon Start codon Genomic DNA Stop codon Genomic DNA Start codon Stop codon Protein -Poly(A) Cap5’-UTR Brendel 2005 10/24/07 Brendel - Spliced Alignment II: Compare with cDNA or EST probes mRNA Acceptor site AG Splice sites AG Donor site Intron 3’-UTR BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 33 Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Information Content vs Position Splice Site Detection Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES 0.8 • Information Content Ii : Ii = 2 + " 0.5 f iB log 2 ( f iB ) 0.4 0.3 0.3 0.2 0.1 0.0 I i ! I + 196 . "I i: ith position in sequence Ī: avg information content over all positions >20 nt from splice site σĪ: avg sample standard deviation of Ī BCB 444/544 Fall 07 Dobbs 0.5 0.1 -50 10/24/07 35 -40 -30 -20 -10 Human T2_AG 0.6 0.4 0.2 • Extent of Splice Signal Window: BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 0.7 Human T2_GT 0.6 B !U ,C , A ,G Brendel 2005 0.8 0.7 0.0 0 10 20 30 40 50 -50 -40 -30 -20 -10 0 10 20 30 40 50 Which sequences are exons & which are introns? How can you tell? Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 36 6 #27 - Gene Prediction II 10/24/07 Donor (GT) & Acceptor (AG) Sites Used for Model Training Species Brendel 2005 Markov Model for Spliced Alignment Number of True Splice Sites / Phase 1 2 3 Type Home sapiens GT AG 6586 6555 5277 5194 3037 2979 Mus musculus GT AG 1212 1194 1185 1139 521 504 Rattus norvegicus GT AG 450 442 408 386 147 140 Gallus gallus GT AG 288 284 238 228 107 103 Drosophila GT AG 989 1001 670 671 524 536 C. elegans GT AG 37029 36864 20500 20325 20789 20626 S. pombe GT AG 170 179 118 122 119 118 Aspergillus GT AG 221 217 176 172 157 163 Arabidopsis thaliana GT AG 23019 22929 9297 9247 8653 8611 Zea mays GT AG 316 311 107 104 88 83 PΔG (1-PΔG )(1-PD(n+1)) en (1-PΔG )PD(n+1) (1-PΔG )PD(n+1) in 10/24/07 37 Brendel 2005 True TP FP PP=TP+FP False FN TN PN=FN+TN Coverage Recall != False Positives Predicted = 1$" 1$" + # True TP FP PP=TP+FP False FN TN PN=FN+TN • Sensitivity: S n = TP / AP = = 1 !Coverage " IMPORTANT: Sensitivity alone does not tell us much about performance because a 100% sensitivity can be trivially achieved by labeling all test cases positive! In English? Sensitivity is the fraction of all positive instances having a true positive prediction. r= AN AP Do not memorize this! BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 AN 1!# "= • Specificity: S p = TP / PP = 1=! Recall PP 1IMPORTANT: ! # + r" in medical jargon, Specificity is sometimes defined In English? Specificity is the differently (what we define here as fraction of all predicted positives "Specificity" is sometimes referred that are, in fact, true positives. to as "Positive predictive value") 39 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II σ 1.00 (Receiver Operating Characteristic (?!!) Sn 0.60 0.80 for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently by plotting fraction -10 -8 of true positives (TPR = true positive rate) vs fraction of false positives (FPR = false positive rate) 6 8 10 12 14 16 18 20 -10 -8 -6 -4 σ Sn 0.60 A. thaliana GT site • MCC = 1 for a perfect prediction Do not memorize this! 10/24/07 41 • • 2 4 6 8 10 12 14 16 18 20 σ 0.80 Sn 0.60 A. thaliana AG site 0.40 0.20 0.00 -6 -4 -2 0 0.00 -2 0 1.00 0.40 -10 -8 BCB 444/544 Fall 07 Dobbs 4 0.80 Matthews correlation coefficient (MCC) BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 0.20 2 1.00 • Correlation Coefficient 0 for a completely random assignment -1 for a "perfectly incorrect" prediction 0.00 -6 -4 -2 0 Human AG site 0.40 0.20 (1 - specificity) Sn 0.60 0.40 In signal detection theory, a receiver operating characteristic (ROC), or vs 40 σ 1.00 Human GT site 0.80 http://en.wikipedia.org/wiki/Roc_curve ROC curve is a plot of sensitivity 10/24/07 GenSeqer Performance? Best Measures for Comparison? • ROC curves 38 AP=TP+FN AN=FP+TN FP AN AN AN 1 1!!## AN = TP S/ pPP ==TP ! = 1"! =" ="1=! # • Specificity: S p S=pTP / PP 1=!/1PP !##r+ PPPP PP1 !1#!1+ "+ r"" • Normalized specificity: ! 10/24/07 Actual True False True Positives AP=TP+FN AN=FP+TN FN AP = TP / AP • Sensitivity: S n S=nTP / AP = 1=!1"! " BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Evaluation of Predictions - in English Predicted Positives Actual True False != in+1 1-PA(n) Evaluation of Predictions • Misclassification rates: en+1 PA(n)PΔG BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Predicted PΔG 0.20 2 4 6 8 10 12 14 16 18 20 -10 -8 -6 -4 0.00 -2 0 2 4 6 8 10 12 14 16 18 20 Plots such as these (& ROCs) are much better than using a "single number" to compare different methods Such plots illustrate trade-off: Sn vs Sp Note: the above are not ROC curves (plots of Sn vs 1-Sp) Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 42 7 #27 - Gene Prediction II 10/24/07 GeneSeqer Results on Different Genomes Species Model 2C Homo sapiens 2C Drosophila 7C C. elegans 7C A. thaliana Brendel 2005 Site Test Site Set True False GT 921 44411 AG 920 65103 GT 329 11501 AG 329 14920 GT 400 7460 AG 400 10132 GT 613 9027 AG 614 10196 Bayes Factor Sn σ Sp (%) (%) (%) 0 3 6 0 3 6 98.5 91.7 66.3 96.3 90.3 76.1 90.5 96.3 98.5 88.4 92.9 96.1 16.4 34.8 57.6 9.7 15.7 25.6 • Comparison with ab initio gene prediction: vs GENSCAN an HMM-based ab initio method 0 3 6 0 3 6 95.4 90.0 83.9 95.7 92.1 85.1 94.8 97.6 99.1 94.8 97.0 98.5 34.1 53.6 75.0 28.7 41.4 59.4 0 3 6 0 3 6 97.8 94.2 84.8 98.8 96.2 90.2 92.7 97.1 99.1 97.2 98.8 99.5 40.4 64.3 85.4 58.2 76.9 88.5 • "Winner" depends on: • Availability of ESTs • Level of similarity to protein homologs 0 3 6 0 3 6 99.5 95.6 87.1 99.2 96.4 87.1 93.2 97.6 99.3 92.3 96.4 98.6 48.1 73.2 91.0 41.9 62.0 81.2 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II GeneSeqer vs Performance of GeneSeqer vs Others? 10/24/07 43 Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II GENSCAN GeneSeqer vs 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 GeneSeqer NAP GENSCAN 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II GeneSeqer NAP GENSCAN 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score 10/24/07 45 GENSCAN - Burge, MIT Brendel 2005 GeneSeqer: Input BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BCB 444/544 Fall 07 Dobbs BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 46 10/24/07 48 GeneSeqer: Output http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Brendel 2005 GENSCAN 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 GENSCAN - Burge, MIT Brendel 2005 44 (Intron prediction) Intron (Sn + Sp) / 2 Exon (Sn + Sp) / 2 (Exon prediction) 10/24/07 10/24/07 47 Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 8 #27 - Gene Prediction II 10/24/07 GeneSeqer: Gene Evidence Summary Gene Prediction - Problems & Status? Common errors? • False positive intergenic regions: • 2 annotated genes actually correspond to a single gene • False negative intergenic region: • One annotated gene structure actually contains 2 genes • False negative gene prediction: • Missing gene (no annotation) • Other: • Partially incorrect gene annotation • Missing annotation of alternative transcripts Current status? • For ab initio prediction in eukaryotes: HMMs have better overall performance for detecting untron/exon boundaries • Limitation? Training data: predictions are organism specific • Combined ab initio/homology based predictions: Improved accurracy • Limitation? Availability of identifiable sequence homologs in databases Brendel 2005 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 49 http://www.bioinformatics.iastate.edu/bioinformatics2go/ GENSCAN: http://genes.mit.edu/GENSCAN.html GeneMark.hmm: http://exon.gatech.edu/GeneMark/ others: GRAIL, FGENES, MZEF, HMMgene Similarity-based • • 50 Ab initio • • • • 10/24/07 Other Gene Prediction Resources: at ISU Recommended Gene Prediction Software • BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BLAST, GenomeScan, EST2Genome, Twinscan Combined: • GeneSeqer, ROSETTA Consensus: because results depend on organisms & specific task, Always use more than one program! • Two servers hat report consensus predictions • GeneComber • DIGIT BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 51 BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II 10/24/07 52 Other Gene Prediction Resources: GaTech, MIT, Stanford, etc. Lists of Gene Prediction Software http://www.bioinformaticsonline.org/links/ch_09_t_1.html http://cmgm.stanford.edu/classes/genefind/ Current Protocols in Bioinformatics (BCB/ISU owns a copy - currently in my lab!) Chapter 4 Finding Genes 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BCB 444/544 Fall 07 Dobbs 10/24/07 53 9