BCB 444/544 Lecture 26 Gene Prediction #26_Oct22 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 1 Required Reading (before lecture) Mon Oct 22 - Lecture 26 Gene Prediction • Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Regulatory Element Prediction • Chp 9 - pp 113 - 126 Thurs Oct 25 - Review Session & Project Planning Fri Oct 26 - EXAM 2 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 2 Assignments & Announcements Sun Oct 21 - Study Guide for Exam 2 was posted Mon Oct 22 - HW#4 Due (no "correct" answer to post) Thu Oct 25 - Lab = Optional Review Session for Exam 544 Project Planning/Consult with DD & MT Fri Oct 26 - Exam 2 - Will cover: • • • • Lectures 13-26 (thru Mon Sept 17) Labs 5-8 HW# 3 & 4 All assigned reading: Chps 6 (beginning with HMMs), 7-8, 12-16 Eddy: What is an HMM Ginalski: Practical Lessons… BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 3 BCB 544 "Team" Projects • 544 Extra HW#2 is next step in Team Projects • • • • Write ~ 1 page outline Schedule meeting with Michael & Drena to discuss topic Read a few papers Write a more detailed plan • You may work alone if you prefer • Last week of classes will be devoted to Projects • Written reports due: Mon Dec 3 (no class that day) • Oral presentations (15-20') will be: Wed-Fri Dec 5,6,7 • 1 or 2 teams will present during each class period See Guidelines for Projects posted online BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 4 BCB 544 Only: New Homework Assignment 544 Extra#2 (posted online Thurs?) No - sorry! sent by email on Sat… Due: PART 1 - ASAP PART 2 - Fri Nov 2 by 5 PM Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 5 Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html • Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB • Dave Segal UC Davis Zinc Finger Protein Design • Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI • Guang Song ComS, ISU Probing functional mechanisms by structure-based modeling and simulations BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 6 Chp 16 - RNA Structure Prediction SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 16 RNA Structure Prediction (Terribilini) • • • • • • RNA Function Types of RNA Structures RNA Secondary Structure Prediction Methods Ab Initio Approach Comparative Approach Performance Evaluation BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 7 This is a new slide Covalent & non-covalent bonds in RNA Primary: Covalent bonds Secondary/Tertiary Non-covalent bonds • H-bonds (base-pairing) • Base stacking Fig 6.2 Baxevanis & Ouellette 2005BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 8 RNA Pseudoknots & Tetraloops This is a new slide • Often have important regulatory or catalytic functions Pseudoknot http://www.lbl.gov/Science-Articles/ResearchReview/Annual-Reports/1995/images/rna.gif Tetraloop http://academic.brooklyn.cuny.edu/chem/z huang/QD/mckay_hr.gif BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 9 Base Pairing in RNA This slide has been changed G-C, A-U, G-U ("wobble") & many variants See: IMB Image Library of Biological Molecules http://www.fli-leibniz.de/ImgLibDoc/nana/IMAGE_NANA.html#basepairs BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 10 This slide has been changed RNA Secondary Structure Prediction Methods Two (three, recently) main types of methods: 1. Ab initio - based on calculating most energetically favorable secondary structure(s) Energy minimization (thermodynamics) 2. Comparative approach - based on comparisons of multiple evolutionarily-related RNA sequences Sequence comparison (co-variation) 3. Combined computational & experimental Use experimental constraints when available BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 11 This is a new slide RNA Secondary structure prediction - 3 3) Combined experimental & computational • Experiments: DMS Map single-stranded vs doublestranded regions in folded RNA • How? G 200 Enzymes: S1 nuclease, T1 RNase Chemicals: kethoxal, DMS, OH 220 • Software: Mfold Sfold RNAStructure RNAFold RNAlifold BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 240 Kethoxal modification (mild) (strong) DMS modification (mild) (strong) 10/22/07 12 This slide has been changed Ab Initio Prediction: Clarifications • Free energy is calculated based on parameters determined in the wet lab • Correction: Use known energy associated with each type of nearest-neighbor pair (base-stacking) (not base-pair) • Base-pair formation is not independent: multiple base-pairs adjacent to each other are more favorable than individual base-pairs - cooperative because of base-stacking interactions • Bulges and loops adjacent to base-pairs have a free energy penalty BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 13 Energy minimization: What are the rules? A A U U Basepair A=U A=U This is a new slide What gives here? G = -1.2 kcal/mole A U U A Basepair A=U U=A G = -1.6 kcal/mole C Staben 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 14 Energy minimization calculations: Base-stacking is critical AA UU AU or UA AU UA AG, AC, CA, GA UC, UG, GU, CU CC GG This is a new slide -1.2 CG GC -3.0 -1.6 GC CG -4.3 -2.1 GU UG -0.3 -4.8 XG, GX YU, UY 0 - Tinocco et al. C Staben 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 15 This slide has been changed Ab Initio Energy Calculation • Search for all possible base-pairing patterns • Calculate total energy of each structure based on all stabilizing and destabilizing forces Total free energy for a specific RNA conformation = Sum of incremental energy terms for: • helical stacking (sequence dependent) • loop initiation • unpaired stacking (favorable "increments" are < 0) Fig 6.3 Baxevanis & Ouellette 2005BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 16 Dynamic Programming This slide has been changed • Finding optimal secondary structure is difficult lots of possibilities • Compare RNA sequence with itself • Apply scoring scheme based on energy parameters for base stacking, cooperativity, and penalties for destabilizing forces (loops, bulges) • Find path that represents most energetically favorable secondary structure BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 17 3 - Popular Programs that use Combined Computational Experimental Approaches • • • • • Mfold Sfold RNAStructure RNAFold RNAlifold BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 18 Comparison of Predictions for Single RNA using Different Methods SL Y SL Y SL Z SL X SL Z SL X Sfold -51.14 kcal/mol Mfold -54.84 kcal/mol SL Y SL Z SL Y SL X SL Z SL X RNAstructure -71.3 kcal/mol JH Lee 2007 RNAfold -80.16 kcal/mol BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 19 Comparison of Mfold Predictions: -/+ Constraints Mfold -126.05 kcal/mol JH Lee 2007 Mfold plus constraints -54.84 kcal/mol BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 20 Performance Evaluation This slide has been changed • Ab initio methods? correlation coefficient = 20-60% • Comparative approaches? correlation coefficient = 2080% • Programs that require user to supply MSA are more accurate • Comparative programs are consistently more accurate than ab initio • Base-pairs predicted by comparative sequence analysis for large & small subunit rRNAs are 97% accurate when compared with high resolution crystal structures! - Gutell, Pace • BEST APPROACH? Methods that combine computational prediction (ab initio & comparative) with experimental constraints (from chemical/enzymatic modification studies) BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 21 Chp 8 - Gene Prediction SECTION III GENE AND PROMOTER PREDICTION Xiong: Chp 8 Gene Prediction • Categories of Gene Prediction Programs • Gene Prediction in Prokaryotes • Gene Prediction in Eukaryotes BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 22 What is a Gene? What is a gene? segment of DNA, some of which is "structural," i.e., transcribed to give a functional RNA product, & some of which is "regulatory" • Genes can encode: • mRNA (for protein) • other types of RNA (tRNA, rRNA, miRNA, etc.) • Genes differ in eukaryotes vs prokaryotes (& archaea), both structure & regulation BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 23 Gene Finding Problem: Given a new genomic DNA sequence, identify coding regions and their predicted RNA and protein sequences ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT Steps: 1. 2. 3. Search against protein / EST database Apply gene prediction programs (many programs available) Analyze regulatory regions BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 24 Gene Prediction in Prokaryotes vs Eukaryotes Eukaryotes • Large genomes 107 – 1010 bp • Often less than 2% coding • Complicated gene structure (splicing, long exons) • Prediction success 50-95% Prokaryotes • Small genomes 0.5 - 10·106 bp • About 90% of genome is coding • Simple gene structure • Prediction success ~99% Splice sites ATG TAA 5’ UTR 3’ UTR Promotor Exons Start codon Stop codon ATG TAA Promotor Open reading frame (ORF) Introns BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 25 DNA "Signals" Used by Gene Finding Algorithms 1. Exploit the regular gene structure ATG—Exon1—Intron1—Exon2—…—ExonN—STOP 2. Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… 3. Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron 4. Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length 5. Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 26 Computational Gene Finding Approaches • Ab initio methods • Search by signal: find DNA sequences involved in gene expression. • Search by content: Test statistical properties distinguishing coding from non-coding DNA • Similarity based methods • Database search: exploit similarity to proteins, ESTs, and cDNAs • Comparative genomics: exploit aligned genomes • Do other organisms have similar sequence? • Hybrid methods - best BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 27 Examples of Gene Prediction Software Ab initio Similarity-based BLAST, Procrustes… Hybrids Genscan, GeneMark.hmm, Genie, GeneID… GeneSeqer, GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM. BEST? Ab initio - Genescan (according to some assessments) Hybrid - GeneSeqer But depends on organism & specific task Lists of Gene Prediction Software http://www.bioinformaticsonline.org/links/ch_09_t_1.html http://cmgm.stanford.edu/classes/genefind/ BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 28 Synthesis & Processing of Eukaryotic mRNA DN Gene in DNA 5’ exon 1 3’ intron 1' transcript (RNA) exon 2 3’ exon 3 5’ intron Transcription 5’ 3’ Splicing (remove introns) 3’ 5’ Mature mRNA 5’ 7MeG Capping & polyadenylation AAAAA 3’ m Export to cytoplasm BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 29 What are cDNAs & ESTs? cDNA libraries are important for determining gene structure & studying regulation of gene expression • Isolate RNA (always from a specific organism, region, and time point) insert • Convert RNA to complementary DNA • (with reverse transcriptase) • Clone into cDNA vector • Sequence the cDNA inserts vector • Short cDNAs are called ESTs or Expressed Sequence Tags ESTs are strong evidence for genes • Full-length cDNAs can be difficult to obtain BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 30 UniGene: Unique genes via ESTs • Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene • UniGene clusters contain many ESTs • UniGene data come from many cDNA libraries. When you look up a gene in UniGene, you can obtain information re: level & tissue distribution of expression BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 31 Gene Prediction • Overview of steps & strategies • What sequence signals can be used? • What other types of information can be used? • Algorithms • HMMs, Bayesian models, neural nets • Gene prediction software • 3 major types • many, many programs! BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 32 Overview of Gene Prediction Strategies What sequence signals can be used? • Transcription: TF binding sites, promoter, initiation site, terminator, GC islands, etc. • Processing signals: Splice donor/acceptors, polyA signal • Translation: Start (AUG = Met) & stop (UGA,UUA, UAG) ORFs, codon usage What other types of information can be used? • Homology (sequence comparison, BLAST) • cDNAs & ESTs (experimental data, pairwise alignment) BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 33 Gene prediction: Eukaryotes vs prokaryotes Gene prediction is easier in microbial genomes Why? Smaller genomes Simpler gene structures Many more sequenced genomes! (for comparative approaches) Many microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are available e.g., GeneMark.hmm TIGR Comprehensive Microbial Resource (CMR) NCBI Microbial Genomes BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 34 Predicting Genes - Basic steps: • Obtain genomic sequence • BLAST it! • Perform database similarity search (with EST & cDNA databases, if available) • Translate in all 6 reading frames (i.e., "6-frame translation") • Compare with protein sequence databases • Use Gene Prediction software to locate genes • Analyze regulatory sequences • Refine gene prediction BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 35 Predicting Genes - Details: 1. 1st, mask to "remove" repetitive elements (ALUs, etc.) 2. Perform database search on translated DNA (BlastX,TFasta) 3. Use several programs to predict genes (GENSCAN, GeneMark.hmm, GeneSeqer) 4. Search for functional motifs in translated ORFs (Blocks, Motifs, etc.) & in neighboring DNA sequences 5. Repeat BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 36 GeneSeqer - Brendel et al.- ISU http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Spliced Alignment Algorithm Brendel et al (2004) Bioinformatics 20: 1157 • Perform pairwise alignment with large gaps in one sequence (due to introns) • Align genomic DNA with cDNA, ESTs, protein sequences • Score semi-conserved sequences at splice junctions • Using Bayesian model or MM • Score coding constraints in translated exons • Using a Bayesian model or MM Intron GT Donor Brendel 2005 AG Splice sites BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Acceptor 10/22/07 37 Brendel - Spliced Alignment II: Compare with protein probes Start codon Stop codon Genomic DNA Protein Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 38 Splice Site Detection Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES • Information Content Ii : Ii 2 f iB BU ,C , A,G log 2 ( f iB ) • Extent of Splice Signal Window: I i I 196 . I i: ith position in sequence Ī: avg information content over all positions >20 nt from splice site Ī: avg sample standard deviation of Ī Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 39 Information content vs position 0.8 0.8 0.7 0.7 Human T2_GT 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 -50 -40 -30 -20 -10 Human T2_AG 0.6 0.0 0 10 20 30 40 50 -50 -40 -30 -20 -10 0 10 20 30 40 50 Which sequences are exons & which are introns? How can you tell? Brendel et al (2004) Bioinformatics 20: 1157 Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 40 Markov Model for Spliced Alignment PG PG (1-PG)(1-PD(n+1)) en en+1 (1-PG)PD(n+1) PA(n)PG (1-PG)PD(n+1) in in+1 1-PA(n) Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 41