#26 - Gene Prediction 10/22/07 Required Reading BCB 444/544 (before lecture) Mon Oct 22 - Lecture 26 Lecture 26 Gene Prediction • Chp 8 - pp 97 - 112 Wed Oct 24 - Lecture 27 Gene Prediction (will not be covered on Exam 2) Regulatory Element Prediction • Chp 9 - pp 113 - 126 Thurs Oct 25 - Review Session & Project Planning #26_Oct22 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Fri Oct 26 - EXAM 2 10/22/07 1 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Assignments & Announcements 10/22/07 2 BCB 544 "Team" Projects Sun Oct 21 - Study Guide for Exam 2 was posted • 544 Extra HW#2 is next step in Team Projects Mon Oct 22 - HW#4 Due (no "correct" answer to post) • • • • Thu Oct 25 - Lab = Optional Review Session for Exam 544 Project Planning/Consult with DD & MT Write ~ 1 page outline Schedule meeting with Michael & Drena to discuss topic Read a few papers Write a more detailed plan • You may work alone if you prefer Fri Oct 26 - Exam 2 - Will cover: • • • • • Last week of classes will be devoted to Projects • Written reports due: Mon Dec 3 (no class that day) • Oral presentations (15-20') will be: Wed-Fri Dec 5,6,7 Lectures 13-26 (thru Mon Sept 17) Labs 5-8 HW# 3 & 4 All assigned reading: Chps 6 (beginning with HMMs), 7-8, 12-16 Eddy: What is an HMM Ginalski: Practical Lessons… BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 • 1 or 2 teams will present during each class period See Guidelines for Projects posted online 3 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction BCB 544 Only: New Homework Assignment 4 Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: 544 Extra#2 (posted online Thurs?) http://www.bcb.iastate.edu/seminars/index.html No - sorry! sent by email on Sat… Due: 10/22/07 • Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB PART 1 - ASAP PART 2 - Fri Nov 2 by 5 PM • Part 1 - Brief outline of Project, email to Drena & Michael Dave Segal UC Davis Zinc Finger Protein Design • Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI after response/approval, then: • Guang Song ComS, ISU Probing functional mechanisms by structure-based modeling and simulations Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction BCB 444/544 Fall 07 Dobbs 10/22/07 5 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 6 1 #26 - Gene Prediction 10/22/07 This is a new slide Chp 16 - RNA Structure Prediction SECTION V Covalent & non-covalent bonds in RNA STRUCTURAL BIOINFORMATICS Primary: Xiong: Chp 16 RNA Structure Prediction (Terribilini) • • • • • • Covalent bonds Secondary/Tertiary RNA Function Types of RNA Structures RNA Secondary Structure Prediction Methods Ab Initio Approach Comparative Approach Performance Evaluation Non-covalent bonds • H-bonds (base-pairing) • Base stacking BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 RNA Pseudoknots & Tetraloops 7 This is a new slide http://www.lbl.gov/Science-Articles/ResearchReview/Annual-Reports/1995/images/rna.gif Base Pairing in RNA 10/22/07 8 This slide has been changed G-C, A-U, G-U ("wobble") & many variants • Often have important regulatory or catalytic functions Pseudoknot Fig 6.2 Baxevanis & Ouellette 2005BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction See: IMB Image Library of Biological Molecules Tetraloop http://www.fli-leibniz.de/ImgLibDoc/nana/IMAGE_NANA.html#basepairs http://academic.brooklyn.cuny.edu/chem/z huang/QD/mckay_hr.gif BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 9 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction RNA Secondary structure prediction - 3 3) Combined experimental & computational Two (three, recently) main types of methods: • Experiments: 1. Ab initio - based on calculating most energetically favorable secondary structure(s) • How? Energy minimization (thermodynamics) Sequence comparison (co-variation) G 200 Enzymes: S1 nuclease, T1 RNase Chemicals: kethoxal, DMS, OH• 220 • Software: Mfold Sfold RNAStructure RNAFold RNAlifold 3. Combined computational & experimental Use experimental constraints when available BCB 444/544 Fall 07 Dobbs DMS Map single-stranded vs doublestranded regions in folded RNA 2. Comparative approach - based on comparisons of multiple evolutionarily-related RNA sequences BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 10 This is a new slide This slide has been changed RNA Secondary Structure Prediction Methods 10/22/07 11 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 240 Kethoxal modification (mild) (strong) DMS modification (mild) (strong) 10/22/07 12 2 #26 - Gene Prediction 10/22/07 This slide has been changed • Free energy is calculated based on parameters determined in the wet lab • Correction: Use known energy associated with each type of nearest-neighbor pair (base-stacking) (not base-pair) • Base-pair formation is not independent: multiple base-pairs adjacent to each other are more favorable than individual base-pairs - cooperative - because of base-stacking interactions • Bulges and loops adjacent to base-pairs have a free energy penalty BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 Energy minimization calculations: Base-stacking is critical -1.2 CG GC -3.0 AU or UA UA AU -1.6 GC CG -4.3 AG, AC, CA, GA UC, UG, GU, CU -2.1 GU UG -0.3 -4.8 XG, GX YU, UY CC GG A A U Basepair A=U U A=U A U U A Basepair BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 13 C Staben 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Dynamic Programming This slide has been changed Ab Initio Energy Calculation (sequence dependent) • loop initiation • unpaired stacking (favorable "increments" are < 0) 15 BCB 444/544 Fall 07 Dobbs Fig 6.3 Baxevanis & Ouellette 2005BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction This slide has been changed 10/22/07 14 Total free energy for a specific RNA conformation = Sum of incremental energy terms for: • helical stacking • Finding optimal secondary structure is difficult lots of possibilities • Compare RNA sequence with itself • Apply scoring scheme based on energy parameters for base stacking, cooperativity, and penalties for destabilizing forces (loops, bulges) • Find path that represents most energetically favorable secondary structure BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 • Search for all possible base-pairing patterns • Calculate total energy of each structure based on all stabilizing and destabilizing forces 0 10/22/07 A=U U=A ΔG = -1.6 kcal/mole - Tinocco et al. C Staben 2005 What gives here? ΔG = -1.2 kcal/mole This is a new slide AA UU This is a new slide Energy minimization: What are the rules? Ab Initio Prediction: Clarifications 16 3 - Popular Programs that use Combined Computational Experimental Approaches • • • • • 17 10/22/07 Mfold Sfold RNAStructure RNAFold RNAlifold BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 18 3 #26 - Gene Prediction 10/22/07 Comparison of Predictions for Single RNA using Different Methods Comparison of Mfold Predictions: -/+ Constraints SL Y SL Y SL Z SL X SL Z SL X Sfold -51.14 kcal/mol Mfold -54.84 kcal/mol SL Y SL Z SL Y SL X SL Z SL X RNAstructure -71.3 kcal/mol JH Lee 2007 Mfold -126.05 kcal/mol Mfold plus constraints -54.84 kcal/mol RNAfold -80.16 kcal/mol BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Performance Evaluation 10/22/07 19 JH Lee 2007 This slide has been changed • Ab initio methods? correlation coefficient = 20-60% • Comparative approaches? correlation coefficient = 2080% • Programs that require user to supply MSA are more accurate • Comparative programs are consistently more accurate than ab initio • Base-pairs predicted by comparative sequence analysis for large & small subunit rRNAs are 97% accurate when compared with high resolution crystal structures! - Gutell, Pace BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 20 10/22/07 22 Chp 8 - Gene Prediction SECTION III GENE AND PROMOTER PREDICTION Xiong: Chp 8 Gene Prediction • Categories of Gene Prediction Programs • Gene Prediction in Prokaryotes • Gene Prediction in Eukaryotes • BEST APPROACH? Methods that combine computational prediction (ab initio & comparative) with experimental constraints (from chemical/enzymatic modification studies) BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 21 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction What is a Gene? Gene Finding Problem: Given a new genomic DNA sequence, identify coding regions and their predicted RNA and protein sequences What is a gene? segment of DNA, some of which is "structural," i.e., transcribed to give a functional RNA product, & some of which is "regulatory" ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT • Genes can encode: • mRNA (for protein) Steps: • other types of RNA (tRNA, rRNA, miRNA, etc.) 1. 2. 3. • Genes differ in eukaryotes vs prokaryotes (& archaea), both structure & regulation BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction BCB 444/544 Fall 07 Dobbs 10/22/07 23 Search against protein / EST database Apply gene prediction programs (many programs available) Analyze regulatory regions BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 24 4 #26 - Gene Prediction 10/22/07 DNA "Signals" Used by Gene Finding Algorithms Gene Prediction in Prokaryotes vs Eukaryotes Eukaryotes Prokaryotes • Large genomes 107 – 1010 bp • Often less than 2% coding • Small genomes 0.5 - 10·106 bp • About 90% of genome is coding • Simple gene structure • Complicated gene structure (splicing, long exons) • Prediction success 50-95% ATG TAA 5’ UTR 3’ UTR Exons Exploit the regular gene structure ATG—Exon1—Intron1—Exon2—…—ExonN—STOP 2. Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… 3. Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron 4. Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length 5. Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns • Prediction success ~99% Splice sites Promotor 1. Start codon Stop codon ATG TAA Promotor Open reading frame (ORF) Introns BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 25 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 26 Examples of Gene Prediction Software Computational Gene Finding Approaches • Ab initio methods Ab initio • Search by signal: find DNA sequences involved in gene expression. • Search by content: Test statistical properties distinguishing coding from non-coding DNA • Similarity based methods • Database search: exploit similarity to proteins, ESTs, and cDNAs • Comparative genomics: exploit aligned genomes • Do other organisms have similar sequence? BLAST, Procrustes… Hybrids • Hybrid methods - best Genscan, GeneMark.hmm, Genie, GeneID… Similarity-based GeneSeqer, GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM. BEST? Ab initio - Genescan (according to some assessments) Hybrid - GeneSeqer But depends on organism & specific task Lists of Gene Prediction Software http://www.bioinformaticsonline.org/links/ch_09_t_1.html http://cmgm.stanford.edu/classes/genefind/ BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 27 Synthesis & Processing of Eukaryotic mRNA intron 1' transcript (RNA) exon 2 3’ exon 3 5’ intron What are cDNAs & ESTs? insert organism, region, and time point) • Convert RNA to complementary DNA • (with reverse transcriptase) • Clone into cDNA vector • Sequence the cDNA inserts vector • Short cDNAs are called ESTs or Expressed Sequence Tags ESTs are strong evidence for genes • Full-length cDNAs can be difficult to obtain 3’ Splicing (remove introns) 3’ 5’ 5’ 7MeG 28 • Isolate RNA (always from a specific Transcription 5’ Mature mRNA 10/22/07 cDNA libraries are important for determining gene structure & studying regulation of gene expression DN Gene in DNA 5’ exon 1 3’ BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Capping & polyadenylation AAAAA 3’ m Export to cytoplasm BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction BCB 444/544 Fall 07 Dobbs 10/22/07 29 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 30 5 #26 - Gene Prediction 10/22/07 UniGene: Unique genes via ESTs Gene Prediction • Overview of steps & strategies • Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene • What sequence signals can be used? • What other types of information can be used? • UniGene clusters contain many ESTs • Algorithms • UniGene data come from many cDNA libraries. • HMMs, Bayesian models, neural nets • Gene prediction software When you look up a gene in UniGene, you can obtain information re: level & tissue distribution of expression BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 • 3 major types • many, many programs! 31 Overview of Gene Prediction Strategies What other types of information can be used? Why? Smaller genomes Simpler gene structures Many more sequenced genomes! (for comparative approaches) Many microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are available e.g., GeneMark.hmm TIGR Comprehensive Microbial Resource (CMR) NCBI Microbial Genomes • Homology (sequence comparison, BLAST) • cDNAs & ESTs (experimental data, pairwise alignment) 10/22/07 33 Predicting Genes - Basic steps: BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 34 Predicting Genes - Details: • Obtain genomic sequence 1. 1st, mask to "remove" repetitive elements (ALUs, etc.) 2. Perform database search on translated DNA (BlastX,TFasta) 3. Use several programs to predict genes (GENSCAN, GeneMark.hmm, GeneSeqer) 4. Search for functional motifs in translated ORFs (Blocks, Motifs, etc.) & in neighboring DNA sequences 5. Repeat • BLAST it! • Perform database similarity search (with EST & cDNA databases, if available) • Translate in all 6 reading frames (i.e., "6-frame translation") • Compare with protein sequence databases • Use Gene Prediction software to locate genes • Analyze regulatory sequences • Refine gene prediction BCB 444/544 Fall 07 Dobbs 32 Gene prediction is easier in microbial genomes • Transcription: TF binding sites, promoter, initiation site, terminator, GC islands, etc. • Processing signals: Splice donor/acceptors, polyA signal • Translation: Start (AUG = Met) & stop (UGA,UUA, UAG) ORFs, codon usage BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 Gene prediction: Eukaryotes vs prokaryotes What sequence signals can be used? BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 35 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 36 6 #26 - Gene Prediction 10/22/07 GeneSeqer - Brendel et al.- ISU Brendel - Spliced Alignment II: http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Compare with protein probes Spliced Alignment Algorithm Brendel et al (2004) Bioinformatics 20: 1157 Start codon • Perform pairwise alignment with large gaps in one sequence Stop codon Genomic DNA (due to introns) • Align genomic DNA with cDNA, ESTs, protein sequences Protein • Score semi-conserved sequences at splice junctions • Using Bayesian model or MM • Score coding constraints in translated exons • Using a Bayesian model or MM Intron GT Donor Brendel 2005 AG Acceptor Splice sites BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 37 Brendel 2005 Splice Site Detection BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 38 Information content vs position Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES 0.8 • Information Content Ii : Ii = 2 + " 0.7 Human T2_GT 0.6 0.5 f iB log 2 ( f iB ) B !U ,C , A ,G 0.5 0.4 0.3 0.3 0.2 0.1 0.1 0.0 -50 I i ! I + 196 . "I i: ith position in sequence Ī: avg information content over all positions >20 nt from splice site σĪ: avg sample standard deviation of Ī BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 39 -40 -30 -20 -10 Human T2_AG 0.6 0.4 0.2 • Extent of Splice Signal Window: Brendel 2005 0.8 0.7 0.0 0 10 20 30 40 50 -50 -40 -30 -20 -10 0 10 20 30 40 50 Which sequences are exons & which are introns? How can you tell? Brendel et al (2004) Bioinformatics 20: 1157 Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 10/22/07 40 Markov Model for Spliced Alignment PΔG PΔG (1-PΔG )(1-PD(n+1)) en en+1 (1-PΔG )PD(n+1) PA(n)PΔG (1-PΔG )PD(n+1) in in+1 1-PA(n) Brendel 2005 BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction BCB 444/544 Fall 07 Dobbs 10/22/07 41 7