Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema Why gene prediction? experimental way? Why gene prediction? Exponential growth of sequences New sequencing technology Metagenomics: ~1% grow in lab How to do it? How to do it? It is a complicated task, let’s break it into parts How to do it? It is a complicated task, let’s break it into parts Genome How to do it? It is a complicated task, let’s break it into parts Genome How to do it? Protein-coding gene prediction Homology Search Phillip Lee & Divya Anjan Kumar ab initio approach Nadeem Bulsara & Neha Gupta How to do it? RNA gene prediction Amanda McCook & Chengwei Luo tRNA rRNA sRNA Homology Search Homology Search Strategy open reading frame(ORF) How/Why find ORF? How/Why find ORF? How/Why find ORF? Protein Database Searches Domain searches Limits of Extrinsic Prediction ab initio Prediction Homology Search is not Enough! Biased and incomplete Database sequenced genomes are not evenly distributed on the tree of life, and does not reflect the diversity accordingly either. ab initio Gene Prediction Features ORFs (6 frames) Codon Statistics Features (Contd.) Probabilistic View Supervised Techniques Unsupervised Techniques Usually Used Tools GeneMark Glimmer EasyGene PRODIGAL GeneMark GeneMark.hmm GeneMark.hmm GeneMarkS Glimmer Glimmer Journey Glimmer3.02 PRODIGAL Prokaryotic Dynamic Programming Gene Finding Algorithm Developed at Oak Ridge National Laboratory and the University of Tennessee Features Features EasyGene Developed at University of Copenhagen Statistical significance is the measure for gene prediction. ¥ High quality data set based on similarity in SwissPRot is extracted from genome. ¥ Data set used to estimate the HMM where based on ORF score and length statistical significance is calculated. Problem: ¥ No standalone version available Comparison of Different Tools RNA Gene Prediction Why Predict RNA? Regulatory sRNA sRNA Challenges Fundamental Methodology RFAM What Is Covariance? Fig: Christian Weile et al. BMC Genomics (2007) 8:244 Noncomparative Prediction Fig: James A. Goodrich & Jennifer F. Kugel, Nature Rev. Mol. Cell Biol. (2006) 7:612 Noncomparative Prediction *Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1 Comparative+Noncomparative • Effective sRNA prediction in V. cholerae • Non-enterobacteria • sRNAPredict2 • 32 novel sRNAs predicted • 9 tested • 6 confirmed Jonathan Livny et al. Nucleic Acids Res. (2005) 33:4096 Software *Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1 Eva K. Freyhult et al. Genome Res. (2007) 17:117 Modification & finishing • Consensus strategy to integrate ab initio results • Broken gene recruiting • TIS correcting • IS calling • operon annotating • Gene presence/absence analysis Modification & finishing Consensus strategy Broken gene recruiting pass pass candidate fragments fail homology search ab initio results Modification & finishing TIS correcting Start codon redundancy:ATG, GTG, TTG, CTG Leaderless genes Markov iteration, experimental verified data Modification & finishing IS calling IS Finder DB Operon annotating Modification & finishing Gene Presence/absence analysis Schema (proposed) Schema (proposed) assembly group Schema (proposed) assembly group