Gene Prediction Preliminary Results Computational Genomics February 20, 2012 ab initio Gene Prediction Using Glimmer3, RAST, Prodigal and GenemarkS Prodigal • lack of complexity(no Hidden Markov Model, no Interpolated Markov Model). • based on dynamic programming. • remains accuracy in high GC content genomes. • tends to predict longer genes rather than more genes. Prodigal Protocol Prodigal Options Build Training File Running Prodigal Screenshot of Results GeneMarkS Gene prediction in Prokaryotic genome with unsupervised model parameter estimation Web based version Command line version Syntax: runGeneMarkS <input_file> <output folder> The Output folder contains 3 types of files: •.out file: contains the default output •.faa file: contains the amino acid sequence of the corresponding ORFs in FASTA format •.fnn file: contains the nucleotide sequence of the corresponding ORFs in FASTA format Screenshot of the .out file Strand +:normal strand, -:reverse strand Left end: Begin position, Right end: End position Screenshot of the .faa file Screenshot of the .fnn file Glimmer3 • A system for finding genes in microbial DNA • Works by creating a variable-length Markov model from a training set of genes • Using the model to identify all genes in a DNA sequence Running Glimmer3 • 2 step progress • 1. A probability model of coding sequences must be built called an interpolated context model. – – – – a set of training sequences 1. genes identified by homology or known genes 2. from long, overlapping orfs 3. genes from a highly similar species • 2. program is run to analyze the sequences and make gene predictions – Best results require longest possible training set of genes Glimmer3 programs • Long-orfs uses an amino-acid distribution model to filter the set of orfs • Extract builds training set from long, nonoverlapping orfs • Build-icm build interpolated context model from training sequences • Glimmer3 analyze sequences and make predictions Interpolated Context Model RAST • RAST (Rapid Annotation using Subsystem Technology) is a system for annotating bacterial and archaeal genomes. • Pipelines- tRNAScan-SE, Glimmer2, and comparing against other prokaryote genes that are universal across species. Number Genes Predicted ID Glimmer3 Prodigal RAST Genemark M19107 1728 1728 1784 1808 M19501 1914 1867 2015 1933 M21127 2370 2317 2456 2413 M21621 1937 1914 1838 1972 M21639 2698 2665 2823 2797 M21709 1924 1881 2004 1925 Average 2095 2062 2153 2141 Gene Length of Predicted Genes ID Glimmer3 RAST GeneMark M19107 791.43 793.56 801.50 M19501 806.71 809.12 840.52 M21127 987.09 692.20 708.70 M21621 851.47 900.93 885.61 M21639 740.28 751.85 762.46 M21709 840.49 843.18 873.15 Average 836.25 798.47 811.99 Homology-based Gene Prediction using BLAT Homology-based Gene Prediction using BLAT 1709 Protein coding genes Haemophilus influenzae Query Haemophilus haemolyticus Targets 99 M19107.fasta Blat-UCSC 17 M19501.fasta 29 M21127.fasta Output.pslx Predicted genes QueryCoverage (%) Frequency graphs 24 M21621.fasta Define cutoff 49 31 M21639.fasta M21709.fasta Frequency Cut-off Query-Coverage % Homology-based Gene Prediction using BLAT Results Strand Contigs Predicted genes Average Lenght 99 Querycoverage CUTOFF (%) 90 M19107 787 1049 M19501 17 90 1063 996 M21127 29 90 901 963 M21621 24 90 930 685 M21639 49 90 970 1277 M21709* 31 90 1515 813 Gene Calling Protocol N° of Predicted Genes (≥ 90% Query-coverage) 787 1063 M19107 M19501 901 930 M21127 M21621 970 1515 M21639 M21709* Gene Scoring System Presence / Absence ≥ 4/5 = 3/5 ≤ 2/5 ? Multiple Alignment (Muscle) Consensus Sequence Final set of homologybased predicted genes RNA Prediction First pass filters identify "candidate" tRNA regions of the sequence. • tRNAscan and EufindtRNA Further analysis to confirm the initial tRNAprediction. • Cove tRNAscan-SE –B <inputfile> -o <outputfile1> -f <outputfile2> -m <outputfile3> -B <file> : search for bacterial tRNAs • This option selects the bacterial covariace model for tRNA analysis, and loosens the search parameters for EufindtRNA to improve detection o f bacterial tRNAs. -o <file> : save final results in <file> • Specifiy this option to write results to <file>. -f <file> : save results and tRNA secondary structures to <file>. -m <file> : save statistics summary for run • contains the run options selected as well as statistics on the number of tRNAs detected at each phase of the search, search speed, and other statistics. Output using “–o” parameter Output using “–f” parameter Output using “–m” parameter M19107 M19501 M21127 M21621 M21639 M21709 No. of contigs 99 17 29 23 49 29 Contigs with atleast 1 tRNA 45 12 22 19 33 21 First-pass tRNAs predicted 103 124 114 123 137 113 Coveconfirmed tRNAs 41 51 50 52 51 51 ISOTYPE AND ANTI CODON COUNT (M19107) RNAmmer Working • It works using two level of Hidden markov models. • The spotter model is constructed from highly conserved loci within a structural alignment of known rRNA sequences. • Once the spotter model detects an approximate position of a gene, flanking regions are extracted and parsed to the full model which matches the entire gene. • By enabling a two-level approach it is avoided to run a full model through an entire genome sequence allowing faster predictions. Command line options Rnammer -S (species) –m (molecules) –xml (xml file) –gff (gff file) –h (hmm report file) –f (fasta file) • -S : specify the species to use. In out case, it will be bacterial • -m : molecules to search for. (ie. Large subunit or small subunit) Results ##gff-version2 ##source-version RNAmmer-1.2 ##date 2012-02-19 ##Type DNA # seqname source feature start end score # --------------------------------------------------------------------------------------------------------84 RNAmmer-1.2 rRNA 28110 31006 3556.4 84 RNAmmer-1.2 rRNA 31127 31241 82.9 1 RNAmmer-1.2 rRNA 116969 117083 82.9 60 RNAmmer-1.2 rRNA 338 452 82.9 29 RNAmmer-1.2 rRNA 198 312 82.9 84 RNAmmer-1.2 rRNA 25977 27507 1872.9 # --------------------------------------------------------------------------------------------------------- +/- frame attribute + + + + + . . . . . . 23s_rRNA 5s_rRNA 5s_rRNA 5s_rRNA 5s_rRNA 16s_rRNA M19107 4 1 1 M19501 7 1 1 M21127 4 1 0 M21621 4 0 0 M21639 7 2 1 M21709 8 2 2 sRNA Prediction Rfam Database Homology Search • A collection of RNA families – Non-coding RNA genes – Structured cis-regulatory elements – Self-splicing RNAs • WU-BLAST search, and keeps hits with E-value < 1e-5 Rfam Preliminary Results The output format is: <rfam acc> <rfam id> <seq id> <seq start> <seq end> <strand> <score> Results: 84 Rfam similarity 25970 27512 1477.28 + . evalue=2.08e-50;gccontent=52;id=SSU_rRNA_bacteria.1;model_end=1518;model_start=1;rfam-acc=RF00177;rfamid=SSU_rRNA_bacteria Accession # Total ncRNA # of rRNA # of tRNA / tmRNA # of sRNA Others (RNasep) Sequencing Coverage M19107 65 10 43 11 1 12 X M19501 85 14 53 17 1 53 X M21127 79 9 52 17 1 20 X M21621 81 10 54 16 1 25 X M21639 95 12 53 29 1 78 X M21709 92 16 54 21 1 34 X Things to be done • Get Geneprimp to work since we are having some problems with the installation and the web server takes a long time to process. • Get further information required to run other RNA prediction softwares. • Compare specific RNA prediction softwares with Rfam predictions. Leading Biocomputational Tools • eQRNA (Rivas and Eddy 2001) • RNAz (Washietl et al. 2005; Gruber etal. 2010) • sRNAPredict3/SIPHT (Livny et al. 2006, 2008) • NAPP (Marchais et al. 2009) All four approaches use comparative genomics!! Lu, X., H. Goodrich-Blair, et al. (2011). "Assessing computational tools for the discovery of small RNA genes in bacteria." RNA 17(9): 1635-1647 sRNApredict3 Pipeline