2nd_pres_Geneprediction

advertisement
Gene Prediction
Preliminary Results
Computational Genomics
February 20, 2012
ab initio Gene Prediction
Using Glimmer3, RAST, Prodigal and
GenemarkS
Prodigal
• lack of complexity(no Hidden Markov Model, no
Interpolated Markov Model).
• based on dynamic programming.
• remains accuracy in high GC content genomes.
• tends to predict longer genes rather than more genes.
Prodigal
Protocol
Prodigal Options
Build Training File
Running Prodigal
Screenshot of Results
GeneMarkS
Gene prediction in Prokaryotic genome with
unsupervised model parameter estimation
Web based version
Command line version
Syntax:
runGeneMarkS <input_file> <output folder>
The Output folder contains 3 types of files:
•.out file: contains the default output
•.faa file: contains the amino acid sequence of the corresponding ORFs in FASTA format
•.fnn file: contains the nucleotide sequence of the corresponding ORFs in FASTA format
Screenshot of the .out file
Strand +:normal strand, -:reverse strand
Left end: Begin position, Right end: End position
Screenshot of the .faa file
Screenshot of the .fnn file
Glimmer3
• A system for finding genes in microbial DNA
• Works by creating a variable-length Markov
model from a training set of genes
• Using the model to identify all genes in a DNA
sequence
Running Glimmer3
• 2 step progress
• 1. A probability model of coding sequences must
be built called an interpolated context model.
–
–
–
–
a set of training sequences
1. genes identified by homology or known genes
2. from long, overlapping orfs
3. genes from a highly similar species
• 2. program is run to analyze the sequences and
make gene predictions
– Best results require longest possible training set of
genes
Glimmer3 programs
• Long-orfs  uses an amino-acid distribution
model to filter the set of orfs
• Extract builds training set from long,
nonoverlapping orfs
• Build-icm build interpolated context model
from training sequences
• Glimmer3 analyze sequences and make
predictions
Interpolated Context Model
RAST
• RAST (Rapid Annotation using Subsystem
Technology) is a system for annotating
bacterial and archaeal genomes.
• Pipelines- tRNAScan-SE, Glimmer2, and
comparing against other prokaryote genes
that are universal across species.
Number Genes Predicted
ID
Glimmer3
Prodigal
RAST
Genemark
M19107
1728
1728
1784
1808
M19501
1914
1867
2015
1933
M21127
2370
2317
2456
2413
M21621
1937
1914
1838
1972
M21639
2698
2665
2823
2797
M21709
1924
1881
2004
1925
Average
2095
2062
2153
2141
Gene Length of Predicted Genes
ID
Glimmer3
RAST
GeneMark
M19107
791.43
793.56
801.50
M19501
806.71
809.12
840.52
M21127
987.09
692.20
708.70
M21621
851.47
900.93
885.61
M21639
740.28
751.85
762.46
M21709
840.49
843.18
873.15
Average
836.25
798.47
811.99
Homology-based
Gene Prediction using BLAT
Homology-based Gene
Prediction using BLAT
1709
Protein
coding genes
Haemophilus influenzae
Query
Haemophilus haemolyticus
Targets
99
M19107.fasta
Blat-UCSC
17
M19501.fasta
29
M21127.fasta
Output.pslx
Predicted genes
QueryCoverage
(%)
Frequency
graphs
24
M21621.fasta
Define cutoff
49
31
M21639.fasta
M21709.fasta
Frequency
Cut-off
Query-Coverage %
Homology-based Gene Prediction using BLAT
Results
Strand
Contigs
Predicted
genes
Average
Lenght
99
Querycoverage
CUTOFF (%)
90
M19107
787
1049
M19501
17
90
1063
996
M21127
29
90
901
963
M21621
24
90
930
685
M21639
49
90
970
1277
M21709*
31
90
1515
813
Gene Calling Protocol
N° of Predicted Genes (≥ 90% Query-coverage)
787
1063
M19107
M19501
901
930
M21127
M21621
970
1515
M21639
M21709*
Gene Scoring System
Presence / Absence
≥ 4/5
= 3/5
≤ 2/5
?
Multiple Alignment (Muscle)
Consensus Sequence
Final set of homologybased predicted genes
RNA Prediction
First pass filters identify "candidate" tRNA regions of the sequence.
• tRNAscan and EufindtRNA
Further analysis to confirm the initial tRNAprediction.
• Cove
tRNAscan-SE –B <inputfile> -o <outputfile1> -f <outputfile2> -m <outputfile3>
-B <file> : search for bacterial tRNAs
• This option selects the bacterial covariace model for tRNA analysis, and loosens
the search parameters for EufindtRNA to improve detection o f bacterial tRNAs.
-o <file> : save final results in <file>
• Specifiy this option to write results to <file>.
-f <file> : save results and tRNA secondary structures to <file>.
-m <file> : save statistics summary for run
• contains the run options selected as well as statistics on the number of tRNAs
detected at each phase of the search, search speed, and other statistics.
Output using “–o” parameter
Output using “–f” parameter
Output using “–m” parameter
M19107
M19501
M21127
M21621
M21639
M21709
No. of contigs
99
17
29
23
49
29
Contigs with
atleast 1 tRNA
45
12
22
19
33
21
First-pass
tRNAs
predicted
103
124
114
123
137
113
Coveconfirmed
tRNAs
41
51
50
52
51
51
ISOTYPE AND ANTI CODON COUNT (M19107)
RNAmmer
Working
• It works using two level of Hidden markov models.
• The spotter model is constructed from highly conserved loci within a structural
alignment of known rRNA sequences.
• Once the spotter model detects an approximate position of a gene, flanking regions are
extracted and parsed to the full model which matches the entire gene.
• By enabling a two-level approach it is avoided to run a full model through an entire
genome sequence allowing faster predictions.
Command line options
Rnammer -S (species) –m (molecules) –xml (xml file) –gff (gff file) –h (hmm report
file) –f (fasta file)
• -S : specify the species to use. In out case, it will be bacterial
• -m : molecules to search for. (ie. Large subunit or small subunit)
Results
##gff-version2
##source-version RNAmmer-1.2
##date 2012-02-19
##Type DNA
# seqname
source
feature
start
end
score
# --------------------------------------------------------------------------------------------------------84
RNAmmer-1.2 rRNA
28110
31006
3556.4
84
RNAmmer-1.2 rRNA
31127
31241
82.9
1
RNAmmer-1.2 rRNA
116969
117083
82.9
60
RNAmmer-1.2 rRNA
338
452
82.9
29
RNAmmer-1.2 rRNA
198
312
82.9
84
RNAmmer-1.2 rRNA
25977
27507
1872.9
# ---------------------------------------------------------------------------------------------------------
+/-
frame
attribute
+
+
+
+
+
.
.
.
.
.
.
23s_rRNA
5s_rRNA
5s_rRNA
5s_rRNA
5s_rRNA
16s_rRNA
M19107
4
1
1
M19501
7
1
1
M21127
4
1
0
M21621
4
0
0
M21639
7
2
1
M21709
8
2
2
sRNA Prediction
Rfam Database Homology Search
• A collection of RNA families
– Non-coding RNA genes
– Structured cis-regulatory elements
– Self-splicing RNAs
• WU-BLAST search, and keeps hits with E-value < 1e-5
Rfam Preliminary Results
The output format is: <rfam acc> <rfam id> <seq id> <seq start> <seq end> <strand> <score>
Results:
84 Rfam similarity 25970 27512 1477.28 +
.
evalue=2.08e-50;gccontent=52;id=SSU_rRNA_bacteria.1;model_end=1518;model_start=1;rfam-acc=RF00177;rfamid=SSU_rRNA_bacteria
Accession #
Total
ncRNA
# of rRNA
# of tRNA
/ tmRNA
# of sRNA
Others
(RNasep)
Sequencing
Coverage
M19107
65
10
43
11
1
12 X
M19501
85
14
53
17
1
53 X
M21127
79
9
52
17
1
20 X
M21621
81
10
54
16
1
25 X
M21639
95
12
53
29
1
78 X
M21709
92
16
54
21
1
34 X
Things to be done
• Get Geneprimp to work since we are having some problems with the installation
and the web server takes a long time to process.
• Get further information required to run other RNA prediction softwares.
• Compare specific RNA prediction softwares with Rfam predictions.
Leading Biocomputational Tools
• eQRNA (Rivas and Eddy 2001)
• RNAz (Washietl et al. 2005; Gruber etal. 2010)
• sRNAPredict3/SIPHT (Livny et al. 2006, 2008)
• NAPP (Marchais et al. 2009)
All four approaches use comparative genomics!!
Lu, X., H. Goodrich-Blair, et al. (2011). "Assessing computational tools for the discovery of small
RNA genes in bacteria." RNA 17(9): 1635-1647
sRNApredict3 Pipeline
Download