MethodsStats_EVM_annotation_Odegus

advertisement
Author list
Francisco Câmara1,2, Emilio Palumbo1,2, Barbara Uszczynska1,2, Anna Vaslova1,2 and
Roderic Guigo1,2
1 Centre for Genomic Regulation (CRG), Dr. Aiguader 88, 08003 Barcelona, Spain.
2 Universitat Pompeu Fabra (UPF), Barcelona, Spain
METHODS FOR THE EVM-DERIVED (PROTEIN-CODING)
GENOME ANNOTATION OF OCTODON DEGUS AND
STATISTICS OF THE RESULTING CONSENSUS GENE MODELS
(EVM)
1 Protein-coding ab initio or evidence-based gene predictions
on the O. degus genome (Assembly OctDeg1.0; 2012/05/01)
using four programs.
1.1
Geneid gene predictions on the O. degus v1.0 genome
assembly.
1.2 SGP2 gene predictions on the O. degus v1.0 genome
assembly.
1.3 Augustus gene predictions on the O. degus v1.0 genome
assembly.
1.4 SNAP gene predictions on the O. degus v1.0 genome
assembly.
2 Evidence-Modeler (EVM)-based genome annotation of the O.
degus v1.0 assembly by combining different sources of
evidence using weights.
2.1 PASA transcript alignments
2.2 Protein alignments
2.3 Combining the different EVM sources
2.4 EVM consensus annotation statistics
3. References
1 Protein-coding gene annotation of the O. degus
genome
1.1
Obtaining
geneid
gene
predictions
H.sapiens/mammalian-specific parameter file
using
an
Geneid [1,2] is an ab initio gene prediction program used to find potential
protein-coding genes in anonymous genomic sequences. In the context of geneid
training basically consists of computing position weight matrices (PWMs) or
Markov models of order 1 for splice sites and start codons, and deriving a model
of coding DNA (generally a Markov model of order 5). Furthermore, once a
preliminary species-specific matrix is obtained it is further optimized by adjusting
two internal matrix parameters: the cutoff of the scores of the predicted exons
(eWF) and the ratio of signal to coding statistics information to be used (oWF).
Geneid using its H. sapiens/mammal-specific parameter file has been used in the
past to accurately generate gene predictions in several different mammalian
genomes (i.e. M. musculus –Consortium MGS, 2002-, R. norvegicus –Gibbs et
al., 2004). Accuracy of the Geneid generic mammal parameter file for predicting
sequences in O. degus was tested on an “artificial scaffold” of 13.5 Mbases
consisting of the 238 evaluation-set concatenated gene models with 800
nucleotides of intervening sequence between each of the genes (Table 1). This
artificial scaffold was built using one of the modules of a recently developed
Geneid training tool. The protein-coding gene sequences embedded into the
artificial scaffold were selected from within the set of more than 26,000 NCBI
GNOMON annotated O. degus transcripts
(http://www.ncbi.nlm.nih.gov/bioproject/193441).
Geneid can also use external information, such as the coordinates of known
introns, to improve the accuracy of its predictions. In order to take advantage of
this feature of geneid, we 1) generated a set of 25,621 transcripts split-mapping
to the genome assembly of O. degus using the PASA pipeline (Program to
Assemble Spliced Alignments; r2014-04-217 [3] –refer to section two for
additional information) and 2) extracted all introns corresponding to the proteincoding sequences from these 25,621 PASA-generated gene models using the
PASA module designed to produce potential training-set models. This process
resulted in a final set of 159,789 introns. Furthermore, to measure the accuracy
of the geneid predictions given this intronic evidence we mapped all available
transcript evidence (refer to section 1.3) to the artificial scaffold with the PASA
pipeline, and then used the program’s “training set”-building module to mimic
creating a set of PASA transcripts mapping to the test gene models (within the
artificial scaffold) and then proceeded to extract the introns corresponding to the
generated protein-coding transcript sequences. We then calculated the accuracy
of geneid (+introns) on the test scaffold, which showed a significant improvement
in the performance of the program when using introns as evidence (i.e. 9-18%
and 7-10% higher sensitivity and specificity respectively, than geneid at
nucleotide/exon levels; refer to table 1refer to table 1).
1.2 Obtaining SGP2 gene predictions using a pre-existing
mammalian-specific parameter file.
SGP2 [4] is a syntenic gene prediction tool that combines ab initio gene
prediction (geneid) with TBLASTX searches between two or more genome
sequences to provide both sensitive and specific gene predictions, usually
showing an improvement to geneid’s performance, especially by reducing the
number of false-positive predictions. SGP2 requires a reference genome to
which the target genome (in this case O. degus) is TBLASTX-compared. We
decided to use the genome of H. sapiens (assembly hg38) as our “reference
genome” based on the premise that O. degus (as M. musculus) has a high
likelihood of being at an appropriate evolutionary distance from H. sapiens for
optimal SGP2 performance. An “appropriate” evolutionary distance in the context
of SGP2 means that the coding portions of the target/reference genomes being
aligned would have a higher conservation than their intergenic and intronic
regions. This feature would in turn contribute to a higher accuracy of SGP2 when
compared to geneid, especially at the level of specificity. Furthermore, if our
assumption proved correct using an existing SGP2 parameter file previously
developed and optimized for predicting gene models in M. musculus/H. sapiens
genomes [4] could be used to obtain the gene predictions. In order to test our
assumption this pre-existing “Mmus-Hsap” SGP2 parameter file was evaluated
on the same “artificial scaffold” used to test geneid (refer to the previous section).
The SGP2 predictions on the “artificial scaffold” (using H. sapiens as the
reference genome) show this is program is 7-12% more sensitive, and 4-7%
more specific than geneid at nucleotide and exon levels (refer to Table 1).
SGP2 (an extension of geneid) can also use external information (i.e. introns)
improve the accuracy of its predictions. We used the same set of introns as
external evidence for SGP2 as we had for geneid (section 1.1). Again the
accuracy of SGP2 (+introns) on the test scaffold confirmed a marked
improvement in the performance of the program when introns were used as
evidence (i.e. 12-25% higher sensitivity than geneid at nucleotide/exon levels;
refer to table 1).
1.3 Obtaining augustus gene predictions using a pre-existing
mammalian-specific parameter file
We also produced O. degus protein-coding gene annotations using the
gene prediction tool Augustus. Augustus is a program that predicts genes in
eukaryotic genomic sequences and is “re-trainable”. The program is based on a
Hidden Markov Model and integrates a number of known methods and
submodels (5). In order to predict gene sequences in O. degus using augustus
we used the program’s pre-existing mammal/H. sapiens parameter file.
We
estimated the program’s accuracy in predicting O. degus sequences by
evaluating it on the same 13.5 Mbase artificial scaffold consisting of the 238
concatenated gene models with 800 nucleotides of intervening sequence used to
evaluate the in-house geneid and SGP2 programs (Table 1).
We also took advantage of augustus’ potential to use external evidence to
improve its performance. We did this by obtaining a set of predictions that used
the mammal/H. sapiens augustus parameter file in combination with PASAderived transcript evidence obtained for this species (refer to section 2 for
additional information). Briefly, for O. degus this transcript evidence set consisted
of 25,621 PASA transcripts built from an initial set of 1,767,640 sequences from
various different sources: 1) O.degus mRNA transcripts (deemed to be proteincoding)
obtained
at
the
NCBI
O.
degus
project
site
(fftp://ftp.ncbi.nlm.nih.gov/genomes/Octodon_degus/RNA/) 2) ESTs and mRNA
sequences from five species of rodents (mouse, guinea pig, chinese hamster and
rat) also found at the project pages for each of these rodents also at the NCBI.
Refer to section 3 of this document for further details on how the PASA
transcripts were generated.
Our strategy for taking advantage of the large set of transcripts followed the
methodology
described
in
(http://augustus.gobics.de/binaries/readme.rnaseq.html) and in an article by
Stanke et al. [5] and allowed us to obtain a higher-accuracy evidence-based set
of Augustus(+hints) predictions on the O. degus assembly (refer to table 1). In
order for the O. degus augustus matrix to take advantage of the external data we
first had to optimize some internal parameters of the new augustus parameter
file; the exonpart bonus for hints corresponding to PASA-evidence (“E”) was
given a bonus of 1xE3. Also, for every exonpart that was not supported by the
PASA evidence, the probability of the gene structure was given a “malus” or a
penalization of 0.995. Furthermore, complete exons predicted by augustus that
perfectly matched the exons in the external hints were given a bonus of 1xE4.
The intron bonus for (PASA) hints of source E was set to 1xE4, meaning that a
predicted intron would get this bonus when being exactly as in the PASA “hint”.
1.4 Obtaining SNAP predictions using a pre-existing mammalianspecific matrix
Our final source of ab initio gene predictions to be used by the EVidence
Modeler combiner (EVM r2012-06-25) [7)) was obtained using the program
SNAP [6] using its pre-existing mammalian-specific matrix. SNAP was developed
by Ian Korf and consists of a general-purpose gene finding program that can be
used both on eukaryotic and prokaryotic genomes. SNAP is an acronym for
“Semi-HMM-based Nucleic Aid Parser”. SNAP’s performance was the poorest
from among the three different sources of ab initio/homology-based gene
predictions used in this project but was still used as a source for EVM, albeit with
a fraction of the weight given to the other ab initio tools.
Geneid/SGP2 (both with or without introns), augustus (with or without “hints”),
and SNAP, using their pre-existing set of mammal-specific parameter files were
subsequently used to predict genes on the latest repeat-masked assembly of the
O. degus genome (OctDeg1.0; 2012/05/01). This assembly is made up of 7,134
scaffolds. Given the mammalian/H.sapiens parameter file geneid predicted
68,023 protein-coding genes without external evidence and 61,839 sequences
when using intronic data. SGP2 generated 55,558 predictions on the assembly
and
49,619
gene
models
when
using
introns
as
external
evidence
(SGP2+introns). The program augustus predicted 32,581 genes on the assembly
while its evidence-based variation [augustus(+hints)] produced 26,405 gene
models. The gene predictor SNAP generated 167,786 predictions on this
assembly.
These programs were then used as input to a “combiner” (Evidence Modeler;
EVM r2012-06-25) [7], which was the program employed to obtain the reference
annotation for this genome.
2 EVM-based genome annotation of the O. degus assembly by
combining different sources of evidence using weights.
A combination of the Program to Assemble Spliced Alignments (PASA
r2014-04-17) [3] and Evidence Modeler (EVM r2012-06-25) [7] was used to
obtain consensus coding sequence (CDS) models using three main sources of
evidence: aligned transcripts, aligned proteins, and gene predictions (refer to
section 1).
2.1 PASA transcript alignments
The O. degus RNA sequences processed by the PASA pipeline (r2014-0417) were obtained as described in section 1.3. As previously mentioned this
process resulted in 25,621 PASA transcript assemblies. The transcriptome was
subsequently added to the PASA database (which uses GMAP/BLAT as the
alignment engines). PASA was set to be quite stringent. Input sequences with
less than 95% identity to the genomic sequence over 90% of their length were
discarded.
2.2 Protein alignments
Furthermore, Uniprot-90 [8] and 26,259 Uniprot-Swissprot highly curated
protein
sequences
(ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxon
omic_divisions/uniprot_sprot_rodents.dat.gz) were split-mapped to the O. degus
genome by using the program SPALN2 [9] with M. musculus-specific parameters.
We also used the spliced-protein alignment tool exonerate [10] to map the
Uniprot-Swissprot protein sequences to the scaffolds of O. degus (table 2).
2.3 Combining the different EVM sources
The resulting alignments were then filtered as suggested in the EVM
documentation (http://evidencemodeler.sourceforge.net/).
Gene predictions were obtained as previously described (sections 1.1 - 1.3) and
also modified as recommended (http://evidencemodeler.sourceforge.net/)) and
added to the EVM pipeline.
Subsequently the transcript alignments, protein alignments and the ab initio gene
models were combined into consensus CDS models by EVM using different
weights. These weights (shown in table 2) were selected following the EVM
documentation (i.e. http://evidencemodeler.sourceforge.net/) and, with regard to
the ab initio predictions, also based on the accuracy of the different programs in
predicting sequences in the evaluation “artificial scaffold” for this species (refer to
sections 1.1 and table 1).
We also determined which sources of evidence (ab initio, protein or transcript)
EVM used to build each of the 35,524 consensus gene models (table 3) and
removed several of the EVM consensus gene models from the initial reference
set. We filtered-out 1) the 1,596 reference gene models supported exclusively by
SNAP predictions (taking into consideration the low accuracy of this prediction
tool – refer to table 1) as well as the 2) single-exon consensus gene models
derived from predictions solely supported by geneid, sgp or augustus predictions,
with <300 nucleotides and an EVM score <5. This resulted in a final set of
31,945 consensus CDS models which were then updated with UTRs and
alternative exons through five rounds of PASA’s routine to update annotations.
The resulting transcripts were grouped into genes and then a pre-selected
species-specific identifier was assigned to the genes, transcripts and protein
products derived from them.
2.4 EVM consensus annotation statistics
Finally, and as a quality control, the protein products obtained from the
reference annotation of this species was aligned against either the exhaustive
NCBI non-redundant (NR-201402) database using the “protein vs. protein”
BLASTP “flavor” of the sequence comparison tool BLAST (E=10-2 with a
minimum identity of 25%) to determine what percentage of the annotated genes
matched a sequence of this large biological-sequence public databases. Results
showed that ~97% of our consensus EVM reference of this species matched an
NR protein given the criteria above (table 4). Furthermore, table 4 contains a
wide-range of statistics obtained both from the O. degus assembly and the
consensus
EVM
protein-coding
consensus
protein-coding
annotation.
Furthermore table 4 also contains statistics on the NCBI GNOMON-based
(ftp://ftp.ncbi.nlm.nih.gov/genomes/Octodon_degus/RNA/)
protein
coding
annotation whose gene-models were used as the source of transcript evidence
for EVM.
Program/param
SN
SP
SNe
SPe
SNg
SPg
Geneid man/hs
0.83
0.75
0.65
0.69
0.09
0.06
Geneid+intron
mam/hs
0.92
0.82
0.83
0.79
0.24
0.17
SGP2 odegus /
Hs
0.90
0.82
0.77
0.73
0.12
0.08
SGP2+intron
odegus /
mam/hs
0.95
0.86
0.86
0.79
0.26
0.17
Augustus+hints
mam/hs
0.87
0.94
0.81
0.90
0.33
0.35
Augustus
mam/hs
0.81
0.84
0.68
0.75
0.06
0.07
SNAP mam/hs
0.83
0.45
0.57
0.32
0.03
0.01
Table 1. Accuracy of gene prediction on an O. degus “artificial scaffold” consisting of 238
concatenated O. degus test sequences (with 800 nucleotides of sequence between each of the
gene models) using the ab initio programs geneid, augustus and SNAP with their pre-existing
Mammalian/H. sapiens parameter files (i.e. “mam/hs”). The accuracy of SGP2 (homology
evidence-based prediction tool that used the genome of H. sapiens as reference) and that of
Augustus (using RNASeq and transcript evidence i.e. “augustus+hints”) were also tested for
accuracy on the same set of sequences. Geneid (geneid+introns) and SGP2 (SGP2+introns)
using introns as external evidence were also evaluated. (SN & SP: sensitivity & specificity at
nucleotide level; SNe & SPe: sensitivity & specificity at exon level; SNg & SPg: sensitivity &
specificity at gene level).
Type
Source
Weight
ABINITIO_PREDICTION
Augustus
1
ABINITIO_PREDICTION
AugustusHints
1.75
ABINITIO_PREDICTION
geneid
1
ABINITIO_PREDICTION
sgp2
1.25
ABINITIO_PREDICTION
geneid+introns
1.5
ABINITIO_PREDICTION
sgp2+introns
1.75
ABINITIO_PREDICTION
SNAP
0.3
PROTEIN
SPALN2 uniprot90
5
PROTEIN
SPALN2 uniprot-swissprot
4
PROTEIN
exonerate uniprot-swissprot
4
TRANSCRIPT
PASA
10
Table 2. Weights used by EVM to create a consensus CDS model for O. degus. (SPLAN2
against Uniprot90 proteins; SPALN2 uniprot-swissprot: SPALN2 against rodent
uniprot/swissprot curated proteins; Exonerate uniprot-swissprot: exonerate against against rodent
uniprot/swissprot curated proteins;
Type of source of evidence
Number of consensus gene
models supported by the type
of source of evidence (% of
total number of EVM
reference gene models
supported)
PASA transcript alignments
15,165 (42.6894%)
Protein alignments
18,266 (51.4188%)
Protein OR PASA alignments
18,880 (53.1472%)
Protein AND PASA alignments
14,551 (40.961%)
Protein, PASA and at least one source of ab
14,550 (40.9582%)
initio predictions
Exclusively ab initio evidence (geneid,
16,643 (46.85%)
geneidi,sgp2,sgp2i,augustus,augustushints
or snap)
Only one source of ab initio predictions (No
9,382 (26.4103%)
protein or transcript evidence)
At least two sources of ab initio evidence
7,261 (20.4397%)
(No protein or transcript evidence)
All sources of ab initio evidence (No protein
2,218 (6.24367%)
or transcript evidence)
just geneid,geneidi/spg2,sgp2i (No protein
6,096 (17.1602%)
or transcript evidence)
just geneid/geneidi (No protein or transcript
1,273 (3.58349%)
evidence)
just sgp2/sgp2i (No protein or transcript
2,196 (6.18174%)
evidence)
just augustus/augustus+hints (No protein or
1,690 (4.75735%)
transcript evidence)
just SNAP (No protein or transcript
1,596 (4.49274%)
evidence)
singleEXON genes (with ab initio evidence
from more than 1 program and/or
5,952 (16.7549%)
protein/PASA evidence
Table 3. Break down of the types of evidence used to build the 35,524 gene-model EVM
consensus set (prior to filtering for weakly supported gene models).
Annotation versions
O.degus 2a (EVMgenerated)
Genome length
(Mbases)
O.degus (ncbi GNOMON) -proteincoding only2,995.89
number of scaffolds
7,134
Number of proteincoding genes
31,739
20,779
Gene density
(genes/Kbase)
0.0106
0.007
Number of proteincoding transcripts
36,866
26,248
1.16 (SD 0.72) (1 – 32)
(9.24%)
1.26 (SD 0.94) (1 – 31)(15%)
Number of
transcripts with
UTRs
10,648
-
Number of proteins
36,575
26,248
Number of complete
proteins (%)
33,858 (92.57%)
-
35,475 (97%)
-
Transcripts/gene
(range) (% genes
with more than 1
transcript)
Number/(%)
proteins with
similarity to
sequences in the
NCBI NR database
(E=10-2; min.
identity=25%)
Avg. length of
proteins (range)
461.96 aa. (SD 593.73) (25 –
34,458)
577.56 aa. (SD 641.03) (23 –
34,357)
Avg. length of fulllength proteins
(range)
478.57 aa. (SD 602.27) (25 –
34,458)
-
Number of partial
proteins (not
starting with "M")
1842 (5.04%)
259 (0.98%)
253.11 aa. (SD 431.8)
-
1589 (4.34%)
(can’t determine as gnomon
protein set has no clear STOP
signal)
Avg. length of partial
proteins (no
terminal STOP
codon)
213.91 aa. (SD 350.85)
-
Number of partial
proteins (not
starting with an M and- no terminal
STOP codon)
714 (1.95%)
-
Avg. length of partial
proteins (not
starting with an M and- no terminal
STOP codon)
158.83 aa. (SD 261.08)
-
Number of partial
proteins (not
starting with an M -
2,717 (7.43%)
-
Avg. length of partial
proteins (not
starting with "M")
Number of partial
proteins (no
terminal STOP
codon)
or- no terminal STOP
codon)
Avg. length of partial
proteins (not
starting with an M or- no terminal STOP
codon)
254.96 aa. (SD 423.14)
-
Number of proteincoding exons
288,884
268,660
Number of introns
252,018
242,412
Number of UTRs
(spliced)
19,003
-
Number of singleexon genes
10,114
3,156
Number of multiexonic transcripts
(genes)
26,752 (21,740)
23,092 (17,623)
Exons/transcript
(range) (excludes
single-exon genes)
10.42 (SD 10.50) (2 – 313)
11.49 (SD 10.35) (2 – 313)
Introns/transcript
(range)
9.42 (SD 10.50) (1 – 312)
10.49 (SD 10.35) (1 – 312)
“spliced”
UTRs/transcript
(range)
1.785 (SD 0.74) (1 - 5)
-
Avg. length of
introns (range)
5,998 (SD 19,994.1) (21 –
734,060)
5,613.03 (SD 19,909.6) (30 –
1,116,408)
Avg. length of monoexonic genes
519.27 (SD 430.56)
872.88 (SD 618.70)
Avg. length of exons
(excludes monoexonic genes)
165.37 (SD 233.34)
161.25 (SD 230.37)
Avg. length of first
exons
230.78 (SD 336.07)
-
Avg. length of
internal exons
149.24 (SD 194.59)
-
Avg. length of
terminal exons
235.72 (SD 352.41)
-
Avg. length of CDS
(range)
1,392.9 (SD 1,782.42) (75 –
103,074)
1,736.06 (SD 1,923.46) (69 –
103,074)
Avg. length of UTRs
(range)
653.40 (SD 942.07) (1 11,857)
-
Avg. length of
primary transcripts
43,714.8 (SD 107,530)
56,055.1 (SD 117,349)
G+C content exonic
(mono-exonic genes)
49.72% (SD 7.63%)
51.85% (SD 8.56%)
G+C content exonic
(excludes monoexonic genes)
52.53% (SD 7.47%)
53.52% (SD 7.42%)
G+C content exonic
(first exons)
53.62% (SD 10.85%)
-
G+C content exonic
(internal exons)
51.27% (SD 9.67%)
-
G+C content exonic
(terminal exons)
53.59% (SD 10.84%)
-
G+C content intronic
45.05% (SD 11.54%)
45.45% (SD 11.61%)
G+C content
genomic
G+C content UTRs
40.16% (SD 5.63%)
53,76% (SD 5%)
-
Table 4. Statistics for the EVM-generated and GNOMON-based protein-coding
reference annotations for O. degus
3 References
1. Blanco, E., Parra, G., Guigó, R. Using geneid to identify genes. Curr Protoc
Bioinformatics. 4.3. (2007)
2. Parra, G., Blanco, E. & Guigó, R. GeneID in Drosophila. Genome Res 10 511-515
(2000)
3. Haas, B.J., et al. Improving the Arabidopsis genome annotation using maximal
transcript alignment assemblies. Nucleic Acids Res 31 5654-5666 (2003)
4. Parra, G. et al. Comparative gene prediction in human and mouse. Genome Res
13 108-117 (2003)
5. Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts.
Nucleic Acids Res 34 W435-439 (2006)
6. Korf I. Gene finding in novel Genomes. BMC Bioinformatics 5 59 (2004)
7. Haas, B. J. et al. Automated eukaryotic gene structure annotation using
EVidenceModeler and the Program to Assemble Spliced Alignments. Genome
Biol 9 R7 (2008)
8. Li W. and Godzik A. Cd-hit: a fast program for clustering and comparing large sets
of protein or nucleotide sequences. Bioinformatics 22 1658-1659 (2006)
9. Iwata, H. and Gotoh, O., " Benchmarking spliced alignment programs including
Spaln2, an extended version of Spaln that incorporates additional species-specific
features" Nucleic Acids Research 40 (20) e161 (2012)
10. Slater, G. & Birney,E. Automated generation of heuristics for biological sequence
comparison. BMC Bioinformatics 6:31 1–11 (2005)
Acknowledgments
Tyler Alioto of the CNAG i(http://www.cnag.cat/institution/) in Barcelona for
insightful technical assistance and development of some of the scripts used in
the analysis.
This work was partially supported by the «Instituto Nacional de Bioinformática»
(INB) from ISCIII in Spain.
Download