Author list Francisco Câmara1,2, Emilio Palumbo1,2, Barbara Uszczynska1,2, Anna Vaslova1,2 and Roderic Guigo1,2 1 Centre for Genomic Regulation (CRG), Dr. Aiguader 88, 08003 Barcelona, Spain. 2 Universitat Pompeu Fabra (UPF), Barcelona, Spain METHODS FOR THE EVM-DERIVED (PROTEIN-CODING) GENOME ANNOTATION OF OCTODON DEGUS AND STATISTICS OF THE RESULTING CONSENSUS GENE MODELS (EVM) 1 Protein-coding ab initio or evidence-based gene predictions on the O. degus genome (Assembly OctDeg1.0; 2012/05/01) using four programs. 1.1 Geneid gene predictions on the O. degus v1.0 genome assembly. 1.2 SGP2 gene predictions on the O. degus v1.0 genome assembly. 1.3 Augustus gene predictions on the O. degus v1.0 genome assembly. 1.4 SNAP gene predictions on the O. degus v1.0 genome assembly. 2 Evidence-Modeler (EVM)-based genome annotation of the O. degus v1.0 assembly by combining different sources of evidence using weights. 2.1 PASA transcript alignments 2.2 Protein alignments 2.3 Combining the different EVM sources 2.4 EVM consensus annotation statistics 3. References 1 Protein-coding gene annotation of the O. degus genome 1.1 Obtaining geneid gene predictions H.sapiens/mammalian-specific parameter file using an Geneid [1,2] is an ab initio gene prediction program used to find potential protein-coding genes in anonymous genomic sequences. In the context of geneid training basically consists of computing position weight matrices (PWMs) or Markov models of order 1 for splice sites and start codons, and deriving a model of coding DNA (generally a Markov model of order 5). Furthermore, once a preliminary species-specific matrix is obtained it is further optimized by adjusting two internal matrix parameters: the cutoff of the scores of the predicted exons (eWF) and the ratio of signal to coding statistics information to be used (oWF). Geneid using its H. sapiens/mammal-specific parameter file has been used in the past to accurately generate gene predictions in several different mammalian genomes (i.e. M. musculus –Consortium MGS, 2002-, R. norvegicus –Gibbs et al., 2004). Accuracy of the Geneid generic mammal parameter file for predicting sequences in O. degus was tested on an “artificial scaffold” of 13.5 Mbases consisting of the 238 evaluation-set concatenated gene models with 800 nucleotides of intervening sequence between each of the genes (Table 1). This artificial scaffold was built using one of the modules of a recently developed Geneid training tool. The protein-coding gene sequences embedded into the artificial scaffold were selected from within the set of more than 26,000 NCBI GNOMON annotated O. degus transcripts (http://www.ncbi.nlm.nih.gov/bioproject/193441). Geneid can also use external information, such as the coordinates of known introns, to improve the accuracy of its predictions. In order to take advantage of this feature of geneid, we 1) generated a set of 25,621 transcripts split-mapping to the genome assembly of O. degus using the PASA pipeline (Program to Assemble Spliced Alignments; r2014-04-217 [3] –refer to section two for additional information) and 2) extracted all introns corresponding to the proteincoding sequences from these 25,621 PASA-generated gene models using the PASA module designed to produce potential training-set models. This process resulted in a final set of 159,789 introns. Furthermore, to measure the accuracy of the geneid predictions given this intronic evidence we mapped all available transcript evidence (refer to section 1.3) to the artificial scaffold with the PASA pipeline, and then used the program’s “training set”-building module to mimic creating a set of PASA transcripts mapping to the test gene models (within the artificial scaffold) and then proceeded to extract the introns corresponding to the generated protein-coding transcript sequences. We then calculated the accuracy of geneid (+introns) on the test scaffold, which showed a significant improvement in the performance of the program when using introns as evidence (i.e. 9-18% and 7-10% higher sensitivity and specificity respectively, than geneid at nucleotide/exon levels; refer to table 1refer to table 1). 1.2 Obtaining SGP2 gene predictions using a pre-existing mammalian-specific parameter file. SGP2 [4] is a syntenic gene prediction tool that combines ab initio gene prediction (geneid) with TBLASTX searches between two or more genome sequences to provide both sensitive and specific gene predictions, usually showing an improvement to geneid’s performance, especially by reducing the number of false-positive predictions. SGP2 requires a reference genome to which the target genome (in this case O. degus) is TBLASTX-compared. We decided to use the genome of H. sapiens (assembly hg38) as our “reference genome” based on the premise that O. degus (as M. musculus) has a high likelihood of being at an appropriate evolutionary distance from H. sapiens for optimal SGP2 performance. An “appropriate” evolutionary distance in the context of SGP2 means that the coding portions of the target/reference genomes being aligned would have a higher conservation than their intergenic and intronic regions. This feature would in turn contribute to a higher accuracy of SGP2 when compared to geneid, especially at the level of specificity. Furthermore, if our assumption proved correct using an existing SGP2 parameter file previously developed and optimized for predicting gene models in M. musculus/H. sapiens genomes [4] could be used to obtain the gene predictions. In order to test our assumption this pre-existing “Mmus-Hsap” SGP2 parameter file was evaluated on the same “artificial scaffold” used to test geneid (refer to the previous section). The SGP2 predictions on the “artificial scaffold” (using H. sapiens as the reference genome) show this is program is 7-12% more sensitive, and 4-7% more specific than geneid at nucleotide and exon levels (refer to Table 1). SGP2 (an extension of geneid) can also use external information (i.e. introns) improve the accuracy of its predictions. We used the same set of introns as external evidence for SGP2 as we had for geneid (section 1.1). Again the accuracy of SGP2 (+introns) on the test scaffold confirmed a marked improvement in the performance of the program when introns were used as evidence (i.e. 12-25% higher sensitivity than geneid at nucleotide/exon levels; refer to table 1). 1.3 Obtaining augustus gene predictions using a pre-existing mammalian-specific parameter file We also produced O. degus protein-coding gene annotations using the gene prediction tool Augustus. Augustus is a program that predicts genes in eukaryotic genomic sequences and is “re-trainable”. The program is based on a Hidden Markov Model and integrates a number of known methods and submodels (5). In order to predict gene sequences in O. degus using augustus we used the program’s pre-existing mammal/H. sapiens parameter file. We estimated the program’s accuracy in predicting O. degus sequences by evaluating it on the same 13.5 Mbase artificial scaffold consisting of the 238 concatenated gene models with 800 nucleotides of intervening sequence used to evaluate the in-house geneid and SGP2 programs (Table 1). We also took advantage of augustus’ potential to use external evidence to improve its performance. We did this by obtaining a set of predictions that used the mammal/H. sapiens augustus parameter file in combination with PASAderived transcript evidence obtained for this species (refer to section 2 for additional information). Briefly, for O. degus this transcript evidence set consisted of 25,621 PASA transcripts built from an initial set of 1,767,640 sequences from various different sources: 1) O.degus mRNA transcripts (deemed to be proteincoding) obtained at the NCBI O. degus project site (fftp://ftp.ncbi.nlm.nih.gov/genomes/Octodon_degus/RNA/) 2) ESTs and mRNA sequences from five species of rodents (mouse, guinea pig, chinese hamster and rat) also found at the project pages for each of these rodents also at the NCBI. Refer to section 3 of this document for further details on how the PASA transcripts were generated. Our strategy for taking advantage of the large set of transcripts followed the methodology described in (http://augustus.gobics.de/binaries/readme.rnaseq.html) and in an article by Stanke et al. [5] and allowed us to obtain a higher-accuracy evidence-based set of Augustus(+hints) predictions on the O. degus assembly (refer to table 1). In order for the O. degus augustus matrix to take advantage of the external data we first had to optimize some internal parameters of the new augustus parameter file; the exonpart bonus for hints corresponding to PASA-evidence (“E”) was given a bonus of 1xE3. Also, for every exonpart that was not supported by the PASA evidence, the probability of the gene structure was given a “malus” or a penalization of 0.995. Furthermore, complete exons predicted by augustus that perfectly matched the exons in the external hints were given a bonus of 1xE4. The intron bonus for (PASA) hints of source E was set to 1xE4, meaning that a predicted intron would get this bonus when being exactly as in the PASA “hint”. 1.4 Obtaining SNAP predictions using a pre-existing mammalianspecific matrix Our final source of ab initio gene predictions to be used by the EVidence Modeler combiner (EVM r2012-06-25) [7)) was obtained using the program SNAP [6] using its pre-existing mammalian-specific matrix. SNAP was developed by Ian Korf and consists of a general-purpose gene finding program that can be used both on eukaryotic and prokaryotic genomes. SNAP is an acronym for “Semi-HMM-based Nucleic Aid Parser”. SNAP’s performance was the poorest from among the three different sources of ab initio/homology-based gene predictions used in this project but was still used as a source for EVM, albeit with a fraction of the weight given to the other ab initio tools. Geneid/SGP2 (both with or without introns), augustus (with or without “hints”), and SNAP, using their pre-existing set of mammal-specific parameter files were subsequently used to predict genes on the latest repeat-masked assembly of the O. degus genome (OctDeg1.0; 2012/05/01). This assembly is made up of 7,134 scaffolds. Given the mammalian/H.sapiens parameter file geneid predicted 68,023 protein-coding genes without external evidence and 61,839 sequences when using intronic data. SGP2 generated 55,558 predictions on the assembly and 49,619 gene models when using introns as external evidence (SGP2+introns). The program augustus predicted 32,581 genes on the assembly while its evidence-based variation [augustus(+hints)] produced 26,405 gene models. The gene predictor SNAP generated 167,786 predictions on this assembly. These programs were then used as input to a “combiner” (Evidence Modeler; EVM r2012-06-25) [7], which was the program employed to obtain the reference annotation for this genome. 2 EVM-based genome annotation of the O. degus assembly by combining different sources of evidence using weights. A combination of the Program to Assemble Spliced Alignments (PASA r2014-04-17) [3] and Evidence Modeler (EVM r2012-06-25) [7] was used to obtain consensus coding sequence (CDS) models using three main sources of evidence: aligned transcripts, aligned proteins, and gene predictions (refer to section 1). 2.1 PASA transcript alignments The O. degus RNA sequences processed by the PASA pipeline (r2014-0417) were obtained as described in section 1.3. As previously mentioned this process resulted in 25,621 PASA transcript assemblies. The transcriptome was subsequently added to the PASA database (which uses GMAP/BLAT as the alignment engines). PASA was set to be quite stringent. Input sequences with less than 95% identity to the genomic sequence over 90% of their length were discarded. 2.2 Protein alignments Furthermore, Uniprot-90 [8] and 26,259 Uniprot-Swissprot highly curated protein sequences (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxon omic_divisions/uniprot_sprot_rodents.dat.gz) were split-mapped to the O. degus genome by using the program SPALN2 [9] with M. musculus-specific parameters. We also used the spliced-protein alignment tool exonerate [10] to map the Uniprot-Swissprot protein sequences to the scaffolds of O. degus (table 2). 2.3 Combining the different EVM sources The resulting alignments were then filtered as suggested in the EVM documentation (http://evidencemodeler.sourceforge.net/). Gene predictions were obtained as previously described (sections 1.1 - 1.3) and also modified as recommended (http://evidencemodeler.sourceforge.net/)) and added to the EVM pipeline. Subsequently the transcript alignments, protein alignments and the ab initio gene models were combined into consensus CDS models by EVM using different weights. These weights (shown in table 2) were selected following the EVM documentation (i.e. http://evidencemodeler.sourceforge.net/) and, with regard to the ab initio predictions, also based on the accuracy of the different programs in predicting sequences in the evaluation “artificial scaffold” for this species (refer to sections 1.1 and table 1). We also determined which sources of evidence (ab initio, protein or transcript) EVM used to build each of the 35,524 consensus gene models (table 3) and removed several of the EVM consensus gene models from the initial reference set. We filtered-out 1) the 1,596 reference gene models supported exclusively by SNAP predictions (taking into consideration the low accuracy of this prediction tool – refer to table 1) as well as the 2) single-exon consensus gene models derived from predictions solely supported by geneid, sgp or augustus predictions, with <300 nucleotides and an EVM score <5. This resulted in a final set of 31,945 consensus CDS models which were then updated with UTRs and alternative exons through five rounds of PASA’s routine to update annotations. The resulting transcripts were grouped into genes and then a pre-selected species-specific identifier was assigned to the genes, transcripts and protein products derived from them. 2.4 EVM consensus annotation statistics Finally, and as a quality control, the protein products obtained from the reference annotation of this species was aligned against either the exhaustive NCBI non-redundant (NR-201402) database using the “protein vs. protein” BLASTP “flavor” of the sequence comparison tool BLAST (E=10-2 with a minimum identity of 25%) to determine what percentage of the annotated genes matched a sequence of this large biological-sequence public databases. Results showed that ~97% of our consensus EVM reference of this species matched an NR protein given the criteria above (table 4). Furthermore, table 4 contains a wide-range of statistics obtained both from the O. degus assembly and the consensus EVM protein-coding consensus protein-coding annotation. Furthermore table 4 also contains statistics on the NCBI GNOMON-based (ftp://ftp.ncbi.nlm.nih.gov/genomes/Octodon_degus/RNA/) protein coding annotation whose gene-models were used as the source of transcript evidence for EVM. Program/param SN SP SNe SPe SNg SPg Geneid man/hs 0.83 0.75 0.65 0.69 0.09 0.06 Geneid+intron mam/hs 0.92 0.82 0.83 0.79 0.24 0.17 SGP2 odegus / Hs 0.90 0.82 0.77 0.73 0.12 0.08 SGP2+intron odegus / mam/hs 0.95 0.86 0.86 0.79 0.26 0.17 Augustus+hints mam/hs 0.87 0.94 0.81 0.90 0.33 0.35 Augustus mam/hs 0.81 0.84 0.68 0.75 0.06 0.07 SNAP mam/hs 0.83 0.45 0.57 0.32 0.03 0.01 Table 1. Accuracy of gene prediction on an O. degus “artificial scaffold” consisting of 238 concatenated O. degus test sequences (with 800 nucleotides of sequence between each of the gene models) using the ab initio programs geneid, augustus and SNAP with their pre-existing Mammalian/H. sapiens parameter files (i.e. “mam/hs”). The accuracy of SGP2 (homology evidence-based prediction tool that used the genome of H. sapiens as reference) and that of Augustus (using RNASeq and transcript evidence i.e. “augustus+hints”) were also tested for accuracy on the same set of sequences. Geneid (geneid+introns) and SGP2 (SGP2+introns) using introns as external evidence were also evaluated. (SN & SP: sensitivity & specificity at nucleotide level; SNe & SPe: sensitivity & specificity at exon level; SNg & SPg: sensitivity & specificity at gene level). Type Source Weight ABINITIO_PREDICTION Augustus 1 ABINITIO_PREDICTION AugustusHints 1.75 ABINITIO_PREDICTION geneid 1 ABINITIO_PREDICTION sgp2 1.25 ABINITIO_PREDICTION geneid+introns 1.5 ABINITIO_PREDICTION sgp2+introns 1.75 ABINITIO_PREDICTION SNAP 0.3 PROTEIN SPALN2 uniprot90 5 PROTEIN SPALN2 uniprot-swissprot 4 PROTEIN exonerate uniprot-swissprot 4 TRANSCRIPT PASA 10 Table 2. Weights used by EVM to create a consensus CDS model for O. degus. (SPLAN2 against Uniprot90 proteins; SPALN2 uniprot-swissprot: SPALN2 against rodent uniprot/swissprot curated proteins; Exonerate uniprot-swissprot: exonerate against against rodent uniprot/swissprot curated proteins; Type of source of evidence Number of consensus gene models supported by the type of source of evidence (% of total number of EVM reference gene models supported) PASA transcript alignments 15,165 (42.6894%) Protein alignments 18,266 (51.4188%) Protein OR PASA alignments 18,880 (53.1472%) Protein AND PASA alignments 14,551 (40.961%) Protein, PASA and at least one source of ab 14,550 (40.9582%) initio predictions Exclusively ab initio evidence (geneid, 16,643 (46.85%) geneidi,sgp2,sgp2i,augustus,augustushints or snap) Only one source of ab initio predictions (No 9,382 (26.4103%) protein or transcript evidence) At least two sources of ab initio evidence 7,261 (20.4397%) (No protein or transcript evidence) All sources of ab initio evidence (No protein 2,218 (6.24367%) or transcript evidence) just geneid,geneidi/spg2,sgp2i (No protein 6,096 (17.1602%) or transcript evidence) just geneid/geneidi (No protein or transcript 1,273 (3.58349%) evidence) just sgp2/sgp2i (No protein or transcript 2,196 (6.18174%) evidence) just augustus/augustus+hints (No protein or 1,690 (4.75735%) transcript evidence) just SNAP (No protein or transcript 1,596 (4.49274%) evidence) singleEXON genes (with ab initio evidence from more than 1 program and/or 5,952 (16.7549%) protein/PASA evidence Table 3. Break down of the types of evidence used to build the 35,524 gene-model EVM consensus set (prior to filtering for weakly supported gene models). Annotation versions O.degus 2a (EVMgenerated) Genome length (Mbases) O.degus (ncbi GNOMON) -proteincoding only2,995.89 number of scaffolds 7,134 Number of proteincoding genes 31,739 20,779 Gene density (genes/Kbase) 0.0106 0.007 Number of proteincoding transcripts 36,866 26,248 1.16 (SD 0.72) (1 – 32) (9.24%) 1.26 (SD 0.94) (1 – 31)(15%) Number of transcripts with UTRs 10,648 - Number of proteins 36,575 26,248 Number of complete proteins (%) 33,858 (92.57%) - 35,475 (97%) - Transcripts/gene (range) (% genes with more than 1 transcript) Number/(%) proteins with similarity to sequences in the NCBI NR database (E=10-2; min. identity=25%) Avg. length of proteins (range) 461.96 aa. (SD 593.73) (25 – 34,458) 577.56 aa. (SD 641.03) (23 – 34,357) Avg. length of fulllength proteins (range) 478.57 aa. (SD 602.27) (25 – 34,458) - Number of partial proteins (not starting with "M") 1842 (5.04%) 259 (0.98%) 253.11 aa. (SD 431.8) - 1589 (4.34%) (can’t determine as gnomon protein set has no clear STOP signal) Avg. length of partial proteins (no terminal STOP codon) 213.91 aa. (SD 350.85) - Number of partial proteins (not starting with an M and- no terminal STOP codon) 714 (1.95%) - Avg. length of partial proteins (not starting with an M and- no terminal STOP codon) 158.83 aa. (SD 261.08) - Number of partial proteins (not starting with an M - 2,717 (7.43%) - Avg. length of partial proteins (not starting with "M") Number of partial proteins (no terminal STOP codon) or- no terminal STOP codon) Avg. length of partial proteins (not starting with an M or- no terminal STOP codon) 254.96 aa. (SD 423.14) - Number of proteincoding exons 288,884 268,660 Number of introns 252,018 242,412 Number of UTRs (spliced) 19,003 - Number of singleexon genes 10,114 3,156 Number of multiexonic transcripts (genes) 26,752 (21,740) 23,092 (17,623) Exons/transcript (range) (excludes single-exon genes) 10.42 (SD 10.50) (2 – 313) 11.49 (SD 10.35) (2 – 313) Introns/transcript (range) 9.42 (SD 10.50) (1 – 312) 10.49 (SD 10.35) (1 – 312) “spliced” UTRs/transcript (range) 1.785 (SD 0.74) (1 - 5) - Avg. length of introns (range) 5,998 (SD 19,994.1) (21 – 734,060) 5,613.03 (SD 19,909.6) (30 – 1,116,408) Avg. length of monoexonic genes 519.27 (SD 430.56) 872.88 (SD 618.70) Avg. length of exons (excludes monoexonic genes) 165.37 (SD 233.34) 161.25 (SD 230.37) Avg. length of first exons 230.78 (SD 336.07) - Avg. length of internal exons 149.24 (SD 194.59) - Avg. length of terminal exons 235.72 (SD 352.41) - Avg. length of CDS (range) 1,392.9 (SD 1,782.42) (75 – 103,074) 1,736.06 (SD 1,923.46) (69 – 103,074) Avg. length of UTRs (range) 653.40 (SD 942.07) (1 11,857) - Avg. length of primary transcripts 43,714.8 (SD 107,530) 56,055.1 (SD 117,349) G+C content exonic (mono-exonic genes) 49.72% (SD 7.63%) 51.85% (SD 8.56%) G+C content exonic (excludes monoexonic genes) 52.53% (SD 7.47%) 53.52% (SD 7.42%) G+C content exonic (first exons) 53.62% (SD 10.85%) - G+C content exonic (internal exons) 51.27% (SD 9.67%) - G+C content exonic (terminal exons) 53.59% (SD 10.84%) - G+C content intronic 45.05% (SD 11.54%) 45.45% (SD 11.61%) G+C content genomic G+C content UTRs 40.16% (SD 5.63%) 53,76% (SD 5%) - Table 4. Statistics for the EVM-generated and GNOMON-based protein-coding reference annotations for O. degus 3 References 1. Blanco, E., Parra, G., Guigó, R. Using geneid to identify genes. Curr Protoc Bioinformatics. 4.3. (2007) 2. Parra, G., Blanco, E. & Guigó, R. GeneID in Drosophila. Genome Res 10 511-515 (2000) 3. Haas, B.J., et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31 5654-5666 (2003) 4. Parra, G. et al. Comparative gene prediction in human and mouse. Genome Res 13 108-117 (2003) 5. Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 34 W435-439 (2006) 6. Korf I. Gene finding in novel Genomes. BMC Bioinformatics 5 59 (2004) 7. Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol 9 R7 (2008) 8. Li W. and Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22 1658-1659 (2006) 9. Iwata, H. and Gotoh, O., " Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features" Nucleic Acids Research 40 (20) e161 (2012) 10. Slater, G. & Birney,E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:31 1–11 (2005) Acknowledgments Tyler Alioto of the CNAG i(http://www.cnag.cat/institution/) in Barcelona for insightful technical assistance and development of some of the scripts used in the analysis. This work was partially supported by the «Instituto Nacional de Bioinformática» (INB) from ISCIII in Spain.