§S0.1 Gene Prediction Methodology Gene structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE (Figure S0.1). Both FGENESH and FGENESH+ are gene prediction programs acquired from Softberry.com and GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center. Both FGENESH and FGENESH+ utilize a statistical model of gene structure that requires training on each organism for accurate prediction. We acquired these programs already trained by Softberry on Neurospora sequences. GENEWISE (as we ran it), splices and aligns a protein sequence with genomic sequence to predict a gene structure. Although GENEWISE does utilize some speciesspecific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these were set to non-species specific defaults. We post-processed GENEWISE incomplete protein alignments by extending the first exon upstream to the nearest start codon, and by extending the last exon downstream to the first stop codon. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used. An assessment of the accuracy of GENEWISE as well as FGENESH, and FGENESH+ is described in §S0.2. Briefly, these three gene callers were combined in the following manner: 1. FGENESH was run on the entire genomic sequence to provide an initial set of predicted genes. Each FGENESH prediction was put into a set of EVIDENCE_GENES. 2. The genome was also searched against the non-redundant protein database using BLASTX 3. Regions of the genome with blastx homology spanning over 80% of a protein (when sub-alignments are stitched together in a consistent fashion) were considered "Homologous Gene Regions" (HGRs). 4. HGRs were clustered into groups of HGRs that all implicated the same gene structure 5. For each cluster of HGRs, the protein showing the most sequence similarity to the genome was passed to both FGENESH and GENEWISE to produce 2 gene predictions, if the protein had >80% amino acid identity to the translated genome (cumulative across sub-alignments). 6. If the protein used in step 6 had >90% amino acid identity to the translated genome (cumulative across sub-alignments), then the GENEWISE call was favored over the FGENESH+ call, and was used as the EVIDENCE_GENE for the HGR and added to the set of EVIDENCE_GENES. 7. If the protein used in step 6 had >90% but less than 90% amino acid identity to the translated genome (cumulative across sub-alignments), then the FGENESH+ call was favored over the GENEWISE call, and was used as the EVIDENCE_GENE for the HGR and added to the set of EVIDENCE_GENES. 8. When EVIDENCE_GENES overlapped in their exons, the EVIDENCE_GENE with the least amount of homology support (as measured by the sequence similarity of the protein used to make the call or zero for FGENESH calls) was removed from the set of EVIDENCE_GENES. 9. All remaining EVIDENCE_GENES were then called as our official ANNOTATED_GENES (Since EVIDENCE_GENES represent potential alternate gene predictions that may be based on homology, these genes are available to the user on the Neurospora website) §S0.2 Accuracy of Gene Callers on Test Set The strategy for combining gene prediction programs was based on an assessment of the performance of these programs on test set of 191 genomic sequences for Neurospora genes generously provided by Dr. Chuck Staben of the University of Kentucky. It is important to note, however, that Softberry used some of the proteins in this test set in the training of FGENESH and FGENESH+. Thus the performance of these programs on the test set are likely inflated relative to the performance on a random set of N. crassa genes. This is not an issue for GENEWISE as it required no training. In assessing the performance of the gene callers we asked two questions: How well does FGENESH perform in predicting genes in the absence of homology? How well do FGENESH+ and GENEWISE perform when given protein sequences of varying amino acid identity to the actual N. crassa translated gene? A number of metrics were used to compare predicted gene structures against trusted gene structures. These metrics are described in detail elsewhere1. We will present the results here for the correlation coefficient only, defined as CC = (Tp*Tn - Fp*Fn) / SQRT( (Tp+Fp)*(Tn+Fn)*(Tp+Fn)*(Tn+Fp) ) where Tp, Tn, Fp, and Fn are the number of true positives, true negatives, false positives, and false negatives respectively, all defined at the nucleotide level relative to the trusted gene. A CC=1 represents a perfect prediction relative to the trusted gene. The results of the analysis are shown in Figure S0.2 below. This figure shows a histogram of Correlation Coefficients for all the genes predicted by FGENESH, FGENESH+, and GENEWISE on the test set of 191 genes. In each of these figures, histograms of gene prediction correlation coefficients are shown for predictions based on different level of protein similarity. FGENESH produces relatively accurate predictions on this test. FGENESH+ and GENEWISE on the other hand show very poor performance when proteins with less than 80% AA identity to the translated genome sequence are used as a basis for the gene prediction. For proteins with > 80% AA identity, these gene callers do appear to allow an improvement in gene prediction accuracy as compared with FGENESH. In the case of GENEWISE, it can be seen that for proteins with >90% AA identity, GENEWISE performs very well and appears to offer significant improvement over FGENESH in prediction accuracy. For proteins with <90% AA identity, however, GENEWISE does not perform much better that FGENESH. In the case of FGENESH+, it can be seen that for proteins with >90% AA identity, FGENESH+ performs slightly better than FGENESH but worse than GENEWISE. For proteins with between 80% and 90% AA identity however, FGENESH+ does outperform GENEWISE and appears to perform slightly better than FGENESH in that the fraction of poor predictions (CC<0.8) diminishes. §S0.3 Gene Validation Using ESTs The Neurospora automated gene predictions were validated against a set of previously characterized ESTs. The ESTs were not used as evidence during the automated gene calling, and could thus be used as an independent measure of the accuracy of the gene calls. To assess gene call accuracy, EST alignments were compared with predicted gene structures to detect potential errors. Publicly available EST sequences were used from tissue specific libraries2 (5136 sequences), time-of-day-specific libraries3 (19,932 sequences), and a library derived from nitrogen mycelial mats (1,012 sequences). In addition, a set of 1,536 additional sequences from the perithecial library produced by Mary Anne Nelson was generated by the WICGR. ESTs were aligned to the genome sequence using BlastN. Each EST was then “assigned” to the locus where it produced the highest scoring alignment with greater than 95% nucleotide identity. EST assignments were compared to predicted gene structure to detect the following types of prediction errors: (1) Spurious Predicted Introns: predicted gene structures with EST alignments completely spanning an intron (without corresponding gaps in the alignment); (2) Incorrect Splice Sites: predicted gene structures with EST alignments covering >50bp of an intron and an adjacent exon; (3) Missing Exons: EST alignments not in predicted coding regions. The results of these comparisons are shown in Figure S0.3. ESTs were assigned to a total of 3,090 different predicted protein-coding genes or 31% of the total 10,082. Of these genes, 6% appeared to include spurious introns, while 6% appear to have incorrect predicted splice sites. Only a small proportion (<1%) of regions with EST alignments appeared be in predicted non-coding regions. §S0.4 Gene Count A total of 10.0082 protein coding genes were predicted using the described methods. Eliminating genes shorter than 100aa that lack protein or EST similarity reduces this number to 9,200. Based on comparisons to gene counts of other sequenced organisms, this number is roughly what is expected based on genome size (Figure S0.4). The number of genes is also consistent with previous estimates. In (Kelker et al., 2001)4 a total of 11,000 genes were estimated. In (Kupfer et al., 1997)5, an estimate of 8000-9000 genes was provided for “an average 36 Mb ascomycetous fungi”, and an estimate of 9200 genes was provided for Neurospora. Finally, both (Beazn et al. 2001)6 and (Nelson et al. 1997)2 estimated 13,000 genes for Neurospora. Thus the number of genes predicted by our method is within the range of 9200-13000 estimated by previous authors. Genewise Fgenesh+ Fgenesh Proteins (nr) Blastx Genome Gene Model Gene Model Model Selection ESTs Blat Alignment Validation (Correction) Figure S0.1 Schematic of methodology used for Neurospora gene prediction. Gene Model Figure S0.2 Accuracy of gene calling on test gene set. See text for more details. Missing Exons <1% Genes w/Spurious Introns 6% Genes w/No conflict 87% Genes w/Incorrect Splice Site 6% Figure S0.3 Distribution of detected conflicts between EST alignments and Neurospora predicted genes. Number of Genes Vs Genome Size 30000 P.abyssi M.jannaschii T.maritima M.thermoautotrophicum 25000 E.cuniculi P.horikoshii A.fulgidus Number of Genes 20000 Synechocystis S.solfataricus V.cholerae B.subtilis 15000 E.coli M.acetivorans S.pombe P.aeruginosa 10000 K.yarowii S.cerevisiae M.loti 5000 N.crassa D.melanogaster C.elegans* A.thaliana* 0 0 20 40 60 80 100 120 140 160 Genome Size (Kb) Figure S0.4 The number of predicted genes as a function of genome size for selected sequenced and annotated prokaryotic (circles) and eukaryotic (squares) organisms. References 1. 2. 3. 4. 5. 6. Guigo, R., Agarwal, P., Abril, J. F., Burset, M. & Fickett, J. W. An assessment of gene prediction accuracy in large DNA sequences. Genome Res 10, 1631-42. (2000). Nelson, M. A. et al. Expressed sequences from conidial, mycelial, and sexual stages of Neurospora crassa. Fungal Genet Biol 21, 348-63. (1997). Bell-Pedersen, D., Shinohara, M. L., Loros, J. J. & Dunlap, J. C. Circadian clockcontrolled genes isolated from Neurospora crassa are late night- to early morningspecific. Proc Natl Acad Sci U S A 93, 13096-101. (1996). Kelkar, H. S. et al. The Neurospora crassa genome: cosmid libraries sorted by chromosome. Genetics 157, 979-90 (2001). Kupfer, D. M., Reece, C. A., Clifton, S. W., Roe, B. A. & Prade, R. A. Multicellular ascomycetous fungal genomes contain more than 8000 genes. Fungal Genet Biol 21, 364-72 (1997). Bean, L. E. et al. Analysis of the pdx-1 (snz-1/sno-1) region of the Neurospora crassa genome: correlation of pyridoxine-requiring phenotypes with mutations in two structural genes. Genetics 157, 1067-75 (2001).