Word file (95 KB )

advertisement
§S0.1 Gene Prediction Methodology
Gene structures were predicted using a combination of FGENESH, FGENESH+,
and GENEWISE (Figure S0.1). Both FGENESH and FGENESH+ are gene prediction
programs acquired from Softberry.com and GENEWISE is part of the WISE2 package
developed by Ewan Birney and is available from the Sanger Center.
Both FGENESH and FGENESH+ utilize a statistical model of gene structure that
requires training on each organism for accurate prediction. We acquired these programs
already trained by Softberry on Neurospora sequences.
GENEWISE (as we ran it), splices and aligns a protein sequence with genomic
sequence to predict a gene structure. Although GENEWISE does utilize some speciesspecific parameters, most notably for intron nucleotide statistics and splice site consensus
sequences, these were set to non-species specific defaults. We post-processed
GENEWISE incomplete protein alignments by extending the first exon upstream to the
nearest start codon, and by extending the last exon downstream to the first stop codon. If
a stop codon was encountered upstream of a gene before a start could be found, the gene
call was not used.
An assessment of the accuracy of GENEWISE as well as FGENESH, and
FGENESH+ is described in §S0.2. Briefly, these three gene callers were combined in the
following manner:
1. FGENESH was run on the entire genomic sequence to provide an initial set of
predicted genes. Each FGENESH prediction was put into a set of
EVIDENCE_GENES.
2. The genome was also searched against the non-redundant protein database using
BLASTX
3. Regions of the genome with blastx homology spanning over 80% of a protein (when
sub-alignments are stitched together in a consistent fashion) were considered
"Homologous Gene Regions" (HGRs).
4. HGRs were clustered into groups of HGRs that all implicated the same gene structure
5. For each cluster of HGRs, the protein showing the most sequence similarity to the
genome was passed to both FGENESH and GENEWISE to produce 2 gene
predictions, if the protein had >80% amino acid identity to the translated genome
(cumulative across sub-alignments).
6. If the protein used in step 6 had >90% amino acid identity to the translated genome
(cumulative across sub-alignments), then the GENEWISE call was favored over the
FGENESH+ call, and was used as the EVIDENCE_GENE for the HGR and added to
the set of EVIDENCE_GENES.
7. If the protein used in step 6 had >90% but less than 90% amino acid identity to the
translated genome (cumulative across sub-alignments), then the FGENESH+ call was
favored over the GENEWISE call, and was used as the EVIDENCE_GENE for the
HGR and added to the set of EVIDENCE_GENES.
8. When EVIDENCE_GENES overlapped in their exons, the EVIDENCE_GENE with
the least amount of homology support (as measured by the sequence similarity of the
protein used to make the call or zero for FGENESH calls) was removed from the set
of EVIDENCE_GENES.
9. All remaining EVIDENCE_GENES were then called as our official
ANNOTATED_GENES
(Since EVIDENCE_GENES represent potential alternate gene predictions that may be
based on homology, these genes are available to the user on the Neurospora website)
§S0.2 Accuracy of Gene Callers on Test Set
The strategy for combining gene prediction programs was based on an assessment of the
performance of these programs on test set of 191 genomic sequences for Neurospora
genes generously provided by Dr. Chuck Staben of the University of Kentucky. It is
important to note, however, that Softberry used some of the proteins in this test set in the
training of FGENESH and FGENESH+. Thus the performance of these programs on the
test set are likely inflated relative to the performance on a random set of N. crassa genes.
This is not an issue for GENEWISE as it required no training. In assessing the
performance of the gene callers we asked two questions:

How well does FGENESH perform in predicting genes in the absence of
homology?

How well do FGENESH+ and GENEWISE perform when given protein
sequences of varying amino acid identity to the actual N. crassa translated gene?
A number of metrics were used to compare predicted gene structures against trusted
gene structures. These metrics are described in detail elsewhere1. We will present the
results here for the correlation coefficient only, defined as
CC = (Tp*Tn - Fp*Fn) / SQRT( (Tp+Fp)*(Tn+Fn)*(Tp+Fn)*(Tn+Fp) )
where Tp, Tn, Fp, and Fn are the number of true positives, true negatives, false positives,
and false negatives respectively, all defined at the nucleotide level relative to the trusted
gene. A CC=1 represents a perfect prediction relative to the trusted gene.
The results of the analysis are shown in Figure S0.2 below. This figure shows a
histogram of Correlation Coefficients for all the genes predicted by FGENESH,
FGENESH+, and GENEWISE on the test set of 191 genes. In each of these figures,
histograms of gene prediction correlation coefficients are shown for predictions based on
different level of protein similarity.
FGENESH produces relatively accurate predictions on this test. FGENESH+ and
GENEWISE on the other hand show very poor performance when proteins with less than
80% AA identity to the translated genome sequence are used as a basis for the gene
prediction. For proteins with > 80% AA identity, these gene callers do appear to allow an
improvement in gene prediction accuracy as compared with FGENESH.
In the case of GENEWISE, it can be seen that for proteins with >90% AA identity,
GENEWISE performs very well and appears to offer significant improvement over
FGENESH in prediction accuracy. For proteins with <90% AA identity, however,
GENEWISE does not perform much better that FGENESH.
In the case of FGENESH+, it can be seen that for proteins with >90% AA identity,
FGENESH+ performs slightly better than FGENESH but worse than GENEWISE. For
proteins with between 80% and 90% AA identity however, FGENESH+ does outperform
GENEWISE and appears to perform slightly better than FGENESH in that the fraction of
poor predictions (CC<0.8) diminishes.
§S0.3 Gene Validation Using ESTs
The Neurospora automated gene predictions were validated against a set of previously
characterized ESTs. The ESTs were not used as evidence during the automated gene
calling, and could thus be used as an independent measure of the accuracy of the gene
calls. To assess gene call accuracy, EST alignments were compared with predicted gene
structures to detect potential errors.
Publicly available EST sequences were used from tissue specific libraries2 (5136
sequences), time-of-day-specific libraries3 (19,932 sequences), and a library derived from
nitrogen mycelial mats (1,012 sequences). In addition, a set of 1,536 additional
sequences from the perithecial library produced by Mary Anne Nelson was generated by
the WICGR. ESTs were aligned to the genome sequence using BlastN. Each EST was
then “assigned” to the locus where it produced the highest scoring alignment with greater
than 95% nucleotide identity.
EST assignments were compared to predicted gene structure to detect the following
types of prediction errors: (1) Spurious Predicted Introns: predicted gene structures with
EST alignments completely spanning an intron (without corresponding gaps in the
alignment); (2) Incorrect Splice Sites: predicted gene structures with EST alignments
covering >50bp of an intron and an adjacent exon; (3) Missing Exons: EST alignments
not in predicted coding regions.
The results of these comparisons are shown in Figure S0.3. ESTs were assigned to
a total of 3,090 different predicted protein-coding genes or 31% of the total 10,082. Of
these genes, 6% appeared to include spurious introns, while 6% appear to have incorrect
predicted splice sites. Only a small proportion (<1%) of regions with EST alignments
appeared be in predicted non-coding regions.
§S0.4 Gene Count
A total of 10.0082 protein coding genes were predicted using the described methods.
Eliminating genes shorter than 100aa that lack protein or EST similarity reduces this
number to 9,200. Based on comparisons to gene counts of other sequenced organisms,
this number is roughly what is expected based on genome size (Figure S0.4).
The number of genes is also consistent with previous estimates. In (Kelker et al., 2001)4 a
total of 11,000 genes were estimated. In (Kupfer et al., 1997)5, an estimate of 8000-9000
genes was provided for “an average 36 Mb ascomycetous fungi”, and an estimate of 9200
genes was provided for Neurospora. Finally, both (Beazn et al. 2001)6 and (Nelson et al.
1997)2 estimated 13,000 genes for Neurospora. Thus the number of genes predicted by
our method is within the range of 9200-13000 estimated by previous authors.
Genewise Fgenesh+ Fgenesh
Proteins
(nr)
Blastx
Genome
Gene
Model
Gene
Model
Model
Selection
ESTs
Blat
Alignment
Validation
(Correction)
Figure S0.1 Schematic of methodology used for Neurospora gene prediction.
Gene
Model
Figure S0.2 Accuracy of gene calling on test gene set. See text for more details.
Missing Exons
<1%
Genes
w/Spurious Introns
6%
Genes
w/No conflict
87%
Genes w/Incorrect
Splice Site
6%
Figure S0.3 Distribution of detected conflicts between EST alignments and Neurospora predicted
genes.
Number of Genes Vs Genome Size
30000
P.abyssi
M.jannaschii
T.maritima
M.thermoautotrophicum
25000
E.cuniculi
P.horikoshii
A.fulgidus
Number of Genes
20000
Synechocystis
S.solfataricus
V.cholerae
B.subtilis
15000
E.coli
M.acetivorans
S.pombe
P.aeruginosa
10000
K.yarowii
S.cerevisiae
M.loti
5000
N.crassa
D.melanogaster
C.elegans*
A.thaliana*
0
0
20
40
60
80
100
120
140
160
Genome Size (Kb)
Figure S0.4 The number of predicted genes as a function of genome size for selected sequenced and
annotated prokaryotic (circles) and eukaryotic (squares) organisms.
References
1.
2.
3.
4.
5.
6.
Guigo, R., Agarwal, P., Abril, J. F., Burset, M. & Fickett, J. W. An assessment of
gene prediction accuracy in large DNA sequences. Genome Res 10, 1631-42.
(2000).
Nelson, M. A. et al. Expressed sequences from conidial, mycelial, and sexual
stages of Neurospora crassa. Fungal Genet Biol 21, 348-63. (1997).
Bell-Pedersen, D., Shinohara, M. L., Loros, J. J. & Dunlap, J. C. Circadian clockcontrolled genes isolated from Neurospora crassa are late night- to early morningspecific. Proc Natl Acad Sci U S A 93, 13096-101. (1996).
Kelkar, H. S. et al. The Neurospora crassa genome: cosmid libraries sorted by
chromosome. Genetics 157, 979-90 (2001).
Kupfer, D. M., Reece, C. A., Clifton, S. W., Roe, B. A. & Prade, R. A.
Multicellular ascomycetous fungal genomes contain more than 8000 genes.
Fungal Genet Biol 21, 364-72 (1997).
Bean, L. E. et al. Analysis of the pdx-1 (snz-1/sno-1) region of the Neurospora
crassa genome: correlation of pyridoxine-requiring phenotypes with mutations in
two structural genes. Genetics 157, 1067-75 (2001).
Download