Materials and Methods - Springer Static Content Server

advertisement
Materials and Methods
Data collection
Protein sequences deduced from 57 completed fungi genomes was retrieved from Broad
Institute, DOE Joint Genome Institute, Saccharomyces Genome Database, Génolevures
Consortium, and NCBI (Table 1). The Uniref90 database was retrieved from UniProt.
Selection of beta oxidation enzymes via similarity search
We conducted a BLASTP search between the fungal proteins and sequences from UniRef90
database, using the threshold e=10-10 for significant matches. If the top hit against UniRef90
database was a beta oxidation enzyme, the query protein was annotated as such. To infer remote
homologs, the enzymes identified were used for a second ‘transitive’ screen to annotate
sequences without any matches in the first run, a procedures employed with success by the
annotation tool AutoFACT (Koski et al., 2005). For species in which beta oxidation enzymes
were not found, we also searched by TBLASTN in their genome sequences to avoid false
negatives caused by gene prediction error. However, it cannot be completely ruled out that some
enzymes appear to be missing only because of incomplete genome data. For species in which
beta oxidation enzymes were not found in the genome sequence, we scrutinized other available
data sources such as ESTs.
Mitochondrial protein prediction
The subcellular localization prediction of the proteins was based on nine predictors: TargetP
(Emanuelsson et al., 2000), Subloc (Hua and Sun, 2001), SherLoc (Shatkay et al., 2007),
1
pTARGET (Guda and Subramaniam, 2005), Predotar (Small et al., 2004), Protein Prowler
(PProwler) (Boden and Hawkins, 2005), PASUB (Lu et al., 2004), MitoProt (Claros and
Vincens, 1996), and CELLO (Yu et al., 2006). All the results were obtained from online servers
of the predictors, except for MitoProt, which was installed and run locally.
For most proteins, the predictors gave contradictory results. We integrated these predictions by
employing the tool YimLOC (Shen and Burger, 2007), which has been shown to be more
accurate than any of the individual predictors. We used the YimLOC result as the final
localization prediction.
Peroxisomal protein prediction
Six of the nine predictors above include the class “peroxisomes” (SherLoc, pTARGET, PProwler,
PASUB, PST1 (Neuberger et al., 2003), and CELLO). Since the predictors have low to medium
sensitivity for peroxisomal proteins (22%-77%), we employed the following scheme to
maximize the sensitivity by combing function annotation with localization prediction: for a query
protein, if 1) by similarity search it is annotated as a peroxisomal beta oxidation enzyme; 2) any
of these predictors classifies it into peroxisomes; 3) it is not targeted to mitochondria according
to YimLOC, this protein will be predicted as peroxisome-destined. Dual-localization will be
assigned if this protein is also recognized as a mitochondrial one by YimLOC.
Phylogenetic inference
We chose eight proteins that have been successfully used in previous fungal phylogenetic
analyses. These are the two largest subunits of RNA polymerase II (RPB1 and RPB2), and III
2
(RPC1 and RPC2), mitochondrial and cytoplasmic heat shock proteins 70 (HSP70_mit and
HSP70_cyt), elongation factor II (EF2), and 60kDa chaperonin (CPN60). We used Homo sapiens
and Monosiga brevicollis as outgroup (not shown). Homologs of these proteins were initially
searched in each species with a BLASTP cutoff of e=10-100). Only unambiguous orthologs were
retained, identified by phylogenetic analyses of individual proteins (Supplementary Table 5).
Protein sequence alignments were generated using MUSCLE (Edgar, 2004). Aligned sequences
were concatenated, and Gblocks (Castresana, 2000) was employed to remove ambiguously
aligned regions and highly divergent parts of the alignment. Three independent phylogenetic runs
were performed with the Bayesian program PhyloBayes (Lartillot and Philippe, 2004), using the
CAT+Γ model with eight discrete gamma rate categories. After the convergence of likelihood
values across generations, the final consensus tree was built based on the combined runs.
Bayesian posterior probabilities were obtained from the majority rule consensus of the tree
sampled after 1,000 generations. We also constructed maximum likelihood trees using RAxML
(Stamatakis, 2006) using the WAG+ Γ model with four discrete gamma rate categories, and we
evaluated the statistical support by 100 bootstrap replicates.
Reference:
Boden, M., and Hawkins, J. (2005) Prediction of subcellular localization using sequence-biased recurrent networks.
Bioinformatics 21: 2279-2286.
Castresana, J. (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis.
Mol Biol Evol 17: 540-552.
Claros, M.G., and Vincens, P. (1996) Computational method to predict mitochondrially imported proteins and their
targeting sequences. Eur J Biochem 241: 779-786.
Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids
Res 32: 1792-1797.
Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. (2000) Predicting subcellular localization of proteins
based on their N-terminal amino acid sequence. J Mol Biol 300: 1005-1016.
Guda, C., and Subramaniam, S. (2005) pTARGET [corrected] a new method for predicting protein subcellular
localization in eukaryotes. Bioinformatics 21: 3963-3969.
Hua, S., and Sun, Z. (2001) Support vector machine approach for protein subcellular localization prediction.
Bioinformatics 17: 721-728.
Karev, G.P., Wolf, Y.I., and Koonin, E.V. (2003) Simple stochastic birth and death models of genome evolution: was
there enough time for us to evolve? Bioinformatics 19: 1889-1900.
Kondrashov, F.A., and Koonin, E.V. (2003) Evolution of alternative splicing: deletions, insertions and origin of
3
functional parts of proteins from intron sequences. Trends Genet 19: 115-119.
Koski, L.B., Gray, M.W., Lang, B.F., and Burger, G. (2005) AutoFACT: an automatic functional annotation and
classification tool. BMC Bioinformatics 6: 151.
Lartillot, N., and Philippe, H. (2004) A Bayesian mixture model for across-site heterogeneities in the amino-acid
replacement process. Mol Biol Evol 21: 1095-1109.
Leipe, D.D., Koonin, E.V., and Aravind, L. (2003) Evolution and classification of P-loop kinases and related
proteins. J Mol Biol 333: 781-815.
Lu, Z., Szafron, D., Greiner, R., Lu, P., Wishart, D.S., Poulin, B., Anvik, J., Macdonell, C., and Eisner, R. (2004)
Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20: 547556.
Makarova, K.S., and Koonin, E.V. (2003) Filling a gap in the central metabolism of archaea: prediction of a novel
aconitase by comparative-genomic analysis. FEMS Microbiol Lett 227: 17-23.
Neuberger, G., Maurer-Stroh, S., Eisenhaber, B., Hartig, A., and Eisenhaber, F. (2003) Prediction of peroxisomal
targeting signal 1 containing proteins from amino acid sequence. J Mol Biol 328: 581-592.
Rogozin, I.B., Aravind, L., and Koonin, E.V. (2003) Differential action of natural selection on the N and C-terminal
domains of 2'-5' oligoadenylate synthetases and the potential nuclease function of the C-terminal domain. J
Mol Biol 326: 1449-1461.
Shatkay, H., Hoglund, A., Brady, S., Blum, T., Donnes, P., and Kohlbacher, O. (2007) SherLoc: high-accuracy
prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics.
Shen, Y.Q., and Burger, G. (2007) 'Unite and conquer': enhanced prediction of protein subcellular localization by
integrating multiple specialized tools. BMC Bioinformatics 8: 420.
Small, I., Peeters, N., Legeai, F., and Lurin, C. (2004) Predotar: A tool for rapidly screening proteomes for Nterminal targeting sequences. Proteomics 4: 1581-1590.
Stamatakis, A. (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa
and mixed models. Bioinformatics 22: 2688-2690.
Yu, C.S., Chen, Y.C., Lu, C.H., and Hwang, J.K. (2006) Prediction of protein subcellular localization. Proteins.
4
Download