Materials and Methods Data collection Protein sequences deduced from 57 completed fungi genomes was retrieved from Broad Institute, DOE Joint Genome Institute, Saccharomyces Genome Database, Génolevures Consortium, and NCBI (Table 1). The Uniref90 database was retrieved from UniProt. Selection of beta oxidation enzymes via similarity search We conducted a BLASTP search between the fungal proteins and sequences from UniRef90 database, using the threshold e=10-10 for significant matches. If the top hit against UniRef90 database was a beta oxidation enzyme, the query protein was annotated as such. To infer remote homologs, the enzymes identified were used for a second ‘transitive’ screen to annotate sequences without any matches in the first run, a procedures employed with success by the annotation tool AutoFACT (Koski et al., 2005). For species in which beta oxidation enzymes were not found, we also searched by TBLASTN in their genome sequences to avoid false negatives caused by gene prediction error. However, it cannot be completely ruled out that some enzymes appear to be missing only because of incomplete genome data. For species in which beta oxidation enzymes were not found in the genome sequence, we scrutinized other available data sources such as ESTs. Mitochondrial protein prediction The subcellular localization prediction of the proteins was based on nine predictors: TargetP (Emanuelsson et al., 2000), Subloc (Hua and Sun, 2001), SherLoc (Shatkay et al., 2007), 1 pTARGET (Guda and Subramaniam, 2005), Predotar (Small et al., 2004), Protein Prowler (PProwler) (Boden and Hawkins, 2005), PASUB (Lu et al., 2004), MitoProt (Claros and Vincens, 1996), and CELLO (Yu et al., 2006). All the results were obtained from online servers of the predictors, except for MitoProt, which was installed and run locally. For most proteins, the predictors gave contradictory results. We integrated these predictions by employing the tool YimLOC (Shen and Burger, 2007), which has been shown to be more accurate than any of the individual predictors. We used the YimLOC result as the final localization prediction. Peroxisomal protein prediction Six of the nine predictors above include the class “peroxisomes” (SherLoc, pTARGET, PProwler, PASUB, PST1 (Neuberger et al., 2003), and CELLO). Since the predictors have low to medium sensitivity for peroxisomal proteins (22%-77%), we employed the following scheme to maximize the sensitivity by combing function annotation with localization prediction: for a query protein, if 1) by similarity search it is annotated as a peroxisomal beta oxidation enzyme; 2) any of these predictors classifies it into peroxisomes; 3) it is not targeted to mitochondria according to YimLOC, this protein will be predicted as peroxisome-destined. Dual-localization will be assigned if this protein is also recognized as a mitochondrial one by YimLOC. Phylogenetic inference We chose eight proteins that have been successfully used in previous fungal phylogenetic analyses. These are the two largest subunits of RNA polymerase II (RPB1 and RPB2), and III 2 (RPC1 and RPC2), mitochondrial and cytoplasmic heat shock proteins 70 (HSP70_mit and HSP70_cyt), elongation factor II (EF2), and 60kDa chaperonin (CPN60). We used Homo sapiens and Monosiga brevicollis as outgroup (not shown). Homologs of these proteins were initially searched in each species with a BLASTP cutoff of e=10-100). Only unambiguous orthologs were retained, identified by phylogenetic analyses of individual proteins (Supplementary Table 5). Protein sequence alignments were generated using MUSCLE (Edgar, 2004). Aligned sequences were concatenated, and Gblocks (Castresana, 2000) was employed to remove ambiguously aligned regions and highly divergent parts of the alignment. Three independent phylogenetic runs were performed with the Bayesian program PhyloBayes (Lartillot and Philippe, 2004), using the CAT+Γ model with eight discrete gamma rate categories. After the convergence of likelihood values across generations, the final consensus tree was built based on the combined runs. Bayesian posterior probabilities were obtained from the majority rule consensus of the tree sampled after 1,000 generations. We also constructed maximum likelihood trees using RAxML (Stamatakis, 2006) using the WAG+ Γ model with four discrete gamma rate categories, and we evaluated the statistical support by 100 bootstrap replicates. Reference: Boden, M., and Hawkins, J. (2005) Prediction of subcellular localization using sequence-biased recurrent networks. Bioinformatics 21: 2279-2286. Castresana, J. (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17: 540-552. Claros, M.G., and Vincens, P. (1996) Computational method to predict mitochondrially imported proteins and their targeting sequences. Eur J Biochem 241: 779-786. Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: 1792-1797. Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300: 1005-1016. Guda, C., and Subramaniam, S. (2005) pTARGET [corrected] a new method for predicting protein subcellular localization in eukaryotes. Bioinformatics 21: 3963-3969. Hua, S., and Sun, Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17: 721-728. Karev, G.P., Wolf, Y.I., and Koonin, E.V. (2003) Simple stochastic birth and death models of genome evolution: was there enough time for us to evolve? Bioinformatics 19: 1889-1900. Kondrashov, F.A., and Koonin, E.V. (2003) Evolution of alternative splicing: deletions, insertions and origin of 3 functional parts of proteins from intron sequences. Trends Genet 19: 115-119. Koski, L.B., Gray, M.W., Lang, B.F., and Burger, G. (2005) AutoFACT: an automatic functional annotation and classification tool. BMC Bioinformatics 6: 151. Lartillot, N., and Philippe, H. (2004) A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol 21: 1095-1109. Leipe, D.D., Koonin, E.V., and Aravind, L. (2003) Evolution and classification of P-loop kinases and related proteins. J Mol Biol 333: 781-815. Lu, Z., Szafron, D., Greiner, R., Lu, P., Wishart, D.S., Poulin, B., Anvik, J., Macdonell, C., and Eisner, R. (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20: 547556. Makarova, K.S., and Koonin, E.V. (2003) Filling a gap in the central metabolism of archaea: prediction of a novel aconitase by comparative-genomic analysis. FEMS Microbiol Lett 227: 17-23. Neuberger, G., Maurer-Stroh, S., Eisenhaber, B., Hartig, A., and Eisenhaber, F. (2003) Prediction of peroxisomal targeting signal 1 containing proteins from amino acid sequence. J Mol Biol 328: 581-592. Rogozin, I.B., Aravind, L., and Koonin, E.V. (2003) Differential action of natural selection on the N and C-terminal domains of 2'-5' oligoadenylate synthetases and the potential nuclease function of the C-terminal domain. J Mol Biol 326: 1449-1461. Shatkay, H., Hoglund, A., Brady, S., Blum, T., Donnes, P., and Kohlbacher, O. (2007) SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics. Shen, Y.Q., and Burger, G. (2007) 'Unite and conquer': enhanced prediction of protein subcellular localization by integrating multiple specialized tools. BMC Bioinformatics 8: 420. Small, I., Peeters, N., Legeai, F., and Lurin, C. (2004) Predotar: A tool for rapidly screening proteomes for Nterminal targeting sequences. Proteomics 4: 1581-1590. Stamatakis, A. (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688-2690. Yu, C.S., Chen, Y.C., Lu, C.H., and Hwang, J.K. (2006) Prediction of protein subcellular localization. Proteins. 4