Supplementary Materials (doc 30K)

Supplemental Materials, analysis and methods The sequences of LFG proteins across eukaryotic species were identified using Blastp and tBlastn1 beginning with a query set of previously identified genes9 and those from the recent literature. To be sure to identify more remote homologues while excluding non-LFG proteins, profiles were built with Selenoprofiles2, and used to search available genomes again. All LFG sequences from Blast results, seleno-profiles searches and manual identification were combined into a single sequence set. This set was inspected to remove identical sequences, partial sequences, and probable assembly or gene prediction errors. There were a few cases in which manual review allowed assembly errors to be identified and corrected. There were also, a number of cases involving intronexon boundary errors were found in automated gene predictions. These were also corrected by comparisons to the genomic sequences of the nearest homologues and with the examination of available ESTs. The resulting set contains sequences from all sequenced eukaryotic lineages. This set was then aligned using t-coffee3. Two alignments were generated. Alignment 1 contains the assumed best set of full length LFG genes (with the exception that highly repetitive N-terminals with large variation within a clear LFG subfamily were trimmed back to a common length). Alignment 2 contains the full set (Suppl. Material), including partial matches, and likely some incomplete sequences. LFG subfamilies were assigned based using the branching topology of a full phylogenetic tree analysis, and checked for distinctive sequence patterns. ML trees were reconstructed using the best-fitting evolutionary model (BestML). To select the evolutionary model best fitting each protein subfamily, a phylogenetic tree was reconstructed using a Neighbour Joining (NJ) approach as implemented in BioNJ4. The likelihood of this topology was computed, allowing branchlength optimization, using seven different models (JTT, LG, WAG, Blosum62, MtREV, VT and Dayhoff), as implemented in PhyML version 3.0 5. The two evolutionary models best fitting the data were determined by comparing the likelihood of the used models according to the AIC criterion6. Then, ML trees were derived using these two models with the default tree topology search method NNI (Nearest Neighbor Interchange). A similar approach based on NJ topologies to select the best-fitting model for a subsequent ML analysis has been shown previously to be highly accurate7. Branch support was computed using an aLRT (approximate likelihood ratio test) parametric test based on a chi-square distribution, as implemented in PhyML. It should be noted that the branch support obtained correlates well with bootstrap values obtained with the simpler ClustalW software. LFG subfamilies were assigned based using the branching topology of a full phylogenetic tree analysis, and checked for distinctive sequence patterns. Phylogenetic reconstruction was carried out as described in a study of Vertebrate and Mammalian Selenoproteomes7,8. Finally, tree images were generated using programs based on ETE28. This procedure generated tree1 (Suppl. Fig. S3). Tree1 was inspected, and correlated with a probable species tree to derive the history of LFG family, as depicted in figure 1. It should be noted that the branch support obtained correlates well with bootstrap values obtained with the simpler ClustalW software. Additional phylogenetic analysis was carried out on sub taxonomic divisions and on the distinct LGF subfamilies identified. In particular this was done for the mammals in all five subfamilies and in the primate separately. The resulting alignment at the DNA level when combined with the predicted protein amino acid sequences, allowed detailed analysis of synonymous versus non-synonymous mutations along the lines of assumed descent9, shown in figure 4 of main text References. 1. Altschul, S.F., Madded, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:33893402 (1997). 2. Mariotti, M. & Guigó, R., Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes. Bioinformatics 26(21):2656-63 (2010). 3. Notredame, C., Higgins, D.G. & Heringa, J., T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Bio. 302:205–17 (2000). 4. Gascuel O., BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol 14:685-95 (1997) 5. Guindon, S. Dufayard, J.F., Lefort, V., Anisimova, M., Hordijk, W. & Gascuel, O., New algorithms and methods to estimate maximum- likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59:307-21 (2010). 6. Akaike, H., Information theory and extension of the maximum likelihood principle. Proceedings of the 2nd international symposium on information theory, Budapest, Hungary, pp. 267-281 (1973). 7. Huerta-Cepas, J., Capella-Gutierrez, S., Pryszcz, L.P., Denisov, I., Kormes, D. Marcet-Houben, M. et al., PhylomeDB v3.0: An expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions. Nucleic Acids Res 39:D556-60 (2011). 8. Huerta-Cepas, J., Dopazo, J. & Gabaldón, T., ETE: a python Environment for Tree Exploration. BMC Bioinformatics 11:24 (2010). 9. Mariotti, M., Ridge, P.G., Zhang, Y., Lobanov, A.V., Pringle, T.H., Guigo, R. et al., Composition and Evolution of the Vertebrate and Mammalian Selenoproteomes., PLoS ONE 7(3):e33066 (2012)

Supplementary Materials (doc 30K)

Related documents

Products

Support

Supplementary Materials (doc 30K)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib