Supplementary Materials (doc 30K)

advertisement
Supplemental Materials, analysis and methods
The sequences of LFG proteins across eukaryotic species were identified
using Blastp and tBlastn1 beginning with a query set of previously identified
genes9 and those from the recent literature. To be sure to identify more remote
homologues while excluding non-LFG proteins, profiles were built with
Selenoprofiles2, and used to search available genomes again. All LFG sequences
from Blast results, seleno-profiles searches and manual identification were
combined into a single sequence set. This set was inspected to remove identical
sequences, partial sequences, and probable assembly or gene prediction errors.
There were a few cases in which manual review allowed assembly errors to be
identified and corrected. There were also, a number of cases involving intronexon boundary errors were found in automated gene predictions. These were
also corrected by comparisons to the genomic sequences of the nearest
homologues and with the examination of available ESTs.
The resulting set contains sequences from all sequenced eukaryotic
lineages. This set was then aligned using t-coffee3. Two alignments were
generated. Alignment 1 contains the assumed best set of full length LFG genes
(with the exception that highly repetitive N-terminals with large variation within
a clear LFG subfamily were trimmed back to a common length). Alignment 2
contains the full set (Suppl. Material), including partial matches, and likely some
incomplete sequences.
LFG subfamilies were assigned based using the branching topology of a full
phylogenetic tree analysis, and checked for distinctive sequence patterns. ML
trees were reconstructed using the best-fitting evolutionary model (BestML). To
select the evolutionary model best fitting each protein subfamily, a phylogenetic
tree was reconstructed using a Neighbour Joining (NJ) approach as implemented
in BioNJ4. The likelihood of this topology was computed, allowing branchlength optimization, using seven different models (JTT, LG, WAG, Blosum62,
MtREV, VT and Dayhoff), as implemented in PhyML version 3.0 5. The two
evolutionary models best fitting the data were determined by comparing the
likelihood of the used models according to the AIC criterion6. Then, ML trees
were derived using these two models with the default tree topology search
method NNI (Nearest Neighbor Interchange). A similar approach based on NJ
topologies to select the best-fitting model for a subsequent ML analysis has been
shown previously to be highly accurate7. Branch support was computed using an
aLRT (approximate likelihood ratio test) parametric test based on a chi-square
distribution, as implemented in PhyML. It should be noted that the branch
support obtained correlates well with bootstrap values obtained with the simpler
ClustalW software.
LFG subfamilies were assigned based using the branching topology of a full
phylogenetic tree analysis, and checked for distinctive sequence
patterns. Phylogenetic reconstruction was carried out as described in a study
of Vertebrate and Mammalian Selenoproteomes7,8. Finally, tree images were
generated using programs based on ETE28. This procedure generated tree1
(Suppl. Fig. S3). Tree1 was inspected, and correlated with a probable species
tree to derive the history of LFG family, as depicted in figure 1. It should be
noted that the branch support obtained correlates well with bootstrap values
obtained with the simpler ClustalW software. Additional phylogenetic analysis
was carried out on sub taxonomic divisions and on the distinct LGF subfamilies
identified. In particular this was done for the mammals in all five subfamilies
and in the primate separately. The resulting alignment at the DNA level when
combined with the predicted protein amino acid sequences, allowed detailed
analysis of synonymous versus non-synonymous mutations along the lines of
assumed descent9, shown in figure 4 of main text
References.
1. Altschul, S.F., Madded, T.L., Schaffer, A.A., Zhang, J., Zhang, Z.,
Miller, W. et al., Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic Acids Res. 25:33893402 (1997).
2. Mariotti, M. & Guigó, R., Selenoprofiles: profile-based scanning of
eukaryotic genome sequences for selenoprotein genes. Bioinformatics
26(21):2656-63 (2010).
3. Notredame, C., Higgins, D.G. & Heringa, J., T-Coffee: A novel
method for fast and accurate multiple sequence alignment. J. Mol.
Bio. 302:205–17 (2000).
4. Gascuel O., BIONJ: an improved version of the NJ algorithm based
on a simple model of sequence data. Mol Biol Evol 14:685-95 (1997)
5. Guindon, S. Dufayard, J.F., Lefort, V., Anisimova, M., Hordijk, W.
& Gascuel, O., New algorithms and methods to estimate maximum-
likelihood phylogenies: assessing the performance of PhyML 3.0.
Syst Biol 59:307-21 (2010).
6. Akaike, H., Information theory and extension of the maximum
likelihood principle. Proceedings of the 2nd international symposium
on information theory, Budapest, Hungary, pp. 267-281 (1973).
7. Huerta-Cepas, J., Capella-Gutierrez, S., Pryszcz, L.P., Denisov, I.,
Kormes, D. Marcet-Houben, M. et al., PhylomeDB v3.0: An
expanding repository of genome-wide collections of trees, alignments
and phylogeny-based orthology and paralogy predictions. Nucleic
Acids Res 39:D556-60 (2011).
8. Huerta-Cepas, J., Dopazo, J. & Gabaldón, T., ETE: a python
Environment for Tree Exploration. BMC Bioinformatics 11:24
(2010).
9. Mariotti, M., Ridge, P.G., Zhang, Y., Lobanov, A.V., Pringle, T.H.,
Guigo, R. et al., Composition and Evolution of the Vertebrate and
Mammalian Selenoproteomes., PLoS ONE 7(3):e33066 (2012)
Download