S1. Recommendations for phylogenetic analyses

advertisement
1
S1. Recommendations for phylogenetic analyses
2
There are a number of useful resources providing a step-by-step guide to state-of-the-art
phylogenetic analyses, e.g. [1, 8]. Although many researchers have converged to using nearly
undistinguishable strategies there are many possible variations on this theme. The following
should only be considered as an example.
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
In the specific case of family-level analyses of polyomavirus LTag sequences, we advise to
use amino acid sequences. Amino acid sequences will depend on the proper identification of
splice sites in nucleotide sequences. In most cases publicly available genomes come with
straightforward, unambiguous annotations, now often arising from splice site predictions.
We encourage the community to regularly check and curate novel as well as old genomes.
LTag amino acid sequences can be aligned with any of the most popular multi-sequence
aligner, e.g. Clustal Omega or MUSCLE [5, 11]. These algorithms are implemented in a
number of multi-task platforms equipped with a graphical user interface, e.g. SeaView or
Geneious [6, 9].
Although LTag amino acid sequences generally align quite well, a number of sections of the
alignment will look relatively shaky, i.e. comprise many gaps. These are regions where local
site homology is more difficult to ascertain. They will maybe contribute some phylogenetic
signal but they will most certainly bring in unnecessary noise. Ambiguous columns may be
removed manually or, even better, by using a reproducible rule that one can implement with
e.g. Gblocks [12]. Gblocks is also implemented in SeaView.
Probabilistic phylogenetic inference methods, i.e. maximum likelihood (ML) and Bayesian
analyses, require a model of amino acid substitution to be specified. It is a good idea to first
determine which model might best capture the processes that resulted in your own
sequence data. Model selection in a ML framework is an efficient and popular way to
identify the “best model”. For amino acid alignments, this can be done using ProtTest [3].
With a LTag alignment and a reasonable model of amino acid substitution in hand, it is now
possible to proceed with phylogenetic analyses per se. It is better to run analyses in both ML
and Bayesian frameworks. Branches that will receive good statistical support in both
analyses will be a lot more credible.
ML analyses can be performed with a number of softwares, including PhyML and MEGA [7,
13]. One should pay some attention to the algorithm used to generate new topologies along
the optimization process, e.g. subtree-pruning-regrafting (SPR) or a combination of SPR and
nearest-neighbor-interchange (NNI) are usually seen as reasonably efficient at exploring
topological space. The end result of a ML analysis will be a single tree, the ML tree. Branch
support can be estimated using non-parametric bootstrapping, in which case several
hundred pseudo-replicates of the original dataset will be analyzed: the frequency of
appearance of any given branch in this set of pseudo-ML trees is routinely referred to as the
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
bootstrap value. Bootstrap values can be plotted above the corresponding branch of the ML
tree.
Bayesian analyses are usually performed using BEAST or MrBayes [2, 4, 10]. There is
significant amino acid rate variation across the LTag tree, so it may be wise to use an
evolutionary model comprising a relaxed clock component. The model of evolution specified
in BEAST should also include a component describing the tree shape. We would strongly
suggest not to use coalescent models and to opt for one of the speciation models, e.g. the
birth-death model. Unlike ML, Bayesian analyses do not aim at identifying the “best tree”.
Instead, they will generate a set of “plausible trees”. The frequency of appearance of any
branch in this set of trees is a good approximation of their posterior probability (which
cannot be directly estimated), i.e. a measure of their statistical robustness. Bayesian sets of
trees are usually summarized onto a single tree, the maximum clade credibility tree (MCC
tree), which is the “best representative tree” of the entire set under consideration
(considering branch posterior probabilities). Posterior probabilities can be plotted above the
corresponding branch of the MCC tree.
It is usual to only present the ML tree or the MCC tree in publications, in which case both
bootstrap values and posterior probabilities can be co-plotted above the appropriate
branches.
Properly setting up, running and analyzing the output of Bayesian analyses requires some
learning. It is far beyond the scope of this short document to provide general guidelines
about these steps but a number of excellent resources and tutorials are available online, e.g.
at http://beast2.org/.
The SG will only consider taxonomical claims that rely on properly performed phylogenetic
analyses. To quickly summarize, this should include: 1) a meaningful amino acid alignment,
2) a model selection procedure, 3) the implementation of at least two phylogenetic
inference methods, one of which at least being character-based and probabilistic, i.e. claims
only backed by distance-based methods will not be considered, and 4) a statistical
assessment of branch support.
66
67
References
68
69
70
71
72
73
74
75
1.
2.
3.
Anisimova M, Liberles DA, Philippe H, Provan J, Pupko T, von Haeseler A (2013) Stateof the art methodologies dictate new standards for phylogenetic analysis. BMC Evol
Biol 13:161
Bouckaert R, Heled J, Kuhnert D, Vaughan T, Wu CH, Xie D, Suchard MA, Rambaut A,
Drummond AJ (2014) BEAST 2: a software platform for Bayesian evolutionary
analysis. PLoS Comput Biol 10:e1003537
Darriba D, Taboada GL, Doallo R, Posada D (2011) ProtTest 3: fast selection of best-fit
models of protein evolution. Bioinformatics 27:1164-1165
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Drummond AJ, Suchard MA, Xie D, Rambaut A (2012) Bayesian phylogenetics with
BEAUti and the BEAST 1.7. Mol Biol Evol 29:1969-1973
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res 32:1792-1797
Gouy M, Guindon S, Gascuel O (2010) SeaView version 4: A multiplatform graphical
user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol
27:221-224
Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010) New
algorithms and methods to estimate maximum-likelihood phylogenies: assessing the
performance of PhyML 3.0. Syst Biol 59:307-321
Hall BG (2013) Building phylogenetic trees from molecular data with MEGA. Mol Biol
Evol 30:1229-1235
Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, Buxton S, Cooper
A, Markowitz S, Duran C, Thierer T, Ashton B, Meintjes P, Drummond A (2012)
Geneious Basic: an integrated and extendable desktop software platform for the
organization and analysis of sequence data. Bioinformatics 28:1647-1649
Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Hohna S, Larget B, Liu L,
Suchard MA, Huelsenbeck JP (2012) MrBayes 3.2: efficient Bayesian phylogenetic
inference and model choice across a large model space. Syst Biol 61:539-542
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H,
Remmert M, Soding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of
high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol
7:539
Talavera G, Castresana J (2007) Improvement of phylogenies after removing
divergent and ambiguously aligned blocks from protein sequence alignments. Syst
Biol 56:564-577
Tamura K, Stecher G, Peterson D, Filipski A, Kumar S (2013) MEGA6: Molecular
Evolutionary Genetics Analysis version 6.0. Mol Biol Evol 30:2725-2729
Download