1 S1. Recommendations for phylogenetic analyses 2 There are a number of useful resources providing a step-by-step guide to state-of-the-art phylogenetic analyses, e.g. [1, 8]. Although many researchers have converged to using nearly undistinguishable strategies there are many possible variations on this theme. The following should only be considered as an example. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 In the specific case of family-level analyses of polyomavirus LTag sequences, we advise to use amino acid sequences. Amino acid sequences will depend on the proper identification of splice sites in nucleotide sequences. In most cases publicly available genomes come with straightforward, unambiguous annotations, now often arising from splice site predictions. We encourage the community to regularly check and curate novel as well as old genomes. LTag amino acid sequences can be aligned with any of the most popular multi-sequence aligner, e.g. Clustal Omega or MUSCLE [5, 11]. These algorithms are implemented in a number of multi-task platforms equipped with a graphical user interface, e.g. SeaView or Geneious [6, 9]. Although LTag amino acid sequences generally align quite well, a number of sections of the alignment will look relatively shaky, i.e. comprise many gaps. These are regions where local site homology is more difficult to ascertain. They will maybe contribute some phylogenetic signal but they will most certainly bring in unnecessary noise. Ambiguous columns may be removed manually or, even better, by using a reproducible rule that one can implement with e.g. Gblocks [12]. Gblocks is also implemented in SeaView. Probabilistic phylogenetic inference methods, i.e. maximum likelihood (ML) and Bayesian analyses, require a model of amino acid substitution to be specified. It is a good idea to first determine which model might best capture the processes that resulted in your own sequence data. Model selection in a ML framework is an efficient and popular way to identify the “best model”. For amino acid alignments, this can be done using ProtTest [3]. With a LTag alignment and a reasonable model of amino acid substitution in hand, it is now possible to proceed with phylogenetic analyses per se. It is better to run analyses in both ML and Bayesian frameworks. Branches that will receive good statistical support in both analyses will be a lot more credible. ML analyses can be performed with a number of softwares, including PhyML and MEGA [7, 13]. One should pay some attention to the algorithm used to generate new topologies along the optimization process, e.g. subtree-pruning-regrafting (SPR) or a combination of SPR and nearest-neighbor-interchange (NNI) are usually seen as reasonably efficient at exploring topological space. The end result of a ML analysis will be a single tree, the ML tree. Branch support can be estimated using non-parametric bootstrapping, in which case several hundred pseudo-replicates of the original dataset will be analyzed: the frequency of appearance of any given branch in this set of pseudo-ML trees is routinely referred to as the 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 bootstrap value. Bootstrap values can be plotted above the corresponding branch of the ML tree. Bayesian analyses are usually performed using BEAST or MrBayes [2, 4, 10]. There is significant amino acid rate variation across the LTag tree, so it may be wise to use an evolutionary model comprising a relaxed clock component. The model of evolution specified in BEAST should also include a component describing the tree shape. We would strongly suggest not to use coalescent models and to opt for one of the speciation models, e.g. the birth-death model. Unlike ML, Bayesian analyses do not aim at identifying the “best tree”. Instead, they will generate a set of “plausible trees”. The frequency of appearance of any branch in this set of trees is a good approximation of their posterior probability (which cannot be directly estimated), i.e. a measure of their statistical robustness. Bayesian sets of trees are usually summarized onto a single tree, the maximum clade credibility tree (MCC tree), which is the “best representative tree” of the entire set under consideration (considering branch posterior probabilities). Posterior probabilities can be plotted above the corresponding branch of the MCC tree. It is usual to only present the ML tree or the MCC tree in publications, in which case both bootstrap values and posterior probabilities can be co-plotted above the appropriate branches. Properly setting up, running and analyzing the output of Bayesian analyses requires some learning. It is far beyond the scope of this short document to provide general guidelines about these steps but a number of excellent resources and tutorials are available online, e.g. at http://beast2.org/. The SG will only consider taxonomical claims that rely on properly performed phylogenetic analyses. To quickly summarize, this should include: 1) a meaningful amino acid alignment, 2) a model selection procedure, 3) the implementation of at least two phylogenetic inference methods, one of which at least being character-based and probabilistic, i.e. claims only backed by distance-based methods will not be considered, and 4) a statistical assessment of branch support. 66 67 References 68 69 70 71 72 73 74 75 1. 2. 3. Anisimova M, Liberles DA, Philippe H, Provan J, Pupko T, von Haeseler A (2013) Stateof the art methodologies dictate new standards for phylogenetic analysis. BMC Evol Biol 13:161 Bouckaert R, Heled J, Kuhnert D, Vaughan T, Wu CH, Xie D, Suchard MA, Rambaut A, Drummond AJ (2014) BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput Biol 10:e1003537 Darriba D, Taboada GL, Doallo R, Posada D (2011) ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics 27:1164-1165 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Drummond AJ, Suchard MA, Xie D, Rambaut A (2012) Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol 29:1969-1973 Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792-1797 Gouy M, Guindon S, Gascuel O (2010) SeaView version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol 27:221-224 Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59:307-321 Hall BG (2013) Building phylogenetic trees from molecular data with MEGA. Mol Biol Evol 30:1229-1235 Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, Buxton S, Cooper A, Markowitz S, Duran C, Thierer T, Ashton B, Meintjes P, Drummond A (2012) Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28:1647-1649 Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Hohna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP (2012) MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 61:539-542 Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Soding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7:539 Talavera G, Castresana J (2007) Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol 56:564-577 Tamura K, Stecher G, Peterson D, Filipski A, Kumar S (2013) MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol 30:2725-2729