Advanced in Phylogenetic Analysis Group 4 53A36 Group 3 Group 5 ** ** ** ** Group 6 ** 53A26 ** Group 1 Group 2 53A37 .2 Hsin-Fu Liu, Ph.D. Department of Medical Research, Mackay Memorial Hospital, Taipei, Taiwan E-mail: hsinfu@ms1.mmh.org.tw Application of Phylogenetic Analysis • • • • • • • • Evolutionary studies Classifications Phylogeography Phylogenomics Molecular Epidemiology Phylodynamics Forensic Sciences, Criminal Investigation etc.……. Some of the applications need to be very precise! Some key points for performing a phylogenetic analysis Select an informative region to analyze coding or non-coding region length of the fragment phylogenetic signal Make an optimal sequence alignment the bias in selecting the taxa sequence homology insertions or deletions Use different methods to construct the trees Neighbour-Joining Maximum Parsimony Maximum Likelihood Bayesian Statistical test for phylogenetic trees bootstrapping (1000 replicates) http://evolution.genetics.washington.edu/phylip/software.html This site lists more than 392 phylogeny packages and 54 free servers. Perhaps the best-known programs are PAUP* (David Swofford and colleagues) and PHYLIP (Joe Felsenstein). Distance matrix methods 1 TCAAGTCAGGTTCGA 2 TCCAGTTAGACTCGA 3 TTCAATCAGGCCCGA Convert dissimilarity to evolutionary distance by correcting for multiple events per site according to a certain model of evolution, e.g. Jukes and Cantor. 2 3 1 0.266 0.333 2 3 1 0.33 0.44 2 3 dissimilarity 0.333 2 0.44 3 evolutionary distance A first idea about the relationship among different taxa can be viewed by their evolutionary distance. The simplest way to estimate the evolutionary distance is to count the total number of nucleotide differences between them and divide by the total number of available sites. An evolutionary distance obtained in such a way is called uncorrected distance or Hamming distance. A more realistic estimation can be achieved by applying a particular model of evolution that is making assumptions on how nucleotides change during the evolutionary process. The simplest approach to measuring distances between sequences is to align pairs of sequences, and then to count the number of differences. The degree of divergence is called the Hamming distance. For an alignment of length N with n sites at which there are differences, the degree of divergence D is: D=n/N But observed differences do not equal genetic distance! Genetic distance involves mutations that are not observed directly. Uncorrected distance vs. evolutionary distance Jukes and Cantor (1969) proposed a corrective formula: D = (- 3 ) ln (1 – 4 p) 3 4 Consider an alignment where 3/60 aligned residues differ. The normalized Hamming distance is 3/60 = 0.05. The Jukes-Cantor correction is 4 D = (- 3 ) ln (1 – 0.05) = 0.052 3 4 When 30/60 aligned residues differ, the Jukes-Cantor correction is more substantial: D = (- 3 ) ln (1 – 4 0.5) = 0.82 4 3 Ancestral Sequence Present Sequences 1 A C T G A A C G T A A C G C A C T G G A G G A A T C G C 2 A A T G A A A G A A T C G C Although only 3 nucleotide differences can be observed, 13 mutations have occurred during evolution. During Evolution Sequence 1 A C T G T A C A C G G T A A T A C C G C G Sequence 2 A A C T G A A A C G A T A A T C G T C C Single substitution Sequential substitutions Coincidental substitutions Parallel substitutions Convergent substitutions Back substitutions Correcting for unobserved mutations Expected difference Sequence difference Correction Observed difference Time/Evolutionary distance How to correct the difference? Apply a substitution model that tries to estimate the correct number of substitutions. Nucleotide substitution models Jukes and Cantor’s A C T pyrimidines equal substitution rate Jukes-Cantor (JC or JC69) Kimura 2-Parameter (K2P or K80) A purines G Kimura’s 2 parameter C G T unequal substitution rate transition transversion one rate of substitution equal base frequencies two types of substitution equal base frequencies Felsenstein 1981 (F81) one rate of substitution unequal base frequencies Hasegawa-Kishino-Yano, or Felsenstein (HKY85 or F84) two types of substitution unequal base frequencies Tamura and Nei (TN93) three types of substitution unequal base frequencies General Time Reversible (GTR) six types of substitution unequal base frequencies Among-site rate variation (Γ distribution) α<1 is L-shaped: most of the sites are considered to have a low substitution rate, or they are virtually invariable, while few of them evolve at very high rate. α>1 is bell-shaped: most of the sites have intermediate rates, while few of them evolve at very low or very high rate. All the models described in the previous section assume each site in the aligned nucleotide sequences mutates exactly at the same rate. However, different selective pressure at different positions of the genome can be responsible for some sites to evolve slower or faster than others. In coding regions, mutations in first and second position generally lead to amino acid replacement and they occur normally at slower rate due to purifying selection. On the other hand, non-coding regions tend to change faster, because mutations can occur with less constraints than in coding ones. In a classical approach among-site rate heterogeneity can be approximated by a statistical function called g-distribution, which represents the nucleotide substitution rate as a function of the proportion of sites having a particular rate. The use of g-distribution can account for two different levels of rate variation in real data. The inclusion of g-distributed nucleotide substitution rates in the models described above is a more realistic description of the evolutionary process and can improve the estimation of the evolutionary distance. Genetic Distances: Example Genetic distance between SIVcpz and HIV-1 (p24 gene) • observed # changes / # sites = 0.406 • Jukes-Cantor = 0.586 • Kimura 2 parameter = 0.602 • Hasegawa-Kishino-Yano = 0.611 • General reversible = 0.620 • General reversible + gamma = 1.017 Likelihood Ratio Test (LRT) The performances of the different substitution models can be tested using the likelihood ratio test •For particular questions (e.g. molecular clock analysis) it is important to have the correct branch length and thus it is worth to put effort in investigating the most appropriate evolutionary model. •Whether or not the different lineages in a phylogenetic trees are evolving at constant rate it is possible to evaluate employing the LRT. •Analyses of real data suggest that while different models of nucleotide substitutions often have led to drastic changes in the likelihood of a tree or in the distance matrix of the taxa, they have only minor effect on tree topology. •The LRT rejects a simpler model in favor of a more complex one when the double of the difference between the two log likelihood, calculated employing the two different models, is significant in a 2 test. The degrees of freedom for the 2 test is given by the difference between the number of free parameters in the more complex model and the number of constraints in the simpler one. Model selection Akaike information criterion The Akaike information criterion is a measure of the relative goodness of fit of a statistical model. It was developed by Hirotsugu Akaike, under the name of "an information criterion" (AIC) first published in 1974. It is grounded in the concept of information entropy, in effect offering a relative measure of the information lost when a given model is used to describe reality. It can be said to describe the tradeoff between bias and variance in model construction, or loosely speaking between accuracy and complexity of the model. Where k is the number of parameters in the statistical model, and L is the maximized value of the likelihood function for the estimated model. AICc AICc is AIC with a correction for finite sample sizes. AICc was first proposed by Hurvich & Tsai in 1989. When all the models in the candidate set have the same k, then AICc and AIC will give identical (relative) valuations. In that situation, then, AIC can always be used. Model selection Bayesian information criterion Bayesian information criterion (BIC) is a criterion for model selection among a finite set of models. It is based, in part, on the likelihood function, and it is closely related to AIC. When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in over-fitting. The BIC resolves this problem by introducing a penalty term for the number of parameters in the model. The penalty term is larger in BIC than in AIC. The BIC was developed by Gideon E. Schwarz. In fact, Akaike was so impressed with Schwarz's Bayesian formalism that he developed his own Bayesian formalism, now often referred to as the ABIC for "a Bayesian Information Criterion" or more casually "Akaike's Bayesian Information Criterion". Parameter in the model More parameters the model uses, more powerful it does? Bias: distance between the average estimate and truth Variance: spread of the estimates around the truth Statistical test for phylogenetic trees The tree topology obtained by any of the tree-making methods should be viewed as only an estimate of the phylogenetic relation among the taxa. Therefore, it is essential to know how much confidence can be associated with the branch length or the appearance of a set of taxa as a monophyletic group in a given tree. Bootstrap analysis Bootstrap analysis is the most often used method for statistical evaluation of phylogenies. It is used to test the effects of sampling error on tree inference and the stability of tree nodes by calculating how often a particular cluster in a tree appears when nucleotide sites are re-sampled with replacements many times. In general: Bootstrap value > 95% : often close to 100% confidence in that branch Bootstrap value > 75% : often close to 95% confidence in that branch Branch can be considered as a monophyletic group. Note: If the original data set is biased for some reasons, a clade may be regarded as statistically significant even if it is a wrong one. Conversely, a clade may be a correct one even if its bootstrap value is less than 75%. This is because the original bias cannot be corrected by the re-sampling process. Possible consequence of insufficient bootstrapping samples Random number seed Alignment not modified Modified alignment Bootstrapping times Bootstrapping times 100 times 1000 times 100 times 1000 times 5 62% 69.4% 67% 70.2% 15 68% 70.1% 77% 72.7% 23 71% 67.8% 85% 74.0% Neighbor-joining trees Random number seed Alignment not modified Modified alignment Bootstrapping times Bootstrapping times 100 times 1000 times 100 times 1000 times 5 37.6% 35.1% 85% 80% 15 35.9% 37.2% 84% 80.5% 23 31.8% 33.7% 81% 80.5% Maximum parsimony trees Likelihood-mapping analysis (quartet puzzling method) Likelihood-mapping analyses can be used as a complementary approach to solve the controversial phylogenies. •The method is based on an analysis of the maximum likelihoods for the three fully resolved tree topologies that can be computed for four sequences. •The three likelihoods are represented as points inside an equilateral triangle. •The triangle is partitioned into different regions. •The centre of the triangle represents a star-like evolution whereas the three corners represent wellresolved phylogeny and the three intermediate regions between the corners reflect the difficulty in distinguishing between two of the three trees. •For more than four sequences, the different strains can be grouped into four different subsets and all possible quartets generated by drawing one sequence from each subset can be evaluated. •The more points distribute in a certain region of a particular corner, the bigger the support for the tree topology joining the four subsets represented by that corner. •If most points locate in the center of the triangle, the four subsets are independent and related by a star-like tree. L1 A C B D L3 L2 C A D L2 B L1 B A C L3 D Quartet puzzling method The three corners of the triangle represent the three possible unrooted tree topologies for four taxa. L1, L2, and L3 represent the three likelihoods of the three trees, respectively. Each length of the perpendicular from point P to the triangle side is equal to the likelihood of the tree represented by the opposite corner. Quartet Puzzling Support Values The quartet puzzling support values can be interpreted in a similar way as bootstrap values (though they should not confused with them). Branches showing a quartet puzzling reliability > 90% can be considered strongly supported. Branches with lower reliability (> 70%) can in principle be also trusted but in this case it is advisable to check how well the respective internal branch does in comparison to other branches in the tree (i.e. check relative reliability). If you are interested in a branch with a low confidence it is also important to check the alternative groupings that are not included in the quartet puzzling tree. Analysis of the phylogenetic signal The quartet puzzling approach can be used also to visualize the phylogenetic content of a particular data set of n aligned sequences. For n sequences (n!/4!) possible quartets, exist. When n≥11, a random sample of 10000 quartets is sufficient to obtain a comprehensive picture of the kind of phylogenetic signal present. The more the dots in the center of the triangle, the more the phylogenetic noise, reflecting star-like evolution, in the data set. 1 ALLELE FREQUENCY fixed mutation polymorphis m maintained lost mutation 0 TIME • • • • • Positive selective pressure: mutant is more fit Negative selective pressure: mutant is less fit Balancing selection: heterozygote is more fit Most synonymous mutations can be considered neutral Non-synonymous mutations are always subject to selective pressure dN/dS •Are the substitutions we observe neutral or beneficial? •First, classify mutations as synonymous or non-synonymous •Synonymous substitutions are likely to be neutral •Non-synonymous substitutions may be neutral, beneficial or deleterious •Beneficial mutations spread faster than neutral ones •dN/dS=ratio of non-synonymous & synonymous substitution rates. •dN/dS = 1 if all non-synonymous substitutions are neutral •dN/dS < 1 if some non-synonymous substitutions are deleterious •dN/dS = 0 if all non-synonymous substitutions are deleterious •dN/dS > 1 only if some non-synonymous substitutions are beneficial Synonymous substitutions are not always neutral •Overlapping genes and reading frames Very common in viruses (e.g. HTLVs) •Secondary RNA or DNA structure e.g. stem-loops •Codons for the same amino acid differ in fitness May result from different translational efficiencies of tRNAs •Untranslated or regulatory regions Detecting recombination in viral sequences Consistent genome-wide HIV-1 M-group clustering Phylogenetic analysis results in similar general clustering on HIV strains irrespective of which region of the genome is used for the analysis. Jumping of recombinant isolates Recombinant isolate KAL153. Similarity and bootscanning analysis followed by phylogenetic analysis (K2P+NJ) of the identified segments. Networks for analysis of recombination Split decomposition method Evolutionary relationships between taxa are most often represented as phylogenetic trees and this is justified by the assumption that evolution is a branching or tree-like process. However, a set of real data often contains a number of different and sometimes conflicting signals and thus does not always clearly support a unique tree. The split decomposition method is a transformation-based approach. Essentially, evolutionary data is transformed or, more precisely, "canonically decomposed", into a sum of "weakly compatible splits" and then represented by a so-called splits graph. For ideal data, this is a tree, whereas less ideal data will give rise to a tree-like network that can be interpreted as possible evidence for different and conflicting phylogenies. Neighbor-Net Neighbor-Net, a distance based method for constructing phylogenetic networks that is based on the Neighbor-Joining (NJ) algorithm of Saitou and Nei. Neighbor-Net provides a snapshot of the data that can guide more detailed analysis. Unlike split decomposition, Neighbor-Net scales well and can quickly produce detailed and informative networks for several hundred taxa. Multilocus sequence typing (MSLT) to classify several hundreds Salmonella isolates. J Clin Microbiol 2002; 40: 1627-35. Comparison of K80-based NJ tree and split-decomposition phylogenetic analysis Split decomposition of the Human Bocavirus in Taiwan References Graur D and Li W-H. 2000. Fundamentals of Molecular Evolution, 2nd edition, Sinauer Associates, Sunderland, MA. Hillis D, Moritz C, and Mable BK (eds). 1996. Molecular Systematics, 2nd edition, Sinauer Associates, Sunderland, MA. Masatoshi Nei and Sudhir Kumar. 2000. Molecular Evolution and Phylogenetics, Oxford University Press, New York, NY. Philippe Lemey, Marco Salemi and Anne-Mieke Vandamme (eds). 2009. The Phylogenetic Handbook : A Practical Approach to phylogenetic analysis and hypothesis testing, Second Edition, Cambridge University Press. Pevsner J. 2009. Bioinformatics and Functional Genomics, 2nd edition, Wiley-Blackwell. International Bioinformatics Workshop on Virus Evolution and Molecular Epidemiology, 1995-2009, Rega Institute for Medical Research, Katholieke Universiteit Leuven, Leuven, Belgium. Workshop on Molecular Evolution, 1998-2008, Marine Biological Laboratory, Woods Hole, Massachusetts, USA. Liu Hsin-Fu. 1996. Genomic diversity and molecular phylogeny of human and simian T-cell lymphotropic viruses. Thesis submitted for the degree of “Doctor in Medical Sciences”, Faulty of Medicine, Katholieke Universiteit Leuven, Leuven, Belgium. Salemi Macro. 1999. Molecular investigation of the origin and genetic stability of the human T-cell lymphotropic viruses. Thesis submitted for the degree of “Doctor in Sciences”, Faulty of Science, Katholieke Universiteit Leuven, Leuven, Belgium. Note: This teaching material was prepared for the workshop only, not for formal publication, therefore did not specifically address to the references. They were all taken from the list above.