An Introduction to Phylogenetic Methods Part two Dr Laura Emery Laura.Emery@ebi.ac.uk www.ebi.ac.uk Objectives • After this tutorial you should be able to… • Discuss a range of methods for phylogenetic inference, their advantages, assumptions and limitations • Implement some phylogenetic methods using publicly available software • Appreciate some approaches for assessing branch support and selecting an appropriate substitution model • Know where to look for further information Outline • • • • Alignment for phylogenetics Phylogenetics: The general approach Phylogenetic Methods (1 – simple methods) Assessing Branch Support BREAK • • • • Substitution Models Phylogenetic Methods (2 - statistical inference) Deciding which model to use (hypothesis testing) Software The problem of multiple substitutions * G * A A hidden mutations * A * T • More likely to have occurred between distantly related species • > We need an explicit model of evolution to account for these Methodological approaches 1. Distance matrix methods (pre-computed distances) • UPGMA assumes perfect molecular clock Sokal & Michener (1958) • Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei (1987) 2. Maximum parsimony Fitch (1971) • What is a substitution model? Minimises number of mutational steps 3. Maximum likelihood, ML • Evaluates statistical likelihood of alternative trees, based on an explicit model of substitution 4. Bayesian methods • Like ML but can incorporate prior knowledge Statistical phylogenetic inference Figure Brian Moore Models of sequence evolution • We use models of substitution to ‘roughly’ describe the way that we believe the sequences have evolved • They are necessarily highly simplified descriptions of more complex biological processes • Parameters can be added to build more sophisticated models if we believe this is relevant for our data Substitution Models • Common nested models • Jukes and Cantor (JC) 1969 • Kimura 2 Parameter (K2P) 1980 • Felsenstein 1981(F81) • Hasegawa, Kishino and Yano 1985 (HKY85) • Generalised time-reversible (GTR or REV) Tavaré 1986 • Accounting for rate heterogeneity • Other substitution models The Jukes and Cantor (JC) 1969 model μ • 1 parameter A • μ = mutation rate μ • Assumptions μ μ μ • Equal base frequencies C μ μ μ • All mutations equally likely • All sites evolve at the same rate μ G μ μ μ T • All sites evolve independently • Time reversibility d = estimated nucleotide distance p = observed distance in sequence data But not all substitutions are equally likely… Transitions are more likely to occur than transversions Figures Andrew Rambaut Kimura 2 Parameter (K2P) 1980 μ • 2 parameters A • μ = mutation rate μ • κ = transition/transversion ratio • Assumptions κ C μ μ κ κ μ • Equal base frequencies • All mutations equally likely μ G μ κ μ T • All sites evolve at the same rate • All sites evolve independently • Time reversibility d = estimated nucleotide distance p = observed distance in sequence data q = proportion of sites with transversional differences But base frequencies are often not equal... Base frequencies vary among and within genomes ACTG Felsenstein 1981 (F81) • 4 symbols (3 parameters) • πA ,πC ,πG,πT = base frequencies πA A πC C • πA + πC + πG + πT = 1(so 3 parameters) • Assumptions • Equal base frequencies • All mutations equally likely • All sites evolve at the same rate • All sites evolve independently • Time reversibility πGG T π T Hasegawa, Kishino and Yano 1985 (HKY85) • 6 symbols (5 parameters) πA A • μ = mutation rate • κ = transition/transversion ratio • πA ,πC ,πG,πT = base frequencies • Assumptions • Equal base frequencies • All mutations equally likely • All sites evolve at the same rate • All sites evolve independently • Time reversibility μ κ πC μ C μ μ κ κ μ μ πGG μ κ μ T π T But there are also differences among the other nucleotide transition rates... 1. 2. 3. 4. 5. 6. A C A G A T C G C T G T Figures Andrew Rambaut Generalised time-reversible (GTR) Tavaré 1986 • 10 symbols (9 parameters) πA A • rAC, rAG, rAT, rCG, rCT, rGT = mutation • Assumptions • Equal base frequencies rAG πGG C rAC rAT r CG rates • πA ,πC ,πG,πT = base frequencies πC rAC rCT rAG rCG r rGT AT rGT • All mutations equally likely • All sites evolve at the same rate • All sites evolve independently • Time reversibility Widely used rCT T π T But some sites overall faster than others... 973 mtDNA CR; parsimony analysis (with known pedigree) Heyer et al. (2001) Gamma distributed rates • Rate variation among sites is often shown to be wellapproximated by a gamma distribution • To use: add alpha (α) parameter to existing model e.g. • Assumptions • Equal base frequencies Frequency GTR+G 0.04 α = 0.5 α=2 • All mutations equally likely • All sites evolve at the same rate α = 200 0.06 α = 50 0.02 • All sites evolve independently • Time reversibility 0 1 Substitution rate 2 Other substitution models • Amino acid substitution models • Dayhoff 1972 • Whelan and Goldman 2001 (WAG) • Lee & Gascuel 2008 (LG) • • • • Codon models e.g. Yang 2000 Relaxed molecular clock e.g. Drummond et al. 2006 Mixture models And many more! Methodological approaches 1. Distance matrix methods (pre-computed distances) • UPGMA assumes perfect molecular clock Sokal & Michener (1958) • Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei (1987) 2. Maximum parsimony Fitch (1971) • What is a substitution model? Minimises number of mutational steps 3. Maximum likelihood, ML • Evaluates statistical likelihood of alternative trees, based on an explicit model of substitution 4. Bayesian methods • Like ML but can incorporate prior knowledge Statistical phylogenetic inference recommended methods Figure Brian Moore Maximum Likelihood 1. Calculate the probability of the observed sequence data under a given model (including tree structure, branch lengths, and transition parameters). [The likelihood is proportional to this probability.] 2. Search for the tree(s) which maximize(s) the likelihood. Likelihood branch topology lengths model parameters probability constant data (alignment) Maximum Likelihood • Advantages: • statistically consistent • requires the use of an explicit model of evolution • Disadvantages: • slow (especially if all possible trees are evaluated) • produces a single ML tree • Usage: Widely-used and recommended method recommended Bayesian Inference 1. Calculate the probability of the model specified given the sequence data observed (using equation derived from Bayes Theorem) 2. Search the tree-space using MCMC (or equivalent) to approximate the joint-posterior probability density likelihood function posterior probability Pr 𝐻𝑖 𝑋 = prior probability Pr 𝑋 𝐻𝑖 Pr[𝐻𝑖 ] 𝑛 𝑗=𝑙 Pr 𝑋 𝐻𝑗 Pr[𝐻𝑗 ] marginal likelihood Bayesian Inference • Advantages: • the option to incorporate prior knowledge • produces probability distribution of possible trees • unlike ML, treats model parameters as random variables • Disadvantages: • very slow • heuristic methods of tree searching do not guarantee you find the best tree • Usage: Widely-used and recommended method recommended Heuristic searches do not guarantee you find the best tree Figure Andrew Rambaut Methodological approaches 1. Distance matrix methods (pre-computed distances) • UPGMA assumes perfect molecular clock Sokal & Michener (1958) • Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei (1987) 2. Maximum parsimony Fitch (1971) • What is a substitution model? Minimises number of mutational steps 3. Maximum likelihood, ML • Evaluates statistical likelihood of alternative trees, based on an explicit model of substitution 4. Bayesian methods How do I choose a substitution model? • Like ML but can incorporate prior knowledge How do I choose a substitution model? biological intuition develop hypothesis • Identify most appropriate assumptions and thus model for your data • Will a complex model with fewer assumptions better explain your data than a simple model? test hypothesis • Likelihood ratio test • Bayes factor test Not sure where to start? Empirical data shows GTR+G (nucleotide) or LG (protein) to be a good bet for standard datasets large in size Choosing a more complex model with more parameters will always fit the data better > We want to know if the fit is significantly better R2 = 0.78 R2 = 0.86 R2 = 1 Likelihood ratio test • Requires models to be nested • Uses likelihood ratio to evaluate if our hypothesis (H1) is significantly better than our null hypothesis (H0): Likelihood ratio = L(H1)/L(H0) Likelihood of hypothesis • Twice the logarithm of this ratio (2Δ) Likelihood of null hypothesis approximates a chi-squared distribution under the null hypothesis H0: Twice the 2Δ = 2[ln(L(H )) 1 difference in log Log likelihood of likelihood hypothesis – ln(L(H0))] Log likelihood of null hypothesis corresponding to the difference • with d degrees of freedom the number of free parameters between models in Likelihood ratio test example • Question: Do the rates of transitions and transversions in my sequence data significantly vary? • H1: K2P better explains my data (2 rate parameters, transitions different to transversions) • H0: JC is adequate (1 rate parameter for all substitutions) • Draw trees, find out ln(L(K2P)) = -23345; ln(L(JC)) = 23368 • Calculate: 2Δ = 2[ln(L(H1)) – ln(L(H0))] 2Δ = 2[ -23345 - -23368] = 2x23 = 46 • d = difference in number of free parameters = 2 - 1 = 1 • Next we look this up on a Χ2 distribution… Likelihood ratio test example • Is our 2Δ (twice log of the likelihood ratio) greater than we would expect by chance (p = 0.05)? • 2Δ = 46 (d = 1) YES – 46 is much larger than 0.004 > We can reject H0 (JC) and accept H1 (K2P) Software Sequence searching BLAST, FASTA, PSI-Search http://www.ebi.ac.uk/services Multiple sequence alignment Clustal Omega, MUSCLE, Prank (phylogenetically aware) http://www.ebi.ac.uk/services ClustalW2, PAUP Distance-based phylogenetic methods http://www.ebi.ac.uk/Tools/phylogeny/ Maximum likelihood phylogenetics RAxML (coming soon to EBI tools), PhyML, SeaView, PAUP, PAML Bayesian Phylogenetics MrBayes, BEAST Model Testing ModelTest, PAML • And lots lots more see: http://evolution.genetics.washington.edu/phylip/software.html Outline • • • • Alignment for phylogenetics Phylogenetics: The general approach Phylogenetic Methods (1 – simple methods) Assessing Branch Support BREAK • • • • Substitution Models Phylogenetic Methods (2 - statistical inference) Deciding which model to use (hypothesis testing) Software Now it is your turn… • Open your course manuals and begin Tutorial 2 (page 13) • Also available to download from: http://www.ebi.ac.uk/training/course/scuola-dibioinformatica-2013 • You will require the alignment file Rodents.txt • You will require the software SeaView 4.4.2 http://pbil.univ-lyon1.fr/software/seaview.html • There are answers available online but it is much better to ask for help! Thank you! www.ebi.ac.uk Twitter: @emblebi Facebook: EMBLEBI