Tree Inference Methods • Methods to infer phylogenetic trees – Introduction • There is no one correct method • Methods are grouped according to two criteria – Does it use discrete character states or distance matrices? – Does it cluster OTUs in a stepwise manner or evaluate a number of possible trees? Tree Inference Methods • Discrete character state methods – Includes sequences, morphological characters, physiological characters, restriction maps, etc. – Each character is analyzed separately and independently (usually) – Best tree is deduced from a set of possible trees using the character state data – Retain information about individual characters throughout the analysis and can be used to reconstruct ancestral states if necessary – Extremely computer intensive – Beyond certain numbers of taxa, it is impossible to evaluate all possible trees • Distance matrix methods – Calculate a measure of dissimilarity and abandon any information about the actual character states – The distance matrix is then used to build a tree from the ground up – Distance matrix represents the genetic or evolutionary distance – No need to evaluate multiple trees, computationally simple – Information is lost – No way to reconstruct ancestral states Tree Inference Methods • Tree evaluation methods – With these methods, you have some criterion for selecting a ‘best’ tree based on the data – If possible, perform an exhaustive search of all possible trees, evaluate all of them using criterion and choose the best one – Not possible for large numbers of OTUs – Algorithms allow us to evaluate subsets but we risk never identifying the best tree – Many ‘best’ trees are possible (even likely) • Clustering methods – – – – – Construct a tree from nothing using specific algorithms Cluster the two most closely related taxa Then add a third most closely related, and so on…. Fast Produce only one tree Models of DNA Evolution • Clustering Methods: Obtaining Genetic Distances • Nucleotide substitution models • In order to calculate a genetic distance, we must have some model of DNA evolution on which to “hang our hat” • General assumptions of most models (often violated at least slightly) – – – – All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity Models of DNA Evolution • General assumptions of most models (often violated at least slightly) – – – – All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity Compensatory changes Models of DNA Evolution • General assumptions of most models (often violated at least slightly) – – – – All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity Models of DNA Evolution • Clustering Methods: Obtaining Genetic Distances • Nucleotide substitution models • In order to calculate a genetic distance, we must have some model of DNA evolution on which to “hang our hat” • General assumptions of most models (often violated at least slightly) – – – – All sites are independent of one another Sites are homogeneous in their rates of change Markovian: Given the present state, future changes are unaffected by past states Temporal homogeneity • Strictly speaking, these assumptions apply only to regions undergoing little or no selection • Our task is to determine a mathematical method to model the (presumed) stochastic processes that introduced the observed differences among sequences Models of DNA Evolution • A model should: – Provide a consistent measure of dissimilarity among sequences – Provide linearly proportional distances to the time since divergence (if a molecular clock is assumed) – Provide distances representing the branch lengths on an evolutionary tree • The basic model is just counting the number of differences - pdistance (p = #differences/site) • Intuitively simple but probably accurate only for very few cases because of homoplasy • Homoplasy - a character state shared by a set of sequences but not present in the common ancestor; a misleading phylogenetic signal • Most commonly, homoplasy is introduced because of multiple and back substitutions • P-distances almost invariably underestimate the actual number of changes Models of DNA Evolution • P-distances invariably underestimate the actual number of changes Models of DNA Evolution • P-distances invariably underestimate the actual number of changes • Saturation – the point at which any phylogenetic signal is lost; so many changes have occurred, the sequences are essentially random with respect to one another Models of DNA Evolution • Substitutions as homogeneous Markov processes • Markov processes are specified in Q matrices • A 4x4 matrix in which each position gives the instantaneous rate of change from one base to another. • μ = mutation rate • a = rate at which A-C change occurs relative to other possible changes Models of DNA Evolution • Most Q matrices represent time homogeneous, time continuous, stationary Markov process • Assumptions – At any given site in a sequence, the rate of change from base i to base j is independent of the base that occupied the site prior to i. – Time homogeneous/continuous – substitution rates do not change over time – Stationary – the relative frequencies of the bases (πA,πC,πG,πT) are at equilibrium – Many models are also time-reversible – the rate of change from i to j is always the same as from j to i. • These assumptions don’t make much sense biologically but are necessary if substitutions are to be modeled as stochastic processes Models of DNA Evolution • Jukes Cantor (JC69) – the simplest model • Assumptions: – Equilibrium frequencies for the four nucleotides are 25% each (πA=πC=πG=πT=1/4) – Equal probabilities exist for any substitution (a=b=c=d=e=f=1) • Once the Q matrix is stated, calculating the probability of change from one base to another over evolutionary time, P(t) is accomplished by calculating the matrix exponential – Matrix algebra is involved. I took it back in 1991. Forgive me • The resulting correction becomes d=-¾ln(1-(4/3)p) – p = the observed distance (p-distance) Models of DNA Evolution • Using JC69 • Note the parallel substitution at position 9 • The actual distance is higher than the observed distance • 6 changes actually occurred Models of DNA Evolution • Using JC69 • p = 4/10 = 0.4 • d (JC69) = -3/4 ln [1-4/3 (0.4)] = 0.5716 • A more reasonable estimate of the number of actual changes that occurred • What assumptions of JC69 are violated? Models of DNA Evolution • Kimura 2-parameter (K2P) • Generally, transitions occur at higher rates than transversions • This violates the rate assumptions of JC69 Models of DNA Evolution • Kimura 2-parameter • A different rate must be considered for transitions (α) and transversions (β), changing the Q matrix to: • π remains ¼ for all bases • d = ½ ln[1/1-2P-Q] + [1/4 ln[1/(1-2Q]] • P and Q are the proportional differences between sequences due to transitions and transversions, respectively • Note if, α=β … Models of DNA Evolution • Felsenstein (1981) - F81 • In most taxa, A+T ≠ C+G • If there are only a few G’s, the rate of substitution from G to A will be low compared to other substitutions • Violates the rate assumptions of JC69 Models of DNA Evolution • Felsenstein (1981) - F81 • Different frequencies must be considered for all bases, substitution rates are the same for all, changing the Q matrix to: • π is unique for all bases (πA ≠ πC ≠ πG ≠ πT) • Note that this model assumes similar base composition for all sequences under consideration • Note, if πA = πC = πG = πT … Models of DNA Evolution • Hasegawa, Kishino and Yano (HKY85) • Combines F81 and K2P • General Time Reversible (GTR) • Allows all six pairs of substitutions to have distinct rates • Allows unequal base frequencies Models of DNA Evolution Models of DNA Evolution • A variety of other models exist: • Tajima-Nei (1984) – refines JC69 for more accurate rates of nucleotide substitution • Tamura 3 parameter (1982) – corrects for multiple hits • Tamura-Nei (1993) – corrects for multiple hits, considers purine and pyrimidine transitions differently Models of DNA Evolution • Varying substitution rates among sites in sequences (rate heterogeneity) can be compensated for • Most times, a gamma, Γ, distribution is used • An α value to determine the shape of the distribution can be estimated from the data and incorporated into calculations Models of DNA Evolution • Small values of α = L-shaped Γ-distribution and extreme rate variation among sites, most sites invariable but a few sites have very high substitution rates • Large values (>1) of α = bell-shaped Γ-distribution and minimal rate variation among sites Models of DNA Evolution • Choosing the wrong model may give the wrong tree – Wrong model incorrect branch lengths, Ti/Tr ratios, divergences rate estimations, mutation rates, divergence dates • What model to choose and how to choose it? • Generally, more complex models fit the data better – Thus, it may seem best to use the most complex model by default – However, • More parameters must be estimated, making computation more difficult (longer) and increasing the possibility of error in estimation • Find a medium between complexity and practicality Models of DNA Evolution • Choosing a model • The fit of a model to the data is proportional to: – – – – – – – The probability of the data (D), given a model of evolution (M), a vector of model parameters (θ), a tree topology (τ) and a vector of branch lengths (ν) L = P(D | M, θ, τ, ν) Often use the log likelihood to ease computation l = lnP(D | M, θ, τ, ν) • Likelihood ratio test (LRT) • LRT statistic LTR = 2 (l1 – l0) • • • • l1 = the maximum log likelihood under the more complex model (alternative hypothesis) l0 = the maximum log likelihood under the less complex model (null hypothesis) Always =>0 Large value = the more complex model is better Models of DNA Evolution • Choosing a model • Hierarchical likelihood ratio test (hLRT) • Most of the models described above are nested, or hierarchical – i.e. JC is a special case of F81 where the base frequencies are equal • ModelTest will perform all possible comparisons and evaluate them using a Χ2 test Models of DNA Evolution • Choosing a model • Information criteria • The likelihood of each model is penalized by a function of the number of free parameters (K) in the model; more parameters = higher penalty • Akaiki Information Criterion (AIC) • AIC = -2l + 2K • AIC = the amount of information lost when we use a particular model • Small values are better • ModelTest, ProtTest Models of DNA Evolution • • • • • • • • Choosing a model Bayesian methods Bayes factors are similar to LTR Posterior probabilities can be calculated Most commonly Bayesian Information Criterion (BIC) is calculated BIC = -2l + 2K log n Smaller = better ModelTest & ProtTest