p-distance

advertisement
Tree Inference Methods
• Methods to infer phylogenetic trees – Introduction
• There is no one correct method
• Methods are grouped according to two criteria
– Does it use discrete character states or distance matrices?
– Does it cluster OTUs in a stepwise manner or evaluate a number of possible trees?
Tree Inference Methods
• Discrete character state methods
– Includes sequences, morphological characters, physiological characters, restriction
maps, etc.
– Each character is analyzed separately and independently (usually)
– Best tree is deduced from a set of possible trees using the character state data
– Retain information about individual characters throughout the analysis and can be
used to reconstruct ancestral states if necessary
– Extremely computer intensive
– Beyond certain numbers of taxa, it is impossible to evaluate all possible trees
• Distance matrix methods
– Calculate a measure of dissimilarity and abandon any information about the actual
character states
– The distance matrix is then used to build a tree from the ground up
– Distance matrix represents the genetic or evolutionary distance
– No need to evaluate multiple trees, computationally simple
– Information is lost
– No way to reconstruct ancestral states
Tree Inference Methods
• Tree evaluation methods
– With these methods, you have some criterion for selecting a ‘best’ tree based on
the data
– If possible, perform an exhaustive search of all possible trees, evaluate all of them
using criterion and choose the best one
– Not possible for large numbers of OTUs
– Algorithms allow us to evaluate subsets but we risk never identifying the best tree
– Many ‘best’ trees are possible (even likely)
• Clustering methods
–
–
–
–
–
Construct a tree from nothing using specific algorithms
Cluster the two most closely related taxa
Then add a third most closely related, and so on….
Fast
Produce only one tree
Models of DNA Evolution
• Clustering Methods: Obtaining Genetic Distances
• Nucleotide substitution models
• In order to calculate a genetic distance, we must have some model of
DNA evolution on which to “hang our hat”
• General assumptions of most models (often violated at least slightly)
–
–
–
–
All sites are independent of one another
Sites are homogeneous in their rates of change
Markovian: Given the present state, future changes are unaffected by past states
Temporal homogeneity
Models of DNA Evolution
• General assumptions of most models (often violated at least slightly)
–
–
–
–
All sites are independent of one another
Sites are homogeneous in their rates of change
Markovian: Given the present state, future changes are unaffected by past states
Temporal homogeneity
Compensatory changes
Models of DNA Evolution
• General assumptions of most models (often violated at least slightly)
–
–
–
–
All sites are independent of one another
Sites are homogeneous in their rates of change
Markovian: Given the present state, future changes are unaffected by past states
Temporal homogeneity
Models of DNA Evolution
• Clustering Methods: Obtaining Genetic Distances
• Nucleotide substitution models
• In order to calculate a genetic distance, we must have some model of
DNA evolution on which to “hang our hat”
• General assumptions of most models (often violated at least slightly)
–
–
–
–
All sites are independent of one another
Sites are homogeneous in their rates of change
Markovian: Given the present state, future changes are unaffected by past states
Temporal homogeneity
• Strictly speaking, these assumptions apply only to regions undergoing
little or no selection
• Our task is to determine a mathematical method to model the
(presumed) stochastic processes that introduced the observed
differences among sequences
Models of DNA Evolution
• A model should:
– Provide a consistent measure of dissimilarity among sequences
– Provide linearly proportional distances to the time since divergence (if a molecular
clock is assumed)
– Provide distances representing the branch lengths on an evolutionary tree
• The basic model is just counting the number of differences - pdistance (p = #differences/site)
• Intuitively simple but probably accurate only for very few cases
because of homoplasy
• Homoplasy - a character state shared by a set of sequences but not
present in the common ancestor; a misleading phylogenetic signal
• Most commonly, homoplasy is introduced because of multiple and
back substitutions
• P-distances almost invariably underestimate the actual number of
changes
Models of DNA Evolution
• P-distances invariably underestimate the actual number of changes
Models of DNA Evolution
• P-distances invariably underestimate the actual number of changes
• Saturation – the point at which any phylogenetic signal is lost; so
many changes have occurred, the sequences are essentially random
with respect to one another
Models of DNA Evolution
• Substitutions as homogeneous Markov processes
• Markov processes are specified in Q matrices
• A 4x4 matrix in which each position gives the instantaneous rate of
change from one base to another.
• μ = mutation rate
• a = rate at which A-C change occurs relative to other possible
changes
Models of DNA Evolution
• Most Q matrices represent time homogeneous, time continuous,
stationary Markov process
• Assumptions
– At any given site in a sequence, the rate of change from base i to base j is
independent of the base that occupied the site prior to i.
– Time homogeneous/continuous – substitution rates do not change over time
– Stationary – the relative frequencies of the bases (πA,πC,πG,πT) are at equilibrium
– Many models are also time-reversible – the rate of change from i to j is always the
same as from j to i.
• These assumptions don’t make much sense biologically but are
necessary if substitutions are to be modeled as stochastic processes
Models of DNA Evolution
• Jukes Cantor (JC69) – the simplest model
• Assumptions:
– Equilibrium frequencies for the four nucleotides are 25% each (πA=πC=πG=πT=1/4)
– Equal probabilities exist for any substitution (a=b=c=d=e=f=1)
• Once the Q matrix is stated, calculating the probability of change from
one base to another over evolutionary time, P(t) is accomplished by
calculating the matrix exponential
– Matrix algebra is involved. I took it back in 1991. Forgive me
• The resulting correction becomes d=-¾ln(1-(4/3)p)
– p = the observed distance (p-distance)
Models of DNA Evolution
• Using JC69
• Note the parallel substitution at position 9
• The actual distance is higher than the observed distance
• 6 changes actually occurred
Models of DNA Evolution
• Using JC69
• p = 4/10 = 0.4
• d (JC69) = -3/4 ln [1-4/3 (0.4)] = 0.5716
• A more reasonable estimate of the number of actual changes that
occurred
• What assumptions of JC69 are violated?
Models of DNA Evolution
• Kimura 2-parameter (K2P)
• Generally, transitions occur at higher rates than transversions
• This violates the rate assumptions of JC69
Models of DNA Evolution
• Kimura 2-parameter
• A different rate must be considered for transitions (α) and
transversions (β), changing the Q matrix to:
• π remains ¼ for all bases
• d = ½ ln[1/1-2P-Q] + [1/4 ln[1/(1-2Q]]
• P and Q are the proportional differences between sequences due to
transitions and transversions, respectively
• Note if, α=β …
Models of DNA Evolution
• Felsenstein (1981) - F81
• In most taxa, A+T ≠ C+G
• If there are only a few G’s, the rate of substitution from G to A will be
low compared to other substitutions
• Violates the rate assumptions of JC69
Models of DNA Evolution
• Felsenstein (1981) - F81
• Different frequencies must be considered for all bases, substitution
rates are the same for all, changing the Q matrix to:
• π is unique for all bases (πA ≠ πC ≠ πG ≠ πT)
• Note that this model assumes similar base composition for all
sequences under consideration
• Note, if πA = πC = πG = πT …
Models of DNA Evolution
• Hasegawa, Kishino and Yano (HKY85)
• Combines F81 and K2P
• General Time Reversible (GTR)
• Allows all six pairs of substitutions to have
distinct rates
• Allows unequal base frequencies
Models of DNA Evolution
Models of DNA Evolution
• A variety of other models exist:
• Tajima-Nei (1984) – refines JC69 for more accurate rates of
nucleotide substitution
• Tamura 3 parameter (1982) – corrects for multiple hits
• Tamura-Nei (1993) – corrects for multiple hits, considers purine and
pyrimidine transitions differently
Models of DNA Evolution
• Varying substitution rates among sites in sequences (rate
heterogeneity) can be compensated for
• Most times, a gamma, Γ, distribution is used
• An α value to determine the shape of the distribution can be
estimated from the data and incorporated into calculations
Models of DNA Evolution
• Small values of α = L-shaped Γ-distribution and extreme rate variation
among sites, most sites invariable but a few sites have very high
substitution rates
• Large values (>1) of α = bell-shaped Γ-distribution and minimal rate
variation among sites
Models of DNA Evolution
• Choosing the wrong model may give the wrong tree
– Wrong model  incorrect branch lengths, Ti/Tr ratios, divergences rate
estimations, mutation rates, divergence dates
• What model to choose and how to choose it?
• Generally, more complex models fit the data better
– Thus, it may seem best to use the most complex model by default
– However,
• More parameters must be estimated, making computation more difficult
(longer) and increasing the possibility of error in estimation
• Find a medium between complexity and practicality
Models of DNA Evolution
• Choosing a model
• The fit of a model to the data is proportional to:
–
–
–
–
–
–
–
The probability of the data (D),
given a model of evolution (M),
a vector of model parameters (θ),
a tree topology (τ) and a vector of branch lengths (ν)
L = P(D | M, θ, τ, ν)
Often use the log likelihood to ease computation
l = lnP(D | M, θ, τ, ν)
• Likelihood ratio test (LRT)
• LRT statistic  LTR = 2 (l1 – l0)
•
•
•
•
l1 = the maximum log likelihood under the more complex model (alternative hypothesis)
l0 = the maximum log likelihood under the less complex model (null hypothesis)
Always =>0
Large value = the more complex model is better
Models of DNA Evolution
• Choosing a model
• Hierarchical likelihood ratio test (hLRT)
• Most of the models described above are nested, or hierarchical
–
i.e. JC is a special case of F81 where the base frequencies are equal
• ModelTest will perform all possible comparisons and evaluate them
using a Χ2 test
Models of DNA Evolution
• Choosing a model
• Information criteria
• The likelihood of each model is penalized by a function of the number
of free parameters (K) in the model; more parameters = higher
penalty
• Akaiki Information Criterion (AIC)
• AIC = -2l + 2K
• AIC = the amount of information lost when we use a particular model
• Small values are better
• ModelTest, ProtTest
Models of DNA Evolution
•
•
•
•
•
•
•
•
Choosing a model
Bayesian methods
Bayes factors are similar to LTR
Posterior probabilities can be calculated
Most commonly Bayesian Information Criterion (BIC) is calculated
BIC = -2l + 2K log n
Smaller = better
ModelTest & ProtTest
Download