Phylogenetic Methods 2

advertisement
An Introduction to Phylogenetic Methods
Part two
Dr Laura Emery
Laura.Emery@ebi.ac.uk
www.ebi.ac.uk
Objectives
• After this tutorial you should be able to…
• Discuss a range of methods for phylogenetic inference, their
advantages, assumptions and limitations
• Implement some phylogenetic methods using publicly
available software
• Appreciate some approaches for assessing branch support
and selecting an appropriate substitution model
• Know where to look for further information
Outline
•
•
•
•
Alignment for phylogenetics
Phylogenetics: The general approach
Phylogenetic Methods (1 – simple methods)
Assessing Branch Support
BREAK
•
•
•
•
Substitution Models
Phylogenetic Methods (2 - statistical inference)
Deciding which model to use (hypothesis testing)
Software
The problem of multiple substitutions
*
G
*
A
A
hidden
mutations
*
A
*
T
• More likely to have
occurred between
distantly related
species
• > We need an explicit
model of evolution to
account for these
Methodological approaches
1. Distance matrix methods (pre-computed
distances)
•
UPGMA assumes perfect molecular clock Sokal & Michener
(1958)
•
Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei
(1987)
2. Maximum parsimony Fitch (1971)
•
What is a substitution model?
Minimises number of mutational steps
3. Maximum likelihood, ML
• Evaluates statistical likelihood of alternative trees, based on
an explicit model of substitution
4. Bayesian methods
• Like ML but can incorporate prior knowledge
Statistical phylogenetic inference
Figure Brian Moore
Models of sequence evolution
• We use models of substitution to ‘roughly’ describe the
way that we believe the sequences have evolved
• They are necessarily highly simplified descriptions of
more complex biological processes
• Parameters can be added to build more sophisticated
models if we believe this is relevant for our data
Substitution Models
• Common nested models
•
Jukes and Cantor (JC) 1969
•
Kimura 2 Parameter (K2P) 1980
•
Felsenstein 1981(F81)
•
Hasegawa, Kishino and Yano 1985 (HKY85)
•
Generalised time-reversible (GTR or REV) Tavaré 1986
• Accounting for rate heterogeneity
• Other substitution models
The Jukes and Cantor (JC) 1969 model
μ
• 1 parameter
A
• μ = mutation rate
μ
• Assumptions
μ
μ
μ
• Equal base frequencies
C
μ
μ
μ
• All mutations equally likely
• All sites evolve at the same
rate
μ
G
μ
μ
μ
T
• All sites evolve independently
• Time reversibility
d = estimated nucleotide distance
p = observed distance in sequence data
But not all substitutions are equally likely…
Transitions are more likely to occur than transversions
Figures Andrew Rambaut
Kimura 2 Parameter (K2P) 1980
μ
• 2 parameters
A
• μ = mutation rate
μ
• κ = transition/transversion ratio
• Assumptions
κ
C
μ
μ
κ
κ
μ
• Equal base frequencies
• All mutations equally likely
μ
G
μ
κ
μ
T
• All sites evolve at the same
rate
• All sites evolve independently
• Time reversibility
d = estimated nucleotide distance
p = observed distance in sequence data
q = proportion of sites with transversional differences
But base frequencies are often not equal...
Base frequencies vary among and within genomes
ACTG
Felsenstein 1981 (F81)
• 4 symbols (3 parameters)
• πA ,πC ,πG,πT = base
frequencies
πA
A
πC
C
• πA + πC + πG + πT = 1(so 3
parameters)
• Assumptions
• Equal base frequencies
• All mutations equally likely
• All sites evolve at the same
rate
• All sites evolve independently
• Time reversibility
πGG
T π
T
Hasegawa, Kishino and Yano 1985 (HKY85)
• 6 symbols (5 parameters)
πA
A
• μ = mutation rate
• κ = transition/transversion ratio
• πA ,πC ,πG,πT = base
frequencies
• Assumptions
• Equal base frequencies
• All mutations equally likely
• All sites evolve at the same
rate
• All sites evolve independently
• Time reversibility
μ
κ
πC
μ
C
μ
μ
κ
κ
μ
μ
πGG
μ
κ
μ
T π
T
But there are also differences among the
other nucleotide transition rates...
1.
2.
3.
4.
5.
6.
A
C
A
G
A
T
C
G
C
T
G
T
Figures Andrew Rambaut
Generalised time-reversible (GTR) Tavaré
1986
• 10 symbols (9 parameters)
πA
A
• rAC, rAG, rAT, rCG, rCT, rGT = mutation
• Assumptions
• Equal base frequencies
rAG
πGG
C
rAC
rAT r
CG
rates
• πA ,πC ,πG,πT = base
frequencies
πC
rAC
rCT
rAG
rCG
r
rGT AT
rGT
• All mutations equally likely
• All sites evolve at the same
rate
• All sites evolve independently
• Time reversibility
Widely used
rCT
T π
T
But some sites overall faster than others...
973 mtDNA CR; parsimony analysis (with known pedigree)
Heyer et al. (2001)
Gamma distributed rates
• Rate variation among sites is often shown to be wellapproximated by a gamma distribution
• To use: add alpha (α) parameter to existing model e.g.
• Assumptions
• Equal base frequencies
Frequency
GTR+G
0.04
α = 0.5
α=2
• All mutations equally likely
• All sites evolve at the same rate
α = 200
0.06
α = 50
0.02
• All sites evolve independently
• Time reversibility
0
1
Substitution rate
2
Other substitution models
• Amino acid substitution models
• Dayhoff 1972
• Whelan and Goldman 2001 (WAG)
• Lee & Gascuel 2008 (LG)
•
•
•
•
Codon models e.g. Yang 2000
Relaxed molecular clock e.g. Drummond et al. 2006
Mixture models
And many more!
Methodological approaches
1. Distance matrix methods (pre-computed
distances)
•
UPGMA assumes perfect molecular clock Sokal & Michener
(1958)
•
Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei
(1987)
2. Maximum parsimony Fitch (1971)
•
What is a substitution model?
Minimises number of mutational steps
3. Maximum likelihood, ML
• Evaluates statistical likelihood of alternative trees, based on
an explicit model of substitution
4. Bayesian methods
• Like ML but can incorporate prior knowledge
Statistical phylogenetic inference
recommended
methods
Figure Brian Moore
Maximum Likelihood
1. Calculate the probability of the observed sequence data
under a given model (including tree structure, branch
lengths, and transition parameters).
[The likelihood is proportional to this probability.]
2. Search for the tree(s) which maximize(s) the likelihood.
Likelihood
branch
topology lengths
model
parameters
probability
constant
data
(alignment)
Maximum Likelihood
• Advantages:
• statistically consistent
• requires the use of an explicit model of evolution
• Disadvantages:
• slow (especially if all possible trees are evaluated)
• produces a single ML tree
• Usage: Widely-used and recommended method
recommended
Bayesian Inference
1. Calculate the probability of the model specified given the
sequence data observed (using equation derived from
Bayes Theorem)
2. Search the tree-space using MCMC (or equivalent) to
approximate the joint-posterior probability density
likelihood function
posterior probability
Pr 𝐻𝑖 𝑋 =
prior probability
Pr 𝑋 𝐻𝑖 Pr[𝐻𝑖 ]
𝑛
𝑗=𝑙 Pr
𝑋 𝐻𝑗 Pr[𝐻𝑗 ]
marginal likelihood
Bayesian Inference
• Advantages:
• the option to incorporate prior knowledge
• produces probability distribution of possible trees
• unlike ML, treats model parameters as random variables
• Disadvantages:
• very slow
• heuristic methods of tree searching do not guarantee you
find the best tree
• Usage: Widely-used and recommended method
recommended
Heuristic searches do not guarantee you find
the best tree
Figure Andrew Rambaut
Methodological approaches
1. Distance matrix methods (pre-computed
distances)
•
UPGMA assumes perfect molecular clock Sokal & Michener
(1958)
•
Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei
(1987)
2. Maximum parsimony Fitch (1971)
•
What is a substitution model?
Minimises number of mutational steps
3. Maximum likelihood, ML
• Evaluates statistical likelihood of alternative trees, based on
an explicit model of substitution
4. Bayesian methods
How do I choose a substitution model?
• Like ML but can incorporate prior knowledge
How do I choose a substitution model?
biological
intuition
develop
hypothesis
• Identify most
appropriate
assumptions
and thus model
for your data
• Will a complex
model with
fewer
assumptions
better explain
your data than a
simple model?
test
hypothesis
• Likelihood ratio test
• Bayes factor test
Not sure where to start? Empirical data shows GTR+G (nucleotide) or
LG (protein) to be a good bet for standard datasets large in size
Choosing a more complex model with more
parameters will always fit the data better
> We want to know if the fit is significantly better
R2 = 0.78
R2 = 0.86
R2 = 1
Likelihood ratio test
• Requires models to be nested
• Uses likelihood ratio to evaluate if our hypothesis (H1) is
significantly better than our null hypothesis (H0):
Likelihood ratio = L(H1)/L(H0)
Likelihood of
hypothesis
• Twice the logarithm of this ratio (2Δ)
Likelihood of
null
hypothesis
approximates
a
chi-squared distribution under the null hypothesis H0:
Twice the 2Δ = 2[ln(L(H ))
1
difference in log
Log likelihood of
likelihood
hypothesis
– ln(L(H0))]
Log likelihood
of null
hypothesis
corresponding
to the difference
• with d degrees of freedom
the number of free parameters between models
in
Likelihood ratio test example
• Question: Do the rates of transitions and transversions in
my sequence data significantly vary?
• H1: K2P better explains my data (2 rate parameters,
transitions different to transversions)
• H0: JC is adequate (1 rate parameter for all substitutions)
• Draw trees, find out ln(L(K2P)) = -23345; ln(L(JC)) = 23368
• Calculate: 2Δ = 2[ln(L(H1)) – ln(L(H0))]
2Δ = 2[ -23345 - -23368] = 2x23 = 46
• d = difference in number of free parameters = 2 - 1 = 1
• Next we look this up on a Χ2 distribution…
Likelihood ratio test example
• Is our 2Δ (twice log of the likelihood ratio) greater than we
would expect by chance (p = 0.05)?
• 2Δ = 46 (d = 1)
YES – 46 is much larger than 0.004
> We can reject H0 (JC) and accept H1 (K2P)
Software
Sequence searching
BLAST, FASTA, PSI-Search
http://www.ebi.ac.uk/services
Multiple sequence
alignment
Clustal Omega, MUSCLE, Prank
(phylogenetically aware)
http://www.ebi.ac.uk/services
ClustalW2, PAUP
Distance-based
phylogenetic methods http://www.ebi.ac.uk/Tools/phylogeny/
Maximum likelihood
phylogenetics
RAxML (coming soon to EBI tools),
PhyML, SeaView, PAUP, PAML
Bayesian
Phylogenetics
MrBayes, BEAST
Model Testing
ModelTest, PAML
• And lots lots more see:
http://evolution.genetics.washington.edu/phylip/software.html
Outline
•
•
•
•
Alignment for phylogenetics
Phylogenetics: The general approach
Phylogenetic Methods (1 – simple methods)
Assessing Branch Support
BREAK
•
•
•
•
Substitution Models
Phylogenetic Methods (2 - statistical inference)
Deciding which model to use (hypothesis testing)
Software
Now it is your turn…
• Open your course manuals and begin Tutorial 2 (page
13)
• Also available to download from:
http://www.ebi.ac.uk/training/course/scuola-dibioinformatica-2013
• You will require the alignment file Rodents.txt
• You will require the software SeaView 4.4.2
http://pbil.univ-lyon1.fr/software/seaview.html
• There are answers available online but it is much better
to ask for help!
Thank you!
www.ebi.ac.uk
Twitter: @emblebi
Facebook: EMBLEBI
Download