Slides of Barbara`s talk - School of Mathematical Sciences

advertisement
Mostly statistical
Current challenges for
molecular phylogenetics
Barbara Holland
School of Mathematics & Physics
University of Tasmania
Charles Darwin and Alfred Russell Wallace
Evolution as descent with modification,
implying relationships between organisms by
unbroken genetic lines
Phylogenetics seeks to determine these genetic
relationships
Darwin’s sketch: the
first phylogenetic tree?
Alfred Russel
Wallace
Charles Darwin
Ernst Haeckel’s Tree of Life (1866)
Since the publication of Origin
of the Species in 1859 people
have been trying to infer the
evolutionary
“tree of life”.
Haeckel’s Pedigree of Man
Why molecular phylogeny
• Most molecules evolve independently of
adaptations affecting morphology.
• It is fairly easy to find genes that are present in
all species of interest, e.g., a 12S RNA
molecule in mitochondria is functional over all
mammals.
• Useful mathematical models of sequence
evolution have been developed that underpin
attempts to infer evolutionary trees
Mole
Golden Mole
?
Whale
Laurasiatheria
Afrotheria
Golden Mole
Mole
Whale
hedgehog
elephant
?
tenrec
Afrotheria
elephant
Laurasiatheria
tenrec
hedgehog
A brief and incomplete history of
molecular phylogenetics
Antibodies
DNA-DNA hybridisation
Sequence Data
(Amino acid then DNA)
60s
Distance based
70s
80s
PCR
MORE Sequence Data
Parsimony
Sarich
Wilson
Systematic bias – Felsenstein Zone
Explicit Models - Maximum likelihood
Assessing support - Bootstrap
Felsenstein
90s
More complex models - Bayesian methods
00s
Population processes, gene trees in species trees
Various perils...anomalous gene trees,
non identifiable models
The molecular phylogeny problem
ACCGCTTA
Time
ACCCCTTA
ACCCCTTA
We see the aligned
modern day sequences
?
ACTGCTTA
ACTGCTAA
ACCCCATA ACTGCTTA
…ACCCCTTA…
…ACCCCATA…
…ACTGCTTA…
…ACTGCTAA…
ACTGCTAA
And want to recover the
underlying evolutionary
tree.
Sequence evolution is modelled as
a Markov process
A
Consider a single edge in a phylogeny, i.e. evolution of
a single species, and the evolution of a single DNA base
amongst the possible states {A, C, G, T}.
C
A
time
The probability of mutating from state i to j over a
length of time t depends only on the current state i and
the potential future state j, not on any of the previous
history of the sequence, and can be written pij(t).
T
time t
G
Continuous time Markov chains
M=
A
C
G
T
A
pAA
pAC
pAG
pAT
C
pCA
pCC
pCG
pCT
G
pGA
pGC
pGG
pGT
T
pTA
pTC
pTG
pTT
Transition matrix
Q=
A
C
G
T
A
-qA*
qAC
qAG
qAT
C
qCA
-qC*
qCG
qCT
G
qGA
qGC
-qG*
qGT
T
qTA
qTC
qTG
-qT*
Where qi* = Σj qij, j ≠ i
i.e. rows sum to zero.
Instataneous rate matrix
M = exp(Qt)
Typically we restrict to stationary, reversible models, with the stationary distribution
denoted by π. So, π Q = 0, and D(π)Q is symmetric.
Models of nucleotide
substitution
•
Jukes Cantor (JC)
–
–
•
–
Transitions and transversions at
different rates
Base frequencies equal
HKY model
–
–
•
A
Kimura 2 Parameter (K2P)
–
•
All substitutions equally likely
Base frequencies equal
α
α
α
α
C
Transitions and transversions at different rates
Base frequencies different
General Time Reversible (GTR)
α
G
α
T
Models of nucleotide
substitution
•
Jukes Cantor (JC)
–
–
•
–
Transitions and transversions at
different rates
Base frequencies equal
HKY model
–
–
•
A
Kimura 2 Parameter (K2P)
–
•
All substitutions equally likely
Base frequencies equal
β
α
α
α
C
Transitions and transversions at different rates
Base frequencies different
General Time Reversible (GTR)
α
G
β
T
Models of nucleotide
substitution
•
Jukes Cantor (JC)
–
–
•
–
Transitions and transversions at
different rates
Base frequencies equal
HKY model
–
–
•
A
Kimura 2 Parameter (K2P)
–
•
All substitutions equally likely
Base frequencies equal
β
α
α
α
C
Transitions and transversions at different rates
Base frequencies different
General Time Reversible (GTR)
α
G
β
T
Models of nucleotide
substitution
•
Jukes Cantor (JC)
–
–
•
–
Transitions and transversions at
different rates
Base frequencies equal
HKY model
–
–
•
A
Kimura 2 Parameter (K2P)
–
•
All substitutions equally likely
Base frequencies equal
β
α
ζ
δ
C
Transitions and transversions at different rates
Base frequencies different
General Time Reversible (GTR)
γ
G
ε
T
Models define probability
distributions on site patterns
The model θ consists of: the tree topology, edge weights, Q matrix*,
and root distribution π.
y
Edge weights t1, t2, t3, t12
M12
M3
Me = exp(Qte)
x
pijk = Σx,y M1(x,i) M2(x,j) M12(y,x) M3(y,k) π(y)
M1
M2
1
2
3
*More generally, this could be a set of Q matrices
Tree estimation using maximum
likelihood
• For a given set of parameters θ we can calculate the probability of
any particular site pattern.
• The overall probability of an alignment is then taken to be the the
product of the probabilities for each site (i.i.d assumption).
• This is the likelihood function, i.e. the probability of the data given
the model.
• We can then use optimisation techniques to find the model
parameters (tree topology, edge lengths, parameters of the
substitution model) that maximise the likelihood.
Extra features of sequence
evolution that can be modelled
• Site to site rate variation (usually modelled by
a gamma distribution)
• Invariant sites
BUT
• Some parts of reality are problematic…
– Base composition bias
– Sites that are free to vary change across the tree
– Non independence of sites
Likelihood versus parsimony
(the Felsenstein Zone)
Prior to the introduction of ML to phylogenetics community by Joe
Felsenstein Maximum Parsimony (MP) was the most widely used
method for estimating phylogenetic trees.
MP chooses the tree that requires the fewest mutations to explain the
data
A
C
G
A
A
A
B
A
B
G
A
G
A
G
D
G
C
A
A D
Likelihood versus parsimony
(the Felsenstein Zone)
The MP criterion has been shown to be statistically inconsistent on some
trees under the models of nucleotide substitution discussed previously.
Likelihood is statisitically consistent (given the correct model).
Felsenstein (1978)
Hendy & Penny (1989)
Assessing confidence
• It is not just of interest to get a point estimate of
the phylogenetic tree.
• We would also like some measure of confidence
in our point estimate.
– Is our tree likely to change if we get more data?
– How robust is our result to sampling error?
• The bootstrap is a useful tool for answering
these sorts of questions.
The bootstrap
(Felsenstein 1985)
• For each bootstrap sample:
– Create a new alignment (of the same length as
the original) by resampling the columns of the
observed alignment
– Construct a tree for the ‘bootstrap’ alignment
• The bootstrap support for each edge is the
number of bootstrap trees that edge appears in.
1234567
ATATAAA
ATTATAA
TAAAATA
TATAAAT
a
b
c
d
1224567
ATTTAAA
ATTATAA
TAAAATA
TAAAAAT
a
b
c
d
a
b
c
d
1334567
AAATAAA
ATTATAA
TAAAATA
TTTAAAT
a
b
c
d
a
1234567
ATATAAA
ATTATAA
TAAAATA
TATAAAT
a
b
c
d
a
b
c
d
b
c
a
0.75
b
d
c
d
1244567
ATTTAAA
ATAATAA
TAAAATA
TAAAAAT
a
b
c
d
a
b
c
d
Example where the bootstrap is
useful
• Simulate data on the four taxon tree
below (JC model)
• Use sequence lengths of 100, 1000, and
10000
0.01
((a,b),(c,d))
0.2
((a,c),(b,d))
a
b
c
d
((a,d),(b,c))
100
5.7%
42.8%
49.8%
1000
97%
<5%
<5%
10000
100%
0
0
Example where it is not so
useful
• Simulate data on the two four-taxon trees below (JC
model) in the proportion 55%, 45% and concatenate the
sequences
• Use total sequence lengths of 100, 1000, and 10000
0.05
0.1
55%
a
b
c
d
((a,b),(c,d))
((a,c),(b,d))
0.05
0.1
45%
a
c
b
d
((a,d),(b,c))
100
64%
33%
3%
1000
80%
20%
0%
10000
98%
<5
<5
Genome-scale phylogeny
• Data sets with many concatenated genes
– Rokas et al, Nature 2003 (106 genes, 8 taxa)
– Goremykin et al, MBE 2004 (61 genes, 14 taxa)
• Estimated trees have very high bootstrap support.
• BUT... trees are sensitive to: model used, method used,
data-coding.
Case study: The Amborella
Wars
Angiosperms
A New Caladonian
shrub
Grasses
bootstrap support
NJ bootstrap with ML distances
using a GTR + gamma model
100
80
Amb+Nym
60
Grasses
Amb
40
Nym
20
0
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
alpha (gamma shape parameter)
Skewed rates
Equal rates
Sensitivity to model choice
• Phylogenomic datasets may involve hundreds of
genes for many species.
• These data sets create challenges for current
phylogenetic methods, as different genes have
different functions and hence evolve under
different processes.
• One question is how best to model this
heterogeneity to give reliable phylogenetic
estimates of the species tree.
Example
Rokas et al. (2003) produced 106 gene trees for 8 yeast taxa
S. cerevisiae
S. paradoxus
S. mikatae
S. kudriavzevii
S. bayanus
S. kluyveri
S. castellii
C. albicans
Two extremes
• How many parameters do we need to
adequately represent the branches of all
(unrooted) gene trees ?
Between
13 (consensus tree)
&
13 x 106 = 1378
• Too few parameters introduces bias
• Too many parameters increases the variance
Stochastic partitioning
• Attempts to cluster genes into classes that
have evolved in a similar fashion.
• Each class is allowed its own set of
parameters (e.g. branch lengths or model
of nucleotide substitution)
Algorithm overview
1. Randomly assign the n genes to k classes.
2. Optimise parameters for each class
3. Compute the posterior probability for each
gene with the parameters from each class.
4. Move each gene into the class for which it has
highest posterior probability
5. Go to step 2, when no genes change class
STOP
How many classes?
Conclusions regarding
stochastic partitioning
• Pros
– AIC/BIC allows you a quantitative method to choose
how many parameters are needed.
– Identifies groups of genes under similar constraints
• Cons
– Slow
– Randomized algorithm so different starting points lead
to different partitions.
Brief Tour…
Combinatorics of tree space
Graph Theory
Stochastic Models, Inference & Probability Theory
Algebraic Geometry
Lie groups, representation theory
….
….
Identifiability
Figure 2 Matsen and Steel (2007)
…the underlying assumption was that mixture model data on one topology can be
distinguished from data evolved on an unmixed tree of another topology given
enough data and the ``correct'' method. Here we show that this assumption can be
false. For biologists our results imply that, for example, the combined data
from two genes whose phylogenetic trees differ only in terms of branch
lengths can perfectly fit a tree of a different topology.
Elizabeth Allman
John Rhodes
Algebraic geometry approach
The boundary of phylogenetics
and population genetics
Fisher-Wright model
Phylogenetic tree
Gene trees in species phylogenies
James Degnan
Noah Rosenberg
Representation theory, Lie groups,
Markov invariants, closure of model
classes
Jeremy Sumner
Peter Jarvis
http://www.maths.utas.edu.au/phylomania/phylomania2011.htm
Download