Document 11743584

advertisement
10/27/14 Estimating Evolutionary Trees
v if the data are “consistent with infinite sites”
then all methods should yield the same tree
v it gets more complicated when there is
homoplasy, i.e., parallel or convergent
mutations at the same position
v more than one tree may be equally good
as a hypothesis of the genealogical history
Phylogenetic Methods
v UPGMA (single pass algorithm)
v neighbor-joining (single pass algorithm)
v Parsimony
²  search more or less exhaustively for the tree with the
smallest number of steps (mutations) required to
explain the data
v maximum likelihood
²  search more or less exhaustively for the tree
(topology and branch lengths) that maximizes the
likelihood of the observed data
v Bayesian MCMC methods
²  summarize the posterior distribution of trees to
estimate the probability of clades in the tree
1 10/27/14 Does it matter for pop gen?
v we don’t need to know the genealogy for
each locus to make inferences/estimate
population genetic parameters
v but, analyzing data that are not consistent with
infinite sites requires more complex coalescent
and/or mutation models
Gene Trees versus Species Trees
v “reciprocal monophyly”
2 10/27/14 Gene Trees versus Species Trees
v “incomplete lineage sorting”
The Lineage
Sorting Process
v Speciation at time X
²  ancestral polymorphism
retained
v The gene tree is
polyphyletic for both
species between times
X and Y
v The gene tree is
paraphyletic within
species between time
Y and Z
v Reciprocal monophyly
at time Z
3 10/27/14 Gene Trees versus Species Trees
v Incongruence…
²  between gene tree
and species tree
²  …and between
different gene trees
−t
Probability of Incongruence
2 2N
e
3
v for the simple 3 taxon case, where t is the
number of generations between speciation
events and one sample per taxon
v also applies when lineage sorting is complete
within each of the terminal taxa
²  incongruence as a result of incomplete lineage
sorting in the past
4 10/27/14 The “lasting effects” of
incomplete lineage sorting
Species 1
Species 2
Species 3
S1
S2
S3
S1 S2 S3
Ancestral population
probability of mtDNA and nuclear gene
trees matching species tree as a function
of internode length
Moore 1995 Evolution 49, 718-726
5 10/27/14 Interpreting Single
Gene Trees?
v human mtDNA tree
²  consistent with “out of Africa”
hypothesis
Avise et al.
1990 Evolution
6 10/27/14 Other causes of incongruence
v hybridization/introgression/horizontal
transfer
v balancing selection
v gene duplication and loss
7 10/27/14 Introgression plus
Selective Sweep
A
B
C
Time
Species Tree
C
B
A
Gene Tree
A
B
Introgression
followed by a
selective
sweep
C
Balancing Selection
H
C
G
v results in a “balanced”
allele frequency
maintained by
frequency-dependent
selection
v can maintain pre-existing
alleles over long stretches
of time
H
H
C
G
C
G
8 10/27/14 From Klein, Takahata, Ayala 1993
Gene Duplication and Loss
Actual phylogeny
a
b
c
d
a
b
c
Gene
duplication
Apparent phylogeny
a
b
c
d
d
9 10/27/14 phylogeny of a subunits of
voltage-gated calcium
channels
Piontkivska &
Hughes, 2003,
JME
Approaches for making inferences/
estimating parameters
v direct estimates from summary statistics
²  E.g.,
FST =
1
1− FST
≡ 4m =
1− 4Nm
FST
²  but this typically requires significant assumptions
²  genetic equilibrium, constant population size, etc.
v  simple coalescent simulations to generate
confidence intervals
10 0"
120"
0"
0.9"
1"
1.1"
1.2"
1.3"
1.4"
1.5"
1.6"
1.7"
1.8"
1.9"
2"
2.1"
2.2"
2.3"
2.4"
2.5"
2.6"
2.7"
2.8"
2.9"
3"
3.1"
3.2"
3.3"
3.4"
3.5"
3.6"
3.7"
3.8"
3.9"
4"
4.1"
120"
0.9"
1"
1.1"
1.2"
1.3"
1.4"
1.5"
1.6"
1.7"
1.8"
1.9"
2"
2.1"
2.2"
2.3"
2.4"
2.5"
2.6"
2.7"
2.8"
2.9"
3"
3.1"
3.2"
3.3"
3.4"
3.5"
3.6"
3.7"
3.8"
3.9"
4"
4.1"
10/27/14 Distribution of θS estimates
160"
140"
k"="10"
100"
k"="20"
80"
60"
40"
20"
Distribution of θ∏ estimates
160"
140"
k"="10"
100"
k"="20"
80"
60"
40"
20"
11 10/27/14 More sophisticated approaches for making
inferences/estimating parameters
v  start with historical model…
MIGRATE-N
²  simulates N populations
connected by gene flow
²  estimates population sizes
and migration rates (both
scaled by N and µ)
²  equilibrium model
²  coalescence of all samples
requires migration between
demes because populations
do not merge as you go
back in time
Beerli P, Felsenstein J (1999) Maximum-likelihood estimation of migration rates and effective
population numbers in two populations using a coalescent approach. Genetics 152, 763–773.
Beerli P, Felsenstein J (2001) Maximum likelihood estimation of a migration matrix and effective
population sizes in n subpopulations by using a coalescent approach. PNAS 98, 4563–4568.
12 10/27/14 IM - Isolation
with Migration
² model of population
divergence with gene flow
² estimates population sizes,
migration rates and
divergence time(s)
Approaches for making inferences/
estimating parameters
v Bayesian MCMC analyses to estimate
demographic and historical parameters
² based either on maximum likelihood and the
Felsenstein equation or on summary statistics
(Approximate Bayesian Computation, ABC)
² the Felsenstein Equation gives the likelihood of
the data given a set of model parameters, Θ
Pr ( X Θ) =
∫ Pr ( X G )p (G Θ) dG
G
where X is the data, Θ is the set of model parameters,
and G is the set of all possible genealogies given Θ
13 10/27/14 Calculating the likelihood of the data
for a given genealogy
v given a model of sequence evolution, a tree
(=genealogy) with branch lengths, and
observed character states (DNA sequences in
the samples)...
v we can calculate the likelihood (probability) of
the data at a given sequence position
C
A
t1
t2
C
C
G
t4
t5
t3
y
A tree/genealogy with
branch lengths and the
data at a single DNA
sequence position
w
z
t6
t7
t8
x
Pr(Xi | G) = ∑∑∑∑ Pr(A, C, C, C, G, x, y, z, w | G)
x
y
z
w
∑ Pr(y | x, t )Pr(A | y, t )Pr(C | y, t )Pr(z | x, t ) Pr(C | z, t )Pr(w | z, t )Pr(C | w, t )Pr(G | w, t )
6
1
2
8
3
7
4
5
x
²  in this example, this quantity is summed over 256 (=44)
possible combinations of x, y, z, w
²  number of calculations increases exponentially with
more taxa, but computational shortcuts are employed
14 10/27/14 Calculating the likelihood of the data
for a given genealogy
v given a model of sequence evolution, a tree
(=genealogy) with branch lengths, and
observed character states (DNA sequences in
the samples)...
v we can calculate the likelihood (probability) of
the data at a given sequence position
v the overall likelihood of the data is the product
of the likelihoods for individual sites or the sum
of the ln likelihoods…
m
m
L = Pr(X | G) = ∏ Pr(Xi | G) ≡ ln L = ∑ ln Li
i=1
i=1
In practice…
v for a sample of k alleles, draw random
coalescence times from the exponential
distribution, as appropriate given the historical
and demographic model parameters
v estimate the likelihood (probability) of the
observed DNA sequences for genealogies
generated under the model
Pr ( X Θ) =
∫ Pr ( X G )p (G Θ) dG
G
v change a model parameter (according to
carefully designed rules), generate a new set of
genealogies and calculate likelihood
v we now have two results…
15 10/27/14 In practice…
v if the new result is better, accept the new set of
model parameters ( x!) and continue the
process by taking another step in the Markov
Chain (i.e., “updating” a model parameter,
generating genealogies, etc…)
v if the result is worse, either accept the new set
of model parameters ( x!) or go back to the
previous set of parameters ( x ), with the “coin
flip” probabilities as defined by the MetropolisHastings Algorithm
# P ( x") g ( x" → x ) &
((
A ( x → x") = min %%1,
"
P
x
g
x
→
x
(
)
(
)
$
'
v repeat millions of times
Markov Chain Monte Carlo methods
16 10/27/14 Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD
(2009) Inferring the joint demographic history of multiple
populations from multidimensional SNP frequency data. PLoS
Genetics 5, e1000695.
East African allele frequency (n = 10 birds, 20 alleles) v ∂a∂i uses the joint allele frequency distribution
as the observed input data
v uses the diffusion approximation to estimate the
expected j.a.f.d. for a given set of model
parameters
v and then calculates the likelihood of the
observed data based on the above
20 4 19 3 2 4 1 18 1 17 4 16 1 15 7 14 2 13 5 12 5 11 6 1 10 2 1 9 4 2 8 5 2 7 6 6 7 5 13 4 12 1 2 3 14 1 1 2 47 3 1 1 2 1 132 7 7 6 3 5 1 2 0 326 89 48 39 27 19 7 0 1 2 3 4 5 6 7 1 1 1 1 2 2 1 4 1 2 2 2 2 2 1 1 1 2 2 1 4 2 3 2 1 3 1 1 1 1 4 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 40 1 1 1 1 3 2 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 2 1 1 2 1 1 4 2 1 1 2 1 2 6 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 4 5 8 3 2 2 2 3 3 1 8 9 10 11 12 13 14 15 16 17 18 19 20 West African allele frequency (n = 10 birds, 20 alleles) 17 
Download