Methods for sampling genealogies in complex models of divergence

advertisement
Methods for sampling
genealogies in complex models
of divergence
Jody Hey
Rutgers University
Acknowledgements
• Model development
Rasmus Nielsen
• Chimpanzee studies
Yong-Jin Won
Yong Wang
Sang Chul Choi
P( | X )   P( | G ) P(G | X )dG
the Isolation with Migration Model
Descendant Populations
(Populations for Data Collection)
Present
N1
Migration
m1
N2
m2
Splitting Time
Θ includes Six
Parameters
Past
t
NA
Ancestral Population
Treating genealogies as a nuisance
variable
Θ – parameters of the model (e.g.
population sizes, migration rates)
X – data
G – genealogy (i.e. coalescent tree)
L( | X )   P( X | G ) P(G | ) d G
L( | X )   P( X | G ) P(G | ) d G
• In practice
– recombination is assumed to be zero
within loci, and to be high between loci
– Must be approximated by using samples
of genealogies
– Is slow
Instead of sampling genealogies ->
approximate likelihoods using summary
statistics
• Summary statistic methods have become
common due to the limitations of methods
that sample genealogies
• Can work with loci that have histories of
recombination
• Can be fast
• But, do not use all of the information in
the data
• So far do not do so well with models and
histories that include gene exchange
Competition between two lines of
research: genealogy sampling, and
summary statistics
• Genealogy Sampling
– limited by assumptions on
recombination (so far)
– Slow
– Works well for estimating parameters
• Summary Statistics
– Not limited by recombination
– Faster
– Does not work so well (so far)
An new method for sampling
genealogies
• We would like a smaller MCMC state space,
for which it is easier to design an MCMC
updating scheme that leads to rapid
convergence
• We would like to have an approach that
generates an analytic likelihood function in
multiple dimensions
• But that avoids the frailties of that
approach that stem from using samples of
G conditioned on a driving value of Θ, Θ0
(Kuhner et al, 1995)
Hey & Nielsen 2007 PNAS 104:2785–2790.
Reconsidering the integration over
genealogies
L( | X )   Pr( X | G ) P(G | )dG
Consider an alternative expression, that also
integrates over G , but that directly yields a
posterior probability of Θ
P( | X )   P( | G) P(G | X )dG
P( | X )   P( | G ) P(G | X )dG
This is an expectation of P(Θ |G) and can be
approximated given a sample of genealogies drawn at
random from the posterior distribution of G, P(G | X)
1 k
1 k P(Gi | ) P()
P( | X )   P( | Gi )  
k i 1
k i 1
P(Gi )
This step does not depend on the data, X. All the
information in the data is contained in the sample
drawn from P(G|X)
Yields an analytic function
The key to generating samples of
genealogies from P(G | X) and to
approximating P(Θ|X) is the calculation of
the prior probability of G, P(G)
P(G )   P(G | ) P()d
In fact this can be calculated analytically for
the main demographic components of Θ.
Sequence of operations
• Run a Markov chain over G and generate
random samples from P(G | X)
• For each G drawn from this distribution, save
P(G) and all necessary information for
calculating P(G|Θ).
• Build a function that approximates the
posterior density of Θ
1 P(Gi | ) P()
P ( | X )  
k i 1
P(Gi )
k
•This is an analytic function, and can be
evaluated for any value of Θ
•The function can be differentiated and
searched for maxima.
Comparing the likelihood ratio for a true nested
model with the likelihood for the full model
• 100 data sets simulated under a model with just
2 population sizes and 1 migration rate
0.009
0.008
0.007
Probability
0.006
0.005
χ2 2 Degrees of Freedom
0.004
0.003
–2×Log-Likelihood Ratio
0.002
100 simulated data sets
0.001
0
0
20000
40000
Ne - Effective Population Size
60000
Chimpanzee Distributions
Chimpanzee Divergence
Posterior Density for Population Size - Ne
P. t. troglodytes
P. t. verus
Ancestor
Original
results of
Won & Hey
P. t. troglodytes
P. t. verus
Ancestor
New
Method
Models for more than two
populations
• Assume that we know the species
phylogeny
For three sampled populations
N1
m
m
t0
NA0
N2
m
N3
m
m
m
m
m
t1
NA1
Θ includes 15
Parameters
Multi-population IMa – The Good News
• Adding more populations does not
introduce new mathematical issues
– Building the application is mostly a
programming problem, not a math
problem
• Can do any number of populations
for a known phylogeny
– Program will “work” for 10 populations
(assuming a known phylogeny)
19 population size parameters
162 migration rate parameters
9 population splitting times
Multiple -populations – The Bad News
• A lot of data will be required for
many situations (hundreds of loci)
• Models with many parameters
introduce much more potential for
model identifiability problems
• Program is still slow and applications
with 100’s of loci will require new
computing configurations
Chimpanzees in a four population
Isolation with Migration Model
• Pan paniscus (Bonobo)
• P. troglodytes troglodytes (Central
African Chimpanzee)
• P. t. schweinfurthii (East African
Chimpanzee)
• P. t. verus (West African
Chimpanzee)
Chimpanzee Distributions
Chimpanzee phylogeny*
P.t. schweinfurthii
Eastern
P. t. troglodytes
Central
P.t. verus
West
*Becquet et al., (2007) PLoS Genet 3:e66.
(based on 310 microsatellite loci)
P. paniscus
Bonobo
Data
• Fischer et al., Curr. Biol. 16:1133-1138.
– 26 loci, approx 20 gene copies per
species, average length 700 bp
• Yu et al., (2003) Genetics 164:1511-1518.
– 42 loci, approx 10 gene copies per
species, average length 400 bp
• Deinard & Kidd (2000), HOXB6 and APOB
• Single loci from mitochondria, X
chromosome, Y chromosome
• Total of 73 loci
79,000 yrs
Eastern
Central
7,800
26,000
Western
Bonobo
7,100 8,200
30,000
440,000 yrs
6,900
Migration
Signficantly greater than zero
Splitting Times in years
Effective Population Sizes
Parameter Estimates
for Four Chimpanzee
Populations
890,000 yrs
17,000
Download