Methods for sampling genealogies in complex models of divergence Jody Hey Rutgers University Acknowledgements • Model development Rasmus Nielsen • Chimpanzee studies Yong-Jin Won Yong Wang Sang Chul Choi P( | X ) P( | G ) P(G | X )dG the Isolation with Migration Model Descendant Populations (Populations for Data Collection) Present N1 Migration m1 N2 m2 Splitting Time Θ includes Six Parameters Past t NA Ancestral Population Treating genealogies as a nuisance variable Θ – parameters of the model (e.g. population sizes, migration rates) X – data G – genealogy (i.e. coalescent tree) L( | X ) P( X | G ) P(G | ) d G L( | X ) P( X | G ) P(G | ) d G • In practice – recombination is assumed to be zero within loci, and to be high between loci – Must be approximated by using samples of genealogies – Is slow Instead of sampling genealogies -> approximate likelihoods using summary statistics • Summary statistic methods have become common due to the limitations of methods that sample genealogies • Can work with loci that have histories of recombination • Can be fast • But, do not use all of the information in the data • So far do not do so well with models and histories that include gene exchange Competition between two lines of research: genealogy sampling, and summary statistics • Genealogy Sampling – limited by assumptions on recombination (so far) – Slow – Works well for estimating parameters • Summary Statistics – Not limited by recombination – Faster – Does not work so well (so far) An new method for sampling genealogies • We would like a smaller MCMC state space, for which it is easier to design an MCMC updating scheme that leads to rapid convergence • We would like to have an approach that generates an analytic likelihood function in multiple dimensions • But that avoids the frailties of that approach that stem from using samples of G conditioned on a driving value of Θ, Θ0 (Kuhner et al, 1995) Hey & Nielsen 2007 PNAS 104:2785–2790. Reconsidering the integration over genealogies L( | X ) Pr( X | G ) P(G | )dG Consider an alternative expression, that also integrates over G , but that directly yields a posterior probability of Θ P( | X ) P( | G) P(G | X )dG P( | X ) P( | G ) P(G | X )dG This is an expectation of P(Θ |G) and can be approximated given a sample of genealogies drawn at random from the posterior distribution of G, P(G | X) 1 k 1 k P(Gi | ) P() P( | X ) P( | Gi ) k i 1 k i 1 P(Gi ) This step does not depend on the data, X. All the information in the data is contained in the sample drawn from P(G|X) Yields an analytic function The key to generating samples of genealogies from P(G | X) and to approximating P(Θ|X) is the calculation of the prior probability of G, P(G) P(G ) P(G | ) P()d In fact this can be calculated analytically for the main demographic components of Θ. Sequence of operations • Run a Markov chain over G and generate random samples from P(G | X) • For each G drawn from this distribution, save P(G) and all necessary information for calculating P(G|Θ). • Build a function that approximates the posterior density of Θ 1 P(Gi | ) P() P ( | X ) k i 1 P(Gi ) k •This is an analytic function, and can be evaluated for any value of Θ •The function can be differentiated and searched for maxima. Comparing the likelihood ratio for a true nested model with the likelihood for the full model • 100 data sets simulated under a model with just 2 population sizes and 1 migration rate 0.009 0.008 0.007 Probability 0.006 0.005 χ2 2 Degrees of Freedom 0.004 0.003 –2×Log-Likelihood Ratio 0.002 100 simulated data sets 0.001 0 0 20000 40000 Ne - Effective Population Size 60000 Chimpanzee Distributions Chimpanzee Divergence Posterior Density for Population Size - Ne P. t. troglodytes P. t. verus Ancestor Original results of Won & Hey P. t. troglodytes P. t. verus Ancestor New Method Models for more than two populations • Assume that we know the species phylogeny For three sampled populations N1 m m t0 NA0 N2 m N3 m m m m m t1 NA1 Θ includes 15 Parameters Multi-population IMa – The Good News • Adding more populations does not introduce new mathematical issues – Building the application is mostly a programming problem, not a math problem • Can do any number of populations for a known phylogeny – Program will “work” for 10 populations (assuming a known phylogeny) 19 population size parameters 162 migration rate parameters 9 population splitting times Multiple -populations – The Bad News • A lot of data will be required for many situations (hundreds of loci) • Models with many parameters introduce much more potential for model identifiability problems • Program is still slow and applications with 100’s of loci will require new computing configurations Chimpanzees in a four population Isolation with Migration Model • Pan paniscus (Bonobo) • P. troglodytes troglodytes (Central African Chimpanzee) • P. t. schweinfurthii (East African Chimpanzee) • P. t. verus (West African Chimpanzee) Chimpanzee Distributions Chimpanzee phylogeny* P.t. schweinfurthii Eastern P. t. troglodytes Central P.t. verus West *Becquet et al., (2007) PLoS Genet 3:e66. (based on 310 microsatellite loci) P. paniscus Bonobo Data • Fischer et al., Curr. Biol. 16:1133-1138. – 26 loci, approx 20 gene copies per species, average length 700 bp • Yu et al., (2003) Genetics 164:1511-1518. – 42 loci, approx 10 gene copies per species, average length 400 bp • Deinard & Kidd (2000), HOXB6 and APOB • Single loci from mitochondria, X chromosome, Y chromosome • Total of 73 loci 79,000 yrs Eastern Central 7,800 26,000 Western Bonobo 7,100 8,200 30,000 440,000 yrs 6,900 Migration Signficantly greater than zero Splitting Times in years Effective Population Sizes Parameter Estimates for Four Chimpanzee Populations 890,000 yrs 17,000