Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen Outline Idiosyncratic history and background on ML estimation of demographic parameters based on DNA sequence data. A new computational approach/modification. Idiosyncratic history and background on ML estimation of demographic parameters based on SNP data. Ascertainment and large scale SNP data sets. Felsenstein’s Equation Pr( X | ) Pr( X | G ) p(G | )dG EG| Pr(X | G) So 1 k Pr(X | ) Pr(X | Gi ) k i 1 where Gi, i=1,2,…k, has been simulated from p(G|). Coefficient of Variation 1 0.9 0.8 0.7 C.V. 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 Sample size 8 10 Importance Sampling h(G ) P r(X | G) p(G | )dG P r(X | G) p(G | ) h(G) dG P r(X | G ) p(G | ) E h ( G ) So 1 k Pr(X | Gi ) p(Gi | ) Pr(X | ) k i 1 h(Gi ) where Gi, i=1,2,…k, has been simulated from h(G). Griffiths and Tavare Recursion ' ' Pr(X | ) p(m ut| ) Pr(X mut | ) p( X mut X | , m ut) ' X mut ' ' p(coal | ) Pr(X coal | ) p( X coal X | , coal) ' X coal Simulate mutation (coalescent) from ' p (mut | ) p( X mut ' X mut ' p(mut | ) p ( X mut X | , mut ) ' X | , mut ) p (coal | ) p ( X coal X | , coal ) and correct using importance sampling. ' X coal Example (Nielsen 1998) •Infinite sites model •Estimation of T •Estimation of population phylogenies Integro-recursion Ugliest equation ever published in a biological journal… MLE: T=1.8 (36,000 years) Data from the Caribean Hawksbill Turtle MCMC Set up a Markov chain on state space on all supported values of and G and with stationary distribution p(, G | X). Now since p(, G | X ) Pr(X | G) p(G | ) p() this can easily be done using Metropolis-Hastings sampling, i.e. updates to and G are proposed from a proposal distribution q( , G → ’ , G’) and accepted with probability P( X | G' ) P(G' | ' )q(' , G' , G) P( X | G) P(G | )q(, G ' , G' ) Some problems… • Histogram estimator or other smoothing must be used. • Likelihood ratios hard to estimate (e.g. M=0). A new method • It is possible to calculate the marginal prior probability of a genealogy P (G ) P (G | ) P ()d • It turns out that this math is doable, for most components of Θ such as q and M. • The we can sample from the marginal posterior of G P( X | G) P(G) P(G | X ) P( X | G ) P(G) P( X ) using the previously discussed MCMC procedures. Slide inspired by Jody Hey We then recover the posterior for using P( | X ) P( | G ) P(G | X )dG Approximated by 1 k 1 k P(Gi | ) P() P( | X ) P( | Gi ) k i 1 k i 1 P(Gi ) Slide inspired by Jody Hey Advantages • Eliminates problems with covariance between parameters leading to mixing problems. • Provides a smooth posterior/likelihood function useful for optimization and likelihood ratio estimation. Disadvantages • Requires more calculation in each MCMC iteration Likelihood ratio estimation 6 loci, 15 gene copies, H0: m1=m2 Other approaches • Kuhner and Felsenstein use a combination of MCMC and importance sampling to estimate surfaces (no prior for the parameters). • PAC methods suggested by Stephens and Donnelly samples from a close approximation to p(G | X , ) Pr(X | G) p(G | ) Pr(X | ) to estimate an approximate likelihood. • ABC (Beaumont, Pritchard, Tavare and others) methods are a very popular and promising class of methods based on (1) reducing the data to summary statistics, (2) simulate new data from the prior, (3) accepting the parameter value under which the data was simulated if the difference between simulated and true statistics is less than d. SNP Data Nielsen and Slatkin (2000) A more efficient method.. Griffiths and Tavare (1998), Nielsen (2000) A more efficient method.. Griffiths and Tavare (1998), Nielsen (2000) Ascertainment Sample vs. Typed Sample Ascertainment sample Typed sample n = 20, d = 4, #SNPs = 1000 0.30 0.25 Frequency True Frequencies 0.20 Observed frequencies 0.15 0.10 0.05 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 x 1.0 E[D'] 0.9 0.8 0.7 0.6 0.5 0 1 2 3 4 5 6 7 8 r =2Nc no ascertainment bias ascertainment bias 9 10 Correcting for ascertainment biases Now, for simplicity, consider the case without a sweep, then Li (P ) Pr( X i x, Asc | P ) Pr( Asc | X i x) px Pr( Asc | P ) Pr( Asc | P ) where (in the simplest possible case) and x n x d d Pr( Asc | X i x) 1 n d n 1 Pr( Asc | P) p j Pr( Asc | X i j ) j 1 In this simple case, the maximum likelihood estimate of P is simply given by 1 nj n1 nk pˆ k , k = 1, 2, …, n – 1, Pr( Asc | X k ) j 1 Pr( Asc | X j ) where nk is the number of SNPs with allele frequency k. Selective sweeps: Similarly define L(P, , D) Pr( X | P, , D, Asc) 10,000 simulated SNPs with n = 20 and d = 5 0.3 True frequencies 0.25 Observed frequencies Corrected frequencies 0.2 0.15 0.1 0.05 0 1 3 5 7 9 11 13 15 17 19 Hudson’s (2001) Estimator when n = 100, m = 5, r = 5, and #SNP pairs = 200. 0.7 Uncorrected 0.6 Corrected 0.5 b. 0.4 0.3 0.2 0.1 0 2 3 4 5 6 r =2N c 7 8 9 10 Complications • Double-hit ascertainment (HapMap) • Ascertainment based on chimpanzee (HapMap) • Panel depth may vary among SNPs and/or among regions (HapMap). • Ascertainment method may vary among SNPs (HapMap). • Population structure (HapMap). • Loss of information regarding asc. scheme (HapMap??). HapMap ascertainment depth distrb. (ignores many important components) 3.00E-01 2.50E-01 2.00E-01 1.50E-01 1.00E-01 5.00E-02 0.00E+00 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Perlegen HapMap Data Directly sequenced polymorphism data from 20 European-Americans, 19 African-Americans and one chimpanzee from 9,316 protein coding genes Data set previously described in Bustamante, C.D. et al. 2005. Natural selection on protein-coding genes in the human genome. Nature 437, 11537. Demographic model European-Americans African-Americans migration Population growth Bottleneck Admixture Estimation n 1 L( ) p j ( ) nj j 1 Sampling probabilities from the 2D frequency spectrum , Number of SNPs with pattern j in the 2D frequency spectrum SNPs within a gene are correlated. But estimator is consistent. The estimate has the same properties as a real likelihood estimator except that it converges slightly slower because of the correlation (Wiuf 2006). African-Americans 0.35 0.30 Simulated 0.25 Observed % 0.20 0.15 0.10 0.05 0.00 0 5 10 15 20 Allele Frequency 25 30 35 European-Americans 0.45 0.40 0.35 Simulated Observed 0.30 % 0.25 Godness-of-fit: p = 0.6 0.20 0.15 0.10 0.05 0.00 0 5 10 15 20 25 Allele Frequency 30 35 40 Acknowledgements Jody Hey, John Wakeley, Melissa Hubisz, Andy Clark, Carlos Bustamante, Scott Williamson, Aida Andres, Amit Andip, Adam Boyko, Anders Albrechtsen,Mark Adams, Michelle Cargill and other staff at Celera Genomics and Applied Biosystems.