Document

advertisement
Monte Carlo methods for estimating
population genetic parameters
Rasmus Nielsen
University of Copenhagen
Outline




Idiosyncratic history and background on ML
estimation of demographic parameters based on
DNA sequence data.
A new computational approach/modification.
Idiosyncratic history and background on ML
estimation of demographic parameters based on
SNP data.
Ascertainment and large scale SNP data sets.
Felsenstein’s Equation
Pr( X | )   Pr( X | G ) p(G | )dG

 EG| Pr(X | G)
So
1 k
Pr(X | )   Pr(X | Gi )
k i 1
where Gi, i=1,2,…k, has been simulated from p(G|).
Coefficient of Variation
1
0.9
0.8
0.7
C.V.
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2
4
6
Sample size
8
10
Importance Sampling
h(G )
 P r(X | G) p(G | )dG   P r(X | G) p(G | ) h(G) dG
 P r(X | G ) p(G | ) 
 E

h
(
G
)


So
1 k Pr(X | Gi ) p(Gi | )
Pr(X | )  
k i 1
h(Gi )
where Gi, i=1,2,…k, has been simulated from h(G).
Griffiths and Tavare
Recursion
'
'
Pr(X | )  p(m ut| )  Pr(X mut
| ) p( X mut
 X | , m ut)
'
X mut
'
'
 p(coal | )  Pr(X coal
| ) p( X coal
 X | , coal)
'
X coal
Simulate mutation (coalescent) from
'
p (mut | )  p( X mut
'
X mut
'
p(mut | ) p ( X mut
 X | , mut )
'
 X | , mut )  p (coal | )  p ( X coal
 X | , coal )
and correct using importance sampling.
'
X coal
Example (Nielsen 1998)
•Infinite sites model
•Estimation of T
•Estimation of
population
phylogenies
Integro-recursion
Ugliest
equation
ever
published in
a biological
journal…
MLE: T=1.8 (36,000 years)
Data from the Caribean Hawksbill Turtle
MCMC
Set up a Markov chain on state space on all supported values of 
and G and with stationary distribution p(, G | X). Now since
p(, G | X )  Pr(X | G) p(G | ) p()
this can easily be done using Metropolis-Hastings sampling, i.e.
updates to  and G are proposed from a proposal distribution q( ,
G → ’ , G’) and accepted with probability
P( X | G' ) P(G' | ' )q(' , G'  , G)
P( X | G) P(G | )q(, G  ' , G' )
Some problems…
• Histogram estimator or other smoothing
must be used.
• Likelihood ratios hard to estimate (e.g.
M=0).
A new method
• It is possible to calculate the marginal prior probability of a
genealogy
P (G )   P (G | ) P ()d
• It turns out that this math is doable, for most components of Θ such
as q and M.
• The we can sample from the marginal posterior of G
P( X | G) P(G)
P(G | X ) 
 P( X | G ) P(G)
P( X )
using the previously discussed MCMC procedures.
Slide inspired by Jody Hey
We then recover the posterior for  using
P( | X )   P( | G ) P(G | X )dG

Approximated by
1 k
1 k P(Gi | ) P()
P( | X )   P( | Gi )  
k i 1
k i 1
P(Gi )
Slide inspired by Jody Hey
Advantages
• Eliminates problems with covariance between
parameters leading to mixing problems.
• Provides a smooth posterior/likelihood function useful for
optimization and likelihood ratio estimation.
Disadvantages
• Requires more calculation in each MCMC iteration
Likelihood ratio estimation
6 loci, 15 gene copies, H0: m1=m2
Other approaches
• Kuhner and Felsenstein use a combination of MCMC and
importance sampling to estimate surfaces (no prior for the
parameters).
• PAC methods suggested by Stephens and Donnelly samples
from a close approximation to
p(G | X , ) 
Pr(X | G) p(G | )
Pr(X | )
to estimate an approximate likelihood.
• ABC (Beaumont, Pritchard, Tavare and others) methods are
a very popular and promising class of methods based on (1)
reducing the data to summary statistics, (2) simulate new
data from the prior, (3) accepting the parameter value under
which the data was simulated if the difference between
simulated and true statistics is less than d.
SNP Data
Nielsen and Slatkin (2000)
A more efficient method..
Griffiths and Tavare (1998), Nielsen (2000)
A more efficient method..
Griffiths and Tavare (1998), Nielsen (2000)
Ascertainment Sample vs. Typed Sample
Ascertainment
sample
Typed sample
n = 20, d = 4, #SNPs = 1000
0.30
0.25
Frequency
True Frequencies
0.20
Observed frequencies
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
x
1.0
E[D']
0.9
0.8
0.7
0.6
0.5
0
1
2
3
4
5
6
7
8
r =2Nc
no ascertainment bias
ascertainment bias
9 10
Correcting for ascertainment biases
Now, for simplicity, consider the case without a sweep, then
Li (P ) 
Pr( X i  x, Asc | P )
Pr( Asc | X i  x)
 px
Pr( Asc | P )
Pr( Asc | P )
where (in the simplest possible case)
and
 x  n  x
 

d
d

Pr( Asc | X i  x)  1    
n
 
d 
n 1
Pr( Asc | P)   p j Pr( Asc | X i  j )
j 1
In this simple case, the maximum likelihood estimate of P is simply
given by
1
nj
 n1

nk
pˆ k 

 , k = 1, 2, …, n – 1,
Pr( Asc | X  k )  j 1 Pr( Asc | X  j ) 
where nk is the number of SNPs with allele frequency k.
Selective sweeps:
Similarly define L(P, , D)  Pr( X | P, , D, Asc)
10,000 simulated SNPs with n = 20 and d = 5
0.3
True frequencies
0.25
Observed frequencies
Corrected frequencies
0.2
0.15
0.1
0.05
0
1
3
5
7
9
11
13
15
17
19
Hudson’s (2001) Estimator when n = 100, m
= 5, r = 5, and #SNP pairs = 200.
0.7
Uncorrected
0.6
Corrected
0.5
b.
0.4
0.3
0.2
0.1
0
2
3
4
5
6
r =2N c
7
8
9
10
Complications
• Double-hit ascertainment (HapMap)
• Ascertainment based on chimpanzee (HapMap)
• Panel depth may vary among SNPs and/or
among regions (HapMap).
• Ascertainment method may vary among SNPs
(HapMap).
• Population structure (HapMap).
• Loss of information regarding asc. scheme
(HapMap??).
HapMap ascertainment depth distrb.
(ignores many important components)
3.00E-01
2.50E-01
2.00E-01
1.50E-01
1.00E-01
5.00E-02
0.00E+00
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Perlegen
HapMap
Data
Directly sequenced polymorphism data from 20
European-Americans, 19 African-Americans and
one chimpanzee from 9,316 protein coding
genes
Data set previously described in Bustamante, C.D.
et al. 2005. Natural selection on protein-coding
genes in the human genome. Nature 437, 11537.
Demographic model
European-Americans
African-Americans
migration
Population growth
Bottleneck
Admixture
Estimation
n 1
L( )    p j ( ) 
nj
j 1
Sampling probabilities from the 2D frequency spectrum
,
Number of SNPs with pattern j in the 2D frequency spectrum
SNPs within a gene are correlated. But estimator is
consistent. The estimate has the same properties as a real
likelihood estimator except that it converges slightly slower
because of the correlation (Wiuf 2006).
African-Americans
0.35
0.30
Simulated
0.25
Observed
%
0.20
0.15
0.10
0.05
0.00
0
5
10
15
20
Allele Frequency
25
30
35
European-Americans
0.45
0.40
0.35
Simulated
Observed
0.30
%
0.25
Godness-of-fit: p = 0.6
0.20
0.15
0.10
0.05
0.00
0
5
10
15
20
25
Allele Frequency
30
35
40
Acknowledgements
Jody Hey, John Wakeley, Melissa Hubisz, Andy Clark,
Carlos Bustamante, Scott Williamson, Aida Andres,
Amit Andip, Adam Boyko, Anders Albrechtsen,Mark
Adams, Michelle Cargill and other staff at Celera
Genomics and Applied Biosystems.
Download