Chapter 2: Bayesian hierarchical models in geographical genetics

advertisement
Chapter 2:
Bayesian hierarchical models
in geographical genetics
Manda Sayler
• Geographical genetics is the field of population genetics that
focuses on describing the distribution of genetic variation
within and among populations and understanding the
processes that produce those patterns.
• Statistical sampling uncertainty arises from the process of
constructing allele frequency estimates from population
samples.
• Genetic sampling uncertainty arises from the underlying
stochastic evolutionary process that gave rise to the
population we sampled.
– Note: increasing the sample size of alleles with each population
reduces statistical uncertainty, but it cannot reduce the
magnitude of genetic uncertainty.
• Weir and Cockerham approach is the most widely used
approach for analysis of genetic diversity in hierarchically
structured populations.
• Bayesian approach provides a model-based approach to
inference that is enormously powerful and flexible.
• Hierarchical Bayesian models provide a natural approach to
inference in geographical genetics.
Weir and Cockerham Approach
• To illustrate the formalism, consider a set of populations
segregating for 2 alleles, A1 and A2 at a single locus
• pk frequency of allele at A1
• Xij,k frequency of genotype AiAj in the kth population k=1,…,K
x11   p2   p2
1 K
pk
where  p 

x12   p (1   p )  2 p2
K k 1
x 22  (1   p ) 2   p2
1 K
2
and
 p   ( pk   p ) 2
1 K
K k 1
xij   xij,k
K k 1
F 
 p2
• Variance st
 p (1   p )
• Fst can be interpreted as the fraction of genetic diversity due to
differences in allele frequencies among populations .
Hierarchical Bayesian Models
• A hierarchical Bayesian model uses the full power of the data
for simultaneous estimators of the parameters while
accounting for both statistical and genetic uncertainty.
• To account for statistical uncertainty assume that alleles are
sampled independently within populations.
• Also assume the samples are drawn independently across
loci and population.
• Likelihood of the sample from a single population is binomial.
I
K
P({lik },{nik } | { pik },{ i }, )   piklik (1  pik ) nik lik
i 1 k 1
• To account for genetic uncertainty we must assume a
parametric form for the among-population allele frequency
distribution.
• It is natural to assume that population allele frequencies follow
a Beta distribution,
1  1 



P( pik |  i , )  Beta 

,
1


 i 

i 









where E(pik) = π and Var(pik) = θπ(1 - π).
• Thus, θ is equivalent to Fst.
• The posterior distribution for the parameters is
 I  K lik


nik lik
P({ pik }, { i },  | {lik }, {nik })      pik (1  pik )
P( pik |  i , )  P( i )  P( )

 i 1  k 1

where P(πi) and P(θ) are the prior distributions for πi and θ,
respectively.
A fully hierarchical model
• To estimate the correlation of allele frequencies within loci, we
need to add an additional level to the hierarchy that describes
the distribution of mean allele frequencies across loci
P(πi| π,θy).
• Regard the loci in the sample as a sample from a larger
universe of loci from which we might have sampled.
• Regard the populations in our sample as a sample from a
larger universe of populations from which we might have
sampled.
• The likelihood is unchanged. The posterior becomes
P( x ,  y ,  , { i }, { pik } | {lik }, {nik }) 
 K lik

nik lik
P( pik |  i , x  P( i |  , y ) P( x ) P( y ) P( )
  pik (1  pik )

i 1  k 1

I
where P( pik |  i , x ) is the Beta distribution for θx, and
P( pik |  i , y ) is the Beta distribution for θy.
Developing an MCMC sampler
• The process begins by picking an initial value for p, called p0,
then p0 is updated until we have a large sample of values pt
using either
– Metropolis-Hastings algorithm (Figure 2.2)
– Slice algorithm (Figure 2.3)
• Estimate any property of the posterior to an arbitrary
degree of accuracy.
• Ensure that the MC has converged the values from an
initial burn-in period are discarded.
• Values retained from the following sample period represent the
full posterior distribution and summary statistics are calculated
directly from this sample.
• Reduce the autocorrelation of values in the sample, it is
sometimes useful to thin the sample.
Download