Mathematics Population Genetics

advertisement
Mathematics Population Genetics.
Introduction to
the Stochastic Theory
Guanajuato
March 2009
Warren J Ewens
Genes are of different types (= different “alleles” = different
colors). We assume initially that at the gene locus of interest
there are only two possible alleles, usually denoted (and
denoted in the handout notes) as A1 and A2. To be colorful, in
both sense of the word, we sometimes refer to these as the
“red” allele and the “green” allele respectively.
The individual shown is A1A2 (= red / green). The other two
possibilities are (of course) A1A1 (=red / red) and A2A2 (= green
/ green).
We next consider the entire population (of genes) at this locus,
and discuss the evolution of the A1 and A2 allelic frequencies.
Although these lectures (and slides) concern the stochastic theory
of population genetics, we first consider (briefly) some simple
aspects of the deterministic theory.
Hardy-Weinberg frequencies
Genotype:
A1A1
A1A2
A2A2
Frequencies:
x2
2x(1-x)
(1-x)2 (eqn. (6))
Fitnesses
w11
w12
w22 (eqn. (8))
or
or
1+s
1 – s1
1 + sh
1
1 (eqn. (9))
1 – s2 (eqn. (10))
x' – x ≈ sx(1-x) {x + h(1-2x)}
(eqn. (11))
dx/dt ≈ sx(1-x) {x + h(1-2x)}
(eqn. (12))
x2
t ( x1 , x 2 ) 
 [ sx(1  x){x  h(1  2h)}]
x1
1
dx
(eqn. (13))
Markov chain theory
Standard results are given in the notes in
equation (20) - absorption probabilities,
equation (21) - mean absorption times,
equations (24)-(28) – conditional processes,
equation (32) – stationary distribution
equation (34) – reversibility.
We use Markov chain theory to discuss the case where random
changes in these frequencies occur from one generation to the
next. We first consider the cases where there are no
complicating features such as selection, mutation, two sexes,
etc.
Even for this very simple situation, there are MANY possible
stochastic models describing these changes, (with greater or
lesser accuracy). The first one that we consider is the “simple”
Wright-Fisher model. This is a model of pure binomial
sampling.
It assumes a diploid population size that is constant over time
at the value N, with non-overlapping generations, and no
complicating features.
Since only two alleles (A1 and A2) are allowed, and since the
population size is assumed to be constant (= N individuals =
2N genes), it is sufficient to focus on the number of A1 genes
in any generation. In generation t, this number is denoted by
X(t). Thus number of A2 genes in generation t is since the
number of green genes is automatically 2N – X(t).
The binomial random sampling assumption implies that the
Markov chain model for the number of ‘red” genes in the
population is as shown on the following slide.
The “simple” Wright-Fisher model
Pr ob { X (t  1)  j | X (t )  i}
 2 N   i  j   i



 j   2 N  1 2 N
ij
 


p
i , j  0, 1, 2,  , 2 N
 

 
2 N j
(eqn. (35))
There are two absorbing states (corresponding to “all genes are
A1” and “all genes are A2”). With probability 1, one or other of
these two states will eventually be entered, and “fixation” has
occurred. We can ask:
(i) what is the probability that the “all A1” state is eventually
entered?”
(ii) What is the mean number of generations until one of the
absorbing states is entered?
(iii) Given that eventually all genes are A1, what is the mean
number of generations until this happens?
The answer to question 1 is straightforward.
Standard Markov chain shows that this probability depends on
the initial number of A1 genes. If for different possible initial
numbers i, (i = 0, 1, 2, …, 2N), this probability is denoted by
πi, the set of values (π0, π1, π2,…, π2N) satisfies
πi = Σj pij πj, (i = 1, 2, …, 2N-1),
π0 = 0, π2N = 1.
It is easy to see from this that πi = i / (2N).
(eqn. (36))
Thus the required probability is X(0) / 2N.
This result can also be found using martingale arguments – see
eqn. (37).
A more “genetic” way of getting this result is this: eventually all
genes in the population will be descended from one gene in the
parental generation. The probability that this is an A1 gene is,
by symmetry, simply the initial proportion X(0) / 2N of A1
genes in the population.
(Later we “time-reverse” this argument when considering the
coalescent.)
“Mean time” questions are much harder to answer, and to this day
no exact answers are known.
Early approaches to this problem centered around the eigenvalues
of the Wright-Fisher transition matrix – see eqn. (38) -
λ0 = λ1 = 1,
λj = {(2N)(2N-1)…(2N-j+1)} / (2N)j,
j = 2, 3, …. , 2N.
In particular, λ2 = 1 – 1/(2N).
The right - eigenvector corresponding to λ2 is
r2' = (0, 1(2N-1), …, i(2N-i), …. 1(2N-1), 0).
The left-eigenvector is unknown. It is approximately (1,1,1,….,
1,1,1).
This leads to pij (n) ≈ Ci(2N-i){1-1/(2N)}n
for large n.
The Taylor series approach. (This is essentially the
diffusion approximation approach – see later.)
eqns(41, 42, 43)
t ( x)   Prob{x  x  x}t ( x  x)  1


t ( x)  t ( x)  E (x)t ( x)  12 E (x) 2 t ( x)  1


E (x)E ( x)  12 E (x) 2 t ( x)  1
For the simple Wright - Fisher model,
t (x)  0, E (x) 2  x(1  x) / 2 N .
This gives

t ( x)  4 N /x(1  x)
Mean times – Taylor series approximation
eqns(47,49,50)
t ( p)  4 N p log p  (1  p) log( 1  p)


t (2 N ) 1  2  2 log 2 N
t { 12 }  2.8 N generation s
Mean times with one initial A1 gene.
eqns (49) and (53)
t1 
2 N 1
t
j 1
1, j
Fisher, Wright
t1, j  2 / j , j  1, 2, , 2 N  1
t1  2log( 2 N  1)   
Conditional process (conditional on fixation of A1)
eqns.(24,27,28)
p  pij j / i
*
ij
pij*(n )  pij( n ) j / i
t  tij j / i
*
ij
Conditional mean times
eqn(59,60,61)
Applying these to the Wright-Fisher model, we get
*

t (2 N )
1
 4 N  2 generation s
t { 2 }  2.8 N generation s
* 1

t 1  (2 N )
*
1
 2 log 2 N generation s
One-way mutation:
the Wright-Fisher model
eqn. (63)
 2N 
j
2N  j
( i ) (1  i )
pij  
 j 
where  i  i (1  u ) / 2 N
One-way mutation: Taylor series (=diffusion)
approximation
eqns. (66), (67)


t ( x, p)  4 Nx 1 (1   ) 1 (1  x) 1  1 ,
0 x p


t ( x, p)  4 Nx 1 (1   ) 1 (1  x) 1 1  (1  p)1 ,
p  x 1
( p is the initial frequency of A1 )
Two-way mutation
eqns. (76),(77),(78)
 i  i(1  u )  (2 N  i)v/ 2 N
  2 Nv /(u  v)
 2  4 N 2uv /(u  v) 2 (4 Nu  4 Nv  1) small order term s
Prob(two genes of same allelic type)  (1   )/(1  2 )
Homozygosity probability
The case = u = v
eqn. (79)

F  u  (1  u )
2
2

1
2N
 F (1 
 2u (1  u )(1  F )(1  21N )
1  2u (1  u )( 2 N  2)
F
1  4u (1  u )( 2 N  1)
F  (1   ) /(1  2 )
1
2N
)
The Cannings (exchangeable) model
Gene i leaves yi offspring genes. The joint
distributi on of ( yi , y j ,, yk ) is independen t
of (i, j ,, k ). As in the Wright - Fisher
model, each gene is either of allelic type A1
or A 2 .
Suppose that in the Cannings model, we write Xt for the number of
A1 genes in generation t. There will then be a transition matrix for
Xt.
Then the eigenvalues of this transition matrix (describing the number
of A1 genes in any generation) are (eqn. (81)):λ0 = 1, λj = E(y1y2y3∙∙∙∙yj), j = 1, 2, …., 2N.
Here
λ1 ≥λ2 ≥λ3 …… ≥λ2N .
This is a very useful formula.
An example
2  E ( y1 y2 )
 1   2 /( 2 N  1)
where  2  var( yi )
eqn(84)
The Moran (birth-death) model
eqns. (92,93,94)
pi ,i 1  i (2 N  i ) /( 2 N )
2
pi ,i 1  i (2 N  i ) /( 2 N )
2


pi ,i  i  (2 N  i ) /( 2 N )
2
2
2
Mean sojourn times
eqn. (97)
tij  2 N (2 N  i ) /( 2 N  j ),
j  1, 2, , i
tij  2 Ni / j ,
j  i  1, , 2 N  1
Mean times to fixation or loss
eqn. (98)
i
2 N 1
ti  2 N (2 N  i ) (2 N  j )  2 Ni  j
j 1
1
j i 1
t ( p)  (2 N ) p log p  (1  p) log( 1  p)
2
1
Conditional mean times
eqns. (99,100, 101)
tij*  2 N (2 N  i ) j /i (2 N  j ),
j  1, 2, , i
t  2N ,
j  i  1, , 2 N  1
*
ij
i
ti*  2 N (2 N  i )i 1  j (2 N  j ) 1  2 N (2 N  i  1)
j 1
t1*  2 N (2 N  1)
Largest non-unit eigenvalue and its eigenvectors
eqn. (104)
2  1  2 /( 2 N )
2

r  0, 1(2 N  1), , i (2 N  i ), , 1(2 N  1), 0
1
1
   2 (2 N  1), 1, , 1,  2 (2 N  1) 
(Approximate) mean times (with one-way mutation)
eqns. (109,110)
p

t p  (2 N ) 2 (1   ) 1   x 1 (1  x) 1  1 dx
0
1

1
 1
1
  x (1  x) 1  (1  p ) dx 

p

p


1
 1

t( 2 N ) 1  2 N 1   x (1  x) dx 


1
 (2 N )





Another (approximate) expression

t ( p)  
j 1
2N
t (1)  
j 1

4N
j
1  (1  p)
j( j 1   )
4N
j( j 1   )

Infinitely many alleles:
Wright-Fisher model
eqn. (119)
Prob{ X 0 (t  1) X 1 (t  1), X 2 (t  1),  | X 1 (t ), X 2 (t ), } 
(2 N )!
X i ( t 1)



i
 X i (t  1)!
where  0  u and  i  X i (t )(1  u ) /( 2 N ),
i  1, 2, 3, 
Homozygosity probability
eqns. (120,121)
( t 1)
2
F

1

 (1  u ) (2 N )  1  (2 N )
2

F2  1  2 N  2 N (1  u )

 2 1
1
F 
(t )
2
 (1   )
1
Identity probability with three genes
eqn. (136)
(t )

1  3(2 N  1) F2  
( t 1)
3
2 

F3  (1  u ) (2 N )
 (2 N  1)( 2 N  2) F (t ) 
3 

1
F3  2(2   ) F2  2! /(1   )( 2   )
Population mean of K
eqns. (125,126,127)
1
E(K )   
 x
1
 1
(1  x)
( 2 N ) 1
x2
E K ( x1 , x2 )    x (1  x)
 1
1
x1
 1
 ( x)  x (1  x)
1
dx
dx
Identity probability with i genes
eqn. (138)
Fn  (n  1)! /(1   )( 2   ) (n  1   )
Sample partition formula
eqn. (143)
a j
n!
Prob( A  a)  a1 a2
an
1 2  n a1!a2 ! an ! S n ( )
a  (a1 , a2 , , an )
S n ( )   (  1)(  2)  (  n  1)
Sample distribution of K
eqns. (145,146,147)
Prob( K  k )  S  / S n ( )
k
n
k




E(K )  


  1   2
  n 1
n 1
j
var( K )   
2
j 1 (  j )
From the sampling formula,
Prob {one allele observed in a sample of n genes}
= (n-1)! / (1+θ)(2+θ)∙∙∙∙(n-1+θ).
Using the frequency spectrum,
Prob{only one allele observed in a sample of n genes }
1


   x n x 1 (1  x) 1 dx
0
 (n  1)! / (1   )( 2   )  (n  1   ) ,
(as found above)
Moran model:
the entire population
eqns. (151,152)
 j
(2 N )!
Prob( 1 ,  2 ,,  2 N )  1  2
2 N
1 2  (2 N ) 1!  2 !  2 N ! S 2 N ( )
  2 Nu /(1  u )
Exact (Moran model) mean number of alleles with j
representing genes
eqn. (157), used in eqn. (156)
1


2
N
2
N



1




1 
 ,
 j  
  j 

j
 

j  1, 2, , 2 N
Probability of quasi-fixation
eqn. (158). See also eqn. (159)
 2 N 1 2 N    1 2 N  1


   
j
j 
 j 0 

Compare this with
  2 N    1

2N 
2N


1
1



1
Quasi-fixation probabilities: the case θ = 1
eqn. (161)
1
1  1 1
1 
1     

2N  2 3
2N 
Compare this with
1
2N
(Note : mean number of alleles in the
population  1  1 / 2  1 / 3    1 / 2 N )
Mean number of generations until loss of all current
alleles
2 N (2 N   )(  1)
1
2N

j 1
2N
2 N ( 2 N   )
j 1
1

 2 N  2 N    1 
1 


j 1  
  j 

j



1
j ( j    1)
Properties of the simple Wright-Fisher model and the
resulting effective populations sizes
eqns. (175,176,177)
max  1  (2 N )
N
(e)
e
1
 2(1  max )
1
 2  Prob(two genes have same parent)  (2 N ) 1
N
(i )
e
 (2 2 )
1
Var x(t  1) | x(t )  x(t )1  x(t ) / 2 N
N e( )  x(t )1  x(t )/2Var x(t  1) | x(t )
Effective population size for the Cannings model
eqns. (178, 179,181,182)
.
N e( e )  N e(i )  ( N  12 ) /  2 .
Therefore,
 (Cannings model )  4 Nu / 
2
Effective population size in the Moran model
eqn. (183)
.
N
(e)
e
N
(i )
e
N
( )
e
 N.
1
2
Eigenvalue effective population size for the two-gender
Wright-Fisher model
eqn(193)
N
(e)
e
 4 N1 N 2 N
1
Eigenvalue effective population size for the sub-divided
population Wright-Fisher model
eqn. (198)
N
(e)
e
 N ( H  1)1  2K ( H  1)
1
Inbreeding effective population size for the sub-divided
population Wright-Fisher model
eqn(199)
N
(i )
e
 N ( H  1) 
1
2
/1  (2 N )
1

Eigenvalue effective population size for the cyclic
population size Wright-Fisher model
eqn(200)
N
(e)
e
1
1
1 1
k
 k{ N    N }
DIFFUSION THEORY
The forward Kolmogorov equation (eqn. (215)):-


1 2
f ( x; t )   {a( x) f ( x; t )} 
2 {b( x ) f ( x; t )}
t
x
2 x
The backward Kolmogorov equation (eqn. (218)):-


1
2
f ( x; p, t )  a( p) f ( x; p, t )  b( p) 2 f ( x; p, t )
t
p
2
p
From the backward equation we get (when relevant) fixation
probabilities (see eqns. (224) and (226)), mean fixation times
(see eqns. (230), (231) and (232) for the case of two absorbing
boundaries, eqns. (237), 9238), (239) and (240) for the case of
one absorbing boundary).
We also get information about the variance of the fixation times –
see eqn. (236).
When there are two absorbing boundaries we can also get
conditional mean absorption times (see eqns. (247), )2480,
(249), (250), (251).
We can also get the conditional process drift and diffusion
coefficients – see eqns. (254) and (255), with the WrightFisher process values in (256), as well as the conditional
process forward and backward Kolmogorov equations – see
eqns. (2580 and (259).
From the forward equation we get (when relevant) the stationary
distribution – see eqn. (244).
The scale and speed functions
These are very important. The scale function p(x) is defined in
eqn. (260) and the speed function m(x) is defined in (261).
The lead to the functions u(s) and v(s) (see eqns. (262) and (263)
which define boundary behavior – see eqns. (264).
Values for the scale and speed function for diffusion processes in
genetics are given in eqns. (271) and (272).
Many applications of these in genetics are then given on pages 89
– 108.
INFERENCE OPERATIONS
1. Estimation of θ.
We have seen that the parameter θ enters into many formulae. So
it is interesting to consider how we might estimate it, from
data.
Sample partition formula (remember?)
eqn. (143)
a j
n!
Prob( A  a)  a1 a2
an
1 2  n a1!a2 ! an ! S n ( )
a  (a1 , a2 , , an )
S n ( )   (  1)(  2)  (  n  1)
Sample distribution of K (remember?)
eqn. (145)
Prob( K  k )  S  / S n ( )
k
n
k
These give conditional partition probabilities
eqn(328)
n!
Prob{A  a | K n  k}  k a a
an
1
2
S n 1 2  n a1!a2 ! an !
This shows that k is a sufficient statistic for θ. Standard
statistical theory then shows that we must estimate θ
by using k, AND k ONLY.
MLE of θ
eqn(330)
ˆk
ˆk
ˆk
k
k
k
k n  ˆ  ˆ
k   (ˆ)
E (k )   ( )

1 ˆ



2
ˆ
ˆk
k  n 1
Approximating the mean square error of the estimator
eqn(336)
k  E (k )  (ˆ   ) ( )
var( K n )
ˆ
MSE ( K ) 
2
 ( )
eqn(338)
MSE (ˆK ) 


n 1
j
j 1 ( j  ) 2
Alleles data
T
T
T
T
T
G TAT G C C T G C
G TAT G C C T G C
GTCTG CTTGA
G TAT G C C T G C
C TAT G C C T G C
Three alleles (k=3). 1=2, 2=0, 3=1
Sites data
T
T
T
T
T
G TAT G C C T G C
G TAT G C C T G C
GTCTGCTTGA
G TAT G C C T G C
C TAT G C C T G C
Four polymorphic sites (s=4)
“Sites” data
The data consists simply of S, the number of segregating sites
in the sample of n genes (DNA Sequences).
E(S)   g1
Var(S)  g1  g2 2
n1
where
Thus
g1   j ,
j 1
S
s 
g1

1
n 1
g2   j 2 ,
j 1
and

Var( s ) 

g1

 2 g2
g12

Var( S )
Some values of MSE( )
K
 =.5
=1
=3
=5
n = 50
n = 100
.902
.918
.874
.903
.891
.960
.928
1.038
n = 500
.943
.942
1.047
1.178
2. Testing for neutrality
The Ewens-Watterson test.
This is based on the conditional distribution of the numbers n1, n2, …., nk of
genes of the (k) alleles observed in a sample of n genes. (eqn. (348) – the
same as eqn. (328).)
n!
k
| S n | k ! n1n2 ....... nk
The test statistic is the sample heterozygosity ∑j nj2/n2.
A test based on the “sample frequency spectrum”
eqn(352)
21i
9
21! S
E ( Ai | k  10, n  21) 
i (21  i )! S102
The Tajima test
ˆT


D
T
(
i
,
j
)
i j
n
 
 2
ˆ  ˆ
T
S
Vˆ
eqn (353)
Download