Mathematics Population Genetics. Introduction to the Stochastic Theory Guanajuato March 2009 Warren J Ewens Genes are of different types (= different “alleles” = different colors). We assume initially that at the gene locus of interest there are only two possible alleles, usually denoted (and denoted in the handout notes) as A1 and A2. To be colorful, in both sense of the word, we sometimes refer to these as the “red” allele and the “green” allele respectively. The individual shown is A1A2 (= red / green). The other two possibilities are (of course) A1A1 (=red / red) and A2A2 (= green / green). We next consider the entire population (of genes) at this locus, and discuss the evolution of the A1 and A2 allelic frequencies. Although these lectures (and slides) concern the stochastic theory of population genetics, we first consider (briefly) some simple aspects of the deterministic theory. Hardy-Weinberg frequencies Genotype: A1A1 A1A2 A2A2 Frequencies: x2 2x(1-x) (1-x)2 (eqn. (6)) Fitnesses w11 w12 w22 (eqn. (8)) or or 1+s 1 – s1 1 + sh 1 1 (eqn. (9)) 1 – s2 (eqn. (10)) x' – x ≈ sx(1-x) {x + h(1-2x)} (eqn. (11)) dx/dt ≈ sx(1-x) {x + h(1-2x)} (eqn. (12)) x2 t ( x1 , x 2 ) [ sx(1 x){x h(1 2h)}] x1 1 dx (eqn. (13)) Markov chain theory Standard results are given in the notes in equation (20) - absorption probabilities, equation (21) - mean absorption times, equations (24)-(28) – conditional processes, equation (32) – stationary distribution equation (34) – reversibility. We use Markov chain theory to discuss the case where random changes in these frequencies occur from one generation to the next. We first consider the cases where there are no complicating features such as selection, mutation, two sexes, etc. Even for this very simple situation, there are MANY possible stochastic models describing these changes, (with greater or lesser accuracy). The first one that we consider is the “simple” Wright-Fisher model. This is a model of pure binomial sampling. It assumes a diploid population size that is constant over time at the value N, with non-overlapping generations, and no complicating features. Since only two alleles (A1 and A2) are allowed, and since the population size is assumed to be constant (= N individuals = 2N genes), it is sufficient to focus on the number of A1 genes in any generation. In generation t, this number is denoted by X(t). Thus number of A2 genes in generation t is since the number of green genes is automatically 2N – X(t). The binomial random sampling assumption implies that the Markov chain model for the number of ‘red” genes in the population is as shown on the following slide. The “simple” Wright-Fisher model Pr ob { X (t 1) j | X (t ) i} 2 N i j i j 2 N 1 2 N ij p i , j 0, 1, 2, , 2 N 2 N j (eqn. (35)) There are two absorbing states (corresponding to “all genes are A1” and “all genes are A2”). With probability 1, one or other of these two states will eventually be entered, and “fixation” has occurred. We can ask: (i) what is the probability that the “all A1” state is eventually entered?” (ii) What is the mean number of generations until one of the absorbing states is entered? (iii) Given that eventually all genes are A1, what is the mean number of generations until this happens? The answer to question 1 is straightforward. Standard Markov chain shows that this probability depends on the initial number of A1 genes. If for different possible initial numbers i, (i = 0, 1, 2, …, 2N), this probability is denoted by πi, the set of values (π0, π1, π2,…, π2N) satisfies πi = Σj pij πj, (i = 1, 2, …, 2N-1), π0 = 0, π2N = 1. It is easy to see from this that πi = i / (2N). (eqn. (36)) Thus the required probability is X(0) / 2N. This result can also be found using martingale arguments – see eqn. (37). A more “genetic” way of getting this result is this: eventually all genes in the population will be descended from one gene in the parental generation. The probability that this is an A1 gene is, by symmetry, simply the initial proportion X(0) / 2N of A1 genes in the population. (Later we “time-reverse” this argument when considering the coalescent.) “Mean time” questions are much harder to answer, and to this day no exact answers are known. Early approaches to this problem centered around the eigenvalues of the Wright-Fisher transition matrix – see eqn. (38) - λ0 = λ1 = 1, λj = {(2N)(2N-1)…(2N-j+1)} / (2N)j, j = 2, 3, …. , 2N. In particular, λ2 = 1 – 1/(2N). The right - eigenvector corresponding to λ2 is r2' = (0, 1(2N-1), …, i(2N-i), …. 1(2N-1), 0). The left-eigenvector is unknown. It is approximately (1,1,1,…., 1,1,1). This leads to pij (n) ≈ Ci(2N-i){1-1/(2N)}n for large n. The Taylor series approach. (This is essentially the diffusion approximation approach – see later.) eqns(41, 42, 43) t ( x) Prob{x x x}t ( x x) 1 t ( x) t ( x) E (x)t ( x) 12 E (x) 2 t ( x) 1 E (x)E ( x) 12 E (x) 2 t ( x) 1 For the simple Wright - Fisher model, t (x) 0, E (x) 2 x(1 x) / 2 N . This gives t ( x) 4 N /x(1 x) Mean times – Taylor series approximation eqns(47,49,50) t ( p) 4 N p log p (1 p) log( 1 p) t (2 N ) 1 2 2 log 2 N t { 12 } 2.8 N generation s Mean times with one initial A1 gene. eqns (49) and (53) t1 2 N 1 t j 1 1, j Fisher, Wright t1, j 2 / j , j 1, 2, , 2 N 1 t1 2log( 2 N 1) Conditional process (conditional on fixation of A1) eqns.(24,27,28) p pij j / i * ij pij*(n ) pij( n ) j / i t tij j / i * ij Conditional mean times eqn(59,60,61) Applying these to the Wright-Fisher model, we get * t (2 N ) 1 4 N 2 generation s t { 2 } 2.8 N generation s * 1 t 1 (2 N ) * 1 2 log 2 N generation s One-way mutation: the Wright-Fisher model eqn. (63) 2N j 2N j ( i ) (1 i ) pij j where i i (1 u ) / 2 N One-way mutation: Taylor series (=diffusion) approximation eqns. (66), (67) t ( x, p) 4 Nx 1 (1 ) 1 (1 x) 1 1 , 0 x p t ( x, p) 4 Nx 1 (1 ) 1 (1 x) 1 1 (1 p)1 , p x 1 ( p is the initial frequency of A1 ) Two-way mutation eqns. (76),(77),(78) i i(1 u ) (2 N i)v/ 2 N 2 Nv /(u v) 2 4 N 2uv /(u v) 2 (4 Nu 4 Nv 1) small order term s Prob(two genes of same allelic type) (1 )/(1 2 ) Homozygosity probability The case = u = v eqn. (79) F u (1 u ) 2 2 1 2N F (1 2u (1 u )(1 F )(1 21N ) 1 2u (1 u )( 2 N 2) F 1 4u (1 u )( 2 N 1) F (1 ) /(1 2 ) 1 2N ) The Cannings (exchangeable) model Gene i leaves yi offspring genes. The joint distributi on of ( yi , y j ,, yk ) is independen t of (i, j ,, k ). As in the Wright - Fisher model, each gene is either of allelic type A1 or A 2 . Suppose that in the Cannings model, we write Xt for the number of A1 genes in generation t. There will then be a transition matrix for Xt. Then the eigenvalues of this transition matrix (describing the number of A1 genes in any generation) are (eqn. (81)):λ0 = 1, λj = E(y1y2y3∙∙∙∙yj), j = 1, 2, …., 2N. Here λ1 ≥λ2 ≥λ3 …… ≥λ2N . This is a very useful formula. An example 2 E ( y1 y2 ) 1 2 /( 2 N 1) where 2 var( yi ) eqn(84) The Moran (birth-death) model eqns. (92,93,94) pi ,i 1 i (2 N i ) /( 2 N ) 2 pi ,i 1 i (2 N i ) /( 2 N ) 2 pi ,i i (2 N i ) /( 2 N ) 2 2 2 Mean sojourn times eqn. (97) tij 2 N (2 N i ) /( 2 N j ), j 1, 2, , i tij 2 Ni / j , j i 1, , 2 N 1 Mean times to fixation or loss eqn. (98) i 2 N 1 ti 2 N (2 N i ) (2 N j ) 2 Ni j j 1 1 j i 1 t ( p) (2 N ) p log p (1 p) log( 1 p) 2 1 Conditional mean times eqns. (99,100, 101) tij* 2 N (2 N i ) j /i (2 N j ), j 1, 2, , i t 2N , j i 1, , 2 N 1 * ij i ti* 2 N (2 N i )i 1 j (2 N j ) 1 2 N (2 N i 1) j 1 t1* 2 N (2 N 1) Largest non-unit eigenvalue and its eigenvectors eqn. (104) 2 1 2 /( 2 N ) 2 r 0, 1(2 N 1), , i (2 N i ), , 1(2 N 1), 0 1 1 2 (2 N 1), 1, , 1, 2 (2 N 1) (Approximate) mean times (with one-way mutation) eqns. (109,110) p t p (2 N ) 2 (1 ) 1 x 1 (1 x) 1 1 dx 0 1 1 1 1 x (1 x) 1 (1 p ) dx p p 1 1 t( 2 N ) 1 2 N 1 x (1 x) dx 1 (2 N ) Another (approximate) expression t ( p) j 1 2N t (1) j 1 4N j 1 (1 p) j( j 1 ) 4N j( j 1 ) Infinitely many alleles: Wright-Fisher model eqn. (119) Prob{ X 0 (t 1) X 1 (t 1), X 2 (t 1), | X 1 (t ), X 2 (t ), } (2 N )! X i ( t 1) i X i (t 1)! where 0 u and i X i (t )(1 u ) /( 2 N ), i 1, 2, 3, Homozygosity probability eqns. (120,121) ( t 1) 2 F 1 (1 u ) (2 N ) 1 (2 N ) 2 F2 1 2 N 2 N (1 u ) 2 1 1 F (t ) 2 (1 ) 1 Identity probability with three genes eqn. (136) (t ) 1 3(2 N 1) F2 ( t 1) 3 2 F3 (1 u ) (2 N ) (2 N 1)( 2 N 2) F (t ) 3 1 F3 2(2 ) F2 2! /(1 )( 2 ) Population mean of K eqns. (125,126,127) 1 E(K ) x 1 1 (1 x) ( 2 N ) 1 x2 E K ( x1 , x2 ) x (1 x) 1 1 x1 1 ( x) x (1 x) 1 dx dx Identity probability with i genes eqn. (138) Fn (n 1)! /(1 )( 2 ) (n 1 ) Sample partition formula eqn. (143) a j n! Prob( A a) a1 a2 an 1 2 n a1!a2 ! an ! S n ( ) a (a1 , a2 , , an ) S n ( ) ( 1)( 2) ( n 1) Sample distribution of K eqns. (145,146,147) Prob( K k ) S / S n ( ) k n k E(K ) 1 2 n 1 n 1 j var( K ) 2 j 1 ( j ) From the sampling formula, Prob {one allele observed in a sample of n genes} = (n-1)! / (1+θ)(2+θ)∙∙∙∙(n-1+θ). Using the frequency spectrum, Prob{only one allele observed in a sample of n genes } 1 x n x 1 (1 x) 1 dx 0 (n 1)! / (1 )( 2 ) (n 1 ) , (as found above) Moran model: the entire population eqns. (151,152) j (2 N )! Prob( 1 , 2 ,, 2 N ) 1 2 2 N 1 2 (2 N ) 1! 2 ! 2 N ! S 2 N ( ) 2 Nu /(1 u ) Exact (Moran model) mean number of alleles with j representing genes eqn. (157), used in eqn. (156) 1 2 N 2 N 1 1 , j j j j 1, 2, , 2 N Probability of quasi-fixation eqn. (158). See also eqn. (159) 2 N 1 2 N 1 2 N 1 j j j 0 Compare this with 2 N 1 2N 2N 1 1 1 Quasi-fixation probabilities: the case θ = 1 eqn. (161) 1 1 1 1 1 1 2N 2 3 2N Compare this with 1 2N (Note : mean number of alleles in the population 1 1 / 2 1 / 3 1 / 2 N ) Mean number of generations until loss of all current alleles 2 N (2 N )( 1) 1 2N j 1 2N 2 N ( 2 N ) j 1 1 2 N 2 N 1 1 j 1 j j 1 j ( j 1) Properties of the simple Wright-Fisher model and the resulting effective populations sizes eqns. (175,176,177) max 1 (2 N ) N (e) e 1 2(1 max ) 1 2 Prob(two genes have same parent) (2 N ) 1 N (i ) e (2 2 ) 1 Var x(t 1) | x(t ) x(t )1 x(t ) / 2 N N e( ) x(t )1 x(t )/2Var x(t 1) | x(t ) Effective population size for the Cannings model eqns. (178, 179,181,182) . N e( e ) N e(i ) ( N 12 ) / 2 . Therefore, (Cannings model ) 4 Nu / 2 Effective population size in the Moran model eqn. (183) . N (e) e N (i ) e N ( ) e N. 1 2 Eigenvalue effective population size for the two-gender Wright-Fisher model eqn(193) N (e) e 4 N1 N 2 N 1 Eigenvalue effective population size for the sub-divided population Wright-Fisher model eqn. (198) N (e) e N ( H 1)1 2K ( H 1) 1 Inbreeding effective population size for the sub-divided population Wright-Fisher model eqn(199) N (i ) e N ( H 1) 1 2 /1 (2 N ) 1 Eigenvalue effective population size for the cyclic population size Wright-Fisher model eqn(200) N (e) e 1 1 1 1 k k{ N N } DIFFUSION THEORY The forward Kolmogorov equation (eqn. (215)):- 1 2 f ( x; t ) {a( x) f ( x; t )} 2 {b( x ) f ( x; t )} t x 2 x The backward Kolmogorov equation (eqn. (218)):- 1 2 f ( x; p, t ) a( p) f ( x; p, t ) b( p) 2 f ( x; p, t ) t p 2 p From the backward equation we get (when relevant) fixation probabilities (see eqns. (224) and (226)), mean fixation times (see eqns. (230), (231) and (232) for the case of two absorbing boundaries, eqns. (237), 9238), (239) and (240) for the case of one absorbing boundary). We also get information about the variance of the fixation times – see eqn. (236). When there are two absorbing boundaries we can also get conditional mean absorption times (see eqns. (247), )2480, (249), (250), (251). We can also get the conditional process drift and diffusion coefficients – see eqns. (254) and (255), with the WrightFisher process values in (256), as well as the conditional process forward and backward Kolmogorov equations – see eqns. (2580 and (259). From the forward equation we get (when relevant) the stationary distribution – see eqn. (244). The scale and speed functions These are very important. The scale function p(x) is defined in eqn. (260) and the speed function m(x) is defined in (261). The lead to the functions u(s) and v(s) (see eqns. (262) and (263) which define boundary behavior – see eqns. (264). Values for the scale and speed function for diffusion processes in genetics are given in eqns. (271) and (272). Many applications of these in genetics are then given on pages 89 – 108. INFERENCE OPERATIONS 1. Estimation of θ. We have seen that the parameter θ enters into many formulae. So it is interesting to consider how we might estimate it, from data. Sample partition formula (remember?) eqn. (143) a j n! Prob( A a) a1 a2 an 1 2 n a1!a2 ! an ! S n ( ) a (a1 , a2 , , an ) S n ( ) ( 1)( 2) ( n 1) Sample distribution of K (remember?) eqn. (145) Prob( K k ) S / S n ( ) k n k These give conditional partition probabilities eqn(328) n! Prob{A a | K n k} k a a an 1 2 S n 1 2 n a1!a2 ! an ! This shows that k is a sufficient statistic for θ. Standard statistical theory then shows that we must estimate θ by using k, AND k ONLY. MLE of θ eqn(330) ˆk ˆk ˆk k k k k n ˆ ˆ k (ˆ) E (k ) ( ) 1 ˆ 2 ˆ ˆk k n 1 Approximating the mean square error of the estimator eqn(336) k E (k ) (ˆ ) ( ) var( K n ) ˆ MSE ( K ) 2 ( ) eqn(338) MSE (ˆK ) n 1 j j 1 ( j ) 2 Alleles data T T T T T G TAT G C C T G C G TAT G C C T G C GTCTG CTTGA G TAT G C C T G C C TAT G C C T G C Three alleles (k=3). 1=2, 2=0, 3=1 Sites data T T T T T G TAT G C C T G C G TAT G C C T G C GTCTGCTTGA G TAT G C C T G C C TAT G C C T G C Four polymorphic sites (s=4) “Sites” data The data consists simply of S, the number of segregating sites in the sample of n genes (DNA Sequences). E(S) g1 Var(S) g1 g2 2 n1 where Thus g1 j , j 1 S s g1 1 n 1 g2 j 2 , j 1 and Var( s ) g1 2 g2 g12 Var( S ) Some values of MSE( ) K =.5 =1 =3 =5 n = 50 n = 100 .902 .918 .874 .903 .891 .960 .928 1.038 n = 500 .943 .942 1.047 1.178 2. Testing for neutrality The Ewens-Watterson test. This is based on the conditional distribution of the numbers n1, n2, …., nk of genes of the (k) alleles observed in a sample of n genes. (eqn. (348) – the same as eqn. (328).) n! k | S n | k ! n1n2 ....... nk The test statistic is the sample heterozygosity ∑j nj2/n2. A test based on the “sample frequency spectrum” eqn(352) 21i 9 21! S E ( Ai | k 10, n 21) i (21 i )! S102 The Tajima test ˆT D T ( i , j ) i j n 2 ˆ ˆ T S Vˆ eqn (353)