Markov chain

Markov chains Assume a gene that has three alleles A, B, and C. These can mutate into each other. 0.12 A 0.07 B 0.2 0.15 0.1 Transition probabilities 0.05 C A B C  0.68 0.07 0.1  A   P   0.12 0.78 0.05  B  0.2 0.15 0.85  C   Transition matrix Probability matrix 68% of A stays A, 12% mutates into B and 20% into C. 7% mutates from B to A and 10% from C to A. Left probability matrix: The column sums add to 1. Right probability matrix: The row sums add to 1. Transition matrices are always square The trace contains the probabilities of no change. PR  PL T Calculating probabilities Probabilities to reach another state in the next step. Probabilities to reach another state in exactly two steps. 2  0.68 0.07 0.1   0.4908 0.1172 0.1565  0.68 0.07 0.1        P2   0.12 0.78 0.05   0.1852 0.6243 0.0935 P   0.12 0.78 0.05   0.2 0.15 0.85  0.324 0.2585 0.75   0.2 0.15 0.85        The probability to reach any state in exactly n steps is given by Pn  Pn Pk  U   k  U1 Assume for instance you have a virus with N strains. Assume further that at each generation a strain mutates to another strain with probabilities ai→j. The probability to stay is therefore 1-Σai→j. What is the probability that the virus is after k generations the same as at the beginning? 1   a i 1,1  p  a N1  P A B C A B 0.5 0.3 0.2    1   a1,i 1  a1N C 0.05 0.8 0.15 0.3 0.1 0.6 k=5 PN A B C A B C 0.230675 0.20048 0.258105 0.47613 0.51785 0.43003 0.293195 0.28167 0.311865 Pk  U   k  U1 Eigenvalues 0.338197 0.561803 1 Eigenvectors 0.814984 0.550947 0.368878 -0.450512 -0.797338 0.794506 -0.364472 0.246391 0.482379 Lk 0.004424 0 0 0.055966 0 0 Inverse 0.878092 0.264583 -1.107265 0.109323 -0.798204 1.231089 0.607621 0.607621 0.607621 0 0 1 ULk 0.003606 0.030834 0.368878 -0.001993 -0.044624 0.794506 -0.001613 0.013789 0.482379 ULkU-1 0.230675 0.47613 0.293195 0.20048 0.258105 0.51785 0.43003 0.28167 0.311865 Given initial allele frequencies. What are the frequencies in the next generation?  0.68 0.07 0.1    P   0.12 0.78 0.05   0.2 0.15 0.85     0 .2    P0   0.5   0 .3    Initial allele frequencies FA (1)  FA (0) * p( A  A)  FB (0) * p( B  A)  FC (0) * p(C  A) PA  0.68 0.07 0.1  0 .2    P0   0.5   0 .3     0 .2    PA P0   0.5   0.201  0 .3     0.68 0.07 0.1   0.2   0.201  Allele frequencies      P1  PP0   0.12 0.78 0.05   0.5    0.429  in the first  0.2 0.15 0.85   0.3   0.37  generation       0.68 0.07 0.1   0.2   0.201       P1  PP0   0.12 0.78 0.05   0.5    0.429   0.2 0.15 0.85   0.3   0.37       P2  PP1  P(PP0 )  P2P0 Pn  PPn1 Pn  P(n)Pn1 Pn  Pn P0 Xn  Pn  X0  U n U 1  X0 The model assumes constant transition probabilities. Transition probabilities might change. A Markov chain is a process where step n depends only on the transition probabilities at step n-1 and the realized values at step n. A Marcov chain doesn’t have a memory. p(Xn  i | Xn 1, Xn 2 , Xn 3...Xn 1 )  p(Xn  i | Xn 1 ) Andrey Markov (1856-1922) Does our mutation process above reach in stable allele frequencies or do they change forever? Do we get stable frequencies? X n1  X n  P • X n P  Xn  1Xn  (P 1I)  Xn  0 Xn is a steady-state, stationary probability, or equilibrium vector. The associated eigenvalue is 1. The equilibrium vector is independent of the initial conditions. The largest eigenvalue (principal eigenvalue) of every probability matrix equals 1 and there is an associated stationary probability vector that defines the equilibrium conditions (PerronFrobenius theorem). Eigenvalues and eigenvectors of probability matrices P 0.006159 0.260998 0.383385 0.312983 0.491399 0.23416 0.036019 0.314422 0.292022 0.328144 0.101216 0.277682 0.087934 0.312887 0.057607 0.245795 0.03226 0.115475 0.077524 0.008197 0.41267 0.39304 0.098784 0.004584 0.114652 Column sums 1 Eigenvalues -0.49348933 -0.22284172 -0.10044735 0.139067327 1 1 Eigenvalues -0.48893388 -0.16647268 0.047131406 0.842629825 1 1 The eigenvalues of probability matrices and their transposes are identical. 1 Eigenvectors 0.676793 0.31531 0.049124 0.188368 0.5796 0.261813 -0.31217 -0.05106 0.002974 0.4894 -0.02386 0.714055 -0.81216 -0.60912 0.32066 -0.29236 -0.49289 0.451684 -0.29421 0.21635 -0.62238 -0.2243 0.362421 0.711985 0.52432 P 0.006159 0.260998 0.23416 0.036019 0.101216 0.277682 0.245795 0.03226 0.41267 0.39304 Column sums 1 Column sums of probability matrices are 1. Row sums might be higher. 1 0 0 0 0 0 1 0 0 1 0 0 0.312983 0.292022 0.312887 0.077524 0.004584 0.491399 0.328144 0.057607 0.008197 0.114652 1 1 1 Eigenvectors 0.674168 0.386913 0.097599 0.274076 0.255996 -0.71988 -0.19397 0.230028 -0.00674 0.198906 0.254115 -0.88505 -0.29806 -0.30949 -0.74062 0.10075 -0.62536 0.443543 0.582882 0.280194 One of the eigenvalues of a probability matrix is 1. To get frequencies the eigenvector has to be rescaled (normalized). If one of the entries of P is 1, the matrix is called absorbing. In this case the eigenvector of the largest eigenvalue contains only zeros and one 1. 0 0 1 0 0 Absorbing chains become monodominant by one element. Normalizing the stationary state vector P A B C D E Eigenvalues -0.1173 0.259272 0.632003 0.716025 1 A 0.5 0.5 0 0 0 B 0.15 0.25 0.35 0 0.25 C 0.05 0 0.9 0.05 0 D 0 0 0.25 0.5 0.25 E 0 0.33 0 0.33 0.34 Eigenvalues 0.141692 0.307099 0.471674 0.065218 -0.68828 -0.49866 0.341664 -0.18563 0.315521 0.017437 0.220252 0.838656 -0.32032 0.653439 -0.71444 -0.35729 0.551392 -0.47931 -0.31915 -0.36096 Sum Largest eigenvector 0.13963502 0.14698423 0.9553975 0.17638108 0.12248686 1.54088469 Rescaled 0.09062 0.09539 0.620032 0.114467 0.079491 1 Frequencies Stationary frequencies Frequencies have to add to unity! Final frequencies The sum of the eigenvector entries have to be rescaled. 1 Xn  P  X0  U  U  X0 n n N=1000 Eigenvalues 3.14436E-23 0 0 2.46845E-15 0 0 0 0 1 Eigenvectors 0.816257937 0.17364202 0.35099 -0.42522385 -0.77775251 0.385401 -0.39103409 0.604110489 0.853388 Un 2.56661E-23 4.28627E-16 0.35099 -1.3371E-23 -1.9198E-15 0.385401 -1.2296E-23 1.49122E-15 0.853388 Inverse 1.005233 -0.2379 0.629018 UnU-1 0.220779 0.242424 0.536797 -0.07159 -0.38111 -0.93492 0.520064 0.629018 0.629018 0.220779 0.22078 0.242424 0.24242 0.536797 0.5368 Do all Markov chains converge? 0.6 0.3 C Recurrent part 0.4 0.9 B 0.1 A 0.3 Periodic chain B A 0.6 A 0.8 Closed part D You can leave every state. B 0.7 C C D State D cannot be left. The chain is absorbing. Recurrent and aperiodic chains are called ergodic. The probability matrix theorem tells that every irreducible ergodic transition matrix has a steady state vector T to which the process converges. Absorbing chains Closed part It is impossible to leave state D B A A chain is called absorbing if it containes states without exit. The other states are called transient. C D Any absorbing Markov chain finally converges to the absorbing states. Absorbing part A B C D A 0.5 0.25 0.25 0 B 0 0.5 0.25 0.25 C 0 0 0.5 0.5 D 0 0 0 1 Eigenvalues Principal eigenvector 0.5 0 0 0 0 0.5 0 0 0 0 0.5 0.707107 0.707107 0.707107 0 1 -0.70711 -0.70711 -0.70711 1 The time to reach the absorbing state Assume a druncard going randomly through five streets. In the first street is his home, in the last a bar. At either home or bar he stays. 0.5 0.5 0.5 Home Bar 0  1 1/ 2 0   0 0 1/ 2 0 P   0 1/ 2 0 1/ 2   0 0 1/ 2 0 0 0 0 1/ 2  A B C D E Eigenvalues -0.70711 0 0.707107 1 1 A 1 0 0 0 0 B 0.5 0 0.5 0 0 0.5 0.5 0.5 C 0 0.5 0 0.5 0 0  0 0  0 1  D 0 0 0.5 0 0.5 0 0.143403 0.316228 0.544526 0 -0.48961 -0.63246 -0.31898 0 0.692413 0 -0.4511 0 -0.48961 0.632456 -0.31898 0 0.143403 -0.31623 0.544526 E 0 0 0 0 1 Principal eigenvectors 1 0 0 0 0 0 0 0 0 1 The canonical form We rearrange the transition matrix to have the s absorbing states in the upper left corner and the t transient states in the lower right corner. We have four compartments I Pcanonical   ss  0ts Rst   Qtt  A B C D E A 1 0 0 0 0 B 0.5 0 0.5 0 0 C 0 0.5 0 0.5 0 D 0 0 0.5 0 0.5 E 0 0 0 0 1 A B C D E A 1 0 0 0 0 E 0 0 0 0 1 B 0.5 0 0.5 0 0 C 0 0.5 0 0.5 0 D 0 0 0.5 0 0.5 A E B C D A 1 0 0 0 0 E 0 1 0 0 0 B 0.5 0 0 0.5 0 C 0 0 0.5 0 0.5 D 0 0.5 0 0.5 0 Transient part After n steps we have;  I ss n P    0ts n Rst   I ss    Qtt   0ts ?  n Qtt  The unknown matrix contains information about the frequencies to reach an absorbing state from stateB, C, or D.  I ss P    0ts n n Rst   I ss    Qtt   0ts  I ss 3 P    0ts ?  n Qtt  n Rst   I ss    Qtt   0ts R( I  Q)   2 Qtt  3 R( I  Q  Q 2 )   3  Qtt  Rst  I    ss Qtt   0ts I P 3   ss  0ts Rst  I    ss Qtt   0ts R( I  Q  Q  ...Q n Qtt limn Q n  0 2  I ss 2 P    0ts 2 n 1  )   I ss    0  ts  Rst  (I  Q)i  i 0  n  Qtt  n 1 Multiplication of probabilities gives ever smaller values n 1 limn  ( I  Q)i  ( I  Q) 1 Simple geometric series i 0 B  R(Itt  Q)1 The entries nijof the matrix B contain the probabilities of ending in an absorbing state i when started in state j. N  (Itt  Q)1 The entries nijof the fundamental matrix N of Q contain the expected numbers of time the process is in state i when started in state j. The druncard’s walk A B C D E A 1 0 0 0 0 B 0.5 0 0.5 0 0 C 0 0.5 0 0.5 0 D 0 0 0.5 0 0.5 E 0 0 0 0 1 A B C D E A 1 0 0 0 0 E 0 0 0 0 1 B 0.5 0 0.5 0 0 C 0 0.5 0 0.5 0 D 0 0 0.5 0 0.5 A E B C D A 1 0 0 0 0 E 0 1 0 0 0 B 0.5 0 0 0.5 0 C 0 0 0.5 0 0.5 D 0 0.5 0 0.5 0 N  (Itt  Q)1 t  Ntt It1  (Itt  Q)1 It1 The sum of all rows of N gives the expected number of times the chain is is state i (afterwards it falls to the absorbing state). t is a column vector that gives the expected number of steps (starting at state i) before the chain is absorbed. B C D Q B 0 0.5 0 I 1 0 0 C 0.5 0 0.5 0 1 0 D 0 0.5 0 0 0 1 I 1 1 1 B C D (I-Q)-1 B 1.5 1 0.5 B C D NI 3 4 3 B C D I-Q B 1 -0.5 0 C -0.5 1 -0.5 D 0 -0.5 1 A E RN B 0.75 0.25 C 0.5 0.5 D 0.25 0.75 The probability of reaching the absorbing state from any of the transient states. B C D C 1 2 1 D 0.5 1 1.5 The expected number of steps to reach the absorbing state. 0.6 0.3 C 0.4 0.6 0.9 B A D 0.1 A 0.8 B 0.7 C 0.3 A A B B C C D D A A 0 0 0.4 0.4 0.6 0.6 0 0 B B 0.3 0.3 0.7 0.7 0 0 0 0 C C 0.3 0.3 0 0 0.7 0.7 0 0 A A B B C C A A 0.2 0.2 0.8 0.8 0 0 B B 0 0 0.3 0.3 0.7 0.7 C C 0.6 0.6 0 0 0.4 0.4 D D 0 0 0 0 0.9 0.9 0.1 0.1 Eigenvalues Eigenvalues -0.3 -0.3 0.1 0.1 0.7 0.7 1 1 Complex Complex eigenvalues eigenvalues -0.05 0.597913 -0.05 0.597913 -0.05 -0.597913 -0.05 -0.597913 1 0 1 0 Eigenvector Eigenvector 4 4 0.384111 0.384111 0.512148 0.512148 0.768221 0.768221 0 0 Eigenvector Eigenvector 3 3 0 0 0 0 0 0 Periodic chains do not have stable points. Expected return (recurrence) times 0.50 A B 0.15 0.05 0.35 C If we start at state D, how long does it take on average to return to D? 0.25 The rescaled eigenvector u of the probability matrix P gives the steady state frequencies to be in state i. E If u is the rescaled eigenvector of the probability matrix P, the expected return time tii of state i back to i is given by the inverse of the ith element ui of the eigenvector u. 0.33 0.05 0.33 D 0.25 0.25 A B C D E Sum P A 0.5 0.5 0 0 0 B 0.25 0.15 0.35 0 0.25 C 0.05 0 0.9 0.05 0 D 0 0 0.25 0.5 0.25 E 0 0.33 0 0.33 0.34 1 1 1 1 1 0.25 -0.77 0.295 -0.23 0.456 0.328 -0.37 -0.05 0.658 -0.57 Eigenvalue -0.21 0.212 0.655 0.732 1 tii  0.448 0.197 0.406 -0.67 -0.38 0.064 0.235 -0.88 0.262 0.317 Sum Eigenvector 0.168 0.146 0.951 0.176 0.122 1.563 A B C D E Rescaled 0.107644 0.093604 0.608424 0.11232 0.078003 1/Rescaled 9.289855 10.68333 1.64359 8.90278 12.82 1 ui In the long run it takes about 9 steps to return to D First passage times in ergodic chains 0.50 A 0.05 0.15 0.35 C 0.25 0.05 If we start at state D, how long does it take on average to reach state A? B 0.25 0.33 D D 0.33 E D 0.25 The fundamental matrix of an ergodic chain N  (Itt  P  W) 1 W is the matrix containing only the rescaled stationary point vector. D D 0.25 0.25 0.25 0.25 C E E E 0.05 0.33 0.33 0.33 A B B D 0.0125 0.15 0.35 0.25 0.012375 A C C 0.05 0.05 A 0.00144375 A 0.00103125 …… We have to consider all possible ways from D to A. The inverse of the sum of these probabilities gives the expected number of steps to reach from point j to point k. nkk  n jk t jk  Applied to the original probability matrix P the wkk fundamental matrix N of P contains information on expected number of times the process is in The expected average number of steps tjk to reach from j state i when started in state j. to k comes from the entries of the fundamental matrix N divided through the respective entry of the (rescaled) stationary point vector. Average first passage time 0.50 A 0.05 P 0.15 0.35 C 0.25 0.05 B 0.25 0.33 0.33 E D 0.25 A B C D E Eigenvalue -0.117299 0.259272 0.632003 0.716025 1 Sum I A 0.5 0.5 0 0 0 B 0.15 0.25 0.35 0 0.25 C 0.05 0 0.9 0.05 0 Largest eigenvector 0.14 0.147 0.955 0.176 0.122 1.541 I-P+W D E 0 0 0.25 0.5 0.25 0 0.33 0 0.33 0.34 Rescaled 0.091 0.095 0.62 0.114 0.079 1 0.591 -0.06 0.041 0.091 0.091 -0.4 0.845 0.095 0.095 -0.23 0.62 0.27 0.72 0.37 0.62 0.114 0.114 0.064 0.614 -0.22 0.079 -0.17 0.079 -0.17 0.739 Return times 11.04 10.48 1.613 8.736 12.58 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 A B C D E W 0.091 0.095 0.62 0.114 0.079 0.091 0.095 0.62 0.114 0.079 0.091 0.095 0.62 0.114 0.079 0.091 0.095 0.62 0.114 0.079 0.091 0.095 0.62 0.114 0.079 A B C D E (I-P+W)-1 A B 1.984 0.165 1.315 1.506 -2.29 -1.05 -0.28 -0.05 0.272 0.431 C -0.08 -0.33 2.007 -0.25 -0.34 D -0.33 -0.07 -1.29 2.06 0.634 E -0.22 0.574 -2.11 0.832 1.927 A B C 0 20.07 22.78 2 0 19.26 6.935 4.935 0 20.43 18.43 20.21 20.82 18.82 28.55 A =(L22-H22)/H14 D 25.55 16.52 5.322 0 16.27 E 24.33 9.773 6.643 10.73 0 A B C D E E You have sunny, cloudy, and rainy days with respective transition probabilities. How long does it take for a sunny day to folow a rainy day? How long does it take that a sunny day comes back? Sunny Cloudy Rainy Sunny 0.5 0.25 0.5 Cloudy 0.35 0.25 0.35 Rainy 0.15 0.5 0.15 Eigenvalue Largest eigenvector -0.1 0.714 0 0.541 1 0.444 Sum 1.699 I-P+W 0.92 0.17 -0.08 -0.03 1.068 -0.03 0.111 -0.24 1.111 Return times Sunny 2.378 Cloudy 3.143 Rainy 3.826 1 0 0 Rescaled 0.42 0.318 0.261 0 1 0 0 0 1 W Sunny 0.42 0.42 0.42 Cloudy 0.318 0.318 0.318 Rainy 0.261 0.261 0.261 -1 (I-P+W) 1.072 -0.15 0.072 0.029 0.938 0.029 -0.1 0.217 0.899 Sunny Cloudy Rainy Sunny 0 2.919 2.378 Cloudy 2.857 0 2.857 Rainy 3.826 2.609 0 The construction of evolutionary trees from DNA sequence data T→C T C A→G A G→C→G T G→C→A A A C G T T C A→G A G T G C C C T Probabilities of DNA substitution We assume equal substitution Single substitution probabilities. If the total probability for a substitution is p: Parallel substitution p A T Back substitution p p p Multiple substitution C G p p(A→T)+p(A→C)+p(A→G)+p(A→A) =1 The probability that A mutates to T, C, or G is P¬A=p+p+p The probability of no mutation is pA=1-3p Independent events The probability that A mutates to T and C to G is PAC=(p)x(p) Independent events p( A  B)  p( A)  p( B) p( A  B)  p( A) p( B) The probability matrix T→C T C A→G A G→C→G T G→C→A A A C G T T C A→G A G T G C C C T Single substitution Parallel substitution Back substitution Multiple substitution T A C G p p p  1  3 p   1 3 p p p   p P p p 1 3 p p     p p p 1  3 p   A T C G What is the probability that after 5 generations A did not change? p5  (1  3 p)5 The Jukes - Cantor model (JC69) now assumes that all substitution probabilities are equal. The Jukes Cantor model assumes equal substitution probabilities within these 4 nucleotides. Arrhenius model p p p  1  3 p   dP (t )  P(t )  P(t )  P(0)e t p 1  3 p p p   P dt  p p 1 3 p p Substitution probability after time t    p p p 1  3 p   A,T,G,C A Transition matrix t P(t )  P(0)t The probability that nothing changes is the zero term of the Poisson distribution P( A  C, T , G)  e   e4 pt Substitution matrix  1 3  4 t   e 4 4  1  1 e  4 t P4 4  1 1  4 t 4 4e 1 1   e  4 t 4 4 1 1  4 t  e 4 4 1 3  4 t  e 4 4 1 1  4 t  e 4 4 1 1  4 t  e 4 4 1 1  4 t  e 4 4 1 1  4 t  e 4 4 1 3  4 t  e 4 4 1 1  4 t  e 4 4 The probability of at least one substitution is 1 1  4 t   e  4 4  1 1  4 t   e  4 4 1 1  4 t   e  4 4 1 3  4 t   e  4 4  P( A  C  T  G)  e   1  e4 pt The probability to reach a nucleotide from any other is 1 (1  e  4 pt ) 4 The probability that a nucleotide doesn’t change after time t is 1 1 3 P( A  A, T , C , G | A)  1  3( (1  e  4 pt ))   e  4 pt 4 4 4 P( A, T , G, C  A)  Probability for a single difference 1 3 3 P( A  A, T , C , G )  3( (1  e  4 pt ))   e  4 pt 4 4 4 What is the probability of n differences after time t? 0.35 0.3 0.25 f(p) We use the principle of maximum likelihood and the Bernoulli distribution x n x n  n x   3 3 3 3     p( x, t )    p (1  p) n x     e4 pt  1  (  e4 pt )  4 4     x  x  4 4 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 p 6 7 8 9  n  n 3 3  1 3  ln p( x, t )  ln   x ln p  (n  x) ln(1  p)  ln   x ln  e 4 pt   (n  x) ln  e4 pt )  4 4  4 4   x  x t 1  4x  ln1   4 p  3n  This is the mean time to get x different sites from a sequence of n nucleotides. It is also a measure of distance that dependents only on the number of substitutions 10 Homo sapiens Pan troglodytes Pan paniscus Gorilla Homo neandertalensis Phylogenetic trees are the basis of any systematic classificaton t 1  4x  ln1   4 p  3n  Time Divergence - number of substitutions

Markov chain

Related documents

Products

Support

Markov chain

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib