Lecture 3: Markov models of sequence evolution Alexei Drummond Friday quiz: How many bacterial cells are there in an average adult human? A) B) C) D) 1012 (1 trillion) 1013 (10 trillion) 1014 (100 trillion) 1015 (1000 trillion) Hint: There are about 1014 human cells in the average adult human. CS369 2007 2 Modeling genetic change • Given two or more aligned nucleotide or amino acid sequences, usually the first goal is to calculate some measure of sequence similarity (or conversely distance) • The simplest way to estimate genetic distances is the pdistance (number of differences between two sequences divided by the sequence length) – The p-distance is the hamming distance normalized by the length of the sequence. Therefore it is the proportion of positions at which the sequences differ. – The p-distance can also be consider the probability that the two sequences differ at a random position (site). CS369 2007 3 Modeling genetic change AACCTGTGCA AATCTGTGTA * * Seq1 seq2 CS369 2007 ATCCTGGGTT * * ** AATCTGTGTA ATCCTGGGTT ** * * 4 P-distance Seq1 seq2 AATCTGTGTA ATCCTGGGTT ** * * p-distance=0.4 proportion of # nt between two sequences Usually underestimate the true distance: genetic (or evolutionary) distance d CS369 2007 5 Multiple, parallel, and back-substitutions AACCTGTGCA AACCTGTGCA T A A C AACCAGTGAA * * CS369 2007 AACCTGTGCA T G A C ACCCGGTGAA * * 6 1 0.9 0.8 p-distance (p) 0.7 0.6 0.5 Relationship between p (observed) distance and d (genetic) distance 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 Genetic distance (d) CS369 2007 7 Transition probabilities • Definition: Let Pxy(t) be the probability that a nucleotide x evolves to a nucleotide y in time t. If x = y then this evolutionary pathway could involve 0, 2, 3 or more substitutions. If x y the the pathway could involve 1, 2, 3 or more substitutions. • P(t) is then a square transition probability matrix of size 4 by 4. CS369 2007 8 Modeling nucleotide substitutions as a timehomogeneous time-continuous stationary Markov process (1) i = A, C, G, T PGG(t) and PGA(t) Independent from i Markov property G t • PGG(t) PGA(t) G A At any given site in a sequence the rate of change from base i to base j is independent from the base that occupied that site prior i CS369 2007 9 Modeling nt substitutions as a time-homogeneous time-continuous stationary Markov process (2) • Homogeneity – Substitution rates do not change over time • Stationarity – The relative frequencies of A, C, G, and T (pA, pC, pG, pT) are at equilibrium, i.e. remain constant. CS369 2007 10 Models of DNA Substitution Simplest 1. Base frequencies are equal and all substitutions are equally likely (Jukes-Cantor) 2. Base frequencies are equal but transitions and transversions occur at different rates (Kimura 2 parameter) 3. Unequal base frequencies and transitions and transversions occur at different rates (Hasegawa-Kishino-Yano) Most complex CS369 2007 4. Unequal base frequencies and all substitution types occur at different rates (General Reversible Model) 11 The Q-matrix (instantaneous rate matrix) A C G T (ap C bp G cp T ) ap C bp G cp T gp A (gp A dp G ep T ) dp G ep T 1 Q hp A jp C (hp A jp C fp T ) fp T ip A kp C lp G (ip A kp C lp G ) pi frequency of nt i a, b, c, etc. relative rate parameters non-diagonal entries: rate flow from nucleotide i to nucleotide j diagonal entries: total rate flow that leaves nucleotide i (rate at which nt i disappear per site per sequence). scale factor so total output per unit time = 1.0 CS369 2007 12 The Q-matrix A * gp A Q hp A ip A CS369 2007 Qii Qij ji C G T ap C bp G * dp G jp C * kp C lp G cp T ep T fp T * A C G T total rate iQii i 13 General Time Reversible (GTR) Models A * ap A Q bp A cp A C G T ap C bp G * dp G dp C * ep C fp G cp T ep T fp T * p A p C p G p T Substitutions from nucleotide i to nucleotide j have the same rate of substitutions from nucleotide j to nucleotide i. In general: f = 1 and a, b, c, d, e are estimated from the data via maximum likelihood CS369 2007 14 Time-reversibility x z t 2 t equivalent x y y CS369 2007 15 Q-matrix for the Jukes and Cantor (JC) model p A p C p G p T 1/4 a b c d e f 1 3/4 CS369 2007 * 1/4 1/4 1/4 0.25 0.25 4 1/4 * 1/4 1/4 Q 0.25 3 1/4 1/4 * 1/4 0.25 1/4 1/4 1/4 * 16 Q-matrix for the Jukes and Cantor (JC) model * 1 Q 3 1 1 CS369 2007 1 1 1 * 1 1 1 * 1 1 1 * 0.25 0.25 0.25 0.25 17 Evolutionary meaning of the Q-matrix for the JC model 1/3 1/3 1/3 1/3 1/3 1/3 Q 1/3 1/3 1/3 1/3 1/3 1/3 Q i ii i = rate per unit time of nucleotide i (i =A, C, G, T) replacement during evolution: nt substitutions per sequence per site per unit time t = nt substitutions per site between two sequences that are separated by time t = d CS369 2007 18 Estimating transition probabilities • As soon as the Q matrix, and thus the evolutionary model, is specified, it is possible to calculate the probabilities of change from any base to any other during the evolutionary time t, P(t), by computing the matrix exponential P(t) exp(Qt) CS369 2007 19 Jukes and Cantor (JC) model solution By computing P(t)=exp(Qt) with Q according to the JC model 1 3 4 exp( t) 4 4 3 3 3 4 Pi j (t) exp( t) 4 4 3 Pi j (t) Pi=j(t) = probability of nt i to end up with the same character after time t Pij(t) = probability of nt i ending up as a different character after time t CS369 2007 20 Estimating the genetic distances(1) •The total probability of two sequences sharing the same nucleotide at a position is Pi=j(t) and therefore the probability of the two sequences being different, p = 1 - Pi=i(t) = Pij(t) p = 3/4 (1 - exp(-4/3t)) •An estimator of p is the observed proportion of different sites between two sequences ( p-distance). CS369 2007 21 Estimating the genetic distances(2) Solving for t we get t = - 3/4 ln (1- 4/3 p). Substituting t with d we finally obtain the Jukes-Cantor correction formula for the genetic distance d between two sequences: d = - 3/4 ln (1- 4/3 p) It can also be demonstrated that the variance V(d) will be given by V(d) = 9p(1-p)/(3-4p)2 CS369 2007 22 Calculating JC distance Seq1 seq2 p-distance AATCTGTGTA ATCCTGGGTT ** * * = 0.4 d (JC model) = - 3/4 ln [1- 4/3 (0.4)] = 0.5716 CS369 2007 23 Calculating JC distance AACCTGTGCA AATCTGTGTA * * p-distance ATCCTGGGTT * * ** = 0.4 d (JC model) = - 3/4 ln [1- 4/3 (0.4)] = 0.5716 CS369 2007 24 Q-matrix for the F81 model p A pC pG pT a b c d e f 1 1 (p A 2 p C 2 p G 2 p T 2 ) CS369 2007 * p A Q p A p A pC * pC pC p G p T p G p T * p T p G * 25 F81 model correction formula d ln(1 p/ ) •p = observed distance • When pA= pT= pC= pG=0.25, = 3/4, and the formula becomes equivalent to the one obtained for the JC model CS369 2007 26 Q-matrix for the Kimura-2p (K80) model p A p C p G p T 1/4 a c d f 1 b e Transversions Transitions 2 CS369 2007 * 1 1 1 * 1 Q 2 1 * 1 1 1 * 27 Nucleotide frequencies in HIV/SIV are at equilibrium: pol gene Average SEQUENCE COMPOSITION (HIV-O/HIV-M full pol) 5% chi-square test SE8538a passed 97TZ02a passed BOLO122b passed pA = 39.0% CAM1b passed NY5CGb passed pC = 16.6% 98IN022c passed pG = 22.8% 94IN112c passed 93IN101c passed pT = 21.6% VI850f passed X138g passed SE6165g passed Average VI991h passed Ti/Tv=2.6 SE9173j passed SE92809j passed MP535k passed 92UG001d passed HIVO passed CS369 2007 p-value 97.80% 94.59% 99.94% 96.73% 97.64% 99.44% 98.68% 99.61% 97.09% 86.61% 95.73% 98.23% 96.17% 96.50% 69.92% 86.20% 77.48% 28 Nucleotide frequencies in HIV/SIV are at equilibrium: env gene Average SEQUENCE COMPOSITION (SIV/HIV full envelope) pA = 34.5% pC = 17.4% pG = 23.4% pT = 24.7% Average Ti/Tv=1.5 CS369 2007 MVP5180 SIVcpzUS SIVcpzGAB 92UG037a 92UG975g 92RU131g 93IN905c 92BRO25c 92UG021d 92UG024d BSSG3b SFMHS20b 91TH652b MBC18R01b 5% chi-square test passed passed passed passed passed passed passed passed passed passed passed passed passed passed p-value 14.60% 48.09% 51.77% 84.58% 99.73% 97.45% 77.15% 59.51% 94.89% 92.60% 97.86% 92.40% 92.86% 99.59% 29 Q-matrix for the F84 model (very similar to the HKY85 model) p A pC pG pT a c d f 1 (Transversions) b 1 /(p A p G ) e 1 /(p C p T ) * pA Q [1 /(p A p G )]p A pA CS369 2007 (Transitions) pC [1 /(p A p G )]p G * pG pC [1 /(p C p T )]p C * pT [1 /(p C p T )]p T pT pG * 30 Nucleotide substitution patterns in HIV/SIV To A Average frequency of changes between states A C G T 346.2 697.4 290.3 123 320.8 SIV/HIV-1 envelope C 241.9 G 515.4 126.6 T 215.6 371 From 117.1 Transitions Transversions Average Ti/Tv=1.5 CS369 2007 144.6 31 More complex models… • More complex models, like Tamura-Nei (TN93), or the general time reversible (GTR) model usually requires numerical algorithms in order to calculate d. • Several software packages exist that can estimate genetic distances between nucleotide sequences according to different evolutionary models – – – – – – CS369 2007 MEGA3, PAUP*, PHYLIP, TREE-PUZZLE, DAMBE, Geneious 2.5.4 32 Estimating HIV genetic distances: env gene HIV-1B vs HIV-O/SIVcpz/HIV-1C full envelope p-distance JC69 K80 Tajima-Nei HIV-O 0.391 (.008) 0.552 (.018) 0.560 (.019) 0.572 (.019) SIVcpz 0.266 (.009) 0.337 (.009) 0.340 (.010) 0.427 (.013) HIV-1C 0.163 (.008) 0.184 (.008) 0.187 (.008) 0.189 (.008) CS369 2007 33 Estimating HIV genetic distances: pol gene HIV-1B vs HIV-O/HIV-1C full pol p-distance JC69 K80 Tajima-Nei HIV-O 0.257 (.007) 0.315 (.010) 0.318 (.011) 0.324 (.011) HIV-1C 0.103 (.005) 0.111 (.005) 0.113 (.006) 0.114 (.006) CS369 2007 34 When divergence is low p and d are linearly related 1 0.9 0.8 p-distance (p) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 Genetic distance (d) CS369 2007 35 Conclusions • The genetic distance between two sequences can be estimated using a Markov model of DNA substitution. • Different models will estimate different genetic distances • We have focused on DNA models, but it is possible to consider models for proteins and models that take into account codons and the genetic code. • Markov model approaches to estimating genetic distance do not deal with indels, and presuppose an alignment • These models assume that all positions in a DNA sequence mutate at the same rate. We will talk about how to relax this assumption in later lectures. CS369 2007 36