Sistemática Teórica Hernán Dopazo Biomedical Genomics and Evolution Lab Lesson 02 Models of Sequence Evolution Facultad de Ciencias Exactas y Naturales Universidad de Buenos Aires Argentina 2013 Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com Models of Nucleotide Substitution Calculation of the distance between two sequences is perhaps the simplest phylogenetic analysis, yet it is important for two reasons. • • First, calculation of pairwise distances is the first step in distance-matrix methods of phylogeny reconstruction, which use cluster algorithms to convert a distance matrix into a phylogenetic tree. Second, Markov-process models of nucleotide substitution used in distance calculation form the basis of likelihood and Bayesian analysis of multiple sequences on a phylogeny. Evolutionary models are sets of assumptions about the process of nucleotide or amino-acid substitution ⎡ . ⎢ ⎢ qCT Q = qij = ⎢ q ⎢ AT ⎢ qGT ⎣ ( ) qTC qTA . qCA qAC . qGC qGA qTG ⎤ ⎥ qCG ⎥ qAG ⎥ ⎥ . ⎥ ⎦ Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com • The distance between two sequences is defined as: • • DNA substitution models the expected number of nucleotide substitutions per site. If the evolutionary rate is constant over time, the distance will increase linearly with the time of divergence. • A simplistic distance measure is the proportion of different sites, sometimes called the p distance. • If 10 sites are different between two sequences, each 100 nucleotides long, then p = 10% = 0.1 • This raw proportion works fine for very closely related sequences (p~5%) but is otherwise a clear underestimate of the number of substitutions that have occurred. • Sequences may saturate due to multiple changes (hits) at the same position after lineage splitting. • In the worst case, data may become random and all the phylogenetic information about relationships can be lost!!! ∧ p = 2 / 8 = 0.25....wrong d = 10 / 8 = 1.25....best Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com DNA substitution models Markov Models • To estimate the number of substitutions, we need a probabilistic model to describe changes between nucleotides. • Continuous-time Markov chains are commonly used for this purpose. • The nucleotide sites in the sequence are normally assumed to be evolving independently of each other. • Substitutions at any particular site are described by a Markov chain, with the four nucleotides to be the states of the chain. • The main feature of a Markov chain is that it has no memory: ‘given the present, the future does not depend on the past’ • In other words, the probability with which the chain jumps into other nucleotide states depends on the current state, but not on how the current state is reached. This is known as the Markovian property. • Adding further constraints on substitution rates between nucleotides we get different models. Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com DNA substitution models b A a d C c e G Ts f Tv T Ts Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com DNA substitution models b A a d C c e G Ts f Tv T Ts Prof. Hernán Dopazo Jukes & Cantor ’69 Lab. Genómica Biomédica y Evolución http://hdopazolab.com • It assumes that every nucleotide has the same rate λ of changing into any other nucleotide. • We use qij to denote the instantaneous rate of substitution from nucleotide i to nucleotide j, with i, j = T, C, A, or G. • Thus the substitution-rate matrix is: T C A G ⎡ −3a a a a ⎤ ⎢ ⎥ a −3a a a ⎥ ⎢ Q = qij = a −3a a ⎥ ⎢ a ⎢ a a a −3a ⎥⎦ ⎣ ( ) • A ... Sum = 0 ... Sum = 0 a a ... Sum = 0 ... Sum = 0 a C a a Note that qij Δt gives the probability that any given nucleotide i will change to a different nucleotide j in an infinitely small time interval Δt G Ts a Tv T Ts Prof. Hernán Dopazo Jukes & Cantor ’69 Lab. Genómica Biomédica y Evolución http://hdopazolab.com Imagine a long sequence with nucleotide i at every site; • t pij i Transition-probability matrix: ⎡ p0 (t) ⎢ p1 (t) Qt ⎢ P(t) = e = ⎢ p1 (t) ⎢ ⎣ p1 (t) p1 (t) p0 (t) p1 (t) p1 (t) p1 (t) p1 (t) p0 (t) p1 (t) j t>0 dt p1 (t)⎤ ... Sum = 1 ⎧ 1 3 −4 t ⎥ p (t) = + e p1 (t)⎥ , with ⎪ 0 4 4 ⎨ ! ⎪ p1 (t) = 1 − 1 e−4 t p1 (t)⎥ ⎩ 4 4 ⎥ p0 (t)⎦ ...what if t=0 ?... P(0)= Identity matrix: No evolution • pij is the probability that a given nucleotide i will become j time t later. • The matrix P(t) = { pij (t) } is known as transitional probability matrix. • Transition probabilities express the product of rates (a) and times (t) at the same time. • With no external information about either the time or the rate, we can estimate only the distance, but not time or rate individually. • The probability that the chain is in state j when t → ∞ is represented by πj and the distribution (πT, πC, πA, πG) is known as the limiting distribution of the chain = 1/4 Stationary Distribution No Phylogenetic Signal Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com • Transition probability, pij is the probability that a given nucleotide i will become j time t later. python from math import * def JC69 (m,t): a=0 b=1 x = 1/4.0 + 3/4.0 * exp(-4*m*t) y = 1/4.0 - 1/4.0 * exp(-4*m*t) a=x b=y matrixq = [[a,b,b,b],[b,a,b,b],[b,b,a,b],[b,b,b,a]] print matrixq ### m=0.02, t=1 JC69(0.02,1) ### m=0.2, t=1 JC69(0.2,1) ### m=0.8, t=1 JC69(0.8,1) ### m=1, t=1 JC69(1,1) ### m=3, t=1 JC69(3,1) Python Code >>> ### m=0.02, t=1 ... JC69(0.02,1) [[0.94233725978997684, 0.019220913403341061, 0.019220913403341061, 0.019220913403341061], [0.019220913403341061, 0.94233725978997684, 0.019220913403341061, 0.019220913403341061], [0.019220913403341061, 0.019220913403341061, 0.94233725978997684, 0.019220913403341061], [0.019220913403341061, 0.019220913403341061, 0.019220913403341061, 0.94233725978997684]] >>> >>> ### m=0.2, t=1 ... JC69(0.2,1) [[0.58699672308791617, 0.13766775897069461, 0.13766775897069461, 0.13766775897069461], [0.13766775897069461, 0.58699672308791617, 0.13766775897069461, 0.13766775897069461], [0.13766775897069461, 0.13766775897069461, 0.58699672308791617, 0.13766775897069461], [0.13766775897069461, 0.13766775897069461, 0.13766775897069461, 0.58699672308791617]] >>> >>> ### m=0.8, t=1 ... JC69(0.8,1) [[0.28057165298377468, 0.23980944900540846, 0.23980944900540846, 0.23980944900540846], [0.23980944900540846, 0.28057165298377468, 0.23980944900540846, 0.23980944900540846], [0.23980944900540846, 0.23980944900540846, 0.28057165298377468, 0.23980944900540846], [0.23980944900540846, 0.23980944900540846, 0.23980944900540846, 0.28057165298377468]] >>> >>> ### m=1, t=1 ... JC69(1,1) [[0.26373672916655061, 0.24542109027781644, 0.24542109027781644, 0.24542109027781644], [0.24542109027781644, 0.26373672916655061, 0.24542109027781644, 0.24542109027781644], [0.24542109027781644, 0.24542109027781644, 0.26373672916655061, 0.24542109027781644], [0.24542109027781644, 0.24542109027781644, 0.24542109027781644, 0.26373672916655061]] >>> >>> ### m=3, t=1 ... JC69(3,1) [[0.25000460815926501, 0.24999846394691166, 0.24999846394691166, 0.24999846394691166], [0.24999846394691166, 0.25000460815926501, 0.24999846394691166, 0.24999846394691166], [0.24999846394691166, 0.24999846394691166, 0.25000460815926501, 0.24999846394691166], [0.24999846394691166, 0.24999846394691166, 0.24999846394691166, 0.25000460815926501]] Prof. Hernán Dopazo Jukes & Cantor ’69 Lab. Genómica Biomédica y Evolución http://hdopazolab.com • How does the Markov-chain model correct for multiple hits and recover the hidden changes • The exponential equation accommodates all the possible paths the evolutionary process might have taken. • Qt e In particular, the transition probabilities for a Markov chain satisfy the equation, known as the Chapman– Kolmogorov theorem: • The transition probability from any nucleotide i to any nucleotide j over time t1 + t2 is a sum over all possible states k at any intermediate point t1 pij (t1 + t 2 ) = ∑ pik (t1 )pkj (t 2 ) k i t1 pik k t2 pkj j k = T, A, C, G Prof. Hernán Dopazo Jukes & Cantor ’69 Lab. Genómica Biomédica y Evolución http://hdopazolab.com • Now consider estimation of the distances between two sequences: • From the instantaneous-rate matrix the total substitution rate for any nucleotide is 3a • If the two sequences are separated by time t, for example, if they diverged from a common ancestor time t/2 ago, the distance between the two sequences will be: d = 3at • Suppose x out of n sites are different between the two sequences, so that the proportion of different sites is ∧ p = x/n • The probability that the nucleotide in a descendant sequence is different from the nucleotide in the ancestral sequence is: 3 3 −4 at 3 3 −4 d/3 p = 3p1 (t) = − e = − e 4 4 4 4 By equating this to the observed proportion pˆ we obtain the estimate of distance as: ∧ 4∧ d = −3 / 4 log(1− p ) 3 if p^ > 3/4 the method is innaplicable Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com Jukes & Cantor ’69 Prof. Hernán Dopazo Jukes & Cantor ’69 Lab. Genómica Biomédica y Evolución http://hdopazolab.com Jukes-Cantor distances Pii Pij https://www.desmos.com/calculator Prof. Hernán Dopazo Kimura ’80 Lab. Genómica Biomédica y Evolución http://hdopazolab.com • • In real data, transitions often occur at higher rates than transversions. A Kimura (1980) proposed a model that accounts for different transition and transversion rates. b • Let the substitution rates be a for transitions and b for transversions • Thus the substitution-rate matrix with two free parameters is: T C A • the simplest distance between two sequences separated by time t is: b C G ⎡ −(a + 2b) a b b ⎢ a −(a + 2b) b b ⎢ Q = qij = ⎢ b b −(a + 2b) a ⎢ b b a −(a + 2b) ⎢⎣ ( ) a ⎤ ⎥ ⎥ ⎥ ⎥ ⎥⎦ d = (a + 2b)t b a G Ts b Tv T Ts Prof. Hernán Dopazo Kimura ’80 Lab. Genómica Biomédica y Evolución http://hdopazolab.com • Note that at is the expected number of transitions per site and 2bt is the expected number of transversions per site. • It is more convenient to use the transition/transversion rate ratio: • The matrix of transition probabilities is obtained as: k = a /b ⎡ ⎢ ⎢ Qt P(t) = ( e ) = ⎢ ⎢ ⎢ ⎣ • where the three distinct elements of the matrix are: T C A p0 (t) p1 (t) p2 (t) p1 (t) p0 (t) p2 (t) p2 (t) p2 (t) p0 (t) p2 (t) p2 (t) p1 (t) Ts Tv ...what if t=0 ?... No evolution p0=1, p1=0, p2=0 G p2 (t) ⎤ ⎥ p2 (t) ⎥ p1 (t) ⎥ ⎥ p0 (t) ⎥ ⎦ Prof. Hernán Dopazo Python Code Lab. Genómica Biomédica y Evolución http://hdopazolab.com • Transition probability, pij is the probability that a given nucleotide i will become j time t later. >>... python from math import * def K80(a,b,t): x = 1/4.0 + 1/4.0 * exp (-4*b*t) + 1/2.0 * exp(-2*(a+b)*t) y = 1/4.0 + 1/4.0 * exp (-4*b*t) - 1/2.0 * exp(-2*(a+b)*t) z = 1/4.0 - 1/4.0 * exp (-4*b*t) p=x q=y r=z matrixs = [[p,q,r,r],[q,p,r,r],[r,r,p,q],[r,r,q,p]] print matrixs ### a=0.4, b=0.2, t=1 K80(0.4,0.2,1) ### a=0.8, b=0.4, t=1 K80(0.8,0.4,1) ### a=2, b=1, t=1 K80(2,1,1) ⎡ ⎢ ⎢ P(t) = ( eQt ) = ⎢ ⎢ ⎢ ⎣ p0 (t) p1 (t) p2 (t) p1 (t) p0 (t) p2 (t) p2 (t) p2 (t) p0 (t) p2 (t) p2 (t) p1 (t) p2 (t) ⎤ ⎥ p2 (t) ⎥ p1 (t) ⎥ ⎥ p0 (t) ⎥ ⎦ >>> ### a=0.4, b=0.2, t=1 ... K80(0.4,0.2,1) [[0.51292934698540638, 0.21173513507320438, 0.13766775897069461, 0.13766775897069461], [0.21173513507320438, 0.51292934698540638, 0.13766775897069461, 0.13766775897069461], [0.13766775897069461, 0.13766775897069461, 0.51292934698540638, 0.21173513507320438], [0.13766775897069461, 0.13766775897069461, 0.21173513507320438, 0.51292934698540638]] >>> >>> ### a=0.8, b=0.4, t=1 ... K80(0.8,0.4,1) [[0.34583310614337009, 0.25511515285395764, 0.19952587050133616, 0.19952587050133616], [0.25511515285395764, 0.34583310614337009, 0.19952587050133616, 0.19952587050133616], [0.19952587050133616, 0.19952587050133616, 0.34583310614337009, 0.25511515285395764], [0.19952587050133616, 0.19952587050133616, 0.25511515285395764, 0.34583310614337009]] >>> >>> ### a=2, b=1, t=1 ... K80(2,1,1) [[0.25581828581051674, 0.25333953363385037, 0.24542109027781644, 0.24542109027781644], [0.25333953363385037, 0.25581828581051674, 0.24542109027781644, 0.24542109027781644], [0.24542109027781644, 0.24542109027781644, 0.25581828581051674, 0.25333953363385037], [0.24542109027781644, 0.24542109027781644, 0.25333953363385037, 0.25581828581051674]] Stationary Distribution at 1/4 No Phylogenetic Signal Prof. Hernán Dopazo Kimura ’80 Lab. Genómica Biomédica y Evolución http://hdopazolab.com • The sequence data can be summarized as the proportions of sites with transitional and transversional differences. Let these be S and V , respectively. • Where: A G S V C T S Prof. Hernán Dopazo Kimura ’80 Lab. Genómica Biomédica y Evolución http://hdopazolab.com Pii K80 Distances Ts Tv Pii Ts K80 Distances Tv Prof. Hernán Dopazo Tamura-Nei ’93 Lab. Genómica Biomédica y Evolución http://hdopazolab.com • JC69 and K80 have symmetrical substitution rates, with qij = qji for all i and j. • Such Markov chains have πi = 1/4 for all i as the stationary distribution; that is, when the substitution process reaches equilibrium, the sequence will have equal proportions of the four nucleotides. • This assumption is unrealistic for virtually every real data set. • Now consider a model that accommodate unequal base compositions. • The model of Tamura and Nei (1993), referred to as TN93 • The substitution-rate matrix under the TN93 model is: T • C A b A a2 G C b G b a1 While parameters πT, πC, πA, πG are used to specify the substitution rates, they also give the stationary (equilibrium) distribution, with πY = πT + πC and πR = πA + πG to be the frequencies of pyrimidines and purines, respectively. R b Tv T Y Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com Tamura-Nei ’93 • After some hard algebra ... • When t increases from 0 to ∞, the diagonal element pjj (t) decreases from 1 to πj, while the off-diagonal element pij(t) increases from 0 to πj, with pij(∞) = πj, irrespective of the starting nucleotide i. • The limiting distribution (πT, πC, πA, πG) is also the stationary distribution. Prof. Hernán Dopazo Tamura-Nei ’93 Lab. Genómica Biomédica y Evolución http://hdopazolab.com • We now consider estimation of the sequence distance under the model. A G S1 V C T S2 Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución HKY ’93 & F84 http://hdopazolab.com • A special case of TN’93 is Hasegawa, Kishino and Yano, 1985 (HKY85) • The model is obtained by setting • Another special case of TN’93 is the F84 model derived by Joel Felsenstein. • The model is obtained by setting • If we assume or in the TN’93 model or we obtained the F81 model in the TN’93 model Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com DNA substitution models • For constructing phylogenetic trees, sophisticated distances is not necessarily more efficient • Indeed, sophisticated distances show always higher variances Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com • Corrected distances always give larger distances than the observed. DNA substitution models Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com DNA substitution models Distance correction methods share several assumptions: • All nucleotide sites change independently. • The substitution rate is constant over time and in different lineages • The base composition is at equilibrium (all sequences have the same base frequencies) • The conditional probabilities of nucleotide substitutions are the same for all sites and do not change over time. While these assumptions make the methods tractable, they are in many cases unrealistic. Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución DNA substitution models http://hdopazolab.com Distance correction methods share several assumptions: • All nucleotide sites change independently. • The substitution rate is constant over time and in different lineages • The base composition is at equilibrium (all sequences have the same base frequencies) • The conditional probabilities of nucleotide substitutions are the same for all sites and do not change over time. While these assumptions make the methods tractable, they are in many cases unrealistic. “Sobre el Rigor de la Ciencia.. J L Borges, 1960” En aquel Imperio, el Arte de la Cartografía logró tal Perfección que el Mapa de una sola Provincia ocupaba toda una Ciudad, y el Mapa del Imperio, toda una Provincia. Con el tiempo, estos Mapas Desmesurados no satisficieron y los Colegios de Cartógrafos levantaron un Mapa del Imperio, que tenía el Tamaño del Imperio y coincidía puntualmente con él. Menos Adictas al Estudio de la Cartografía, las Generaciones Siguientes entendieron que ese dilatado Mapa era Inútil y no sin Impiedad lo entregaron a las Inclemencias del Sol y los Inviernos. En los Desiertos del Oeste perduran despedazadas Ruinas del Mapa, habitadas por Animales y por Mendigos; en todo el País no hay otra reliquia de las Disciplinas Geográficas Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com Nested Models Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com Amino Acid substitution models Dayhoff at al (1979) developed a model of protein evolution that resulted in the development of a set of widely used replacement matrices. In the Dayhoff approach, • Replacement rates are derived from alignments of 71 protein sequences ~85% identical, • This ensures that the likelihood of a particular mutation (e.g., L → V) being the result of a set of successive mutations (e.g., L → x → y → V) is very low. • From this they compiled a “mutation pobability matrix” called PAM 001 (1%) that show the probabilities of change from one aa, to another, for different amounts of evolutionary change, • One of the main uses of the Dayhoff matrices has been in databases search methods, PAM50, PAM100, PAM250 corresponding to P(0.5), P(1) and P(2.5), respectivelly. • PAM250 is a widely used matrix wich result of raising PAM001 to the 250th power.. and the others correspondingly... Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com Amino Acid substitution models • Several later groups have attempted to extend Dayhoff’s methodology or re-apply her analysis using later databases with more examples. • Jones, et al. (1992) used the same methodology as Dayhoff but with modern databases and for membrane spanning proteins. The BLOSUM series of matrices were created by Henikoff & Henikoff (1992).Their features are: • Derived from local, ungapped alignments of distantly related sequences, • All matrices are directly calculated; no extrapolations are used, • The number of the matrix (BLOSUM62) refers to the minimum % identity of the blocks used to build the matrix; greater numbers, lesser distances, • The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches. • Specific matrices modeling mitochondrial proteins exists Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com • Variable Substitution Rates Across Sites In the DNA and Amino acid substitution models considered as far, the rate of substitution is assumed to be the same for all sites. This rarely holds, and rates varies from site to site, • In the case of protein coding genes this is obvious: 1, 2 and 3 positions. • In the case of RNA coding genes, secondary structure consisting in loops and stems have different substitutions rates. • One can accommodate the rate variation by assuming that rate r for any site is a random variable drawn from a statistical distribution. • Statistical analyses have suggested that the rate variation approximately follows the gamma (Γ) distribution !! Use: Rate classes Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com Variable Substitution Rates Across Sites • Low α values corresponds to large rate variation. • As α gets larger the rate of variation diminishes, until as α approaches ∞ all sites have the same substitution rate. • Models are labeled as JC+Γ, K80+Γ, HKY+Γ, PAM+Γ, etc. • Indeed models can be corrected by considering the proportion of invariable sites (I) and the nucleotide frequency (F): (JC+Γ+ I +F) ; (K80+Γ+I +F) ; (HKY+Γ+I ); mtREV+Γ+I, etc. Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com Full set with Gamma: JC : Jukes-Cantor (model 1) JC+G : Jukes-Cantor plus Gamma (model 3) F81 : Felsenstein 1981 (model 5) F81+G : Felsenstein 1981 plus Gamma (model 7) K80 : Kimura 2-parameter (model 9) K80+G : Kimura 2-parameter plus Gamma (model 11) HKY : Hasegawa-Kishino-Yano (model 13) HKY+G : Hasegawa-Kishino-Yano plus Gamma (model 15) TrNef : equal-frequency Tamura-Nei (model 17) TrNef+G : equal-frequency Tamura-Nei plus Gamma (model 19) TrN : Tamura-Nei (model 21) TrN+G : Tamura-Nei plus Gamma (model 23) K81 : Kimura 3-parameter (model 25) K81+G : Kimura 3-parameter plus Gamma (model 27) K81uf : unequal-frequency Kimura 3-parameter (model 29) K81uf+G : unequal-frequency Kimura 3-parameter plus Gamma (model 31) TIMef : equal-frequency Transition Model (model 33) TIMef+G : equal-frequency Transition Model plus Gamma (model 35) TIM : Transition Model (model 37) TIM+G : Transition Model plus Gamma (model 39) TVMef : equal-frequency Transversion Model (model 41) TVMef+G : equal-frequency Transversion Model plus Gamma (model 43) TVM : Transversion Model (model 45) TVM+G : Transversion Model plus Gamma (model 47) SYM : Symmetrical Model (model 49) SYM+G : Symmetrical Model plus Gamma (model 51) GTR : General Time Reversible (model 53) GTR+G : General Time Reversible plus Gamma (model 55) Full Set of Models Prof. Hernán Dopazo Lab. Genómica Biomédica y Evolución http://hdopazolab.com DNA substitution models References • • Jukes TH and Cantor CR (1969). Evolution of Protein Molecules. New York: Academic Press. pp. 21–132. • Felsenstein J (1981). "Evolutionary trees from DNA sequences: a maximum likelihood approach". Journal of Molecular Evolution 17 (6): 368–376. • Hasegawa M, Kishino H, Yano T (1985). "Dating of human-ape splitting by a molecular clock of mitochondrial DNA". Journal of Molecular Evolution 22 (2): 160–174.Kishino H, Hasegawa M (1989). "Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea". Journal of Molecular Evolution 29 (2): 170–179. • Tamura K, Nei M (1993). "Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees". Molecular Biology and Evolution 10 (3): 512–526. • Tavaré S (1986). "Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences". Lectures on Mathematics in the Life Sciences (American Mathematical Society) 17: 57–86. • Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. A model of evolutionary change in proteins. In Atlas of protein sequence and structure, volume 5, pages 345–358. M. O. Dayhoff, National biomedical research foundation, Washington DC., 1978 • • Henikoff S. and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A, 89:10915–10919, 1992. • • • Adachi J and M. Hasegawa. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol, 42:459–468, 1996 • • Muller T, and M. Vingron. Modeling amino acid replacement. J Comput Biol, 7:761–776, 2000. Kimura M (1980). "A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences". Journal of Molecular Evolution 16 (2): 111–120. Jones, D. T., W. R. Taylor, and J. M. Thornton. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci, 8:275–282, 1992 Lio, P and N. Goldman. Modeling mitochondrial protein evolution using structural information. J Mol Evol, 54:519–529, 2002. Whelan, S. and N. Goldman. A general empirical model of protein evolution derived from multiple protein families using a maximumlikelihood approach. Mol Biol Evol, 18:691–699, 2001 Yang, Z. Among-site variation and its inpact on phylogenetic analises. TREE, 11:367–371, 1996