# t - Biomedical Genomics & Evolution Lab ```Sistem&aacute;tica Te&oacute;rica
Hern&aacute;n Dopazo
Biomedical Genomics and Evolution Lab
Lesson 02
Models of Sequence Evolution
Facultad de Ciencias Exactas y Naturales
Argentina
2013
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
Models of Nucleotide Substitution
Calculation of the distance between two sequences is perhaps the
simplest phylogenetic analysis, yet it is important for two reasons.
•
•
First, calculation of pairwise distances is the first step in
distance-matrix methods of phylogeny reconstruction, which
use cluster algorithms to convert a distance matrix into a
phylogenetic tree.
Second, Markov-process models of nucleotide substitution used
in distance calculation form the basis of likelihood and
Bayesian analysis of multiple sequences on a phylogeny.
Evolutionary models are
the process of nucleotide
or amino-acid substitution
⎡ .
⎢
⎢ qCT
Q = qij = ⎢
q
⎢ AT
⎢ qGT
⎣
( )
qTC
qTA
.
qCA
qAC
.
qGC
qGA
qTG ⎤
⎥
qCG ⎥
qAG ⎥
⎥
. ⎥
⎦
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
The distance between two sequences is defined as:
•
•
DNA substitution models
the expected number of nucleotide substitutions per site.
If the evolutionary rate is constant over time, the distance will
increase linearly with the time of divergence.
•
A simplistic distance measure is the proportion of different sites,
sometimes called the p distance.
•
If 10 sites are different between two sequences, each 100 nucleotides
long, then p = 10% = 0.1
•
This raw proportion works fine for very closely related sequences
(p~5%) but is otherwise a clear underestimate of the number of
substitutions that have occurred.
• Sequences may saturate due to multiple changes (hits)
at the same position after lineage splitting.
•
In the worst case, data may become random and all the
phylogenetic information about relationships can be lost!!!
∧
p = 2 / 8 = 0.25....wrong
d = 10 / 8 = 1.25....best
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
DNA substitution models
Markov Models
•
To estimate the number of substitutions, we need a probabilistic model to describe changes between
nucleotides.
•
Continuous-time Markov chains are commonly used for this purpose.
•
The nucleotide sites in the sequence are normally assumed to be evolving independently of each other.
•
Substitutions at any particular site are described by a Markov chain, with the four nucleotides to be the
states of the chain.
•
The main feature of a Markov chain is that it has no memory: ‘given the present, the future does not
depend on the past’
•
In other words, the probability with which the chain jumps into other nucleotide states depends on the
current state, but not on how the current state is reached. This is known as the Markovian property.
•
Adding further constraints on substitution rates between nucleotides we get different models.
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
DNA substitution models
b
A
a
d
C
c
e
G
Ts
f
Tv
T
Ts
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
DNA substitution models
b
A
a
d
C
c
e
G
Ts
f
Tv
T
Ts
Prof. Hern&aacute;n Dopazo
Jukes &amp; Cantor ’69
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
It assumes that every nucleotide has the same rate λ of changing into any other nucleotide.
•
We use qij to denote the instantaneous rate of substitution from nucleotide i to nucleotide j,
with i, j = T, C, A, or G.
•
Thus the substitution-rate matrix is:
T
C
A
G
⎡ −3a a
a
a ⎤
⎢
⎥
a −3a a
a ⎥
⎢
Q = qij =
a −3a a ⎥
⎢ a
⎢ a
a
a −3a ⎥⎦
⎣
( )
•
A
... Sum = 0
... Sum = 0
a
a
... Sum = 0
... Sum = 0
a
C
a
a
Note that qij Δt gives the probability that any given nucleotide i will change to a different
nucleotide j in an infinitely small time interval
Δt
G
Ts
a
Tv
T
Ts
Prof. Hern&aacute;n Dopazo
Jukes &amp; Cantor ’69
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
Imagine a long sequence with nucleotide i at every site;
•
t
pij
i
Transition-probability matrix:
⎡ p0 (t)
⎢
p1 (t)
Qt
⎢
P(t) = e =
⎢ p1 (t)
⎢
⎣ p1 (t)
p1 (t)
p0 (t)
p1 (t)
p1 (t)
p1 (t)
p1 (t)
p0 (t)
p1 (t)
j
t&gt;0
dt
p1 (t)⎤ ... Sum = 1
⎧
1 3 −4 t
⎥
p
(t)
=
+ e
p1 (t)⎥ , with ⎪ 0
4 4
⎨
!
⎪ p1 (t) = 1 − 1 e−4 t
p1 (t)⎥
⎩
4 4
⎥
p0 (t)⎦
...what if t=0 ?... P(0)= Identity matrix:
No evolution
•
pij is the probability that a given nucleotide i will become j time t later.
•
The matrix P(t) = { pij (t) } is known as transitional probability matrix.
•
Transition probabilities express the product of rates (a) and times (t) at the same time.
•
With no external information about either the time or the rate, we can estimate only the distance, but
not time or rate individually.
•
The probability that the chain is in state j when t → ∞ is
represented by πj and the distribution (πT, πC, πA, πG) is
known as the limiting distribution of the chain = 1/4
Stationary Distribution
No Phylogenetic Signal
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
Transition probability, pij is the probability that a given nucleotide i will become j time t later.
python
from math import *
def JC69 (m,t):
a=0
b=1
x = 1/4.0 + 3/4.0 * exp(-4*m*t)
y = 1/4.0 - 1/4.0 * exp(-4*m*t)
a=x
b=y
matrixq = [[a,b,b,b],[b,a,b,b],[b,b,a,b],[b,b,b,a]]
print matrixq
### m=0.02, t=1
JC69(0.02,1)
### m=0.2, t=1
JC69(0.2,1)
### m=0.8, t=1
JC69(0.8,1)
### m=1, t=1
JC69(1,1)
### m=3, t=1
JC69(3,1)
Python Code
&gt;&gt;&gt; ### m=0.02, t=1
... JC69(0.02,1)
[[0.94233725978997684, 0.019220913403341061, 0.019220913403341061, 0.019220913403341061],
[0.019220913403341061, 0.94233725978997684, 0.019220913403341061, 0.019220913403341061],
[0.019220913403341061, 0.019220913403341061, 0.94233725978997684, 0.019220913403341061],
[0.019220913403341061, 0.019220913403341061, 0.019220913403341061, 0.94233725978997684]]
&gt;&gt;&gt;
&gt;&gt;&gt; ### m=0.2, t=1
... JC69(0.2,1)
[[0.58699672308791617, 0.13766775897069461, 0.13766775897069461, 0.13766775897069461],
[0.13766775897069461, 0.58699672308791617, 0.13766775897069461, 0.13766775897069461],
[0.13766775897069461, 0.13766775897069461, 0.58699672308791617, 0.13766775897069461],
[0.13766775897069461, 0.13766775897069461, 0.13766775897069461, 0.58699672308791617]]
&gt;&gt;&gt;
&gt;&gt;&gt; ### m=0.8, t=1
... JC69(0.8,1)
[[0.28057165298377468, 0.23980944900540846, 0.23980944900540846, 0.23980944900540846],
[0.23980944900540846, 0.28057165298377468, 0.23980944900540846, 0.23980944900540846],
[0.23980944900540846, 0.23980944900540846, 0.28057165298377468, 0.23980944900540846],
[0.23980944900540846, 0.23980944900540846, 0.23980944900540846, 0.28057165298377468]]
&gt;&gt;&gt;
&gt;&gt;&gt; ### m=1, t=1
... JC69(1,1)
[[0.26373672916655061, 0.24542109027781644, 0.24542109027781644, 0.24542109027781644],
[0.24542109027781644, 0.26373672916655061, 0.24542109027781644, 0.24542109027781644],
[0.24542109027781644, 0.24542109027781644, 0.26373672916655061, 0.24542109027781644],
[0.24542109027781644, 0.24542109027781644, 0.24542109027781644, 0.26373672916655061]]
&gt;&gt;&gt;
&gt;&gt;&gt; ### m=3, t=1
... JC69(3,1)
[[0.25000460815926501, 0.24999846394691166, 0.24999846394691166, 0.24999846394691166],
[0.24999846394691166, 0.25000460815926501, 0.24999846394691166, 0.24999846394691166],
[0.24999846394691166, 0.24999846394691166, 0.25000460815926501, 0.24999846394691166],
[0.24999846394691166, 0.24999846394691166, 0.24999846394691166, 0.25000460815926501]]
Prof. Hern&aacute;n Dopazo
Jukes &amp; Cantor ’69
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
How does the Markov-chain model correct for multiple hits and recover the hidden changes
•
The exponential equation accommodates all the possible paths the evolutionary process might have
taken.
•
Qt
e
In particular, the transition probabilities for a Markov chain satisfy the equation, known as the Chapman–
Kolmogorov theorem:
•
The transition probability from any nucleotide i to any nucleotide j over time t1 + t2 is a sum over all
possible states k at any intermediate point t1
pij (t1 + t 2 ) = ∑ pik (t1 )pkj (t 2 )
k
i
t1
pik
k
t2
pkj
j
k = T, A, C, G
Prof. Hern&aacute;n Dopazo
Jukes &amp; Cantor ’69
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
Now consider estimation of the distances between two sequences:
•
From the instantaneous-rate matrix the total substitution rate for any nucleotide is 3a
•
If the two sequences are separated by time t, for example, if they diverged from a common ancestor time t/2
ago, the distance between the two sequences will be:
d = 3at
•
Suppose x out of n sites are different between the two sequences, so that the proportion of different sites is
∧
p = x/n
•
The probability that the nucleotide in a descendant sequence is different from the nucleotide in the ancestral
sequence is:
3 3 −4 at 3 3 −4 d/3
p = 3p1 (t) = − e = − e
4 4
4 4
By equating this to the observed proportion pˆ we obtain the estimate of distance as:
∧
4∧
d = −3 / 4 log(1− p )
3
if p^ &gt; 3/4 the method is innaplicable
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
Jukes &amp; Cantor ’69
Prof. Hern&aacute;n Dopazo
Jukes &amp; Cantor ’69
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
Jukes-Cantor distances
Pii
Pij
https://www.desmos.com/calculator
Prof. Hern&aacute;n Dopazo
Kimura ’80
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
•
In real data, transitions often occur at higher rates than transversions.
A
Kimura (1980) proposed a model that accounts for different transition and
transversion rates.
b
•
Let the substitution rates be a for transitions and b for transversions
•
Thus the substitution-rate matrix with two free parameters is:
T
C
A
•
the simplest distance between two sequences separated by time t is:
b
C
G
⎡ −(a + 2b)
a
b
b
⎢
a
−(a + 2b)
b
b
⎢
Q = qij = ⎢
b
b
−(a + 2b)
a
⎢
b
b
a
−(a + 2b)
⎢⎣
( )
a
⎤
⎥
⎥
⎥
⎥
⎥⎦
d = (a + 2b)t
b
a
G
Ts
b
Tv
T
Ts
Prof. Hern&aacute;n Dopazo
Kimura ’80
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
Note that at is the expected number of transitions per site and 2bt is the expected number of
transversions per site.
•
It is more convenient to use the transition/transversion rate ratio:
•
The matrix of transition probabilities is obtained as:
k = a /b
⎡
⎢
⎢
Qt
P(t) = ( e ) = ⎢
⎢
⎢
⎣
• where the three distinct elements of the matrix are:
T
C
A
p0 (t)
p1 (t)
p2 (t)
p1 (t)
p0 (t)
p2 (t)
p2 (t)
p2 (t)
p0 (t)
p2 (t)
p2 (t)
p1 (t)
Ts
Tv
...what if t=0 ?...
No evolution
p0=1, p1=0, p2=0
G
p2 (t) ⎤
⎥
p2 (t) ⎥
p1 (t) ⎥
⎥
p0 (t) ⎥
⎦
Prof. Hern&aacute;n Dopazo
Python Code
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
Transition probability, pij is the probability that a given nucleotide i will become j time t later.
&gt;&gt;...
python
from math import *
def K80(a,b,t):
x = 1/4.0 + 1/4.0 * exp (-4*b*t) + 1/2.0 * exp(-2*(a+b)*t)
y = 1/4.0 + 1/4.0 * exp (-4*b*t) - 1/2.0 * exp(-2*(a+b)*t)
z = 1/4.0 - 1/4.0 * exp (-4*b*t)
p=x
q=y
r=z
matrixs = [[p,q,r,r],[q,p,r,r],[r,r,p,q],[r,r,q,p]]
print matrixs
### a=0.4, b=0.2, t=1
K80(0.4,0.2,1)
### a=0.8, b=0.4, t=1
K80(0.8,0.4,1)
### a=2, b=1, t=1
K80(2,1,1)
⎡
⎢
⎢
P(t) = ( eQt ) = ⎢
⎢
⎢
⎣
p0 (t)
p1 (t)
p2 (t)
p1 (t)
p0 (t)
p2 (t)
p2 (t)
p2 (t)
p0 (t)
p2 (t)
p2 (t)
p1 (t)
p2 (t) ⎤
⎥
p2 (t) ⎥
p1 (t) ⎥
⎥
p0 (t) ⎥
⎦
&gt;&gt;&gt; ### a=0.4, b=0.2, t=1
... K80(0.4,0.2,1)
[[0.51292934698540638, 0.21173513507320438, 0.13766775897069461, 0.13766775897069461],
[0.21173513507320438, 0.51292934698540638, 0.13766775897069461, 0.13766775897069461],
[0.13766775897069461, 0.13766775897069461, 0.51292934698540638, 0.21173513507320438],
[0.13766775897069461, 0.13766775897069461, 0.21173513507320438, 0.51292934698540638]]
&gt;&gt;&gt;
&gt;&gt;&gt; ### a=0.8, b=0.4, t=1
... K80(0.8,0.4,1)
[[0.34583310614337009, 0.25511515285395764, 0.19952587050133616, 0.19952587050133616],
[0.25511515285395764, 0.34583310614337009, 0.19952587050133616, 0.19952587050133616],
[0.19952587050133616, 0.19952587050133616, 0.34583310614337009, 0.25511515285395764],
[0.19952587050133616, 0.19952587050133616, 0.25511515285395764, 0.34583310614337009]]
&gt;&gt;&gt;
&gt;&gt;&gt; ### a=2, b=1, t=1
... K80(2,1,1)
[[0.25581828581051674, 0.25333953363385037, 0.24542109027781644, 0.24542109027781644],
[0.25333953363385037, 0.25581828581051674, 0.24542109027781644, 0.24542109027781644],
[0.24542109027781644, 0.24542109027781644, 0.25581828581051674, 0.25333953363385037],
[0.24542109027781644, 0.24542109027781644, 0.25333953363385037, 0.25581828581051674]]
Stationary Distribution at 1/4
No Phylogenetic Signal
Prof. Hern&aacute;n Dopazo
Kimura ’80
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
The sequence data can be summarized as the proportions of sites with transitional and transversional
differences. Let these be S and V , respectively.
•
Where:
A
G
S
V
C
T
S
Prof. Hern&aacute;n Dopazo
Kimura ’80
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
Pii
K80 Distances
Ts
Tv
Pii
Ts
K80 Distances
Tv
Prof. Hern&aacute;n Dopazo
Tamura-Nei ’93
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
JC69 and K80 have symmetrical substitution rates, with qij = qji for all i and j.
•
Such Markov chains have πi = 1/4 for all i as the stationary distribution; that is, when the substitution process
reaches equilibrium, the sequence will have equal proportions of the four nucleotides.
•
This assumption is unrealistic for virtually every real data set.
•
Now consider a model that accommodate unequal base compositions.
•
The model of Tamura and Nei (1993), referred to as TN93
•
The substitution-rate matrix under the TN93 model is:
T
•
C
A
b
A
a2
G
C
b
G
b
a1
While parameters πT, πC, πA, πG are used to specify the substitution rates, they also give the stationary
(equilibrium) distribution, with πY = πT + πC and πR = πA + πG to be the frequencies of pyrimidines and
purines, respectively.
R
b
Tv
T
Y
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
Tamura-Nei ’93
•
After some hard algebra ...
•
When t increases from 0 to ∞, the diagonal element pjj (t) decreases from 1 to πj, while the off-diagonal
element pij(t) increases from 0 to πj, with pij(∞) = πj, irrespective of the starting nucleotide i.
•
The limiting distribution (πT, πC, πA, πG) is also the stationary distribution.
Prof. Hern&aacute;n Dopazo
Tamura-Nei ’93
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
We now consider estimation of the sequence distance under the model.
A
G
S1
V
C
T
S2
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
HKY ’93 &amp; F84
http://hdopazolab.com
•
A special case of TN’93 is Hasegawa, Kishino and Yano, 1985 (HKY85)
•
The model is obtained by setting
•
Another special case of TN’93 is the F84 model derived by Joel Felsenstein.
•
The model is obtained by setting
•
If we assume
or
in the TN’93 model
or
we obtained the F81 model
in the TN’93 model
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
DNA substitution models
•
For constructing phylogenetic trees, sophisticated distances is not necessarily more efficient
•
Indeed, sophisticated distances show always higher variances
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
Corrected distances always give larger distances than the observed.
DNA substitution models
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
DNA substitution models
Distance correction methods share several assumptions:
•
All nucleotide sites change independently.
•
The substitution rate is constant over time and in different lineages
•
The base composition is at equilibrium (all sequences have the same base frequencies)
•
The conditional probabilities of nucleotide substitutions are the same for all sites and do not change over time.
While these assumptions make the methods tractable, they are in many cases unrealistic.
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
DNA substitution models
http://hdopazolab.com
Distance correction methods share several assumptions:
•
All nucleotide sites change independently.
•
The substitution rate is constant over time and in different lineages
•
The base composition is at equilibrium (all sequences have the same base frequencies)
•
The conditional probabilities of nucleotide substitutions are the same for all sites and do not change over time.
While these assumptions make the methods tractable, they are in many cases unrealistic.
“Sobre el Rigor de la Ciencia.. J L Borges, 1960”
En aquel Imperio, el Arte de la Cartograf&iacute;a logr&oacute; tal Perfecci&oacute;n que el
Mapa de una sola Provincia ocupaba toda una Ciudad, y el Mapa del
Imperio, toda una Provincia. Con el tiempo, estos Mapas Desmesurados
no satisficieron y los Colegios de Cart&oacute;grafos levantaron un Mapa del
Imperio, que ten&iacute;a el Tama&ntilde;o del Imperio y coincid&iacute;a puntualmente con
&eacute;l. Menos Adictas al Estudio de la Cartograf&iacute;a, las Generaciones
Siguientes entendieron que ese dilatado Mapa era In&uacute;til y no sin
Impiedad lo entregaron a las Inclemencias del Sol y los Inviernos. En los
por Animales y por Mendigos; en todo el Pa&iacute;s no hay otra reliquia de las
Disciplinas Geogr&aacute;ficas
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
Nested Models
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
Amino Acid substitution models
Dayhoff at al (1979) developed a model of protein evolution that resulted in the development of a set of
widely used replacement matrices. In the Dayhoff approach,
•
Replacement rates are derived from alignments of 71 protein sequences ~85% identical,
•
This ensures that the likelihood of a particular mutation (e.g., L → V) being the result of a set of
successive mutations (e.g., L → x → y → V) is very low.
•
From this they compiled a “mutation pobability matrix” called PAM 001 (1%) that show the probabilities
of change from one aa, to another, for different amounts of evolutionary change,
•
One of the main uses of the Dayhoff matrices has been in databases
search methods, PAM50, PAM100, PAM250 corresponding to P(0.5),
P(1) and P(2.5), respectivelly.
•
PAM250 is a widely used matrix wich result of raising PAM001 to the
250th power.. and the others correspondingly...
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
Amino Acid substitution models
•
Several later groups have attempted to extend Dayhoff’s methodology or re-apply her analysis using later
databases with more examples.
•
Jones, et al. (1992) used the same methodology as Dayhoff but with modern databases and for membrane
spanning proteins.
The BLOSUM series of matrices were created by
Henikoff &amp; Henikoff (1992).Their features are:
•
Derived from local, ungapped alignments of distantly
related sequences,
•
All matrices are directly calculated; no extrapolations
are used,
•
The number of the matrix (BLOSUM62) refers to the
minimum % identity of the blocks used to build the
matrix; greater numbers, lesser distances,
•
The BLOSUM series of matrices generally perform
better than PAM matrices for local similarity searches.
•
Specific matrices modeling mitochondrial proteins
exists
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
•
Variable Substitution Rates Across Sites
In the DNA and Amino acid substitution models considered as far, the rate of substitution is assumed to
be the same for all sites. This rarely holds, and rates varies from site to site,
•
In the case of protein coding genes this is obvious: 1, 2 and 3 positions.
•
In the case of RNA coding genes, secondary structure consisting in loops and stems have different
substitutions rates.
•
One can accommodate the rate variation by
assuming that rate r for any site is a random
variable drawn from a statistical distribution.
•
Statistical analyses have suggested that the rate
variation approximately follows the gamma (Γ)
distribution
!! Use: Rate classes
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
Variable Substitution Rates Across Sites
•
Low α values corresponds to large rate variation.
•
As α gets larger the rate of variation diminishes, until as α approaches ∞ all sites
have the same substitution rate.
•
Models are labeled as JC+Γ,
K80+Γ, HKY+Γ, PAM+Γ, etc.
•
Indeed models can be
corrected by considering the
proportion of invariable sites (I)
and the nucleotide frequency
(F): (JC+Γ+ I +F) ; (K80+Γ+I
+F) ; (HKY+Γ+I ); mtREV+Γ+I,
etc.
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
Full set with Gamma:
JC : Jukes-Cantor (model 1)
JC+G : Jukes-Cantor plus Gamma (model 3)
F81 : Felsenstein 1981 (model 5)
F81+G : Felsenstein 1981 plus Gamma (model 7)
K80 : Kimura 2-parameter (model 9)
K80+G : Kimura 2-parameter plus Gamma (model 11)
HKY : Hasegawa-Kishino-Yano (model 13)
HKY+G : Hasegawa-Kishino-Yano plus Gamma (model 15)
TrNef : equal-frequency Tamura-Nei (model 17)
TrNef+G : equal-frequency Tamura-Nei plus Gamma (model 19)
TrN : Tamura-Nei (model 21)
TrN+G : Tamura-Nei plus Gamma (model 23)
K81 : Kimura 3-parameter (model 25)
K81+G : Kimura 3-parameter plus Gamma (model 27)
K81uf : unequal-frequency Kimura 3-parameter (model 29)
K81uf+G : unequal-frequency Kimura 3-parameter plus Gamma (model 31)
TIMef : equal-frequency Transition Model (model 33)
TIMef+G : equal-frequency Transition Model plus Gamma (model 35)
TIM : Transition Model (model 37)
TIM+G : Transition Model plus Gamma (model 39)
TVMef : equal-frequency Transversion Model (model 41)
TVMef+G : equal-frequency Transversion Model plus Gamma (model 43)
TVM : Transversion Model (model 45)
TVM+G : Transversion Model plus Gamma (model 47)
SYM : Symmetrical Model (model 49)
SYM+G : Symmetrical Model plus Gamma (model 51)
GTR : General Time Reversible (model 53)
GTR+G : General Time Reversible plus Gamma (model 55)
Full Set of Models
Prof. Hern&aacute;n Dopazo
Lab. Gen&oacute;mica Biom&eacute;dica y Evoluci&oacute;n
http://hdopazolab.com
DNA substitution models
References
•
•
Jukes TH and Cantor CR (1969). Evolution of Protein Molecules. New York: Academic Press. pp. 21–132.
•
Felsenstein J (1981). &quot;Evolutionary trees from DNA sequences: a maximum likelihood approach&quot;. Journal of Molecular Evolution 17 (6):
368–376.
•
Hasegawa M, Kishino H, Yano T (1985). &quot;Dating of human-ape splitting by a molecular clock of mitochondrial DNA&quot;. Journal of Molecular
Evolution 22 (2): 160–174.Kishino H, Hasegawa M (1989). &quot;Evaluation of the maximum likelihood estimate of the evolutionary tree
topologies from DNA sequence data, and the branching order in hominoidea&quot;. Journal of Molecular Evolution 29 (2): 170–179.
•
Tamura K, Nei M (1993). &quot;Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and
chimpanzees&quot;. Molecular Biology and Evolution 10 (3): 512–526.
•
Tavar&eacute; S (1986). &quot;Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences&quot;. Lectures on Mathematics in the Life
Sciences (American Mathematical Society) 17: 57–86.
•
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. A model of evolutionary change in proteins. In Atlas of protein sequence and structure,
volume 5, pages 345–358. M. O. Dayhoff, National biomedical research foundation, Washington DC., 1978
•
•
Henikoff S. and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A, 89:10915–10919, 1992.
•
•
•
Adachi J and M. Hasegawa. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol, 42:459–468, 1996
•
•
Muller T, and M. Vingron. Modeling amino acid replacement. J Comput Biol, 7:761–776, 2000.
Kimura M (1980). &quot;A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide
sequences&quot;. Journal of Molecular Evolution 16 (2): 111–120.
Jones, D. T., W. R. Taylor, and J. M. Thornton. The rapid generation of mutation data matrices from protein sequences. Comput Appl
Biosci, 8:275–282, 1992
Lio, P and N. Goldman. Modeling mitochondrial protein evolution using structural information. J Mol Evol, 54:519–529, 2002.
Whelan, S. and N. Goldman. A general empirical model of protein evolution derived from multiple protein families using a maximumlikelihood approach. Mol Biol Evol, 18:691–699, 2001
Yang, Z. Among-site variation and its inpact on phylogenetic analises. TREE, 11:367–371, 1996
```