# Document ```Lecture 3: Markov models of
sequence evolution
Alexei Drummond
Friday quiz: How many bacterial cells
are there in an average adult human?
A)
B)
C)
D)
1012 (1 trillion)
1013 (10 trillion)
1014 (100 trillion)
1015 (1000 trillion)
Hint: There are about 1014 human cells in the
CS369 2007
2
Modeling genetic change
• Given two or more aligned nucleotide or amino acid
sequences, usually the first goal is to calculate some
measure of sequence similarity (or conversely distance)
• The simplest way to estimate genetic distances is the pdistance (number of differences between two sequences
divided by the sequence length)
– The p-distance is the hamming distance normalized by the length
of the sequence. Therefore it is the proportion of positions at which
the sequences differ.
– The p-distance can also be consider the probability that the two
sequences differ at a random position (site).
CS369 2007
3
Modeling genetic change
AACCTGTGCA
AATCTGTGTA
*
*
Seq1
seq2
CS369 2007
ATCCTGGGTT
*
* **
AATCTGTGTA
ATCCTGGGTT
**
* *
4
P-distance
Seq1
seq2
AATCTGTGTA
ATCCTGGGTT
**
* *
p-distance=0.4
proportion of # nt between two sequences
Usually underestimate the true distance:
genetic (or evolutionary) distance d
CS369 2007
5
Multiple, parallel, and back-substitutions
AACCTGTGCA
AACCTGTGCA
T A
A
C
AACCAGTGAA
*
*
CS369 2007
AACCTGTGCA
T
G
A
C
ACCCGGTGAA
*
*
6
1
0.9
0.8
p-distance (p)
0.7
0.6
0.5
Relationship between p
(observed) distance and
d (genetic) distance
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
2.5
3
Genetic distance (d)
CS369 2007
7
Transition probabilities
• Definition: Let Pxy(t) be the probability that a
nucleotide x evolves to a nucleotide y in
time t. If x = y then this evolutionary
pathway could involve 0, 2, 3 or more
substitutions. If x  y the the pathway could
involve 1, 2, 3 or more substitutions.
• P(t) is then a square transition probability
matrix of size 4 by 4.
CS369 2007
8
Modeling nucleotide substitutions as a timehomogeneous time-continuous stationary Markov
process (1)
i = A, C, G, T
PGG(t) and PGA(t)
Independent from i
Markov property
G
t
•
PGG(t)
PGA(t)
G
A
At any given site in a sequence the rate of change
from base i to base j is independent from the base that
occupied that site prior i
CS369 2007
9
Modeling nt substitutions as a time-homogeneous
time-continuous stationary Markov process (2)
•
Homogeneity
– Substitution rates do not change over time
•
Stationarity
– The relative frequencies of A, C, G, and T
(pA,
pC,
pG,
pT)
are
at
equilibrium,
i.e. remain constant.
CS369 2007
10
Models of DNA Substitution
Simplest
1. Base frequencies are equal and
all substitutions are equally likely
(Jukes-Cantor)
2. Base frequencies are equal but transitions and
transversions occur at different rates
(Kimura 2 parameter)
3. Unequal base frequencies and transitions and
transversions occur at different rates
(Hasegawa-Kishino-Yano)
Most complex
CS369 2007
4. Unequal base frequencies and all
substitution types occur at different rates
(General Reversible Model)
11
The Q-matrix (instantaneous rate
matrix)
A
C
G
T
(ap C  bp G  cp T )

ap C
bp G
cp T


gp A
(gp A  dp G  ep T )
dp G
ep T
1 

Q

hp A
jp C
(hp A  jp C  fp T )
fp T
 


ip A
kp C
lp G
(ip A  kp C  lp G )

pi
frequency of nt i
a, b, c, etc.
relative rate parameters
non-diagonal entries:
rate flow from nucleotide i to nucleotide j
diagonal entries:
total rate flow that leaves nucleotide i (rate at
which nt i disappear per site per sequence).

scale factor so total output per unit time = 1.0
CS369 2007
12
The Q-matrix
A
 *

 gp A
Q
 hp A

ip A

CS369 2007
Qii  Qij
ji
C
G
T
ap C
bp G
*
dp G
jp C
*
kp C
lp G
cp T 

ep T 
fp T 

* 
A
C
G
T
total rate    iQii
i
13
General Time Reversible (GTR) Models
A
 *

 ap A
Q
 bp A

cp A
C
G
T
ap C
bp G
*
dp G
dp C
*
ep C
fp G
cp T 

ep T 
fp T 

* 
p A 
 
p C 


p G 
 
p T 
Substitutions from nucleotide i to nucleotide j have
the same rate of substitutions from nucleotide j to
nucleotide i.


In general: f = 1 and a, b, c, d, e are estimated from
the data via maximum likelihood
CS369 2007
14
Time-reversibility
x
z
t
2
t
equivalent
x
y

y
CS369 2007
15
Q-matrix for the Jukes and Cantor (JC) model
p A  p C  p G  p T  1/4
a  b  c  d  e  f 1
  3/4

CS369 2007
 * 1/4 1/4 1/4 
0.25




0.25
4 1/4 * 1/4 1/4 

  
Q 
0.25
3 1/4 1/4 * 1/4 




0.25
1/4 1/4 1/4 * 

16
Q-matrix for the Jukes and Cantor (JC) model
*

 1
Q
3 1

1

CS369 2007
1 1 1

* 1 1
1 * 1

1 1 *
0.25


0.25


0.25


0.25

17
Evolutionary meaning of the Q-matrix for the JC
model
  1/3 1/3 1/3


1/3  1/3 1/3

Q
1/3 1/3  1/3


1/3 1/3 1/3  
 Q
i
ii

i
 = rate per unit time of nucleotide i (i =A, C, G, T) replacement during evolution:

nt substitutions per sequence per site per unit time
 t
= nt substitutions per site between two sequences that are separated by time t
= d
CS369 2007
18
Estimating transition probabilities
• As soon as the Q matrix, and thus the
evolutionary model, is specified, it is
possible to calculate the probabilities of
change from any base to any other during
the evolutionary time t, P(t), by computing
the matrix exponential
P(t)  exp(Qt)
CS369 2007
19
Jukes and Cantor (JC) model solution
By computing
P(t)=exp(Qt)
with Q according to the JC model
1 3
4
 exp( t)
4 4
3
3 3
4
Pi j (t)   exp( t)
4 4
3
Pi j (t) 
Pi=j(t) = probability of nt i to end up with the same character after time t

Pij(t) = probability of nt i ending up as a different character after time t
CS369 2007
20
Estimating the genetic distances(1)
•The total probability of two sequences sharing the same
nucleotide at a position is Pi=j(t) and therefore the
probability of the two sequences being different,
p = 1 - Pi=i(t) = Pij(t)
p = 3/4 (1 - exp(-4/3t))
•An estimator of p is the observed proportion of different sites
between two sequences ( p-distance).
CS369 2007
21
Estimating the genetic distances(2)
Solving for t we get t = - 3/4 ln (1- 4/3 p). Substituting t with d
we finally obtain the Jukes-Cantor correction formula for the genetic
distance d between two sequences:
d = - 3/4 ln (1- 4/3 p)
It can also be demonstrated that the variance V(d) will be given by
V(d) = 9p(1-p)/(3-4p)2
CS369 2007
22
Calculating JC distance
Seq1
seq2
p-distance
AATCTGTGTA
ATCCTGGGTT
**
* *
=
0.4
d (JC model) = - 3/4 ln [1- 4/3 (0.4)] = 0.5716
CS369 2007
23
Calculating JC distance
AACCTGTGCA
AATCTGTGTA
*
*
p-distance
ATCCTGGGTT
*
* **
=
0.4
d (JC model) = - 3/4 ln [1- 4/3 (0.4)] = 0.5716
CS369 2007
24
Q-matrix for the F81 model
p A  pC  pG  pT
a  b  c  d  e  f 1
  1 (p A 2  p C 2  p G 2  p T 2 )

CS369 2007
 *

 p A
Q
 p A

p A
pC
*
pC
pC
p G p T 

p G p T 
* p T 

p G * 
25
F81 model correction formula
d   ln(1 p/ )
•p = observed distance

• When pA= pT= pC= pG=0.25,  = 3/4, and the formula
becomes equivalent to the one obtained for the JC model
CS369 2007
26
Q-matrix for the Kimura-2p (K80) model
p A  p C  p G  p T  1/4
a  c  d  f 1
b e 
Transversions
Transitions
  2

CS369 2007
* 1  1 


 1 * 1  
Q
  2  1 * 1 


1  1 * 
27
Nucleotide frequencies in HIV/SIV are at
equilibrium: pol gene
Average SEQUENCE COMPOSITION (HIV-O/HIV-M full pol)
5% chi-square test
SE8538a
passed
97TZ02a
passed
BOLO122b
passed
pA = 39.0%
CAM1b
passed
NY5CGb
passed
pC = 16.6%
98IN022c
passed
pG = 22.8%
94IN112c
passed
93IN101c
passed
pT = 21.6%
VI850f
passed
X138g
passed
SE6165g
passed
Average
VI991h
passed
Ti/Tv=2.6
SE9173j
passed
SE92809j
passed
MP535k
passed
92UG001d
passed
HIVO
passed
CS369 2007
p-value
97.80%
94.59%
99.94%
96.73%
97.64%
99.44%
98.68%
99.61%
97.09%
86.61%
95.73%
98.23%
96.17%
96.50%
69.92%
86.20%
77.48%
28
Nucleotide frequencies in HIV/SIV are at
equilibrium: env gene
Average SEQUENCE COMPOSITION (SIV/HIV full envelope)
pA = 34.5%
pC = 17.4%
pG = 23.4%
pT = 24.7%
Average
Ti/Tv=1.5
CS369 2007
MVP5180
SIVcpzUS
SIVcpzGAB
92UG037a
92UG975g
92RU131g
93IN905c
92BRO25c
92UG021d
92UG024d
BSSG3b
SFMHS20b
91TH652b
MBC18R01b
5% chi-square test
passed
passed
passed
passed
passed
passed
passed
passed
passed
passed
passed
passed
passed
passed
p-value
14.60%
48.09%
51.77%
84.58%
99.73%
97.45%
77.15%
59.51%
94.89%
92.60%
97.86%
92.40%
92.86%
99.59%
29

Q-matrix for the F84 model
(very similar to the HKY85 model)
p A  pC  pG  pT
a  c  d  f 1
(Transversions)
b  1  /(p A  p G )
e  1  /(p C  p T )

*

pA
 
Q
 [1  /(p A  p G )]p A

pA

CS369 2007
(Transitions)
pC
[1  /(p A  p G )]p G
*
pG
pC
[1  /(p C  p T )]p C
*

pT

[1  /(p C  p T )]p T 

pT
pG
*


30
Nucleotide substitution patterns in HIV/SIV
To
A
Average frequency of
changes between states
A
C
G
T
346.2
697.4
290.3
123
320.8
SIV/HIV-1 envelope
C
241.9
G
515.4
126.6
T
215.6
371
From
117.1
Transitions
Transversions
Average Ti/Tv=1.5
CS369 2007
144.6
31
More complex models…
• More complex models, like Tamura-Nei (TN93), or the
general time reversible (GTR) model usually requires
numerical algorithms in order to calculate d.
• Several software packages exist that can estimate genetic
distances between nucleotide sequences according to
different evolutionary models
–
–
–
–
–
–
CS369 2007
MEGA3,
PAUP*,
PHYLIP,
TREE-PUZZLE,
DAMBE,
Geneious 2.5.4
32
Estimating HIV genetic distances: env gene
HIV-1B vs HIV-O/SIVcpz/HIV-1C
full envelope
p-distance
JC69
K80
Tajima-Nei
HIV-O
0.391 (.008) 0.552 (.018) 0.560 (.019) 0.572 (.019)
SIVcpz
0.266 (.009) 0.337 (.009) 0.340 (.010) 0.427 (.013)
HIV-1C
0.163 (.008) 0.184 (.008) 0.187 (.008) 0.189 (.008)
CS369 2007
33
Estimating HIV genetic distances: pol gene
HIV-1B vs HIV-O/HIV-1C
full pol
p-distance
JC69
K80
Tajima-Nei
HIV-O
0.257 (.007) 0.315 (.010) 0.318 (.011) 0.324 (.011)
HIV-1C
0.103 (.005) 0.111 (.005) 0.113 (.006) 0.114 (.006)
CS369 2007
34
When divergence is low p and d are linearly related
1
0.9
0.8
p-distance (p)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
2.5
3
Genetic distance (d)
CS369 2007
35
Conclusions
• The genetic distance between two sequences can be
estimated using a Markov model of DNA substitution.
• Different models will estimate different genetic distances
• We have focused on DNA models, but it is possible to
consider models for proteins and models that take into
account codons and the genetic code.
• Markov model approaches to estimating genetic distance
do not deal with indels, and presuppose an alignment
• These models assume that all positions in a DNA
sequence mutate at the same rate. We will talk about
how to relax this assumption in later lectures.
CS369 2007
36
```