Evolutionary Changes in Nucleotide Sequences

advertisement
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
Substitution Models
Readings:
Chapter 3 of Fundamentals of Molecular Evolution by Graur and Li (2000)
Chapter 1 & 2 of Computational Molecular Evolution by Yang (2006)
Models of nucleotide substitution
Introduction
Calculate of the distance between two sequences is the simplest phylogenetic analysis
Important because
The first step in distance methods for phylogeny reconstruction
Markov-process models of nucleotide substitution used in distance calculation
form the basis of likelihood and Bayesian analysis
The distance between two nucleotide sequences is defined as the expected number of
nucleotide substitutions per site.
A simplest distance measure is the proportion of different sites, sometimes called the pdistance. If 10 sites are different between two sequences, each 100 bp long, then p= 10%
= 0.1
However, a variable site may result from more than one substitutions that have occurred,
and even a constant site may harbor back or parallel substitutions.
Multiple hits: multiple substitutions at the same site (i.e., some changes are hidden)
Note: p is usable only for high similar sequences, with p < 5%.
Nucleotide substitution in a DNA sequence
Jukes and Cantor’s one-parameter model (JC69)
This simple model assumes that substitutions occur with equal probability among the four
nucleotide types. The rate of substitution for each nucleotide is 3 per unit time, and the
rate of substitution is in each of the three possible directions of change is . It is called
the one-parameter model because the model involves a single parameter, .
A

C

th
e
 m
od 
el
in
vo
lv
es
a
si
ng
le
G

T
1
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
Since we start with A, the probability hat this site is occupied by A at time 0 is PA(0) = 1.
At time 1, the probability of still having A at this site is given by
PA(1) = 1 – 3
In which 3 is the probability of A changing to T, C or G, and 1 – 3 is the probability
that A has remained unchanged.
The probability of having A at time 2 is
PA(2) = (1 – 3 PA(1) +  PA(1)
To derive this equation, we consider two possible scenarios:
1. the nucleotide has remained unchanged from time 0 to time 2, and
2. the nucleotide has changed to T, C, or G at time 1, but has subsequently reverted to A
at time 2.
Using the above formulation, we can show that the following recurrence equation applies
to any t:
PA(t+1) = (1 – 3 PA(t) +  PA(t)
We can rewrite this equation in terms of the amount of change in PA(t) per unit time as
PA(t) = PA(t+1)  PA(t) = [(1 – 3 PA(t) +  PA(t)]  PA(t)
= – 3 PA(t) +  PA(t)
= – 4 PA(t) + 
We can approximate this process by a continuous-time model, by regarding PA(t) as the
rate of change at time t. With this approximation,
dPA(t )
 4 PA(t )  
dt

PA(t ) 
1
1
 (PA(0)  )e4 t
4
4
PA(0) = 1,

PA(t ) 
1
3
 ( )e4 t
4 4
This also can be rewritten in a more explicit form to take into account the facts that the
initial nucleotide is A and the nucleotide at time t is also A.

2
Workshop on Computational Molecular Evolution
PAA(t ) 
Chau-Ti Ting
12/22/2007
1
3
 ( )e4 t
4 4
A more general equation can also be written as

Pii(t ) 
1 3 4 t
 ( )e
4 4
PA(t ) 
1 1 4 t
 ( )e
4 4
PA(0) = 0,

If the initial nucleotide is G instead of A, then

PGA(t ) 
1 1 4 t
 ( )e
4 4
A general probability, Pij (t), that a nucleotide will become j at time t, given that it was i
at time 0.

1 1
Pij(t )   ( )e4 t
4 4
where i ≠ j
In Yang (2006), he use qij to denote the instantaneous rate of substitution from nucleotide

i to nucleotide j, with i, j = T, C, A or G. Thus the substitution-rate matrix is
3 

 


 3 
 

Q  qij 
 
 3  



 3
 

where the nucleotides are ordered T, C, A and G. The total rate of substitution of any
nucleotide i is 3, 
which is qij
Transition probability, pij (t) is the probability that a given nucleotide i will become j
P(t)  {pij (t)} is known as transitional probability matrix.
time t later. The matrix


p0 (t)

p1 (t)
P(t)  eQt  
p1 (t)

p1 (t)

p1 (t)
p0 (t)
p1 (t)
p1 (t)
p1 (t)
p1 (t)
p0 (t)
p1 (t)
p1 (t)

p1 (t)
, with
p1 (t)

p0 (t)

1 3 4 t
p0 (t)  4  4 e

p1 (t)  1  1 e4 t

4 4

3
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
1. P(t) sums to 1
2. P(0) = I, the identity matrix, reflect the case of no evolution (t = 0)
3. Rate  and time t occur in the transition probability only in the form of a product t.
With no external information about either the time or the rate, we can estimate only
the distance, but not time or rate individually.
4. When t  , pij(t )  1/4 , for all i and j.


P
Time
Kimura’s two-parameter model (K80)
Substitution between two pyrimidines (TC) or between two purines (AG) are called
transitions, while those between a pyrimidine and a purines (TCAG) are called
transversion. In real data, transitions often occur at higher rates than transversion.
In this model, the rate of transitional substitution at each nucleotide site is  per unit
time, whereas the rate of each transversional substitution is  per unit time.

A


C
G



T
Let us consider the probability that a site that has A at time 0 will have A time t. After
one time unit, the probability of A changing to G is , and the probability of A changing
to either C or T is 2. Thus the probability of A remaining unchanged after one time unit
is
PA(1) = 1 –   
At time 2, the probability of having A at this site is given by the sum of the probabilities
of four different scenarios: 1) A remained unchanged at t =1 and t =2; 2) A change to G
4
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
at t =1 and reverted by a transition to A at t =2; 3) A change to C at t =1 and reverted by a
transversion to A at t =2; 4) A change to C at t =1 and reverted by a transversion to A at t
=2. Hence,
PA(2) = (1 –    PA(1) + PT(1) + PC(1) + PG(1)
By extension,
PA(t+1) = (1 –    PA(t) + PT(t) + PC(t) + PG(t)
Similarly, we can obtain
PT(t+1) =  PA(t) +(1 –    PT(t) + PC(t) + PG(t)
PC(t+1) =  PA(t) + PT(t) +(1 –    PC(t) + PG(t)
PG(t+1) =  PA(t) + PT(t) + PC(t) +(1 –    PG(t)
From this set of four sequences, we arrive at the following solution:
1 1
1
PAA(t )   ( )e4 t  ( )e2(   )t
4 4
2
As in JC69 model, PAA(t) = PGG(t) = PCC(t) = PTT(t)

X(t ) 
1 1 4 t 1 2(   )t
 ( )e  ( )e
4 4
2
Let Y(t) = the probability that the initial nucleotide and the nucleotide at time t differ
from each other by 
a transition.
Y(t) = PAG(t) = PGA(t) = PTC(t) = PCT(t)
1 1 4 t 1 2(   )t
 ( )e  ( )e
4 4
2
The probability, Z(t), that the initial nucleotide and the nucleotide at time t differ by a
specific type of transversion is given by
Y(t ) 

Z(t ) 
1 1 4 t
 ( )e
4 4
Note that each nucleotide subject to two types of transversion, but only one type of
transition. Also

X(t) + Y(t) + 2 Z(t) = 1
The rate matrix is as follows
5
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
(  2 )







(  2 )




Q  qij 
 


(  2 )





(  2 )
 

where the nucleotides are ordered T, C, A and G. The total rate of substitution of any
nucleotide is   2 , and the distance between two sequences separated by time t is

d= (  2) t. Note that t is the expected number of transitions per site and 2t is the
expected number of transversions per site. It is convenient to use distance d and the
transition/transversion rate ratio    /  .


 i will become j
Transition probability, p (t) is the probability that a given nucleotide
ij
time t later. The matrix P(t)  {pij (t)} is known as transitional probability matrix.



p0 (t)

p (t)
Qt
P(t)  e   1
p2 (t)

p2 (t)
p1 (t)
p0 (t)
p2 (t)
p2 (t)
p2 (t)
p2 (t)
p0 (t)
p1 (t)
p2 (t)

p2 (t)
p1 (t)

p0 (t)
where the three distinct elements of the matrix are

1 1
1
1 1
1
p0 (t)   e4 t  e2(   )t   e4 d /( 2)  e2d ( 1)/( 2)
4 4
2
4 4
2
1 1
1
1 1
1
p1(t)   e4 t  e2(   )t   e4 d /( 2)  e2d ( 1)/( 2)
4 4
2
4 4
2
1 1
1 1
p2 (t)   e4 t   e4 d /( 2)
4 4
4 4
Note that p0 (t)  p1(t)  2p2 (t) 1

Number of nucleotide substitutions between two DNA sequences

If two sequences of length N differ from each other at n site, then the proportion of
differences, n/N, is referred to as the degree of divergence or Hamming distance.
If the degree of divergence is substantial, then the observed number of differences is
likely to be smaller than the actual number of substitutions due to multiple substitution or
multiple hit at the same site.
6
Workshop on Computational Molecular Evolution
ACTGAACGTAACGC


A
A
C A
C
T
T
G
G
A
A C T G
A
A
C A
C G
G
G
T A
T A
A
A
A T
A C T
C
C
G
G
C T C
C
Chau-Ti Ting
12/22/2007
single substitution
sequential substitution
Coincidental substitution
Parallel substitution
Convergent substitution
Back substitution
Number of nucleotide substitutions between two noncoding sequences
Let us start with JC69 model. In this model, it is sufficient to consider only I(t), which is
the probability that the nucleotide at a given site at the time t is the same in both
sequences. Suppose that the nucleotide at a given site was A at time 0. At time t, the
probability that a descendant sequence will have A at this site is PAA(t), and consequently
the probability that two descendant sequences have A at this site is P2AA(t). Similarly, the
probabilities that both sequence have T, C, G at this site are P2AT(t) P2AC(t) P2AG(t),
respectively. Therefore,
2
2
2
2
I(t )  PAA(t
)  PAT (t )  PAC(t )  PAG(t )
I(t ) 
1 3 8 t
 e
4 4
2
Note that the probability that t PAA(t
)he two sequences are different at a site at time t is p =
1 I(t). Thus,
or

p

3
(1 e8t )
4
8t   n(1

4
p)
3
The time of divergence between two sequences is usually given not known, and thus we
can not estimate . Instead, we compute d, which is the number of substitutions per site
 between two sequences. In the case of the one parameter
since the time of divergence
model, d= 2(3t), where 3t is the number of substitutions per site in a single lineage.
We can calculate d as
7
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
3
4
n(1 p)
4
3
Where p is observed proportion of different nucleotides between two sequences. For
sequence length n, the sampling variance is approximately given by
d

var( d) 
p(1 p)
4
n(1 p) 2
3
In the case of two-parameter model, the differences between two sequences are
classified into transitions
and transversions. Let S and V be the proportion of transitional
and transversional differences between two sequences, respectively. Then the number of
nucleotide substitutions per site between two sequences, d, is estimated by
1
1
n(1 2S  V )  n(1 2V )
2
4
2  n(1 2S  V )

1
n(1 2V )
d
Equivalently the transition distance and the transversion distance are estimated as

Aslo, the variance of d is

1
1
n(1 2S  V )  n(1 2V )
2
4
1
2t   n(1 2V )
2
t  
var( d)  [a2 S  b2V  (aS  bV )2 ]/n
where
a  (1 2S  V )1
1
b  [(1 2S  V )1  (1 2V )1 ]
2
Violation of assumptions

Several assumptions
have been made that are not necessary met by the sequences under
study.
1. The rate of substitution was assumed to be the same at all sites. This assumption
might not hold, as the rate may vary greatly from site to site.
2. The substitution occur in an independent manner.
8
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
3. The substitution matrix was assumed not to change in time, so that the nucleotide
frequencies are maintained at a constant equilibrium value throughout their
evolution.
The transition/transversion rate ratio
Three definitions of the ‘transition/transversion rate ratio’ are in use
1. The ratio of numbers of transitional and transversional differences between the two
sequences, without correcting multiple hits. (E(S)/E(V))
2.    /  , with   1 meaning no rate difference between transitions and transversions
3. Average transition/transversion ratio (R): same as the first one but with correction

Overall,
R is convenient to use for comparing estimates under different models, while 
is more suitable for formulating the null hypothesis of no transition/transversion rate
difference.

Distance estimation under different substitution model
At small distance, the different assumptions about the structure if the Q matrix do not
make much difference, and simple models such as JC69 and K80 produce very similar
estimates to those under more complex models
At intermediate distance (20%~30%), different model assumptions become more
important. It may be favorable to use realistic models for distance estimation if the
sequences are not too short.
At large distance (>40%), the different methods often produce very different estimates,
and the estimates involve large sampling errors.
Models of amino acid and codon substitution
Introduction
With protein coding genes, we have the advantage of being able to distinguish
synonymous or silent substitutions from the nonsynonymous or replacement
substitutions.
Synonymous substitutions: nucleotide substitutions that do not change the encoded amino
acid
Nonsynonymous substitutions: nucleotide substitutions that do change the encoded amino
acid
Synonymous and nonsynonymous mutations are under very different selection pressures
and are fixed at very different rates. Thus, comparison between synonymous and
nonsynonymous substitution rates provides a means to understand the effect of natural
selection on the protein. This comparison does not require estimation of absolute
substitution rates or knowledge of the divergence time.
9
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
Models of amino acid replacement
Empirical models
Empirical models attempts to describe the relative rates of substitution between two
amino acids without considering explicitly factors that influence the evolutionary
process. They are often constructed by analyzing large quantities of sequence data, as
compiled from database.
Mechanistic models consider the biological process involved in amino acid substitution,
such as mutation biases in the DNA, translation of the codons into amino acid after
filtering by natural selection. Mechanistic models have more interpretative power and are
particular useful for study the forces and mechanisms of gene sequence evolution.
Empirical models of amino acid substitution are all constructed by estimating relative
substitution rates between two amino acids under general time-reversible model.
 iqij   j q ji , for any i  j
The first empirical amino acid substitution matrix was constructed by Dayhoff and
colleagues. They compiled and analyzed protein sequences available at the time, using a

reconstruct ancestral
parsimony argument to
protein sequences and tabulating amino acid
changes along branches on the phylogeny. Dayhoff et al. approximated the transitionprobability matrix for an expected distance of 0.01 changes per site, call 1 PAM (for
point-accepted mutations).
Features of these matrices:
1. amino acids with similar physico-chemical properties tend to interchange with each
other at high rates than dissimilar amino acids. (DE or IV)
2. The “mutational distance” between amino acids determined by the structure of the
genetic code. Amino acids separated by differences of two or three codon positions
have lower rates than amino acids separated by a difference of one codon position.
(RK for nuclear proteins or for mitochondrial proteins)
Both factors may be operating at the same time.
Estimate synonymous and nonsynonymous substitutions rates
Two distances are usually calculated between protein-coding DNA sequences, for
synonymous and nonsynonymous substitutions, respectively.
dS or KS: the number of synonymous changes per synonymous site
dN or KN: the number of nonsynonymous changes per nonsynonymous site
Two classes of methods: heuristic counting methods and the ML method
10
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
Counting Methods
Three steps:
1. Count synonymous and nonsynonymous sites
2. Count synonymous and nonsynonymous differences
3. Calculate the proportion of differences and correct for multiple hits
Nei and Gojobori (1986)
1. Count synonymous and nonsynonymous sites: S and N
2. Count synonymous and nonsynonymous differences: Sd and Nd
3. Calculate the proportion of differences (pS and pN) as
pS  Sd /S
pN  N d /N
apply the JC69 correction for multiple hits

3
4
n(1 pS )
4
3
3
4
dN   n(1 pN )
4
3
dS  
Transition/transversion rate difference and codon usage
According to Li et al. (1985), we first classify the nucleotide sites into nondegenerate,

twofold degenerate and fourfold
degenerate site.
nondegenerate (L0): all the possible changes at this site are nonsynonymous
twofold degenerate (L2): one of the three possible changes is synonymous
fourfold degenerate(L4): all possible changes at the site are synonymous
The nucleotide differences in each class are further classified into transitional (Si) and
transversional (Vi) differences, where i = 0, 2, and 4 denoted nondegerneracy, twofold
degeneracy and fourfold degeneracy, respectively.
All the substitutions at nondegenerate sites are nonsynonymous.
All the substitutions at fourfold degenerate sites are synonymous.
At twofold degenerate site, transitional changes are synonymous, whereas
transversitional changes are nonsynonymous.
The proportion of transitional differences at i-fold degenerate sites between two
sequences is calculated as
S
Pi  i
Li
Similarly, the proportion of transversional differences at i-fold degenerate sites between
two sequences is calculated as

11
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
Vi
Li
Kimura’s two-parameter method is used to estimate the number of transitional (Ai) and
transversional (Bi) substitutions per ith type site.
1
1
 Ai  n(ai )  n(bi )
2
4
1
Bi  n(bi )
2
Qi 
Where ai =1/(1– 2 Pi –Qi), bi = 1/(1– 2Qi)
The total number of substitutions
per ith type of degenerate site, Ki, is given by

Ki = Ai +Bi
A2 and B2 denote the numbers of synonymous and nonsynonymous substitutions per
twofold degenerate site, respectively.
K4 = A4 +B4 denote the numbers of synonymous substitutions per fourfold degenerate
site.
K0 = A0 +B0 denote the numbers of nonsynonymous substitutions per nondegenerate site.
then, the number of synonymous substitutions per synonymous site (dS) and the number
of nonsynonymous substitutions per nonsynonymous site (dN) can be obtained by
dS 
L2 A2  L4 A4
(L2 /3)  L4
dN 
L2 B2  L0 A0
(2L2 /3)  L0
Li (1993) and Pamilo and Bianchi (1993) proposed to calculated the number of

symnonymous substitution by taking (L2 A2 + L4 K4 )/ (L2 + L4) as an estimate of the
transition component of nucleotide substitution at twofold and fourfold degenerate site
dS 
L2 A2  L4 A4
 B4
L2  L4
dN  A0 
L2 B2  L0 A0
L2  L0

12
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
Maximum likelihood method
Define number of sites and substitutions on a per codon basis.
1. The expected number of substitution per codon from codons i to j, i  j, over any time t
is  ijqij t . Thus the numbers of synonymous and nonsynonymous substitutions per
codon between two sequences separated by time t are
Sd  t S 

 q t
i ij
i j,aa i  aa j
N d  t N 
 q t
i ij
i j,aa i aa j
Where  S and N are the proportions of synonymous and nonsynonymous
substitutions

2. Count sites: For   1, the number of synonymous and nonsynonymous sites per
 codonare
S  31S

N  31
N
3. The distances are given by

dS  Sd /S
dN  N d /N
  dN /dS  ( N / S ) /( 1N / 1S ) measures the perturbation in the proportion of
synonymous and nonsynonymous substitutions caused by natural selection on the protein.


Ex: tobacco rbcL genes
1. Assume equal transition and transversion rates (   1) and equal codon frequency
(  j  1/61) and estimate t and  by ML method

t
 0.363
  0.096
2. Calculate S, N and Sd, Nd
3. Calculate distances

Comparison of methods
1. Ignoring the transition and transversion rate difference leads to underestimation of S,
overestimate dS, and underestimation of the  ratio
13
Workshop on Computational Molecular Evolution
Chau-Ti Ting
12/22/2007
2. Ignoring the codon usage bias leads to overestimation of S, underestimate dS, and
overestimation of the  ratio
3. Different method or model assumptions can produce very different estimates even
when two sequences are highly similar.
4. Assumptions appear to matter more than methods.
Advantages of the likelihood method
1. Conceptual simplicity
2. It is much simpler to accommodate more realistic models of codon substitution
14
Download