Project_ShanthiIyanperumal

advertisement
Maximum likelihood in PhylogeneticsDNA/Amino Acid Substitution Models
By
SHANTHI IYANPERUMAL
Project for
Probability and Statistics
Dr. M. Partensky
4/19/04
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
1/15
Table of Contents
Maximum likelihood in Phylogenetics-.......................................................... 1
DNA/Amino Acid Substitution Models ......................................................... 1
Phylogeny ....................................................................................................................... 3
Explication of the Maximum Likelihood method ..........Error! Bookmark not defined.
Calculation of likelihood of molecular sequences: ......................................................... 5
Example1: Likelihood of a single sequence with two nucleotides AC ...................... 5
Example2: Likelihood of a one branch tree between two aligned sequences ............. 5
DNA substitution Models ............................................................................................. 10
Jukes-Cantor (JC): .................................................................................................... 10
Felsenstein 1981(F81) ............................................................................................. 10
Kimura 2-parameter(K80) ....................................................................................... 10
Hasegawa-Kishino-Yano (HKY) ............................................................................ 10
Tamura-Nei (TrN):.................................................................................................... 11
Kimura 3-parameter (K3P) ..................................................................................... 11
Transition Model (TIM) .......................................................................................... 11
Transversion Model (TVM).................................................................................... 11
Symmetrical Model (SYM) .................................................................................... 11
General Time Reversible (GTR) ............................................................................. 11
Gamma Distribution (G) .......................................................................................... 11
Proportion of Invariable Sites (I) ............................................................................. 11
Amino acid Substitution Models .................................................................................. 12
Empirical substitution models................................................................................... 12
PAM matrices ........................................................................................................... 13
Dayhoff matrices ....................................................................................................... 13
JTT matrices.............................................................................................................. 13
Other empirical models ............................................................................................. 14
Blosum (Block substitution matrices) ....................................................................... 14
Poisson models.......................................................................................................... 14
Reference: ..................................................................................................................... 15
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
2/15
Phylogeny
Phylogeny is the evolution of a genetically related group of organisms. It is the study of
relationships between collection of "things" (genes, proteins, organs..) that are derived
from a common ancestor.
Phylogeneny is used to
1. Find evolutionary ties between organisms. (Analyze changes that occured in different
organisms during evolution).
2. Find or understand relationships between an ancestral sequence and it descendants.
(Evolution of family of sequences)
3. Estimate time of divergence between a group of organisms that share a common
ancestor.
From a common ancestor sequence, two DNA sequences are diverged. Each of these two
sequences start to accumulate nucleotide substitutions. The number of these mutations are
used in molecular evolution analysis.
One of the most striking features of life is all living organisms share highly conserved
regions in proteins, particularly in proteins that are involved in information processing
(transcription and translation). All share the same genetic codes. This information leads
us to accept the theory that all organisms known to us have evolved from a common
ancestor. Patrick Forterre coined this ancestor as LUCA (Last Universal Common
Ancestor.
When 2 sequences found in 2 organisms are very similar, we assume that they have
derived from one ancestor.
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
3/15
The sequences alignment reveal which positions are conserved from the ancestor
sequence.
Most phylogenetic methods assume that each position in a sequence can change
independently from the other positions.
General Model of DNA Substitution
Maximum likelihood evaluates the probability that the choosen evolutionary model will
have generated the observed sequences. Phylogenies are then inferred by finding those
trees that yield the highest likelihood.
The rate matrix for a general model of DNA substitution is given by
.
r2pC
r4pG
r6pT
r1pA
.
r8pG
r10pT
r3pA
r7pC
.
r12pT
r5pA
r9pC
r11pG
.
Q = q(i,j) =
The rows and columns are ordered A, C, G and T. The matrix gives the rate of change
from nucleotide i(arranged along the rows) to nucleotide j(along the columns).
For example r2pC gives the rate of change from A to C.
Let P(v,s) be the transition probability matrix where p i,j(v,s) is the probability that
nucleotide i changes into j over branch length v. The vector s contains the parameters of
the substitution model(eg. pA, pC, pG, pT, r1,r2…).
For two-state case, to calculate the probability of observing a change over a branch of
length v, the following matrix calculation is performed:
P (v,s) = eQv
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
4/15
Calculation of likelihood of molecular sequences:
For DNA sequence comparison the model has 2 parts, the base composition and
the process. The composition is just the proportion of the four nucleotides A, C, G,
T.
Example1: Likelihood of a single sequence with two nucleotides AC
If the model is Jukes – Cantor model, which has a base composition of ¼ for each
nucleotide then the likelihood will be 1/4 X 1/4 = 1/16. If the model has a
composition of 40%a and 10%c the likelihood of the sequence will be 0.4 x
0.1=0.04
If we take the 16 possible nucleotide combinations and calculate the sum of all of
them the sum of those likelihoods is 1. For any model, the sum of the likelihoods
of all the different data possibilities should be 1.
Example2: Likelihood of a one branch tree between two aligned
sequences
Sequence 1
CCAT
Sequence 1
CCGT
The other part of the model, the process part is needed if we have more than one
sequence related by a tree. The process might be described by sentences, or by a
matrix of numbers, describing how the nucleotides change from one to another. Let
the composition part of the model be denoted by  = [0.1, 0.4, 0.2, 0.3]. The order
of the bases is A, C, G, and T. There are 16 possible changes from one nucleotide
to the other. The changes can be represented as a 4 X 4 matrix.
A
P=
0.976
0.002
0.003
C
0.01
0.983
0.01
G
T
0.007 0.007
0.005 0.01
0.979 0.007
A
C
G
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
5/15
0.002
0.013
0.005 0.979
T
Likelihood of going from sequence1 to sequence 2 is:
= c Pc-c
c Pc-c
a Pa-g
t Pt-t
= 0.4 * 0.983 * 0.4 * 0.983 * 0.1* 0.007 * 0.3 * 0..979
= 0.0000300
In the above example we did not consider branch lengths. Intuitively for short
branch lengths the probability of a base change is low and for long branch lengths
it is high.
Let’s assume the matrix we have chosen describes a branch with a Certain
Evolutionary Distance (CED). The likelihood we calculated was for 1 CED.
The likelihood for the same alignment for 2 CED units is found by multiplying
matrix P by itself.
P2
=
0.953
0.005
0.007
0.005
0.02 0.013
0.966 0.01
0.02 0.959
0.026 0.01
0.015
0.02
0.015
0.959
A
C
G
T
Likelihood for 2 CED units is:
= c Pc-c
c Pc-c
a Pa-g
t Pt-t
= 0.4 * 0.966 * 0.4*0.966 * 0.1* 0.013 * 0.3 * 0..959
= 0.0000559
As the branch length increases the values on the diagonal decrease and the other
values increase because change becomes more likely than being the same.
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
6/15
The table lists the likelihoods for increasing branch lengths.
Branch length
(CED units)
1
2
3
10
15
20
30
Likelihood
0.0000300
0.0000559
0.0000782
0.000162
0.000177
0.000175
0.000152
The likelihood rises to a maximum somewhere between 15 and 20 ced units.
Example 3 : Likelihood of a tree with four taxa
Assume that we have the aligned nucleotide sequences for four taxa:
The possible trees are
We want to evauate the likelihood of the unrooted tree represented by the nucleotides of
site j in the sequence and shown below:
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
7/15
Since most of the models currently used are time-reversible, the likelihood of the tree is
generally independent of the position of the root. Therefore it is convenient to root the
tree at an arbitrary internal node as done in the Fig. below,
Under the assumption that nucleotide sites evolve independently (the Markovian model
of evolution), we can calculate the likelihood for each site separately and combine the
likelihood into a total value towards the end. To calculate the likelihood for site j, we
have to consider all the possible scenarios by which the nucleotides present at the tips of
the tree could have evolved. So the likelihood for a particular site is the summation of the
probablilities of every possible reconstruction of ancestral states, given some model of
base substitution. So in this specific case all possible nucleotides A, G, C, and T
occupying nodes (5) and (6), or 4 x 4 = 16 possibilities :
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
8/15
In the case of protein sequences each site may occupy 20 states (that of the 20 amino
acids) and thus 400 possibilities have to be considered. Since any one of these scenarios
could have led to the nucleotide configuration at the tip of the tree, we must calculate the
probability of each and sum them to obtain the total probability for each site j.
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
9/15
The likelihood for the full tree then is product of the likelihood at each site.
L= L(1) x L(2) ..... x L(N)
Since the individual likelihoods are extremely small numbers it is convenient to sum the
log likelihoods at each site and report the likelihood of the entire tree as the log
likelihood.
N
ln L= ln L(1) + ln L(2) ..... + ln L(N) = SUM ln L(j)
j=1
DNA substitution Models
The use of maximum likelihood (ML) algorithms in developing phylogenetic hypotheses
requires a model of evolution. The frequently used General Time Reversible (GTR)
family of nested models encompasses 64 models with different combinations of
parameters for DNA site substitution. The models are listed here from the least complex
to the most parameter rich.
Jukes-Cantor (JC):
Equal base frequencies, all substitutions equally likely . 1 level of nesting.
Felsenstein 1981(F81)
Variable base frequencies, all substitutions equally likely . 1 level of nesting.
Kimura 2-parameter(K80)
Equal base frequencies, variable transition and transversion frequencies .
2 levels of nesting.
Hasegawa-Kishino-Yano (HKY)
Variable base frequencies, variable transition and transversion frequencies.
2 levels of nesting.
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
10/15
Tamura-Nei (TrN):
Variable base frequencies, equal transversion frequencies, variable transition frequencies
Kimura 3-parameter (K3P)
Variable base frequencies, equal transition frequencies, variable transversion frequencies
Transition Model (TIM)
Variable base frequencies, variable transitions, transversions equal
Transversion Model (TVM)
Variable base frequencies, variable transversions, transitions equal
Symmetrical Model (SYM)
Equal base frequencies, symmetrical substitution matrix (A to T = T to A)
General Time Reversible (GTR)
Variable base frequencies, symmetrical substitution matrix . 6 levels of nesting .
In addition to models describing the rates of change from one nucleotide to another, there
are models to describe rate variation among sites in a sequence. The following are the
two most commonly used models.
Gamma Distribution (G)
Rate heterogeneity can be accommodated by specifying that the rate of evolution across
different sites and is distributed according to a gamma distribution. A simpler way of
accounting for rate heterogeneity is to specify that a fixed proportion of sites are invariant
i.e. have zero rate of evolution.
Proportion of Invariable Sites (I)
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
11/15
Extent of static, unchanging sites in a dataset .
Amino acid Substitution Models
The divergence among sequences can be modeled with a mutation matrix. The matrix,
denoted by M, describes the probabilities of amino acid mutations for a given period of
evolution.
This corresponds to a model of evolution in which amino acids mutate randomly and
independently from one another but according to some predefined probabilities
depending on the amino acid itself. This is a Markovian model of evolution and while
simple, it is one of the best models. Intrinsic properties of amino acids, like
hydrophobicity, size, charge, etc. can be modeled by appropriate mutation matrices.
Dependencies which relate one amino acid characteristic to the characteristics of its
neighbors are not possible to model through this mechanism. Amino acids appear in
nature with different frequencies. These frequencies are denoted by fi and correspond to
the steady state of the Markov process defined by the matrix M., i.e., the vector f is any
of the columns of
or the eigenvector of M whose corresponding eigenvalue is 1
(Mf=f). This model of evolution is symmetric, i.e., the probability of having an i which
mutates to a j is the same as starting with a j which mutates into an i.
The following is a list of amino acid substitution models which use matrices.
Empirical substitution models
In contrast to DNA substitution models, amino acid replacement models have
concentrated on the empirical approach. Dayhoff and coworkers developed a model of
protein evolution which resulted in the development of a set of widely used replacement
matrices. In the Dayhoff approach, replacement rates are derived from alignments of
protein sequences that are at least 85% identical; this constraint ensures that the
likelihood of a particular mutation being the result of a set of successive mutations is low.
One of the main uses of the Dayhoff matrices has been in database search methods
where, for example, the matrices P(0.5), P(1) and P(2.5) (known as the PAM50, PAM100
and PAM250 matrices) are used to assess the significance of proposed matches between
target and database sequences. However, the implicit rate matrix has been used for
phylogenetic applications.
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
12/15
PAM matrices
In the definition of mutation the matrix M implies certain amount of mutation (measured
in PAM units). A 1-PAM mutation matrix describes an amount of evolution which will
change, on the average, 1% of the amino acids. In mathematical terms this is expressed as
a matrix M such that
The diagonal elements of M are the probabilities that a given amino acid does not change,
so
(1-Mii)
is
the
probability
of
mutating
away
from
i.
If we have a probability or frequency vector p, the product Mp gives the probability
vector or the expected frequency of p after an evolution equivalent to 1-PAM unit. Or, if
we start with amino acid i (a probability vector which contains a 1 in position i and 0s in
all others) M*i (the ith column of M) is the corresponding probability vector after one
unit of random evolution. Similarly, after k units of evolution (what is called k-PAM
evolution) a frequency vector p will be changed into the frequency vector Mk p. Notice
that chronological time is not linearly dependent on PAM distance. Evolution rates may
be very different for different species and different proteins.
Dayhoff matrices
Dayhoff presented a method for estimating the matrix M from the observation of 1572
accepted mutations between 34 superfamilies of closely related sequences. Their method
was pioneering in the field. A Dayhoff matrix is computed from a 250-PAM mutation
matrix, used for the standard dynamic programming method of sequence alignment. The
Dayhoff matrix entries are related to M250 by
.
JTT matrices
Jones et al. and Gonnett et al. have used much the same methodology as Dayhoff, but
with modern databases. The Jones et al. model has been implemented for phylogenetic
analyses with some success. Jones et al. have also calculated an amino acid replacement
matrix specifically for membrane spanning segments. This matrix has remarkably
different values from the Dayhoff matrices, which are known to be biased toward watersoluble globular proteins.
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
13/15
Other empirical models
Adachi and Hasegawa have implemented a general reversible Markov model of amino
acid replacement that uses a matrix derived from the inferred replacements in
mitochondrial proteins of 20 vertebrate species. The authors show that this model
performs better than others when dealing with mitochondrial protein phylogeny.
Blosum (Block substitution matrices)
Henikoff and Henikoff have used local, ungapped alignments of distantly related
sequences to derive the BLOSUM series of matrices. Matrices of this series are identified
by a number after the matrix (e.g. BLOSUM50), which refers to the minimum percentage
identity of the blocks of multiple aligned amino acids used to construct the matrix. These
matrices are directly calculated without extrapolations, and are analogous to transition
probability matrices P(T) for different values of T, estimated without reference to any
rate matrix Q. The BLOSUM matrices often perform better than PAM matrices for local
similarity searches, but have not been widely used in phylogenetics.
Poisson models
A simple, non-empirical model of amino acid replacement was proposed by Nei(1987)
This model implements a Poisson distribution, and gives accurate estimates of the
number of amino acid replacements when species are closely related.
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
14/15
Reference:
1. Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood by
Huelsenbeck J. and Crandall K.
2. http://workshop.molecularevolution.org/resources/models/codonmodels.php
3. www.cs.technion.ac.il/~dang/courseCB/lecture13.pps
4. http://www.biology.usu.edu/biol6750/Lecture_15.htm
5. bio.wayne.edu/mf/teaching/BIO6060_Maximumlikelihood2
6. http://www.nmu.edu/biology/Lindsay/teaching/BI315/phylo/phylo_DNAmodels.html
7. http://www.icp.ucl.ac.be/~opperd/private/max_likeli.html
8.http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Ab
stract&list_uids=8642615
9. http://stat-www.berkeley.edu/users/terry/PMMB/Workshop2000/Lab3/phylo.pdf
10.http://www.biology.duke.edu/rausher/phylo1.pdf
11. Maximum Likelihood: Phylogeny Estimation- Neelima Lingareddy
Maximum Likelihood-DNA Substitution Models – Shanthi Iyanperumal
15/15
Download