Maximum likelihood in Phylogenetics-

advertisement
Maximum likelihood in Phylogenetics
Group 3
JAN JONES
SHANTHI IYANPERUMAL
SYUSANNA KOYFMAN
Project for
Probability and Statistics
Dr. M. Partensky
4/19/04
Maximum Likelihood in Phylogenetics
1/29
Table of Contents
Maximum likelihood in Phylogenetics ........................................................... 1
P&S Project Part 1: ......................................................................................................... 3
Maximum Likelihood Overview................................................................................. 3
P&S Project Part II: ........................................................................................................ 9
General Model of DNA Substitution .............................................................................. 9
Calculation of likelihood of molecular sequences: ....................................................... 10
Example1:................................................................................................................ 10
Example2:................................................................................................................ 10
Example 3 : ............................................................................................................... 12
DNA substitution Models ............................................................................................. 15
Jukes-Cantor (JC): .................................................................................................... 15
Felsenstein 1981(F81) ............................................................................................. 15
Kimura 2-parameter(K80) ....................................................................................... 15
Hasegawa-Kishino-Yano (HKY) ............................................................................ 15
Tamura-Nei (TrN):.................................................................................................... 15
Kimura 3-parameter (K3P) ..................................................................................... 16
Transition Model (TIM) .......................................................................................... 16
Transversion Model (TVM).................................................................................... 16
Symmetrical Model (SYM) .................................................................................... 16
General Time Reversible (GTR) ............................................................................. 16
Gamma Distribution (G) .......................................................................................... 16
Proportion of Invariable Sites (I) ............................................................................. 16
Amino acid Substitution Models .................................................................................. 16
Empirical substitution models................................................................................... 17
PAM matrices ........................................................................................................... 17
Dayhoff matrices..................................................................................................... 18
JTT matrices.............................................................................................................. 18
Other empirical models ............................................................................................. 18
Blosum (Block substitution matrices) ....................................................................... 18
Poisson models.......................................................................................................... 18
P&S Project Part III: ..................................................................................................... 19
Phylogenetic trees...................................................................................................... 19
Maximum parsimony.............................................................................................. 20
Maximum likelihood ............................................................................................... 21
LIKELIHOOD RATIO TESTS IN PHYLOGENETICS ....................................... 24
Reference: ..................................................................................................................... 28
Maximum Likelihood in Phylogenetics
2/29
P&S Project Part 1:
Maximum Likelihood Overview
By: Janice Jones
04/19/04
This section of the project presents an overview of the Maximum Likelihood method and
introduces its application to the construction of phylogenetic trees. The two other sections
of this project, submitted by Shanthi Iyanperumal and Syusanna Koyfman, provide
additional information about the method’s application, and explore the topics of models
of DNA substitution, the testing of these models, likelihood ratio tests, and a comparison
with the maximum parsimony method.
Phylogenics is an area of high interest in bioinformatics. Construction of a phylogenic
tree provides insights into the origins of genes and their protein products. A tree can
assist in formulation of questions to ask about the behavior of a gene or protein in various
organisms. A number of statistical methodologies have been devised to construct a
phylogenic tree from sequence data for a set of related genes/proteins in one or more
organisms. One of these, the maximum likelihood method, is reviewed by Huelsenbeck
and Crandall(1). What follows is an interpretation of the methodology as they describe it,
supplemented by an additional reference and two examples in Mathematica.
The maximum likelihood method was said (1) to have been first described in 1922 by
English statistician RA Fisher. It was apparently not widely used until after 1990, when
increased computing power and optimizations of the method itself made its application
more practical. It is said (3) to be one of the most popular sequence-based criteria for
evaluating trees, along with parsimony and compatibility. It is inherently more
computationally expensive than parsimony but various newer optimizations have been
proposed to further address this problem.
General Approach
The job of the maximum likelihood method is to construct a probability model that best
describes a set of known data. As model parameters are changed, the method seeks those
parameter values that maximize the probability of observing the actual data. Felsenstein
(2) notes that when the method is applied to an evolutionary tree, the likelihood is the
probability of evolving the observed sequences given the hypothesis (a proposed tree). It
was emphasized that the likelihood is not the probability that the tree itself is correct.
To explain how a likelihood calculation works, Huelsenbeck and Crandall (1) start with
the now-familiar binomial distribution to describe a coin tossing experiment. The
distribution formula specifies the probability of getting h successes, given n trials and a
probability of success of p.
Maximum Likelihood in Phylogenetics
3/29
p(h|n,p) = C(h, n) * ph * (1-p)(n-h)
They then proceed to turn this calculation around, even though the right hand is the same.
In the conventional binomial probability calculation, the p of success is constant and the
number of successes in n trials is what we want. In a likelihood calculation, the number
of success and trials are kept constant and the probability of a single success is what we
are trying to find. The left hand side of the equation above becomes L(p|h,n): the
likelihood that the value of p results in an outcome of h successes in n trials. The
calculation is iterated for various reasonable values of p and the result, the likelihood L of
p, is plotted against p. The value of p for which L reaches a maximum is presumed to be
the “best” value of p.
Data from a hypothetical coin toss experiment is then used to illustrate the method. In this
example, the goal is to determine the probability of heads, given an outcome of h heads in
n trials. I have provided Example 1 in the accompanying Mathematica file,
Proj_Group3JaniceJones.nb, to demonstrate this use of maximum likelihood. The authors
mention that in this simple experiment, an alternate to the computational solution is to
determine the value of p for which the slope of the likelihood L is zero. This works in the
simple experiment but may be more difficult to apply when the probability model
becomes more complex.
Second Simple Example
I have prepared a second example that has a slightly more complex probability model that
is based on a permutations. Here, we choose 10 marbles from a jar that has a total of 17
marbles of three different colors. We are told the colors of the marbles we picked. The
problem is to determine how many marbles of each color still remain in the jar. The goal
of the probability function used in the solution was not to calculate the probability that a
random marble would be of a particular color, but the probability that 10 marbles would
be of the specified colors. The test parameters (colors of the 7 remaining marbles) that
resulted in the highest probability for the 10 marble colors picked were considered to be
the “most likely” values. See Example 2 in Proj_JaniceJones.nb for additional comments
and code.
One interesting aspect of this problem is that the solution does not require the likelihood
measure to be a probability value; a value that is directly proportional to the probability
works just as well. When the probability function is simplified to return a permutation
count rather than a probability, the “most likely colors” part of the result is the same,
even though the maximum likelihood value itself has changed. Another interesting aspect
is that since any two of the parameters (number of marbles of each color) can vary
independently, a simple graphical solution is not an option.
Application to Phylogenetics
How is this approach used in the construction of a phylogenic tree? The data we are
starting with, instead of a count of heads in n trials or marbles of different colors, is a set
Maximum Likelihood in Phylogenetics
4/29
of “s” sequences aligned in “n” positions. For the next example, to keep the possibilities
simple, say the sequences are of nucleic acids. Then a data point is the set of bases in a
given aligned position. For the following tiny sample alignment,
AG
AC
TG
there are two data points: {A,A,T} for position 1 and {G,C,G} for position 2. For each of
these positions, there are r = 4s possible values, where s in the number of sequences being
aligned. Since there are 3 sequences in this example, there are 64 possible values for each
position: {A,A,A},{A,A,T},{A,A,G},{A,A,C},{A,T,A}...etc. Each of the 64 possible
patterns can be assigned a id number (1 to 64) to distinguish it from the other patterns.
Disregarding data point order for now, the probability of observing particular values for
the collection of data points (positions) in the alignment can be described by the
multinomial distribution, copied from (1):
The authors compare the 64 possible outcomes for each data point with the coin toss that
had only two possible outcomes. Although maximum likelihood was successfully applied
to the marble problem, its solution did not require that this same model be used. The
marble problem can be said to have only one data point, but with many possible
outcomes. Each outcome was one of the possible permutations of the choice of 10
marbles.
But back to the sequence. The probability question changes from “What is the probability
of h heads in n tosses?” or “What is the probability of drawing 5R, 3G, and 2Y marbles?”
The new question is “What is the probability of getting n1 occurrences of data point (1),
n2 occurrence of data point (2),... n64 of data point (64), given 3 sequences of 2 bases
each?” As was done with the coin toss and marble examples, the likelihood function is a
probability function that estimates the probability for data that is already known.
A trivial probability function estimates the probability (of getting the sequence data
points) based on 64 parameters. Each parameter is the probability of getting a particular
base combination for the three sequences in a random position. The most likely value for
each parameter turns out to match the proportion of the time that the sequence
combination occurs over all the positions in the alignment. In the small 2-residue, 3sequence example, the parameters pi are 0 for all but two of the 64 possibilities. For each
of these, the maximum likelihood estimate for pi is ni/n or ½.
Maximum Likelihood in Phylogenetics
5/29
Need for More Complex Model
While this type of likelihood equation does call attention to sequences, the probability
question that it addresses is not biologically interesting. It only differs from the coin toss
in the number of possible outcomes for each experiment. To become interesting, the
model that defines the probability of observing the given site patterns must become much
more complex. One interesting model is a definition of a phylogenetic tree that contains
nodes, branch lengths, and tips. Each tip represents a sequence and each node represents
a convergence of the sequences on its adjacent branches. The resulting tree, in most
circumstances, reflects the expected “tree of life” in its picture of how various species are
related. (Humans would be seen as most closely related to chimpanzee, then other
primates, then mouse/rat, etc. ) For this discussion, all the sequences are assumed to be
nucleotides. The same principles could be applied to protein sequences but the model
becomes more complex.
For the likelihood probability function, branch lengths of the tree are specified as a
function of the expected number of changes per site and a model of sequence change(1).
A given site pattern can be represented as the tips of branches that are connected directly
or indirectly. At each branch node, any of the four nucleotides is possible. Assuming this
model, there are 64 possible nodes that may contribute to a site pattern that represents one
position in a hypothetical alignment of four sequences.
Such a node arrangement is illustrated in the following figure copied from (1). It
represents a single position in which the aligned sequences have bases G, G, T, and T
respectively. The nodes i, j, and k may each have any of the four nucleotides, allowing a
total of 64 different trees that have the leaves shown.
Figure 1 – copied from (1)
Instead of using a simple probability value for the site pattern, as is implied by the
multinomial probability model, the phylogenetic model calculates the site pattern
Maximum Likelihood in Phylogenetics
6/29
probability as a sum of terms, one term for each possible configuration of nodes that
could lead to the pattern. Here, a configuration of nodes refers to the assignment of
nucleotides to the nodes. Each term is itself dependent on the “equilibrium frequency” of
the “base” node, the presumed length of all the branches that lead (directly or indirectly)
to the leaves, the probability of observing two given nucleotides at the ends of the
branches, and other unspecified parameters. The “base” node (my term) is the one
furthest from the “leaves”; in the above diagram, it is node k. Its equilibrium frequency
has a different value for each of the four nucleotides that is assigned to the node. It is
typically determined by the overall base composition of the sequences and is intended to
estimate the probability that the nucleotide, going back in evolutionary time, was of that
base.
The following equation, copied from (1), represents a probability value for the sequences
observed in a single position, illustrated in the diagram above on this page. The values v,
as in the diagram, represent branch lengths; the value (pi) represents the equilibrium
frequency for node k; the values (theta) represent “additional parameters” of the
nucleotide substitution model. It is presented here just to illustrate the complexity and
potential computational expense of a tree likelihood calculation.
As extensive as this equation appears, it only represents one position in an alignment that
contains just four sequences. As the number of sequences increases, the number of
summations required by the above probability equation goes up exponentially. According
to Felsenstein (2), the number of terms becomes 22n-2, where n is the number of
sequences; if we are working with 10 sequences (a fairly modest number), we would
have 218 terms for just one aligned position.
The likelihood of a tree for the entire sequence alignment is the product of these
estimated probability values taken across all the sequence positions. Each sequence
contributes equally and it is assumed that all positions are independent. This is to say that
the above probability function for a single candidate tree involves a very large number of
computations. If all the possible trees are considered, the computation becomes
impossibly long.
For the maximum likelihood method to be a useable phylogenetic tool, some
optimizations were needed. Several interesting ones were proposed by Felsenstein (2) in
1981.
Maximum Likelihood in Phylogenetics
7/29
Felsenstein’s first optimization reduced the number of summations in the probability
function for the tree represented by a single sequence position. It is best explained by
referring to the tree diagram (Figure 1). In the “full tree” calculation shown above, the
left side of the tree is kept static while all possible combinations of the right side are
calculated. Then one change is made to the left side, and all possible combinations on the
right are calculated again. This continues for all possible combinations on the left side.
Then the base at node k is changed and all the calculations on both sides are repeated.
In the optimized calculation, the likelihood (probability function) is first calculated
separately for each of the “outermost” nodes. In Figure 1, these are nodes labeled i and j.
For each of these nodes, four likelihood values are calculated, one for each possible value
of its parent node k. For node i, a likelihood value would be based on its two leaves and
the base of node k. Then the likelihood for the parent node is the summation, for each of
its possible values (A, T, G, or C) of the product of its child nodes and the equilibrium
constant for node k. If is the equilibrium value of the base at node k, and Lik and Ljk
represent likelihood values of nodes i and j for a given value of k, then the likelihood of
the tree would calculated as the summation over k ofk * Lik * Ljk. Calculation of the L
values at each of nodes i and j would require 16 summations and the final likelihood
calculation would require 4 summations, for a total of 36 summations.
Felsenstein’s second optimization involves the choice of tree topologies to test. As stated
in his paper, two million topologies are possible for a tree that represents just 10
sequences. The optimization was to first build the tree with only two sequences, and then
to add just one sequence at a time without altering the arrangement of the existing nodes.
A stated disadvantage of this approach was that the final topology could vary, depending
on the order in which sequences were added.
A third optimization from the same paper involved the “Pulley Principle”. This was an
observation that the probability of a tree did not change if the total length of two branches
attached to the same node was kept equal. For example, in Figure 1, the probability of the
tree will not change as long as the sum of distances v5 and v6 are equal. This permitted
incremental changes to be made to a branch length until no further improvement
appeared in the probability value.
Citations
1. Hulsenbeck J., Crandall, K. Phylogeny Estimation and Hypothesis Testing Using
Maximum Likelihood. Annu. Rev. Ecol. Syst., 1997, 28:437-66.
2. Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood
approach. J. Mol. Evol. 17:368-76
3. Kim J, Warnow T. Tutorial on Phylogenetic Tree Estimation.
http://kim.bio.upenn.edu/~jkim/media/ISMBtutorial.pdf
Maximum Likelihood in Phylogenetics
8/29
P&S Project Part II:
DNA Substitution Models
By: Shanthi Iyanperumal
Phylogeny
Phylogeny is the evolution of a genetically related group of organisms. It is the study of
relationships between collection of "things" (genes, proteins, organs..) that are derived from a
common ancestor.
General Model of DNA Substitution
Maximum likelihood evaluates the probability that the choosen evolutionary model will have
generated the observed sequences. Phylogenies are then inferred by finding those trees that
yield the highest likelihood.
The rate matrix for a general model of DNA substitution is given by
.
r2pC
r4pG
r6pT
r1pA
.
r8pG
r10pT
r3pA
r7pC
.
r12pT
r5pA
r9pC
r11pG
.
Q = q(i,j) =
The rows and columns are ordered A, C, G and T. The matrix gives the rate of change from
nucleotide i(arranged along the rows) to nucleotide j(along the columns).
For example r2pC gives the rate of change from A to C.
Let P(v,s) be the transition probability matrix where pi,j(v,s) is the probability that nucleotide i
changes into j over branch length v. The vector s contains the parameters of the substitution
model(eg. pA, pC, pG, pT, r1,r2…).
For two-state case, to calculate the probability of observing a change over a branch of length
v, the following matrix calculation is performed:
Maximum Likelihood in Phylogenetics
9/29
P (v,s) = eQv
Calculation of likelihood of molecular sequences:
For DNA sequence comparison the model has 2 parts, the base composition and the
process. The composition is just the proportion of the four nucleotides A, C, G, T.
Example1:
Likelihood of a single sequence with two nucleotides AC
If the model is Jukes – Cantor model, which has a base composition of ¼ for each
nucleotide then the likelihood will be 1/4 X 1/4 = 1/16. If the model has a composition
of 40%a and 10%c the likelihood of the sequence will be 0.4 x 0.1=0.04
If we take the 16 possible nucleotide combinations and calculate the sum of all of them
the sum of those likelihoods is 1. For any model, the sum of the likelihoods of all the
different data possibilities should be 1.
Example2:
Likelihood of a one branch tree between two aligned sequences
Sequence 1
CCAT
Sequence 1
CCGT
The other part of the model, the process part is needed if we have more than one
sequence related by a tree. The process might be described by sentences, or by a
matrix of numbers, describing how the nucleotides change from one to another. Let the
composition part of the model be denoted by  = [0.1, 0.4, 0.2, 0.3]. The order of the
bases is A, C, G, and T. There are 16 possible changes from one nucleotide to the
other. The changes can be represented as a 4 X 4 matrix.
A
P=
0.976
C
0.01
G
T
0.007 0.007
Maximum Likelihood in Phylogenetics
A
10/29
0.002
0.003
0.002
0.983
0.01
0.013
0.005 0.01
0.979 0.007
0.005 0.979
C
G
T
Likelihood of going from sequence1 to sequence 2 is:
= c Pc-c
c Pc-c
a Pa-g
t Pt-t
= 0.4 * 0.983 * 0.4 * 0.983 * 0.1* 0.007 * 0.3 * 0..979
= 0.0000300
In the above example we did not consider branch lengths. Intuitively for short branch
lengths the probability of a base change is low and for long branch lengths it is high.
Let’s assume the matrix we have chosen describes a branch with a Certain Evolutionary
Distance (CED). The likelihood we calculated was for 1 CED.
The likelihood for the same alignment for 2 CED units is found by multiplying matrix P
by itself.
P2
=
0.953
0.005
0.007
0.005
0.02 0.013
0.966 0.01
0.02 0.959
0.026 0.01
0.015
0.02
0.015
0.959
A
C
G
T
Likelihood for 2 CED units is:
= c Pc-c
c Pc-c
a Pa-g
t Pt-t
= 0.4 * 0.966 * 0.4*0.966 * 0.1* 0.013 * 0.3 * 0..959
= 0.0000559
As the branch length increases the values on the diagonal decrease and the other
values increase because change becomes more likely than being the same.
Maximum Likelihood in Phylogenetics
11/29
The table lists the likelihoods for increasing branch lengths.
Branch length
(CED units)
1
2
3
10
15
20
30
Likelihood
0.0000300
0.0000559
0.0000782
0.000162
0.000177
0.000175
0.000152
The likelihood rises to a maximum somewhere between 15 and 20 ced units.
Example 3 :
Likelihood of a tree with four taxa
Assume that we have the aligned nucleotide sequences for four taxa:
The possible trees are
We want to evauate the likelihood of the unrooted tree represented by the nucleotides of
site j in the sequence and shown below:
Maximum Likelihood in Phylogenetics
12/29
Since most of the models currently used are time-reversible, the likelihood of the tree is
generally independent of the position of the root. Therefore it is convenient to root the tree at
an arbitrary internal node as done in the Fig. below,
Under the assumption that nucleotide sites evolve independently (the Markovian model of
evolution), we can calculate the likelihood for each site separately and combine the likelihood
into a total value towards the end. To calculate the likelihood for site j, we have to consider all
the possible scenarios by which the nucleotides present at the tips of the tree could have
evolved. So the likelihood for a particular site is the summation of the probablilities of every
possible reconstruction of ancestral states, given some model of base substitution. So in this
specific case all possible nucleotides A, G, C, and T occupying nodes (5) and (6), or 4 x 4 = 16
possibilities :
Maximum Likelihood in Phylogenetics
13/29
In the case of protein sequences each site may occupy 20 states (that of the 20 amino acids)
and thus 400 possibilities have to be considered. Since any one of these scenarios could have
led to the nucleotide configuration at the tip of the tree, we must calculate the probability of
each and sum them to obtain the total probability for each site j.
The likelihood for the full tree then is product of the likelihood at each site.
Maximum Likelihood in Phylogenetics
14/29
L= L(1) x L(2) ..... x L(N)
Since the individual likelihoods are extremely small numbers it is convenient to sum the log
likelihoods at each site and report the likelihood of the entire tree as the log likelihood.
N
ln L= ln L(1) + ln L(2) ..... + ln L(N) = SUM ln L(j)
j=1
DNA substitution Models
The use of maximum likelihood (ML) algorithms in developing phylogenetic hypotheses
requires a model of evolution. The frequently used General Time Reversible (GTR) family of
nested models encompasses 64 models with different combinations of parameters for DNA
site substitution. The models are listed here from the least complex to the most parameter rich.
Jukes-Cantor (JC):
Equal base frequencies, all substitutions equally likely . 1 level of nesting.
Felsenstein 1981(F81)
Variable base frequencies, all substitutions equally likely . 1 level of nesting.
Kimura 2-parameter(K80)
Equal base frequencies, variable transition and transversion frequencies .
2 levels of nesting.
Hasegawa-Kishino-Yano (HKY)
Variable base frequencies, variable transition and transversion frequencies.
2 levels of nesting.
Tamura-Nei (TrN):
Variable base frequencies, equal transversion frequencies, variable transition frequencies
Maximum Likelihood in Phylogenetics
15/29
Kimura 3-parameter (K3P)
Variable base frequencies, equal transition frequencies, variable transversion frequencies
Transition Model (TIM)
Variable base frequencies, variable transitions, transversions equal
Transversion Model (TVM)
Variable base frequencies, variable transversions, transitions equal
Symmetrical Model (SYM)
Equal base frequencies, symmetrical substitution matrix (A to T = T to A)
General Time Reversible (GTR)
Variable base frequencies, symmetrical substitution matrix . 6 levels of nesting .
In addition to models describing the rates of change from one nucleotide to another, there are
models to describe rate variation among sites in a sequence. The following are the two most
commonly used models.
Gamma Distribution (G)
Rate heterogeneity can be accommodated by specifying that the rate of evolution across
different sites and is distributed according to a gamma distribution. A simpler way of
accounting for rate heterogeneity is to specify that a fixed proportion of sites are invariant i.e.
have zero rate of evolution.
Proportion of Invariable Sites (I)
Extent of static, unchanging sites in a dataset .
Amino acid Substitution Models
Maximum Likelihood in Phylogenetics
16/29
The divergence among sequences can be modeled with a mutation matrix. The matrix,
denoted by M, describes the probabilities of amino acid mutations for a given period of
evolution.
This corresponds to a model of evolution in which amino acids mutate randomly and
independently from one another but according to some predefined probabilities depending on
the amino acid itself. This is a Markovian model of evolution and while simple, it is one of the
best models. Intrinsic properties of amino acids, like hydrophobicity, size, charge, etc. can be
modeled by appropriate mutation matrices. Dependencies which relate one amino acid
characteristic to the characteristics of its neighbors are not possible to model through this
mechanism. Amino acids appear in nature with different frequencies. These frequencies are
denoted by fi and correspond to the steady state of the Markov process defined by the
matrix M., i.e., the vector f is any of the columns of
or the eigenvector of M whose
corresponding eigenvalue is 1 (Mf=f). This model of evolution is symmetric, i.e., the probability
of having an i which mutates to a j is the same as starting with a j which mutates into an i.
The following is a list of amino acid substitution models which use matrices.
Empirical substitution models
In contrast to DNA substitution models, amino acid replacement models have concentrated on
the empirical approach. Dayhoff and coworkers developed a model of protein evolution which
resulted in the development of a set of widely used replacement matrices. In the Dayhoff
approach, replacement rates are derived from alignments of protein sequences that are at
least 85% identical; this constraint ensures that the likelihood of a particular mutation being the
result of a set of successive mutations is low. One of the main uses of the Dayhoff matrices
has been in database search methods where, for example, the matrices P(0.5), P(1) and
P(2.5) (known as the PAM50, PAM100 and PAM250 matrices) are used to assess the
significance of proposed matches between target and database sequences. However, the
implicit rate matrix has been used for phylogenetic applications.
PAM matrices
In the definition of mutation the matrix M implies certain amount of mutation (measured in PAM
units). A 1-PAM mutation matrix describes an amount of evolution which will change, on the
average, 1% of the amino acids. In mathematical terms this is expressed as a matrix M such
that
The diagonal elements of M are the probabilities that a given amino acid does not change, so
(1-Mii)
is
the
probability
of
mutating
away
from
i.
If we have a probability or frequency vector p, the product Mp gives the probability vector or
the expected frequency of p after an evolution equivalent to 1-PAM unit. Or, if we start with
Maximum Likelihood in Phylogenetics
17/29
amino acid i (a probability vector which contains a 1 in position i and 0s in all others) M*i (the
ith column of M) is the corresponding probability vector after one unit of random evolution.
Similarly, after k units of evolution (what is called k-PAM evolution) a frequency vector p will be
changed into the frequency vector Mk p. Notice that chronological time is not linearly
dependent on PAM distance. Evolution rates may be very different for different species and
different proteins.
Dayhoff matrices
Dayhoff presented a method for estimating the matrix M from the observation of 1572
accepted mutations between 34 superfamilies of closely related sequences. Their method was
pioneering in the field. A Dayhoff matrix is computed from a 250-PAM mutation matrix, used for
the standard dynamic programming method of sequence alignment. The Dayhoff matrix entries
are related to M250 by
.
JTT matrices
Jones et al. and Gonnett et al. have used much the same methodology as Dayhoff, but with
modern databases. The Jones et al. model has been implemented for phylogenetic analyses
with some success. Jones et al. have also calculated an amino acid replacement matrix
specifically for membrane spanning segments. This matrix has remarkably different values
from the Dayhoff matrices, which are known to be biased toward water-soluble globular
proteins.
Other empirical models
Adachi and Hasegawa have implemented a general reversible Markov model of amino acid
replacement that uses a matrix derived from the inferred replacements in mitochondrial
proteins of 20 vertebrate species. The authors show that this model performs better than
others when dealing with mitochondrial protein phylogeny.
Blosum (Block substitution matrices)
Henikoff and Henikoff have used local, ungapped alignments of distantly related sequences to
derive the BLOSUM series of matrices. Matrices of this series are identified by a number after
the matrix (e.g. BLOSUM50), which refers to the minimum percentage identity of the blocks of
multiple aligned amino acids used to construct the matrix. These matrices are directly
calculated without extrapolations, and are analogous to transition probability matrices P(T) for
different values of T, estimated without reference to any rate matrix Q. The BLOSUM matrices
often perform better than PAM matrices for local similarity searches, but have not been widely
used in phylogenetics.
Poisson models
A simple, non-empirical model of amino acid replacement was proposed by Nei(1987) This
model implements a Poisson distribution, and gives accurate estimates of the number of amino
acid replacements when species are closely related.
Maximum Likelihood in Phylogenetics
18/29
P&S Project Part III:
DNA Substitution Models
By: Syusanna Koyfman
Phylogenetic trees
If we assume that all life must come from a common origin, then closely related
species share a more recent common ancestor than distantly related species.
We can then show the relationship between species using the phylogenetic tree.
This graph shown in Fig. I.4.1 shows an example of a phylogenetic tree.
“The nodes represent taxonomic units, while the branches connecting them
reflect their relationships in terms of descent. The topology is the pattern of
branches found in a tree. The branch length is commonly used to indicate some
form of evolutionary distance represented by that branch. The actual, still existing
taxonomic units or operational taxonomic units are represented by nodes on the
tips of the branches, called external nodes. The other nodes are called internal
nodes.”1
1
http://rrna.uia.ac.be/~peter/doctoraat/evol.html
Maximum Likelihood in Phylogenetics
19/29
Fig. I.4.1. Example of a phylogenetic tree. The branch lengths in this type of tree
representation are given by the horizontal length only.
“A tree where a special node indicating the common ancestor to all OTUs is
present (the root) is called a rooted tree. An unrooted tree leaves the position of
the common ancestor unspecified. The total number of possible distinct,
unrooted trees for n sequences is given by (Penny et al., 1982; Li and Grauer,
1991):2
There are many methods that are used to construct phylogenetic trees. I will
focus in this document the maximum parsimony and maximum likelihood
methods, which are both methods that use all character data.
Maximum parsimony
“A maximum parsimony tree is the tree that requires the smallest number of
evolutionary changes to result in the set OTUs under study.”3
Sometime we can find many trees with the same number of changes. Although
we consider all sites, they do not necessarily relate information regarding the
most parsimonious tree. We “filter” out the sites that are not favorable to our
topographies. We consider favorable sites ones that show at least 2 different
kinds of residues at the site. We find the most parsimonious tree by choosing
from all of the possible tree topologies.
” For each of these possible trees the ancestral sequences at each branching
point are reconstructed and the minimum number of evolutionary changes can be
calculated. Finally the tree requiring the smallest number of substitutions will be
chosen. “4
2
http://rrna.uia.ac.be/~peter/doctoraat/evol.html
http://rrna.uia.ac.be/~peter/doctoraat/evol.html
4 http://rrna.uia.ac.be/~peter/doctoraat/evol.html
3
Maximum Likelihood in Phylogenetics
20/29
Some advantages for using the maximum parsimony method are that it is based
on shared and derived characters, does not reduce sequence information to a
single number, tries to provide information on the ancestral sequences, and
evaluates different trees. The disadvantages are that is slow in comparison with
distance methods, does not use all the sequence information (only informative
sites are used), does not correct for multiple mutations (does not imply a model
of evolution), it does not provide information on the branch lengths and is
notorious for its sensitivity to codon bias.
Maximum likelihood
“Maximum Likelihood is a method for the inference of phylogeny. It evaluates a
hypothesis about evolutionary history in terms of the probability that the proposed
model and the hypothesized history would give rise to the observed data set. The
supposition is that a history with a higher probability of reaching the observed
state is preferred to a history with a lower probability. The method searches for
the tree with the highest probability or likelihood.”5
Some advantages of the maximum likelihood method are as follows;
They have often lower variance than other methods (ie. it is frequently the
estimation method least affected by sampling error), they tend to be robust to
many violations of the assumptions in the evolutionary model, even with very
short sequences they tend to outperform alternative methods such as parsimony
or distance methods, the method is statistically well founded, they evaluate
different tree topologies, they use all the sequence information, are less
The disadvantages are as follows; it is very CPU intensive and thus extremely
slow, and the result is dependent on the model of evolution used.
5
http://www.icp.ucl.ac.be/~opperd/private/max_likeli.html
Maximum Likelihood in Phylogenetics
21/29
One example I found in the following website seems to explain really well.
“Maximum likelihood evaluates the probability that the chosen evolutionary model
will have generated the observed sequences. Phylogenies are then inferred by
finding those trees that yield the highest likelihood.
Assume that we have the aligned nucleotide sequences for four taxa:
1
j
....N
(1)
A G G C U C C A A ....A
(2)
A G G U U C G A A ....A
(3)
A G C C C A G A A.... A
(4)
A U U U C G G A A.... C
We want to evaluate the likelihood of the uprooted tree represented by the
nucleotides of site j in the sequence and shown below:
(1)
\
(2)
/
/
\
-----/
/
(3)
\
\
(4)
What is the probabliity that this tree would have generated the data presented in
the sequence under the chosen model?
Since most of the models currently used are time-reversible, the likelihood of the
tree is generally independent of the position of the root. Therefore it is convenient
to root the tree at an arbitrary internal node as done in the Fig. below,
C
C
\
/
\/
A
\
\
A
G
|
/
|
/
|
/
| /
| /
A
Under the assumption that nucleotide sites evolve independently (the Markovian
model of evolution), we can calculate the likelihood for each site separately and
combine the likelihood into a total value towards the end. To calculate the
Maximum Likelihood in Phylogenetics
22/29
likelihood for site j, we have to consider all the possible scenarios by which the
nucleotides present at the tips of the tree could have evolved. So the likelihood
for a particular site is the summation of the probabilities of every possible
reconstruction of ancestral states, given some model of base substitution. So in
this specific case all possible nucleotides A, G, C, and T occupying nodes (5)
and (6), or 4 x 4 = 16 possibilities:
_
_
| C
C A
G |
|
\ / |
/ |
|
\/
|
/
|
L(j) = Sum(Prob |
(5) |
/
|)
|
\ | /
|
|
\ | /
|
|_
(6)
_|
In the case of protein sequences each site may occupy 20 states (that of the 20
amino acids) an thus 400 possibilities have to be considered. Since any one of
these scenarios could have led to the nucleotide configuration at the tip of the
tree, we must calculate the probability of each and sum and sum them to obtain
the total probability for each site j.
The likelihood for the full tree then is product of the likelihood at each site.
N
L= L(1) x L(2) ..... x L(N) = ½ L(j)
j=1
Since the individual likelihoods are extremely small numbers it is convenient to
sum the log likelihoods at each site and report the likelihood of the entire tree as
the log likelihood.
N
ln L= ln L(1) + ln L(2) ..... + ln L(N) = SUM ln L(j)
j=1
Maximum Likelihood in Phylogenetics
23/29
The above procedure is then repeated for all possible topologies (or for all
possible trees). The tree with the highest probability is the tree with the highest
maximum likelihood.“6
LIKELIHOOD RATIO TESTS IN PHYLOGENETICS
It is well noted that the there are many assumptions that are made in
phylogenetic analysis. These assumptions are sometime incorrect, but we will
see that even incorrect assumption may lead to correct analysis.
It is understood that the concept of assumptions is one of the most debated
subjects in the field of phylogenetics. All of the phylogenetic methods make
assumptions about the process of evolution. In addition, many phylogenetic
methods use bifurcation trees, a tree in which each ancestral lineages gives rise
to exactly two descendent lineages, to best describe the phylogeny of species.
“Consequently, all the methods of phylogenetic inference depend on their
underlying models. To have confidence in inferences it is necessary to have
confidence in the models.”7 This means that if we have the confidence in our
models, we may then make our assumptions.
More assumptions are made in a phylogenetic analysis. “For example, the
assumptions of a maximum likelihood analysis are mathematically explicit and,
besides the assumption of independence among sites, include parameters that
describe the substitution process, the lengths of the branches on a phylogenetic
tree, and among-site rate heterogeneity. The assumptions made in a parsimony
analysis include independence and a specific model of character transformation
6
7
http://www.icp.ucl.ac.be/~opperd/private/max_likeli.html
http://bioinformatics.oupjournals.org/cgi/reprint/14/9/817.pdf
Maximum Likelihood in Phylogenetics
24/29
(often called a step-matrix or weighting scheme; a commonly used weighting
scheme is to give every character transformation equal weight). “8
Despite all of these assumptions, it is surprising that phylogenetic methods can
estimate the correct tree with high probability. “In fact, the maximum likelihood,
parsimony, and several distance methods appear to be robust to violation of
many assumptions, including making incorrect assumptions about the
substitution process, among-site rate variation, and independence among sites”9
One of the advantages of making explicit assumptions about the evolutionary
process is that we can compare alternative models of evolution in a statistical
context. “Instead of being viewed as a disadvantage, the use of explicit models of
evolution in a phylogenetic analysis allows the systematist not only to estimate
phylogeny, but to learn about processes of evolution through hypothesis testing.
One measure of the relative tenability of two competing hypotheses is the ratio of
their likelihoods. “10
It is from this introduction that we can begin to decide which model fits the data.
We can look at the following example as an indication of the methods employed
in justifying their use.
“Because of this, all the methods based on explicit models of evolution should
explore which is the model that fits the data best, justifying then its use. In
traditional statistical theory, a widely accepted statistic for testing the goodness of
fit of models is the likelihood ratio test statistic.
Huelsenbeck&Crandall1997.pdf
Huelsenbeck&Crandall1997.pdf
10
Huelsenbeck&Crandall1997.pdf
8
9
Maximum Likelihood in Phylogenetics
25/29
where L0 is the likelihood under the null hypothesis (simple model) and L1 is the
likelihood under the alternative hypothesis (more complex, parameter rich,
model)”11
“Here, the maximum likelihood calculated under the null hypothesis (H0) is in the
numerator, and the maximum likelihood calculated under the alternative
hypothesis (H1) is in the denominator. When Λ is less than one, H0 is discredited
and when Λ is greater than one, H1 is discredited. Λ greater than one is only
possible for non-nested models. When nested models are considered, (i.e., the
null hypothesis is a subset or special case of the alternative hypothesis), Λ < 1
and -2 log Λ is asymptotically χ 2 distributed under the null hypothesis with q
degrees of freedom, where q is the difference in the number of free parameters
between the general and restricted hypotheses.”12
”The maximum likelihood estimates of the model parameters under the null
hypothesis are used to parameterize the simulations. For the phylogeny problem,
these parameters would include the tree topology, branch lengths, and
substitution parameters (e.g., transition:transversion rate ratio or the shape
parameter of the gamma distribution). For each simulated data set, -2 log Λ is
calculated anew by maximizing the likelihood under the null and alternative
hypotheses. The proportion of the time that the observed value of -2 log Λ
exceeds the values observed in the simulations represents the significance level
of the test.”13
11
http://bioinformatics.oupjournals.org/cgi/reprint/14/9/817.pdf
Huelsenbeck&Crandall1997.pdf
13
Huelsenbeck&Crandall1997.pdf
12
Maximum Likelihood in Phylogenetics
26/29
It is common for the rejection level to be set at about 5%. Therefore, maximum
likelihood allows simplified formulation and testing of the phylogenetic hypothesis
through these likelihood ratio tests. In addition, ratio tests have desirable
statistical properties, especially if they are used on simple hypotheses.
“Over the past two decades, numerous likelihood ratio tests have been
suggested. These include tests of the null hypotheses that (a) a model of DNA
substitution adequately explains the data (32, 80, 93), (b) rates of nucleotide
substitution are biased (32, 80, 93), (c) rates of substitution are constant among
lineages (24, 65, 79, 115), (d ) rates are equal among sites (123), (e) rates of
substitution are the same in different data partitions (30, 66, 122), ( f )
substitution parameters are the same among data partitions (122), (g) the same
topology underlies different data partitions (51), (h) a prespecified group is
monophyletic (53), (i) hosts and associated parasites have corresponding
phylogenies (56), ( j) hosts and parasites have identical speciation times (56),
and (k) rates of synonymous and nonsynonymous substitution are the same
(77).”14
14
Huelsenbeck&Crandall1997.pdf
Maximum Likelihood in Phylogenetics
27/29
Reference:
1. Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood by
Huelsenbeck J. and Crandall K.
2. http://workshop.molecularevolution.org/resources/models/codonmodels.php
3. www.cs.technion.ac.il/~dang/courseCB/lecture13.pps
4. http://www.biology.usu.edu/biol6750/Lecture_15.htm
5. bio.wayne.edu/mf/teaching/BIO6060_Maximumlikelihood2
6. http://www.nmu.edu/biology/Lindsay/teaching/BI315/phylo/phylo_DNAmodels.html
7. http://www.icp.ucl.ac.be/~opperd/private/max_likeli.html
8.http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Ab
stract&list_uids=8642615
9. http://stat-www.berkeley.edu/users/terry/PMMB/Workshop2000/Lab3/phylo.pdf
10.http://www.biology.duke.edu/rausher/phylo1.pdf
11. Maximum Likelihood: Phylogeny Estimation- Neelima Lingareddy
Maximum Likelihood in Phylogenetics
28/29
Maximum Likelihood in Phylogenetics
29/29
Download