DNA is the basic hereditary material in all cells and

advertisement
PHYLOGENY ESTIMATION USING MAXIMUM LIKELIHOOD
By George Shanahan
DNA: The Information for Reconstructing Phylogenetic Trees
DNA is the basic hereditary material in all cells and contains all the information
necessary to make proteins. DNA is a linear polymer that is made up of nucleotide units.
The nucleotide unit consists of a nitrogen containing base, a deoxyribose sugar, and a
phosphate. There are four types of bases: adenine (A), thymine (T), guanine (G), and
cytosine (C). Each base is connected to a sugar via a ß glycosyl linkage. The nucleotide
units are connected via the O3' and O5' carbon atoms forming phosphodiester linkages.
This linkage between the hydroxyl groups of the deoxyribose sugar and the phosphate
groups form the DNA backbone.
adenine
thymine
guanine
cytosine
In normal DNA, the bases form pairs: A to T and G to C. This is called complementarity.
Two complementary chains that are arranged in an anti-parallel manner form a duplex of
DNA. The A-T and G-C base pairs are the same length, and occupy the same space
within a DNA double helix, creating a uniform diameter. DNA is normally a double
stranded macromolecule, consisting of two polynucleotide chains held together. Also
known as a double helix.
In this structure, two DNA strands form a spiral, winding around an axis in a righthanded spiral. The sugar-phosphate backbones of the two DNA strands wind around the
helix axis like a spiral staircase. The two polynucleotide chains run in opposite
directions, and the bases of the individual nucleotides are on the inside of the helix,
stacked on top of each other. The larger base shape shown in the picture above depicts
the purines (adenine and guanine). The other base shape in the illustration represents the
pyrimidines (cytosine and thymine).
The results of fiber and single crystal x-ray crystallographic studies have shown that
DNA can have several conformations. The most common one is called B-DNA. B-DNA
is a right-handed double helix with a wide and narrow groove. The bases are
perpendicular to the helix axis.
Compound Name: 5’-d(CpGpCpGpApApTpTpCpGpCpG)-3', 290 K
Sequence in asymmetric unit

Chain A: C G C G A A T T C G C G

Chain B: C G C G A A T T C G C G
B-DNA
Occasionally mutations occur in which a base is changed. Base pairs still form,
but they are not in the usual Watson-Crick geometry. Sequences can diverge from a
common ancestor because mutations occur. Those mutations can then be fixed into the
evolving population. The Maximum Likelihood method is used for the analysis of DNA
and amino acid sequence data in an attempt to answer evolutionary questions.
Maximum Likelihood in Phylogenetics
Molecular phylogeny methods allow, from a given set of aligned sequences, the
suggestion of phylogenetic trees which aim at reconstructing the history of successive
divergence which took place during evolution, between the considered sequences and
their common ancestor. Its topology (form) and its length (sum of its branch lengths)
characterize a phylogenetic tree. Each node of a tree is an estimation of the ancestor of
the elements included in that node.
Maximum Likelihood is a method for the inference of phylogeny. It evaluates a
hypothesis about evolutionary history in terms of the probability that the proposed model
and the hypothesized history would give rise to the observed data set. The supposition is
that a history with a higher probability of reaching the observed state is preferred to a
history with a lower probability. The method searches for the tree with the highest
probability or likelihood.
The data for molecular phylogenetic problems are the individual site patterns. A
set of sequences is considered for analysis. Multiple sequence alignment is then
performed to get a proper alignment. The phylogenetic tree will depend strongly on the
alignments obtained.
Maximum likelihood evaluates the probability that the chosen evolutionary model
will have generated the observed sequences. Phylogenies are then inferred by finding
those trees that yield the highest likelihood.
Assume that we have the aligned nucleotide sequences for four taxa:
1
j
....N
(1)
A G G C T C C A A ....A
(2)
A G G T T C G A A ....A
(3)
A G C C C A G A A.... A
(4)
A T T T C G G A A.... C
We want to evaluate the likelihood of the unrooted tree represented by the nucleotides of
site j in the sequence and shown below:
(1)
\
\
(2)
/
/
------
/
/
(3)
\
\
(4)
Now, proceed to find the probability that this tree would have generated the data
presented in the sequence under the chosen model. Since most of the models currently
used are time-reversible, the likelihood of the tree is generally independent of the position
of the root. Therefore it is convenient to root the tree at an arbitrary internal node as done
in the Fig. below;
C
C
\
/
\/
A
\
\
A
G
|
/
|
/
|
/
| /
| /
A
Under the assumption that nucleotide sites evolve independently (the Markovian model
of evolution), we can calculate the likelihood for each site separately and combine the
likelihood into a total value towards the end. To calculate the likelihood for site j, we
have to consider all the possible scenarios by which the nucleotides present at the tips of
the tree could have evolved. So the likelihood for a particular site is the summation of the
probablilities of every possible reconstruction of ancestral states, given some model of
base substitution. So in this specific case all possible nucleotides A, G, C, and T
occupying nodes (5) and (6), there are 4(n-2) arrangements for n taxa or 42 = 16
possibilities :
_
_
| C
C A
G |
|
\ / |
/ |
|
\/
|
/
|
L(j) = Sum(Prob |
(5) |
/
|)
|
\ | /
|
|
\ | /
|
|_
(6)
_|
In the case of protein sequences each site may occupy 20 states (that of the 20 amino
acids) and thus 400 possibilities have to be considered. Since any one of these scenarios
could have led to the nucleotide configuration at the tip of the tree, we must calculate the
probability of each and sum them to obtain the total probability for each site j.
For each tree we calculate the likelihood
of getting the observed states = L(i)
C
C
A
t2
A
t3
t4
G
t5
t1
A
L (i) = PA x PA-A (t1)x PA-C (t2)x PA-C (t3)x PA-A (t4)x PA-G (t5)
The likelihood for the full tree then is product of the likelihood at each site.
N
L= L(1) x L(2) ..... x L(N) =  L(j)
j=1
Since the individual likelihoods are extremely small numbers it is convenient to sum the
log likelihoods at each site and report the likelihood of the entire tree as the log
likelihood.
N
ln L= ln L(1) + ln L(2) ..... + ln L(N) =  ln L(j)
j=1
The maximum likelihood tree is now determined by repeating this procedure for
all possible topologies (or for all possible trees) and searching for the set of branchlengths that maximize L. The result is the tree’s likelihood score. The tree with the
highest probability is the tree with the highest maximum likelihood.
To calculate the probability of observing a given site pattern, the transition
probabilities need to be specified. In the case of DNA or amino acid substitutions, the
simplest model is the Poisson model, which assumes that all changes between amino
acids occur at the same rate. The model of evolution that attributes to each possible
nucleotide or amino-acid substitution has a certain probability that is essential to obtain
the correct tree.
Introduction to DNA Substitution Models
The use of maximum likelihood algorithms in developing phylogenetic
hypotheses requires a model of evolution. The model of evolution that attributes to each
possible nucleotide or amino-acid substitution has a certain probability that is essential to
obtain the correct tree. The frequently used General Time Reversible (GTR) family of
nested models encompasses 64 models with different combinations of parameters for
DNA site substitution. The models are listed here from the least complex to the most
parameter rich.
Jukes-Cantor (JC, nst=1): Equal base frequencies, all substitutions equally likely
(PAUP* rate classification: aaaaaa, PAML: aaaaaa)*
Felsenstein 1981(F81, nst=1): Variable base frequencies, all substitutions equally likely
(PAUP*: aaaaaa, PAML: aaaaaa)**
Kimura 2-parameter (K80, nst=2): Equal base frequencies, variable transition and
transversion frequencies (PAUP*: abaaba, PAML: abbbba)
Hasegawa-Kishino-Yano (HKY, nst=2): Variable base frequencies, variable transition
and transversion frequencies (PAUP*: abaaba, PAML: abbbba)
Tamura-Nei (TrN): Variable base frequencies, equal transversion frequencies, variable
transition frequencies (PAUP*: abaaea, PAML: abbbbf)
Kimura 3-parameter (K3P): Variable base frequencies, equal transition frequencies,
variable transversion frequencies (PAUP*: abccba, PAML: abccba)
Transition Model (TIM): Variable base frequencies, variable transitions, transversions
equal (PAUP*: abccea, PAML: abccbe)
Transversion Model (TVM): Variable base frequencies, variable transversions,
transitions equal (PAUP*: abcdbe, PAML: abcdea)
Symmetrical Model (SYM): Equal base frequencies, symmetrical substitution matrix (A
to T = T to A) (PAUP*: abcdef, PAML: abcdef)
General Time Reversible (GTR, nst=6): Variable base frequencies, symmetrical
substitution matrix (PAUP*: abcdef, PAML: abcdef)
In addition to models describing the rates of change from one nucleotide to another, there
are models to describe rate variation among sites in a sequence. The following are the
two most commonly used models.
Gamma Distribution (G): Gamma distributed site-to-site rate variation
Proportion of Invariable Sites (I): Extent of static, unchanging sites in a dataset
Substitutions are themselves grouped hierarchically: simple, general base
substitution, transitions and transversions, purine to purine and pyrimidine to pyrimidine
transitions, and AC/GT and AT/CG transversions. The groupings are symbolized as rate
classifications according to the PAUP and PAML matrices below. Substitution types that
are constrained to be equal in rate assume the leftmost letter symbol.
PAUP* Substitution Rate Matrix
Matrix
A C G T
PAML Substitution Rate
A
C
-
a
b
c
T
-
d
e
C
-
f=1
A
-
G
G
T
T
C
A
G
-
a
b
c
-
d
e
-
f=1
-
Advantages and disadvantages of maximum likelihood methods:
There are some supposed advantages of maximum likelihood methods over other
methods.

they estimate the branch lengths of the final tree

they have often lower variance than other methods (ie. it is frequently the
estimation method least affected by sampling error)

they tend to be robust to many violations of the assumptions in the
evolutionary model

even with very short sequences they tend to outperform alternative
methods such as parsimony or distance methods.

the method is statistically well founded

they evaluate different tree topologies

they use all the sequence information
There are also some supposed disadvantages

maximum likelihood is very CPU intensive and thus extremely slow

the result is dependent on the model of evolution used
Bibliography
1. Adachi J, Hasegawa M. 1992. Amino acid substitution of proteins coded for in
mitochondrial DNA during mammalian evolution. Jpn. J. Genet. 67:187–97
2. Bishop MJ, Thompson EA. 1986. Maximum likelihood alignment of DNA sequences.
J. Mol. Biol. 190:159–65
3. Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood
approach. J. Mol. Evol. 17:368–376
4. Felsenstein J. 1988. Phylogenies from molecular sequences. Annu. Rev. Genet.
22:521–565
5. Huelsenbeck JP, Crandall KA. 1997. Phylogeny estimation and hypothesis testing
using maximum likelihood. Annu. Rev. Ecol. Stst. 28:437-466
6. Navidi WC, Churchill GA, von Haeseler A. 1991. Methods for inferring phylogenies
from nucleic acid sequence data by using maximum likelihood and linear in-variants.
Mol. Biol. Evol. 8:128–143
7. Penny D, Hendy MD, Steel MA. 1992. Progress with methods for constructing
evolutionary trees. Trends Ecol. Evol. 7:73-79
8. Thorne JL, Kishino H, Felsenstein J. 1991. An evolutionary model for maximum
likelihood alignment of DNA sequences. J. Mol. Evol. 33:114-124
Download