Infering phylogeny from Protein / DNA sequence

advertisement
Math REU - Inferring phylogeny from Protein / DNA sequence –
Arian Sibila – 7/30/2003
Inferring the phylogeny means building the phylogenetic tree. Phylogenetic tree is a model of
evolution. Building the phylogenetic tree means determining which branching came first, which
second, etc. Available information is the DNA / Protein sequence of the taxa forming the leaves of
the tree. The sequence does not indicate the position in the tree, but based on pair wise comparisons
we reconstruct the tree.
Introduction
From species to the tree of evolution is a long way for we have recovered very little information.
To get a better idea, we're looking at the past 100 or so years to reconstruct branches and edges
Nature has been building for over 500 million years – seems impossible! Furthermore, some pieces
of information we have are useless either because no one has fully deciphered and understood its
dynamics (such as DNA), or because it would create enormously complicated models. Thus our
effort is with the goal of trying to capture the essence of and approximate the evolution of species.
The problem can be very neatly represented in a mathematical language as species are represented
uniquely with sequences of symbols. We don't know what language they describe, but we know
that more difference means more distantly related taxa. A symbol is any of the letters from an
alphabet: A, T, G, C (4) for DNA, and A, C, Y, W, … (20) for protein sequences. A sequence
consists of some number of symbols. The property we observe is that, in nature at least, these
sequences change or mutate over time. The simplest kind of mutation, out of several that happen in
nature, is called substitution, where one symbol is replaced by another from the same alphabet, and
it is the only kind we consider for two reasons:
1. it gives simpler model
2. we're hoping that other kinds of mutations can be approximated by the substitutions, since it is
indeed the most basic one
Based on chemical properties, certain substitutions happen more often than others, and this is
represented by probabilities:
P(j | i)
is the probability of getting j-th symbol (from the alphabet) given that we have i-th symbol, or
probability of (i => j) substitution. All the probabilities are conveniently placed in a single matrix,
with P(j | i) entry in i-th row, j-th column:
P (1 | 1) P (2 | 1)
P (1 | 2) P (2 | 2)


P (1 | n) P (2 | n)
 P (n | 1)
 P ( n | 2)


 P ( n | n)
Some important properties can be seen:
1. rows add up to 1.0
2. matrix can be raised to a power to get the probability of series of substitutions
3. the matrix to a power has the same properties
And some other properties (that come due to the nature of data we use):
1. entries along the main diagonal are the largest (representing non-change, e.g. P(1=>1) or 1
remaining 1)
2. powering the matrix gives more change, taking roots gives less change
This matrix, representing the natural substitutions or Percent of Accepted Mutations (PAM) is
constructed from the large set of sequences by counting all the pairings during pair wise alignment
(of either DNA or protein sequences). Note the differences with original (Dayhoff, '78) definition
of calculating PAM. In this paper, we're using slightly different method entirely based on counting
and probabilities.
Based on data, we're hoping find a method to express difference in sequences as a distance and plot
a tree, or in general finding a function:
d : sequence 2  R
Finally, we're hoping that, given distance between any two taxa is the sum of each of the distances
to their common ancestor, while in reality this might not always be the case.
Sequence
In nature at least, sequences are all of different lengths. So how do you compare them? Symbol by
symbol comparison requires some symbols to be discarded from the longer (or even shorter!)
sequence. The biologists have came up with a method called aligning entirely based on probability
and scoring matrices. The idea is that bigger score is more likely to happen in nature – the first
(possible) source of error!
Process of aligning inserts gaps where needed to pair the chunks which probably have similar
functions. During the course of evolution, sequences loose (deletions) and get (insertions) chunks
of symbols, but there are parts of sequences which remain unaffected from one taxon to the other,
and those parts should theoretically be paired after the process of aligning. Now, comparisons
should tell us something about the substitutions, whereas before aligning, deleted and inserted
chunks would give us garbage.
Biological clock and mutations
As mentioned earlier, substitution is the only kind of mutation that we observe, and there is a good
reason for that – a so called biological clock! The claim is that molecular functions take place more
or less at the same rate today as they did 100 mya. Although without proof, we all agree that since
substitutions are random, more time means more opportunity for them to happen, while that is not
the case with insertions and deletions, where the whole chunks get inserted / removed in a short
period of time.
Thus, life mutates at more or less constant pace. Natural selection decides which mutations are fit
to live. After weak ones die and strong ones survive, the population contains the new (mutated) set
of beneficial genes, and we say that the species has evolved.
Evolution and Parallel Evolution
Sometimes, new taxa are created. To get a clue about evolution, let's imagine the following
scenario. A population of monkeys is physically divided by a river. It's known that monkeys do not
cross water, so there will be no contact between the two populations for a long time. Both sides
evolve, but they accumulate different set of mutations as dictated by conditions, such as food
availability, temperature, landscape, etc. Both sides evolved even before the appearance of the
river, but it's the loss of contact caused by physical separation that makes the gene pools to slowly
diverge over time. Eventually they become two different species, unable to breed. This is how all
the life is believed to have evolved.
In this case, the branching point or the ancestor of the two species is the population before it gets
divided. As you can see, that animal is extinct because the population has mutated, so we don't have
it's sequence. We can only guess when it happened. Who branched first. Is it human-chimpanzee
branching that occurred most recently or was it human-gorilla? Paleontologists have debated for
decades, but we want to be able to answer questions like this using modern sequences.
The other burning issue is how rapidly did the two sides diverge? If the conditions are similar on
each side of the river, one would expect the two sides to stay on the similar course of evolution for
a while and develop into very similar taxa, despite the separation. This is especially holds true for
large populations, meaning that based only on those two taxa it would be hard to tell how far or
how long ago did the branching happen as compared to some other branching. Another way to say
it is that because of parallel evolution, sequence similarity based distance function will give smaller
value. In most cases, however, we assume separation of a splinter group consisting of relatively
few species from the main population and moving into different conditions where any mutations
will quickly propagate through the group and cause them to diverge rapidly.
Matrix Distance Methods
Matrices are used as a convenient way of grouping percentages of substitutions. Suppose we have
the following probability matrix:
 0 .7 0 .3 


 0.2 0.8 
Then P(1 => 2) = 0.3. In two steps:
2
 0.7 0.3   0.55 0.45 

  

 0.2 0.8   0.3 0.7 
P(1 => 2) = 0.45, and so on.
PAM (Percent Accepted Mutations)
First introduced by Dayhoff in 1978 as a way of aligning and scoring. It's a matrix, meant to
delineate the evolutionary change and to differentiate likely from unlikely mutations.
The most important PAM matrix is PAM-1 (since all other PAM's are calculated from it), which
corresponds to 1% of change. The idea is, if 99% of the symbols (aminoacids or nucleotides)
remain the same, what is the probability that other mutations have taken change? Or, in equation:
P * diag(Q) = 0.99
Where Q is PAM-1 and diag(Q) is vector along it's main diagonal. Note that entries along the main
diagonal correspond to non-change while off-diagonal entries correspond to change. Diagonal
entries are expected to be large, close to 1 in order for some of our methods to work. Dayhoff
method defines "relative mutability", Mi, to be the total amount of change for the given row. We do
not make use of this method because it has no mathematical significance (to us at least), but
instead, we use the same method as described earlier for finding transition matrix, with one
difference: counts for (i=>j) and (j=>i) are same. In other words, matrix is symmetric. We use the
criteria above to find a matrix Q, st:
Qt = MQ
PQ * diag(Q) = 0.99
Where MQ is PAM-transition and PQ is fraction of each of the symbols in nature (i.e. counts of each
amino acid in the set of all known mammals). It can be solved algebraically or numerically. The
idea of numerical solution is to find an arbitrarily small range of ts such that PQ * diag(MQ1/t) >
0.99 for upper limit and < 0.99 for lower limit by repeatedly cutting interval in half. For example,
starting with interval of size 100 and after 100 cuts would give us precision of the order of 100 /
2100 or about 10-12.
Matrix Power Method for Computing Distances
If Q2 = M for some transition matrix M, then the two taxa are 2 PAMs apart. This is a Markov
process. Similarly, they are t PAMs apart when:
Qt  M
tR
In reality, it gets more complicated. It's not always possible to solve for t, and in fact we're working
with cases where it's not. Imagine trying to find a coefficient when vectors are not collinear!
Why it should work
Matrices Q and M are similar in how they are calculated and what they describe. It's how we
interpret them that's different:
 Q shows likelihood of substitutions to happen in nature
 M shows actual substitutions that happened in comparison of two taxa
To look at it another way, if 80% of A's remain same, then 20% of them got substituted for another
symbol in one step. Or perhaps it's more likely that 8% of A's got substituted for and 2% of other
symbols changed back A's, in two steps. This is what the method is about – we know that transition
M consists of tiny substitutions here and there, and we're trying to group them into "steps" and
relate those to the natural tendency of symbol substitution – PAM-1.
Solving for "t"
How do we solve for t then? We don’t! Some Qt will "look like" M, and we take that t to be the
distance. It will, actually, never even remotely look like M, but some Qt will be closest to it,
according to the criteria (matrix difference function) that we set:
C  A  B,
diff ( A, B) 
c
2
cC
(A, B and C are matrices)
It fulfills few basic criteria (diff(A,A) = 0, diff(A,B)=diff(B,A)). Graph of the function f(t) =
diff(Qt, M):
Different matrices Q and M have different looking graphs, but there's always a global minimum,
and we take the value of t at that point to be the solution for our equation.
Tree Building
Algorithms for the tree building are left to as an exercise for the reader, but here are few things that
can be said about the tree:
1. it is binary
2. it is built from the pair wise distances (for n taxa, n-by-n matrix with distances)
3. closest (with smallest distance) taxa branched more recently (lower in the tree if the root is
at the top)
4. every distance function you can come up with will have "errors", that is, data won't fit tree
perfectly. The ambiguity is usually resolved statistically.
Results
The method was applied on following set of mammals:
1. horse
2. rhinoceros
3. harbor sea
4. cat
5. blue whale
6. cow
7. human
8. common chimp
9. gorilla
10. mouse
11. rat
12. guinea pig
The resulting tree (output from Phylip):
+3harborsea
+-10
! +--------------------4cat
17
!
!
+-----------------------5bluewhale
-11-----5
!
+----------------6cow
28
!
!
+---1horse
! +------------4
! !
+-------------------------------------13opossum
! !
+--9
+2rhinocero
!
!
!
!
+-------7human
!
!
+----------------1
+---8
+------------------3
+---------9gorilla
!
!
!
! +---6
+8commonchi
! !
!
+--7
+12guineapi
!
!
+------------------10mouse
+----------------------2
+----------------11rat
2
Why it doesn't work as well as expected
As you can see on the tree, human and gorilla are closer than human and horse, which is
encouraging, but paleontologists' findings indicate that human and chimp should be even closer.
Furthermore, the opossum-horse branching occurred too late (opossum is marsupial and it
should've branched of from placentals earlier. In other words, the distance function is not very
accurate.
We did lots of approximations and assumptions to simplify our problem. Also, we had to rely on
sequences obtained from the databases. Possible sources of error include (in the order of
appearance):
1. Incorrect sequence
2. To small data set for PAM-1 to be representative of "unit of evolution"
3. Taking root of matrices does not always yield expected results (e.g. sometimes small
imaginary or negative numbers would appear)
4. Contrived notion of Qt ≈ M, it all depends on function used for matrix difference
5. Exclusion of parallel evolution from the model
6. Tree building software
7. Criteria of what works and what not (we don't know what the tree should look like, because
if we did – we wouldn't need this method)
Conclusion
Our understanding of DNA needs to improve before any significant advances will be made in this
field. To devise a phylogenetic method and claim that it works is not possible unless one
understands obscure molecular biology of life. Even that might not be enough! The historical
evidence is so sparse that one can even wonder, did evolution take place, or is it all just a fixation?
Download