Math REU - Inferring phylogeny from Protein / DNA sequence – Arian Sibila – 7/30/2003 Inferring the phylogeny means building the phylogenetic tree. Phylogenetic tree is a model of evolution. Building the phylogenetic tree means determining which branching came first, which second, etc. Available information is the DNA / Protein sequence of the taxa forming the leaves of the tree. The sequence does not indicate the position in the tree, but based on pair wise comparisons we reconstruct the tree. Introduction From species to the tree of evolution is a long way for we have recovered very little information. To get a better idea, we're looking at the past 100 or so years to reconstruct branches and edges Nature has been building for over 500 million years – seems impossible! Furthermore, some pieces of information we have are useless either because no one has fully deciphered and understood its dynamics (such as DNA), or because it would create enormously complicated models. Thus our effort is with the goal of trying to capture the essence of and approximate the evolution of species. The problem can be very neatly represented in a mathematical language as species are represented uniquely with sequences of symbols. We don't know what language they describe, but we know that more difference means more distantly related taxa. A symbol is any of the letters from an alphabet: A, T, G, C (4) for DNA, and A, C, Y, W, … (20) for protein sequences. A sequence consists of some number of symbols. The property we observe is that, in nature at least, these sequences change or mutate over time. The simplest kind of mutation, out of several that happen in nature, is called substitution, where one symbol is replaced by another from the same alphabet, and it is the only kind we consider for two reasons: 1. it gives simpler model 2. we're hoping that other kinds of mutations can be approximated by the substitutions, since it is indeed the most basic one Based on chemical properties, certain substitutions happen more often than others, and this is represented by probabilities: P(j | i) is the probability of getting j-th symbol (from the alphabet) given that we have i-th symbol, or probability of (i => j) substitution. All the probabilities are conveniently placed in a single matrix, with P(j | i) entry in i-th row, j-th column: P (1 | 1) P (2 | 1) P (1 | 2) P (2 | 2) P (1 | n) P (2 | n) P (n | 1) P ( n | 2) P ( n | n) Some important properties can be seen: 1. rows add up to 1.0 2. matrix can be raised to a power to get the probability of series of substitutions 3. the matrix to a power has the same properties And some other properties (that come due to the nature of data we use): 1. entries along the main diagonal are the largest (representing non-change, e.g. P(1=>1) or 1 remaining 1) 2. powering the matrix gives more change, taking roots gives less change This matrix, representing the natural substitutions or Percent of Accepted Mutations (PAM) is constructed from the large set of sequences by counting all the pairings during pair wise alignment (of either DNA or protein sequences). Note the differences with original (Dayhoff, '78) definition of calculating PAM. In this paper, we're using slightly different method entirely based on counting and probabilities. Based on data, we're hoping find a method to express difference in sequences as a distance and plot a tree, or in general finding a function: d : sequence 2 R Finally, we're hoping that, given distance between any two taxa is the sum of each of the distances to their common ancestor, while in reality this might not always be the case. Sequence In nature at least, sequences are all of different lengths. So how do you compare them? Symbol by symbol comparison requires some symbols to be discarded from the longer (or even shorter!) sequence. The biologists have came up with a method called aligning entirely based on probability and scoring matrices. The idea is that bigger score is more likely to happen in nature – the first (possible) source of error! Process of aligning inserts gaps where needed to pair the chunks which probably have similar functions. During the course of evolution, sequences loose (deletions) and get (insertions) chunks of symbols, but there are parts of sequences which remain unaffected from one taxon to the other, and those parts should theoretically be paired after the process of aligning. Now, comparisons should tell us something about the substitutions, whereas before aligning, deleted and inserted chunks would give us garbage. Biological clock and mutations As mentioned earlier, substitution is the only kind of mutation that we observe, and there is a good reason for that – a so called biological clock! The claim is that molecular functions take place more or less at the same rate today as they did 100 mya. Although without proof, we all agree that since substitutions are random, more time means more opportunity for them to happen, while that is not the case with insertions and deletions, where the whole chunks get inserted / removed in a short period of time. Thus, life mutates at more or less constant pace. Natural selection decides which mutations are fit to live. After weak ones die and strong ones survive, the population contains the new (mutated) set of beneficial genes, and we say that the species has evolved. Evolution and Parallel Evolution Sometimes, new taxa are created. To get a clue about evolution, let's imagine the following scenario. A population of monkeys is physically divided by a river. It's known that monkeys do not cross water, so there will be no contact between the two populations for a long time. Both sides evolve, but they accumulate different set of mutations as dictated by conditions, such as food availability, temperature, landscape, etc. Both sides evolved even before the appearance of the river, but it's the loss of contact caused by physical separation that makes the gene pools to slowly diverge over time. Eventually they become two different species, unable to breed. This is how all the life is believed to have evolved. In this case, the branching point or the ancestor of the two species is the population before it gets divided. As you can see, that animal is extinct because the population has mutated, so we don't have it's sequence. We can only guess when it happened. Who branched first. Is it human-chimpanzee branching that occurred most recently or was it human-gorilla? Paleontologists have debated for decades, but we want to be able to answer questions like this using modern sequences. The other burning issue is how rapidly did the two sides diverge? If the conditions are similar on each side of the river, one would expect the two sides to stay on the similar course of evolution for a while and develop into very similar taxa, despite the separation. This is especially holds true for large populations, meaning that based only on those two taxa it would be hard to tell how far or how long ago did the branching happen as compared to some other branching. Another way to say it is that because of parallel evolution, sequence similarity based distance function will give smaller value. In most cases, however, we assume separation of a splinter group consisting of relatively few species from the main population and moving into different conditions where any mutations will quickly propagate through the group and cause them to diverge rapidly. Matrix Distance Methods Matrices are used as a convenient way of grouping percentages of substitutions. Suppose we have the following probability matrix: 0 .7 0 .3 0.2 0.8 Then P(1 => 2) = 0.3. In two steps: 2 0.7 0.3 0.55 0.45 0.2 0.8 0.3 0.7 P(1 => 2) = 0.45, and so on. PAM (Percent Accepted Mutations) First introduced by Dayhoff in 1978 as a way of aligning and scoring. It's a matrix, meant to delineate the evolutionary change and to differentiate likely from unlikely mutations. The most important PAM matrix is PAM-1 (since all other PAM's are calculated from it), which corresponds to 1% of change. The idea is, if 99% of the symbols (aminoacids or nucleotides) remain the same, what is the probability that other mutations have taken change? Or, in equation: P * diag(Q) = 0.99 Where Q is PAM-1 and diag(Q) is vector along it's main diagonal. Note that entries along the main diagonal correspond to non-change while off-diagonal entries correspond to change. Diagonal entries are expected to be large, close to 1 in order for some of our methods to work. Dayhoff method defines "relative mutability", Mi, to be the total amount of change for the given row. We do not make use of this method because it has no mathematical significance (to us at least), but instead, we use the same method as described earlier for finding transition matrix, with one difference: counts for (i=>j) and (j=>i) are same. In other words, matrix is symmetric. We use the criteria above to find a matrix Q, st: Qt = MQ PQ * diag(Q) = 0.99 Where MQ is PAM-transition and PQ is fraction of each of the symbols in nature (i.e. counts of each amino acid in the set of all known mammals). It can be solved algebraically or numerically. The idea of numerical solution is to find an arbitrarily small range of ts such that PQ * diag(MQ1/t) > 0.99 for upper limit and < 0.99 for lower limit by repeatedly cutting interval in half. For example, starting with interval of size 100 and after 100 cuts would give us precision of the order of 100 / 2100 or about 10-12. Matrix Power Method for Computing Distances If Q2 = M for some transition matrix M, then the two taxa are 2 PAMs apart. This is a Markov process. Similarly, they are t PAMs apart when: Qt M tR In reality, it gets more complicated. It's not always possible to solve for t, and in fact we're working with cases where it's not. Imagine trying to find a coefficient when vectors are not collinear! Why it should work Matrices Q and M are similar in how they are calculated and what they describe. It's how we interpret them that's different: Q shows likelihood of substitutions to happen in nature M shows actual substitutions that happened in comparison of two taxa To look at it another way, if 80% of A's remain same, then 20% of them got substituted for another symbol in one step. Or perhaps it's more likely that 8% of A's got substituted for and 2% of other symbols changed back A's, in two steps. This is what the method is about – we know that transition M consists of tiny substitutions here and there, and we're trying to group them into "steps" and relate those to the natural tendency of symbol substitution – PAM-1. Solving for "t" How do we solve for t then? We don’t! Some Qt will "look like" M, and we take that t to be the distance. It will, actually, never even remotely look like M, but some Qt will be closest to it, according to the criteria (matrix difference function) that we set: C A B, diff ( A, B) c 2 cC (A, B and C are matrices) It fulfills few basic criteria (diff(A,A) = 0, diff(A,B)=diff(B,A)). Graph of the function f(t) = diff(Qt, M): Different matrices Q and M have different looking graphs, but there's always a global minimum, and we take the value of t at that point to be the solution for our equation. Tree Building Algorithms for the tree building are left to as an exercise for the reader, but here are few things that can be said about the tree: 1. it is binary 2. it is built from the pair wise distances (for n taxa, n-by-n matrix with distances) 3. closest (with smallest distance) taxa branched more recently (lower in the tree if the root is at the top) 4. every distance function you can come up with will have "errors", that is, data won't fit tree perfectly. The ambiguity is usually resolved statistically. Results The method was applied on following set of mammals: 1. horse 2. rhinoceros 3. harbor sea 4. cat 5. blue whale 6. cow 7. human 8. common chimp 9. gorilla 10. mouse 11. rat 12. guinea pig The resulting tree (output from Phylip): +3harborsea +-10 ! +--------------------4cat 17 ! ! +-----------------------5bluewhale -11-----5 ! +----------------6cow 28 ! ! +---1horse ! +------------4 ! ! +-------------------------------------13opossum ! ! +--9 +2rhinocero ! ! ! ! +-------7human ! ! +----------------1 +---8 +------------------3 +---------9gorilla ! ! ! ! +---6 +8commonchi ! ! ! +--7 +12guineapi ! ! +------------------10mouse +----------------------2 +----------------11rat 2 Why it doesn't work as well as expected As you can see on the tree, human and gorilla are closer than human and horse, which is encouraging, but paleontologists' findings indicate that human and chimp should be even closer. Furthermore, the opossum-horse branching occurred too late (opossum is marsupial and it should've branched of from placentals earlier. In other words, the distance function is not very accurate. We did lots of approximations and assumptions to simplify our problem. Also, we had to rely on sequences obtained from the databases. Possible sources of error include (in the order of appearance): 1. Incorrect sequence 2. To small data set for PAM-1 to be representative of "unit of evolution" 3. Taking root of matrices does not always yield expected results (e.g. sometimes small imaginary or negative numbers would appear) 4. Contrived notion of Qt ≈ M, it all depends on function used for matrix difference 5. Exclusion of parallel evolution from the model 6. Tree building software 7. Criteria of what works and what not (we don't know what the tree should look like, because if we did – we wouldn't need this method) Conclusion Our understanding of DNA needs to improve before any significant advances will be made in this field. To devise a phylogenetic method and claim that it works is not possible unless one understands obscure molecular biology of life. Even that might not be enough! The historical evidence is so sparse that one can even wonder, did evolution take place, or is it all just a fixation?