Maximum likelihood in Phylogenetics Group 3 JAN JONES SHANTHI IYANPERUMAL SYUSANNA KOYFMAN Project for Probability and Statistics Dr. M. Partensky 4/19/04 Maximum Likelihood in Phylogenetics 1/29 Table of Contents Maximum likelihood in Phylogenetics ........................................................... 1 P&S Project Part 1: ......................................................................................................... 3 Maximum Likelihood Overview................................................................................. 3 P&S Project Part II: ........................................................................................................ 9 General Model of DNA Substitution .............................................................................. 9 Calculation of likelihood of molecular sequences: ....................................................... 10 Example1:................................................................................................................ 10 Example2:................................................................................................................ 10 Example 3 : ............................................................................................................... 12 DNA substitution Models ............................................................................................. 15 Jukes-Cantor (JC): .................................................................................................... 15 Felsenstein 1981(F81) ............................................................................................. 15 Kimura 2-parameter(K80) ....................................................................................... 15 Hasegawa-Kishino-Yano (HKY) ............................................................................ 15 Tamura-Nei (TrN):.................................................................................................... 15 Kimura 3-parameter (K3P) ..................................................................................... 16 Transition Model (TIM) .......................................................................................... 16 Transversion Model (TVM).................................................................................... 16 Symmetrical Model (SYM) .................................................................................... 16 General Time Reversible (GTR) ............................................................................. 16 Gamma Distribution (G) .......................................................................................... 16 Proportion of Invariable Sites (I) ............................................................................. 16 Amino acid Substitution Models .................................................................................. 16 Empirical substitution models................................................................................... 17 PAM matrices ........................................................................................................... 17 Dayhoff matrices..................................................................................................... 18 JTT matrices.............................................................................................................. 18 Other empirical models ............................................................................................. 18 Blosum (Block substitution matrices) ....................................................................... 18 Poisson models.......................................................................................................... 18 P&S Project Part III: ..................................................................................................... 19 Phylogenetic trees...................................................................................................... 19 Maximum parsimony.............................................................................................. 20 Maximum likelihood ............................................................................................... 21 LIKELIHOOD RATIO TESTS IN PHYLOGENETICS ....................................... 24 Reference: ..................................................................................................................... 28 Maximum Likelihood in Phylogenetics 2/29 P&S Project Part 1: Maximum Likelihood Overview By: Janice Jones 04/19/04 This section of the project presents an overview of the Maximum Likelihood method and introduces its application to the construction of phylogenetic trees. The two other sections of this project, submitted by Shanthi Iyanperumal and Syusanna Koyfman, provide additional information about the method’s application, and explore the topics of models of DNA substitution, the testing of these models, likelihood ratio tests, and a comparison with the maximum parsimony method. Phylogenics is an area of high interest in bioinformatics. Construction of a phylogenic tree provides insights into the origins of genes and their protein products. A tree can assist in formulation of questions to ask about the behavior of a gene or protein in various organisms. A number of statistical methodologies have been devised to construct a phylogenic tree from sequence data for a set of related genes/proteins in one or more organisms. One of these, the maximum likelihood method, is reviewed by Huelsenbeck and Crandall(1). What follows is an interpretation of the methodology as they describe it, supplemented by an additional reference and two examples in Mathematica. The maximum likelihood method was said (1) to have been first described in 1922 by English statistician RA Fisher. It was apparently not widely used until after 1990, when increased computing power and optimizations of the method itself made its application more practical. It is said (3) to be one of the most popular sequence-based criteria for evaluating trees, along with parsimony and compatibility. It is inherently more computationally expensive than parsimony but various newer optimizations have been proposed to further address this problem. General Approach The job of the maximum likelihood method is to construct a probability model that best describes a set of known data. As model parameters are changed, the method seeks those parameter values that maximize the probability of observing the actual data. Felsenstein (2) notes that when the method is applied to an evolutionary tree, the likelihood is the probability of evolving the observed sequences given the hypothesis (a proposed tree). It was emphasized that the likelihood is not the probability that the tree itself is correct. To explain how a likelihood calculation works, Huelsenbeck and Crandall (1) start with the now-familiar binomial distribution to describe a coin tossing experiment. The distribution formula specifies the probability of getting h successes, given n trials and a probability of success of p. Maximum Likelihood in Phylogenetics 3/29 p(h|n,p) = C(h, n) * ph * (1-p)(n-h) They then proceed to turn this calculation around, even though the right hand is the same. In the conventional binomial probability calculation, the p of success is constant and the number of successes in n trials is what we want. In a likelihood calculation, the number of success and trials are kept constant and the probability of a single success is what we are trying to find. The left hand side of the equation above becomes L(p|h,n): the likelihood that the value of p results in an outcome of h successes in n trials. The calculation is iterated for various reasonable values of p and the result, the likelihood L of p, is plotted against p. The value of p for which L reaches a maximum is presumed to be the “best” value of p. Data from a hypothetical coin toss experiment is then used to illustrate the method. In this example, the goal is to determine the probability of heads, given an outcome of h heads in n trials. I have provided Example 1 in the accompanying Mathematica file, Proj_Group3JaniceJones.nb, to demonstrate this use of maximum likelihood. The authors mention that in this simple experiment, an alternate to the computational solution is to determine the value of p for which the slope of the likelihood L is zero. This works in the simple experiment but may be more difficult to apply when the probability model becomes more complex. Second Simple Example I have prepared a second example that has a slightly more complex probability model that is based on a permutations. Here, we choose 10 marbles from a jar that has a total of 17 marbles of three different colors. We are told the colors of the marbles we picked. The problem is to determine how many marbles of each color still remain in the jar. The goal of the probability function used in the solution was not to calculate the probability that a random marble would be of a particular color, but the probability that 10 marbles would be of the specified colors. The test parameters (colors of the 7 remaining marbles) that resulted in the highest probability for the 10 marble colors picked were considered to be the “most likely” values. See Example 2 in Proj_JaniceJones.nb for additional comments and code. One interesting aspect of this problem is that the solution does not require the likelihood measure to be a probability value; a value that is directly proportional to the probability works just as well. When the probability function is simplified to return a permutation count rather than a probability, the “most likely colors” part of the result is the same, even though the maximum likelihood value itself has changed. Another interesting aspect is that since any two of the parameters (number of marbles of each color) can vary independently, a simple graphical solution is not an option. Application to Phylogenetics How is this approach used in the construction of a phylogenic tree? The data we are starting with, instead of a count of heads in n trials or marbles of different colors, is a set Maximum Likelihood in Phylogenetics 4/29 of “s” sequences aligned in “n” positions. For the next example, to keep the possibilities simple, say the sequences are of nucleic acids. Then a data point is the set of bases in a given aligned position. For the following tiny sample alignment, AG AC TG there are two data points: {A,A,T} for position 1 and {G,C,G} for position 2. For each of these positions, there are r = 4s possible values, where s in the number of sequences being aligned. Since there are 3 sequences in this example, there are 64 possible values for each position: {A,A,A},{A,A,T},{A,A,G},{A,A,C},{A,T,A}...etc. Each of the 64 possible patterns can be assigned a id number (1 to 64) to distinguish it from the other patterns. Disregarding data point order for now, the probability of observing particular values for the collection of data points (positions) in the alignment can be described by the multinomial distribution, copied from (1): The authors compare the 64 possible outcomes for each data point with the coin toss that had only two possible outcomes. Although maximum likelihood was successfully applied to the marble problem, its solution did not require that this same model be used. The marble problem can be said to have only one data point, but with many possible outcomes. Each outcome was one of the possible permutations of the choice of 10 marbles. But back to the sequence. The probability question changes from “What is the probability of h heads in n tosses?” or “What is the probability of drawing 5R, 3G, and 2Y marbles?” The new question is “What is the probability of getting n1 occurrences of data point (1), n2 occurrence of data point (2),... n64 of data point (64), given 3 sequences of 2 bases each?” As was done with the coin toss and marble examples, the likelihood function is a probability function that estimates the probability for data that is already known. A trivial probability function estimates the probability (of getting the sequence data points) based on 64 parameters. Each parameter is the probability of getting a particular base combination for the three sequences in a random position. The most likely value for each parameter turns out to match the proportion of the time that the sequence combination occurs over all the positions in the alignment. In the small 2-residue, 3sequence example, the parameters pi are 0 for all but two of the 64 possibilities. For each of these, the maximum likelihood estimate for pi is ni/n or ½. Maximum Likelihood in Phylogenetics 5/29 Need for More Complex Model While this type of likelihood equation does call attention to sequences, the probability question that it addresses is not biologically interesting. It only differs from the coin toss in the number of possible outcomes for each experiment. To become interesting, the model that defines the probability of observing the given site patterns must become much more complex. One interesting model is a definition of a phylogenetic tree that contains nodes, branch lengths, and tips. Each tip represents a sequence and each node represents a convergence of the sequences on its adjacent branches. The resulting tree, in most circumstances, reflects the expected “tree of life” in its picture of how various species are related. (Humans would be seen as most closely related to chimpanzee, then other primates, then mouse/rat, etc. ) For this discussion, all the sequences are assumed to be nucleotides. The same principles could be applied to protein sequences but the model becomes more complex. For the likelihood probability function, branch lengths of the tree are specified as a function of the expected number of changes per site and a model of sequence change(1). A given site pattern can be represented as the tips of branches that are connected directly or indirectly. At each branch node, any of the four nucleotides is possible. Assuming this model, there are 64 possible nodes that may contribute to a site pattern that represents one position in a hypothetical alignment of four sequences. Such a node arrangement is illustrated in the following figure copied from (1). It represents a single position in which the aligned sequences have bases G, G, T, and T respectively. The nodes i, j, and k may each have any of the four nucleotides, allowing a total of 64 different trees that have the leaves shown. Figure 1 – copied from (1) Instead of using a simple probability value for the site pattern, as is implied by the multinomial probability model, the phylogenetic model calculates the site pattern Maximum Likelihood in Phylogenetics 6/29 probability as a sum of terms, one term for each possible configuration of nodes that could lead to the pattern. Here, a configuration of nodes refers to the assignment of nucleotides to the nodes. Each term is itself dependent on the “equilibrium frequency” of the “base” node, the presumed length of all the branches that lead (directly or indirectly) to the leaves, the probability of observing two given nucleotides at the ends of the branches, and other unspecified parameters. The “base” node (my term) is the one furthest from the “leaves”; in the above diagram, it is node k. Its equilibrium frequency has a different value for each of the four nucleotides that is assigned to the node. It is typically determined by the overall base composition of the sequences and is intended to estimate the probability that the nucleotide, going back in evolutionary time, was of that base. The following equation, copied from (1), represents a probability value for the sequences observed in a single position, illustrated in the diagram above on this page. The values v, as in the diagram, represent branch lengths; the value (pi) represents the equilibrium frequency for node k; the values (theta) represent “additional parameters” of the nucleotide substitution model. It is presented here just to illustrate the complexity and potential computational expense of a tree likelihood calculation. As extensive as this equation appears, it only represents one position in an alignment that contains just four sequences. As the number of sequences increases, the number of summations required by the above probability equation goes up exponentially. According to Felsenstein (2), the number of terms becomes 22n-2, where n is the number of sequences; if we are working with 10 sequences (a fairly modest number), we would have 218 terms for just one aligned position. The likelihood of a tree for the entire sequence alignment is the product of these estimated probability values taken across all the sequence positions. Each sequence contributes equally and it is assumed that all positions are independent. This is to say that the above probability function for a single candidate tree involves a very large number of computations. If all the possible trees are considered, the computation becomes impossibly long. For the maximum likelihood method to be a useable phylogenetic tool, some optimizations were needed. Several interesting ones were proposed by Felsenstein (2) in 1981. Maximum Likelihood in Phylogenetics 7/29 Felsenstein’s first optimization reduced the number of summations in the probability function for the tree represented by a single sequence position. It is best explained by referring to the tree diagram (Figure 1). In the “full tree” calculation shown above, the left side of the tree is kept static while all possible combinations of the right side are calculated. Then one change is made to the left side, and all possible combinations on the right are calculated again. This continues for all possible combinations on the left side. Then the base at node k is changed and all the calculations on both sides are repeated. In the optimized calculation, the likelihood (probability function) is first calculated separately for each of the “outermost” nodes. In Figure 1, these are nodes labeled i and j. For each of these nodes, four likelihood values are calculated, one for each possible value of its parent node k. For node i, a likelihood value would be based on its two leaves and the base of node k. Then the likelihood for the parent node is the summation, for each of its possible values (A, T, G, or C) of the product of its child nodes and the equilibrium constant for node k. If is the equilibrium value of the base at node k, and Lik and Ljk represent likelihood values of nodes i and j for a given value of k, then the likelihood of the tree would calculated as the summation over k ofk * Lik * Ljk. Calculation of the L values at each of nodes i and j would require 16 summations and the final likelihood calculation would require 4 summations, for a total of 36 summations. Felsenstein’s second optimization involves the choice of tree topologies to test. As stated in his paper, two million topologies are possible for a tree that represents just 10 sequences. The optimization was to first build the tree with only two sequences, and then to add just one sequence at a time without altering the arrangement of the existing nodes. A stated disadvantage of this approach was that the final topology could vary, depending on the order in which sequences were added. A third optimization from the same paper involved the “Pulley Principle”. This was an observation that the probability of a tree did not change if the total length of two branches attached to the same node was kept equal. For example, in Figure 1, the probability of the tree will not change as long as the sum of distances v5 and v6 are equal. This permitted incremental changes to be made to a branch length until no further improvement appeared in the probability value. Citations 1. Hulsenbeck J., Crandall, K. Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood. Annu. Rev. Ecol. Syst., 1997, 28:437-66. 2. Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-76 3. Kim J, Warnow T. Tutorial on Phylogenetic Tree Estimation. http://kim.bio.upenn.edu/~jkim/media/ISMBtutorial.pdf Maximum Likelihood in Phylogenetics 8/29 P&S Project Part II: DNA Substitution Models By: Shanthi Iyanperumal Phylogeny Phylogeny is the evolution of a genetically related group of organisms. It is the study of relationships between collection of "things" (genes, proteins, organs..) that are derived from a common ancestor. General Model of DNA Substitution Maximum likelihood evaluates the probability that the choosen evolutionary model will have generated the observed sequences. Phylogenies are then inferred by finding those trees that yield the highest likelihood. The rate matrix for a general model of DNA substitution is given by . r2pC r4pG r6pT r1pA . r8pG r10pT r3pA r7pC . r12pT r5pA r9pC r11pG . Q = q(i,j) = The rows and columns are ordered A, C, G and T. The matrix gives the rate of change from nucleotide i(arranged along the rows) to nucleotide j(along the columns). For example r2pC gives the rate of change from A to C. Let P(v,s) be the transition probability matrix where pi,j(v,s) is the probability that nucleotide i changes into j over branch length v. The vector s contains the parameters of the substitution model(eg. pA, pC, pG, pT, r1,r2…). For two-state case, to calculate the probability of observing a change over a branch of length v, the following matrix calculation is performed: Maximum Likelihood in Phylogenetics 9/29 P (v,s) = eQv Calculation of likelihood of molecular sequences: For DNA sequence comparison the model has 2 parts, the base composition and the process. The composition is just the proportion of the four nucleotides A, C, G, T. Example1: Likelihood of a single sequence with two nucleotides AC If the model is Jukes – Cantor model, which has a base composition of ¼ for each nucleotide then the likelihood will be 1/4 X 1/4 = 1/16. If the model has a composition of 40%a and 10%c the likelihood of the sequence will be 0.4 x 0.1=0.04 If we take the 16 possible nucleotide combinations and calculate the sum of all of them the sum of those likelihoods is 1. For any model, the sum of the likelihoods of all the different data possibilities should be 1. Example2: Likelihood of a one branch tree between two aligned sequences Sequence 1 CCAT Sequence 1 CCGT The other part of the model, the process part is needed if we have more than one sequence related by a tree. The process might be described by sentences, or by a matrix of numbers, describing how the nucleotides change from one to another. Let the composition part of the model be denoted by = [0.1, 0.4, 0.2, 0.3]. The order of the bases is A, C, G, and T. There are 16 possible changes from one nucleotide to the other. The changes can be represented as a 4 X 4 matrix. A P= 0.976 C 0.01 G T 0.007 0.007 Maximum Likelihood in Phylogenetics A 10/29 0.002 0.003 0.002 0.983 0.01 0.013 0.005 0.01 0.979 0.007 0.005 0.979 C G T Likelihood of going from sequence1 to sequence 2 is: = c Pc-c c Pc-c a Pa-g t Pt-t = 0.4 * 0.983 * 0.4 * 0.983 * 0.1* 0.007 * 0.3 * 0..979 = 0.0000300 In the above example we did not consider branch lengths. Intuitively for short branch lengths the probability of a base change is low and for long branch lengths it is high. Let’s assume the matrix we have chosen describes a branch with a Certain Evolutionary Distance (CED). The likelihood we calculated was for 1 CED. The likelihood for the same alignment for 2 CED units is found by multiplying matrix P by itself. P2 = 0.953 0.005 0.007 0.005 0.02 0.013 0.966 0.01 0.02 0.959 0.026 0.01 0.015 0.02 0.015 0.959 A C G T Likelihood for 2 CED units is: = c Pc-c c Pc-c a Pa-g t Pt-t = 0.4 * 0.966 * 0.4*0.966 * 0.1* 0.013 * 0.3 * 0..959 = 0.0000559 As the branch length increases the values on the diagonal decrease and the other values increase because change becomes more likely than being the same. Maximum Likelihood in Phylogenetics 11/29 The table lists the likelihoods for increasing branch lengths. Branch length (CED units) 1 2 3 10 15 20 30 Likelihood 0.0000300 0.0000559 0.0000782 0.000162 0.000177 0.000175 0.000152 The likelihood rises to a maximum somewhere between 15 and 20 ced units. Example 3 : Likelihood of a tree with four taxa Assume that we have the aligned nucleotide sequences for four taxa: The possible trees are We want to evauate the likelihood of the unrooted tree represented by the nucleotides of site j in the sequence and shown below: Maximum Likelihood in Phylogenetics 12/29 Since most of the models currently used are time-reversible, the likelihood of the tree is generally independent of the position of the root. Therefore it is convenient to root the tree at an arbitrary internal node as done in the Fig. below, Under the assumption that nucleotide sites evolve independently (the Markovian model of evolution), we can calculate the likelihood for each site separately and combine the likelihood into a total value towards the end. To calculate the likelihood for site j, we have to consider all the possible scenarios by which the nucleotides present at the tips of the tree could have evolved. So the likelihood for a particular site is the summation of the probablilities of every possible reconstruction of ancestral states, given some model of base substitution. So in this specific case all possible nucleotides A, G, C, and T occupying nodes (5) and (6), or 4 x 4 = 16 possibilities : Maximum Likelihood in Phylogenetics 13/29 In the case of protein sequences each site may occupy 20 states (that of the 20 amino acids) and thus 400 possibilities have to be considered. Since any one of these scenarios could have led to the nucleotide configuration at the tip of the tree, we must calculate the probability of each and sum them to obtain the total probability for each site j. The likelihood for the full tree then is product of the likelihood at each site. Maximum Likelihood in Phylogenetics 14/29 L= L(1) x L(2) ..... x L(N) Since the individual likelihoods are extremely small numbers it is convenient to sum the log likelihoods at each site and report the likelihood of the entire tree as the log likelihood. N ln L= ln L(1) + ln L(2) ..... + ln L(N) = SUM ln L(j) j=1 DNA substitution Models The use of maximum likelihood (ML) algorithms in developing phylogenetic hypotheses requires a model of evolution. The frequently used General Time Reversible (GTR) family of nested models encompasses 64 models with different combinations of parameters for DNA site substitution. The models are listed here from the least complex to the most parameter rich. Jukes-Cantor (JC): Equal base frequencies, all substitutions equally likely . 1 level of nesting. Felsenstein 1981(F81) Variable base frequencies, all substitutions equally likely . 1 level of nesting. Kimura 2-parameter(K80) Equal base frequencies, variable transition and transversion frequencies . 2 levels of nesting. Hasegawa-Kishino-Yano (HKY) Variable base frequencies, variable transition and transversion frequencies. 2 levels of nesting. Tamura-Nei (TrN): Variable base frequencies, equal transversion frequencies, variable transition frequencies Maximum Likelihood in Phylogenetics 15/29 Kimura 3-parameter (K3P) Variable base frequencies, equal transition frequencies, variable transversion frequencies Transition Model (TIM) Variable base frequencies, variable transitions, transversions equal Transversion Model (TVM) Variable base frequencies, variable transversions, transitions equal Symmetrical Model (SYM) Equal base frequencies, symmetrical substitution matrix (A to T = T to A) General Time Reversible (GTR) Variable base frequencies, symmetrical substitution matrix . 6 levels of nesting . In addition to models describing the rates of change from one nucleotide to another, there are models to describe rate variation among sites in a sequence. The following are the two most commonly used models. Gamma Distribution (G) Rate heterogeneity can be accommodated by specifying that the rate of evolution across different sites and is distributed according to a gamma distribution. A simpler way of accounting for rate heterogeneity is to specify that a fixed proportion of sites are invariant i.e. have zero rate of evolution. Proportion of Invariable Sites (I) Extent of static, unchanging sites in a dataset . Amino acid Substitution Models Maximum Likelihood in Phylogenetics 16/29 The divergence among sequences can be modeled with a mutation matrix. The matrix, denoted by M, describes the probabilities of amino acid mutations for a given period of evolution. This corresponds to a model of evolution in which amino acids mutate randomly and independently from one another but according to some predefined probabilities depending on the amino acid itself. This is a Markovian model of evolution and while simple, it is one of the best models. Intrinsic properties of amino acids, like hydrophobicity, size, charge, etc. can be modeled by appropriate mutation matrices. Dependencies which relate one amino acid characteristic to the characteristics of its neighbors are not possible to model through this mechanism. Amino acids appear in nature with different frequencies. These frequencies are denoted by fi and correspond to the steady state of the Markov process defined by the matrix M., i.e., the vector f is any of the columns of or the eigenvector of M whose corresponding eigenvalue is 1 (Mf=f). This model of evolution is symmetric, i.e., the probability of having an i which mutates to a j is the same as starting with a j which mutates into an i. The following is a list of amino acid substitution models which use matrices. Empirical substitution models In contrast to DNA substitution models, amino acid replacement models have concentrated on the empirical approach. Dayhoff and coworkers developed a model of protein evolution which resulted in the development of a set of widely used replacement matrices. In the Dayhoff approach, replacement rates are derived from alignments of protein sequences that are at least 85% identical; this constraint ensures that the likelihood of a particular mutation being the result of a set of successive mutations is low. One of the main uses of the Dayhoff matrices has been in database search methods where, for example, the matrices P(0.5), P(1) and P(2.5) (known as the PAM50, PAM100 and PAM250 matrices) are used to assess the significance of proposed matches between target and database sequences. However, the implicit rate matrix has been used for phylogenetic applications. PAM matrices In the definition of mutation the matrix M implies certain amount of mutation (measured in PAM units). A 1-PAM mutation matrix describes an amount of evolution which will change, on the average, 1% of the amino acids. In mathematical terms this is expressed as a matrix M such that The diagonal elements of M are the probabilities that a given amino acid does not change, so (1-Mii) is the probability of mutating away from i. If we have a probability or frequency vector p, the product Mp gives the probability vector or the expected frequency of p after an evolution equivalent to 1-PAM unit. Or, if we start with Maximum Likelihood in Phylogenetics 17/29 amino acid i (a probability vector which contains a 1 in position i and 0s in all others) M*i (the ith column of M) is the corresponding probability vector after one unit of random evolution. Similarly, after k units of evolution (what is called k-PAM evolution) a frequency vector p will be changed into the frequency vector Mk p. Notice that chronological time is not linearly dependent on PAM distance. Evolution rates may be very different for different species and different proteins. Dayhoff matrices Dayhoff presented a method for estimating the matrix M from the observation of 1572 accepted mutations between 34 superfamilies of closely related sequences. Their method was pioneering in the field. A Dayhoff matrix is computed from a 250-PAM mutation matrix, used for the standard dynamic programming method of sequence alignment. The Dayhoff matrix entries are related to M250 by . JTT matrices Jones et al. and Gonnett et al. have used much the same methodology as Dayhoff, but with modern databases. The Jones et al. model has been implemented for phylogenetic analyses with some success. Jones et al. have also calculated an amino acid replacement matrix specifically for membrane spanning segments. This matrix has remarkably different values from the Dayhoff matrices, which are known to be biased toward water-soluble globular proteins. Other empirical models Adachi and Hasegawa have implemented a general reversible Markov model of amino acid replacement that uses a matrix derived from the inferred replacements in mitochondrial proteins of 20 vertebrate species. The authors show that this model performs better than others when dealing with mitochondrial protein phylogeny. Blosum (Block substitution matrices) Henikoff and Henikoff have used local, ungapped alignments of distantly related sequences to derive the BLOSUM series of matrices. Matrices of this series are identified by a number after the matrix (e.g. BLOSUM50), which refers to the minimum percentage identity of the blocks of multiple aligned amino acids used to construct the matrix. These matrices are directly calculated without extrapolations, and are analogous to transition probability matrices P(T) for different values of T, estimated without reference to any rate matrix Q. The BLOSUM matrices often perform better than PAM matrices for local similarity searches, but have not been widely used in phylogenetics. Poisson models A simple, non-empirical model of amino acid replacement was proposed by Nei(1987) This model implements a Poisson distribution, and gives accurate estimates of the number of amino acid replacements when species are closely related. Maximum Likelihood in Phylogenetics 18/29 P&S Project Part III: DNA Substitution Models By: Syusanna Koyfman Phylogenetic trees If we assume that all life must come from a common origin, then closely related species share a more recent common ancestor than distantly related species. We can then show the relationship between species using the phylogenetic tree. This graph shown in Fig. I.4.1 shows an example of a phylogenetic tree. “The nodes represent taxonomic units, while the branches connecting them reflect their relationships in terms of descent. The topology is the pattern of branches found in a tree. The branch length is commonly used to indicate some form of evolutionary distance represented by that branch. The actual, still existing taxonomic units or operational taxonomic units are represented by nodes on the tips of the branches, called external nodes. The other nodes are called internal nodes.”1 1 http://rrna.uia.ac.be/~peter/doctoraat/evol.html Maximum Likelihood in Phylogenetics 19/29 Fig. I.4.1. Example of a phylogenetic tree. The branch lengths in this type of tree representation are given by the horizontal length only. “A tree where a special node indicating the common ancestor to all OTUs is present (the root) is called a rooted tree. An unrooted tree leaves the position of the common ancestor unspecified. The total number of possible distinct, unrooted trees for n sequences is given by (Penny et al., 1982; Li and Grauer, 1991):2 There are many methods that are used to construct phylogenetic trees. I will focus in this document the maximum parsimony and maximum likelihood methods, which are both methods that use all character data. Maximum parsimony “A maximum parsimony tree is the tree that requires the smallest number of evolutionary changes to result in the set OTUs under study.”3 Sometime we can find many trees with the same number of changes. Although we consider all sites, they do not necessarily relate information regarding the most parsimonious tree. We “filter” out the sites that are not favorable to our topographies. We consider favorable sites ones that show at least 2 different kinds of residues at the site. We find the most parsimonious tree by choosing from all of the possible tree topologies. ” For each of these possible trees the ancestral sequences at each branching point are reconstructed and the minimum number of evolutionary changes can be calculated. Finally the tree requiring the smallest number of substitutions will be chosen. “4 2 http://rrna.uia.ac.be/~peter/doctoraat/evol.html http://rrna.uia.ac.be/~peter/doctoraat/evol.html 4 http://rrna.uia.ac.be/~peter/doctoraat/evol.html 3 Maximum Likelihood in Phylogenetics 20/29 Some advantages for using the maximum parsimony method are that it is based on shared and derived characters, does not reduce sequence information to a single number, tries to provide information on the ancestral sequences, and evaluates different trees. The disadvantages are that is slow in comparison with distance methods, does not use all the sequence information (only informative sites are used), does not correct for multiple mutations (does not imply a model of evolution), it does not provide information on the branch lengths and is notorious for its sensitivity to codon bias. Maximum likelihood “Maximum Likelihood is a method for the inference of phylogeny. It evaluates a hypothesis about evolutionary history in terms of the probability that the proposed model and the hypothesized history would give rise to the observed data set. The supposition is that a history with a higher probability of reaching the observed state is preferred to a history with a lower probability. The method searches for the tree with the highest probability or likelihood.”5 Some advantages of the maximum likelihood method are as follows; They have often lower variance than other methods (ie. it is frequently the estimation method least affected by sampling error), they tend to be robust to many violations of the assumptions in the evolutionary model, even with very short sequences they tend to outperform alternative methods such as parsimony or distance methods, the method is statistically well founded, they evaluate different tree topologies, they use all the sequence information, are less The disadvantages are as follows; it is very CPU intensive and thus extremely slow, and the result is dependent on the model of evolution used. 5 http://www.icp.ucl.ac.be/~opperd/private/max_likeli.html Maximum Likelihood in Phylogenetics 21/29 One example I found in the following website seems to explain really well. “Maximum likelihood evaluates the probability that the chosen evolutionary model will have generated the observed sequences. Phylogenies are then inferred by finding those trees that yield the highest likelihood. Assume that we have the aligned nucleotide sequences for four taxa: 1 j ....N (1) A G G C U C C A A ....A (2) A G G U U C G A A ....A (3) A G C C C A G A A.... A (4) A U U U C G G A A.... C We want to evaluate the likelihood of the uprooted tree represented by the nucleotides of site j in the sequence and shown below: (1) \ (2) / / \ -----/ / (3) \ \ (4) What is the probabliity that this tree would have generated the data presented in the sequence under the chosen model? Since most of the models currently used are time-reversible, the likelihood of the tree is generally independent of the position of the root. Therefore it is convenient to root the tree at an arbitrary internal node as done in the Fig. below, C C \ / \/ A \ \ A G | / | / | / | / | / A Under the assumption that nucleotide sites evolve independently (the Markovian model of evolution), we can calculate the likelihood for each site separately and combine the likelihood into a total value towards the end. To calculate the Maximum Likelihood in Phylogenetics 22/29 likelihood for site j, we have to consider all the possible scenarios by which the nucleotides present at the tips of the tree could have evolved. So the likelihood for a particular site is the summation of the probabilities of every possible reconstruction of ancestral states, given some model of base substitution. So in this specific case all possible nucleotides A, G, C, and T occupying nodes (5) and (6), or 4 x 4 = 16 possibilities: _ _ | C C A G | | \ / | / | | \/ | / | L(j) = Sum(Prob | (5) | / |) | \ | / | | \ | / | |_ (6) _| In the case of protein sequences each site may occupy 20 states (that of the 20 amino acids) an thus 400 possibilities have to be considered. Since any one of these scenarios could have led to the nucleotide configuration at the tip of the tree, we must calculate the probability of each and sum and sum them to obtain the total probability for each site j. The likelihood for the full tree then is product of the likelihood at each site. N L= L(1) x L(2) ..... x L(N) = ½ L(j) j=1 Since the individual likelihoods are extremely small numbers it is convenient to sum the log likelihoods at each site and report the likelihood of the entire tree as the log likelihood. N ln L= ln L(1) + ln L(2) ..... + ln L(N) = SUM ln L(j) j=1 Maximum Likelihood in Phylogenetics 23/29 The above procedure is then repeated for all possible topologies (or for all possible trees). The tree with the highest probability is the tree with the highest maximum likelihood.“6 LIKELIHOOD RATIO TESTS IN PHYLOGENETICS It is well noted that the there are many assumptions that are made in phylogenetic analysis. These assumptions are sometime incorrect, but we will see that even incorrect assumption may lead to correct analysis. It is understood that the concept of assumptions is one of the most debated subjects in the field of phylogenetics. All of the phylogenetic methods make assumptions about the process of evolution. In addition, many phylogenetic methods use bifurcation trees, a tree in which each ancestral lineages gives rise to exactly two descendent lineages, to best describe the phylogeny of species. “Consequently, all the methods of phylogenetic inference depend on their underlying models. To have confidence in inferences it is necessary to have confidence in the models.”7 This means that if we have the confidence in our models, we may then make our assumptions. More assumptions are made in a phylogenetic analysis. “For example, the assumptions of a maximum likelihood analysis are mathematically explicit and, besides the assumption of independence among sites, include parameters that describe the substitution process, the lengths of the branches on a phylogenetic tree, and among-site rate heterogeneity. The assumptions made in a parsimony analysis include independence and a specific model of character transformation 6 7 http://www.icp.ucl.ac.be/~opperd/private/max_likeli.html http://bioinformatics.oupjournals.org/cgi/reprint/14/9/817.pdf Maximum Likelihood in Phylogenetics 24/29 (often called a step-matrix or weighting scheme; a commonly used weighting scheme is to give every character transformation equal weight). “8 Despite all of these assumptions, it is surprising that phylogenetic methods can estimate the correct tree with high probability. “In fact, the maximum likelihood, parsimony, and several distance methods appear to be robust to violation of many assumptions, including making incorrect assumptions about the substitution process, among-site rate variation, and independence among sites”9 One of the advantages of making explicit assumptions about the evolutionary process is that we can compare alternative models of evolution in a statistical context. “Instead of being viewed as a disadvantage, the use of explicit models of evolution in a phylogenetic analysis allows the systematist not only to estimate phylogeny, but to learn about processes of evolution through hypothesis testing. One measure of the relative tenability of two competing hypotheses is the ratio of their likelihoods. “10 It is from this introduction that we can begin to decide which model fits the data. We can look at the following example as an indication of the methods employed in justifying their use. “Because of this, all the methods based on explicit models of evolution should explore which is the model that fits the data best, justifying then its use. In traditional statistical theory, a widely accepted statistic for testing the goodness of fit of models is the likelihood ratio test statistic. Huelsenbeck&Crandall1997.pdf Huelsenbeck&Crandall1997.pdf 10 Huelsenbeck&Crandall1997.pdf 8 9 Maximum Likelihood in Phylogenetics 25/29 where L0 is the likelihood under the null hypothesis (simple model) and L1 is the likelihood under the alternative hypothesis (more complex, parameter rich, model)”11 “Here, the maximum likelihood calculated under the null hypothesis (H0) is in the numerator, and the maximum likelihood calculated under the alternative hypothesis (H1) is in the denominator. When Λ is less than one, H0 is discredited and when Λ is greater than one, H1 is discredited. Λ greater than one is only possible for non-nested models. When nested models are considered, (i.e., the null hypothesis is a subset or special case of the alternative hypothesis), Λ < 1 and -2 log Λ is asymptotically χ 2 distributed under the null hypothesis with q degrees of freedom, where q is the difference in the number of free parameters between the general and restricted hypotheses.”12 ”The maximum likelihood estimates of the model parameters under the null hypothesis are used to parameterize the simulations. For the phylogeny problem, these parameters would include the tree topology, branch lengths, and substitution parameters (e.g., transition:transversion rate ratio or the shape parameter of the gamma distribution). For each simulated data set, -2 log Λ is calculated anew by maximizing the likelihood under the null and alternative hypotheses. The proportion of the time that the observed value of -2 log Λ exceeds the values observed in the simulations represents the significance level of the test.”13 11 http://bioinformatics.oupjournals.org/cgi/reprint/14/9/817.pdf Huelsenbeck&Crandall1997.pdf 13 Huelsenbeck&Crandall1997.pdf 12 Maximum Likelihood in Phylogenetics 26/29 It is common for the rejection level to be set at about 5%. Therefore, maximum likelihood allows simplified formulation and testing of the phylogenetic hypothesis through these likelihood ratio tests. In addition, ratio tests have desirable statistical properties, especially if they are used on simple hypotheses. “Over the past two decades, numerous likelihood ratio tests have been suggested. These include tests of the null hypotheses that (a) a model of DNA substitution adequately explains the data (32, 80, 93), (b) rates of nucleotide substitution are biased (32, 80, 93), (c) rates of substitution are constant among lineages (24, 65, 79, 115), (d ) rates are equal among sites (123), (e) rates of substitution are the same in different data partitions (30, 66, 122), ( f ) substitution parameters are the same among data partitions (122), (g) the same topology underlies different data partitions (51), (h) a prespecified group is monophyletic (53), (i) hosts and associated parasites have corresponding phylogenies (56), ( j) hosts and parasites have identical speciation times (56), and (k) rates of synonymous and nonsynonymous substitution are the same (77).”14 14 Huelsenbeck&Crandall1997.pdf Maximum Likelihood in Phylogenetics 27/29 Reference: 1. Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood by Huelsenbeck J. and Crandall K. 2. http://workshop.molecularevolution.org/resources/models/codonmodels.php 3. www.cs.technion.ac.il/~dang/courseCB/lecture13.pps 4. http://www.biology.usu.edu/biol6750/Lecture_15.htm 5. bio.wayne.edu/mf/teaching/BIO6060_Maximumlikelihood2 6. http://www.nmu.edu/biology/Lindsay/teaching/BI315/phylo/phylo_DNAmodels.html 7. http://www.icp.ucl.ac.be/~opperd/private/max_likeli.html 8.http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Ab stract&list_uids=8642615 9. http://stat-www.berkeley.edu/users/terry/PMMB/Workshop2000/Lab3/phylo.pdf 10.http://www.biology.duke.edu/rausher/phylo1.pdf 11. Maximum Likelihood: Phylogeny Estimation- Neelima Lingareddy Maximum Likelihood in Phylogenetics 28/29 Maximum Likelihood in Phylogenetics 29/29