PHYLOGENY ESTIMATION USING MAXIMUM LIKELIHOOD By George Shanahan DNA: The Information for Reconstructing Phylogenetic Trees DNA is the basic hereditary material in all cells and contains all the information necessary to make proteins. DNA is a linear polymer that is made up of nucleotide units. The nucleotide unit consists of a nitrogen containing base, a deoxyribose sugar, and a phosphate. There are four types of bases: adenine (A), thymine (T), guanine (G), and cytosine (C). Each base is connected to a sugar via a ß glycosyl linkage. The nucleotide units are connected via the O3' and O5' carbon atoms forming phosphodiester linkages. This linkage between the hydroxyl groups of the deoxyribose sugar and the phosphate groups form the DNA backbone. adenine thymine guanine cytosine In normal DNA, the bases form pairs: A to T and G to C. This is called complementarity. Two complementary chains that are arranged in an anti-parallel manner form a duplex of DNA. The A-T and G-C base pairs are the same length, and occupy the same space within a DNA double helix, creating a uniform diameter. DNA is normally a double stranded macromolecule, consisting of two polynucleotide chains held together. Also known as a double helix. In this structure, two DNA strands form a spiral, winding around an axis in a righthanded spiral. The sugar-phosphate backbones of the two DNA strands wind around the helix axis like a spiral staircase. The two polynucleotide chains run in opposite directions, and the bases of the individual nucleotides are on the inside of the helix, stacked on top of each other. The larger base shape shown in the picture above depicts the purines (adenine and guanine). The other base shape in the illustration represents the pyrimidines (cytosine and thymine). The results of fiber and single crystal x-ray crystallographic studies have shown that DNA can have several conformations. The most common one is called B-DNA. B-DNA is a right-handed double helix with a wide and narrow groove. The bases are perpendicular to the helix axis. Compound Name: 5’-d(CpGpCpGpApApTpTpCpGpCpG)-3', 290 K Sequence in asymmetric unit Chain A: C G C G A A T T C G C G Chain B: C G C G A A T T C G C G B-DNA Occasionally mutations occur in which a base is changed. Base pairs still form, but they are not in the usual Watson-Crick geometry. Sequences can diverge from a common ancestor because mutations occur. Those mutations can then be fixed into the evolving population. The Maximum Likelihood method is used for the analysis of DNA and amino acid sequence data in an attempt to answer evolutionary questions. Maximum Likelihood in Phylogenetics Molecular phylogeny methods allow, from a given set of aligned sequences, the suggestion of phylogenetic trees which aim at reconstructing the history of successive divergence which took place during evolution, between the considered sequences and their common ancestor. Its topology (form) and its length (sum of its branch lengths) characterize a phylogenetic tree. Each node of a tree is an estimation of the ancestor of the elements included in that node. Maximum Likelihood is a method for the inference of phylogeny. It evaluates a hypothesis about evolutionary history in terms of the probability that the proposed model and the hypothesized history would give rise to the observed data set. The supposition is that a history with a higher probability of reaching the observed state is preferred to a history with a lower probability. The method searches for the tree with the highest probability or likelihood. The data for molecular phylogenetic problems are the individual site patterns. A set of sequences is considered for analysis. Multiple sequence alignment is then performed to get a proper alignment. The phylogenetic tree will depend strongly on the alignments obtained. Maximum likelihood evaluates the probability that the chosen evolutionary model will have generated the observed sequences. Phylogenies are then inferred by finding those trees that yield the highest likelihood. Assume that we have the aligned nucleotide sequences for four taxa: 1 j ....N (1) A G G C T C C A A ....A (2) A G G T T C G A A ....A (3) A G C C C A G A A.... A (4) A T T T C G G A A.... C We want to evaluate the likelihood of the unrooted tree represented by the nucleotides of site j in the sequence and shown below: (1) \ \ (2) / / ------ / / (3) \ \ (4) Now, proceed to find the probability that this tree would have generated the data presented in the sequence under the chosen model. Since most of the models currently used are time-reversible, the likelihood of the tree is generally independent of the position of the root. Therefore it is convenient to root the tree at an arbitrary internal node as done in the Fig. below; C C \ / \/ A \ \ A G | / | / | / | / | / A Under the assumption that nucleotide sites evolve independently (the Markovian model of evolution), we can calculate the likelihood for each site separately and combine the likelihood into a total value towards the end. To calculate the likelihood for site j, we have to consider all the possible scenarios by which the nucleotides present at the tips of the tree could have evolved. So the likelihood for a particular site is the summation of the probablilities of every possible reconstruction of ancestral states, given some model of base substitution. So in this specific case all possible nucleotides A, G, C, and T occupying nodes (5) and (6), there are 4(n-2) arrangements for n taxa or 42 = 16 possibilities : _ _ | C C A G | | \ / | / | | \/ | / | L(j) = Sum(Prob | (5) | / |) | \ | / | | \ | / | |_ (6) _| In the case of protein sequences each site may occupy 20 states (that of the 20 amino acids) and thus 400 possibilities have to be considered. Since any one of these scenarios could have led to the nucleotide configuration at the tip of the tree, we must calculate the probability of each and sum them to obtain the total probability for each site j. For each tree we calculate the likelihood of getting the observed states = L(i) C C A t2 A t3 t4 G t5 t1 A L (i) = PA x PA-A (t1)x PA-C (t2)x PA-C (t3)x PA-A (t4)x PA-G (t5) The likelihood for the full tree then is product of the likelihood at each site. N L= L(1) x L(2) ..... x L(N) = L(j) j=1 Since the individual likelihoods are extremely small numbers it is convenient to sum the log likelihoods at each site and report the likelihood of the entire tree as the log likelihood. N ln L= ln L(1) + ln L(2) ..... + ln L(N) = ln L(j) j=1 The maximum likelihood tree is now determined by repeating this procedure for all possible topologies (or for all possible trees) and searching for the set of branchlengths that maximize L. The result is the tree’s likelihood score. The tree with the highest probability is the tree with the highest maximum likelihood. To calculate the probability of observing a given site pattern, the transition probabilities need to be specified. In the case of DNA or amino acid substitutions, the simplest model is the Poisson model, which assumes that all changes between amino acids occur at the same rate. The model of evolution that attributes to each possible nucleotide or amino-acid substitution has a certain probability that is essential to obtain the correct tree. Introduction to DNA Substitution Models The use of maximum likelihood algorithms in developing phylogenetic hypotheses requires a model of evolution. The model of evolution that attributes to each possible nucleotide or amino-acid substitution has a certain probability that is essential to obtain the correct tree. The frequently used General Time Reversible (GTR) family of nested models encompasses 64 models with different combinations of parameters for DNA site substitution. The models are listed here from the least complex to the most parameter rich. Jukes-Cantor (JC, nst=1): Equal base frequencies, all substitutions equally likely (PAUP* rate classification: aaaaaa, PAML: aaaaaa)* Felsenstein 1981(F81, nst=1): Variable base frequencies, all substitutions equally likely (PAUP*: aaaaaa, PAML: aaaaaa)** Kimura 2-parameter (K80, nst=2): Equal base frequencies, variable transition and transversion frequencies (PAUP*: abaaba, PAML: abbbba) Hasegawa-Kishino-Yano (HKY, nst=2): Variable base frequencies, variable transition and transversion frequencies (PAUP*: abaaba, PAML: abbbba) Tamura-Nei (TrN): Variable base frequencies, equal transversion frequencies, variable transition frequencies (PAUP*: abaaea, PAML: abbbbf) Kimura 3-parameter (K3P): Variable base frequencies, equal transition frequencies, variable transversion frequencies (PAUP*: abccba, PAML: abccba) Transition Model (TIM): Variable base frequencies, variable transitions, transversions equal (PAUP*: abccea, PAML: abccbe) Transversion Model (TVM): Variable base frequencies, variable transversions, transitions equal (PAUP*: abcdbe, PAML: abcdea) Symmetrical Model (SYM): Equal base frequencies, symmetrical substitution matrix (A to T = T to A) (PAUP*: abcdef, PAML: abcdef) General Time Reversible (GTR, nst=6): Variable base frequencies, symmetrical substitution matrix (PAUP*: abcdef, PAML: abcdef) In addition to models describing the rates of change from one nucleotide to another, there are models to describe rate variation among sites in a sequence. The following are the two most commonly used models. Gamma Distribution (G): Gamma distributed site-to-site rate variation Proportion of Invariable Sites (I): Extent of static, unchanging sites in a dataset Substitutions are themselves grouped hierarchically: simple, general base substitution, transitions and transversions, purine to purine and pyrimidine to pyrimidine transitions, and AC/GT and AT/CG transversions. The groupings are symbolized as rate classifications according to the PAUP and PAML matrices below. Substitution types that are constrained to be equal in rate assume the leftmost letter symbol. PAUP* Substitution Rate Matrix Matrix A C G T PAML Substitution Rate A C - a b c T - d e C - f=1 A - G G T T C A G - a b c - d e - f=1 - Advantages and disadvantages of maximum likelihood methods: There are some supposed advantages of maximum likelihood methods over other methods. they estimate the branch lengths of the final tree they have often lower variance than other methods (ie. it is frequently the estimation method least affected by sampling error) they tend to be robust to many violations of the assumptions in the evolutionary model even with very short sequences they tend to outperform alternative methods such as parsimony or distance methods. the method is statistically well founded they evaluate different tree topologies they use all the sequence information There are also some supposed disadvantages maximum likelihood is very CPU intensive and thus extremely slow the result is dependent on the model of evolution used Bibliography 1. Adachi J, Hasegawa M. 1992. Amino acid substitution of proteins coded for in mitochondrial DNA during mammalian evolution. Jpn. J. Genet. 67:187–97 2. Bishop MJ, Thompson EA. 1986. Maximum likelihood alignment of DNA sequences. J. Mol. Biol. 190:159–65 3. Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376 4. Felsenstein J. 1988. Phylogenies from molecular sequences. Annu. Rev. Genet. 22:521–565 5. Huelsenbeck JP, Crandall KA. 1997. Phylogeny estimation and hypothesis testing using maximum likelihood. Annu. Rev. Ecol. Stst. 28:437-466 6. Navidi WC, Churchill GA, von Haeseler A. 1991. Methods for inferring phylogenies from nucleic acid sequence data by using maximum likelihood and linear in-variants. Mol. Biol. Evol. 8:128–143 7. Penny D, Hendy MD, Steel MA. 1992. Progress with methods for constructing evolutionary trees. Trends Ecol. Evol. 7:73-79 8. Thorne JL, Kishino H, Felsenstein J. 1991. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 33:114-124