Phylogenetic Trees Lecture 3 Based on: Durbin et al 7.4; Gusfield 17 . Character-based methods for constructing phylogenies In this approach, trees are constructed by comparing the characters of the corresponding species. Characters may be morphological (teeth structures) or molecular (homologous DNA sequences). One common approach is Maximum Parsimony. Assumptions: Independence of characters (no interactions) Best tree is one where minimal changes take place 2 1. Maximum Parsimony Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. Question: Which evolutionary tree best explains these sequences ? One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree. AAA AAA 1 AAG 2 GGA AAA AAA 1 AGA Total #substitutions = 4 3 Example Continued There are many trees possible. For example: AAA 1 AAA 1 AAG AAA AAA AAA AGA 1 GGA AGA 1 AAG 1 AGA AAA AAA 2 GGA Total #substitutions = 3 Total #substitutions = 4 The left tree is preferred over the right tree. The total number of changes is called the parsimony score. 4 Simple Example Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position Minimal tree has one evolutionary change: C T C C T C C T TC 5 Extension to Many Letters What is the parsimony score of Aardvark Bison Chimp Dog A: B: C: D: E: CAGGTA CAGACA CGGGTA TGCACT TGCGTA Elephant We do it character after character; each score is computed independently of the others. 6 Fitch’s Algorithm of Evaluating Trees Traverse tree from leaves to root determining set of possible states (e.g. nucleotides) for each internal node Traverse tree from root to leaves picking ancestral states for internal nodes 7 Fitch’s Algorithm – Step 1 # of changes = # union operations T T AGT CT C GT T G T A T 8 Fitch’s Algorithm – Step 1 Do a post-order (from leaves to root) traversal of tree Determine possible states Ri of internal node i with children j and k R j Rk if R j Rk Ri R j Rk otherwise 9 Fitch’s Algorithm – Step 2 T T AGT CT C GT T G T A T 10 Fitch’s Algorithm – Step 2 Do a pre-order (from root to leaves) traversal of tree Select state rj of internal node j with parent i ri if ri R j rj arbitrary state R otherwise j 11 Weighted Version of Fitch’s Algorithm Instead of assuming all state changes are equally likely, use different costs c(a, b) for different changes a b 1st step of algorithm is to propagate costs up through tree 12 Weighted Version of Fitch’s Algorithm Want to determine minimal cost S(i, a) of assigning character a to node i For leaves: 0 if a is a character at leaf S(i, a) otherwise 13 Weighted Version of Fitch’s Algorithm Want to determine min. cost S(i, a) of assigning character a to node i For internal nodes: S (i, a) min ( S ( j , b) c(a, b)) min ( S (k , b) c(a, b)) b b i j a b a k b 14 Weighted Version of Fitch’s Algorithm – Step 2 Do a pre-order (from root to leaves) traversal of tree Select minimal cost character for root For each internal node j, select character that produced minimal cost at parent i 15 Weighted Parsimony Scores Weighted Parsimony score: Each change is weighted by a score c(a, b). The weighted parsimony score reduces to the parsimony score when c(a,a)=0 and c(a,b)=1 for all b a. 16 Evaluating Weighted Parsimony Scores Each position is independent and computed by itself. Use Dynamic Programming on a given tree. If k is a node with children i and j, then S(i, a) = minx(S(j, x)+c(a, x)) + miny(S(k, y)+c(a, y)) S(i, a)the minimum score of subtree rooted at k when k has character a. i S(i,a) k S(j,x) j S(k,y) 17 Evaluating Parsimony Scores Dynamic programming on a given tree Initialization: For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = Iteration: if i is node with children j and k, then S(i,a) = minx(S(j,x)+c(a,x)) + miny(S(k,y)+c(a,y)) Termination: cost of tree is minxS(r,x) where r is the root Comment: To reconstruct an optimal assignment, we need to keep in each node i and for each character a the two characters x, y that bring about the minimum when i has character a. 18 Cost of Evaluating Parsimony for binary trees If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk2). Of course, we still need to search over ALL possible trees and find the best one. One usually resorts to heuristic search techniques. 19 Exploring the Space of Trees We’ve considered how to find the minimum number of changes for a given tree topology Need some search procedure for exploring the space of tree topologies Given n sequences there are possible rooted trees (2n 3)!! (2n 3)!! 3 5 (2n 3) 20 Counting Trees n=3 One Tree: 1 3 n=4 3 Trees 2 A rooted tree with n leaves has (2n-1) nodes and (2n-2) edges, discounting the edge to the root; hence an unrooted tree has (2n-3) edges. For each additional leaf we add two edges. Therefore we have 1 • 3 • 5 • … • (2n-5) unrooted trees with n leaves. Each of such trees has (2n-3) edges, which can be chosen as a root of the rooted tree. Hence we have 1 • 3 • 5 • … • (2n-5) • (2n-3) rooted trees with n leaves 21 Exploring the Space of Trees taxa (n) 4 5 6 8 10 # of rooted trees 15 105 945 135,135 30,405,375 22 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 Species 1 – A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G How many possible unrooted trees? 23 Maximum Parsimony How many possible unrooted trees? Species 1 Species 2 Species 3 Species 4 - 1 A A A A 2 G C T A 3 G G A T 4 G A A G 5 T T T T 6 A T T T 7 A A G G 8 C T T T 9 T T C C 10 G A T G 1 3 1 2 1 3 2 4 3 4 4 2 24 Maximum Parsimony How many substitutions? 1 change tree 1 2 3 4 A A 5 changes G A G G A A G G A G MP 25 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1-A 2-A 3-A 4-A 1 3 2 4 1 2 3 4 1 3 G C T A G G A T G A A G T T T T A T T T A A G G C T T T T T C C G A T G 0 0 0 4 2 26 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1-A 2-A 3-A 4-A 1 3 2 4 1 2 3 4 1 3 G C T A G G A T G A A G T T T T A T T T A A G G C T T T T T C C G A T G 0 3 0 3 0 3 4 2 27 Maximum Parsimony G1 C2 3T C 3 1-G 4A 2-C G1 T3 2C C G1 3-T 3 4-A 4A 3T 3 A4 C 2C 28 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1-A 2-A 3-A 4-A 1 3 2 4 1 2 3 4 1 3 G C T A G G A T G A A G T T T T A T T T A A G G C T T T T T C C G A T G 0 3 2 0 3 2 0 3 2 4 2 29 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1-A 2-A 3-A 4-A 1 3 2 4 1 2 3 4 1 3 G C T A G G A T G A A G T T T T A T T T A A G G C T T T T T C C G A T G 0 3 2 2 0 3 2 2 0 3 2 1 4 2 30 Maximum Parsimony G1 A2 3A A G1 A3 4G G1 1-G 2-A 2A A 2 4 3-A 2 4-G 4G 3A 1 G4 A 2A 31 Maximum Parsimony 1 3 2 4 1 2 3 4 1 3 0 3 2 2 0 1 1 1 1 3 14 0 3 2 2 0 1 2 1 2 3 16 0 3 2 1 0 1 2 1 2 3 15 4 2 32 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1-A G G G T A A C T G 2-A C G A T T A T T A 3-A T A A T T G T C T 4-A A T G T T G T C G 1 3 2 4 0 3 2 2 0 1 1 1 1 3 14 33 Finding most parsimonious trees exact solutions Exact solutions can only be used for small numbers of taxa. Exhaustive search examines all possible trees. Typically used for problems with less than 10 taxa. 34 Finding most parsimonious trees - exhaustive search (1) B C Starting tree, any 3 taxa A Add fourth taxon (D) in each of three possible positions: three trees E B D D C B C (2b) (2a) A B C E D (2c) A E E A E Add fifth taxon (E) in each of the five possible positions on each of the three trees -> 15 trees, and so on 35 Finding most parsimonious trees exact solutions Branch and bound saves time by discarding families of trees during tree construction that can not be smaller than the smallest tree found so far. (Here “smaller” means more parsimonious.) Can be enhanced by specifying an initial upper bound for tree length. Typically used only for problems with less than 20 taxa. 36 Finding most parsimonious trees: branch and bound C2.1 C D C2.2 A B C C3.1 C B B C3.2 D C2.3 C3.3 A C2.4 B2 B3 A C2.5 C3.4 A C3.5 B E B D C C D C C1.1 D E B B1 C1.5 A A A B B D E A C1.3 D B E C C1.2 E D C A C1.4 C A 37 Finding most parsimonious trees heuristics The number of possible trees increases exponentially with the number of taxa making exhaustive searches impractical for many data sets (an NP complete problem) Heuristic methods are used to search tree space for most parsimonious trees The trees found are not guaranteed to be the most parsimonious - they are best guesses 38 Finding most parsimonious trees - heuristics Stepwise addition Asis - the order in the data matrix Closest -starts with shortest 3-taxon tree adds taxa in order that produces the least increase in tree length Simple - the first taxon in the matrix is a taken as a reference - taxa are added to it in the order of their decreasing similarity to the reference Random - taxa are added in a random sequence, many different sequences can be used Recommend random with as many (e.g. 10-100) addition sequences as practical 39 Finding most parsimonious trees - heuristics Branch Swapping: Nearest neighbor interchange (NNI) Subtree pruning and regrafting (SPR) Tree bisection and reconnection (TBR) 40 Finding most parsimonious trees - heuristics 1 Nearest neighbor interchange (NNI) C A D E F B G A D C C E A D E F B G F B G 41 Finding most parsimonious trees heuristics 2 Subtree pruning and regrafting (SPR) A C D E F B G C D E C F G E F G B D A 42 Finding most parsimonious trees - heuristics 3 Tree bisection and reconnection (TBR) A C D E F B G E A C A B G F D F B G D C E 43 Finding most parsimonious trees heuristics - summary Branch Swapping Nearest neighbor interchange (NNI) Subtree pruning and regrafting (SPR) Tree bisection and reconnection (TBR) The nature of heuristic searches means we cannot know which method will find the most parsimonious trees or all such trees. However, TBR is the most extensive swapping routine and its use with multiple random addition sequences should work well. 44 Tree space may be populated by local minima and islands of most parsimonious trees RANDOM ADDITION SEQUENCE REPLICATES FAILURE SUCCESS Branch Swapping Branch Swapping FAILURE Tree Length Branch Swapping Local Minimum GLOBAL MINIMUM Local Minima 45 Multiple most parsimonious trees Many parsimony analyses yield multiple equally optimal trees Multiple trees are due to either: - Alternative equally parsimonious optimizations of homoplastic characters - Missing data - Or both We can further select among these trees with additional criteria, but Most commonly relationships common to all the optimal trees are summarized with consensus trees 46 Consensus methods - 1 A consensus tree is a summary of the agreement among a set of fundamental trees There are many different consensus methods that differ in: 1. the kind of agreement 2. the level of agreement Consensus methods can be used with any types of tree - not just parsimony 47 Strict consensus methods - 1 Strict consensus methods require agreement across all the fundamental trees They show only those relationships that are unambiguously supported by the parsimonious interpretation of the data The commonest method (strict component consensus) focuses on clades This method produces a consensus tree that includes all and only those clades found in all the fundamental trees Other relationships (those in which the fundamental trees disagree) are shown as unresolved polytomies 48 Strict consensus methods - 2 TWO FUNDAMENTAL TREES A B C D E A F G B C B A D E C F E D F G G STRICT COMPONENT CONSENSUS TREE 49 Majority-rule consensus methods Majority-rule consensus methods require agreement across a majority of the fundamental trees May include relationships that are not supported by the most parsimonious interpretation of the data The commonest method focuses on clades This method produces a consensus tree that includes all and only those clades found in a majority (>50%) of the fundamental trees Other relationships are shown as unresolved polytomies Of particular use in bootstrapping 50 Majority rule consensus THREE FUNDAMENTAL TREES A B C D E F G A B Numbers indicate frequency of clades in the fundamental trees C E F D G A B C E D F G 100 66 66 A B C E D F G 66 66 MAJORITY-RULE COMPONENT CONSENSUS TREE 51 Reduced consensus methods - 1 Focuses upon any cladistic relationships (statements that some taxa are more closely related to each other than to some other taxa) Reduced consensus methods occur in strict and majority-rule varieties Other relationships are shown as unresolved polytomies May be more sensitive than methods focusing only on clades 52 Reduced consensus methods - 2 TWO FUNDAMENTAL TREES A B C D E F G A G B C D E F A BCDE F G A B C D E F Strict component consensus completely unresolved STRICT REDUCED CLADISTIC CONSENSUS TREE Taxon G is excluded 53 Consensus methods - 2 Three fundamental trees strict reduced cladistic strict (component) Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Tracheloraphis Spirostomum Euplotes Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Spirostomumum Tracheloraphis Euplotes Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Spirostomumum Euplotes Tracheloraphis Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Euplotes Spirostomumum Tracheloraphis Gruberia Euplotes excluded majority-rule 100 100 66 66 10 0 100 Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Spirostomum Euplotes Tracheloraphis Gruberia Symbiodinium Prorocentrum Loxodes Tetrahymena Spirostomum Tracheloraphis Gruberia Ochromonas 54 Consensus methods - 3 Use strict methods to identify those relationships unambiguously supported by parsimonious interpretation of the data Use reduced methods where consensus trees are poorly resolved Use majority-rule methods in bootstrapping Avoid other methods which have ambiguous interpretations 55 Parsimony - advantages a simple method - easily understood operation does not seem to depend on an explicit model of evolution gives both trees and associated hypotheses of character evolution should give reliable results if the data is well structured and homoplasy is either rare or randomly distributed on the tree 56 Parsimony - disadvantages May give misleading results if homoplasy is common or concentrated in particular parts of the tree, e.g: - thermophilic convergence - base composition biases - long branch attraction Underestimates branch lengths Model of evolution is implicit - behaviour of method not well understood Parsimony often justified on purely philosophical grounds - we must prefer simplest hypotheses - particularly by morphologists For most molecular systematists this is uncompelling 57 Parsimony can be inconsistent Felsenstein (1978) developed a simple model phylogeny including four taxa and a mixture of short and long branches Under this model parsimony will give the wrong tree A B Model tree p p q C q q D Rates or Branch lengths p >> q Parsimony tree C A Wrong B D Long branches are attracted but the similarity is homoplastic • With more data the certainty that parsimony will give the wrong tree increases - so that parsimony is statistically inconsistent. • Advocates of parsimony initially responded by claiming that Felsenstein’s result showed only that his model was unrealistic. • It is now recognized that the long-branch attraction (the Felsenstein Zone) is one of the most serious problems in phylogenetic inference. 58 2. Perfect Phylogeny Data on species is given by a Character State Matrix. Cell (p, i) has value j iff character i of object (species) p has state j . Goal: constructing evolution tree for the species. Character Object c1 c2 c3 c4 c5 A 1 1 2 0 0 B 2 0 1 2 1 C 3 2 3 3 1 D 0 3 4 1 0 E 1 1 0 0 1 59 Motivation: Evolution Tree Internal nodes correspond to speciation events, where some character (attribute) is acquired. Assumptions: 1. No reversals (characters are not lost) 2. No convergences (a character is created only once) 60 61 Perfect Phylogeny for a 0-1 Matrix A 0-1 matrix: Each character is either 0 (non exists) or 1 (exists). Each of the n objects label exactly one leaf of T Each of the m characters labels exactly one edge of T Object p has exactly the characters labeling the path from p to the root. A perfect phylogeny for the matrix: Tree with no convergence, no reversals. 2 3 1 4 D B E 5 A C 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 0 1 D 0 0 1 1 0 E 0 1 0 0 0 62 The (Binary) Perfect Phylogeny Problem Problem: Given a 0-1 matrix M, determine if it has a perfect phylogeny, and construct one if it does. (Note: edges are labeled by characters: edge labeled by i represent changing character i’s state from 0 to 1). 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 0 1 5 D 0 0 1 1 0 C E 0 1 0 0 0 2 3 1 4 D 1 E B A 63 Solution to Perfect Phylogeny Problem Definition: Given a 0-1 matrix M, Ok={j: Mjk=1}; i.e., Ok is the set of objects that have character k. Theorem: M has a perfect phylogenetic tree iff the sets {Oi} are laminar, ie: for all i, j, either Oi and Oj are disjoint, or one includes the other. Laminar Not Laminar 1 2 3 4 5 1 2 3 4 5 A 1 1 0 0 0 A 1 1 0 0 0 B 0 0 1 0 0 B 0 0 1 0 1 C 1 1 0 0 1 C 1 1 0 0 1 D 0 0 1 1 0 D 0 0 1 1 0 E 0 1 0 0 0 E 0 1 0 0 1 64 Proof : Assume M has a perfect phylogeny, and let i, j be given. Consider the edges labeled i and j. Case 1: There is a root to leaf path containing both. Then one is included in the other (2 and 1 below). Case 2: not case 1. Then they are disjoint (2 and 3 below). 2 3 1 4 D E B 5 A C 65 Proof (cont.) : Assume for all i, j, either Oi and Oj are disjoint, or one includes the other. We prove by induction on the number of characters that it has. Basis: one character. Then there are at most two objects, one with and one without this character. 1 A B 1 0 1 B A 66 Proof (cont.) : Induction step: Assume correctness for n-1 characters, and consider a matrix with n characters (non-zero columns). WLOG assume that O1 is not contained in Oj for j > 1. Let S1 be the set of objects that have character 1, and S2 be the remaining objects. Then each character belongs to objects in S1 or S2, but not both. By induction there are trees T1 and T2 for S1 and S2. Combining them as below gives the desired tree. 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 0 1 D 0 0 1 1 0 E 1 0 0 0 0 1 T1 T2 67 Efficient Implementation 1. Sort the columns by decreasing value when considered as binary numbers. (Time complexity: O(mn), using radix sort). Claim: If the binary value of column i is larger than that of column j, then Oi is not a proper subset of Oj. Proof: Oi – Oj > 0 means the 1’s in Oi are not covered by the 1’s in Oj. 1 2 3 4 5 2 1 3 5 4 A 1 1 0 0 0 A 1 1 0 0 0 B 0 0 1 0 0 B 0 0 1 0 0 C 1 1 0 0 1 C 1 1 0 1 0 D 0 0 1 1 0 D 0 0 1 0 1 E 0 1 0 0 0 E 1 0 0 0 0 68 Efficient Implementation (2) 2. Make a backwards linked list of the 1’s in each row (leftmost 1 in each row points at itself). Time complexity: O(mn). Claim: If the columns are sorted, then the set of columns is laminar iff for each column i, all the links leaving column i point at the same column. Can be checked in O(mn) time. 2 1 3 5 4 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 69 Examples Not laminar 2 laminar 1 3 5 4 A 1 1 0 0 0 A 1 1 0 0 0 B 0 0 1 0 0 B 0 0 1 0 0 C 1 1 0 1 0 C 1 1 0 1 0 D 0 0 1 0 1 D 0 0 1 0 1 E 1 0 1 1 0 E 1 0 0 0 0 70 Efficient Implementation (3) 3. When the matrix is laminar, the tree edges corresponding to characters are defined by the backwards links in the matrix. remaining edges and leaves are determined by the characters of each object. Needs O(mn) time. 2 3 1 4 D E B 5 A C 2 1 3 5 4 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 71