#31 - Phylogenetics Character-Based Methods 11/05/07 Required Reading BCB 444/544 (before lecture) Fri Oct 30 - Lecture 30 Lecture 31 Phylogenetic – Distance-Based Methods • Chp 11 - pp 142 – 169 Mon Nov 5 - Lecture 31 Phylogenetics – Character-Based Methods Phylogenetics – Parsimony and ML • Chp 11 - pp 142 – 169 Wed Nov 7 - Lecture 32 Machine Learning #31_Nov05 Fri Nov 9 - Lecture 33 Functional and Comparative Genomics • Chp 17 and Chp 18 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 1 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 2 BCB 544 Only: New Homework Assignment Assignments & Announcements Mon Oct 29 - HW#5 544 Extra#2 HW#5 = Hands-on exercises with phylogenetics and tree-building software Due: Mon Nov 5 11/05/07 Due: (not Fri Nov 1 as previously posted) √PART 1 - ASAP PART 2 - meeting prior to 5 PM Fri Nov 2 Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 3 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 4 Chp 11 – Phylogenetic Tree Construction Methods and Programs Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: SECTION IV MOLECULAR PHYLOGENETICS http://www.bcb.iastate.edu/seminars/index.html Xiong: Chp 11 Phylogenetic Tree Construction Methods and Programs • Nov 7 Wed - BBMB Seminar 4:10 in 1414 MBB • Sharon Roth Dent 11/05/07 MD Anderson Cancer Center • Role of chromatin and chromatin modifying proteins in regulating gene expression • • • • • Nov 8 Thurs - BBMB Seminar 4:10 in 1414 MBB • Jianzhi George Zhang U. Michigan • Evolution of new functions for proteins • Nov 9 Fri - BCB Faculty Seminar 2:10 in 102 SciI Distance-Based Methods Character-Based Methods Phylogenetic Tree Evaluation Phylogenetic Programs • Amy Andreotti ISU • Something about NMR BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods BCB 444/544 Fall 07 Dobbs 11/05/07 5 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 6 1 #31 - Phylogenetics Character-Based Methods 11/05/07 Tree Construction Summary of Distance-Based Methods • Two main categories of tree building methods • Distance-based • Overall similarity between sequences • Character-based • Consider the entire MSA BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods • Clustering-based methods: • Computationally very fast and can handle large datasets that other methods cannot • Not guaranteed to find the best tree • Optimality-based methods: • Better overall accuracies • Computationally slow • All distance-based methods lose all sequence information and cannot infer the most likely state at an internal node 11/05/07 7 Character-Based Methods BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 8 Parsimony • Based directly on the sequence characters in the MSA rather than overall distances • Count mutational events accumulated on sequences • Evolutionary dynamics of each character can be studied and ancestral sequences inferred • Two popular approaches • Parsimony is based on Occam’s Razor – the simplest explanation is most likely correct • Goal: Find the tree that allows evolution of the sequences with the fewest changes • Parsimony • Maximum Likelihood (ML) BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 9 Parsimony BCB 444/544 Fall 07 Dobbs 11/05/07 10 Algorithms for Small Parsimony • Fitch’s algorithm: • Parsimony score of a tree: The smallest (weighted) number of steps required by the tree • Two parsimony problems: • Large Parsimony problem: Find the tree with the lowest parsimony score • Small Parsimony problem: Given a tree, find its parsimony score • Use the small parsimony problem to solve the large parsimony problem BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 • Based on set operations • Evolutionary steps have the same weight • Sankoff’s algorithm: • Based on dynamic programming • Allows steps to have different weights • Both algorithms compute the minimum (weighted) number of steps a tree requires at a given site 11 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 12 2 #31 - Phylogenetics Character-Based Methods 11/05/07 Fitch’s Algorithm Example Sankoff’s Algorithm • Allows for different weights for different evolutionary steps • Transitions (A <-> G or C <-> T) are more probable than transversions, so give a lower weight to transitions BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 13 Sankoff’s Algorithm Example BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 15 14 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 16 Exhaustive Search • Solving the large parsimony problem requires searching all possible trees (or does it?) • Exhaustive search (exact) • Branch-and-Bound (exact) • Heuristic search methods (not exact) BCB 444/544 Fall 07 Dobbs 11/05/07 Sankoff’s Algorithm Traceback Searching for a Most Parsimonious Tree BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 • Build the only possible unrooted tree for three taxa (can be randomly chosen) • Try all possible places to add the fourth taxon and score each tree • Try all places to add the fifth taxon to the trees and score again … 17 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 18 3 #31 - Phylogenetics Character-Based Methods 11/05/07 Why Finding a True Tree is Difficult Adding the Fourth Taxon Number of rooted trees • The number of possible trees grows exponentially with the number of species (or sequences) • Nr = (2n -3)!/2(n-2)(n-2)! • Nu = (2n -5)!/2(n-3)(n-3)! • To find the best tree, you must explore all possibilities (or must you?) BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 19 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 20 11/05/07 21 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 22 Adding the Fifth Taxon BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods Branch and Bound Branch and Bound • When a tip of the search tree is reached the tree is either optimal (and retained) or suboptimal (and rejected) • When all paths leading from the initial 3 taxon tree have been explored, the algorithm terminates, and all most parsimonious trees will have been identified • Similar to exhaustive search except that we maintain the score of best tree obtained so far • If score of current tree exceeds the current best score, backtrack and take next available path • Main idea: The parsimony score of a tree can only increase as we add another taxa BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods BCB 444/544 Fall 07 Dobbs 11/05/07 23 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 24 4 #31 - Phylogenetics Character-Based Methods 11/05/07 Branch and Bound Branch and Bound • One way to find a reasonable lower bound quickly: • Use UPGMA or NJ to build a complete tree • Calculate the parsimony score of this tree and use it as a lower bound in our search BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 25 Heuristic Search BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 26 11/05/07 28 11/05/07 30 Nearest-Neighbor Interchange • Shortcuts have been designed to reduce the search space • Idea: Build a tree quickly (by NJ or some other fast method) and rearrange parts of it to explore some of the possible trees • Branch swapping • Nearest neighbor interchange • Subtree pruning and regrafting • Tree bisection and reconnection BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 27 Subtree Pruning and Regrafting BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods BCB 444/544 Fall 07 Dobbs BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods Tree Bisection and Reconnection 11/05/07 29 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 5 #31 - Phylogenetics Character-Based Methods 11/05/07 Stepwise Addition – Another Heuristic Maximum Likelihood Method • A greedy method • Start with 3 taxon tree • Add one taxon at a time • Keep only the best tree found so far • No guarantee of optimality, but may provide a good starting point for a search BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 • ML is based on a Markov model of evolution • Observed: The species labeling the leaves • Hidden: The ancestral states • Transition probabilities: The mutation probabilities • Assumptions: • Only mutations are allowed • Sites are independent 31 Models of Evolution at a Site 11/05/07 G A T G C T Probability = mTG · mGA · mGG · mTT · mTC · mTT 33 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 34 Likelihood of a Tree X X Y Y Z G 32 T Ancestral Reconstruction: Most Likely Assignment A 11/05/07 The Probability of an Assignment • Transition probability matrix: M = [mij], i,j {A,C,T,G} Where mij = Prob(i -> j mutation in 1 time unit) Branches may have different lengths BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods C T A L* = maxX,Y,Z {mXY · mYA · mYG · mXZ · mZC · mZT} BCB 444/544 Fall 07 Dobbs G C T L* = ∑X,Y,Z {mXY · mYA · mYG · mXZ · mZC · mZT} Compute using Viterbi algorithm BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods Z Compute using forward algorithm 11/05/07 35 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 36 6 #31 - Phylogenetics Character-Based Methods 11/05/07 Maximum Likelihood Comments Phylogenetic Tree Evaluation • ML is robust • ML converges to the correct answer as more data is added • Can put in a Bayesian statistical framework to obtain a distribution of possible phylogenies • ML can be slow BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 • Bootstrapping • Jackknifing • Bayesian Simulation • Statistical difference tests (are two trees significantly different?) • Kishino-Hasegawa Test (paired t-test) • Shimodaira-Hasegawa Test (χ2 test) 37 Bootstrapping • Construct trees for samples • For each branch in original tree, compute fraction of bootstrap samples in which that branch appears • Assigns a bootstrap support value to each branch • Idea: If a grouping has a lot of support, it will be supported by at least some positions in most of the bootstrap samples 11/05/07 39 Jackknifing BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 40 Bayesian Simulation • Another resampling technique • Randomly delete half of the sites in the dataset • Construct new tree with this smaller dataset, see how often taxa are grouped • Advantage – sites aren’t duplicated • Disadvantage – again really only measuring consistency of the data BCB 444/544 Fall 07 Dobbs 38 • Bootstrapping doesn’t really assess the accuracy of a tree, only indicates the consistency of the data • To get reliable statistics, bootstrapping needs to be done on your tree 500 – 1000 times, this is a big problem if your tree took a few days to construct • Obtain a data matrix with same number of taxa and number of characters as original one BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 Bootstrapping Comments • A bootstrap sample is obtained by sampling sites randomly with replacement BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 • Using a Bayesian ML method to produce a tree automatically calculates the probability of many trees during the search • Most trees sampled in the Bayesian ML search are near an optimal tree 41 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 42 7 #31 - Phylogenetics Character-Based Methods 11/05/07 Phylogenetic Programs Phylogenetic Programs • Huge list at: • http://evolution.genetics.washington.edu/phylip/so ftware.html • PAUP* - one of the most popular programs, commercial, Mac and Unix only, nice user interface • PHYLIP – free, multiplatform, a bit difficult to use but web servers make it easier • WebPhylip – another interface for PHYLIP online BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 • TREE-PUZZLE – uses a heuristic to allow ML on large datasets, also available as a web server • PHYML – web based, uses genetic algorithm • MrBayes – Bayesian program, fast and can handle large datasets, multiplatform download • BAMBE – web based Bayesian program 43 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 44 Final Comments on Phylogenetics • No method is perfect • Different methods make very different assumptions • If multiple methods using different assumptions come up with similar results, we should trust the results more than any single method BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods BCB 444/544 Fall 07 Dobbs 11/05/07 45 8