BCB 444/544 Lecture 31 Phylogenetics – Character-Based Methods #31_Nov05 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 1 Required Reading (before lecture) Fri Oct 30 - Lecture 30 Phylogenetic – Distance-Based Methods • Chp 11 - pp 142 – 169 Mon Nov 5 - Lecture 31 Phylogenetics – Parsimony and ML • Chp 11 - pp 142 – 169 Wed Nov 7 - Lecture 32 Machine Learning Fri Nov 9 - Lecture 33 Functional and Comparative Genomics • Chp 17 and Chp 18 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 2 Assignments & Announcements Mon Oct 29 - HW#5 HW#5 = Hands-on exercises with phylogenetics and tree-building software Due: Mon Nov 5 (not Fri Nov 1 as previously posted) BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 3 BCB 544 Only: New Homework Assignment 544 Extra#2 Due: √PART 1 - ASAP PART 2 - meeting prior to 5 PM Fri Nov 2 Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 4 Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html • Nov 7 Wed - BBMB Seminar 4:10 in 1414 MBB • Sharon Roth Dent MD Anderson Cancer Center • Role of chromatin and chromatin modifying proteins in regulating gene expression • Nov 8 Thurs - BBMB Seminar 4:10 in 1414 MBB • Jianzhi George Zhang U. Michigan • Evolution of new functions for proteins • Nov 9 Fri - BCB Faculty Seminar 2:10 in 102 SciI • Amy Andreotti ISU • Something about NMR BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 5 Chp 11 – Phylogenetic Tree Construction Methods and Programs SECTION IV MOLECULAR PHYLOGENETICS Xiong: Chp 11 Phylogenetic Tree Construction Methods and Programs • • • • Distance-Based Methods Character-Based Methods Phylogenetic Tree Evaluation Phylogenetic Programs BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 6 Tree Construction • Two main categories of tree building methods • Distance-based • Overall similarity between sequences • Character-based • Consider the entire MSA BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 7 Summary of Distance-Based Methods • Clustering-based methods: • Computationally very fast and can handle large datasets that other methods cannot • Not guaranteed to find the best tree • Optimality-based methods: • Better overall accuracies • Computationally slow • All distance-based methods lose all sequence information and cannot infer the most likely state at an internal node BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 8 Character-Based Methods • Based directly on the sequence characters in the MSA rather than overall distances • Count mutational events accumulated on sequences • Evolutionary dynamics of each character can be studied and ancestral sequences inferred • Two popular approaches • Parsimony • Maximum Likelihood (ML) BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 9 Parsimony • Parsimony is based on Occam’s Razor – the simplest explanation is most likely correct • Goal: Find the tree that allows evolution of the sequences with the fewest changes BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 10 Parsimony • Parsimony score of a tree: The smallest (weighted) number of steps required by the tree • Two parsimony problems: • Large Parsimony problem: Find the tree with the lowest parsimony score • Small Parsimony problem: Given a tree, find its parsimony score • Use the small parsimony problem to solve the large parsimony problem BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 11 Algorithms for Small Parsimony • Fitch’s algorithm: • Based on set operations • Evolutionary steps have the same weight • Sankoff’s algorithm: • Based on dynamic programming • Allows steps to have different weights • Both algorithms compute the minimum (weighted) number of steps a tree requires at a given site BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 12 Fitch’s Algorithm Example BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 13 Sankoff’s Algorithm • Allows for different weights for different evolutionary steps • Transitions (A <-> G or C <-> T) are more probable than transversions, so give a lower weight to transitions BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 14 Sankoff’s Algorithm Example BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 15 Sankoff’s Algorithm Traceback BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 16 Searching for a Most Parsimonious Tree • Solving the large parsimony problem requires searching all possible trees (or does it?) • Exhaustive search (exact) • Branch-and-Bound (exact) • Heuristic search methods (not exact) BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 17 Exhaustive Search • Build the only possible unrooted tree for three taxa (can be randomly chosen) • Try all possible places to add the fourth taxon and score each tree • Try all places to add the fifth taxon to the trees and score again … BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 18 Why Finding a True Tree is Difficult Number of rooted trees • The number of possible trees grows exponentially with the number of species (or sequences) • Nr = (2n -3)!/2(n-2)(n-2)! • Nu = (2n -5)!/2(n-3)(n-3)! • To find the best tree, you must explore all possibilities (or must you?) BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 19 Adding the Fourth Taxon BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 20 Adding the Fifth Taxon BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 21 BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 22 Branch and Bound • Similar to exhaustive search except that we maintain the score of best tree obtained so far • If score of current tree exceeds the current best score, backtrack and take next available path • Main idea: The parsimony score of a tree can only increase as we add another taxa BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 23 Branch and Bound • When a tip of the search tree is reached the tree is either optimal (and retained) or suboptimal (and rejected) • When all paths leading from the initial 3 taxon tree have been explored, the algorithm terminates, and all most parsimonious trees will have been identified BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 24 Branch and Bound BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 25 Branch and Bound • One way to find a reasonable lower bound quickly: • Use UPGMA or NJ to build a complete tree • Calculate the parsimony score of this tree and use it as a lower bound in our search BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 26 Heuristic Search • Shortcuts have been designed to reduce the search space • Idea: Build a tree quickly (by NJ or some other fast method) and rearrange parts of it to explore some of the possible trees • Branch swapping • Nearest neighbor interchange • Subtree pruning and regrafting • Tree bisection and reconnection BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 27 Nearest-Neighbor Interchange BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 28 Subtree Pruning and Regrafting BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 29 Tree Bisection and Reconnection BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 30 Stepwise Addition – Another Heuristic • A greedy method • Start with 3 taxon tree • Add one taxon at a time • Keep only the best tree found so far • No guarantee of optimality, but may provide a good starting point for a search BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 31 Maximum Likelihood Method • ML is based on a Markov model of evolution • Observed: The species labeling the leaves • Hidden: The ancestral states • Transition probabilities: The mutation probabilities • Assumptions: • Only mutations are allowed • Sites are independent BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 32 Models of Evolution at a Site • Transition probability matrix: M = [mij], i,j {A,C,T,G} Where mij = Prob(i -> j mutation in 1 time unit) Branches may have different lengths BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 33 The Probability of an Assignment T G A T G C T Probability = mTG · mGA · mGG · mTT · mTC · mTT BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 34 Ancestral Reconstruction: Most Likely Assignment X Y A Z G C T L* = maxX,Y,Z {mXY · mYA · mYG · mXZ · mZC · mZT} Compute using Viterbi algorithm BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 35 Likelihood of a Tree X Y A Z G C T L* = X,Y,Z {mXY · mYA · mYG · mXZ · mZC · mZT} Compute using forward algorithm BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 36 Maximum Likelihood Comments • ML is robust • ML converges to the correct answer as more data is added • Can put in a Bayesian statistical framework to obtain a distribution of possible phylogenies • ML can be slow BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 37 Phylogenetic Tree Evaluation • Bootstrapping • Jackknifing • Bayesian Simulation • Statistical difference tests (are two trees significantly different?) • Kishino-Hasegawa Test (paired t-test) • Shimodaira-Hasegawa Test (χ2 test) BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 38 Bootstrapping • A bootstrap sample is obtained by sampling sites randomly with replacement • Obtain a data matrix with same number of taxa and number of characters as original one • Construct trees for samples • For each branch in original tree, compute fraction of bootstrap samples in which that branch appears • Assigns a bootstrap support value to each branch • Idea: If a grouping has a lot of support, it will be supported by at least some positions in most of the bootstrap samples BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 39 Bootstrapping Comments • Bootstrapping doesn’t really assess the accuracy of a tree, only indicates the consistency of the data • To get reliable statistics, bootstrapping needs to be done on your tree 500 – 1000 times, this is a big problem if your tree took a few days to construct BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 40 Jackknifing • Another resampling technique • Randomly delete half of the sites in the dataset • Construct new tree with this smaller dataset, see how often taxa are grouped • Advantage – sites aren’t duplicated • Disadvantage – again really only measuring consistency of the data BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 41 Bayesian Simulation • Using a Bayesian ML method to produce a tree automatically calculates the probability of many trees during the search • Most trees sampled in the Bayesian ML search are near an optimal tree BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 42 Phylogenetic Programs • Huge list at: • http://evolution.genetics.washington.edu/phylip/so ftware.html • PAUP* - one of the most popular programs, commercial, Mac and Unix only, nice user interface • PHYLIP – free, multiplatform, a bit difficult to use but web servers make it easier • WebPhylip – another interface for PHYLIP online BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 43 Phylogenetic Programs • TREE-PUZZLE – uses a heuristic to allow ML on large datasets, also available as a web server • PHYML – web based, uses genetic algorithm • MrBayes – Bayesian program, fast and can handle large datasets, multiplatform download • BAMBE – web based Bayesian program BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 44 Final Comments on Phylogenetics • No method is perfect • Different methods make very different assumptions • If multiple methods using different assumptions come up with similar results, we should trust the results more than any single method BCB 444/544 F07 ISU Terribilini #31- Phylogenetics - Character-Based Methods 11/05/07 45