#30 - Phylogenetics Distance-Based Methods 11/02/07 Required Reading BCB 444/544 (before lecture) Wed Oct 30 - Lecture 29 Lecture 30 Phylogenetics Basics • Chp 10 - pp 127 - 141 Thurs Oct 31 - Lab 9 Phylogenetics – Distance-Based Methods Gene & Regulatory Element Prediction Fri Oct 30 - Lecture 30 Phylogenetic – Distance-Based Methods • Chp 11 - pp 142 – 169 #30_Nov02 Mon Nov 5 - Lecture 31 Phylogenetics – Parsimony and ML • Chp 11 - pp 142 - 169 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 1 Assignments & Announcements 11/02/07 2 11/02/07 4 11/02/07 6 BCB 544 "Team" Projects Mon Oct 29 - HW#5 Last week of classes will be devoted to Projects HW#5 = Hands-on exercises with phylogenetics and tree-building software Due: Mon Nov 5 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods • Written reports due: • Mon Dec 3 (no class that day) (not Fri Nov 1 as previously posted) • Oral presentations (20-30') will be: • Wed-Fri Dec 5,6,7 • 1 or 2 teams will present during each class period ¾ See Guidelines for Projects posted online BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 3 BCB 544 Only: New Homework Assignment Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: 544 Extra#2 Due: BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods http://www.bcb.iastate.edu/seminars/index.html √PART 1 - ASAP PART 2 - meeting prior to 5 PM Fri Nov 2 • Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI • Bob Jernigan BBMB, ISU • Control of Protein Motions by Structure Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 Fall 07 Dobbs 11/02/07 5 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 1 #30 - Phylogenetics Distance-Based Methods 11/02/07 Tree Building Procedure Chp 10 - Phylogenetics SECTION IV MOLECULAR PHYLOGENETICS • Choose molecular markers • Perform MSA • Choose a model of evolution • Determine tree building method • Assess tree reliability Xiong: Chp 10 Phylogenetics Basics • • • • • • Evolution and Phylogenetics Terminology Gene Phylogeny vs. Species Phylogeny Forms of Tree Representation Why Finding a True Tree is Dificult Procedure of Building a Phylogenetic Tree BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 7 Choice of Molecular Markers 11/02/07 • Make sure important functional residues align • Align secondary structure elements • Use full alignment or just parts 9 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 10 How do we measure divergence between sequences? • Simple measure – just count the number of substitutions observed between the sequences in the MSA • Problem – number of substitutions may not represent the number of evolutionary events that actually occurred • Rascal and NorMD – correct alignment errors, remove potentially unrelated or highly divergent sequences • Gblocks – detect and eliminate poorly aligned positions and divergent regions BCB 444/544 Fall 07 Dobbs 8 • Most critical step in tree building - cannot build correct tree without correct alignment • Should build alignments with multiple programs, then inspect and compare to identify the most reasonable one • Most alignments need manual editing Automatic Editing of Alignments BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 Multiple Sequence Alignment • Very closely related organisms - nucleic acid sequence will show more differences • For individuals within a species - faster mutation rate is in noncoding regions of mtDNA • More distantly related species - slowly evolving nucleic acid sequences like ribosomal RNA or protein sequences • Very distantly related species - use highly conserved protein sequences BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 11 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 12 2 #30 - Phylogenetics Distance-Based Methods 11/02/07 Multiple Substitutions C Multiple Substitutions A A T A T G G A A Just because we only see one difference, does not mean that there was only one evolutionary event BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 Just because we only see no difference, does not mean that there were no evolutionary events 13 Choosing Substitution Models 11/02/07 14 11/02/07 16 Jukes-Cantor Model • Statistical models of evolution are used to correct for the multiple substitution problem • Focus on DNA models BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods • Jukes-Cantor model assumes all nucleotides are substituted with equal probability • Can be used to correct for multiple substitutions 11/02/07 15 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods Evolutionary Models for Protein Sequences Many Other Models • PAM and JTT substitution matrices already take into account multiple substitutions • There are also models similar to Jukes-Cantor for protein sequences BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 Fall 07 Dobbs 11/02/07 17 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 18 3 #30 - Phylogenetics Distance-Based Methods 11/02/07 What about differences in mutation rates between positions within a sequence? Chp 11 – Phylogenetic Tree Construction Methods and Programs SECTION IV MOLECULAR PHYLOGENETICS • One of our assumptions was that all positions in a sequence are evolving at the same rate • Bad assumption Xiong: Chp 11 Phylogenetic Tree Construction Methods and Programs • Third position in a codon changes with higher frequency • In proteins, some amino acids can change and others cannot • • • • • This variation is called among-site rate heterogeneity • Many tree building programs have parameters meant to deal with this problem – adds to complexity of getting the correct tree BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 19 Tree Construction BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 21 6 0 c d 7 14 3 10 0 9 0 a b c d 20 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 22 Distance-Based Methods • Two ways to construct a tree based on a distance matrix a b 11/02/07 • Given a MSA and an evolutionary model, calculate the distance between all pairs of sequences • Construct distance matrix • Construct phylogenetic tree based on the distance matrix Distance Matrices 0 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods Distance-Based Methods • Two main categories of tree building methods • Distance-based • Overall similarity between sequences • Character-based • Consider the entire MSA a Distance-Based Methods Character-Based Methods Phylogenetic Tree Evaluation Phylogenetic Programs b • Clustering • Optimality c d 0 1 2 34 5 6 7 8 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 Fall 07 Dobbs 11/02/07 23 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 24 4 #30 - Phylogenetics Distance-Based Methods 11/02/07 Clustering-Based Methods UPGMA • E.g., UPGMA and Neighbor-Joining • A cluster is a set of taxa • Interspecies distances translate into intercluster distances • Clusters are repeatedly merged • “Closest” clusters merged first • Distances are recomputed after merging BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 • UPGMA – Unweighted Pair Group Method Using Arithmetic Average • Uses molecular clock assumption – all taxa evolve at a constant rate and are equally distant from the root (ultrametric tree) • This assumption is usually wrong • So why use UPGMA? • Very fast 25 UPGMA Example BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 Fall 07 Dobbs 11/02/07 26 11/02/07 28 11/02/07 30 UPGMA Example 11/02/07 27 UPGMA Example BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods UPGMA Example 11/02/07 29 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 5 #30 - Phylogenetics Distance-Based Methods 11/02/07 Neighbor Joining Neighbor Joining • Idea: Find a pair of taxa that are close to each other but far from other taxa • Implicitly finds a pair of neighboring taxa • No molecular clock assumption BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 • NJ corrects for unequal evolutionary rates between sequences by using a conversion step • The conversion step requires calculation of “r-values” and “transformed r-values” 31 Neighbor Joining 11/02/07 32 Neighbor Joining The r-value for a sequence is: The transformed r-value for a sequence is: ri = ∑ d ij r 'i = BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 ri n−2 Where n is the number of taxa Transformed r-values are used to determine the distance of a taxon to the nearest node The sum of the distances between sequence i and all other sequences 33 Neighbor Joining BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 34 Neighbor Joining The final equation we need is for computing the distance from a new cluster to each taxa. Assume taxa i and j were merged into a cluster u. The distance from taxa i to cluster u is: The converted distance between two sequences is: d 'ij = d ij − BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 1 (ri + rj ) 2 d iu = These converted distances are used in building the tree BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 Fall 07 Dobbs 11/02/07 35 [d + (r ' −r ' )] ij i j 2 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 36 6 #30 - Phylogenetics Distance-Based Methods 11/02/07 Neighbor Joining Example A Neighbor Joining Example B • Initialize tree into a star shape with all taxa connected to the center • Step 1: Compute r-values and transformed r-values for all taxa C B 0.40 C 0.35 0.45 D 0.60 0.70 rA = d AB + d AC + d AD = 0.4 + 0.35 + 0.6 = 1.35 0.55 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 rA 1.35 = = 0.675 4−2 2 r'A = 37 Neighbor Joining Example BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods • Step 3: Fill out converted distance matrix A 1 (rA + rB ) 2 1 (1.35 + 1.55) 2 = −1.05 = 0.4 − BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 39 Neighbor Joining Example B ? U C D ? BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 Fall 07 Dobbs B B -1.05 C -1 -1 D -1 -1 C -1.05 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 40 Neighbor Joining Example • Step 4: Create a node by merging closest taxa • In this example, the distance between A and B is the same as the distance between C and D • We can pick either pair to start with • Let’s pick A and B and create a node called U A 38 Neighbor Joining Example • Step 2: Compute converted distances d ' AB = d AB − 11/02/07 • Step 5: Compute branch lengths • Use the equation for computing the distance from a taxa to a node d AU = [d AB + (r ' A −r 'B )] 2 [0.4 + (0.675 − 0.775)] = 2 = 0.15 A B 11/02/07 41 0.15 U 0.25 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods A B 11/02/07 42 7 #30 - Phylogenetics Distance-Based Methods 11/02/07 Neighbor Joining Example Neighbor Joining Example Our reduced distance matrix: • Step 6: Construct reduced distance matrix by computing converted distances from each taxa to the new node U • In UPGMA, we simply calculated the average d CU = U [(d AC − dUA ) + (d BC − dUB )] 2 ( [ 0.35 − 0.15) + (0.45 − 0.25)] = = 0 .2 2 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 43 Neighbor Joining Example C 0.20 D 0.45 C 0.55 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 44 Optimality-Based Methods • From here, we go back to step 1 • Continue until all taxa have been decomposed from the star tree • • Clustering methods produce a single tree with no ability to judge how good it is compared to alternative tree topologies • Optimality-based methods compare all possible tree topologies and select a tree that best fits the distance matrix • Two algorithms: • Fitch-Margoliash • Minimum evolution BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 45 Fitch-Margoliash BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 46 Minimum Evolution • Similar to Fitch-Margoliash, but uses a different optimality criterion • Searches for a tree with the minimum total branch length • This is an indirect way of achieving the best fit of the branch lengths with the original data • Selects best tree among all possible trees based on minimum deviation between distances calculated in the tree and distances in the distance matrix • Basically, a least squares method • Dij = distance between i and j in matrix • dij = distance between i and j in tree • Objective: Find tree that minimizes ∑ (Dij − dij )2 1 ≤i< j≤n BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 Fall 07 Dobbs 11/02/07 47 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 48 8 #30 - Phylogenetics Distance-Based Methods 11/02/07 Summary of Distance-Based Methods • Clustering-based methods: • Computationally very fast and can handle large datasets that other methods cannot • Not guaranteed to find the best tree • Optimality-based methods: • Better overall accuracies • Computationally slow • All distance-based methods lose all sequence information and cannot infer the most likely state at an internal node BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods BCB 444/544 Fall 07 Dobbs 11/02/07 49 9