BCB 444/544 Lecture 30 Phylogenetics – Distance-Based Methods #30_Nov02 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 1 Required Reading (before lecture) Wed Oct 30 - Lecture 29 Phylogenetics Basics • Chp 10 - pp 127 - 141 Thurs Oct 31 - Lab 9 Gene & Regulatory Element Prediction Fri Oct 30 - Lecture 30 Phylogenetic – Distance-Based Methods • Chp 11 - pp 142 – 169 Mon Nov 5 - Lecture 31 Phylogenetics – Parsimony and ML • Chp 11 - pp 142 - 169 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 2 Assignments & Announcements Mon Oct 29 - HW#5 HW#5 = Hands-on exercises with phylogenetics and tree-building software Due: Mon Nov 5 (not Fri Nov 1 as previously posted) BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 3 BCB 544 "Team" Projects Last week of classes will be devoted to Projects • Written reports due: • Mon Dec 3 (no class that day) • Oral presentations (20-30') will be: • Wed-Fri Dec 5,6,7 • 1 or 2 teams will present during each class period See Guidelines for Projects posted online BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 4 BCB 544 Only: New Homework Assignment 544 Extra#2 Due: √PART 1 - ASAP PART 2 - meeting prior to 5 PM Fri Nov 2 Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 5 Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html • Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI • Bob Jernigan BBMB, ISU • Control of Protein Motions by Structure BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 6 Chp 10 - Phylogenetics SECTION IV MOLECULAR PHYLOGENETICS Xiong: Chp 10 Phylogenetics Basics • • • • • • Evolution and Phylogenetics Terminology Gene Phylogeny vs. Species Phylogeny Forms of Tree Representation Why Finding a True Tree is Dificult Procedure of Building a Phylogenetic Tree BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 7 Tree Building Procedure • Choose molecular markers • Perform MSA • Choose a model of evolution • Determine tree building method • Assess tree reliability BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 8 Choice of Molecular Markers • Very closely related organisms - nucleic acid sequence will show more differences • For individuals within a species - faster mutation rate is in noncoding regions of mtDNA • More distantly related species - slowly evolving nucleic acid sequences like ribosomal RNA or protein sequences • Very distantly related species - use highly conserved protein sequences BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 9 Multiple Sequence Alignment • Most critical step in tree building - cannot build correct tree without correct alignment • Should build alignments with multiple programs, then inspect and compare to identify the most reasonable one • Most alignments need manual editing • Make sure important functional residues align • Align secondary structure elements • Use full alignment or just parts BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 10 Automatic Editing of Alignments • Rascal and NorMD – correct alignment errors, remove potentially unrelated or highly divergent sequences • Gblocks – detect and eliminate poorly aligned positions and divergent regions BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 11 How do we measure divergence between sequences? • Simple measure – just count the number of substitutions observed between the sequences in the MSA • Problem – number of substitutions may not represent the number of evolutionary events that actually occurred BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 12 Multiple Substitutions C A T G A Just because we only see one difference, does not mean that there was only one evolutionary event BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 13 Multiple Substitutions A A T G A Just because we only see no difference, does not mean that there were no evolutionary events BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 14 Choosing Substitution Models • Statistical models of evolution are used to correct for the multiple substitution problem • Focus on DNA models BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 15 Jukes-Cantor Model • Jukes-Cantor model assumes all nucleotides are substituted with equal probability • Can be used to correct for multiple substitutions BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 16 Many Other Models BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 17 Evolutionary Models for Protein Sequences • PAM and JTT substitution matrices already take into account multiple substitutions • There are also models similar to Jukes-Cantor for protein sequences BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 18 What about differences in mutation rates between positions within a sequence? • One of our assumptions was that all positions in a sequence are evolving at the same rate • Bad assumption • Third position in a codon changes with higher frequency • In proteins, some amino acids can change and others cannot • This variation is called among-site rate heterogeneity • Many tree building programs have parameters meant to deal with this problem – adds to complexity of getting the correct tree BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 19 Chp 11 – Phylogenetic Tree Construction Methods and Programs SECTION IV MOLECULAR PHYLOGENETICS Xiong: Chp 11 Phylogenetic Tree Construction Methods and Programs • • • • Distance-Based Methods Character-Based Methods Phylogenetic Tree Evaluation Phylogenetic Programs BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 20 Tree Construction • Two main categories of tree building methods • Distance-based • Overall similarity between sequences • Character-based • Consider the entire MSA BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 21 Distance-Based Methods • Given a MSA and an evolutionary model, calculate the distance between all pairs of sequences • Construct distance matrix • Construct phylogenetic tree based on the distance matrix BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 22 Distance Matrices a a b c 0 6 7 0 3 d 14 a 10 b b 0 9 c 0 d c d 0 1 2 34 5 6 7 8 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 23 Distance-Based Methods • Two ways to construct a tree based on a distance matrix • Clustering • Optimality BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 24 Clustering-Based Methods • E.g., UPGMA and Neighbor-Joining • A cluster is a set of taxa • Interspecies distances translate into intercluster distances • Clusters are repeatedly merged • “Closest” clusters merged first • Distances are recomputed after merging BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 25 UPGMA • UPGMA – Unweighted Pair Group Method Using Arithmetic Average • Uses molecular clock assumption – all taxa evolve at a constant rate and are equally distant from the root (ultrametric tree) • This assumption is usually wrong • So why use UPGMA? • Very fast BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 26 UPGMA Example BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 27 UPGMA Example BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 28 UPGMA Example BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 29 UPGMA Example BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 30 Neighbor Joining • Idea: Find a pair of taxa that are close to each other but far from other taxa • Implicitly finds a pair of neighboring taxa • No molecular clock assumption BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 31 Neighbor Joining • NJ corrects for unequal evolutionary rates between sequences by using a conversion step • The conversion step requires calculation of “r-values” and “transformed r-values” BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 32 Neighbor Joining The r-value for a sequence is: ri dij The sum of the distances between sequence i and all other sequences BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 33 Neighbor Joining The transformed r-value for a sequence is: ri r 'i n2 Where n is the number of taxa Transformed r-values are used to determine the distance of a taxon to the nearest node BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 34 Neighbor Joining The converted distance between two sequences is: 1 d 'ij d ij ri rj 2 These converted distances are used in building the tree BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 35 Neighbor Joining The final equation we need is for computing the distance from a new cluster to each taxa. Assume taxa i and j were merged into a cluster u. The distance from taxa i to cluster u is: d iu d r ' r ' ij i j 2 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 36 Neighbor Joining Example A B B 0.40 C 0.35 0.45 D 0.60 0.70 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods C 0.55 11/02/07 37 Neighbor Joining Example • Initialize tree into a star shape with all taxa connected to the center • Step 1: Compute r-values and transformed r-values for all taxa rA d AB d AC d AD 0.4 0.35 0.6 1.35 rA 1.35 r'A 0.675 42 2 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 38 Neighbor Joining Example • Step 2: Compute converted distances 1 d ' AB d AB rA rB 2 1 0.4 1.35 1.55 2 1.05 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 39 Neighbor Joining Example • Step 3: Fill out converted distance matrix A B B -1.05 C -1 -1 D -1 -1 C -1.05 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 40 Neighbor Joining Example • Step 4: Create a node by merging closest taxa • In this example, the distance between A and B is the same as the distance between C and D • We can pick either pair to start with • Let’s pick A and B and create a node called U A B ? U C D ? BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods A B 11/02/07 41 Neighbor Joining Example • Step 5: Compute branch lengths • Use the equation for computing the distance from a taxa to a node d AU d AB r ' A r ' B 2 0.4 0.675 0.775 2 0.15 0.15 U 0.25 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods A B 11/02/07 42 Neighbor Joining Example • Step 6: Construct reduced distance matrix by computing converted distances from each taxa to the new node U • In UPGMA, we simply calculated the average d CU d AC dUA d BC dUB 2 0.35 0.15 0.45 0.25 0 .2 2 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 43 Neighbor Joining Example Our reduced distance matrix: U C 0.20 D 0.45 C 0.55 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 44 Neighbor Joining Example • From here, we go back to step 1 • Continue until all taxa have been decomposed from the star tree • BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 45 Optimality-Based Methods • Clustering methods produce a single tree with no ability to judge how good it is compared to alternative tree topologies • Optimality-based methods compare all possible tree topologies and select a tree that best fits the distance matrix • Two algorithms: • Fitch-Margoliash • Minimum evolution BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 46 Fitch-Margoliash • Selects best tree among all possible trees based on minimum deviation between distances calculated in the tree and distances in the distance matrix • Basically, a least squares method • Dij = distance between i and j in matrix • dij = distance between i and j in tree • Objective: Find tree that minimizes 1i jn 2 (Dij dij ) BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 47 Minimum Evolution • Similar to Fitch-Margoliash, but uses a different optimality criterion • Searches for a tree with the minimum total branch length • This is an indirect way of achieving the best fit of the branch lengths with the original data BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 48 Summary of Distance-Based Methods • Clustering-based methods: • Computationally very fast and can handle large datasets that other methods cannot • Not guaranteed to find the best tree • Optimality-based methods: • Better overall accuracies • Computationally slow • All distance-based methods lose all sequence information and cannot infer the most likely state at an internal node BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 49