Building phylogenetic trees Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances UPGMA method (+ an example) Neighbor-Joining method (+ an example) Comparison of methods Conclusion Phylogeny Phylogeny is the evolution of related species/genes Phylogenetic tree: diagram showing evolutionary lineages of species/genes The history of genes or species may be very different Genes can be homologous or analogous, but still remind each other Homologous sequences can be devided into two parts Orthologous sequences diverged by specification from a common ancestor Paralogous sequences evolved by gene dublication within species Analogous sequences may appear and function very similarly, but they do not have a common ancestor WHEN WE WANT TO EXPLORE EVOLUTIONARY RELATIONSHIPS, WE NEED TO HANDLE ORTHOLOGOUS SEQUENCES Genes Homologous Orthologous Analogous Paralogous Phylogenetic trees WHY construct a phylogenetic tree? to understand lineage of various species to understand how various functions evolved to inform multiple alignments Trees can be rooted (a common ancestor in known) or unrooted Leaves are the terminal nodes that correspond to the observed sequences of genes or species (A, B, C, D) Internal nodes are hypothetical ancestral nodes All trees will be assumed to be binary, meaning that an edge that branches splits into two daughter edges Each edge has a certain amount of evolutionary divergence associated to it, defined by some measure of distance between sequences, or from a model of substitution of residues over the course of evolution HRV10 HRV100 HRV66 HRV77 HRV25 HRV62 HRV29 HRV44 HRV31 HRV47 HRV39 HRV59 HRV63 HRV40 HRV85 HRV56 HRV54 HRV98 HRV1A HRV1bGenba HRV12 HRV78 HRV20 HRV68 HRV28 HRV53 HRV71 HRV51 HRV65 HRV46 HRV80 HRV45 HRV8 HRV95 HRV58 HRV36 HRV89Genba HRV7 HRV88 HRV23 HRV30 HRV2Genban HRV49 HRV43 HRV75 HRV16Genba HRV81 HRV57 HRV55 HRVHanks HRV21 HRV11 HRV33 HRV76 HRV24 HRV90 HRV18 HRV34 HRV50 HRV73 HRV13 HRV41 HRV61 HRV96 HRV15 HRV74 HRV38 HRV60 HRV67 HRV32 HRV9 HRV19 HRV82 HRV22 HRV64 HRV94 Phylogenetic trees Different ways to represent a phylogenetic tree (illustrated by Treeview) HRV10 HRV100 HRV66 HRV77 HRV25 HRV62 HRV29 HRV44 HRV31 HRV47 HRV10 HRV100 HRV66 HRV77 HRV25 HRV62 HRV29 HRV44 HRV31 HRV47 HRV39 HRV59 HRV63 HRV40 HRV85 HRV56 HRV54 HRV98 HRV1A HRV1bGenba HRV12 HRV78 HRV20 HRV68 HRV28 HRV53 HRV71 HRV51 HRV65 HRV46 HRV80 HRV45 HRV8 HRV95 HRV58 HRV36 HRV89Genba HRV7 HRV88 HRV23 HRV30 HRV2Genban HRV49 HRV43 HRV75 HRV16Genba HRV81 HRV57 HRV55 HRVHanks HRV21 HRV11 HRV33 HRV76 HRV24 HRV90 HRV18 HRV34 HRV50 HRV73 HRV13 HRV41 HRV61 HRV96 HRV15 HRV74 HRV38 HRV60 HRV67 HRV32 HRV9 HRV19 HRV82 HRV22 HRV64 HRV94 HRV39 HRV59 HRV63 HRV40 HRV85 HRV56 HRV54 HRV98 HRV1A HRV1bGenba HRV12 HRV78 HRV20 HRV68 HRV28 HRV53 HRV71 HRV51 HRV65 HRV46 HRV80 HRV64 HRV22 HRV82 HRV19 HRV32HRV9 HRV67 HRV23 HRV30 HRV2Genban HRV49 HRV43 HRV75 HRV16Genba HRV81 HRV57 HRV55 HRVHanks HRV21 HRV11 HRV33 HRV76 HRV24 HRV90 HRV18 HRV34 HRV50 HRV73 HRV13 HRV41 HRV61 HRV96 HRV15 HRV74 HRV38 HRV60 HRV67 HRV32 HRV9 HRV19 HRV82 HRV22 HRV64 HRV94 0.1 HRV62 HRV77 HRV38 HRV45 HRV96 HRV8 HRV95 HRV61 HRV58 HRV36 HRV89Genba HRV7 HRV88 HRV94 HRV63 HRV85HRV54 HRV1A HRV59 HRV39 HRV1bGenba HRV98 HRV40 HRV56 HRV66 HRV25 HRV60 HRV29 HRV44 HRV74 HRV15 HRV31 HRV47 HRV41 HRV100 HRV10 HRV13 HRV12 HRV73 HRV78 HRV50 HRV34 HRV18 HRV90 HRV20 HRV24 HRV68 HRV76 HRV33 HRV11 HRV21 HRV28 HRVHanks HRV55 HRV57 HRV53 HRV71 HRV81 HRV16Genba HRV51 HRV75 HRV43 HRV65 HRV49 HRV46 HRV80 HRV2Genban HRV30 HRV23 HRV88 HRV58 HRV7 HRV89Genba HRV45 HRV36 HRV95 0.1 HRV8 Different algorithms used to infer phylogeny from sequence data 1. 2. 3. 4. 5. Distance methods Parsimony Likelihood Probabilistic methods Phylogenetic invariants Route from the molecular sequences to the phylogenetic tree Distance methods: Select a set of related (orthologous) nucleotide or amino acid sequences Perform multiple sequence alignment (Clustal series widely used) Calculate pairwise distances of the sequence using chosen evolution model of substitution (Distances between sequences describe the evolution: the smaller distances are the closer they are related) Select the most suitable algorithm to infer phylogeny View the tree with a certain program (Treeview, NJPlot,..) Hamming Distance Making a tree from pairwise distances Distances dij between each pair of sequences i and j are calculated in the given dataset Different ways defining distances For nucleotide sequences: Jukes-Cantor, Kimura-2-parameter K2P, HKY (Hasegawa-Kishino-Yano), F84, Tamura-Nei, General time-reversible model, General 12-parameter model For amino acid sequences: PAM-matrices, BLOSUM-matrices A B C D A 0 32 44 46 B 32 0 29 43 C D 44 29 0 30 46 43 30 0 Distance matrix methods UPGMA Algorithm introduced by Sokal and Michener 1958 Neighbor-Joining Algorithm introduced by Saitou and Nei 1987 Modified by Studier and Keppler 1988 Clustering method: UPGMA UPGMA = Unweighted pair group method using arithmetic averages Simple method It works by clustering the sequences, at each stage connecting two clusters and finally creating a new node on a tree Method assumes equal rate of evolutionary change along branches Molecular clock assumption UPGMA A C B D UPGMA produces a rooted tree Branch lengths satisfy a molecular clock The divergence of sequences is assumed to occur at the same constant rate at all points in the tree Trees that are clocklike are rooted and the total branch length from the root up to any leaf is equal Trees are often referred to be ultrametric A distance measures are ultrametric if either all three distances are equal dij = dik = djk or two of them are equal and one is smaller: djk < dij = dik UPGMA is guaranteed to build the correct tree if distances are ultrametric Method can be used for reconstructing phylogenies if evolutionary rates are assumed to be same in all lineages criticism in the phylogeny literature Suitable for the species closely related Running time O(n2) Algorithm: UPGMA Initialisation: Assign each sequence i in dataset to its own cluster Define one leaf of T for each sequence, and place at height zero Iteration: Find the two clusters i and j for which dij is the smallest (pick randomly if several equal distances) Define a new cluster ij by Cij = Ci U Cj. Cluster ij has nij = ni + nj members ( initially ni = 1 ) Connect i and j on the tree to a new node v The branch lengths from new node to i and j are placed at height d ij 2 Algorithm: UPGMA (cont.) Iteration (cont.) Compute the distances between the new cluster and the remaining clusters by using d (ij ),k ni nj d jk d ik n n n n j j i i Add ij to the current clusters and remove i and j Termination: When only two clusters i and j remain, place the root at height d ij 2 An example UPGMA (1) Distance matrix (arbitrary) for four items (sequences) A, B, C and D Actually distances are not ultrametric, because three distances are not equal dij ≠ dik ≠ djk or two of them are not equal and one is smaller: djk < dij ≠ dik A B C D A B C D 0 8 7 12 8 0 9 14 7 9 0 11 12 14 11 0 Step 1. Find the smallest distance, dij, between two clusters A and C, where dij is 7 An example UPGMA (2) Step 2. Define new cluster ij, which has nij = ni + nj members (initially ni = 1) New cluster A and C nAC = nA+ nC=2 A B C D Step 3. Connect A and C on the tree to a new node v1 Step 4. The branch lengths from new node v1 to A and C 3,5 d AC 7 3,5 2 2 3,5 A C A B C D 0 8 7 12 0 9 14 0 11 0 An example UPGMA (3) Step 5. Compute the distances between the new cluster AC and the remaining clusters (B and D): nA d AC , B n A nC nC d AB n A nC d CB 1 * 8 1 * 9 8.5 2 2 nA d AC , D n A nC nC d AD n A nC d CD 1 *12 1 *11 11.5 2 2 Step 6. Delete the columns and rows of the distance matrix that correspond to clusters A and C, and add a column and a row for cluster AC AC B D AC New distance matrix B D 0 8,5 11,5 0 14 0 An example UPGMA (4) AC 2nd iteration process Step 1. Find the two sequences i and j for which dij is the smallest (randomly if several equal distances) AC-B AC B 0 B D 8,5 11,5 0 D 14 0 Step 2. Define new cluster (ij), which has nij = ni + nj members ( initially ni = 1 ) New cluster AC and B nACB = nAC+ nB = 2 + 1 = 3 Step 3. Connect AC and B on the tree to a new node v2 Step 4. The branch lengths from new node v2 to AC and B d ACB 8.5 4,25 2 2 3,5 3,5 4,25 A C B An example UPGMA (5) Step 5. Compute the distances between the new cluster and the remaining cluster (D) nAC d ( ACB ), D n AC nB nB d ACD n AC nB 2 1 d BD *11,5 *14 12,33 3 3 Step 6. Delete the columns and rows of the distance matrix that correspond to clusters AC and B, and add a column and a row for cluster ACB New distance matrix ACB D ACB D 0 12,33 0 An example UPGMA (6) Termination: Only two clusters (ACB and D) remaining ACB Place the root height d ij 12,33 6,17 2 2 D Original distance matrix and final phylogenetic tree(including the branch lengths) A B C D ACB D 0 12,33 0 3,5 A B C D 0 8 7 12 0 9 14 0 11 0 0,75 1,92 3,5 4,25 A C B D 6,17 Neighbor-Joining (N-J) D B Another algorithm that works by clustering the sequences Does not assume molecular clock N-J trees are unrooted A C N-J assumes additivity Def. Edge lengths are said to be additive if the distance between any pair of leaves is the sum of lengths of the edges on the path connecting them Method uses an approximate algorithm, where the tree is built by finding a pair of neighboring leaves i and j that minimize the length of the tree. Finally neighboring leaves are joined. Running time O(n2) Algorithm: Neighbor-Joining Initialisation: Define T to be the set of leaf nodes, one for each given sequence n d ij Iteration: ui Compute j i n 2 for each sequence, where n is the number of sequences in the distance matrix Pick a pair i and j (for which dij – ui – uj is the smallest (pick randomly if several equal) Join items i and j with a new node v Compute the branch lengths from a new node v to items i and j Compute the distances between new node v and remaining items Remove i and j from the distance matrix and replace them by new node v Termination: When only two items i and j remain, add the remaining edge between i and j, with length dij An example N-J (1) n Step 1. Compute ui d ij j i n 2 for each row in distance matrix Step 2. Compute d ij (ui u j ) (the lower-diagonal matrix) and choose the smallest (most negative) A B C D Step 1 - ui A 0 8 7 12 =(8+7+12)/(4-2) = 13,5 B 8 0 9 14 =(8+9+14)/(4-2)=15,5 C 7 9 0 11 =(7+9+11)/(4-2)=13,5 D 1 2 14 11 0 =(12+14+11)/(4-2)=18,5 A B C D A 0 8 7 12 B 8-(13,5+15,5)=-21 0 9 14 C 7-(13,5+13,5)=-20 9-(15,5+13,5)= -20 0 11 D 12-(13,5+18,5)=-20 14-(15,5+18,5)=-20 11-(13,5+18,5)=-21 0 An example N-J (2) d AB (u A u B ) 8 13,5 15,5 v 3 A Step 3. Join A and B together with 2 2 2 2 a new node v1. Compute the edge lengths, from A to node v and from B to node v1 B 5 vB d AB (u B u A ) 8 15,5 13,5 5 2 2 2 2 v1 3 A Step 4. Compute distances between the new node v1 and remaining items (C and D) (d AC d BC d AB ) 7 9 8 4 2 2 (d d BD d AB ) 12 14 8 AD 9 2 2 d ( AB),C d ( AB), D An example N-J (3) New reduced distance matrix Step 5. Delete A and B from the distance matrix and replace them by new item AB AB C D Step 6. Continue from step 1, because more than two items remain Step 1. Compute for each row in ui distance matrix n d ij n 2 Step 1 = ui AB 0 4 9 (4+9)/1=13 C 4 0 11 (4+11)/1=15 D 9 11 0 (9+11)/1=20 j i Step 2 Compute and choose d ij (ui u j ) the smallest (the lower-diagonal matrix) AB C D AB 0 4 9 C 4-(13+15)=-24 0 11 D 9-(13+20)=-24 11-(15+20)=-24 0 An example N-J (4) AB C D Step 3 Join v1 and C together with a new node v2. Compute the edge lengths, from v1 to node v2 and from C to node v2 d ABC (u AB uC ) 4 13 15 1 2 2 2 2 u u AB 4 15 13 3 d vC ABC C 2 2 2 2 Step 1 = ui AB 0 4 9 (4+9)/1=13 C 4 0 11 (4+11)/1=15 D 9 11 0 (9+11)/1=20 v1 v2 v1 1 C Step 4 Compute distances between the new node v2 and remaining items (D) d ( ABC), D B 5 3 3 A (d ABD d CD d ABC ) 9 11 4 8 2 2 An example N-J (5) Step 5 Delete AB and C from the distance matrix and replace them by ABC ABC ABC D 0 8 0 D Step 6 Only two nodes remaining connect them Original distance matrix and final phylogenetic tree (including the edge lengths) D A B C A 0 B C D 8 0 D 8 7 12 9 14 0 11 0 B 5 1 C 3 3 A Comparison UPGMA The total branch length from the root up to any leaf is equal Produces a rooted tree, where the root is hypothesized ancestor of the sequences in the tree Suitable for closely related sequences Can be used to infer phylogenies if one can D assume that evolutionary rates are the same in all lineages 3,5 3,5 4,25 6,17 Neighbor-joining Unrooted tree, where the direction of evolution is unknown Suitable for datasets with largely varying rates of evolution Suitable for large datasets 8 A B 5 C 1 B D C 3 3 A Conclusion UPGMA method constructs a rooted phylogenetic tree correctly if there is a molecular clock with a constant rate of mutation UPGMA method is rarely used, because molecular clock assumption is not generally true: selection pressures vary across time periods, genes within organisms, organisms, regions within gene N-J method produces an unrooted tree without molecular clock hypothesis N-J method is one of the most popular and widely used by molecular evolutionist Distance methods are strongly dependent on the model of evolution used Sequence information is reduced when transforming sequence data into distances Distance methods are computationaly fast Reference Durbin, R., Eddy, S., Krogh, A., Mithchison G. 2003 Biological sequence analysis – Probabilistic models of proteins and nucleic acid. Campridge University Press. Li, W. 1997. Molecular Evolution. Sinauer Associates, Sunderland, MA. p. 108 Felsenstein, J. 2003. Inferring Phylogenies. Sinauer Associates, Sunderland, MA. p.147-170 Examples of phylogeny programs Multiple sequence alignment Clustal series (W, V) (free, http://www-igbmc.ustrasbg.fr/BioInfo/ClustalX/Top.html ) Phylogeny packages PAUP (http://paup.csit.fsu.edu/ ) Phylip (free, http://evolution.gs.washington.edu) MEGA (free, http://www.megasoftware.net) Viewing/plotting phylogenetic trees Treeview (free, http://taxonomy.zoology.gla.ac.uk/rod/treeview.html) NJPlot (free, http://pbil.univ-lyon1.fr/software/njplot.html) Further reading N-J: Saitou, N. and M. Nei.1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4(4): 406-25. N-J: Studier, J. A., K. J. Keppler, et al. 1988. A note on the neighborjoining algorithm of Saitou and Nei The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 5(6): 729-31. UPGMA: Michener, C. D., and R. R. Sokal. 1957. A quantative approach to a problem in classification. Evolution 11: 130-162. ClustalW: Thompson, J. D., T. J. Gibson, et al. 1997. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 25(24): 4876-82.