Phylogeny Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU What is phylogenetics? Phylogenetics is the study of evolutionary relationships among and within species. birds rodents snakes primates crocodiles marsupials lizards What is phylogenetics? crocodiles birds lizards snakes rodents primates marsupials This is an example of a phylogenetic tree. Applications of phylogenetics • Forensics: Did a patient’s HIV infection result from an invasive dental procedure performed by an HIV+ dentist? • Conservation: How much gene flow is there among local populations of island foxes off the coast of California? • Medicine: What are the evolutionary relationships among the various prion-related diseases? To be continued… Phylogenetic concepts: Interpreting a Phylogeny Sequence A Sequence B Sequence C Sequence D Sequence E Time Which sequence is most closely related to B? A, because B diverged from A more recently than from any other sequence. Physical position in tree is not meaningful! Only tree structure matters. Phylogenetic concepts: Rooted and Unrooted Trees A A B A ? ? X B Root = B X Root =? ? C ? D Time C D C ? D Rooting and Tree Interpretation chicken human fruit fly chicken human oak – bones bacteria oak archaea – cell nuclei fruit fly bacteria archaebacteria oak bacteria archaebacteria fruit fly + cell nuclei human + bones chicken Rooting Methods Outgroup Rooting a network of relationships Given an unrooted network of relationships among four species of Carnivora [left], outgroup rooting uses an additional taxon (the outgroup) known from independent evidence to be less closely related to any of the other species (the ingroup) than they are to each other. The root is then placed on the branch between the outgroup and the ingroup. In this case, Lynx is a feloid carnivore in a separate superfamily from the four canoid carnivores. Inclusion of Lynx in the network analysis places it on the internode.This method requires accurate information as to ingroup / outgroup relationships. How Many Trees? (assuming bifurcation only) Unrooted trees # sequences 3 4 5 6 10 30 N # pairwise distances # trees # branches /tree Rooted trees # trees # branches /tree How Many Trees? Unrooted trees Rooted trees # pairwise distances 3 3 1 3 3 4 4 6 3 5 15 6 5 10 15 7 105 8 6 15 105 9 945 10 10 45 2,027,025 17 34,459,425 18 30 435 8.69 1036 57 4.95 1038 58 N N (N - 1) 2 # branches /tree # branches /tree # sequences # trees (2N - 5)! 2N - 3 (N - 3)! 2N - 3 # trees (2N - 3)! 2N - 2 (N - 2)! 2N - 2 Tree Properties Ultrametricity Additivity All tips are an equal distance from the root. X Distance between any two tips equals the total branch length between them. a b Root c d a=b+c+d+e e Y a X b Root c e d XY = a + b + c + d + e In simple scenarios, evolutionary trees are ultrametric and phylograms are additive. Y Terminology • External nodes: things under comparison; operational taxonomic units (OTUs) • Internal nodes: ancestral units; hypothetical; goal is to group current day units • Root: common ancestor of all OTUs under study. Path from root to node defines evolutionary path • Unrooted: specify relationship but not evolutionary path – If have an outgroup (external reason to believe certain OTU branched off first), then can root • Topology: branching pattern of a tree • Branch length: amount of difference that occurred along a branch Phylogeny Applications • Tree of Life: Analyzing changes that have occurred in evolution of different organisms http://tolweb.org/tree/phylogeny.html • Phylogenetic relationships among genes can help predict which ones might have similar functions (e.g., ortholog detection) • Follow changes occurring in rapidly changing species (e.g., HIV virus) Phylogeny Packages • PHYLIP, Phylogenetic inference package – evolution.genetics.washington.edu/phylip.html – Felsenstein – Free! • PAUP, phylogenetic analysis using parsimony – paup.csit.fsu.edu – Swofford Similarity vs. Homology • Similar – sequences resemble one another • Homolog – sequences derived from common ancestor • Ortholog – homologous sequences within a species • Paralog – homologous sequences between species Ortholog vs. Paralog • Ortholog – genomic variation occurs after speciation – hence can be used for phylogeny of organism • Paralog – genetic duplication occurs before speciation – hence not suitable for phylogeny of organism Homoplasy • Sequence similarity NOT due to common ancestry • May arise due to parallelism or convergent evolution • Parallelism or parallel evolution – the development of a similar trait in related, but distinct, species descending from the same ancestor, but from different clades • Convergent evolution Parallel evolution Parallel evolution occurs when two species that have descended from the same ancestor remain similar over long periods of time because they independently acquire the same evolutionary adaptations. Parallel evolution occurs because genetically related species adapt to similar environmental changes in similar ways. After many years, the organisms may still resemble each other, even though they speciated in the distant past. Convergent evolution when species from different ancestors colonize the same environment, they may independently acquire the same adaptations. The evolution of species descended from different ancestors to become superficially similar because they are adapting to the same environment is called convergent evolution Divergent Evolution Phylogeny of what? • Organisms – Whole genome phylogeny – Ribosomal RNA (surrogate for whole genome) • • • • • • • • Strains (closely related microbes) Individual genes (or gene families) Repetitive DNA sequences Metabolic pathways Secondary Structures Any discrete character(s) Human languages Microbial communities Why compute phylogenetic trees? • Understand evolutionary history • Map pathogen strain diversity for vaccines • Assist in epidemiology – Of infectious diseases – Of genetic defects • Aid in prediction of function of novel genes • Biodiversity studies • Understanding microbial ecologies Tree Building Exercises Computational Approaches to Phylogenetic Tree Computation • Distance Based Methods – UPGMA – Neighbor joining • Character State Methods – Maximum Parsimony Method – Maximum Likelihood Methods • Tree merging – Consensus trees, super-trees What data is used to build trees? • Traditionally: morphological features (e.g., number of legs, beak shape, etc.) • Today: Mostly molecular data (e.g., DNA and protein sequences) Data for Phylogeny • Can be classified into two categories: – Numerical data • Distance between objects – e.g., distance(man, mouse)=500, – distance(man, chimp)=100 – Usually derived from sequence data – Discrete characters • Each character has finite number of states – e.g., number of legs = 1, 2, 4 – DNA = {A, C, T, G} UPGMA UPGMA 2. Determine the evolutionary distances and build distance matrix 1. 2. 3. 4. - A simple example AGGCCATGAATTAAGAATAA AGCCCATGGATAAAGAGTAA AGGACATGAATTAAGAATAA AAGCCAAGAATTACGAATAA Distance Matrix 1 2 3 4 1 2 3 4 - 0.2 0.05 0.15 - 0.25 0.4 - 0.2 - In this example the evolutionary distance is expressed as the number of nucleotide differences for each sequence pair. For example, sequences 1 and 2 are 20 nucleotides in length and have four differences, corresponding to an evolutionary difference of 4/20 = 0.2. 3. Phylogenetic Tree Construction example (UPGMA algorithm) UPMGA (Michener & Sokal 1957) Dij Bear Raccoon Weasel Seal Bear - 0.26 0.34 0.29 - 0.42 0.44 - 0.44 Raccoon Weasel Seal Bear 0.13 Raccoon 0.13 - 1. Pick smallest entry Dij 2. Join the two intersecting species and assign branch lengths Dij/2 to each of the nodes 3. Phylogenetic Tree Construction example (UPGMA algorithm) Dij Bear Raccoon Weasel Seal 3. Bear - Raccoon Weasel Seal 0.26 0.34 0.29 - 0.42 0.44 - 0.44 Bear Raccoon 0.13 0.13 - Compute new distances to the other species using arithmetic means DW B DW R 0.34 0.42 DW ( BR) 0.38 2 2 D DSR 0.29 0.44 DS ( BR) SB 0.365 2 2 3. Phylogenetic Tree Construction example (UPGMA algorithm) Dij BR Weasel Seal Bear BR Weasel Seal - 0.38 0.365 - 0.44 Raccoon Seal 0.13 0.1825 0.1825 - 1. Pick smallest entry Dij 2. Join the two intersecting species and assign branch lengths Dij/2 to each of the nodes 3. Phylogenetic Tree Construction example (UPGMA algorithm) Dij BR Weasel Seal BR - 0.38 0.365 - 0.44 Weasel Seal Bear Raccoon Seal 0.13 0.1825 0.1825 - 3. Compute new distances to the other species using arithmetic means DW ( BRS ) DW B DW R DW S 0.34 0.42 0.44 0.4 3 3 3. Phylogenetic Tree Construction example (UPGMA algorithm) Dij BRS Weasel Bear Raccoon 0.13 BRS - Weasel Seal Weasel 0.1825 0.4 - 0.2 0.2 1. Pick smallest entry Dij. 2. Join the two intersecting species and assign branch lengths Dij/2 to each of the nodes. 3. Done! Downside of UPGMA 37 Assume molecular clock (assuming the evolutionary rate is approximately constant) Generates only rooted tree Trees are ultrametric Doesn’t work the following case: Computational Approaches to Phylogenetic Tree Computation • Distance Based Methods – UPGMA – Neighbor joining • Character State Methods – Maximum Parsimony Method – Maximum Likelihood Methods • Tree merging – Consensus trees, super-trees Neighbor-joining method 39 Developed in 1987 by Saitou and Nei Works in a similar fashion to UPGMA Still fast – works great for large dataset Doesn’t require the data to be ultrametric Great for largely varying evolutionary rates How to construct a tree with Neighbor-joining method? Step 1: Calculate Sx = (sum all Dx) / (leaves - 2) Step 2: Calculate Mij sum all distance from x and divide by (leaves – 2) pair with smallest M = Distance ij – Si – Sj Step 3: Create S1U 40 a node U that joins pair with lowest Mij = (Dij / 2) + (Si – Sj) / 2 How to construct a tree with Neighbor-joining method? Step 4: Join I and j according to S and make all other taxa in form of a star Step 5: Recalculate new distance matrix of all other taxa to U with: DxU 41 = Dix + Djx - Dij Example of Neighbor-joining A B C D B 5 C 4 7 D 7 10 7 E 6 9 6 5 F 8 11 8 9 E 8 Step 1: S calculation : Sx = (sum all Dx) / (leaves - 2) S(A) = (5 + 4 + 7 + 6 + 8) / 4 = 7.5 S(B) = (5 + 7 + 10 + 9 + 11) / 4 = 10.5 S(C) = (4 + 7 + 7 + 6 + 8) / 4 = 8 S(D) = (7+ 10 + 7 + 5 + 9) / 4 = 9.5 S(E) = (6 + 9 + 6 + 5 + 8) / 4 = 8.5 42 S(F) = (8 + 11 + 8 + 9 + 8) / 4 = 11 Example of Neighbor-joining cont 1 Step 2: Calculate pair with smallest M Mij = Distance ij – Si – Sj Smallest are M(AB) = d(AB) – S(A) –S(B) = 5 – 7.5 – 10.5= -13 M(DE) = 5 – 9.5 – 8.5 = -13 A B C C D E -13 -11.5 -11.5 D -10 -10 -10.5 E -10 -10 -10.5 F 43 B -10.5 -10.5 -11 -13 -11.5 -11.5 Example of Neighbor-joining cont 2 Step 3: Create a node U S1U = (Dij / 2) + (Si – Sj) / 2 U1 joins A and B: S(AU1) = d(AB) / 2 + (S(A) – S(B)) / 2 = 5 / 2 + (7.5 - 10.5) / 2 = 1 S(BU1) = d(AB) / 2 + (S(B) – S(A)) / 2 = 5 / 2 + (10.5 – 7.5) / 2 = 4 44 Example of Neighbor-joining cont 3 45 Step 4: Join A and B according to S, and make all other taxa in form of a star. Branches in black are unknown length and Branches in red are known length Example of Neighbor-joining cont 4 Step5: Calculate new distance matrix Dxu = (Dix + Djx – Dij) / 2 d(CU) = (d(AC) + d(BC) - d(AB)) / 2 = (4 + 7 - 5) / 2 =3 d(DU) = d(AD) + d(BD) - d(AB) / 2 = 6 Same as EU and FU Then we get the new distance matrix U1 46 C D C 3 D 6 7 E 5 6 5 F 7 8 9 E 8 Example of Neighbor-joining cont 5 47 Repeat 1 to 5 until all branches are done In this example, we will get this at the end Downside of Neighbor-joining 48 Generates only one possible tree Generates only unrooted tree Computational Approaches to Phylogenetic Tree Computation • Distance Based Methods – UPGMA – Neighbor joining • Character State Methods – Maximum Parsimony Method – Maximum Likelihood Methods • Tree merging – Consensus trees, super-trees Maximum Parsimony Method Parsimony-score: Number of character-changes (mutations) along the evolutionary tree (tree containing labels on internal vertices) Example: Score = 3 Score = 4 0 1 AAA AAG 1 AGA 0 0 AAA 0 AAA AAA AAA 1 AAA 2 GGA AAG 0 AAA Most parsimonious tree: Tree with minimal parsimony score Minimal Evolution Principle 50 1 0 AGA AGA 1 GGA Small vs. Large Parsimony We break the problem into two: 1. Small parsimony: Given the topology find the best assignment to internal nodes 2. Large parsimony: Find the topology which gives best score Large parsimony is NP-hard We’ll show solution to small parsimony (Fitch and Sankoff’s algorithms) Input to small parsimony: tree with character-state assignments to leaves Example: Aardvark Bison Chimp Dog Elephant 51 A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA Fitch’s Algorithm Execute independently for each character: 1. Bottom-up phase: Determine set of possible states for each internal node 2. Top-down phase: Pick states for each internal node Dynamic Programming framework 2 1 Aardvark Bison Chimp Dog CAGGTA CGGGTA CAGACA TGCACT 52 Elephant TGCGTA Fitch’s Algorithm Bottom-up phase Determine set of possible states for each internal node • • Initialization: Ri = {si} Do a post-order (from leaves to root) traversal of tree – Determine Ri of internal node i with children j, k: R j Rk if R j Rk Ri R R otherwise k j T T CT C Parsimony-score = # union operations AGT GT T G score = 3 T 53 A T Fitch’s Algorithm Top-down phase Pick states for each internal node • • Pick arbitrary state in Rroot for the root Do pre-order (from root to leaves) traversal of tree – Determine sj of internal node j with parent i: si if si R j sj arbitrary state R otherwise j Complexity: O(mnk) T T #characters #states #taxa/nodes AGT CT C GT T G score = 3 T 54 A T Weighted Parsimony Sankoff’s algorithm • Each mutation a↔b costs differently - S(a,b). 1. Bottom-up phase: Determine Ri(s) – cost of optimal stateassignment for subtree of i, when it is assigned state s. 2. Top-down phase: Pick optimal states for each internal node Fitch’s algorithm as special case: • Ri – set of states which yield minimal-cost subtree of i Same as algorithm for optimal lifted tree alignment (Tutorial #4) 55 Sankoff’s Algorithm Bottom-up phase Determine Ri(s) for each internal node • • 0 if si s Ri ( s) otherwise Do a post-order (from leaves to root) traversal of tree – Determine Ri of internal node i with children j, k: Initialization: Ri ( s) min s ' R j ( s' ) S ( s' , s) min s ' Rk ( s' ) S ( s' , s) Natural generalization For non-binary trees Remember pointers ss’ C T G T 56 A T Sankoff’s Algorithm Top-down phase Pick states for each internal node • Select minimal cost character for root (s minimizing Rroot(s)) • Do pre-order (from root to leaves) traversal of tree: - For internal node j, with parent i, select state that produced minimal cost at i (use pointers kept in 1st stage) Ri ( s) min s ' R j ( s' ) S ( s' , s) min s ' Rk ( s' ) S ( s' , s) Complexity: O(mnk2) C T G T A 57 T #characters #states #taxa/nodes Fitch’s Algorithm as special case of Sankoff’s algorithm 0 if a b 1 otherwise Unweighted parsimony: S (a, b) Sankoff’s algorithm: • Ri(s) - cost of optimal subtree of i, when it is assigned state s Fitch’s algorithm: • Score(i) - cost of optimal state-assignment for subtree of i • Ri - set of optimal state-assignment for subtree of i We need to show that: 1. Optimal tree assigns node i with state from Ri. 2. Fitch’s bottom-up recursive formula for Ri. is correct: R j Rk if R j Rk Check for yourselves Ri R R otherwise k j 58 Fitch’s Algorithm as special case of Sankoff’s algorithm 0 if a b 1 otherwise Unweighted parsimony: S (a, b) • • Score(i) - cost of optimal state-assignment for subtree of i Ri - set of optimal state-assignment for subtree of i We need to show that: 1. Optimal tree assigns node i with state from Ri. • Trivially true for the root • Assume (to the contrary) that in an optimal assignment, some node – j is assigned sj∉Rj root Why is this not the case for the weighted version? i j Parsimony-score is integer sj∉Rj Rj(sj) ≥ Score(j)+1 By switching from sj to some s∊Rj we do not raise the parsimony-score 59 Computational Approaches to Phylogenetic Tree Computation • Distance Based Methods – UPGMA – Neighbor joining • Character State Methods – Maximum Parsimony Method – Maximum Likelihood Methods • Tree merging – Consensus trees, super-trees Maximum likelihood 61 Originally developed for statistics by Ronald Fisher between 1912 and 1922 Therefore, explicit statistical model Uses all the data Tends to outperform parsimony or distance matrix methods How to construct a tree with Maximum likelihood? 62 Step 1: Make all possible trees depending on the number of leaves Step 2: Calculate likelihood of occurring with the given data L(Tree) = probability of each tree. • optimizing branch length • generating tree topology Step 3: Pick the tree that have the highest likelihood. Sounds really great? 63 Num of leaves Num of possible trees 3 1 5 15 10 2027025 13 15058768725 20 8200794532637891559375 Maximum likelihood is very expensive and extremely slow to compute Comparison of Methods Distance Maximum parsimony Maximum likelihood Uses only pairwise distances Uses only shared derived characters Uses all data Minimizes distance between nearest neighbors Minimizes total distance Maximizes tree likelihood given specific parameter values Very fast Slow Very slow Easily trapped in local optima Assumptions fail when evolution is rapid Highly dependent on assumed evolution model Good for generating tentative tree, or choosing among multiple trees Best option when tractable (<30 taxa, homoplasy rare) Good for very small data sets and for testing trees built using other methods Methods of evaluating trees • Bootstrap: resample initial data set with one datum removed and replaced with another member • Jackknife: resample initial distribution with one datum missing and not replaced • MCMC: complex, but generates random numbers to produce a desired probability distribution with which to compare model Phylogeny Flowchart Difference in Methods • Maximum-likelihood and parsimony methods have models of evolution • Distance methods do not necessarily – Useful aspect in some circumstances • E.g., trees built based on whole genomes, presence or absence of genes • Religious wars over which methods to use – Most people now believe ML based methods are best: most sensitive at large evolutionary distances – but also most timeconsuming & depend on specific model of evolution used • Most commonly used packages contain software for all three methods: may want to use more than 1 to have confidence in built tree Phylip • URL: http://evolution.genetics.washington.edu/phylip.html • Parsimony – DNApenny or Protpars • Distance – Compute distance measure using DNAdist or Protdist – Neighbor (can use NJ or UPGMA) • ML – DNAml Visualising trees • Treeview • You can change the graphic presentation of a tree (cladogram, rectangular cladogram, radial tree, phylogram), but not change the structure of a tree • http://homopan.wayne.edu/softwares/phoenix/index.html Reference • Mostly from Web