Tricks for trees: Having reconstructed phylogenies what can we do with them? Mike Steel Allan Wilson Centre for Molecular Ecology and Evolution Biomathematics Research Centre University of Canterbury, Christchurch, New Zealand DIMACS, June 2006 1 Where are phylogenetic trees used? Evolutionary biology – species relationships, dating divergences, speciation processes, molecular evolution. Ecology – classifying new species; biodiversity, co-phylogeny, migration of populations. Epidemiology – systematics, processes, dynamics Extras - linguistics, stematology, psychology. 2 Phylogenetic trees [Definition] A phylogenetic X-tree is a tree T=(V,E) with a set X of labelled leaves, and all other vertices unlabelled and of degree >3. If all non-leaf vertices have degree 3 then T is binary 3 Trees and splits 3 1 2 e Ae | Be 4 5 (T ) { Ae | Be : e E} 6 Partial order: ( PX , ) T T ' (T ) (T ' ) Buneman’s Theorem 4 Quartet trees • A quartet tree is a binary phylogenetic tree on 4 leaves (say, x,y,w,z) written xy|wz. x w y z • A phylogenetic X-tree displays xy|wz if there is an edge in T whose deletion separates {x,y} from {w,z} x y w r z s u 5 Corresponding notions for rooted trees Clusters (in place of splits) Triples in place of quartets 6 How are trees useful in epidemiology? Systematics and reconstruction How are different types/strains of a virus related? When, where, and how did they arise? What is their likely future evolution? What was the ancestral sequence? 7 How are trees useful in epidemiology? Processes and dynamics (“Phylodynamics”) How do viruses change with time in a population? Population size etc What is their rate of mutation, recombination, selection? Within-host dynamcs How do viruses evolve in a single patient? How is this related to the progression of the disease? How much compartmental variation exists? 8 What do the shapes of these trees tell us about the processes governing their evolution? Eg. Population dynamics, selection Coalescent prediction 10 Tree shapes (non-metric) George Yule a b c d e 11 Why do trees on the same taxa disagree? Model violation 1. 1. 2. 3. 2. 3. 4. 1. 2. 3. 4. “true model” differs from “assumed model” “true model = assumed model” but estimation method not appropriate to model model true but too parameter rich (non-identifyability) Sampling error (and factors that make it worse!) Alignment error Evolutionary processes Lineage sorting Recombination Horizontal gene transfer; hybrid taxa Gene duplication and loss 13 Sampling error that’s hard to deal with T1 T2 T3 T4 Time ? e 14 Example: Deep divergence in the Metazoan phylogeny Deuterostomes Cnidaria Ustilago Arthropods Crustacea Urochordata Annelida Cephalochordata Mollusca Echinodermata Glossina Anopheles Mammalia Drosophila Actinopter Coleoptera Phanerochaete Cryptococcus Hymenoptera Hemiptera Siphonaptera Lepidoptera Schizosaccharomyces Chelicerata Saccharomyces Tardigrades Candida Paracooccidioides Strongyloides Gibberella Neurospora Magnaporth Heterodera Ascaris Meloidogyne Brugia Glomus Pristionchus Ancylostoma Neocallimastix Caenorhabditis briggsae Caenorhabditis elegans Fungi Trichinella Monosiga brevicollis Monosiga ovata Ctenophora Echinococcus Fasciola Schistosoma mansoni Schistosoma japonicum Dugesia Nematodes Choanoflagellates Platyhelminthes From Huson and Bryant, 2006 15 Models 2 1 1 3 e vs 2 4 e 3 4 Finite state Markov process k 1 e2 16 Models 3 1 3 1 vs 2 2 4 4 •“site saturation” • subdividing long edges only offers a partial remedy (trade-off). 17 Why do trees on the same taxa disagree? Model violation 1. 1. 2. 3. 2. 3. 4. 1. 2. 3. 4. “true model” differs from “assumed model” “true model = assumed model” but estimation method not appropriate to model model true but too parameter rich (non-identifyability) Sampling error (and factors that make it worse!) Alignment Evolutionary processes Lineage sorting Recombination Horizontal gene transfer; hybrid taxa Gene duplication and loss 18 Gene trees vs species trees a Theorem b c a b c J. H. Degnan and N.A. Rosenberg, 2006. For n>5, for any tree, there are branch lengths and population sizes for which the most likely gene tree is different from the species tree. Discordance of species trees with their most likely gene trees. PLoS Genetics, 2(5), e68 May, 2006 19 Example ? Orangutan Gorilla Chimpanzee Adapted From the Tree of the Life Website, University of Arizona Human 20 Distinguishing between signals A Lineage sorting vs sampling error vs HGT B C A B C A C B 21 Why do trees on the same taxa disagree? Model violation 1. 1. 2. 3. 2. 3. 4. 1. 2. 3. 4. “true model” differs from “assumed model” “true model = assumed model” but estimation method not appropriate to model model true but too parameter rich (non-identifyability) Sampling error (and factors that make it worse!) Alignment Evolutionary processes Lineage sorting Recombination Horizontal gene transfer; hybrid taxa Gene duplication and loss 22 Given a tree what questions might we want to answer? How reliable is a split? Where is the root of the tree? Relative ranking of vertices? Dating? How well supported is some ‘deep divergence’ resolved? What model best describes the evolution of the sequences (molecular clock? dS/dN ratio constant? etc) Statistical approaches: Non-parametric bootstrap Parametric bootstrap Likelihood ratio tests Bayesian posterior probabilities Tests (KH, SH, SOWH) Goldman, N., J. P. Anderson, and A. G. Rodrigo. 2000. Likelihood-based tests of topologies in phylogenetics. Systematic Biology 49: 652-670. 23 From Steve Thompson, Florida State Uni 24 Example 25 Non-parametric bootstrap 26 27 Dealing with incompatibility: Consensus trees Strict Majority rule Semistrict consensus 28 Consensus networks Take the splits that are in at least x% of the trees and represent them by a graph Splits Graph (G()) – Dress and Huson Each split is represented by a class of ‘parallel’ edges Simplest example (n=4). 29 (NS) (NS) (SS) (A) (A) (SS) (NS) (NS) (SS) (SS) (SS) (SS) (NS) (SS) (NS) (N,NS) R.nivicola(N) (C,S) (NS, N) (SS) (SS) (NS) chloroplast JSA tree 30 (SS) (A) (SS) (SS) (SS) (NS) (SS) (SS) (SS) (N) R.nivicola (SS) (NS,N) (A) (NS) (NS) (NS) (SS,NS) (NS) (NS,N) (NS) (SS) nuclear (NS) (SS) ITS tree 31 consensus network (ITStree+JSAtree) I III II R.nivicola 32 Maximum agreement subtrees Concept Computational complexity 33 Comparing trees Splits metric (Robinson-Foulds) Statistical aspects. Tree rearrangement operations – the graph of trees (rSPR). Cophylogeny 34 Co-phylogeny (m. charleston) 35 Supertrees Compatibility concept Compatibility of rooted trees (BUILD) Why do we want to do this? Extension – higher order taxa, dates Methods for handling incompatible trees (MRP; mincut variants; minflip) 36 Compatibility A set Q of quartets is compatible if there is a phylogenetic X-tree T that displays each quartet of Q Example: Q={12|34, 13|45, 14|26} 1 3 2 4 5 6 Complexity? 37 Supertrees Compatibility concept Compatibility of rooted trees (BUILD) Why do we want to do this? Extension – higher order taxa, dates Methods for handling incompatible trees (MRP; mincut variants; minflip) 38 Phylogenetic networks Consensus setting: consensus networks Minimizing hybrid/reticulate vertices Supernetworks – Z closure, filtering 39 a b c d a c b d a b c d Networks can represent: Reticulate evolution (eg. hybrid species) Phylogenetic uncertainty (i.e. possible alternative trees) Z-closure Given T1,…, Tk on overlapping sets of species, let (T1 ) (Tk ) construct spcl2() and construct the ‘splits graph’ of the resulting splits that are ‘full’. 40 Split closure operation (Meacham 1986) A1 A2 A1 B1 B2 B1UB2 , A1UA2 B2 B1 A2 A1 B2 41 42 43 Reconstructing ancestral sequences Methods (MP, Likelihood, Bayesian) Quiz. MP for a balanced tree = majority state? Information-theoretic considerations 44 Statistics of parsimony (clustering on a tree) 45