Phylogeny Wayne Maddison 25 March 2003 • History + basic ideas • Methods for reconstructing phylogeny • Applications of phylogeny Description of species & arrangement into groups (classification) Systematics & Taxonomy Beginning 1750’s: What are the species and how are they related? By ca. 1950: • hundreds of thousands of species described • general idea of phylogeny, at least for multicellular creatures Systematics of mid 20th century: Phylogeny: (2) Phylogeny & classification were permitted to differ Must be contiguous piece of tree (i.e. Archaeopteryx + ostrich + crow isn’t paraphyletic) Paraphyletic group: an ancestor and some but not all of its descendants Reptiles turtles lizards crocodiles Birds Mammals Classification: (1) Focus was not so much classification & phylogeny; focus was distinguishing species. Phylogeny inferred usually without explicit data or analyses (i.e. seat of the pants) Monophyletic group: an ancestor and all of its descendants or a group consisting of species more closely related to each other than to any other species not in the group or a group whose most recent common ancestor is more recent than any common ancestor shared with species outside the group ar ds liz s ile rd s bi od als cr oc m am m es rtl tu Polyphyletic group: neither monophyletic nor paraphyletic i.e. discontiguous pieces of the phylogenetic tree Before Hennig: If a group of species is closely related, they should share traits Hennig: If a group of species is monophyletic, they should share traits derived within the containing group. Other types of groups (paraphyletic, polyphyletic) are not expected to possess derived traits uniquely Therefore sharing of derived traits is the indicator of monophyly. 1950’s & 60’s formalizing methods begins Hennig 1950 (German), 1966 (English) (1) Formal logic for reconstructing phylogeny (2) Classification should match phylogeny (all groups monophyletic) apomorphy: derived trait synapomorphy: shared derived trait apomorphy plesiomorphy: ancestral trait plesiomorphy Synapomorphy indicates monophyly 1960’s: quantification A 0 0 1 0 0 0 1 C 1 0 0 0 1 0 0 B 2 1 0 1 2 0 0 ? Tree 4 D 0 1 0 1 0 0 0 E 0 1 0 1 0 1 0 D 0 1 0 1 0 0 0 G 0 1 2 1 0 0 2 Tree 5 F 1 0 0 0 1 1 0 C 1 0 0 0 1 0 0 F 1 0 0 0 1 1 0 G 0 1 2 1 0 0 2 Examine trees, choose tree that optimizes some criterion Optimality methods E 0 1 0 1 0 1 0 Character with states 0, 1 Formal coding of data into matrix of characters and character states B 2 1 0 1 2 0 0 Papilio Nymphalis Pieris Danaus Battus Heliconius Colias A 0 0 1 0 0 0 1 ? Data matrix ? Tree 3 Tree 2 ? Papilio Nymphalis Pieris Danaus Battus Heliconius Colias ? Tree 1 Papilio Nymphalis Pieris Danaus Battus Heliconius Colias A 0 0 1 0 0 0 1 B 2 1 0 1 2 0 0 C 1 0 0 0 1 0 0 D 0 1 0 1 0 0 0 E 0 1 0 1 0 1 0 some loss of information Data matrix F 1 0 0 0 1 1 0 G 0 1 2 1 0 0 2 Danaus 0 Pieris 0 Nymphalis Distance matrix 0.5 0 Papilio 0.5 Papilio 0.6 Heliconius 0 Colias Tree Distance methods Battus 0 Pieris 0 0.8 Nymphalis 0.9 0 0.7 0.6 0.8 0.4 0.5 0.7 0.8 0.4 0.3 0.6 0.8 0.5 0.5 0.2 0.9 Battus 0.7 Danaus Colias Heliconius Distance methods UPGMA Neighbor Joining Optimality methods Parsimony — seeks tree minimizing evolutionary change Likelihood — seeks tree maximizing probability of observed data B 1 C 0 D 0 Parsimony — counting steps, examples — seeks “simplest explanation” (minimizes ad hoc hypothesis against contradictory evidence) Likelihood simple example A What is 1 probability of these states evolving if probability of change on each branch is 0.1? Likelihood: the probability of the data observed given the hypothesis and assumptions P(Data | Hypothesis) A B C D ATTGTA ATTGTA ACCGCA ACCGCA Observed Data Goal: to find that hypothesis that maximizes the probability A B C D If this were the tree ... what would be the probability of these sequences evolving? A B C D ATTGTA ATTGTA ACCGCA ACCGCA Observed Data Probability of evolving sequences doesn't depend only on tree A B C D Also depends on: -the ancestral sequence, or the probabilities of various possible ancestral sequences rates of mutation per unitper time -the probabilities of mutations unit time -the times involved (branch lengths) Model of sequence evolution: -probabilities of bases at ancestor C A α α - A α α - α C α - α α G - α α α T Rates of change G α -base rates of mutations per unit time -site toofsite rate variation Rates mutation per unit time Simplest: Jukes-Cantor 1969 A,C,G,T equally probable at ancestor All substitutions AND alllikely rates equal equally T For each candidate tree, probability of sequences evolving will depend on branch lengths and the α parameter of the JC model A C G T A α α α Example of likelihood analysis using JC 69 A B C D C G T α α α - α α α - α α α - Search: Try alternative trees, branch lengths and α's to find combination maximizing probability of observing the data — simultaneous estimation - A αfC βfG γfT C T A αfA - G C βfA δfC - ρfT δfG εfT G More complex, realistic models all changes can differ in rate as long as symmetrical More complex, realistic models C βfA αfC - γfA εfC ρfG - GTR (General time reversible) A αfC βfG αfT T - G A αfA - αfG βfT C G αfA βfC αfG - αfT T equilibrium base frequencies might not be equal T HKY 85 transitions and transversions can differ in rate equilibrium base frequencies might not be equal Another issue: not all sites evolve at the same rate Protein coding: 2nd positions much slower, third positions and introns fastest Non-coding: e.g., ribosomal or tRNA: areas not vital to secondary structure or interactions may evolve quickly gamma rate variation + G C A γfA βfA A δfC αfA - - εfC - δfG αfC βfG C ρfG - ρfT εfT γfT T G T Example of likelihood analysis using GTR + gamma A B C D Search: Try alternative trees, branch lengths substitution rate parameters, equilibrium base frequences and gamma shape parameters to find combination maximizing probability of observing the data — simultaneous estimation of model & tree α=2 α = 0.5 rate of evolution α = 50 α = 200 One model of site-to-site rate variation: The gamma distribution - has one parameter, the "shape" frequency of sites with that rate slow fast fewer parameters Better likelihoods with more complex model tree + JC (rate) tree + HKY (transition & transversion rates + equil. freq.) tree + GTR (6 rates + equil. freq.) + gamma rate variation more parameters but should use model only as complex as you need! Can use likelihood ratio tests to test significance Practical difficulties: computation Searching among all possible -trees -branch lengths -rate matrix parameters -equilibrium base frequencies -gamma shape parameters Parsimony — originally justified as “simplicity of explanation” Likelihood — statistically best justified, but imposes assumptions of uniformity of process across characters & branches of tree to find the combination maximizing probability is not easy! at each step check optimality criterion to see what adjustment to make — Hill-climbing algorithm making adjustments to tree — Build initial tree by adding taxa Heuristic search (“good guess”) Can’t search exhaustively among all trees both make explicit predictions about character distributions, and assess disagreement between predicted & observed character distributions number of possible trees 3 15 105 945 10395 3.4 e7 8.2 e21 4.95 e38 1 e5,866,723 Difficulties of searching for optimal trees number of taxa* 3 4 5 6 7 10 20 30 1,000,000 *terminal taxa, OTU’s The age of the universe is about 5 e29 picoseconds Distance methods (popular 60’s & 70’s; Neighbor Joining brought back in 90’s) Optimality methods Parsimony (popular 80’s & 90’s; remains popular with morphological data) Likelihood (popular 90’s & 00’s) Success? Tree from visible structures & color Trees from four separate gene regions Jumping spiders: different data agree