Phylogenetics Phylogenetic Trees time NODE Hypothetical Taxonomic Unit ROOT BRANCH Operational Taxonomic Unit (OTU) time Information • Branching order (topology) – Relative closeness of different taxa • Branch length – Amount of divergence Rooted and unrooted trees C A D B A C D B E E ROOTED UNROOTED Rooted and unrooted trees E A B A C D D B E ROOTED UNROOTED C Rooted and unrooted trees A B A E B C D D E ROOTED UNROOTED C ROOTED UNROOTED 3 OTUs B A C 4 OTUs A B A A A C B C B C B A A A A B C B C B C D D D C D B A D C C C D A B … 15 rooted trees of 4 OTUs B D Monophyletic & Paraphyletic Birds Crocodiles Snakes and lizards Turtles and tortoises Mammals REPTILES Monophyletic & Paraphyletic • Monophyletic – Natural clade; all of the taxa are derived from a common ancestor • Paraphyletic – Taxonomic group whose most recent common ancestor is shared by another taxon Reconstruct phylogeny from molecular data ACTGTTACCGA ? ACTGTTACCGA ACTGTTACCGA ACTGTTACCGA ACTGTTACCGA Types of phylogenetic analysis methods • Phenetic: trees are constructed based on observed characteristics, not on evolutionary history • Cladistic: trees are constructed based on fitting observed characteristics to some model of evolutionary history Distance methods Parsimony and Maximum Likelihood methods Methods of Tree reconstruction • • • • Distance Maximum Parsimony Maximum Likelihood Bayesian Phylogeny Estimation: Traditional and Bayesian Approaches Nature Reviews Genetics (2003) 4:275 Genetic distance • Distance from one sequence to another • Hamming Distance – Count number of differences • Multiple hits – number of events is greater than number of differences – Estimate number of events • Infer tree from genetic distance using Neighbour-joining (NJ) method UPGMA shown for illustrative purposes. Neighbour-joining is preferred method. • The algorithm in the text means: find the closest distance between two sequences, cluster those; then find the next closest distance, cluster those; as sequences are added to existing clusters find the average distance between existing clusters • Work through the notation! • UPGMA assumes a molecular clock mechanism of evolution • Neighbor-joining: corrects for UPGMA’s assumption of the same rate of evolution for each branch by modifying the distance matrix to reflect different rates of change. • The net difference between sequence i and all other sequences is • ri = Sdik • The rate-corrected distance matrix is then • Mij = dij - (ri + rj)/(n - 2) • Join the two sequences whose Mij is minimal; then calculate the distance from this new node to all other sequences using • dkm = (dim + djm - dij)/2 • Again correct for rates and join nodes. Maximum Parsimony (MP) • Find topology requiring smallest number of evolutionary changes • Consider each position (site) in the sequence alignment independently • Not all sites are informative • Informative – Favours one topology over others Informative sites a. b. c. d. A A A A A G G G a c a b d c G C A A A C T G G G A A T T T T T T C C C C C C b c a b d d A T A T Maximum Likelihood (ML) • Likelihood L of a tree is the probability of observing the data given the tree L = P(data|tree) • Find the tree with the highest L value • Results depends on model of nucleotide substitution • Computationally time-consuming • Actually, all the other methods discussed implicitly use a simple model of evolution similar to the typical model made explicit in maximum likelihood: • All sites selectively neutral • All mutate independently, forward and reverse rates equal, given by m • Also assume discrete generations and sites change independently • Given this model, can calculate probability that a site with initial nucleotide I will change to nucleotide j within time t: • Ptij = dije-mt + (1 - e-mt)gj, where dij = 1 if i = j and dij = 0 otherwise, and where gj is the equilibrium frequency of nucleotide j • The likelihood that some site is in state i at the kth node of a tree is Li(k) • The likelihoods for all states for each site for each node are calculated separately; the product of the likelihoods for each site gives the overall likelihood for the observed data • Different tree topologies are searched to find the highest overall likelihood • Maximum likelihood is maybe the “gold standard” for phylogenetic analysis; but because of its computational intensity it can only be used for select data and only after much initial fine tuning of many parameters of sequence alignments • Often used to distinguish between several already generated trees Bayesian (B) Phylogeny Estimation • Searches for best trees consistent with both model and data • Incorporates prior knowledge (prior probability) • B maximises probability of tree given data and model • Searches for best set of trees Comparison of methods How much information are they using? • MP, ML, B use actual DNA whereas NJ summarises information into distance matrix • BUT, not all sites are used by MP (“informative” sites only) How can the nature of the data affect the methods? • NJ better for recent divergences • MP works well for a high number of informative sites Comparison of methods How do they cope with lots of sequences? • MP requires comparison of all possible trees – Not possible for large number of taxa • ML is computationally intensive and very slow for large number of taxa • NJ efficient for large number of taxa Anything else? • ML requires explicit assumptions about rate and pattern of substitution (model) – ML may perform poorly if model is incorrect • ML or B may get stuck on local maxima Outgroup rooting of unrooted trees • Outgroup – related sequence that definitely diverged earlier (paleontological evidence) chicken human mouse human rat mouse rat Rate (r) of evolution • K = number of substitutions per site • T = time since divergence • r = K/2T • Rate is expressed as substitutions per site per year Species A Species B T Estimating species divergence times • fossil evidence shows that T1 = 310 mya • What is T2 ? r= K AC + K BC 2(2T1 ) r= K AB 2T2 T2 = K AB 2r T2 = K AB ´ 2T1 K AC + K BC • Only need to have sequences and information on one divergence time Chicken (C) Human (B) T2 T1 Rat (A) True tree and inferred tree • There is only one true tree of species relationships • Inferred tree may not be correct 1. Some genes may not be representative 2. Tree inference method may have produced an incorrect tree – e.g. parsimony method: may get several equally parsimonious results How credible is the tree? • The tree is a hypothesis of the true relationship • Need some measure of the support for that hypothesis • Note: Bayesian methods simultaneously estimate tree and measures of uncertainty for each branch Standard Error of branches Human Chimp Gorilla Orangutan • The bootstrap: randomly sample all positions (columns in an alignment) with replacement -- meaning some columns can be repeated -- but conserving the number of positions; build a large dataset of these randomized samples Bootstrap • Then use your method (distance, parsimony, likelihood) to generate another tree • Do this a thousand or so times • Note that if the assumptions the method is based on hold, you should always get the same tree from the bootstrapped alignments as you did originally • The frequency of some feature of your phylogeny in the bootstrapped set gives some measure of the confidence you can have for this feature Applications of phylogenetics • Detection of orthology and paralogy • Estimation of divergence times • Reconstruction of ancient proteins • Identifying residues important to selection • Detecting recombination points • Identifying mutations likely to be associated with disease • Determining the identity of new pathogens The time will come, I believe, though I shall not live to see it, when we shall have fairly true genealogical trees of each great kingdom of Nature. Charles Darwin The Tree of Life • Traditional classification of life into five kingdoms – Bacteria (inc cyanobacteria) – Protista (inc. cilliates, flagellates, amoebae) – Fungi – Plantae – Animalia Archaebacteria • Carl Woese and colleagues • Study relationships by comparing rRNAs • Methanogens were expected to group with other bacteria • BUT, found to be equally distant from bacteria and eukaryotes • Made new taxon Archaebacteria • Includes many extremophiles – thermophiles – hyperthermophiles – halophiles (salt dependent) The Tree of Life Where is the root of the Tree of Life? • No possible outgroup (by definition) • Iwabe et al. (1989) • Examined phylogenetic tree of pairs of genes that exist in all organisms – derived from gene duplication that predates lineage divergences lineage 1 Gene A1 lineage 2 lineage 3 Gene A lineage 1 Gene A2 lineage 2 lineage 3 • Homologous elongation factor genes EF-Tu and EF-G present in all prokaryotes and eukaryotes • Both genes show the same topology Archaea EF-Tu Eucarya Bacteria Archaea EF-G Eucarya Bacteria Changing view of The Tree of Life … based on morphological characteristics (Chatton, 1925) (Gaucher et al, 2010) based on DNA sequence analysis (Woese & Fox, 1977) based on membrane architecture & gene indels Most modern view … based on ancient gene duplication based on phylogenies of hundreds of genes Phylogeny of humans and apes • Darwin – Gorilla and Chimpanzee our closest relatives and human evolutionary origins in Africa • Many people preferred anthropocentric idea that humans were special Human Chimp Gorilla Traditional view Orangutan Gibbon So what is the evidence? • Serological precipitation (Goodman 1962) – H, G, C constitute a natural clade, orangutans & gibbons earlier diverging • However, H,G,C relative relationships remained unclear • Most DNA sequence data support ((H,C),G) • Some genes show different relationship Human Chimp Gorilla Orangutan Gibbon Conservation biology – the dusky seaside sparrow • Last one died June 1987 (DisneyWorld) • Discovered 1872 • Ammodramus maritimus nigrescens • Geographically confined to small salt marsh in Florida • 2000 individuals in 1900 • 6 individuals (all male) in 1980 • Conservation program – artificial breeding Conservation genetics • Mating of remaining males with females from closest subspecies available • Female hybrids of first generation then “back-crossed” to original males • Continue as long as original males live • Which species to choose to take the females from?? • 8 other A. maritimus subspecies • Geographically dispersed along coast • Artificial breeding with Scott’s seaside sparrow (A. m. peninsulae) • Chosen based on Morphological and behavioural similarities • Was this the best choice? nigrescens Atlantic Coast peninsulae Gulf Coast Woops! • Two subspecies diverged about 250,000 – 500,000 years ago • A. m. nigrescens almost indistinguishable molecularly from other Atlantic Coast subspecies • Any Atlantic Coast subspecies would have been a better choice • Created a new species instead of saving old • Dusky seaside sparrow officially declared extinct in 1990 Origin of angiosperms • Flowering plants: carpelenclosed ovules and seed • Fossils – began to radiate midCretaceous (~115 mya) – Dominant land plants 90 mya • 275,000 species described Origin of angiosperms • Probably arose from gymnosperm-like ancestor up to 370-380 mya • Gymnosperm = “naked seed” (e.g. conifers) • Long time span of possible origin • Why no fossils? – Didn’t exist prior to Cretaceous? – Lived in habitats not conducive to fossilisation? Monocot and Dicot divergence • Monocotyledons • Dicotyledons • Two major classes of angiosperm • Date of their divergence gives minimum estimate for age of angiosperms • Phylogenetic analysis of DNA sequences Monocot – Dicot divergence • Initial estimate of 300-320 mya (Martin et al. 1989) – Glyceraldehyde-3-phosphaste dehydrogenase from plants, animals and fungi • Implied origin close (within 100myr) to the time of origin of earliest land plants – seems too ancient – implies all vascular plants arose within 100myr • Alternative study (Wolfe et al., 1989) • Calibrated molecular clock with maize-wheat divergence (50-70 mya) • Monocot-dicot divergence estimated as 200 mya • Existed long before prominence in paleoflora Cetaceans Cow Deer Hippo Pig Peccary Camel Artiodactyls • Link to ungulates (hoofed mammals) suggested by comparative anatomy • Early protein and mtDNA phylogenetic studies indicated that Cetaceans are closely related to Artiodactyls • Graur and Higgins (1994) • Protein and DNA sequence from several cetaceans and from three suborders of artiodactyls • Showed cetaceans are within artiodactyls • Confirmed by analysis of distribution of SINE elements Cetartiodactyls