Molecular Phylogenetics Phylogenetic trees are about visualizing evolutionary relationships “Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky (1900-1975) Phylogeny Hypothesis of evolutionary relationships Phylogenetic tree = graphical summary of evolutionary history We have been using trees throughout the semester Now we will examine how to construct them Phylogeny is only an estimate Phylogenetics Under Darwin’s hypothesis of common descent Species in the same genus stem from a recent ancestor Hierarchical classification reflects not a mystical ordering of the universe, but rather a real historical process Phylogenies Species tree (how are my species related?) contains only one representative from each species when did speciation take place? all nodes indicate speciation events Gene tree (how are my genes related?) normally contains a number of genes from a single species nodes relate either to speciation or gene duplication events Phylogenetic Trees Diagram consisting of branches and nodes A B C D E terminal node interior node split (bipartition) also written AB|CDE or portrayed **--- branch (edge) root of tree Unrooted vs. rooted trees Rooting a Phylogeny Several methods used to identify polarity Most commonly used is the outgroup method The character state of the target taxa is compared with that of a relative that diverged earlier Outgroup represents the ancestral state Identify outgroup from other phylogenetic studies or fossil data Good to use several outgroups at once Rooting Using an Outgroup 1. The outgroup should be a sequence (or set of sequences or taxon) known to be less closely related to the rest of the sequences (taxa) than they are to each other 2. It should ideally be as closely related as possible to the rest of the sequences (taxa) while still satisfying condition 1 The root must be somewhere between the outgroup and the rest (either on the node or in a branch) The POINT of rooting (using an outgroup) is to include the ancestor of the group of interest in the phylogeny! Terms Clade: A set of species (or sequences) which includes all of the species (or sequences) derived from a single common ancestor Monophyly Polyphyly Paraphyly Cladograms VS. Phylograms Cladogram Only shows you the relationships between taxa Branch lengths provide no data! Phylogram Shows you relationships AND the amount of change (evolution) inferred along each branch Therefore, branch lengths are very important! Cladogram Phylogram [sometimes Phenogram] (branch lengths mean something) Cladograms VS. Phylograms Species A Species A Species B Species B Species C Species C Species D Species E Species D Species E Species F Species F 5 changes Phylogenetics Terms Monophyletic Group Paraphyletic Group All members are believed to stem from a single common ancestor, and the group includes this common ancestor Group that is monophyletic except that some descendents of the common ancestor have been removed Polyphyletic Group consisting of unrelated lineages, each more closely related to other lineages not placed in the taxon Cladistic Methods Techniques that identify monophyletic groups based on synapomorphies Synapomorphies define evolutionary branching points Autapomorphies and ancestral characters do not Must be able to identify homology of traits and direction of change through time (Polarization) Homology The features of organisms almost always evolve from pre-existing features of their ancestors Unlikely that features arise de novo from nothing… Homology Homologous features are derived from a common ancestor Organs of 2 organisms are homologous if they have been inherited (& perhaps modified) from a single organ of a common ancestor A character may be homologous among species but a character state may not 5 toed state is homologous in humans and lizards but the 3 toed state is not homologous in Guinea pigs and Sloths The wings of birds and those of bats are not homologous, although their forelimbs in general are homologous structures (convergent evolution) Maximum Parsimony (Cladistic) Occam’s Razor Entia non sunt multiplicanda praeter necessitatem. William of Occam (1300-1349) The best tree is the one which requires the least number of substitutions Parsimony and Phylogeny Most closely related taxa should have the most traits in common Assume that traits are independent, heritable, and variable in target taxa Traits may be DNA sequence, presence or absence of skeletal elements or floral parts, mode of embryonic development, etc. Traits scored in different taxa must be homologous Parsimony and Phylogeny Shared derived characters (ONLY) are used to deduce the branching patterns of the tree Synapomorphy Synapomorphies are used to attach two branches at a NODE on the tree Molecular Synapomorphies Molecular Synapomorphies Parsimony and Phylogeny Traits may revert to ancestral form because of mutation or selection This may destroy phylogenetic signal and lead to reconstruction of misleading relationships Reversal Convergence and Reversal and collectively known as Homoplasy Molecular Homoplasy via Reversal Parsimony and Phylogeny Homoplasy Creates noise in the data Some characters give conflicting information about relationships Systematists try to minimize homoplasy in a data set Choose characters that evolve slowly relative to age of taxa Parsimony and Phylogeny Parsimony minimizes total amount of evolutionary change in a tree Synapomorphies are usually more common than convergence and reversal Most parsimonious trees minimize homoplasy to give best estimate of phylogeny Fitch (equal-weighted) parsimony Data for site 1 shown on tree topology for all 16 possible combinations of states at the 2 interior nodes. Character length is 2 for this site. Tree length (or tree score) Total steps = 2 + 1 + 2 + 2 + Character length from site 1 C A B A B B D C D D C 225 (best) + 1 = 237 Character length from site 2 A 237 ... 241 (worst) This value is used to compare this tree topology to other tree topologies (smaller is better) Phylogenetic Characters Which characters should be used to reconstruct the correct phylogeny? Morphological characters ie, Skeleton For fossils only morphological characters can be used Morphological characters difficult to use because taxonomic expert needed Molecular characters Allozymes, RFLPs, DNA sequences MUST CHOOSE MOLECULAR MARKER THAT IS APPROPRIATE Best molecular marker is one which has plenty of variation (=phylogenetic signal) yet not too much homoplasy (not too variable!). Phylogenetic Characters Which characters should be used to reconstruct the phylogeny? Molecular data has the advantage that they can be rapidly collected and scored However, homoplasy difficult to indentify Only four bases: G, A, T, C Multiple types of data (including multiple gene sequences) often the best What sequences should I use for organism phylogenies? Slowly evolving / Fast evolving rRNA mitochondrion Nuclear chloroplast Other Phylogenetic methods Parsimony is not the only method for estimating phylogenetic relationships!… Some pitfalls of Parsimony… It can take quite a long time to compute a Parsimony estimate of a phylogeny… Also, parsimony may be very error prone when: rates of evolution are variable very divergent species (or OTUs) are compares because it does not deal well with accounting for homoplasy… Other Phylogenetic methods Other reconstruction methods Distance (Phenetic) methods e.g.: Neighbor joining and UPGMA Based on clustering technique Based on overall similarity Not a cladistic method Uses differences (distances) among character states to group taxa Using Distance Methods to Reconstruct Phylogenetic Relationships Species with the LEAST genetic distance (or other distance) between them are assumed to be CLOSE relatives However, there are MANY cases where this may NOT be true! Distance-Based Methods (UPGMA, Neighbor Joining, etc..) Distance methods are typically very very fast and easy to use to estimate a phylogenetic tree However, they are not cladistic because they do not look for synapomorphies, but rather overall similarity… This means this method is also susceptible to lots of error when a dataset has lots of homoplasy… Distance methods Normally fast and simple e.g. UPGMA, Neighbour Joining, Minimum Evolution, Fitch-Margoliash Correction for multiple hits Only differences can be observed directly – not distances All distance methods rely (crucially) on this A great many models used for nucleotide sequences (e.g. JC, K2P, HKY, Rev, Maximum Likelihood) AA sequences are infinitely more complicated! Accuracy falls off drastically for highly divergent sequences Distance methods Attempts to account for multiple hits using models in distance methods (observed vs. estimated amount of evol. distance) Other Phylogenetic methods Maximum likelihood assumes a particular model of sequence evolution and calculates how likely each branch arose based on the character data Uses all data, even autapomorphies and invariant sites Uses models of evolution designed to capture a pattern of change across characters (e.g., DNA) Allows us to account for complex patterns of nucleotide evolution across regions of genes that may evolve very differently (thus, not all types of changes are weighted evenly in determining the phylogeny…) Lets look at an example… although we will save more heated discussions of patterns for Bayesian MCMCMC methods…. 1.8 2 Gene 1 1.8 1.6 Gene 2 C-G 1.4 1.2 A-C 1 0.8 0.6 0.4 0.2 285 240 195 Length Along Genome 150 105 60 15 504 441 378 315 252 189 126 F 63 00 C A-T 0 Relative Rate of Substitution (G-T = 1) Within vs. Between Gene Variation Transversions Maximum Likelihood Methods Likelihood methods are among the most accurate methods to reconstruct phylogenies! However, they are VERY VERY computationally intensive a tree with 30 species may take several days, with 100 species may take several months! New likelihood methods employing Bayesian statistics along with Marcov Chain Monte Carlo algorithms are helping to solve this problem and are the cutting edge of phylogeny reconstruction these days… Likelihood Methods Requires a model of evolution Each substitution has an associated likelihood given a branch of a certain length A function is derived to represent the likelihood of the data given the tree, branch-lengths and additional parameter So, the tree we get from ML is “the phylogeny that is most likely to have produced the observed data (under the model of evolution selected)” The Likelihood Criterion Given two trees, the one maximizing the probability of the observed data is best Site likelihood - probability of the data for one site conditional on the assumed model of evolution Tree score - sum of site log-likelihoods (term score also general term for the derivative of the lnL) Unlike parsimony tree lengths, log-likelihoods are comparable across models as well as trees Models can be made more parameter rich to increase their realism The most common additional parameters are: A correction to allow different substitution rates for each type of nucleotide change A correction for the proportion of sites which are unable to change A correction for variable site rates at those sites which can change The values of the additional parameters will be estimated in the process (e.g. PAUP) A gamma distribution can be used to model site rate heterogeneity Long Branches Attract In a set of sequences evolving at different rates the sequences evolving rapidly are drawn together Distance methods are VERY VERY prone to making this error Parsimony is also prone to this error Likelihood methods employ an ‘informed’ view of character change (a model) which helps identify situations which probably represent homoplasy, thus decreasing LBA Phylogenetic Methods… It is useful to use a variety of tree reconstruction methods If methods are congruent you have more confidence in your reconstructions! Reconstructing Phylogenies Phylogenies can be useful tools to answer important evolutionary questions One must always question the methods used to reconstruct the phylogeny to be confident in the results Comparison of methods Inconsistency Neighbour Joining (NJ) is very fast but depends on accurate estimates of distance. This is more difficult with very divergent data Parsimony suffers from Long Branch Attraction. This may be a particular problem for very divergent data NJ (and less so, MP) can suffer from Long Branch Attraction Parsimony is also computationally intensive Codon usage bias can be a problem for MP and NJ Maximum Likelihood is the most reliable but depends on the choice of model and is very slow Methods may be combined Finding the best tree…?? How do we find the best tree? With a small number of taxa you can evaluate all possible trees Exhaustive search With more taxa the amount of possible trees increase exponentially For 8 taxa in artiodactyl tree there are 10,395 possibilities Must use a shortcut method to evaluate trees Tree Space Search Strategies Exhaustive Branch and Bound Heuristic Branch Swapping Exhaustive Search Sequences Number of unrooted, binary trees 4 3 5 15 6 105 7 945 8 10,395 9 135,135 10 2,027,025 11 34,459,425 12 654,729,075 13 13,749,310,575 14 316,234,143,225 15 7,905,853,580,625 16 213,458,046,676,875 17 6,190,283,353,629,375 18 191,898,783,962,510,625 19 6,332,659,870,762,850,625 20 221,643,095,476,699,771,875 This run would just about be finished had we started it at the time prokaryotes diverged from eukaryotes (about 2.5 billion years ago!) Branch-and-bound At start, we know this tree has length 1982 steps Lineage out of contention (> 1982 steps) Lineage out of contention (> 1982 steps) (best) Ok, still under 1982 steps Theoretically predicted to always find the best tree -if using MP with a B&B search for 50 taxa, plan on between 2 days to 2 weeks Tree Islands This landscape has 5 peaks, only 1 of which represents the global optimum. best tree (globally optimal tree) locally optimal tree Imagine this depicts “tree space” Heuristic search algorithms are “hill climbers” – they only climb up -However, this type of search is typically the only choice we have for even small datasets… Solution Do 100-1000 replicates starting in different parts of tree space to find global optimum Heuristic search started here will not find global optimum How confident am I that my tree is correct? Bootstrap values Bootstrapping is a statistical technique that can use random resampling of data to determine sampling error for tree topologies How Reliable is a Phylogeny? How do we evaluate confidence in a tree? Bootstrap values are percentages of the number of times the same branch arose after repeated sampling Bootstrap support over 70% indicates that the correct relationship was probably found Investigators usually report any bootstrap value over 50% Bootstrapping phylogenies Characters are resampled with replacement to create many bootstrap replicate data sets Each bootstrap replicate data set is analysed (e.g. with parsimony, distance, ML etc.) Agreement among the resulting trees is summarized with a majority-rule consensus tree Frequencies of occurrence of groups, bootstrap proportions (BPs), are a measure of support for those groups Bootstrapping – How reliable is our phylogeny or part of our phylogeny?? Bootstrap replicate #1… …repeat this random resampling of date lots of times and see how many of these ‘pseudodatasets estimate a particular relationship.. Bootstrap - interpretation Bootstrapping is a very valuable and widely used technique (it is demanded by some journals) BPs give an idea of how likely a given branch would be to be unaffected if additional data, with the same distribution, became available BPs are not the same as confidence intervals. There is no simple relationship between bootstrap values and confidence intervals. There is no agreement about what constitutes a ‘good’ bootstrap value (> 70%, > 80%, > 85% ????) Some theoretical work indicates that BPs can be a conservative estimate of confidence intervals If the estimated tree is inconsistent all the bootstraps in the world won’t help you….. Bootstrap - interpretation Lets consider why bootstraps for a particular relationship may be low…. This can mean 2 things: There is conflicting signal in the data whereby some evidence supports one relationship while other evidence supports another Could be, simply, very little evidence overall (not conflicting evidence, but only one or two characters overall support this relationship) This is often the case if speciation is rapid and lineages split from one another (or a common ancestor) rapidly In general, rapid evolutionary radiations are nearly impossible to accurately estimate without enormous amounts of characters (because of low phylogenetic signal or evidence for exclusive common ancestry) Bootstrapping 59 71 Ochromonas Symbiodinium Prorocentrum Loxodes Tracheloraphis Spirostomumum Euplotes Tetrahymena Gruberia 16 59 26 71 16 21 Ochromonas Symbiodinium Prorocentrum Loxodes Spirostomumum Tetrahymena Euplotes Tracheloraphis Gruberia Majority-rule consensus (with minority components) Wim de Grave et al. Fiocruz bioinformatics training course MP – Tree length based tests of hypotheses Templeton test Compare the length (# steps) of the optimal tree vs. the length of tree that would result from the topological hypothesis you want to test Do this statistically using a 1-tailed Wilcoxan signed-rank test Likelihood-based tests of topologies Kishino-Hasegawa test Trees specified apriori KH can be used to test whether two competing hypotheses have significantly different likelihood NB should not be used to test trees that have been chosen on the basis of the data! Shimodaira-Hasegawa test Can be used to test confidence of ML tree compared to related trees (e.g. second most likely tree from the data)