IE68 - Biological databases Phylogenetic analysis Phylogenetic analysis • Phylogeny a reconstruction of the evolutionary (genealogical) history of a group of organisms/genes or proteins from biological data organisms: populations, species, genera,... => taxa => operational taxonomic units (OTU’s) data: molecular, morphological, archaeological,... => characters • Phylogenetic tree the graphical reconstruction of a phylogeny tree structure: phylogram, cladogram IE68 - biological databases phylogeny Phylogenetic tree A tree consists of nodes connected by branches polytomy A B C D E => OTU’s for which we have data outgroup/midpoint => Ancestor of all the taxa that comprise the tree notation: ((A,B),(C,D,E)) IE68 - biological databases phylogeny Phylogenetics <> Phenetics • Phenetics: method of grouping taxa that is based on overall (dis)similarities of characters => with no reference to evolution! • Phylogenetics: method of grouping taxa that is based on shared derived characters (synapomorphies) or a model of evolution IE68 - biological databases phylogeny Why do we need phylogenies? • Intrinsic interest in the tree => tree of life origin of organisms IE68 - biological databases phylogeny Why do we need phylogenies? • Phylogenies can also be used as tools for investigating other problems e.g. biogeography phylogeny reflects the order of separation of the areas the different taxa occupy T IE68 - biological databases phylogeny Why do we need phylogenies? • Phylogenies can also be used as tools for investigating other problems e.g. forensic science IE68 - biological databases phylogeny IE68 - biological databases phylogeny Phylogenetic analysis • Molecular Phylogenetics reconstruction of the evolutionary (geneological) history of a group of organisms from molecular data, i.e. DNA or protein sequences In this lecture, we will focus on phylogenetic analysis of organisms based on DNA sequence data IE68 - biological databases phylogeny Molecular phylogenetics: approach Step 1: PCR with primers that target cytoplasmic DNA or nuclear loci of taxa, followed by DNA sequence analysis Step 2: Multiple DNA sequence alignment Step 3: Phylogenetic analysis IE68 - biological databases phylogeny PCR and DNA sequencing • Which loci? DNA sequence information, primers, variability, single or low-copy, orthologous, neutral, recombination... • Gene trees versus organismal trees phylogenies for genes do not always match those of their corresponding organisms => analyse more than one gene IE68 - biological databases phylogeny Confounding influence of gene duplication 2 types of homology: orthology (speciation) and paralogy (gene duplication) IE68 - biological databases phylogeny Lineage sorting and coalescence species alleles IE68 - biological databases phylogeny Molecular phylogenetics: approach Step 1: PCR with primers that target cytoplasmic DNA or nuclear loci of taxa, followed by DNA sequence analysis Step 2: Multiple DNA sequence alignment Step 3: Phylogenetic analysis IE68 - biological databases phylogeny Multiple DNA sequence alignment • Problem: alternative alignments possible to align any two sequences by postulating some combination of gaps (insertion/deletions = indels) and substitutions => which one to choose? • Basic task of sequence alignment is to find the alignment with the highest similarity, smallest distance, or lowest overall cost IE68 - biological databases phylogeny Multiple DNA sequence alignment • 2 sequences + scoring scheme => optimal alignment • Scoring scheme: - scoring matrix: distance weights or similarity scores for each pair of aligned bases e.g. transition – transversion matrix A T G C A 0 5 1 5 T 5 0 5 1 G 1 5 0 5 C 5 1 5 0 - gap weight, cost or penalty IE68 - biological databases phylogeny Multiple DNA sequence alignment • Cost of an alignment D = s + wg s = no of substitutions, g = total length of gaps w = gap penalty = cost of gap relative to substitution • Gap penalty W makes implicit assumptions about how the sequences have evolved if indels are thought to be rare, then W should be large (and vice versa) => have to use knowledge of biology e.g. translation (3 bp indel, position), transition<>transversion, ... IE68 - biological databases phylogeny Multiple DNA sequence alignment • Software programs: e.g. CLUSTALW (global alignment) http://www.ebi.ac.uk/clustalw/index.html • The optimal alignment is not always the true alignment => new developments phylogenetic analysis without the multiple DNA sequence alignment step IE68 - biological databases phylogeny Molecular phylogenetics: approach Step 1: PCR with primers that target cytoplasmic DNA or nuclear loci of taxa, followed by DNA sequence analysis Step 2: Multiple DNA sequence alignment Step 3: Phylogenetic analysis IE68 - biological databases phylogeny Inferring phylogenies from DNA sequences C Sequence alignment A B C taxa ..AGCGTCT.. ..AGCGTGT.. ..AG–GAGT.. A Phylogenetic methods B unrooted tree A B characters C rooted tree IE68 - biological databases phylogeny Phylogenetic methods Character-based methods Methods based on an explicit model of evolution Maximumlikelihood methods Methods not based on an explicit model of evolution Maximum parsimony methods IE68 - biological databases phylogeny Non characterbased methods Pairwise distance methods Pairwise distance methods • Dissimilarity matrix: count the number of differences between all possible pairs of sequences • Convert dissimilarity to evolutionary distance by correcting for multiple events per site according to a certain model of evolution • Infer tree topology on the basis of the evolutionary distances by using a clustering algorithm or optimality criterion IE68 - biological databases phylogeny 3 taxa, 3 sequences 1 2 3 1 2 0.26 3 0.20 0.33 1 2 1 2 0.32 3 0.23 0.44 tree 3 Models of sequence evolution expected observed difference => correction (linear) (not linear) Apply a substitution model that tries to estimate the correct number of substitutions IE68 - biological databases phylogeny Models of sequence evolution • Distance “correction” methods: convert observed distances into measure that correspond to ACTUAL distance • Several methods have been proposed, all with different assumptions about the nature of the evolutionary process • Essentially they differ by the number of parameters they include • We can use a general framework to show how these models are inter-related IE68 - biological databases phylogeny Substitution models: general framework IE68 - biological databases phylogeny Substitution models: general framework IE68 - biological databases phylogeny e.g. Model of Jukes & Cantor (JC) • One of the first proposed – perhaps the simplest model of evolution • Assumes that all four bases have equal frequency and that all substitutions are equally likely • Under this model, the distance between any two sequences is given by d = -3/4ln(1-4/3p), where p is the proportion of nucleotides that are different in the two sequences IE68 - biological databases phylogeny e.g. Kimura 2 parameter model (K2P) • incorporates the observation that transitions accumulate more rapidly than transversion • assumes all four bases have equal frequencies but that there are 2 rate classes for substitutions • Under this model, the distance between any two sequences is given by d = 1/2ln[1/(1-2P-Q)] + 1/4ln[1/(1-2Q)], where P and Q are the proportional differences between the two sequences due to transitions and transversions, respectively IE68 - biological databases phylogeny Substitution models Other models: adding more parameters - Felsenstein model (F81) variation in base composition => base frequency f = [A C G T] may vary - Hasewaga Kishino Yano (HKY) model unequal base frequency, transition/transversion - General reversible model (REV) unequal base frequency, all six pairs of substitutions have different rates => ideally, we want the simplest model we can get away with that still yields a reasonable estimate IE68 - biological databases phylogeny Substitution models • Assumptions of these models: – all nucleotide sites change independently – base composition equilibrium – substitution rate is constant over time and in different lineages – each site in a sequence is equally likely to undergo substitution => gamma distribution has a parameter that specifies the range of rate variation among sites: model + ’ IE68 - biological databases phylogeny Pairwise distance methods • Dissimilarity matrix: count the number of differences between all possible pairs of sequences • Convert dissimilarity to evolutionary distance by correcting for multiple events per site according to a certain model of evolution • Infer tree topology on the basis of the evolutionary distances by using a clustering algorithm IE68 - biological databases phylogeny 3 taxa, 3 sequences 1 2 3 1 2 0.26 3 0.20 0.33 1 2 1 2 0.32 3 0.23 0.44 tree 3 Clustering methods • Clustering methods follow a set of steps (an algorithm) and arrive at a tree – UPGMA (Unweighted Pair Group Method using Arithmetic Averages): results in an rooted and additive tree with molecular clock – Neighbor-joining: results in an unrooted and additive tree – Other approaches: least-squares, Fitch, Kitch,... IE68 - biological databases phylogeny UPGMA clustering B C D A B 2 least differences 4 6 C 4 6 6 1 A 1 B Compute new distances between (AB) and other OTU’s d(AB)C = (dAC + dBC) /2 = 4 d(AB)D = (dAD + dBD) /2 = 6 IE68 - biological databases phylogeny UPGMA clustering AB C 4 D 6 C 1 A 1 B 1 2 6 C 1 1 Compute new distances between (ABC) and other OTU’s 1 d(ABC)D = (d(AB)D + dCD) /2 = 6 3 IE68 - biological databases phylogeny 1 2 A B C D Clustering methods • UPGMA additive and ultrametric distances => assumes a molecular clock => very sensitive to unequal rate of evolution! => relative-rate test • Use other clustering methods for phylogeny e.g. Neighbor-joining • “Goodness of fit” statistics: to select the metric tree that best accounts for the observed distances IE68 - biological databases phylogeny Pairwise distance methods • Dissimilarity matrix: count the number of differences between all possible pairs of sequences • Convert dissimilarity to evolutionary distance by correcting for multiple events per site according to a certain model of evolution • Infer tree topology on the basis of the evolutionary distances by using an optimality criterion IE68 - biological databases phylogeny 3 taxa, 3 sequences 1 2 3 1 2 0.26 3 0.20 0.33 1 2 1 2 0.32 3 0.23 0.44 tree 3 Minimum evolution • Distance matrix => unrooted metric trees • Each tree has a length L, which is the sum of all the branch lengths • Optimality criterion: the minimum evolution tree ME is the tree which minimizes L IE68 - biological databases phylogeny Pairwise distance method • Advantages – very fast – based on a model of evolution • Disadvantages – sequence information is reduced to one number – branch lengths may not be biologically interpreted – most methods provide only one tree topology – dependent on the model of evolution used IE68 - biological databases phylogeny Phylogenetic methods Character-based methods Methods based on an explicit model of evolution Maximumlikelihood methods Methods not based on an explicit model of evolution Maximum parsimony methods IE68 - biological databases phylogeny Non characterbased methods Pairwise distance methods Character-based methods • Character-based (discrete) methods operate directly on sequences, rather than on pairwise distances • Two major discrete methods: – Maximum parsimony (MP): chooses tree(s) that require fewest evolutionary changes – Maximum Likelihood (ML): chooses tree(s) that is the one most likely to have produced the observed data IE68 - biological databases phylogeny Maximum parsimony • Maximum parsimony infers a phylogenetic tree by minimizing the total number of evolutionary steps • Principle: – Investigate all possible tree topologies – Reconstruct ancestral sequences – Choose topology with smallest number of steps IE68 - biological databases phylogeny Maximum parsimony - principle 1 3 2 1 3 1 4 2 4 3 possible tree topologies IE68 - biological databases phylogeny 4 2 A B C Maximum parsimony - principle IE68 - biological databases phylogeny Maximum parsimony - principle IE68 - biological databases phylogeny Maximum parsimony - principle IE68 - biological databases phylogeny Maximum parsimony - generalized • In previous example, cost of each substitution was “one step” => equal weight • Instead, we can use different costs for different types of change (e.g. transitions vs transversions) to better match our assumptions about evolutionary processes => weighted parsimony according to Dollo, Wagner, Fitch, ... IE68 - biological databases phylogeny Maximum parsimony - characters IE68 - biological databases phylogeny Maximum parsimony – search methods • Number of tree topologies: Nu = (2n-5)!/2n-3(n-3)! i.e., 3 sequences ~ 1 tree, 4 seq ~ 3 trees, 5 seq ~ 15, 6 ~ 105, => the more sequences (~ taxa), the more trees => computationally expensive • Finding optimal trees: – Exhaustive search: limited number of taxa (<10) find the minimum tree of all possible trees – Branch and bound: small number of taxa (<18) find the minimum tree without evaluating all trees by discarding families of trees during tree construction that cannot be shorter than the shortest tree found so far – Heuristic search: large number of taxa IE68 - biological databases phylogeny Maximum parsimony – search methods - Heuristic search: explore a subset of all possible trees, by using stepwise addition of taxa plus a rearrangement process (branch swapping), but not guaranteed to find the minimal tree Global optimum Local optimum IE68 - biological databases phylogeny Maximum parsimony - output • Consensus tree: MP can yield multiple equally most parsimonious (optimal) trees => relationships common to all the optimal trees are summarized with a consensus tree – Strict consensus: includes splits found in all trees – Majority-rule consensus: includes splits found in the majority of the trees (> 50%) IE68 - biological databases phylogeny Maximum parsimony - output • Consistency index (CI) - Retention index (RI) – measures of the parsimony fit of a character to a tree, or of the average fit of all characters to a tree – more specifically: index of how much homoplasy the constructed tree has – Value from 0 to 1 – higher value => less homoplasy IE68 - biological databases phylogeny IE68 - biological databases phylogeny Parsimony – branch support and tree stability • Bootstrap analysis – is a resampling technique used to measure sampling error – gives an idea about the reliability of branches and clusters – original dataset => resample => construct trees => compare trees to original trees – >70% quite confident of tree topology • Decay index (Bremer support) – gives us a sense of how many steps would be required before a grouping collapses – higher value => better branch support IE68 - biological databases phylogeny Maximum parsimony • Advantages – based on shared derived characters – evaluates different tree topologies – does not reduce the information • Disadvantages – computationally intensive for large datasets – no correction for multiple mutations – sensitive to unequal rates of evolution (long branch attraction) IE68 - biological databases phylogeny Phylogenetic methods Character-based methods Methods based on an Maximumexplicit model of evolution likelihood methods Methods not based on an explicit model of evolution Maximum parsimony methods IE68 - biological databases phylogeny Non characterbased methods Pairwise distance methods Maximum likelihood • Statistical method • If given some data D and a hypothesis H, the likelihood of that data is given by LD = Pr (D|H) • Which is the probability of D given H? IE68 - biological databases phylogeny Maximum likelihood • In the context of molecular phylogenetics – D is the set of sequences being compared – H is a phylogenetic tree • We want to find the likelihood of obtaining the observed data given the tree • The tree that makes the data the most probable evolutionary outcome is the Maximum Likelihood estimate of the phylogeny IE68 - biological databases phylogeny Maximum likelihood • In other words Which tree is most likely to have yielded these sequences (observed data) under a given model of evolution (JC, K2P, ...)? IE68 - biological databases phylogeny Maximum likelihood • Advantages – – – – – Statistically well founded Based on a model of evolution Evaluates different topologies Uses all sequence information Often yields estimates that have lower variance than other methods • Disadvantages – Very slow (computationally intensive) – Dependent on the model of evolution used IE68 - biological databases phylogeny Software programs for phylogenetic analysis • Overview: http://evolution.genetics.washington.edu/phylip/ software.html • Most widely used software programs – PHYLIP: free available (downloadable or online http://bioweb.pasteur.fr/seqanal/phylogeny/phylipuk.html) – PAUP: user friendly but not free available IE68 - biological databases phylogeny Phylogenetic information on the internet • http://tolweb.org/tree/phylogeny.html • http://www.treebase.org/treebase/ • .... IE68 - biological databases phylogeny If you need more information • Jacqueline Vander Stappen K.U.Leuven Laboratory of Gene Technology Kasteelpark Arenberg 21 B-3001 Leuven • Jacqueline.vanderstappen@agr.kuleuven.ac.be IE68 - biological databases phylogeny