Reconstrucción Filogenética Una manera simple de entender la evolución… DATOS: Alineamiento de secuencias de genes Cómo podemos transformar esta información a un contexto histórico? Phylogeny inference 1. Distance based methods -Pair wise distance matrix -Adjust tree branch lengths to fit the distance matrix (ex. Minimum squares, Neighbor joining) 2. Character based methods -Parsimony -Maximum likelihood or model based evolution In 1866, Ernst Haeckel coined the word “phylogeny” and presented phylogenetic trees for most known groups of living organisms. The Tree of Life project Surf the tree of life at: http://tolweb.org/tree/phylogeny.html What is a tree? A tree is a mathematical structure which is used to model the actual evolutionary history of a group of sequences or organisms, i.e. an evolutionary hypothesis. A tree consists of nodes connected by branches. The ancestor of all the sequences is the root of the tree Internal nodes represent hypothetical ancestors Terminal nodes represent sequences or organisms for which we have data. Each is typically called a “Operational Taxonomical Unit” or OTU. Types of Trees Rooted vs. Unrooted Rooted Unrooted Branches Nodes Interior M–2 M–1 Total 2M – 2 2M – 1 Interior M–3 M–2 Total 2M – 3 2M – 2 M is the number of OTU’s The number of rooted and unrooted trees: Possible Number of Number of OTU’s Rooted trees 2 1 Unrooted trees 1 3 3 1 4 15 3 5 105 15 6 945 105 7 10395 945 8 135135 10395 9 2027025 135135 10 34459425 2027025 OTU – Operational Taxonomical Unit Types of Trees Bifurcating Multifurcating Polytomy Polytomies: Soft vs. Hard • Soft: designate a lack of information about the order of divergence. • Hard: the hypothesis that multiple divergences occurred simultaneously Types of Trees Trees Networks Only one path between any pair of nodes More than one path between any pair of nodes A shorthand for trees: the Newick format ((1,2),(3,4)) 2 1 4 3 (((1,2),((3,4),5)),6) 1 2 3 4 5 6 Comments on Trees Trees give insights into • underlying data Identical trees can appear • differently depending upon the method of display Information maybe lost when • creating the tree. The tree is not the underlying data. A B C B A C C A B C B A Different kinds of trees can be used to depict different aspects of evolutionary history 1. Cladogram: simply shows relative recency of common ancestry 2. Additive trees: 7 2 4 3 3 1 5 a cladogram with branch lengths, also called phylograms and metric trees 3 1 1 3 1 1 3. Ultrametric trees: 1 1 1 2 1 3 (dendograms) special kind of additive tree in which the tips of the trees are all equidistant from the root Making trees according to morphological features Ridley New Scientist (Dec. 1983) 100, 647-51 Given a multiple alignment, how do we construct the tree? A B C D E F – – – - GCTTGTCCGTTACGAT ACTTGTCTGTTACGAT ACTTGTCCGAAACGAT ACTTGACCGTTTCCTT AGATGACCGTTTCGAT ACTACACCCTTATGAG ? Distance methods Logic: Evolutionary distance is a tree metric and hence defines a tree General Method: • Evolutionary distances are computed for all pairs of taxa. • A phylogenetic tree is constructed by considering the relationships among these distance data (fitting a tree to the matrix). Methods we’ll talk about • UPGMA (Unweighted Pair Group Method with Arithmetic Mean ) • Neighbor Joining Distance methods Ultrametric Trees 1 4 1 1 1 1 1 1 3 2 1 a b c Metric distances must obey 4 rules: Non-negativity Symmetry Triangle Inequality Distinctness d(a,b) >= 0 d(a,b) = d(b,a) d(a,c) <= d(a,b) + d(b,c) d(a,b) = 0 iff a = b Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA) First, construct a distance matrix: A B C D E F – – – - GCTTGTCCGTTACGAT ACTTGTCTGTTACGAT ACTTGTCCGAAACGAT ACTTGACCGTTTCCTT AGATGACCGTTTCGAT ACTACACCCTTATGAG A B C D E B 2 C 4 4 D 6 6 6 E 6 6 6 4 F 8 8 8 8 8 From http://www.icp.ucl.ac.be/~opperd/private/upgma.html UPGMA First round A B C D E B 2 C 4 4 D 6 6 6 E 6 6 6 4 F 8 8 8 8 dist(A,B),C = (distAC + distBC) / 2 = 4 dist(A,B),D = (distAD + distBD) / 2 = 6 dist(A,B),E = (distAE + distBE) / 2 = 6 dist(A,B),F = (distAF + distBF) / 2 = 8 8 Choose the most similar pair, cluster them together and calculate the new distance matrix. C D E F A,B C D E 4 6 6 8 6 6 8 4 8 8 UPGMA Second round A,B C D E 4 6 6 8 6 6 8 4 8 8 C D E F Third round A,B C D,E C 4 D,E 6 6 F 8 8 8 UPGMA Fourth round AB,C D,E 6 F 8 D,E 8 Fifth round ABC,DE F 8 Note the this method identifies the root of the tree. http://www.genpat.uu.se/mtDB/ A tree of human mitochondria sequences • The mitochondrial genome has 16,500 base-pairs. • In 2000, Gyllensten and colleagues sequenced the mitochondrial genomes of 53 people of diverse geographical, racial and linguistic backgrounds. • A molecular clock seems to hold the divergence of these sequences at a rate of 1.7x10-8 substitutions per site per year. Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 408: 708-713. The deepest branches lead exclusively to sub-Saharan mtDNAs, with the second branch containing both Africans and non-Africans. sub-Sahara mtDNA A tree of 86 mitochondrial sequences. Downloaded from http://www.genpat.uu.se/mtDB/sequences.html and analyzed using MEGA, method: UPGMA Rooting the tree with an outgroup Root Outgroup Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 408: 708-713. Phylogeny based upon the molecular clock • Evidence for a human mitochondrial origin in Africa: African sequence diversity is twice as large as that of non-African • Gyllensten and colleagues estimate that the divergence of Africans and non-Africans occurred 52,000 to 28,000 years ago. Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 408: 708-713. UPGMA assumes a molecular clock • • • The UPGMA clustering method is very sensitive to unequal evolutionary rates (assumes that the evolutionary rate is the same for all branches). Clustering works only if the data are ultrametric Ultrametric distances are defined by the satisfaction of the 'three-point condition'. The three-point condition: A B C For any three taxa, the two greatest distances are equal. UPGMA fails when rates of evolution are not constant A tree in which the evolutionary rates are not equal From http://www.icp.ucl.ac.be/~opperd/private/upgma.html A B C D B 5 C 4 7 D 7 10 7 E 6 9 6 5 F 8 11 8 9 E 8 (Neighbor joining will get the right tree in this case.) Neighbors A a B b x c C d D A and B are neighbors because they are connected through a single internal node. C and D are also neighbors, but A and D are not neighbors. The 4-point condition can be used to identify neighbors. Basically states that neighbors are closer than non-neighbors. The Four Point Condition A a B b x c C d D dAC + dBD = dAD + dBC = a + b + c + d + 2x = dAB + dCD + 2x dAB + dCD < dAC + dBD dAB + dCD < dAD + dBC neighbors non-neighbors Neighbor Joining An algorithm for finding the shortest tree Start with a star (no hierarchical structure) a b c d The length of the tree Pair-wise distances Number of OTUs Neighbor Joining (Saitou and Nei, 1987) Neighbor Joining (Saitou and Nei, 1987) Neighbor Joining (Saitou and Nei, 1987) Neighbor Joining (Saitou and Nei, 1987) Character state methods MAXIMUM PARSIMONY Logic: Examine each column in the multiple alignment of the sequences. Examine all possible trees and choose among them according to some optimality criteria Method we’ll talk about • Maximum parsimony Maximum Parsimony Simpler hypotheses are preferable to more complicated ones and that as hoc hypotheses should be avoided whenever possible (Occam’s Razor). Thus, find the tree that requires the smallest number of evolutionary changes. W X Y Z – – - 0123456789012345 ACTTGACCCTTACGAT AGCTGGCCCTGATTAC AGTTGACCATTACGAT AGCTGGTCCTGATGAC W X Y Z Maximum Parsimony Start by classifying the sites: 123456789012345678901 Mouse CTTCGTTGGATCAGTTTGATA Rat CCTCGTTGGATCATTTTGATA Dog CTGCTTTGGATCAGTTTGAAC Human CCGCCTTGGATCAGTTTGAAC -----------------------------------Invariant * * ******** ***** Variant ** * * ** -----------------------------------Informative ** ** Non-inform. * * 123456789012345678901 CTTCGTTGGATCAGTTTGATA CCTCGTTGGATCATTTTGATA CTGCTTTGGATCAGTTTGAAC CCGCCTTGGATCAGTTTGAAC ** * Mouse Rat Dog Human Site 5: Mouse G G T G Rat Site 2: Mouse T C C Rat Mouse T T C Dog T Dog T C Human G Rat C Human Dog T Dog T C Mouse T C Mouse T T T Dog G C Human Mouse G T C Human Dog G G C Rat Dog G G Human T Rat G Mouse G G Site 3: T Rat G Human Mouse G G G T Dog Mouse G T G Dog Rat G C Human C Rat C C Human Rat G T G Human Maximum Parsimony 123456789012345678901 Mouse CTTCGTTGGATCAGTTTGATA Rat CCTCGTTGGATCATTTTGATA Dog CTGCTTTGGATCAGTTTGAAC Human CCGCCTTGGATCAGTTTGAAC Informative ** ** Mouse Dog Dog Mouse Mouse Rat Rat Human Rat Human Dog Human 3 0 1 Maximum Parsimony The situation is more complicated when there are more than four units. (TAGC) (AT) (TAG) (AG) T (AGT) C (CT) (GT) T T G T A A T T Problems with maximum parsimony: Only uses “informative” sites Long-branches “attract” A A A G C Maximum Likelihood Analysis Same as Maximum Parsimony except rates of nucleic acids substitutions are not considered to have equal probability. All possible unrooted trees are evaluated. (Same for Parsimony) Each column of the alignment is processed. (Same for Parsimony) The transition of A -> T will have a different probability than the transition from G -> C Start with a frequency distribution table that specifies the probability of one base being substituted for another base. See probabilities of nucleotide substitution. (Table 6.5 pg 275) Probability that unrooted tree predicts each column of the alignment is calculated. Probabilities for each column are summed together for each tree. The unrooted tree with the highest probability is chosen. • • • • • • • • • Maximum Likelihood Example Four sequences are compared (w, x, y and z) All unrooted trees are shown In this example we will examine the first unrooted tree. • Sequence w • • Sequence x Sequence y Sequence z ACGCGTTGGG ACGCGTTGGG ACGCAATGAA ACACAGGGAA w y w x x z y z w y z x Maximum Likelihood Example Continued L(Tree x) = L0 * L1 * L2 * L3 * L4 * L5 * L6 L0 base probability of nucleotide at 0 (0.25) L1 probability of nucleotide changing from value at 0 to value at 1. L2 probability of nucleotide changing from value at 0 to value at 1. L3 probability of nucleotide changing from value at 1 to value at 3 (T). L4, L5, L6 probability of nucleotide changing to value at leaf. • T T A G • w x y z L3 L4 L5 L6 • • 1 2 • • L1 0 L0 L2 Maximum Likelihood Example Continued There are 64 likelihood trees to evaluate. (number of bases) ^ (number of internal nodes) or 4^3. We will show evaluation TTG against the first unrooted tree for column TTAG Determine values for L0, … L6. Values are determined by looking up probabilities in transition probability table. Probability of L2 is T->G Probability of L5 is G -> A Probability of L3 is T->T Determine combined probability L0 * L1 * L2 * … * L6 • • • T T A G w x y z L3 L4 L5 L6 1 2 T • • • • G L1 0 L2 T L0 Maximum Likelihood Example Continued Determine probability for combination TGG Determine probability for the other 62 combinations. Sum all the trees together. L(Tree) = (LTree1) + L(Tree2) + … + L(Tree64) Move to next column and repeat the same procedure. Once all columns are complete sum all the probabilities. This is the likelihood of the first unrooted tree. Continue this process for the other unrooted trees. Pick the unrooted tree with the highest probability. This is the most likely unrooted tree. • T T A G • w x y z L3 L4 L5 L6 • • • • 1 2 T G L1 0 L2 G • L0 EVOLUCIÓN IN VITRO POR INTERMEDIO DE PCR Conclusion Phylogenetic Prediction can be used for more than • Evolutionary Distance Verification of Taxonomy – Identification of unknown – Techniques work for genetic and non genetic data (Fatty Acid). – Use multiple methods for verification • Pick at least two different types of methods from Parsimony, – Distance and Likelihood. If the analysis is in agreement there is a higher level of confidence – that the analysis is correct. BOOTSTRAPING How confident are we in this tree? Bootstrapping A statistical method that can be used to place confidence intervals on phylogenies Resampling from the Data Original data: human_myoglobin pig_myoglobin horse_myoglobin common_seal_myoglobin sperm_whale_myoglobin sea_hare_myoglobin -GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHL -GLSDGEWQLVLNVWGKVEADVAGHGQEVLIRLFKGHPETLEKFDKFKHL -GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHL -GLSEGEWQLVLNVWGKVEADLAGHGQDVLIRLFKGHPETLEKFDKFKHL MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL -SLSAAEADLAGKSWAPVFANKDANGDAFLVALFEKFPDSANFFADFKG- ... ... ... ... ... ... Resampled data number 1: Pick with replacement human_myoglobin pig_myoglobin horse_myoglobin common_seal_myoglobin sperm_whale_myoglobin sea_hare_myoglobin LQKWDQKHNVHTEFGAEELQGDKLSWKKLDQGKKVVKKELGLDEDEWLGE LQKWDQKHNVHTEFGAEELQGDKLSWKKLDQGKKVVKKELGLDEDEWLGE LQKWDQTHNVHTEFGAEELQGDKLSWKTLDQGKKVVTKELGQDEDEWLGE LQKWEQKHNVHTEFGADELQGDKLSWKKLDQGKKVVKKELGLDEDDWLGE LQRWEQKHHVHTEFAADELQGDKLSWKKLDQGRKVVKKELGLDEDDWLGE LDDWADENKSNSNFAAAELDANFASAPELNDGDKVAEKFAALNNAAWAAN Repeat 99 more time (or 999,999..) Given the following tree, estimate the confidence of the two internal branches Chimpanzee Human Gibbon Gorilla Orang-utan Estimating Confidence from the Resamplings 1. Of the 100 trees: 41/100 Chimpanzee 28/100 Gorilla 31/100 Human Chimpanzee Human Human Gibbon Gibbon Gibbon Gorilla Orang-utan Chimpanzee Orang-utan Gorilla Orang-utan 2. Upon the original tree we superimpose bootstrap values: Chimpanzee In 41 of the 100 trees, chimp and gorilla are split from the rest. 41 Human Gibbon Gorilla 100 Orang-utan In 100 of the 100 trees, gibbon and orang-utan are split from the rest. THE TREE OF LIFE Relationships between 16S ribosomal RNAs eukaryotes bacteria archaea Distant relationships Close relationships The three domains of Life as identified by phylogenetic analysis of the highly conserved 16S ribosomal RNA 16S ribosomal RNA (Woese and Fox 1977) Where is the root of the tree of life? (by definition there is no outgroup) An ancient gene duplication can root a tree Gene duplication Speciation of 3 and 1-2 Speciation of 1 and 2 Outgroups for A2 Outgroups for A1 Root of 1,2,3 Graur & Li. Fundamentals of Molecular Evolution (1999) The root of the tree of life as inferred from Ef-Tu and EF-G Both trees show Archaea and Eucarya as sister taxa Graur & Li. Fundamentals of Molecular Evolution (1999) Horizontal Gene Transfer archae eubacteria Mn-dependent transcriptional regulator (Tatusov, 1996) What is the origin of the mitochondria? http://www.mitomap.org/ The endosymbiotic theory The evidence: • Both mitochondria and chloroplasts can arise only from preexisting mitochondria and chloroplasts. They cannot be formed in a cell that lacks them because nuclear genes encode only some of the proteins of which they are made. • Both mitochondria and chloroplasts have their own genome. • Both genomes consist of a single circular molecule of DNA. • There are no histones associated with the DNA. The Mitochondria sit with the proteobacteria in the tree of life mitochondrial (MT) Small-subunit (SSU) ribosomal RNA tree Gray MW Nature. 1998 Nov 12;396(6707):109-10. mitochondrion chloroplast Lack mitochondria (?) The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Andersson SG Nature 1998 Nov 12;396(6707):133-40 Mitochondrial ribosomal proteins are most similar to those of R. prowazekii Andersson SG Nature 1998 Nov 12;396(6707):133-40 Mitochondrial proteins involved in ATP synthesis are most similar to those of R. prowazekii Andersson SG Nature 1998 Nov 12;396(6707):133-40 Mitochondria derive from -Purple bacteria Chloroplasts derive from cyanobacteria Graur & Li. Fundamentals of Molecular Evolution (1999) The tree of life with mitochondria and chloroplast endosymbiotic events (Doolittle, 1999) Horizontal transfer is a dominant feature of the “tree” of life (Doolittle, 1999)