Phylogenetics I Evolution Evolution of new organisms is driven by • Mutations – The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. • Selection bias Theory of Evolution • Basic idea – speciation events lead to creation of different species. – Speciation caused by physical separation into groups where different genetic variants become dominant • Any two species share a (possibly distant) common ancestor The Tree of Life Primate evolution A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree. Morphological vs. Molecular • Classical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc. • Modern biological methods allow to use molecular features – Gene sequences – Protein sequences Morphological topology (Based on Mc Kenna and Bell, 1997) Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Slow loris Tree shrew Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Horseshoe bat Little red flying fox Ryukyu flying fox Mouse Rat Vole Cane-rat Guinea pig Squirrel Dormouse Rabbit Pika Pig Hippopotamus Sheep Cow Alpaca Blue whale Fin whale Sperm whale Donkey Horse Indian rhino White rhino Elephant Aardvark Grey seal Harbor seal Dog Cat Asiatic shrew Long-clawed shrew Small Madagascar hedgehog Hedgehog Gymnure Mole Armadillo Bandicoot Wallaroo Opossum Platypus Archonta Glires Ungulata Carnivora Insectivora Xenarthra From sequences to a phylogenetic tree Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QEPGGLVVPPTDA Cat REPGGLVVPPTEG There are many possible types of sequences to use (e.g. Mitochondrial vs Nuclear proteins). Mitochondrial topology (Based on Pupko et al.,) Donkey Horse Indian rhino White rhino Grey seal Harbor seal Dog Cat Blue whale Fin whale Sperm whale Hippopotamus Sheep Cow Alpaca Pig Little red flying fox Ryukyu flying fox Horseshoe bat Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Asiatic shrew Long-clawed shrew Mole Small Madagascar hedgehog Aardvark Elephant Armadillo Rabbit Pika Tree shrew Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Slow loris Squirrel Dormouse Cane-rat Guinea pig Mouse Rat Vole Hedgehog Gymnure Bandicoot Wallaroo Opossum Platypus Perissodactyla Carnivora Cetartiodactyla Chiroptera Moles+Shrews Afrotheria Xenarthra Lagomorpha + Scandentia Primates Rodentia 1 Rodentia 2 Hedgehogs Nuclear topology (Based on Pupko et al. slide) (tree by Madsenl) Round Eared Bat Flying Fox Hedgehog Mole Pangolin 1 Cow Cat Dog Horse Rhino Rat 3 Capybara Rabbit Flying Lemur Tree Shrew Human Galago Sloth 4 Eulipotyphla Pholidota Whale Hippo Pig 2 Chiroptera Hyrax Dugong Elephant Aardvark Elephant Shrew Opossum Kangaroo Cetartiodactyla Carnivora Perissodactyla Glires Scandentia+ Dermoptera Primate Xenarthra Afrotheria Phylogenenetic trees Aardvark Bison Chimp Dog Elephant • Leaves - current day species (or taxa – plural of taxon) • Internal vertices - hypothetical common ancestors • Edges length - “time” from one speciation to the next Twists in molecular phylogenies • We have to emphasize that gene/protein sequence can be homologous for several different reasons: – Orthologs -- sequences diverged after a speciation event – Paralogs -- sequences diverged after a duplication event – Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus) Paralogs Consider evolutionary tree of three taxa: Gene Duplication …and assume that at some point in the past a gene duplication event occurred. 1 2 3 Paralogs The gene evolution is described by this tree (A, B are the copies of the same gene). Gene Duplication Speciation events 1A 2A 3A 3B 2B 1B Paralogs If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree that does not represent the phylogeny of the host species S Gene Duplication S 1A 2A Speciation events 3A 3B S 2B 1B Types of Trees A natural model to consider is that of rooted trees Common Ancestor Types of trees Unrooted tree represents the same phylogeny without the root node Depending on the model, data from current day species does not distinguish between different placements of the root. Rooted versus unrooted trees Tree a Tree b Tree c b a c Represents the three rooted trees Total numbers of trees • For N taxa, – Rooted bifurcating trees: • (2n-3)!! = (2n-3)!/2n-2(n-2)! – Unrooted bifurcating trees • (2n-5)!! – Tree shapes Positioning Roots in Unrooted Trees • We can estimate the position of the root by introducing an outgroup: – a set of species that are definitely distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant Type of Data • Distance-based – Input is a matrix of distances between species – Can be fraction of residue they disagree on, or alignment score between them, or … • Character-based – Examine each character (e.g., residue) separately Two methods of tree Construction • Distance- A weighted tree that realizes the distances between the objects. • Parsimony – A tree with a total minimum number of character changes between nodes. We start with distance based methods, considering the following question: Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances. Distance Matrix • Given n species, we can compute the n x n distance matrix Dij • Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species. The distance between two sequences • Protein sequences: – PAM – BLOSUM • DNA sequences – Jukes-Cantor – HGY – Kimura 2-Parameter General Stationary Timereversible Model . pArAC R= pCrCA pGrGA . pArAG pCrCG pArAT pTrTA pGrGC pTrTC . pTrTG pCrCT pGrGT . (Diagonal elements such that rows sum to zero) Time reversibility: pirij = pjrji General Stationary Timereversible Model P(t) = eRt Given rates, one can find transition probabilities, and vice-versa. Jukes-Cantor R= . u/3 u/3 u/3 u/3 . u/3 u/3 u/3 u/3 . u/3 u/3 u/3 u/3 . Jukes-Cantor • P(no mutation) = e-4/3ut • P(at least one mutation) = 1-e-4/3ut • Ds = ¾ * (1-e-4/3ut) • D ut = -3/4 ln (1-4/3 * Ds) Kimura 2-Parameter R= A C G T . b a b b . b a a b . b b a b . a/b = transition/transversion bias R a+2b = 1 per unit time Kimura 2-Parameter a=R/(R+1), b=0.5/(R+1) Prtransition | t 14 12 exp 2RR++11 t + 14 exp R2+1 t P Prtransversion | t 12 1 exp R2+1 t Q t ln 1 2Q1 2P Q 1 4 2 HKY (Hasegawa, Kishino, Yano) R= . mpC mkpG mpT mpA . mpG mkpT mkpA mpC . mpT mpA mkpC mpG . k = transversion / transition Distances in Trees • Edges may have weights reflecting: – Number of mutations on evolutionary path from one species to another – Time estimate for evolution of one species into another • In a tree T, we often compute dij(T) - the length of a path between leaves i and j Distance in Trees: an Exampe j i d1,4 = 12 + 13 + 14 + 17 + 12 = 68 Fitting Distance Matrix • Given n species, we can compute the n x n distance matrix Dij • Evolution of these genes is described by a tree that we don’t know. • We need an algorithm to construct a tree that best fits the distance matrix Dij Reconstructing a 3 Leaved Tree • Tree reconstruction for any 3x3 matrix is straightforward • We have 3 leaves i, j, k and a center vertex c Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk Reconstructing a 3 Leaved Tree dic + djc = Dij + dic + dkc = Dik 2dic + djc + dkc = Dij + Dik 2dic + Djk = Dij + Dik dic = (Dij + Dik – Djk)/2 Similarly, djc = (Dij + Djk – Dik)/2 dkc = (Dki + Dkj – Dij)/2 Trees with > 3 Leaves • An tree with n leaves has 2n-3 edges • This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2n3 variables • This is not always possible to solve for n > 3 Additive Distance Matrices Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij NON-ADDITIVE otherwise Distance Based Phylogeny Problem • Goal: Reconstruct an evolutionary tree from a distance matrix • Input: n x n distance matrix Dij • Output: weighted tree T with n leaves fitting D • If D is additive, this problem has a solution and there is a simple algorithm to solve it Using Neighboring Leaves to Construct the Tree • Find neighboring leaves i and j with parent k • Remove the rows and columns of i and j • Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: Dkm = (Dim + Djm – Dij)/2 Compress i and j into k, iterate algorithm for rest of tree Finding Neighboring Leaves • To find neighboring leaves we simply select a pair of closest leaves. Finding Neighboring Leaves • To find neighboring leaves we simply select a pair of closest leaves. WRONG Finding Neighboring Leaves • Closest leaves aren’t necessarily neighbors • i and j are neighbors, but (dij = 13) > (djk = 12) • Finding a pair of neighboring leaves is a nontrivial problem! Neighbor Joining Algorithm • In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor joining algorithm for phylogenetic tree reconstruction • Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves • Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption Constructing additive trees: The neighbor joining algorithm Let i, j be neighboring leaves in a tree, let k be their parent, and let m be any other vertex. The formula d (k , m) 1 [d (i, m) + d ( j , m) d (i, j )] 2 shows that we can compute the distances of k to all other leaves. This suggest the following method to construct tree from a distance matrix: 1. Find neighboring leaves i,j in the tree, 2. Replace i,j by their parent k and recursively construct a tree T for the smaller set. 3. Add i,j as children of k in T. Neighbor Finding How can we find from distances alone a pair of nodes which are neighboring leaves? Closest nodes aren’t necessarily neighboring leaves. A B C Next we show one way to find neighbors from distances. D Neighbor Finding: Seitou & Nei algorithm Definitions For a leaf i, let ri d (i, u). u is a leaf For leaves i, j : D(i, j ) ( L 2)d (i, j ) ( ri + r j ) Theorem (Saitou & Nei) Assume all edge weights are positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree. Complexity of Neighbor Joining Algorithm Naive Implementation: Initialization: θ(L2) to compute d(r,i) and C(i,j) for all i,jL. Each Iteration: • O(L2) to find the maximal C(i,j). • O(L) to compute {C(m,k):m L} for the new node k. Total of O(L3). r C(m,k) m k Complexity of Neighbor Joining Algorithm Using Heap to store the C(i,j)’s: Input: Distance matrix D= d(i,j), and an arbitrary object r. Initialization: θ(L2) to compute and heapify the C(i,j)’s in a heap H. Each Iteration: • O(log L) to find and delete the maximal C(i,j) from H. • O(L) to add the values {d(k,m)} to D, for all objects m. • O(L) to delete {d(m,i), d(m,j)} from D (for all m). • O(L log L) to delete {C(i,m), C(j,m)} and add C(k,m) from H, for all objects m. Total of O(L2 log L). (implementation details are omitted) Neighbor Joining Algorithm • Applicable to matrices which are not additive • Known to work good in practice • The algorithm and its variants are the most widely used distance-based algorithms today. The Four Point Condition Compute: 1. Dij + Dkl, 2. Dik + Djl, 3. Dil + Djk 2 2 and 3 represent the same number: the length of all edges + the middle edge (it is counted twice) 3 1 1 represents a smaller number: the length of all edges – the middle edge The Four Point Condition: Theorem • The four point condition for the quartet i,j,k,l is satisfied if two of these sums are the same, with the third sum smaller than these first two • Theorem : An n x n matrix D is additive if and only if the four point condition holds for every quartet 1 ≤ i,j,k,l ≤ n Least Squares Distance Phylogeny Problem • If the distance matrix D is NOT additive, then we look for a tree T that approximates D the best: Squared Error : ∑i,j (dij(T) – Dij)2 • Squared Error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it. • Least Squares Distance Phylogeny Problem: finding the best approximation tree T for a non-additive matrix D (NP-hard).