Phylogenetic Trees The Tree of Life, Evolution Many theories of evolution Basic idea: speciation events lead to creation of different species Speciation caused by physical separation into groups where different genetic variants become dominant Any two species share a (possibly distant) common ancestor Phylogenies A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species Leafs - current day species Nodes - hypothetical most recent common ancestors Edges length - “time” from one speciation to the next Primate evolution Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria) The Linnaeus classification scheme implicitly assumes tree structure Since then, focus on objective criteria for constructing phylogenetic trees Important for many aspects of biology Classification (systematics) Understanding biological mechanisms Taxonomy deals with the naming and ordering of taxa. The Linnaean hierarchy: 1. Kingdom 2. Division 3. Class 4. Order 5. Family 6. Genus 7. Species Morphological vs. Molecular Classical phylogenetic analysis: morphological features number of legs, lengths of legs, etc. Modern biological methods allow to use molecular features Gene sequences Protein sequences Analysis based on homologous sequences (e.g., globins) in different species Dangers in Molecular Phylogenies We have to remember that gene/protein sequence can be homologous for different reasons: Orthologs -- sequences diverged after a speciation event Paralogs -- sequences diverged after a duplication event Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus) Types of Trees A natural model to consider is that of rooted trees- Depending on the model, data from current day species does not distinguish between different placements of the root Unrooted tree represents the same phylogeny with out the root node Trees can either contain distances, or simply links and nodes. Positioning Roots in Unrooted Trees We can estimate the position of the root by introducing an outgroup: a set of species that are definitely distant from all the species of interest Type of Data Distance-based Input is a matrix of distances between species Can be fraction of residue they disagree on, or alignment score between them, or … Character-based Examine each character (e.g., residue) separately Simple Distance-Based Method Input: distance matrix between species Outline: Cluster species together Initially clusters are singletons At each iteration combine two “closest” clusters to get a new one UPGMA Clustering Let Ci and Cj be clusters, define distance between them to be d (Ci , C j ) 1 d ( p, q ) | Ci || C j | pCi qC j When we combine two cluster, Ci and Cj, to form a new cluster Ck, then d (Ck , Cl ) | Ci | d (Ci , Cl ) | C j | d (C j , Cl ) | Ci | | C j | Molecular Clock UPGMA implicitly assumes that all distances measure time in the same way A weaker requirement is additivity In “real” tree, distances between species are the sum of distances between intermediate nodes k d (i , j ) a b c b a j d (i , k ) a c d (j ,k ) b c i Suppose input distances are additive d (m, k ) 1 (d (i , k ) d ( j , k ) d (i , j )) 2 Neighbor Joining Can we use this fact to construct trees? Let D(i, j ) d (i, j ) ( ri rj ) where ri 1 d (i, k ) | L | 2 k Theorem: if D(i,j) is minimal (among all pairs of leaves), then i and j are neighbors in the tree Neighbor Joining Set L to contain all leaves Iteration: Choose i,j such that D(i,j) is minimal Create new node k, and set 1 (d (i, j ) ri rj ) 2 d ( j , k ) d (i, j ) d (i, k ) 1 d (k , m) (d (i, m) d ( j, m) d (i, j )) 2 d (i, k ) remove i,j from L, and add k Terminate: when |L| =2, connect two remaining nodes Distance Based Methods If we make strong assumptions on distances, we can reconstruct trees In real-life distances are not additive