Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia, MO 65211-2060 E-mail: xudong@missouri.edu 573-882-7064 (O) http://digbio.missouri.edu Outline Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu Evolution Many theories of evolution Basic idea: speciation events lead to creation of different species Any two species share a (possibly distant) common ancestor Evolutionary Events Extinction: A new node u is created at the end of a lineage, no new lineage is started from u Speciation: A new node u is created at the end of a lineage, and two new lineages are started from u Hybridization: A new node u is created when two lineages combine (diploid or polyploid) when one lineage creates u and the new lineage from u has double the number of homologs (autopolyploid) Tree of Life http://tolweb.org/ Toxonomy Glycine max Taxonomy ID: 3847 Genbank common name: soybean Rank: species Genetic code: Translation table 1 (Standard) Mitochondrial genetic code: Translation table 1 (Standard) Other names: common name:soybeans Lineage( full ) cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; eurosids I; Fabales; Fabaceae; Papilionoideae; Kingdom Plantae Evolutionary From tree of plants primitive more advanced traits moncot Gymnosperms Non-vascular _______ Dicot Green alga ancestor Flowers Vascular __________ Monocot vs. dicot plants (1) FEATURE MONOCOTS DICOTS 1 2 Leaf venation parallel broad Root system Fibrous Tap In 3’s In 4’s or 5’s Scattered Arranged in a circle Herbaceous Either Cotyledons Number of floral parts Vascular bundle position Woody or herbaceous Monocot vs. dicot plants (2) Number of cotyledons: one vs. two Monocot vs. dicot plants (3) Leaf venation pattern: Monocot Dicot is parallel is net pattern Monocot vs. dicot plants (4) Flower parts: Monocot: in groups of three Dicot: in groups of four or five Outline Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu Phylogenies (1) A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species Aardvark Bison Chimp Dog Elephant Phylogenies (2) Leafs - current day species Nodes - hypothetical most recent common ancestors Edges length - “time” from one speciation to the next Primate Evolution Tree Terminology a b c d leaf {a,b} cluster edge {a,b,c} {a,b,c,d} internal node root Rooted/Unrooted Tree Rooted trees Single common ancestor Requires more information Unrooted trees Objects are leaves Internal nodes are some common ancestors Insufficient information to tell whether not not a given internal node is a common ancestor of any 2 leaves Motivation Understand the lineage of different species Organizing principle to sort species into a taxonomy Understand how various functions evolved Understand forces and constraints on evolution Perform multiple sequence alignment Predict gene function (phylogenetic footprint) Tree Basis Phylogenies are reconstructed based on comparisons between present-day objects Two main aspects Topology How its interior nodes connect to one another and to the leaves Distance An estimate of the evolutionary distance between the nodes Assumptions homology reflects common ancestry single common ancestor treelike relationship exists positional homology independent processes no reversals or convergence molecular clock Outline Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu Molecular Clock Theory (1) For any given protein, accepted mutations in the amino acid sequence for the protein occur at constant rate Accepted = mutations that allow protein to function without death Implication # of accepted mutations proportional to length of time interval i.e. relatively constant rate of accepted mutations within a protein Molecular Clock Theory (2) Rate of accepted mutations maybe different for different proteins (depending on their tolerance for mutations) Different parts of a protein may evolve at different rates Thus, if A and B differ by k accepted mutations, then roughly k/2 mutations have occurred since divergence Science vol. 289 Outline Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu Species/Gene Trees (1) Species tree (how are my species related?) contains only one representative from each species when did speciation take place? all nodes indicate speciation events Gene tree (how are my genes related?) normally contains a number of genes from a single species nodes relate either to speciation or gene duplication events Species/Gene Trees (2) • Your sequence data may not have the same phylogenetic history as the species from which they were isolated •Different genes evolve at different speeds, and there is always the possibility of horizontal gene transfer (hybridization, vector mediated DNA movement, or direct uptake of DNA). Morphological vs. Molecular Classical phylogenetic analysis: morphological features number of legs, lengths of legs, etc. Modern biological methods allow to use molecular features Gene sequences Protein sequences Dangers in Molecular Phylogenies Gene/protein sequence can be homologous for different reasons: Orthologs -- sequences diverged after a speciation event Paralogs -- sequences diverged after a duplication event Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus) Ultrametric trees (1) A metric on a set of objects O given by the assignment of a real number d(x,y) to every pair x,y in O Ultrametric trees (2) An ultrametric has to fulfill the additional requirement An ultrametric tree is characterized by the three point condition Additive Trees Generalization of ultrametric trees # of mutations were assumed to be proportional to temporal distance of a node to ancestor Also assumed, mutations took place at same rate in all branches Additive trees model different rates of mutation along different branches Additivity In “real” tree, distances between species are the sum of distances between intermediate nodes k d (i , j ) a b c a i m b d (i , k ) a c j d ( j ,k ) b c 1 c = d (m, k ) (d (i , k ) d ( j , k ) d (i , j )) 2 Phylogeny Construction parsimony methods: fewest changes likelihood methods: maximize the probability distance methods: based on pairwise evolutionary distances (sequence similarity, nucleotide composition, etc.) Outline Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu UPGMA UPGMA is the unweighted pair group method with arithmetic mean Distance matrix can come from (e.g) DNADNA hybridization, or be constructed from sequence data etc. Iteratively group the most closely related groups. The average distance between elements in two groups is the distance between the groups. UPGMA Procedure 1. find closest pair of units (species, to start with) 2. connect this pair, defining an evolutionary unit (branch) 3. compute distances from the ancestor of this unit to all other ungrouped units -Branch length is distance/2 4. go back to #1 and repeat Evolutionary distances among primates (1) nucleotide substitutions per 100 sites Chimp Gorilla Orang Rhesus Human 1.45 1.51 2.98 7.51 Chimp Gorilla Orang 1.57 2.98 7.55 3.04 7.39 7.10 Humans and chimps are closest: lump them and recompute distances H C Evolutionary distances among primates (2) Gorilla Orang Rhesus H-C 1.54 2.98 7.53 Gorilla Orang 3.04 7.39 7.10 e.g., (H-C) to gorilla distance G = (H-G+C-G)/2 = (1.51+1.57)/2 = 1.54 Gorilla is closest to H-C clade (((H, C), 1.45), G, 1.54) H C Evolutionary distances among primates (3) Orang Rhesus H-C-G 3 7.46 Orang 7.10 R O Human-ChimpGorilla is closer to Orang than to Rhesus G H C UPGMA Clustering Let Ci and Cj be clusters, define distance between them to be 1 d (Ci ,C j ) d ( p, q ) | Ci || C j | pCi q C j When we combine two cluster, Ci and Cj, to form a new cluster Ck, then d (C k , Cl ) | Ci | d (Ci , Cl ) | C j | d (C j , Cl ) | Ci | | C j | UPGMA: conclusions UPGMA gives branch lengths or evolutionary distances as well as branching order if (a big if) mutations occur at a constant rate, we can estimate dates of divergence from sequence differences Outline Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu Possible Evolutionary Tree (1) t1 1 three-taxa tree t2 1*(2*3-3) = 3 four-taxa trees t3 t1 t1 t3 t2 t2 t4 t3 t4 t2 t3 t1 t4 Possible Evolutionary Tree (2) Taxa (n) rooted unrooted (2n-3)!/(2n-2(n-2)!) (2n-5)!/(2n-3(n-3)!) 2 1 1 3 3 1 4 15 3 5 105 15 6 954 105 7 10,395 954 8 135,135 10,395 9 2,027,025 135,135 10 34,459,425 2,027,025 Possible Evolutionary Tree (3) Taxa (n): 2 Taxa (n) 3 Unrooted/rooted 2 1/1 3 1/3 4 3/15 4 Maximum parsimony (1) Minimizes the number of steps required to generate the observed variation in the sequences Guaranteed to find the "best" tree - danger of over-fitting the data Columns representing greater variation dominate Works best for small, highly conserved sequences Maximum parsimony (2) Begin with a multiple sequence alignment Identify informative sites within the sequences Tree requiring smallest number of changes identified Repeat over all informative sites Length = sum of the # of steps in each branch Choose tree with smallest length Maximum parsimony (3) Sequence position and character Taxa 1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G Maximum parsimony (4) A A ACGA C Tree 1 B A B ATGC C GTGC D B D GCAA Tree 2 C D A B 1 2 3 4 Total Tree 3 D C Tree 1 1 2 1 2 6 Tree 2 2 2 1 2 7 Tree 3 2 1 1 1 5 Parsimony on genomic sequence Site Human Chimp Gorilla Orang Recent branch 34 A G A G human-gorilla 560 C C A A human-chimp 1287 * * T T human-chimp 3057- **** **** TAAT TAAT human-chimp A C C A chimp-gorilla 3060 5153 human-chimp chimp-gorilla 12 3 human-gorilla 4 Outline Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu Probabilistic Approaches to Phylogeny (1) Notation and definitions: Let P(x•|T,t•) denote the probability of a set of data given a tree, where: x• denotes n sequences T denotes a tree with n leaves with sequence j at leaf j t• denotes the edge lengths of the tree The definition of P(x•|T,t•) depends on our choice of model of evolution. Probabilistic Approaches to Phylogeny (2) Let P(x|y,t) denote the probability that sequence y evolves into x along an edge of length t. Assume that we can define P(x|y,t). If we can do this for each edge of T we can calculate the probability of T. Probabilistic Approaches to Phylogeny (3) Ridiculously simplistic model of evolution: 1. Every site is independent 2. Deletions and insertions do not occur 3. Substitution accounts for all evolution Let P(b|a, t) denote the probability of the substitution of residue b for residue a over an edge length of t. Extending to aligned gapless sequences x and y, P(x | y, t) = PuP(xu|yu, t), where u indexes over sites Probabilistic Approaches to Phylogeny (4) P(x1,.., x5|T, t•) = P(x1|x4,t1)P(x2|x4,t2)P(x3|x5,t3)P(x4|x5,t4)P(x5) root x5 t4 x4 t2 t3 t1 x2 x1 x3 Divergence time estimates for major groups. Thick bars on branches denote fossil record of fungi; solid circles are calibration points. From Heckman et al. 2001. Science 293: 1132 Confidence Assessment Bootstrap values Bootstrapping is a statistical technique that can use random resampling of data to determine sampling error for tree topologies Bootstrapping phylogenies Characters are resampled with replacement to create many bootstrap replicate data sets Each bootstrap replicate data set is analysed (e.g. with parsimony, distance, ML etc.) Agreement among the resulting trees is summarized with a majority-rule consensus tree Frequencies of occurrence of groups, bootstrap proportions (BPs), are a measure of support for those groups Outline Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu Avian Influenza Viruses Single strand Negative RNA Fragmented Polymorphic 2003/2004 H5N1 Pandemic Highly pathogenic; can be transmitted to people and some cases are fatal Virus: 8 genomic segments (PB1, PB2, PA, HA, NP, NA, M, and NS) and genetic reassortment DNA sequences o South China Agricultural University, China o Genbank Sources and evolution of flu viruses? Outbreak History HA HA gene Other 6 segments (excluding PA) have a similar tree structure PA gene Our analyses suggest that 2003-04 H5N1 pandemic be caused by multiple independent transmissions with multiple genotypes from genetic reassortments. (2001) Reading Assignments Suggested reading: Chapter 14 in “Warren J. Ewens and Gregory R. Grant: Statistical Methods in Bioinformatics – An Introduction. Springer. 2001”. Optional reading: Chapter 17 in “Dan Gusfield: Algorithms on Strings, Trees, and Sequences. Cambridge University Press. 1997”. Project Assignment Develop a program that implement the UPGMA algorithm 1. Modify your code in the assignment for global alignment. 2. Use edit distance (match 1; otherwise 0) with gap penalty –1 – k (k is gap size) for pairwise sequence alignment. 3. Use the sequence identity as the tree distance between leaves. 4. Input format: FASTA in one file. 5. Output format: (((a, b), d1), c, d2))…