Genome Rearrangement Phylogeny Robert K. Jansen School of Biology University of Texas at Austin Bernard M.E. Moret Department of Computer Science University of New Mexico Li-San Wang Tandy Warnow Department of Computer Sciences University of Texas at Austin Outline • Introduction • Genome rearrangement phylogeny reconstruction • Application • Other methods • Future research 2 New Phylogenetic Signals • Large-throughput sequencing efforts lead to larger datasets − Challenge: inferring deep evolutionary events • Biologists turning to “rare genomic changes” − Rare − Large state space − High signal-to-noise ratio − Potential for clarifying early evolution − Best studied: gene order evolution (genome rearrangement) 3 Genomes As Signed Permutations 1 –5 3 4 -2 -6 or 5 –1 6 2 -4 -3 etc. 4 Gene Order Data • Rare changes on the genomic scale • Large state space DNA: 4 states/character − Protein (amino acid sequence): 20 states/character − Circular gene order with 120 genes: − 2119 119! 3.7049 10 232 states/character • High signal-to-noise ratio 5 Genomes Evolve by Rearrangements 1 2 3 4 5 6 7 8 9 10 Inversion: 1 2 –6 –5 -4 -3 7 8 9 10 Transposition: 1 2 7 8 3 5 6 9 10 Inverted Transposition: 1 2 7 8 –6 -5 -4 -3 9 10 4 6 Edit Distances Between Genomes • (INV) Inversion distance [Hannenhalli & Pevzner 1995] − Computable in linear time [Moret et al 2001] • (BP) Breakpoint distance [Watterson et al. 1982] − − Computable in linear time NJ(BP): [Blanchette, Kunisawa, Sankoff, 1999] A= 1 2 3 4 5 6 7 8 9 10 B= 1 2 3 -8 -7 -6 4 5 9 10 BP(A,B)=3 7 Our Model: the Generalized Nadeau-Taylor Model [STOC’01] • Three types of events: − − − Inversions (INV) Transpositions (TRP) Inverted Transpositions (ITP) • Events of the same type are equiprobable • Probabilities of the three types have fixed ratio • We focus on signed circular genomes in this talk. 8 Simulation Study Protocol Synthetic Input Evolutionary Process A 1 -4 2 -3 5 6 B 1 3 2 5 4 6 C 1 3 -2 5 6 4 D 1 3 4 5 -2 6 E 1 4 5 -3 -2 6 F 1 2 3 6 4 5 Phylogenetic Method True Tree C Known in simulation A D B A B D E F C E F Inferred Tree 9 Quantifying Error True Tree C A B D E F FN: false negative D B C E F A Inferred Tree (missing edge) 1/3=33.3% error rate 10 Outline • Genome rearrangement evolution • Genome rearrangement phylogeny reconstruction • Application • Other methods • Future research 11 Gene Order Parsimony A A B C D C X Y E F Z B D Length (T) = E W F min AX+BX+XY+CY+YZ+DZ+YW+EW+FW X,Y,Z,W 12 Breakpoint Phylogeny [Sankoff & Blanchette 1998] • “Maximum Parsimony”-style problem: − Find tree(s), leaf-labeled by genomes, with shortest breakpoint length • NP-hard problem on two levels: − Find the shortest tree (the space of trees has exponential size) − Given a tree, find its breakpoint length (Even for a tree with 3 leaves, but can be reduced to TSP) • BPAnalysis [Sankoff & Blanchette 1998] − Takes 200 years to compute our 13-taxon dataset on a Sun workstation 13 BPAnalysis • Tree length evaluation for EVERY tree • Given a fixed tree topology, evaluate the tree length: − Iteratively evaluate the median problem (tree length for a 3-leaf tree) AA C C X’ X Y Y’ E Y B ZZ B D W F 14 GRAPPA (Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms) http://www.cs.unm.edu/~moret/GRAPPA/ • Uses lowerbound techniques to speed up • Used on real datasets, producing thousand-fold speedups over BPAnalysis [ISMB’01] • Contributors: (led by Bernard Moret at UNM) U. New Mexico U. Texas at Austin Universitá di Bologna, Italy 15 The Circular Lowerbound of the Length of a Tree • Given a tree, we can lowerbound its length very quickly: 2w(T ) lb (T ) d1,2 d 2,3 d3,4 d 4,1 2w(T ) lb (T ) 1 2 3 4 16 The Lowerbound Technique • Avoid any tree X without potential: − tree X whose lowerbound lb(X) is higher than twice the length c(T) of the best tree T • Finding a good starting tree quickly is of utmost importance • We turn to distance-based methods − Neighbor joining (NJ) [Saitou and Nei 1987] − Weighbor [Bruno et al. 2000] 17 Additive Distance Matrix and True Evolutionary Distance (T.E.D.) S3 S1 5 7 3 4 S4 5 1 8 S1 S2 S3 S1 0 9 15 S2 0 14 S3 0 S4 S5 S4 14 13 13 0 S5 17 16 16 13 0 S2 S5 Theorem [Waterman et al. 1977] Given an m×m additive distance matrix, we can reconstruct a tree realizing the distance in O(m2) time. 18 Error Tolerance of Neighbor Joining Theorem [Atteson 1999] Let {Dij} be the true evolutionary distances, and {dij} be the estimated distances for T. Let be the length of the shortest edge in T. If for all taxa i,j, we have 1 | Dij d ij | 2 then neighbor joining returns T. 19 BP and INV BP/2 vs K (120 genes) (K: Actual number of inversions) INV vs K (Inversion-only evolution) 20 NJ(BP) [Blanchette, Kunisawa, Sankoff 1999] and NJ(INV) Inversion only Transpositions/ inverted transpositions only 120 genes, 160 leaves Uniformly Random Tree 21 Estimate True Evolutionary Distances Using BP To use the scatter plot to estimate the actual number of events (K): 1. Compute BP/2 2. From the curve, look up the corresponding value of K (2) (1) BP/2 vs K (120 genes) (K: Actual number of inversions) (Inversion-only evolution) 22 True Evolutionary Distance (t.e.d.) Estimators for Gene Order Data T.E.D. Estimator Exact-IEBP [WABI’01] Based on the Breakpoint Expectation of distance (Exact) Approx-IEBP [STOC’01] EDE [ISMB’01] Breakpoint distance (Approx.) Inversion distance (Approx.) Derivation Analytical Analytical Empirical Model knowledge Required Required Inversiononly IEBP: Inverting the Expected BreakPoint distance EDE: Empirically Derived Estimator 23 True Evolutionary Distance Estimators BP vs K (120 genes) (K: Actual number of inversions) Exact-IEBP vs K (Inversion-only evolution) 24 Variance of True Evolutionary Distance Estimators • There are new distance-based phylogeny reconstruction methods (though designed for DNA sequences) − Weighbor [Bruno et al. 2000] These methods use the variance of good t.e.d.’s, and yield more accurate trees than NJ. • Variance estimates for the t.e.d.s [Wang WABI’02] − Weighbor(IEBP), Weighbor(EDE) K vs Exact-IEBP (120 genes) 25 Using T.E.D. Helps 120 genes 160 leaves Uniformly random tree Transpositions/inverted transpositions only (180 runs per figure) 5% 26 Observations • EDE is the best distance estimator when used with NJ and Weighbor. • True evolutionary distance estimators are reliable even when we do not know the GNT model parameters (the probability ratios of the three types of events). 27 Outline • Genome rearrangement evolution • Genome rearrangement phylogeny reconstruction • Application • Other methods • Future research 28 Percentage of Trees Eliminated Through Bounding [ISMB’01] edge length=2 # taxa 10 20 40 80 160 10 0 0 0 1% 1% 20 0 80% 91% 1% 1% 40 91% 100% 100% 100% 100% 80 99% 100% 100% 100% 100% 160 100% 100% 100% 100% 100% 320 100% 100% 100% 100% 100% #genes Uses NJ(EDE) as starting tree 29 Campanulaceae cpDNA • 13 taxa (tobacco as outlier) • 105 gene segments • GRAPPA finds 216 trees with shortest breakpoint length (out of 654,729,075 trees) • Running Time: − − − BPAnalysis takes 2 centuries on a Sun workstation GRAPPA takes 1.5 hours on a 512-node supercluster About 2300-fold speedup on a single node 30 Trachelium Campanula Adenophora Symphandra Legousia Asyneuma Triodanis Wahlenbergia Codonopsis Merciera Cyananthus Platycodon Tobacco Campanulaceae [Moret et al. ISMB 2001] Strict consensus of 216 optimal trees found by GRAPPA 6 out of 10 max. edges found 31 Outline • Genome rearrangement evolution • Genome rearrangement phylogeny reconstruction • Application • Other methods • Future research 32 “Fast” Approaches for Genome Rearrangement Phylogeny • Basic technique: encode data as strings and apply maximum parsimony • Running time exponential in the number of genomes, but polynomial in the number of genes (faster than GRAPPA) • MPBE [ISMB’00] Maximum Parsimony using Binary Encodings • MPME [Boore et al. Nature ’95, PSB’02] Maximum Parsimony using Multi-state Encodings • The length of a tree using these two methods is a lowerbound of the true breakpoint length [Bryant ’01] 33 Maximum Parsimony using Binary Encoding (MPBE) Input genome (circular) A: 1 2 3 4 = -4 –3 –2 –1 B: 1 -4 -3 –2 = 2 3 4 -1 C: 1 2 -3 –4 = 4 3 –2 -1 MPBE Strings A: 1 B: 0 C: 1 1 1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 34 Maximum Parsimony using Multistate Encoding (MPME) Input genome (circular) A: 1 2 3 4 = -4 –3 –2 –1 B: 1 -4 -3 –2 = 2 3 4 -1 C: 1 2 -3 –4 = 4 3 –2 -1 MPME Strings 1 2 3 4 -1 –2 –3 -4 A: 2 3 4 1 –4 –1 –2 -3 B: -4 3 4 –1 2 1 –2 -3 C: 2 –3 -2 3 4 –1 -4 1 We use PAUP to solve Maximum Parsimony => Constraint: number of states per site cannot exceed 32 35 NJ vs MP (120 genes, 160 genomes) All three event types equiprobable (datasets that exceed 32-state limit for MPME are dropped) 36 Inversion Phylogeny • Inversion median has higher running time than breakpoint median • Inversion phylogeny overall has shorter running time than breakpoint phylogeny, and returns more accurate trees [Moret et al. WABI ’02] 37 DCM-GRAPPA [Moret & Tang 2003] • Disk-Covering Method: divide the original problem into subproblems [Huson, Nettles, Parida, Warnow and Yooseph, 1998] • Uses inversion distance • DCM-GRAPPA: can now process thousands of genomes, each having hundreds of genes 38 Ongoing and Future Research • Genome rearrangement phylogeny with unequal gene content (duplications, deletions, etc.) • Non-uniform genome rearrangement models (Segment-length dependent model, hotspots) 39 Acknowledgements • University of Texas Tandy Warnow (Advisor) Robert K. Jansen Stacia Wyman Luay Nakhleh Usman Roshan Cara Stockham Jerry Sun • University of New Mexico Bernard M.E. Moret David Bader Jijun Tang Mi Yan • Central Washington University Linda Raubeson 40 Phylolab Department of Computer Sciences University of Texas at Austin Please visit us at http://www.cs.utexas.edu/users/phylo/ 41