MUMmer: fast alignment of large-scale DNA and protein sequences Presented by : Arthur Loder Course : CSE 497 Computational Issues in Molecular Biology Date : February 17, 2004 MUMmer’s Significance MUMmer is a system for rapidly aligning entire genomes or very large protein sequences Input=2 strings; Output=base-to-base alignment What distinguishes MUMmer from previous algorithms? Can align very large sequences (millions of nucleotides long) -2- CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Complete Genome Alignment What it means to align complete genomes: Human chromosome #1: 200 million base pairs Previous alignment algorithms run space efficiently O(n) Time complexity O(n2) is unacceptably slow We would like to align using a global dynamic programming algorithm (Needleman/Wunsch), but infeasible; need a shortcut -3- CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 MUMmer vs. BLAST Instead, use technique somewhat similar to BLAST (find similar regions between strings) BLAST for comparing 1 known string to a large set of unknown strings MUMmer for aligning 2 very similar known strings -4- CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 You know you’ve made it when… -5- CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Overview: MUMs Part 1 String A: …CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSCTGAAGT GCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTA AGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTCGAGGCTTTAGAGCTAGCT AGCGCGATCGAGGCTAGCTATGCTAGCTATCATCGCAAGCTAGCTGAGTCGCGAT CGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATC CATCTACTCTAGTAGATCGATTAGTCGATCGATGCTAGATCGGATCGAGTCGAGAT CGATGGAGTCGAGATCGATCTAATCTATCTCTAAATGGAGCGA… String B: …GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCATCTATGA GCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAACTCG CAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGTCGCGATCGGGCGTAGCG ATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCATAGCTAATCG CACTACTACGATGCGATCTCTAGTCGATCTATCTCGGCTTCGATCGTA… How align without Needleman/Wunsch ? -6- CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Overview: MUMs Part 2 String A: …CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSCTGAAGT GCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTA AGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTCGAGGCTTTAGAGCTAGCT AGCGCGATCGAGGCTAGCTATGCTAGCTATCATCGCAAGCTAGCTGAGTCGCGAT CGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATC CATCTACTCTAGTAGATCGATTAGTCGATCGATGCTAGATCGGATCGAGTCGAGAT CGATGGAGTCGAGATCGATCTAATCTATCTCTAAATGGAGCGA… String B: …GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCATCTATGA GCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAACTCG CAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGTCGCGATCGGGCGTAGCG ATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCATAGCTAATCG CACTACTACGATGCGATCTCTAGTCGATCTATCTCGGCTTCGATCGTA… Easier with large exact matches highlighted? -7- CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Overview: MUMs Part 3 Any optimal global alignment will probably use these two subsequences as “anchors” This is the shortcut needed to calculate global alignment quickly on large sequences Very intuitive in alignment process, but only a heuristic -8- CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Overview: MUMs Part 4 MUM: Maximal Unique Match MUMs occur exactly once in each sequence Ignores repeat sequences MUMs found efficiently using suffix tree data structure (to be explained later) -9- CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Overview: Choosing MUMs Once have anchors, need to choose which ones to use in alignment All MUMs: MUMs used in alignment (subset): - 10 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Overview: Closing the Gaps After choose/align anchors, what next? Close the gaps Use dynamic programming (Smith-Waterman) to extend MUMs Smaller regions, so can compute quickly Implicit assumption: sequences very similar - 11 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 3 Phases of MUMmer 3 Phases: 1) Obtaining MUMs (via Suffix Trees) 2) MUM choosing (via Longest Increasing Subsequences) 3) Gap closing (via dynamic programming, Smith-Waterman) Comprised of previously known algorithms packaged to form a unique algorithm - 12 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Critiquing MUMmer’s Output Sample Output: Sequence A: Sequence B: …ACTGC_TGAC_CTA… …ACC_CA_GGCTCG_… ^ ^ ^ MUMmer best-case: same alignment as Needleman/Wunsch MUMmer worst-case: sub optimal alignment At least computable, whereas Needleman/Wunsch is not - 13 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Phase 1: Obtaining MUMs via Suffix Trees Edward M. McCreight. (1973) A Space-Economical Suffix Tree Construction Algorithm. http://doi.acm.org/10.1145/321941.321946 Suffix Trees Outline I. Suffix Trees A. Motivations B. Tries C. Suffix Trees D. How Mummer utilizes suffix trees - 15 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Tries Term ‘trie’ comes from ‘retrieval’ Introduced in 1960’s by Fredkin Suffix trees are a type of trie Uses: Quickly search large text via preprocessing Used for regular expressions, longest common substring, automatic command completion, etc - 16 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Non-Compact Trie Example 5 strings encoded: BIG, BIGGER, BILL, GOOD, GOSH Every edge represents a symbol of the alphabet - 17 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Implementation of Tries Use linked list Include pointers to sibling and first child - 18 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Compacting Tries : Part 1 Method 1: trim chains leading to leaves Compact trie for strings: BIG, BIGGER, BILL, GOOD, GOSH - 19 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Compacting Tries : Part 2 Method 2: Patricia Tries Before, one edge per character Now, unary nodes are collapsed - 20 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Suffix Trie Normal trie, but input strings are suffixes Assume text string [t1…tn] Q: Tree has how many leaves? A: Tree has n Leaves - 21 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Suffix Tree First compact suffix trie Next collapse unary nodes - 22 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Suffix Trees: Decreasing storage Rather than storing strings, store a pair of indices (x,y) where x is beginning of string and y is the end Storage becomes O(n) - 23 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Suffix Tree Algorithms First linear-time algorithm given by Weiner (1973) McCreight developed more space efficient algorithm (1976) Two original papers’ reputations: difficult to understand - 24 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 McCreight’s Algorithm Part 1 Algorithm M: “Maps finite string S of characters into auxiliary index to S in the form of a digital search tree T whose paths are the suffixes of S, and whose terminal nodes correspond uniquely to positions within S.” A Space-Economical Suffix Tree Construction Algorithm. Edward M. McCreight. http://doi.acm.org/10.1145/321941.321946 - 25 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 McCreight’s Algorithm Part 2 S = ababc Definitions: suf3 abc =? head3 =? sufi is suffix of S beginning at character position i headi is longest prefix of sufi which is also prefix for sufj for some j<i taili is sufi – headi ab tail3 =? abc – ab = c - 26 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 McCreight’s Algorithm Part 3 Builds suffix tree by adding sufi to Treei-1 Initially, Tree1 contains only suff1 (the entire string) To obtain Tree2, add suff2 to Tree1 Continue until you have added suffn to Treen-1 Treen is the final suffix tree - 27 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 McCreight’s Algorithm Part 4 Adding a suffix (going from T2 to T3) suf3= abc; head3 = ab; tail3 = c - 28 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Suffix Trees Complexity Adding a non-terminal and a new arc that corresponds to tail takes at most constant time If could find head in at most constant time, it would run in linear time n, the length of the string S Do so by using suffix links (see paper for details) - 29 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Finding MUMs from Suffix Trees - 30 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Finding MUMs from Suffix Trees 2 - 31 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Finding MUMs from Suffix Trees 3 More general case: - 32 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Phase 2: Choosing MUMs For Alignment via Longest Increasing Subsequence (LIS) Gusfield, D. (1997) Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Motivation For Choosing MUMs Q: Why can’t we use all MUMs for alignment? A: Due to crossing of MUMs; can only choose increasing set of MUMs Problem: given a set of MUMs, how do we choose the optimal sequence? - 34 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Choosing MUMs (Continued) Configuration can be uniquely represented: P = {1, 2, 3, 4, 6, 7, 5}; LIS(P) = {1, 2, 3, 4, 6, 7} Determining optimal sequence of MUMs reduces to finding LIS of P CSE 497 Computational Issues in Molecular Biology - 35 - Arthur Loder – Spring 2004 – February 17 IS Definition Increasing Subsequence: values (strictly) increase from left to right Sequence P = {4, 2, 1, 5, 8, 6, 9, 10} Examples of two increasing subsequences: {4, 5, 9} or {2, 5, 6, 9, 10} - 36 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 DS Subsequence Decreasing Subsequence: numbers that are decreasing from left to right Sequence P = {4, 2, 1, 5, 8, 6, 9, 10} Examples? <insert class participation here> {4, 2, 1}, {4, 2}, {4, 1}, {2, 1}, or {8, 6} - 37 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Covers Definition Part 1 Cover of P: set of decreasing subsequences of P P = {7, 3, 4, 8, 6, 2, 1} that contains all numbers of P Some possible covers ? {{7, 3} {4} {8} {6, 2, 1}} OR {{7, 3, 2, 1} {4} {8, 6}} And Others… - 38 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Covers Definitions Part 2 Size of cover: number of decreasing subsequences Smallest cover: cover with minimum size it contains “If I is an increasing subsequence of P with length equal to the size of a cover of P, call it C, then I is a longest increasing subsequence of P and C is a smallest cover of P” Why? - 39 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Covers & Relation to LIS “If I is an increasing subsequence of P with length equal to the size of a cover of P, call it C, then I is a longest increasing subsequence of P and C is a smallest cover of P”. Why? Because no increasing subsequence can contain more than one character from each decreasing subsequence in a cover - 40 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Covers: Examples P = {7, 3, 4, 8, 6, 2, 1} Two possible covers: {{7, 3} {4} {8} {6, 2, 1}} {{7, 3, 2, 1} {4} {8, 6}} What is the size of the smallest cover? 3 (no cover can contain < 3 decreasing subsequences) How many elements in LIS? 3 - 41 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Covers: Examples Continued P = {7, 3, 4, 8, 6, 2, 1} For a particular cover, say: {{7, 3, 2, 1} {4} {8, 6}} You can only choose one element from each subsequence, otherwise subsequence would not be increasing. Example: IS: {3, 4, 6}; to add an element, would need to choose from a subsequence from which you already chose CSE 497 Computational Issues in Molecular Biology - 42 - Arthur Loder – Spring 2004 – February 17 Greedy Cover Algorithm Part 1 To create a smallest cover, use Greedy Cover algorithm: Start from left of sequence P Examine each number Place number at the end of the left-most subsequence it can extend If none exists, make a new decreasing subsequence (to the far right) - 43 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Greedy Cover Algorithm Part 2 Example: P = 6, 3, 5, 1, 9, 7, 2, 10, 4 {6} {6, 3} {6, 3} {5} {6, 3, 1} {5} {6, 3, 1} {5} {9} {6, 3, 1} {5} {9, 7} {6, 3, 1} {5, 2} {9, 7} {6, 3, 1} {5, 2} {9, 7} {10} {6, 3, 1} {5, 2} {9, 7, 4} {10} {6, 3, 1} {5, 2} {9, 7, 4} {10} (smallest cover) - 44 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Obtaining LIS From Smallest Cover LIS Algorithm: Set i = # subsequences in greedy cover Set I to empty list Choose any element x in subsequence I and place in front of List I While i > 1 Scan from top of subsequence (i-1) and find first element y smaller than x x = y and i = i -1 Place x in the front of list I - 45 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Obtaining LIS : Example P = {6, 3, 5, 1, 9, 7, 2, 10, 4} Smallest Cover: {6, 3, 1} {5, 2} {9, 7, 4} {10} 6 5 9 10 3 2 7 1 4 i i i i i = = = = = # subsequences = 4 4; x = 10; I = {10} 3; x = 9; I = {9, 10} 2; x = 5; I = {5, 9, 10} 1; x = 3; I = {3, 5, 9, 10} P = 6, 3, 5, 1, 9, 7, 2, 10, 4 LIS : {3, 5, 9, 10} - 46 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 How Mummer Utilizes LIS P = {1, 2, 3, 4, 6, 7, 5} LIS = {1, 2, 3, 4, 6, 7} - 47 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Obtaining LIS: Complexity Analysis Greedy cover can be found in O(nlogn) LIS found from greedy cover in O(n) - 48 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Phase 3: Closing the Gaps via Smith-Waterman Smith, T. and Waterman, M. (1981) Identification of Common Molecular Subsequences. Journal of Molecular Biology , 147, 195-197. http://citeseer.ist.psu.edu/smith81identification.html Closing the Gaps After global-MUM alignment found, need to close local gaps Gap: interruption in MUM-alignment Types of gaps: 1) 2) 3) 4) SNP Single Nucleotide Polymorphisms Insertion Highly polymorphic region Repeat - 50 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Types of Gaps : Examples - 51 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 SNP Processing SNP = Single Nucleotide Polymorphism SNPs in human DNA appear to be associated with many health issues (genetic disease) Q: How can we determine SNPs by using MUMmer? A: By looking between MUMs; SNPs surrounded by matching subsequences - 52 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Inserts: Transpositions Large gaps in one genome but not the other Transpositions: subsequences that were deleted from one location, inserted elsewhere Detected during post-processing How? Part of MUM-alignment - 53 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Simple Inserts Simple inserts: subsequences that appear in only one genome Do not appear in MUM-alignment - 54 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Polymorphic Regions Gaps in alignment caused by sequences with large numbers of differences If regions small enough, align using dynamic programming (Smith-Waterman) Gives optimal alignment given pre-specified insertion and mutation costs If regions too large, recursively call MUMmer algorithm Q: What must change when running MUMmer again? A: Minimum MUM size must be smaller - 55 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Repeat Processing Repeat sequences do not appear in MUM alignment Only includes sequences that appear exactly once in each genome In a sense, repeat sequences can “fake out” the alignment - 56 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Finding Inversions Professor Lopresti mentioned a method of finding inversions last class. Q: How can we identify inversions by using MUMmer? A: By running in both directions on gap regions - 57 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Results: Mummer 1.0 Storage 12 bytes per leaf node (suffix tree) 24 bytes per internal node (suffix tree) 1 byte for each base in genome Generous upper-bound: 37 bytes per base Therefore, comparing two 100Mb would require < 8 gigabytes of memory - 58 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Improvements: Mummer 2.0 New suffix tree algorithm (Kurtz) At most 20 bytes per base pair (compared to 37) Stream query string against suffix tree, cutting down suffix tree storage - 59 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Improvements: Mummer 3.0 New graphical modules Search operations are optimal or nearoptimal Code rewrite (modular design) OpenSource - 60 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 MUMmer: Conclusion via Arthur Loder In Conclusion MUMmer allows alignment of sequences which are too long to be aligned using previous algorithms Utilizes suffix trees, LIS, and SmithWaterman to obtain results Outputs a base-to-base alignment of two sequences - 62 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 MUMmer: Discussion Discussion Questions What if MUMs overlap? What if some are longer than others? How does LIS take this into account? Are repeats really ignored? Should repeats be ignored? Aren’t they part of the global alignment? The goal is to obtain the same optimal alignment as in Needleman/Wunsch, which does not ignore repeats. - 64 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17 Demonstration http://www.tigr.org/tigr-scripts/CMR2/webmum/mumplot - 65 - CSE 497 Computational Issues in Molecular Biology Arthur Loder – Spring 2004 – February 17