lecture_8 - Computer Science & Engineering

advertisement
MUMmer: fast alignment of
large-scale DNA and protein
sequences
Presented by : Arthur Loder
Course : CSE 497 Computational Issues in Molecular Biology
Date : February 17, 2004
MUMmer’s Significance



MUMmer is a system for rapidly aligning entire
genomes or very large protein sequences
Input=2 strings; Output=base-to-base alignment
What distinguishes MUMmer from previous
algorithms?
 Can align very large sequences (millions of
nucleotides long)
-2-
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Complete Genome Alignment

What it means to align complete genomes:




Human chromosome #1: 200 million base pairs
Previous alignment algorithms run space
efficiently O(n)
Time complexity O(n2) is unacceptably slow
We would like to align using a global dynamic
programming algorithm (Needleman/Wunsch),
but infeasible; need a shortcut
-3-
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
MUMmer vs. BLAST



Instead, use technique somewhat similar to
BLAST (find similar regions between strings)
BLAST for comparing 1 known string to a
large set of unknown strings
MUMmer for aligning 2 very similar known
strings
-4-
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
You know you’ve made it when…
-5-
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Overview: MUMs Part 1
String A:

…CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSCTGAAGT
GCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTA
AGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTCGAGGCTTTAGAGCTAGCT
AGCGCGATCGAGGCTAGCTATGCTAGCTATCATCGCAAGCTAGCTGAGTCGCGAT
CGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATC
CATCTACTCTAGTAGATCGATTAGTCGATCGATGCTAGATCGGATCGAGTCGAGAT
CGATGGAGTCGAGATCGATCTAATCTATCTCTAAATGGAGCGA…
String B:

…GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCATCTATGA
GCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAACTCG
CAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGTCGCGATCGGGCGTAGCG
ATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCATAGCTAATCG
CACTACTACGATGCGATCTCTAGTCGATCTATCTCGGCTTCGATCGTA…

How align without Needleman/Wunsch ?
-6-
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Overview: MUMs Part 2
String A:

…CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSCTGAAGT
GCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTA
AGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTCGAGGCTTTAGAGCTAGCT
AGCGCGATCGAGGCTAGCTATGCTAGCTATCATCGCAAGCTAGCTGAGTCGCGAT
CGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATC
CATCTACTCTAGTAGATCGATTAGTCGATCGATGCTAGATCGGATCGAGTCGAGAT
CGATGGAGTCGAGATCGATCTAATCTATCTCTAAATGGAGCGA…
String B:

…GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCATCTATGA
GCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAACTCG
CAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGTCGCGATCGGGCGTAGCG
ATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCATAGCTAATCG
CACTACTACGATGCGATCTCTAGTCGATCTATCTCGGCTTCGATCGTA…

Easier with large exact matches highlighted?
-7-
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Overview: MUMs Part 3



Any optimal global alignment will probably
use these two subsequences as “anchors”
This is the shortcut needed to calculate
global alignment quickly on large sequences
Very intuitive in alignment process, but only
a heuristic
-8-
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Overview: MUMs Part 4

MUM: Maximal Unique Match

MUMs occur exactly once in each sequence


Ignores repeat sequences
MUMs found efficiently using suffix tree data
structure (to be explained later)
-9-
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Overview: Choosing MUMs

Once have anchors, need to choose which
ones to use in alignment


All MUMs:
MUMs used in alignment (subset):
- 10 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Overview: Closing the Gaps

After choose/align anchors, what next?

Close the gaps



Use dynamic programming (Smith-Waterman)
to extend MUMs
Smaller regions, so can compute quickly
Implicit assumption: sequences very similar
- 11 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
3 Phases of MUMmer

3 Phases:




1) Obtaining MUMs (via Suffix Trees)
2) MUM choosing (via Longest Increasing
Subsequences)
3) Gap closing (via dynamic programming,
Smith-Waterman)
Comprised of previously known algorithms
packaged to form a unique algorithm
- 12 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Critiquing MUMmer’s Output



Sample Output:
Sequence A:
Sequence B:
…ACTGC_TGAC_CTA…
…ACC_CA_GGCTCG_…
^
^
^
MUMmer best-case: same alignment as
Needleman/Wunsch
MUMmer worst-case: sub optimal alignment

At least computable, whereas Needleman/Wunsch is not
- 13 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Phase 1: Obtaining MUMs
via Suffix Trees
Edward M. McCreight. (1973) A Space-Economical Suffix
Tree Construction Algorithm.
http://doi.acm.org/10.1145/321941.321946
Suffix Trees Outline

I. Suffix Trees




A. Motivations
B. Tries
C. Suffix Trees
D. How Mummer utilizes suffix trees
- 15 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Tries




Term ‘trie’ comes from ‘retrieval’
Introduced in 1960’s by Fredkin
Suffix trees are a type of trie
Uses:


Quickly search large text via preprocessing
Used for regular expressions, longest common
substring, automatic command completion, etc
- 16 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Non-Compact Trie Example


5 strings encoded: BIG, BIGGER, BILL, GOOD, GOSH
Every edge represents a symbol of the alphabet
- 17 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Implementation of Tries


Use linked list
Include pointers to sibling and first child
- 18 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Compacting Tries : Part 1

Method 1: trim chains leading to leaves

Compact trie for strings: BIG, BIGGER, BILL, GOOD,
GOSH
- 19 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Compacting Tries : Part 2

Method 2: Patricia Tries


Before, one edge per character
Now, unary nodes are collapsed
- 20 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Suffix Trie




Normal trie, but input strings are suffixes
Assume text string [t1…tn]
Q: Tree has how many leaves?
A: Tree has n Leaves
- 21 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Suffix Tree


First compact suffix trie
Next collapse unary nodes
- 22 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Suffix Trees: Decreasing storage


Rather than storing strings, store a pair of indices (x,y)
where x is beginning of string and y is the end
Storage becomes O(n)
- 23 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Suffix Tree Algorithms



First linear-time algorithm given by Weiner
(1973)
McCreight developed more space efficient
algorithm (1976)
Two original papers’ reputations: difficult to
understand
- 24 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
McCreight’s Algorithm Part 1

Algorithm M:

“Maps finite string S of characters into auxiliary
index to S in the form of a digital search tree T
whose paths are the suffixes of S, and whose
terminal nodes correspond uniquely to positions
within S.”
A Space-Economical Suffix Tree Construction Algorithm. Edward M. McCreight.
http://doi.acm.org/10.1145/321941.321946
- 25 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
McCreight’s Algorithm Part 2

S = ababc

Definitions:




suf3


abc
=?
head3 =?


sufi is suffix of S beginning at character position i
headi is longest prefix of sufi which is also prefix for sufj for some
j<i
taili is sufi – headi
ab
tail3

=?
abc – ab = c
- 26 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
McCreight’s Algorithm Part 3

Builds suffix tree by adding sufi to Treei-1

Initially, Tree1 contains only suff1 (the entire string)

To obtain Tree2, add suff2 to Tree1


Continue until you have added suffn to
Treen-1
Treen is the final suffix tree
- 27 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
McCreight’s Algorithm Part 4


Adding a suffix (going from T2 to T3)
suf3= abc; head3 = ab; tail3 = c
- 28 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Suffix Trees Complexity



Adding a non-terminal and a new arc that
corresponds to tail takes at most constant
time
If could find head in at most constant time,
it would run in linear time n, the length of
the string S
Do so by using suffix links (see paper for
details)
- 29 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Finding MUMs from Suffix Trees
- 30 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Finding MUMs from Suffix Trees 2
- 31 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Finding MUMs from Suffix Trees 3

More
general
case:
- 32 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Phase 2: Choosing MUMs
For Alignment
via Longest Increasing
Subsequence (LIS)
Gusfield, D. (1997) Algorithms on Strings, Trees and Sequences:
Computer Science and Computational Biology.
Motivation For Choosing MUMs



Q: Why can’t we use all MUMs for alignment?
A: Due to crossing of MUMs; can only choose
increasing set of MUMs
Problem: given a set of MUMs, how do we choose
the optimal sequence?
- 34 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Choosing MUMs (Continued)


Configuration can be uniquely represented:
 P = {1, 2, 3, 4, 6, 7, 5};
 LIS(P) = {1, 2, 3, 4, 6, 7}
Determining optimal sequence of MUMs reduces to
finding LIS of P
CSE 497 Computational Issues in Molecular Biology
- 35 -
Arthur Loder – Spring 2004 – February 17
IS Definition

Increasing Subsequence: values (strictly)
increase from left to right

Sequence P = {4, 2, 1, 5, 8, 6, 9, 10}

Examples of two increasing subsequences:
{4, 5, 9} or {2, 5, 6, 9, 10}
- 36 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
DS Subsequence

Decreasing Subsequence: numbers that are
decreasing from left to right

Sequence P = {4, 2, 1, 5, 8, 6, 9, 10}

Examples? <insert class participation here>
{4, 2, 1}, {4, 2}, {4, 1}, {2, 1}, or {8, 6}
- 37 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Covers Definition Part 1

Cover of P: set of decreasing subsequences of P

P = {7, 3, 4, 8, 6, 2, 1}

that contains all numbers of P
Some possible covers ?
{{7, 3} {4} {8} {6, 2, 1}}
OR
{{7, 3, 2, 1} {4} {8, 6}}
And Others…
- 38 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Covers Definitions Part 2

Size of cover: number of decreasing subsequences

Smallest cover: cover with minimum size


it contains
“If I is an increasing subsequence of P with length
equal to the size of a cover of P, call it C, then I is
a longest increasing subsequence of P and C is a
smallest cover of P”
Why?
- 39 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Covers & Relation to LIS


“If I is an increasing subsequence of P with
length equal to the size of a cover of P, call
it C, then I is a longest increasing
subsequence of P and C is a smallest cover
of P”. Why?
Because no increasing subsequence can
contain more than one character from each
decreasing subsequence in a cover
- 40 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Covers: Examples



P = {7, 3, 4, 8, 6, 2, 1}
Two possible covers:
{{7, 3} {4} {8} {6, 2, 1}}
{{7, 3, 2, 1} {4} {8, 6}}
What is the size of the smallest cover?


3 (no cover can contain < 3 decreasing subsequences)
How many elements in LIS?

3
- 41 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Covers: Examples Continued




P = {7, 3, 4, 8, 6, 2, 1}
For a particular cover, say:
{{7, 3, 2, 1} {4} {8, 6}}
You can only choose one element from each
subsequence, otherwise subsequence would not be
increasing.
Example:
IS: {3, 4, 6}; to add an element, would need to
choose from a subsequence from which you
already chose
CSE 497 Computational Issues in Molecular Biology
- 42 -
Arthur Loder – Spring 2004 – February 17
Greedy Cover Algorithm Part 1

To create a smallest cover, use Greedy Cover
algorithm:




Start from left of sequence P
Examine each number
Place number at the end of the left-most
subsequence it can extend
If none exists, make a new decreasing
subsequence (to the far right)
- 43 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Greedy Cover Algorithm Part 2

Example: P = 6, 3, 5, 1, 9, 7, 2, 10, 4










{6}
{6, 3}
{6, 3} {5}
{6, 3, 1} {5}
{6, 3, 1} {5} {9}
{6, 3, 1} {5} {9, 7}
{6, 3, 1} {5, 2} {9, 7}
{6, 3, 1} {5, 2} {9, 7} {10}
{6, 3, 1} {5, 2} {9, 7, 4} {10}
{6, 3, 1} {5, 2} {9, 7, 4} {10} (smallest cover)
- 44 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Obtaining LIS From Smallest Cover

LIS Algorithm:




Set i = # subsequences in greedy cover
Set I to empty list
Choose any element x in subsequence I and
place in front of List I
While i > 1



Scan from top of subsequence (i-1) and find first
element y smaller than x
x = y and i = i -1
Place x in the front of list I
- 45 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Obtaining LIS : Example









P = {6, 3, 5, 1, 9, 7, 2, 10, 4}
Smallest Cover: {6, 3, 1} {5, 2} {9, 7, 4} {10}
6 5 9 10
3 2 7
1
4
i
i
i
i
i
=
=
=
=
=
# subsequences = 4
4; x = 10; I = {10}
3; x = 9; I = {9, 10}
2; x = 5; I = {5, 9, 10}
1; x = 3; I = {3, 5, 9, 10}
P = 6, 3, 5, 1, 9, 7, 2, 10, 4
LIS : {3, 5, 9, 10}
- 46 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
How Mummer Utilizes LIS


P = {1, 2, 3, 4, 6, 7, 5}
LIS = {1, 2, 3, 4, 6, 7}
- 47 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Obtaining LIS: Complexity Analysis

Greedy cover can be found in O(nlogn)

LIS found from greedy cover in O(n)
- 48 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Phase 3: Closing the Gaps
via Smith-Waterman
Smith, T. and Waterman, M. (1981) Identification of Common Molecular
Subsequences. Journal of Molecular Biology , 147, 195-197.
http://citeseer.ist.psu.edu/smith81identification.html
Closing the Gaps

After global-MUM alignment found, need to close
local gaps

Gap: interruption in MUM-alignment

Types of gaps:




1)
2)
3)
4)
SNP Single Nucleotide Polymorphisms
Insertion
Highly polymorphic region
Repeat
- 50 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Types of Gaps : Examples
- 51 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
SNP Processing




SNP = Single Nucleotide Polymorphism
SNPs in human DNA appear to be associated with
many health issues (genetic disease)
Q: How can we determine SNPs by using MUMmer?
A: By looking between MUMs; SNPs surrounded by
matching subsequences
- 52 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Inserts: Transpositions

Large gaps in one genome but not the other
Transpositions: subsequences that were deleted
from one location, inserted elsewhere
 Detected during post-processing
 How?
 Part of
MUM-alignment

- 53 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Simple Inserts

Simple inserts: subsequences that appear in only
one genome

Do not appear in MUM-alignment
- 54 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Polymorphic Regions


Gaps in alignment caused by sequences with large
numbers of differences
If regions small enough, align using dynamic
programming (Smith-Waterman)


Gives optimal alignment given pre-specified insertion and
mutation costs
If regions too large, recursively call MUMmer
algorithm


Q: What must change when running MUMmer again?
A: Minimum MUM size must be smaller
- 55 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Repeat Processing

Repeat sequences do not appear in MUM
alignment


Only includes sequences that appear exactly
once in each genome
In a sense, repeat sequences can “fake out” the
alignment
- 56 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Finding Inversions



Professor Lopresti mentioned a method of
finding inversions last class.
Q: How can we identify inversions by using
MUMmer?
A: By running in both directions on gap
regions
- 57 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Results: Mummer 1.0 Storage





12 bytes per leaf node (suffix tree)
24 bytes per internal node (suffix tree)
1 byte for each base in genome
Generous upper-bound: 37 bytes per base
Therefore, comparing two 100Mb would
require < 8 gigabytes of memory
- 58 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Improvements: Mummer 2.0

New suffix tree algorithm (Kurtz)


At most 20 bytes per base pair (compared to 37)
Stream query string against suffix tree,
cutting down suffix tree storage
- 59 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Improvements: Mummer 3.0


New graphical modules
Search operations are optimal or nearoptimal

Code rewrite (modular design)

OpenSource
- 60 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
MUMmer: Conclusion
via Arthur Loder
In Conclusion



MUMmer allows alignment of sequences
which are too long to be aligned using
previous algorithms
Utilizes suffix trees, LIS, and SmithWaterman to obtain results
Outputs a base-to-base alignment of two
sequences
- 62 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
MUMmer: Discussion
Discussion Questions



What if MUMs overlap? What if some are longer
than others? How does LIS take this into account?
Are repeats really ignored?
Should repeats be ignored? Aren’t they part of the
global alignment? The goal is to obtain the same
optimal alignment as in Needleman/Wunsch, which
does not ignore repeats.
- 64 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Demonstration

http://www.tigr.org/tigr-scripts/CMR2/webmum/mumplot
- 65 -
CSE 497 Computational Issues in Molecular Biology
Arthur Loder – Spring 2004 – February 17
Download