Seminar 10 Strings II

advertisement
Strings II
Tommy Färnqvist
Seminar 10
Strings II
Outline
TDDD95: APS
Suffix Tree and Array
Final Note
Seminar in Algorithmic Problem Solving
April 15, 2016
Tommy Färnqvist
Department of Computer and Information Science
Linköping University
10.1
Outline
Strings II
Tommy Färnqvist
Outline
Suffix Tree and Array
Final Note
1 Suffix Tree and Array
10.2
What are suffix trees?
Strings II
Tommy Färnqvist
Outline
Suffix Tree and Array
Final Note
• A suffix tree stores all of the substrings of a given string in linear space in a way
that the search for a substring is efficient
• time proportional to the length of the queary string.
• Suffix tree can be built in linear time.
10.3
As a picture
Strings II
Tommy Färnqvist
• Here is the suffix tree for GAAGAT
Outline
Suffix Tree and Array
Final Note
• An edge is labelled with a substring of the original string S.
• A node’s label is the concatenation of all edge labels for the path leading to that
node.
• The path from the root to any leaf is a suffix of the string S.
• Each internal node has at least 2 children.
• The substrings labelling each edge out of a node must all begin with different
letters.
• If all suffixes are to be included, then the last letter must be unique. Else, suffix
may end at an internal node!
10.4
Application I. Search for a substring
Strings II
Tommy Färnqvist
Outline
Suffix Tree and Array
Final Note
• Any substring of S is a prefix of a suffix.
• Example of using this: Is the string x a substring of S?
• Start at the root, and follow paths labelled by the characters of x. If you can get to
the end of x, then yes, it is.
10.5
Application II. Longest Common Substring
Strings II
Tommy Färnqvist
Outline
• What is the longest substring common to both S1 and S2 ?
• Build a suffix tree for S = S1 #S2 $, where # and $ are unique characters.
• All suffixes of S1 ends with an edge including #S2 .
• Color all nodes v of the tree:
Suffix Tree and Array
Final Note
• red if v ’s label is a substring of S1
• blue if it is a substring of S2
• purple if it is a substring of both
• We want the lowest purple node.
• Linear time, linear space.
10.6
Quick Note on Suffix Array
Strings II
Tommy Färnqvist
Outline
• Suffix tree is not a compact data structure.
• A lot of pointers
Suffix Tree and Array
Final Note
• A suffix array stores the positions in a string. Each position is an integer, so this
is a length n integer array.
• Each position corresponds to a suffix starting at this position.
• The suffix array is sorted according to the string order of the corresponding
suffixes.
10.7
As a picture
Strings II
Tommy Färnqvist
• Here is the suffix array for AGAAGAT
Outline
Suffix Tree and Array
Final Note
10.8
Suffix Array
Strings II
Tommy Färnqvist
Outline
• Binary search to find substring of length m.
• O(m log n) if implemented straighforwardly
• O(m + log n) if used with an auxiliary data structure called the longest common
prefix (LCP) array
Suffix Tree and Array
Final Note
• Constructing suffix array (together with LCP array) can be done in linear time.
• SA + LCP arrays make it possible to simulate traversing the corresponding
suffix tree in several different ways.
10.9
Next time
Strings II
Tommy Färnqvist
Outline
Suffix Tree and Array
Final Note
Combinatorial Search. . .
10.10
Download