Seminar 9 Strings I

advertisement
Strings I
Tommy Färnqvist
Seminar 9
Strings I
Outline
TDDD95: APS
String Matching
String Multi Matching
Seminar in Algorithmic Problem Solving
8 april 2016
DP over Strings
Trie
The Substring Problem
Final Note
Tommy Färnqvist
Department of Computer and Information Science
Linköping University
9.1
Strings I
Outline
Tommy Färnqvist
1 String Matching
Outline
String Matching
2 String Multi Matching
String Multi Matching
DP over Strings
Trie
3 DP over Strings
The Substring Problem
Final Note
4 Trie
5 The Substring Problem
9.2
Strings I
String Matching
Tommy Färnqvist
Outline
• Given a text string T (with n) characters and a pattern string P (with m
characters), find all occurrences of P in T .
• Easiest solution: Use string library (C++ string::find, C strstr, Java String.indexOf))
• Knuth-Morris-Pratt: O(n + m) time and O(m) space (lab 3.1)
• Boyer-Moore: also O(n + m) time and O(m) space, but more efficient when the
String Matching
String Multi Matching
DP over Strings
Trie
The Substring Problem
Final Note
alphabet is large or the pattern is long since it matches from right to left
• More efficient solutions exist, as we will see. . .
9.3
Strings I
String Multi Matching
Tommy Färnqvist
Outline
String Matching
• Given a text string T (with n) characters and pattern strings P1 , . . . , Pp , find all
occurrences of every pattern Pi in T .
DP over Strings
Trie
• The Aho-Corasick algorithm finds all matches of strings P1 , . . . , Pp in T in
O(n + m + k ) time and O(n) space, where m =
matches (lab 3.2)
String Multi Matching
P
|Pi | and k is the total number of
The Substring Problem
Final Note
9.4
Strings I
DP over Strings - Edit Distance
Tommy Färnqvist
• The edit distance between strings S1 and S2 is the minimum number of
operations I (insert the next char of S2 ), D (delete), R (replace by the next char
of S2 ) that transforms S1 into S2 (also known as the Levenshtein distance)
• Define D(i, j) to be the edit distance of prefixes S1 [1 . . . i] and S2 [1 . . . j], then
D(n, m) is the edit distance of S1 and S2 .
• Define D(i, j) = min(D(i − 1, j) + 1, D(i, j − 1) + 1), D(i − 1, j − 1) + t(i, j), where
t(i, j) = 0 if S1 [i] = S2 [j] else 1.
• DP computation of D(n, m) is in O(nm)
Outline
String Matching
String Multi Matching
DP over Strings
Trie
The Substring Problem
Final Note
• We can also consider edit operations with weights: d for deletion/insertion, r for
substitution, and e for match. Edit distance is the n a special case with
d = r = 1 and e = 0.
• The Hamming distance is also a special case, with d = ∞, r = 1, and e = 0.
(Minimization)
• Longest Common Subsequence is also a special case, with d = 0, r = −∞, and
e = 1. (Maximization)
9.5
Strings I
Trie (prefix tree)
Tommy Färnqvist
• trie: An ordered tree structure used for storing a set of data, usually strings,
optimized for doing prefix searches
• Example: Does any word in the set start with the prefix mart?
• The idea: use a “26-ary” tree
• each node has 26 children: one
for each letter A-Z
Outline
String Matching
String Multi Matching
DP over Strings
Trie
The Substring Problem
Final Note
• add a word to the trie by following
the appropriate child pointer
9.6
Strings I
The Substring Problem
Tommy Färnqvist
Outline
• The substring problem: For a text S of length n, after O(n) time preprocessing,
given any string P either find an occurence of P in S, or determine that one
does not exist in time O(|P|)
•
•
•
•
•
2
Build a trie of all substring of S, O(n ).
It is easy to find prefixes of string in a trie.
Each substring S[i . . . j] is a prefix of the suffix S[i . . . n] of S.
Therefore, create a trie of the n non-empty suffixes of S.
This can be done in O(n) time.
String Matching
String Multi Matching
DP over Strings
Trie
The Substring Problem
Final Note
9.7
Strings I
Suffix Trie and Suffix Tree
Tommy Färnqvist
• A Suffix Trie is a Trie with suffixes
Outline
• A Suffix Tree for S[1 . . . n] is a rooted tree with
• n leaves numbered 1 . . . n
• at least two children for each internal node (with the root as possible exception)
• each edge labeled by a non-empty substring of S
• no two edges out of a node beginning with the same character
String Matching
String Multi Matching
DP over Strings
Trie
The Substring Problem
Final Note
• Suffix Trees allow linear time algorithms for
• exact matching in O(n + occ), where occ is the number of matches
• Longest Repeating Substring in O(n)
• Longest Common Substring in O(n)
9.8
Strings I
Suffix Array
Tommy Färnqvist
Outline
• Suffix Trees are space inefficient and hard to implement
• A Suffix Array is an array that stores:
• A permutation of n indices of sorted suffixes (lab 3.3)
• Suffix Trees allow efficient algorithms for
• Exact matching in O(m log n)
• Longest Repeating Substring in O(n)
• Longest Common Substring in O(n)
String Matching
String Multi Matching
DP over Strings
Trie
The Substring Problem
Final Note
9.9
Strings I
Next time
Tommy Färnqvist
Outline
String Matching
String Multi Matching
DP over Strings
More strings. . .
Trie
The Substring Problem
Final Note
9.10
Download