Strings I Tommy Färnqvist Seminar 9 Strings I Outline TDDD95: APS String Matching String Multi Matching Seminar in Algorithmic Problem Solving 8 april 2016 DP over Strings Trie The Substring Problem Final Note Tommy Färnqvist Department of Computer and Information Science Linköping University 9.1 Strings I Outline Tommy Färnqvist 1 String Matching Outline String Matching 2 String Multi Matching String Multi Matching DP over Strings Trie 3 DP over Strings The Substring Problem Final Note 4 Trie 5 The Substring Problem 9.2 Strings I String Matching Tommy Färnqvist Outline • Given a text string T (with n) characters and a pattern string P (with m characters), find all occurrences of P in T . • Easiest solution: Use string library (C++ string::find, C strstr, Java String.indexOf)) • Knuth-Morris-Pratt: O(n + m) time and O(m) space (lab 3.1) • Boyer-Moore: also O(n + m) time and O(m) space, but more efficient when the String Matching String Multi Matching DP over Strings Trie The Substring Problem Final Note alphabet is large or the pattern is long since it matches from right to left • More efficient solutions exist, as we will see. . . 9.3 Strings I String Multi Matching Tommy Färnqvist Outline String Matching • Given a text string T (with n) characters and pattern strings P1 , . . . , Pp , find all occurrences of every pattern Pi in T . DP over Strings Trie • The Aho-Corasick algorithm finds all matches of strings P1 , . . . , Pp in T in O(n + m + k ) time and O(n) space, where m = matches (lab 3.2) String Multi Matching P |Pi | and k is the total number of The Substring Problem Final Note 9.4 Strings I DP over Strings - Edit Distance Tommy Färnqvist • The edit distance between strings S1 and S2 is the minimum number of operations I (insert the next char of S2 ), D (delete), R (replace by the next char of S2 ) that transforms S1 into S2 (also known as the Levenshtein distance) • Define D(i, j) to be the edit distance of prefixes S1 [1 . . . i] and S2 [1 . . . j], then D(n, m) is the edit distance of S1 and S2 . • Define D(i, j) = min(D(i − 1, j) + 1, D(i, j − 1) + 1), D(i − 1, j − 1) + t(i, j), where t(i, j) = 0 if S1 [i] = S2 [j] else 1. • DP computation of D(n, m) is in O(nm) Outline String Matching String Multi Matching DP over Strings Trie The Substring Problem Final Note • We can also consider edit operations with weights: d for deletion/insertion, r for substitution, and e for match. Edit distance is the n a special case with d = r = 1 and e = 0. • The Hamming distance is also a special case, with d = ∞, r = 1, and e = 0. (Minimization) • Longest Common Subsequence is also a special case, with d = 0, r = −∞, and e = 1. (Maximization) 9.5 Strings I Trie (prefix tree) Tommy Färnqvist • trie: An ordered tree structure used for storing a set of data, usually strings, optimized for doing prefix searches • Example: Does any word in the set start with the prefix mart? • The idea: use a “26-ary” tree • each node has 26 children: one for each letter A-Z Outline String Matching String Multi Matching DP over Strings Trie The Substring Problem Final Note • add a word to the trie by following the appropriate child pointer 9.6 Strings I The Substring Problem Tommy Färnqvist Outline • The substring problem: For a text S of length n, after O(n) time preprocessing, given any string P either find an occurence of P in S, or determine that one does not exist in time O(|P|) • • • • • 2 Build a trie of all substring of S, O(n ). It is easy to find prefixes of string in a trie. Each substring S[i . . . j] is a prefix of the suffix S[i . . . n] of S. Therefore, create a trie of the n non-empty suffixes of S. This can be done in O(n) time. String Matching String Multi Matching DP over Strings Trie The Substring Problem Final Note 9.7 Strings I Suffix Trie and Suffix Tree Tommy Färnqvist • A Suffix Trie is a Trie with suffixes Outline • A Suffix Tree for S[1 . . . n] is a rooted tree with • n leaves numbered 1 . . . n • at least two children for each internal node (with the root as possible exception) • each edge labeled by a non-empty substring of S • no two edges out of a node beginning with the same character String Matching String Multi Matching DP over Strings Trie The Substring Problem Final Note • Suffix Trees allow linear time algorithms for • exact matching in O(n + occ), where occ is the number of matches • Longest Repeating Substring in O(n) • Longest Common Substring in O(n) 9.8 Strings I Suffix Array Tommy Färnqvist Outline • Suffix Trees are space inefficient and hard to implement • A Suffix Array is an array that stores: • A permutation of n indices of sorted suffixes (lab 3.3) • Suffix Trees allow efficient algorithms for • Exact matching in O(m log n) • Longest Repeating Substring in O(n) • Longest Common Substring in O(n) String Matching String Multi Matching DP over Strings Trie The Substring Problem Final Note 9.9 Strings I Next time Tommy Färnqvist Outline String Matching String Multi Matching DP over Strings More strings. . . Trie The Substring Problem Final Note 9.10