String Matching String Matching • Given a pattern P[1..m] and a text T[1..n], find all occurrences of P in T. Both P and T belong to ∑*. • P occurs with shift s (beginning at s+1): P[1]=T[s+1], P[2]=T[s+2],…,P[m]=T[s+m]. • If so, call s is a valid shift, otherwise, an invalid shift. • Note: one occurrence begins within another one: P=abab, T=abcabababbc, P occurs at s=3 and s=5. An example of string matching Notation and terminology • w is a prefix of x, if x=wy for some y∈∑*. Denoted as w⇒x. • w is a suffix of x, if x=yw for some y∈∑*. Denoted as w⇐x. • Lemma 32.1 (Overlapping shift lemma): – Suppose x,y,z and x⇐z and y⇐z, then if |x|≤|y|, then x⇐y; if |x| ≥ |y|, then y⇐x; if |x| = |y|, then x=y. Naïve string matching Running time: O((n-m+1)m). Problem with naïve algorithm • Problem with Naïve algorithm: – Suppose p=ababc, T=cabababcd. • • • • • • T: c a b a b a b c d P: a … P: ababc P: a… P: ababc Whenever a character mismatch occurs after matching of several characters, the comparison begins by going back in T from the character which follows the last beginning character. • Can we do better: not go back in T? Rabin-Karp Algorithm ● Key idea: think of the pattern P[0..m-1] as a key, transform (hash) it into an equivalent integer p ● Similarly, we transform substrings in the text string T[] into integers ● For s=0,1,…,n-m, transform T[s..s+m-1] to an equivalent integer ts ● ● The pattern occurs at position s if and only if p=ts If we compute p and ts quickly, then the pattern matching problem is reduced to comparing p with n-m+1 integers Rabin-Karp Algorithm ● How to compute p? p = 2m-1 P[0] + 2m-2 P[1] + … + 2 P[m-2] + P[m-1] ● Using horner’s rule This takes O(m) time, assuming each arithmetic operation can be done in O(1) time. Rabin-Karp Algorithm ● Similarly, to compute the (n-m+1) integers ts from the text string ● This takes O((n – m + 1) m) time, assuming that each arithmetic operation can be done in O(1) time. This is a bit time-consuming. ● Rabin-Karp Algorithm ● A better method to compute the integers is: This takes O(n+m) time, assuming that each arithmetic operation can be done in O(1) time. Problem ● The problem with the previous strategy is that when m is large, it is unreasonable to assume that each arithmetic operation can be done in O(1) time. ● ● In fact, given a very long integer, we may not even be able to use the default integer type to represent it. Therefore, we will use modulo arithmetic. Let q be a prime number so that 2q can be stored in one computer word. ● This makes sure that all computations can be done using single-precision arithmetic. ● Once we use the modulo arithmetic, when p=ts for some s, we can no longer be sure that P[0 .. M-1] is equal to T[s .. S+ m -1 ] ● Therefore, after the equality test p = ts, we should compare P[0..m-1] with T[s..s+m-1] character by character to ensure that we really have a match. ● So the worst-case running time becomes O(nm), but it avoids a lot of unnecessary string matchings in practice. String Matching Using Finite Automata 14 Example (I) Q is a finite set of states 1 q0 ∈ Q is the start state Q is a set of accepting sates Σ: input alphabet δ: Q × Σ → Q: transition function 0 States 4 input: a b a b 2 3 b a b b a a 15 Example (II) Q is a finite set of states 1 q0 ∈ Q is the start state Q is a set of accepting states Σ: input alphabet δ: Q × Σ → Q: transition function 0 States 2 input state a b 0 1 0 1 1 2 2 1 3 3 4 0 4 1 2 4 3 16 Example (III) a Q is a finite set of states q0 ∈ Q is the start state Q is a set of accepting sates Σ: input alphabet δ: Q × Σinput → Q: transition function state a b 0 1 0 1 1 2 2 1 3 3 4 0 4 1 2 1 a b 0 a b 4 a b a b 2 3 b 17 Example (IV) a Q is a finite set of states state a b 0 1 0 1 1 2 2 1 3 3 4 0 4 1 2 1 a b q0 ∈ Q is the start state Q is a set of accepting sates Σ: input alphabet δ: Q × Σinput → Q: transition function 0 a b a b 4 b a b b a 2 b 3 a a b b b a a 0 18 Example (V) a Q is a finite set of states state a b 0 1 0 1 1 2 2 1 3 3 4 0 4 1 2 1 a b q0 ∈ Q is the start state Q is a set of accepting sates Σ: input alphabet δ: Q × Σinput → Q: transition function 0 a b a b 4 b 0 1 a b b a 2 b 3 a a b b b a a 19 Example (VI) a Q is a finite set of states state a b 0 1 0 1 1 2 2 1 3 3 4 0 4 1 2 1 a b q0 ∈ Q is the start state Q is a set of accepting sates Σ: input alphabet δ: Q × Σinput → Q: transition function 0 a b a b 4 b a 0 1 2 b b a 2 b 3 a a b b b a a 20 Example (VII) a Q is a finite set of states state a b 0 1 0 1 1 2 2 1 3 3 4 0 4 1 2 1 a b q0 ∈ Q is the start state Q is a set of accepting sates Σ: input alphabet δ: Q × Σinput → Q: transition function 0 a b a b 4 b a b 0 1 2 1 b a 2 b 3 a a b b b a a 21 Example (VIII) a Q is a finite set of states state a b 0 1 0 1 1 2 2 1 3 3 4 0 4 1 2 1 a b q0 ∈ Q is the start state Q is a set of accepting sates Σ: input alphabet δ: Q × Σinput → Q: transition function 0 a b a b 4 b a b b 0 1 2 1 2 a 2 b 3 a a b b b a a 22 Example (IX) a Q is a finite set of states state a b 0 1 0 1 1 2 2 1 3 3 4 0 4 1 2 1 a b q0 ∈ Q is the start state Q is a set of accepting sates Σ: input alphabet δ: Q × Σinput → Q: transition function 0 a b a b 4 b a b b a 0 1 2 1 2 3 2 b 3 a a b b b a a 23 Example (X): Pattern = abba a Q is a finite set of states state a b 0 1 0 1 1 2 2 1 3 3 4 0 4 1 2 1 a b q0 ∈ Q is the start state Q is a set of accepting sates Σ: input alphabet δ: Q × Σinput → Q: transition function 0 a b b a b 4 2 b 3 a a b a b b a b 0 1 2 1 2 3 4 b a a 24 Example (XI) a Q is a finite set of states state a b 0 1 0 1 1 2 2 1 3 3 4 0 4 1 2 1 a b q0 ∈ Q is the start state Q is a set of accepting sates Σ: input alphabet δ: Q × Σinput → Q: transition function 0 a b b a b 2 4 b 3 a a b a b b a b b a a 0 1 2 1 2 3 4 2 3 4 1 25 Finite-Automaton-Matcher The example automaton accepts at the end of occurrences of the pattern abba For every pattern of length m there exists an automaton with m+1 states that solves the pattern matching problem with the following algorithm: Finite-Automaton-Matcher(T,δ,P) 1. n ← length(T) 2. q ← 0 3. for i ← 1 to n do 4. q ← δ(q,T[i]) 5. 6. 7. if q = m then s←i-m return “Pattern occurs with shift” s 26 Computing the Transition Function: The Idea! m a n a m a m a m m a m a a m a m a m a m a p a m a m t i p i a m a m a m a m a m t a 27 How to Compute the Transition Function? A string u is a prefix of string v if there exists a string a such that: ua = v A string u is a suffix of string v if there exists a string a such that: au = v Let Pk denote the first k letter string of P Compute-Transition-Function(P, Σ) 1. m ← length(P) 2. for q ← 0 to m do 3. for each character a ∈ Σ do 4. k ← 1+min(m,q+1) 5. repeat k ← k-1 6. until Pk is a suffix of Pqa 7. δ(q,a) ← k 28 Example Compute-Transition-Function(P, Σ) 1. m ← length(P) 2. for q ← 0 to m do 3. for each character a ∈ Σ do 4. k ← 1+min(m,q+1) 5. repeat k ← k-1 6. until Pk is a suffix of Pqa 7. 8. Pattern Text a b a a b a a a a a b a a b a a a P 7a a b a a b a a a P8 δ(q,a) ← k 29 Example Pattern Compute-Transition-Function(P, Σ) 1. m ← length(P) 2. for q ← 0 to m do 3. for each character a ∈ Σ do 4. k ← 1+min(m,q+1) 5. repeat k ← k-1 6. until Pk is a suffix of Pqa 7. 8. δ(q,a) ← k Text a b a a b a a a a a b a a b a a b P 7b a b a a b a a a P8 a b a a b a a P7 a b a a b a P6 a b a a b P5 30 Running time of Compute Transition-Function Compute-Transition-Function(P, Σ) 1. m ← length(P) Factor: m+1 2. for q ← 0 to m do Factor: |Σ| 3. for each character a ∈ Σ do 4. 5. 6. 7. 8. k ← 1+min(m,q+1) repeat Factor: m k ← k-1 until Pk is a suffix of Pqa δ(q,a) ← k Time for check of equality: m Running time of procedure: O(m3 |Σ| ) 31 Knuth-Morris-Pratt (KMP) algorithm • Idea: after some character (such as q) matches of P with T and then a mismatch, the matched q characters allows us to determine immediately that certain shifts are invalid. So directly go to the shift which is potentially valid. Knuth-Morris-Pratt (KMP) algorithm • The matched characters in T are in fact a prefix of P, so just from P, it is OK to determine whether a shift is invalid or not. • Define a prefix function π, which encapsulates the knowledge about how the pattern P matches against shifts of itself. – π :{1,2,…,m}→{0,1,…,m-1} – π[q]=max{k: k<q and Pk ⇐ Pq}, that is π[q] is the length of the longest prefix of P that is a proper suffix of Pq. Prefix function If we precompute prefix function of P (against itself), then whenever a mismatch occurs, the prefix function can determine which shift(s) are invalid and directly ruled out. So move directly to the shift which is potentially valid. However, there is no need to compare these characters again since they are equal. i Proper Suffix π 1 {} 0 2 {b} 0 3 {ba,a} 1 4 {bab,ab,b} 2 5 {baba, aba, ba, a} 3 6 {babab, abab, bab, ab, b} 4 P = ababababca q operations π[q] 1 K=0 0 2 P[k+1]≠P[q], π[q] = k = 0 0 3 P[k+1]=P[q], k = k+1 = 1, π[q] = k = 1 1 4 P[k+1] = P[2] = P[q], k = k+1 = 2, π[q] = k = 2 2 5 P[k+1] = P[3] = P[q], k = k+1 = 3, π[q] = k = 3 3 6 P[k+1] = P[4] = P[q], k = k+1 = 4, π[q] = k = 4 4 7 P[k+1] = P[5] = P[q], k = k+1 = 5, π[q] = k = 5 5 8 P[k+1] = P[6] = P[q], k = k+1 = 6, π[q] = k = 6 6 9 P[k+1] = P[7] ≠P[q], k = π[k] = π[6] = 4 P[k+1] = P[5] ≠P[q], k = π[k] = π[4] = 2 P = ababababca q operations π[q] 1 K=0 0 2 P[k+1]≠P[q], π[q] = k = 0 0 .. … … ... ... 8 P[k+1] = P[6] = P[q], k = k+1 = 6, π[q] = k = 6 6 9 P[k+1] = P[7] ≠P[q], k = π[k] = π[6] = 4 0 P[k+1] = P[5] ≠P[q], k = π[k] = π[4] = 2 P[k+1] = P[3] ≠P[q], k = π[k] = π[2] = 0 P[k+1] = P[3] ≠P[q], π[q] = k = 0 10 P[k+1]=P[q], k = k+1 = 1, π[q] = k = 1 1 Analysis of KMP algorithm • The running time of COMPUTE-PREFIX-FUNCTION is Θ(m) and KMP-MATCHER Θ(m)+ Θ(n). • Using amortized analysis (potential method) (for COMPUTE-PREFIX-FUNCTION): – Associate a potential of k with the current state k of the algorithm: – Consider codes in Line 5 to 9. – Initial potential is 0, line 6 decreases k since π[k]<k, k never becomes negative. – Line 8 increases k at most 1. – Amortized cost = actual-cost + potential-increase – =(repeat-times-of-Line-5+O(1))+(potential-decrease-at-least the repeat-times-of-Line-5+O(1) in line 8)=O(1). Trie 40 Trie 41 Trie A trie T for a set S of strings can be used to implement a dictionary whose keys are the strings of S. Namely, we perform a search in T for a string X by tracing down from the root the path indicated by the characters in X . If this path can be traced and terminates at an external node, then we know X is in the dictionary. For example, in the trie in Figure 12.9, tracing the path for “bull” ends up at an external node. If the path cannot be traced or the path can be traced but terminates at an internal node, then X is not in the dictionary. In the example in Figure 12.9, the path for “bet” cannot be traced and the path for “be” ends at an internal node. 42 Trie we can use a trie to perform a special type of pattern matching, called word matching, where we want to determine whether a given pattern matches one of the words of the text exactly. Word matching with a standard trie: text to be searched; 43 Trie we can use a trie to perform a special type of pattern matching, called word matching, where we want to determine whether a given pattern matches one of the words of the text exactly. Word matching with a standard trie: standard trie for the words in the text (articles and prepositions, which are also known as stop words, excluded), with external nodes augmented with indications of the word positions. 44 Trie Using a trie, word matching for a pattern of length m takes O(dm) time, where d is the size of the alphabet, independent of the size of the text. If the alphabet has constant size (as is the case for text in natural languages and DNA strings), a query takes O(m) time, proportional to the size of the pattern. 45 Compressed Tries A compressed trie is similar to a standard trie but it ensures that each internal node in the trie has at least two children. It enforces this rule by compressing chains of single-child nodes into individual edges. We say that an internal node v of T is redundant if v has one child and is not the root. 46 Compressed Tries 47 Compressed Tries Compact representation of the compressed trie for S. 48 Suffix Tries One of the primary applications for tries is for the case when the strings in the collection S are all the suffixes of a string X . Such a trie is called the suffix trie (also known as a suffix tree or position tree) of string X. 49 Suffix Tries Since the total length of the suffixes of a string X of length n is storing all the suffixes of X explicitly would take O(n^2 ) space. The compact representation of a suffix trie T for a string X of length n uses O(n) space. 50 Suffix Tries Following paper explains an efficient approach for constructing suffix tree in linear time: Gusfield D. Linear-time construction of suffix trees. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. 1997. 51