Uploaded by brucewayne7410

String Matching Algorithms: Naive, Rabin-Karp, Automata, KMP

advertisement
String Matching
String Matching
• Given a pattern P[1..m] and a text T[1..n], find all
occurrences of P in T. Both P and T belong to ∑*.
• P occurs with shift s (beginning at s+1): P[1]=T[s+1],
P[2]=T[s+2],…,P[m]=T[s+m].
• If so, call s is a valid shift, otherwise, an invalid shift.
• Note: one occurrence begins within another one:
P=abab, T=abcabababbc, P occurs at s=3 and s=5.
An example of string matching
Notation and terminology
• w is a prefix of x, if x=wy for some y∈∑*.
Denoted as w⇒x.
• w is a suffix of x, if x=yw for some y∈∑*.
Denoted as w⇐x.
• Lemma 32.1 (Overlapping shift lemma):
– Suppose x,y,z and x⇐z and y⇐z, then if |x|≤|y|, then
x⇐y; if |x| ≥ |y|, then y⇐x; if |x| = |y|, then x=y.
Naïve string matching
Running time:
O((n-m+1)m).
Problem with naïve algorithm
• Problem with Naïve algorithm:
– Suppose p=ababc, T=cabababcd.
•
•
•
•
•
•
T: c a b a b a b c d
P: a …
P:
ababc
P:
a…
P:
ababc
Whenever a character mismatch occurs after matching of
several characters, the comparison begins by going back in
T from the character which follows the last beginning
character.
• Can we do better: not go back in T?
Rabin-Karp Algorithm
●
Key idea:
think of the pattern P[0..m-1] as a key, transform (hash) it into an
equivalent integer p
● Similarly, we transform substrings in the text string T[] into
integers
●
For s=0,1,…,n-m, transform T[s..s+m-1] to an equivalent integer ts
●
●
The pattern occurs at position s if and only if p=ts
If we compute p and ts quickly, then the pattern matching
problem is reduced to comparing p with n-m+1 integers
Rabin-Karp Algorithm
●
How to compute p?
p = 2m-1 P[0] + 2m-2 P[1] + … + 2 P[m-2] + P[m-1]
●
Using horner’s rule
This takes O(m) time, assuming each arithmetic operation can be
done in O(1) time.
Rabin-Karp Algorithm
●
Similarly, to compute the (n-m+1) integers ts from the text
string
●
This takes O((n – m + 1) m) time, assuming that each
arithmetic operation can be done in O(1) time.
This is a bit time-consuming.
●
Rabin-Karp Algorithm
●
A better method to compute the integers is:
This takes O(n+m) time, assuming that each arithmetic operation can be
done in O(1) time.
Problem
●
The problem with the previous strategy is that when m is
large, it is unreasonable to assume that each arithmetic
operation can be done in O(1) time.
●
●
In fact, given a very long integer, we may not even be able to use
the default integer type to represent it.
Therefore, we will use modulo arithmetic. Let q be a prime
number so that 2q can be stored in one computer word.
●
This makes sure that all computations can be done using
single-precision arithmetic.
●
Once we use the modulo arithmetic, when p=ts for some s,
we can no longer be sure that P[0 .. M-1] is equal to T[s ..
S+ m -1 ]
●
Therefore, after the equality test p = ts, we should compare
P[0..m-1] with T[s..s+m-1] character by character to ensure
that we really have a match.
●
So the worst-case running time becomes O(nm), but it
avoids a lot of unnecessary string matchings in practice.
String Matching
Using Finite Automata
14
Example (I)
Q is a finite set of states
1
q0 ∈ Q is the start state
Q is a set of accepting sates
Σ: input alphabet
δ: Q × Σ → Q: transition function
0
States
4
input:
a
b
a
b
2
3
b
a
b
b
a
a
15
Example (II)
Q is a finite set of states
1
q0 ∈ Q is the start state
Q is a set of accepting states
Σ: input alphabet
δ: Q × Σ → Q: transition function
0
States
2
input
state
a
b
0
1
0
1
1
2
2
1
3
3
4
0
4
1
2
4
3
16
Example (III)
a
Q is a finite set of states
q0 ∈ Q is the start state
Q is a set of accepting sates
Σ: input alphabet
δ: Q × Σinput
→ Q: transition function
state
a
b
0
1
0
1
1
2
2
1
3
3
4
0
4
1
2
1
a
b
0
a
b
4
a
b
a
b
2
3
b
17
Example (IV)
a
Q is a finite set of states
state
a
b
0
1
0
1
1
2
2
1
3
3
4
0
4
1
2
1
a
b
q0 ∈ Q is the start state
Q is a set of accepting sates
Σ: input alphabet
δ: Q × Σinput
→ Q: transition function
0
a
b
a
b
4
b
a
b
b
a
2
b
3
a
a
b
b
b
a
a
0
18
Example (V)
a
Q is a finite set of states
state
a
b
0
1
0
1
1
2
2
1
3
3
4
0
4
1
2
1
a
b
q0 ∈ Q is the start state
Q is a set of accepting sates
Σ: input alphabet
δ: Q × Σinput
→ Q: transition function
0
a
b
a
b
4
b
0
1
a
b
b
a
2
b
3
a
a
b
b
b
a
a
19
Example (VI)
a
Q is a finite set of states
state
a
b
0
1
0
1
1
2
2
1
3
3
4
0
4
1
2
1
a
b
q0 ∈ Q is the start state
Q is a set of accepting sates
Σ: input alphabet
δ: Q × Σinput
→ Q: transition function
0
a
b
a
b
4
b
a
0
1
2
b
b
a
2
b
3
a
a
b
b
b
a
a
20
Example (VII)
a
Q is a finite set of states
state
a
b
0
1
0
1
1
2
2
1
3
3
4
0
4
1
2
1
a
b
q0 ∈ Q is the start state
Q is a set of accepting sates
Σ: input alphabet
δ: Q × Σinput
→ Q: transition function
0
a
b
a
b
4
b
a
b
0
1
2
1
b
a
2
b
3
a
a
b
b
b
a
a
21
Example (VIII)
a
Q is a finite set of states
state
a
b
0
1
0
1
1
2
2
1
3
3
4
0
4
1
2
1
a
b
q0 ∈ Q is the start state
Q is a set of accepting sates
Σ: input alphabet
δ: Q × Σinput
→ Q: transition function
0
a
b
a
b
4
b
a
b
b
0
1
2
1
2
a
2
b
3
a
a
b
b
b
a
a
22
Example (IX)
a
Q is a finite set of states
state
a
b
0
1
0
1
1
2
2
1
3
3
4
0
4
1
2
1
a
b
q0 ∈ Q is the start state
Q is a set of accepting sates
Σ: input alphabet
δ: Q × Σinput
→ Q: transition function
0
a
b
a
b
4
b
a
b
b
a
0
1
2
1
2
3
2
b
3
a
a
b
b
b
a
a
23
Example (X): Pattern = abba
a
Q is a finite set of states
state
a
b
0
1
0
1
1
2
2
1
3
3
4
0
4
1
2
1
a
b
q0 ∈ Q is the start state
Q is a set of accepting sates
Σ: input alphabet
δ: Q × Σinput
→ Q: transition function
0
a
b
b
a
b
4
2
b
3
a
a
b
a
b
b
a
b
0
1
2
1
2
3
4
b
a
a
24
Example (XI)
a
Q is a finite set of states
state
a
b
0
1
0
1
1
2
2
1
3
3
4
0
4
1
2
1
a
b
q0 ∈ Q is the start state
Q is a set of accepting sates
Σ: input alphabet
δ: Q × Σinput
→ Q: transition function
0
a
b
b
a
b
2
4
b
3
a
a
b
a
b
b
a
b
b
a
a
0
1
2
1
2
3
4
2
3
4
1
25
Finite-Automaton-Matcher
The example automaton accepts at the end of occurrences of the pattern
abba
For every pattern of length m there exists an automaton with m+1 states
that solves the pattern matching problem with the following algorithm:
Finite-Automaton-Matcher(T,δ,P)
1. n ← length(T)
2. q ← 0
3. for i ← 1 to n do
4.
q ← δ(q,T[i])
5.
6.
7.
if q = m then
s←i-m
return “Pattern occurs with shift” s
26
Computing the Transition Function:
The Idea!
m
a
n
a
m
a
m
a
m
m
a
m
a
a
m
a
m
a
m
a
m
a
p
a
m
a
m
t
i
p
i
a
m
a
m
a
m
a
m
a
m
t
a
27
How to Compute the Transition Function?
A string u is a prefix of string v if there exists a string a such that: ua = v
A string u is a suffix of string v if there exists a string a such that: au = v
Let Pk denote the first k letter string of P
Compute-Transition-Function(P, Σ)
1. m ← length(P)
2. for q ← 0 to m do
3. for each character a ∈ Σ do
4.
k ← 1+min(m,q+1)
5.
repeat
k ← k-1
6.
until Pk is a suffix of Pqa
7.
δ(q,a) ← k
28
Example
Compute-Transition-Function(P, Σ)
1. m ← length(P)
2. for q ← 0 to m do
3.
for each character a ∈ Σ do
4.
k ← 1+min(m,q+1)
5.
repeat
k ← k-1
6.
until Pk is a suffix of Pqa
7.
8.
Pattern
Text
a
b
a
a
b
a
a
a
a
a
b
a
a
b
a
a
a
P 7a
a
b
a
a
b
a
a
a
P8
δ(q,a) ← k
29
Example
Pattern
Compute-Transition-Function(P, Σ)
1. m ← length(P)
2. for q ← 0 to m do
3.
for each character a ∈ Σ do
4.
k ← 1+min(m,q+1)
5.
repeat
k ← k-1
6.
until Pk is a suffix of Pqa
7.
8.
δ(q,a) ← k
Text
a
b
a
a
b
a
a
a
a
a
b
a
a
b
a
a
b
P 7b
a
b
a
a
b
a
a
a
P8
a
b
a
a
b
a
a
P7
a
b
a
a
b
a
P6
a
b
a
a
b
P5
30
Running time of Compute Transition-Function
Compute-Transition-Function(P, Σ)
1. m ← length(P)
Factor: m+1
2. for q ← 0 to m do
Factor: |Σ|
3.
for each character a ∈ Σ do
4.
5.
6.
7.
8.
k ← 1+min(m,q+1)
repeat
Factor: m
k ← k-1
until Pk is a suffix of Pqa
δ(q,a) ← k
Time for check
of equality: m
Running time of procedure:
O(m3 |Σ| )
31
Knuth-Morris-Pratt (KMP) algorithm
• Idea: after some character (such as q) matches of P with T
and then a mismatch, the matched q characters allows us to
determine immediately that certain shifts are invalid. So
directly go to the shift which is potentially valid.
Knuth-Morris-Pratt (KMP) algorithm
• The matched characters in T are in fact a prefix of P, so just
from P, it is OK to determine whether a shift is invalid or
not.
• Define a prefix function π, which encapsulates the
knowledge about how the pattern P matches against shifts
of itself.
– π :{1,2,…,m}→{0,1,…,m-1}
– π[q]=max{k: k<q and Pk ⇐ Pq}, that is π[q] is the length of the
longest prefix of P that is a proper suffix of Pq.
Prefix function
If we precompute prefix function of P
(against itself), then whenever a mismatch
occurs, the prefix function can determine
which shift(s) are invalid and directly ruled
out. So move directly to the shift which is
potentially valid. However, there is no need
to compare these characters again since they
are equal.
i
Proper Suffix
π
1
{}
0
2
{b}
0
3
{ba,a}
1
4
{bab,ab,b}
2
5
{baba, aba, ba, a}
3
6
{babab, abab, bab, ab, b}
4
P = ababababca
q
operations
π[q]
1
K=0
0
2
P[k+1]≠P[q], π[q] = k = 0
0
3
P[k+1]=P[q], k = k+1 = 1, π[q] = k = 1
1
4
P[k+1] = P[2] = P[q], k = k+1 = 2, π[q] = k = 2
2
5
P[k+1] = P[3] = P[q], k = k+1 = 3, π[q] = k = 3
3
6
P[k+1] = P[4] = P[q], k = k+1 = 4, π[q] = k = 4
4
7
P[k+1] = P[5] = P[q], k = k+1 = 5, π[q] = k = 5
5
8
P[k+1] = P[6] = P[q], k = k+1 = 6, π[q] = k = 6
6
9
P[k+1] = P[7] ≠P[q], k = π[k] = π[6] = 4
P[k+1] = P[5] ≠P[q], k = π[k] = π[4] = 2
P = ababababca
q
operations
π[q]
1
K=0
0
2
P[k+1]≠P[q], π[q] = k = 0
0
..
… … ...
...
8
P[k+1] = P[6] = P[q], k = k+1 = 6, π[q] = k = 6
6
9
P[k+1] = P[7] ≠P[q], k = π[k] = π[6] = 4
0
P[k+1] = P[5] ≠P[q], k = π[k] = π[4] = 2
P[k+1] = P[3] ≠P[q], k = π[k] = π[2] = 0
P[k+1] = P[3] ≠P[q], π[q] = k = 0
10
P[k+1]=P[q], k = k+1 = 1, π[q] = k = 1
1
Analysis of KMP algorithm
• The running time of COMPUTE-PREFIX-FUNCTION is
Θ(m) and KMP-MATCHER Θ(m)+ Θ(n).
• Using amortized analysis (potential method) (for
COMPUTE-PREFIX-FUNCTION):
– Associate a potential of k with the current state k of the algorithm:
– Consider codes in Line 5 to 9.
– Initial potential is 0, line 6 decreases k since π[k]<k, k never
becomes negative.
– Line 8 increases k at most 1.
– Amortized cost = actual-cost + potential-increase
– =(repeat-times-of-Line-5+O(1))+(potential-decrease-at-least the
repeat-times-of-Line-5+O(1) in line 8)=O(1).
Trie
40
Trie
41
Trie
A trie T for a set S of strings can be used to implement a dictionary whose keys are the strings of S.
Namely, we perform a search in T for a string X by tracing down from the root the path indicated by the
characters in X . If this path can be traced and terminates at an external node, then we know X is in the
dictionary. For example, in the trie in Figure 12.9, tracing the path for “bull” ends up at an external node.
If the path cannot be traced or the path can be traced but terminates at an internal node, then X is not
in the dictionary. In the example in Figure 12.9, the path for “bet” cannot be traced and the path for “be”
ends at an internal node.
42
Trie
we can use a trie to perform a special type of pattern matching, called word matching, where we want to
determine whether a given pattern matches one of the words of the text exactly.
Word matching with a standard trie: text to be searched;
43
Trie
we can use a trie to perform a special type of pattern matching, called word matching, where we want to
determine whether a given pattern matches one of the words of the text exactly.
Word matching with a standard trie: standard trie for the words in the text (articles and
prepositions, which are also known as stop words, excluded), with external nodes augmented with
indications of the word positions.
44
Trie
Using a trie, word matching for a pattern of length m takes O(dm) time, where d is the size of the
alphabet, independent of the size of the text. If the alphabet has constant size (as is the case for
text in natural languages and DNA strings), a query takes O(m) time, proportional to the size of the
pattern.
45
Compressed Tries
A compressed trie is similar to a standard trie but it ensures that each internal node in the trie has
at least two children. It enforces this rule by compressing chains of single-child nodes into
individual edges. We say that an internal node v of T is redundant if v has one child and is not the
root.
46
Compressed Tries
47
Compressed Tries
Compact representation of the compressed trie for S.
48
Suffix Tries
One of the primary applications for tries is for the case when the strings in the collection S are all the
suffixes of a string X . Such a trie is called the suffix trie (also known as a suffix tree or position tree) of
string X.
49
Suffix Tries
Since the total length of the suffixes of a string X of length n is
storing all the suffixes of X explicitly would take O(n^2 ) space. The compact representation of a suffix trie T
for a string X of length n uses O(n) space.
50
Suffix Tries
Following paper explains an efficient approach for constructing suffix tree in linear time:
Gusfield D. Linear-time construction of suffix trees. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. 1997.
51
Download