String Matching

advertisement
String Matching
Main idea: look of string of length m in text array of length n, n>=m.
Notation:
If the pattern is array P[1…m] and the text is T[1….n], then pattern P occurs at array T
with shift s if T[s+1 …. s+m] = P[1…m].
For example:
T= aaabbbcacbaaaa
P=
abbb
Then P occurs in T with shift 2.
Implementations:
Naïve algorithm: “slide” the pattern (verbatim) along the array and see if there is a
match.
Rabin Carp Algorithm: slide the “digest” of the pattern along the array and see if there
is a match.
Naïve String Matcher:
NaiveStringMatcher(T, P) {
n = length(T)
m = length(P)
for s = 0, n-m
if P[1…m] = T[s+1 …. s+m]
print “Pattern occurs with shift” s
}
Running time = __________________.
Exercise: write the lower level pseudocode to implement the “if” statement.
What does this algorithm return if the pattern repeats several times in the text?
The Rabin-Karp Algorithm:
Instead of sliding the original pattern, slide the digest of the pattern.
The digest is computed as a decimal number. In order to keep it under control (i.e. as
relatively small numbers), we compute it as modulo Z.
(Use Horner’s rule from Ch.2 to compute the decimal value in the most efficient manner).
Example:
Given:
stringT = “2359023141526739921”.
stringP= “31415”
Z = 13
Find occurrences of P in T.
decimalP = 5*100 + 1*101 + 4*102 + 1*103 + 3*104 if computed naively
= 5*100 + 10*(1 + 10* (4*101 + 10*(1*10 + 10*(3))) if computed by Horner’s
rule
= 31415
digestP = decimalP mod 13 = 7.
Now slide the window of length 5 (because the original string P has 5 bits) and see if
there is a match. Compute digest on each window of 5 and see if it matches the digestP.
Glitch: there can be collisions, i.e. different numbers can have the same modulo 13, so it
is necessary to still check the original strings.
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2
digestT = 23590 mod 13 = 8
which does not equal to digestP, so slide the window one over:
1
2 3 5 9
digestT = 9
0
2
3
1
4
1
5
2
6
7
3
9
9
2
1
2 3 5 9
digestT = 3
0
2
3
1
4
1
5
2
6
7
3
9
9
2
1
2 3 5 9 0
digestT = 11
2
3
1
4
1
5
2
6
7
3
9
9
2
1
2 3 5 9
digestT = 0
0
2
3
1
4
1
5
2
6
7
3
9
9
2
1
2 3 5 9
digestT = 1
0
2
3
1
4
1
5
2
6
7
3
9
9
2
1
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
digestT = 7
which equals digestP. Since the strings match, we found a true match, so P occurs in T
with shift 6.
2 3 5 9
digestT = 8
0
2
3
1
4
1
5
2
6
7
3
9
9
2
1
2 3 5 9
digestT = 4
0
2
3
1
4
1
5
2
6
7
3
9
9
2
1
2 3 5 9
digestT = 5
0
2
3
1
4
1
5
2
6
7
3
9
9
2
1
2 3 5 9 0
digestT = 10
2
3
1
4
1
5
2
6
7
3
9
9
2
1
2 3 5 9 0
digestT = 11
2
3
1
4
1
5
2
6
7
3
9
9
2
1
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
digestT = 7
which equals digestP, but the strings do not match; so this is a spurious hit.
2 3 5 9
digestT = 9
0
2
3
1
4
1
5
2
6
7
3
9
9
2
1
2 3 5 9 0
digestT = 11
2
3
1
4
1
5
2
6
7
3
9
9
2
1
RabinKarpStringMatcher(T, P, Z) {
n = length(T)
m = length(P)
digestP = Decimal(P, 1, m) mod Z
for s = 0, n-m
digestT = Decimal(T, s+1, s+m) mod Z
if digestP = digestT and P[1…m] = T[s+1…s+m]
print “Pattern occurs with shift” s
}
Decimal(A, p, r) {
return decimal value of the portion of A from index p to index r
}
Exercise:
Write the improvement for calculating digestT, since it is not necessary to recompute
from scratch; we can reuse the previous value of digestT.
Code in the textbook, which includes all efficiency improvements (Horner’s rule and
computing the next digestT from the pervious one):
//d is the decimal base, q is the mod we are working with, T is the text, P is the pattern
//look for occurrences of P in T
RabinKarpMatcher(T, P, q, d){
n = length(T)
m = length(P)
h = d^(m-1) mod q
p=0
t=0
for i = 1, m
p = (d *p + P[i]) mod q
t = (d *t + T[i]) mod q
for s = 0, n-m
if p = t
if P[1…m] = T[s+1…s+m]
print “Pattern occurs with shift” s
if s < n-m
t = (d( t – T[s+1]*h) + T[s+m+1]) mod q
}
State Transition Diagrams
State transition diagrams are way cool, they are used in operations research,
networking, etc. For example, the entire TCP algorithm fits into a state transition diagram
that has about .... 10 states or so, it fits on half of a page. Many networking algorithms
are described as state tables.
>Q: regarding the diagrams and the state transition tables in the book, particularly
>Figure 32.7.
>
> I understand at this point how they arrived at the shaded
>values, but I don't get, for example, why in the second row, column 'a'
>has a value of 1 and 'c' has a value of 0.
>
>I'm hoping that understanding that is the key to figuring out why there
>is, for example, an arrow going from state 1 to itself labeled 'a'...
The diagram shows you what happens when a certain input gets into the machine in a
given state, and the state table below shows the same. The “bubble” is the state and the
arrow going out of it is the input that makes the machine go to the next state, i.e. the
next “bubble.”
The convention is that every input that messes up the sequence and makes you start
from scratch (in state 0) is not shown in the state transition diagram.
So, the first bubble on the left says 0, meaning it is state 0. When the machine is in state
0, then when it gets an "a" as input, it goes to state 1, as shown in the figure and also in
the first row, first slot of the table.
When it gets b or c, then nothing happens, we stay in the same state. So that's why we
don't show that edge with b or c. We could show it as a self-loop edge, but that would
clutter the diagram. The first row of the state table say that if state is 0 and input is b or
c, we go to state 0.
In state 1, if the input is a, then it is still a possible valid beginning of the string we are
looking for, we are still on the track and we stay in that state (so we can have sequence
aaaaaaa and we are just waiting for b).
If we get a b, then it is a beginning of the string we are looking for, so now we go to state
2.
If we get c, then we had ac, so it is a total miss, and we have to start again from scratch,
so we go back to state 0, That's why there is 0 in that slot in the state table.
In state 2, so far we had aaaaaaab. If we get an a, then we go to state 3 and wait for b.
If we get b, then we just had aaaaaaaabb, it messes up everything and we should have
an arrow going back to state 0. It is in the state table below, but not on the diagram. The
convention is that every input that messes up the sequence and makes you start from
scratch, in state 0, is not shown in the state transition diagram.
Download