Notes for Lecture 10 (save it as PowerPoint97 file)

advertisement
String Matching
The problem:
• Input: a text T (very long string) and a
pattern P (short string).
• Output: the index in T where a copy of P
begins.
1
Some Notations and Terminologies
• |P| and |T|: the lengths of P and T.
• P[i]: the i-th letter of P.
• Prefix of P: a substring of P starting with
P[1].
• P[1..i]: the prefix containing the first i
letters of P.
• Example: abcabbccaa.
prefix: a, ab, abc, abca, abcab, abcabb, ….
2
Some Notations and Terminologies
• suffix of P[1..i]: a substring of P[1..i]
ending at P[i], e.g. P[3..i], P[5..i] (i>4).
Example: P[1..5]=abcaa.
Suffix of P[1, 3]: c, bc, abc.
Suffix of P[1..4]: a, ca, bca, abca.
3
Straightforward method
• Basic idea:
1. i=1;
2. Start with T[i] and match P with
T[i],T[i+1], ... T[i+|P|-1]
|
|
|
P[1] P[2]
P[|P|]
3. whenever a mismatch is found,
i=i+1 and goto 2 until i+|P|-1<|T|.
• Example 1: T=ABABABCCA and P=ABABC
P: ABABC
A
ABABC
|
|
|
T: ABABABCCA ABABABCCA ABABABCCA
4
Analysis
• Step 2 takes O(|P|) comparisons in the worst
case.
• Step 2 could be repeated O(|T|) times.
• Total running time is O(|T||P|).
5
Knuth-Morris-Pratt Method
(linear time algorithm)
A better idea
• In step 3, when there is a mismatch we move
forward one position (i=i+1).
• We may move more than one position at a time
when a mismatch occurs. (carefully study the
pattern P).
For example:
P: ABABC
ABA
T: ABABABCCA ABABABCCA
6
Questions:
• How to decide how many positions we should
jump when a mismatch occurs?
• How much we can benefit? O(|T|+|P|).
Example 2:
P: abcabcabcaa
|
T: abcabcabcabcaa
|
abcabcab
back here
7
• We can move forward more than one position. Reason?
• Study of Pattern P
P[1..7] abcabca
P[1..10] abcabcabca (when trying to P[11], we have a mismatch)
P[1..7]
abcabca
P[1..4]
abca
• P[1..7] is the longest prefix that is also a suffix of
P[1..10].
• P[1..4] is a prefix that is a suffix of P[1..10], but
not the longest.
• Key: When mismatch occurs at P[i+1], we want to
find the longest prefix of P[1..i] which is also a
suffix of P[1..i].
8
Failure function
• f(i) is the largest r with (r<i) such that
P[1] P[2] ...P[r] = P[i-r+1]P[i-r+2], ..., P[i].
Prefix of length r
Suffix of P[1]P[2]…P[i] of length r
• That is, P[1..f(i)] is the longest prefix that is a suffix of P[1..i].
• Example 3: P=ababaccc and i=5.
P[1] P[2] P[3]
a
b a
a b a
b a
P[3] P[4] P[5]
(r=3) f(5)=3.
9
• Example 4:
P=abcabbabcabbaa
It is easy to verify that
f(1)=0, f(2)=0, f(3)=0, f(4)=1, f(5)=2,
f(6)=0, f(7)=1, f(8)=2, f(9)=3, f(10)=4,
f(11)=5, f(12)=6, f(13)=7, f(14)=1.
10
The Scan Algorithm
•
•
1.
2.
3.
(draw a figure to show)
i: indicates that T[i] is the next character in T to be
compared with the right end of the pattern.
q: indicates that P[q+1] is the next character in P to be
compared with T[i].
i=1 and q=0;
Compare T[i] with P[q+1]
case 1: T[i]==P[q+1]
i=i+1;q=q+1;
if q+1==|P| then print "P occurs at i+1-|P|"
case 2: T[i]≠P[q+1] and q≠0
q=f(q);
case 3: T[i]≠P[q+1] and q==0
i=i+1;
Repeat step2 until i==|T|.
11
• Example 5: P=abcabbabcabbaa
i
1 2 3 4 5 6 7 8 9 10 11 12 13 14
f(i) 0 0 0 1 2 0 1 2 3 4 5 6 7 1
T=abcabcabbabbabcabbabcabbaa
abcabb
|
|
|
abcabbabc
|
abc
|
a(i=i+1)
abcabbabcabbaa(q+1=|p|)
12
Running time complexity(hard)
• The running time of the scan algorithm is O(|T|).
• Proof:
– There are two pointers i and p.
– i: the next character in T to be compared.
– p: the position of P[1]. (See figure below)
p
i
P:abcabcabcaa
|
T:abcabcabcabcaa
|
P:
abcabcaa
p
13
Facts:
1 When a match is found, move i forward.
2 When a mismatch is found, move p forward
until p and i are the same. (When p=i and a
mismatch occur, move both i and p forward)
From facts 1 and 2, it is easy to see that the
total number of comparisons is at most 2|T|.
Thus, the time complexity is O(|T|).
14
Another version of scan algorithm (code)
n=|T|
m=|P|
q=0
for i=1 to n
{
while q>0 and P[q+1]≠T[i] do
{
q=f(q)
}
if P[q+1]==T[i] then
q=q+1
if q==m then
{
print "pattern occurs at i-m+1"
q=f(q)
}
}
15
Failure Function Construction
Basic idea:
Case 1: f(1) is always 0.
Case 2: if P[q]==P[f(q-1)+1] then f(q)=f(q-1)+1.
Example: p=abcabcc
abc
f(1)=0; f(2)=0; f(3)=0; f(4)=1; f(5)=2; f(6)=3; f(7)=0;
P[4]= P[f(4-1)+1], f(4)=f(4-1)+1=1.
P[5]= P[f(5-1)+1], f(5)=f(5-1)+1=1+1=2.
P[6]= P[f(6-1)+1]. F(6)=f(6-1)+1=2+1=3.
16
Case 3: if P[q]P[f(q-1)+1] and f(q-1)≠0 then
consider P[q] ?= P[f(f(q-1))+1] (Do it recursively)
Case 4: if P[q]  P[f(q-1)+1] and f(q-1)==0 then
f[q]=0.
Example : abc abc abb
abc abc
f(8)=5
abc
f(5)=2
a
f(2)=0
i: 1 2 3 4 5 6 7 8 9
f(i): 0 0 0 1 2 3 4 5 0
17
The algorithm (code) to compute failure function
1.
2.
3.
4.
5.
6.
7.
8.
m=|P|;
f(1)=0;
k=0;
for q=2 to |P| do
{
k=f(q-1);
if(k>0 and P[k+1]!=P[q])
{
k=f(k);
goto 6; }
if(k>0 and P[k+1]==P[q])
{
f[q]=k+1;
}
if(k==0)
{
if(P[k+1]==P[q] f[q]=1;
else f[q]=0;
}
}
18
Another version
1.
2.
3.
4.
5.
6.
7.
8.
9.
m=|P|;
f(1)=0;
k=0;
for q=2 to |P| do
{
k=f(q-1);
while(k>0 and P[k+1]!=P[q]) do
{
k=f(k);
}
if(P[k+1]==P[q]) then k=k+1;
f[q]=k;
}
19
• Example 3:
1 2 3 4 5 6 7 8 9 10 11 12
P=a b c a b c a b c a a c
f(1)=0; f(2)=0; f(3)=0; f(4)=1; f(5)=2;
f(6)=3; f(7)=4; f(8)=5; f(9)=6; f(10)=7;
f(11)=1.
(The computation of f(11) is very interesting.)
Question: Do we need to compute f(12)?
Yes, if you want to find ALL occurrences of P.
No, if you just want to find the first occurrence of P.
20
Example:
P=abcabc
T=abcabcabc
abcabc
abcabc
i
123456
f(i) 0 0 0 1 2 3
When a match is found at the end of P, call f(|p|).
Running time complexity (Fun Part, not required)
The running time of failure function construction algorithm is
O(|P|). (The proof is similar to that for scan algorithm.)
Total running time complexity
The total complexity for failure function construction and
scan algorithm is O(|P|+|T|).
21
Linear Time Algorithm for Multiple
patterns (Fun Part)
• Input: a string T (very long) and a set of
patterns P1,P2,...,Pk.
• Output: all the occurrences of Pi's in T.
Let us consider the set of patterns { he, she,
his, hers }. We can construct an automata
as follows:
22
e,i,r
h
0
e
r
1
s
2
8
s
i
6
s
3
9
h
4
7
e
5
s
123456789
f(s) 0 0 0 1 2 0 3 0 3
23
• g(s,a)=s' means that at state s if the next
input letter is a then the next state is s'.
• The states of the automata is organized
column by column.
• Each state corresponds to a prefix of some
pattern Pi.
• F: the set of final states (dark circled)
corresponding to the ends of patterns.
• For the starting state 0, add g(0,a)=0, if
g(0,a) is originally fail.
24
• Exercise: write down the g() function for
the above automata.
• Failure function
f(s) = the state for the longest prefix of some
pattern Pi that is a suffix of the string in the
path from 0 (starting state) to s.
• Example:
he is the longest prefix for hers that is a suffix
of the string she.
25
The scan algorithm
Text: T[1]T[2]...T[n]
s=0;
for i:=1 to n do
{
while g(s,T[i])=fail do s=f(s);
s:=g(s,T[i]);
if s is in F then return "yes";
}
return "no"
26
Theorem: The scan algorithm takes O(|T|)
time.
Proof: Again, the two pointer argument.
• When a match is found, move the first
pointer forward. (s:=g(s,T[i]);)
• When a mismatch is found (g(s,T[i])==fail),
move the second pointer forward. (s=f(s);)
• When a final state is meet, declare the
finding of a pattern. (if s is in F then return
"yes";)
27
• Example:
i=1 2 3 4 5 6 7 8
s h e r s h i i
3 4 5
2 8 9
3 4
1
0
0 0
s
123456789
f(s) 0 0 0 1 2 0 3 0 3
28
Failure function construction
• Basic idea: similar to that for one pattern.
for each state s of depth 1 do
f(s)=0
for each depth d>=1 do
for each state sd of depth d and
character a such that g(sd,a)=s' do
{
s=f(sd)
while g(s,a)=fail do
{
s=f(s)
}
f(s')=g(s,a)
}
29
• g(0,c)≠fail for any possible character c.
The failure function for {he, she, his, hers} is
s
123456789
f(s) 0 0 0 1 2 0 3 0 3
Time complexity: O(|P1|+|P2|+...+|Pk|).
Proof: Two pointer argument.
Leave it for assignment (optional)
30
Download