Lecture 7

advertisement
MCS 101: Algorithms
Instructor
Neelima Gupta
ngupta@cs.du.ac.in
Table of Contents
• String Matching
– Naïve Method
– Finite Automata Approach
– Rabin Karp
– KMP
Pattern Matching
• Given a text string T[0..n-1] and a pattern
P[0..m-1], find all occurrences of the pattern
within the text.
• Example: T = ababcabdabcaabc and P = abc,
the occurrences are:
– first occurrence starts at T[3]
– second occurrence starts at T[9]
– third occurrence starts at T[13]
Let Σ denotes the set of alphabet .
• Given:
A string of alphabets T[1..n] of size “n”
and a pattern P[1..m] of size “m”
where, m<<<n.
• To Find:
Whether the pattern P occurs in text T or not. If it does, then
give the first occurrence of P in T.
The alphabets of both T and P are drawn from finite set Σ.
NAÏVE APPROACH
T:
P:
a
a
b
b
c
d
a
b
d
a
a
b
c
d
e
Example
T:
a
b
c
P:
a
b
d
a
b
( Step – 1 )
d
a
a
b
Mismatch after 3 Comparisons
c
d
e
Example ( Step – 2 )
T:
P:
a
b
c
a
b
a
b
d
a
a
b
d
Mismatch after 1 Comparison
c
d
e
Example ( Step – 3 )
T:
P:
a
b
c
a
b
a
b
d
d
a
a
b
Mismatch after 1 Comparison
c
d
e
Example ( Step – 4 )
T:
P:
a
b
c
a
b
d
a
b
d
a
a
b
c
Match found after 3 Comparisons
Thus, after 8 comparisons the
substring P is found in T.
d
e
Worst Case Running Time
T : a a a a a……..a a f of size say “n”
P : a a a f of size 4
Example ( Step – 1 )
T:
a
P:
a
a
a
a
a
a
.
.
.
.
.
a
a
f
Mismatch found after 4 comparisons
f
Example ( Step – 2 )
T:
P:
a
a
a
a
a
a
a
a
,
,
,
,
a
a
f
Mismatch found after 4 comparisons
f
Example
T:
P:
a
a
a
a
a
.
.
.
.
a
a
a
Match found after 4 comparisons
a
a
a
f
f
Worst Case Running Time
This will continue to happen until (n-4)th
alphabet in T is compared with the characters
in P and thus the no. of comparisons required
is (n-4)4 + 4.
Worst Case Running Time
• At every step, after ‘m’ comparisons a
mismatch will be found.
• These ‘m’ comparisons will be done for
m) characters in T.
(n-
• Thus, the running time obtained is (n-m)m +
m.
Finite Automata
#a
∑
#a
a
s1
s0
#a
a
s2
a
s3
f
f
Worst Case Running Time
• In finite automata, each character is scanned atmost
once. Thus in the worst case, the searching time is
O(n).
• Preprocessing time:- As for every character in ∑ an
edge has to be formed, thus the preprocessing time
is O(m*|∑|).
• Thus total running time is O(n) + O(m*|∑|).
Drawback:If the alphabet set ∑ is very large, then the
time required to construct the FA will be very
large.
BRUTE FORCE STRATEGY
• In this strategy whenever a mismatch was
found , the pattern was shifted right by 1
character.
• But this wasn’t an efficient strategy as it
required a large number of comparisons.
Hence a better algorithm was required.
19
KMP : Knuth Morris Pratt Algorithm
T : …… tj .. …...tj+r-1 ….tj+k-r…...tj+k-2 tj+k-1 …
………………………………
P:
p1 …… pr …… ……… pk-1 pk ……
p1 …… pr
pk …
If tj+k-1 ≠ pk
Shifting of the pattern is required. But instead of shifting right by 1
character, we look for longest prefix of p1 … pk-1 that matches the
suffix of tj … tj+k-1.
Since tj … tj+k-1 has already been matched with p1 … pk-1 , this
means we need to look for longest prefix of p1 … pk-1 that matches
with its own suffix.
20
KMP Contd..
• Let r be the length of the longest prefix of P that
matches with the matched part of P. Then the
pattern can be shifted by r positions instead of 1 and
tj+k-1 should be compared with pr+1.
• Claim 1: We have not missed any match i.e. the
pattern does not exist at any position from j to j+k-r1.
• Proof: Had it been, we would have a longer prefix
matching with its suffix.
Why LONGEST?
T:abcabcabcabcaf
mismatch found
P:abcabcabcaf
22
T:abcabcabcabcaf
mismatch found
P:abcabcabcaf
the longest prefix.
Correct alignment for the pattern will be by
shifting it 3 characters right.
23
T:abcabcabcabcaf
P:
abcabcabcaf
Pattern found.
24
T:abcabcabcabcaf
P:
mismatch
abcabcabcaf
Pattern not found.
By finding a smaller prefix and aligning the
pattern accordingly as shown, the pattern’s
occurrence in the text got missed (that is we
shifted by more positions than we should
25
So it is known that we need to find the longest
prefix in the pattern that matches its suffix.
But HOW?
26
P : p1 ….………….…………… pk …………
Let the length of the longest prefix of p1 … pk-1 that
matches its suffix be ‘r.’
27
T : …… tj .. …...tj+r-1 ….tj+k-r…...tj+k-2 tj+k-1 …
………………………………
P:
p1 …… pr …… ……… pk-1 pk ……
p1 …… pr pk …
If tj+k-1 ≠ pk
Let Fail[k] be a pointer which says that if a mismatch
occurs for pk then what is the character in P that
should come in place of pk by shifting P accordingly .
How to compute Fail[k]?
28
P : p1 … pr-1 pr pr+1 …….…. pk-1 pk …
p1 … pr’-1 pr’ pr’+1
p1…....ps-1 ps ps+1
Look at fail[k-1]. Let it be r’.
If pr’ = pk-1 (which has already been matched with tj+k-1) fail[k] = r’+1
1
else { look at fail[r’] = s , say
if s>0
{ if ps = pk-1 then fail[k] = s+1
else goto 1 with r’ = s
}
} else (i.e s = 0) fail[k] =1
29
EXAMPLE
P: abcabcabcaf
for k=1, fai[k]=0 (assumed)
for k=2,
s=fail[1]=0
therefore, fail[k]=0+1=1
for k=3,
s=fail[2]=1
check whether p2=p1
since p2!=p1
so, s=fail[1]=0
therefore, fail[k]=0+1=1
P: abcabcabcaf
for k=4,
s=fail[3]=1
check whether p1=p3
since p1!=p3
so, s=fail[1]=0
therefore, fail[k]=0+1=1
For k=5
s=fail[4]=1
check whether p1=p4
yes
therefore, fail[k]=1+1=2
Similarly, for others.
k
1
2
3
4
5
6
7
8
9
10
11
fail[k]
0
1
1
1
2
3
4
5
6
7
8
Example :
T:abcabcabcabcaf
P:abcabcabcaf
k: 1 2 3 4 5 6 7 8 9 10 11
P:
k:
abcabcabcaf
1 2 3 4 5 6 7 8 9 10 11
Mismatch found at k=11 position.
Look at fail[11] = 8 which implies the
pattern must be shifted such that p8
comes in place of p11
k
Fail[k]
1
0
2
1
3
1
4
1
5
2
6
3
7
4
8
5
9
6
10
7
11
8
33
Example :
T:abcabcabcabcaf
P:
k:
abcabcabcaf
1 2 3 4 5 6 7 8 9 10 11
Pattern found
k
Fail[k]
1
0
2
1
3
1
4
1
5
2
6
3
7
4
8
5
9
6
10
7
11
8
34
Another Example :
k
Fail[k]
T:abcbabcbabcabcabcaf
1
0
2
1
3
1
4
1
5
2
6
3
7
4
8
5
9
6
10
7
11
8
P:abcabcabcaf
k: 1 2 3 4 5 6 7 8 9 10 11
P:
abcabcabcaf
k:
1 2 3 4 5 6 7 8 9 10 11
Mismatch found at k=4 position.
Look at fail[4] = 1 which implies the pattern
must be shifted such that p1 comes in place
of p4
35
Another Example :
k
Fail[k]
T:abcbabcbabcabcabcaf
1
0
2
1
3
1
4
1
P:
abcabcabcaf
k:
1 2 3 4 5 6 7 8 9 10 11
P:
abcabcabcaf
5
2
k:
1 2 3 4 5 6 7 8 9 10 11
6
3
7
4
8
5
9
6
10
7
11
8
Mismatch found at k=1 position.
Look at fail[1] = 0 which implies read the next
character in text.
36
Another Example :
k
Fail[k]
T:abcbabcbabcabcabcaf
1
0
2
1
3
1
4
1
P:
abcabcabcaf
k:
1 2 3 4 5 6 7 8 9 10 11
P:
abcabcabcaf
5
2
k:
1 2 3 4 5 6 7 8 9 10 11
6
3
7
4
8
5
9
6
10
7
11
8
Mismatch found at k=4 position.
Look at fail[4] = 1 which implies the pattern
must be shifted such that p1 comes in place
of p4
37
Another Example :
k
Fail[k]
T:abcbabcbabcabcabcaf
1
0
2
1
3
1
4
1
P:
abcabcabcaf
k:
1 2 3 4 5 6 7 8 9 10 11
P:
abcabcabcaf
5
2
k:
1 2 3 4 5 6 7 8 9 10 11
6
3
7
4
8
5
9
6
10
7
11
8
Mismatch found at k=1 position.
Look at fail[1] = 0 which implies read the next
character in text.
38
Another Example :
k
Fail[k]
T:abcbabcbabcabcabcaf
1
0
2
1
3
1
4
1
5
2
6
3
7
4
8
5
9
6
10
7
11
8
P:
k:
abcabcabcaf
1 2 3 4 5 6 7 8 9 10 11
Pattern found
39
Analysis of KMP
# of mismatch: For mismatch the pattern is shifted
by at least 1 position. The maximum number of
shifts is determined by the largest suffix.
T: ......a b c a b c a b c a b c d a f d........
mismatch
P:
deb
mismatch
P:
deb
For every mismatch pattern is
..
..
shifted by atleast1postion.

Total no. of shifts <= n-m

Total no. of mismatches <=n-m+1
Analysis of KMP
contd.
# of matches: For every match, pointer in the
text moves up by 1 position.
T: ......a b c a b c a b c a b c d a f d........
For every match pointer moves
up by 1 position.
P:
P:
abc bde
abcbde
P:
a b c b. d e => # of matches <= length of text
..
<= n
..
The complexity of KMP is linear in nature.
O(m+n)
ACKNOWLEDGEMENTS
MSc (CS) 2009
Abhishek Behl(02)
Aarti Sethiya(01)
Akansha Aggarwal(03)
Alok Prakash (04)
Vibha Negi(31)
42
Download