Uploaded by Sudha Rani M

PatternM

advertisement
Pattern Matching
T:
a
b
a
c
a
a
b
1
P:
a
b
a
240-301 Comp. Eng. Lab III (Software), Pattern Matching
a
b
c
a
a
b
4
3
2
c
a
b
1
Overview
1.
2.
3.
4.
5.
What is Pattern Matching?
The Brute Force Algorithm
The Boyer-Moore Algorithm
The Knuth-Morris-Pratt Algorithm
More Information
240-301 Comp. Eng. Lab III (Software), Pattern Matching
2
1. What is Pattern Matching?
 Definition:
– given a text string T and a pattern string P, find
the pattern inside the text
 T:
“the rain in spain stays mainly on the plain”
 P: “n th”
 Applications:
– text editors, Web search engines (e.g. Google),
image analysis
240-301 Comp. Eng. Lab III (Software), Pattern Matching
3
String Concepts
 Assume
S is a string of size m.
A
substring S[i .. j] of S is the string
fragment between indexes i and j.
A
prefix of S is a substring S[0 .. i]
 A suffix of S is a substring S[i .. m-1]
– i is any index between 0 and m-1
240-301 Comp. Eng. Lab III (Software), Pattern Matching
4
Examples
S
a n d r e w
0
 Substring
 All
5
S[1..3] == "ndr"
possible prefixes of S:
– "andrew", "andre", "andr", "and", "an”, "a"
 All
possible suffixes of S:
– "andrew", "ndrew", "drew", "rew", "ew", "w"
240-301 Comp. Eng. Lab III (Software), Pattern Matching
5
2. The Brute Force Algorithm
 Check
each position in the text T to see if
the pattern P starts in that position
T: a n d r e w
P: r e w
T: a n d r e w
P: r e w
P moves 1 char at a time through T
....
240-301 Comp. Eng. Lab III (Software), Pattern Matching
6
Example-1
240-301 Comp. Eng. Lab III (Software), Pattern Matching
7
Example-2
240-301 Comp. Eng. Lab III (Software), Pattern Matching
8
Example-3
 T=a
bacaabaccabacabaabb
 P=a b a c a b
240-301 Comp. Eng. Lab III (Software), Pattern Matching
9
Brute Force-Complexity…





Given a pattern M characters in length, and a text N characters in length...
Worst case: compares pattern to each substring of text of length M. For example,
M=5.
This kind of case can occur for image data
Total number of comparisons: M (N-M+1)
Worst case time complexity: O(MN)
240-301 Comp. Eng. Lab III (Software), Pattern Matching
10
Best case if pattern not found….




Given a pattern M characters in length, and a text N characters
in length...
For example, M=5.
Total number of comparisons: N
Best case time complexity: O(N)
240-301 Comp. Eng. Lab III (Software), Pattern Matching
11
Analysis
 Brute
force pattern matching runs in time
O(mn) in the worst case.
 But
most searches of ordinary text take
O(m+n), which is very quick.
240-301 Comp. Eng. Lab III (Software), Pattern Matching
continued
12
 The
brute force algorithm is fast when the
alphabet of the text is large
– e.g. A..Z, a..z, 1..9, etc.
 It
is slower when the alphabet is small
– e.g. 0, 1 (as in binary files, image files, etc.)
240-301 Comp. Eng. Lab III (Software), Pattern Matching
continued
13
 Example
of a worst case:
– T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"
– P: "aaah"
 Example
of a more average case:
– T: "a string searching example is standard"
– P: "store"
240-301 Comp. Eng. Lab III (Software), Pattern Matching
14
3. The Boyer-Moore Algorithm
 The
Boyer-Moore pattern matching
algorithm is based on two techniques.

The looking-glass Heuristic:
– When testing a possible placement of P against
T, begin the comparisons from the end of P and
move backward to the front of P.
240-301 Comp. Eng. Lab III (Software), Pattern Matching
15
Character-Jump Heuristic:
During the testing of a possible placement of P
against T, a mismatch of text character T[i] with
the corresponding pattern character P[j] is handled
as follows.
 If c is not contained anywhere in P, then shift P
completely past T[i] (for it cannot match any
character in P).
 Otherwise, shift P until an occurrence of character
c in P gets aligned with T[i].

240-301 Comp. Eng. Lab III (Software), Pattern Matching
16
 2.
The character-jump Heuristic:
– when a mismatch occurs at T[i] == x
– the character in pattern P[j] is not the
same as T[i]

There are 3 possible
cases, tried in order.
T
P
x a
i
ba
j
240-301 Comp. Eng. Lab III (Software), Pattern Matching
17
Case 1
 If
P contains x somewhere, then try to
shift P right to align the last occurrence
of x in P with T[i].
T
P
x a
i
x c ba
j
T
and
move i and
j right, so
j at end
240-301 Comp. Eng. Lab III (Software), Pattern Matching
x a ? ?
inew
P
x c ba
jnew
18
Case 2
 If
P contains x somewhere, but a shift right
to the last occurrence is not possible, then
shift P right by 1 character to T[i+1].
T
P
x a x
i
cw ax
j
T
and
move i and
j right, so
j at end
x is after
240-301 Comp. Eng. LabjIIIposition
(Software), Pattern Matching
xa x ?
inew
P
cw ax
jnew
19
Case 3
 If
cases 1 and 2 do not apply, then shift P to
align P[0] with T[i+1].
T
x a
i
P d c ba
T
and
move i and
j right, so
j at end
j
No x in P
240-301 Comp. Eng. Lab III (Software), Pattern Matching
x a ? ??
inew
P d c ba
0
jnew
20
Boyer-Moore Example (1)
T:
a
p a t
r i
1
t h m
P:
r i
t e r n
2
t h m
r i
m a t c h i n g
3
t h m
r i
r i
240-301 Comp. Eng. Lab III (Software), Pattern Matching
4
t h m
a l g o r i
5
t h m
t h m
11 10 9 8 7
r i t h m
r i
6
t h m
21
Boyer-Moore Example (2)
 T=a
bacaabadcabacabaabb
 P=a b a c a b
240-301 Comp. Eng. Lab III (Software), Pattern Matching
22
Last Occurrence Function
algorithm preprocesses the
pattern P and the alphabet A to build a last
occurrence function L()
 Boyer-Moore’s
– L() maps all the letters in A to integers
 L(x)
is defined as:
// x is a letter in A
– the largest index i such that P[i] == x, or
– -1 if no such index exists
240-301 Comp. Eng. Lab III (Software), Pattern Matching
23
L() Example
P a b a c a b
0 1 2 3 4 5
A
= {a, b, c, d}
 P: "abacab"
x
L(x)
a
4
b
5
c
3
d
-1
L() stores indexes into P[]
240-301 Comp. Eng. Lab III (Software), Pattern Matching
24
Note
 In
Boyer-Moore code, L() is calculated
when the pattern P is read in.
 Usually
L() is stored as an array
– something like the table
240-301 Comp. Eng. Lab III (Software), Pattern Matching
25
Boyer-Moore Example (2)
T
a
b
a
c
a
a
b
a
d
c
a
b
a
c
a
b
a
a
b
b
1
Pa
b
a
a
b
c
a
a
b
4
3
2
13 12 11 10 9
8
c
a
b
a
b
b
a
c
5
a
b
a
c
a
a
7
b
a
b
a
c
a
b
6
a
b
a
c
a
b
240-301 Comp. Eng. Lab III (Software), Pattern Matching
x
L(x)
a
4
b
5
c
3
d
-1
26
Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Algorithm BoyreMooreMatch(T,P):
Input: Strings T (text) with n characters and P (pattern) with m characters
Output: Starting index of the first substring of T matching P, or an indication
that P is not a substring of T
compute function last L(X)
i←m−1
j←m−1
repeat
if T[i] = = P[j] then
if j = 0 then return i // a match!
else i ← i − 1
j←j−1
else i ← i + m − min(j, 1 + last(T[i])) // jump step
j←m−1
until i>n − 1
return “There is no substring of T matching P.”
240-301 Comp. Eng. Lab III (Software), Pattern Matching
27
Analysis
 Boyer-Moore
worst case running time is
O(nm + A)
 But,
Boyer-Moore is fast when the alphabet
(A) is large, slow when the alphabet is small.
– e.g. good for English text, poor for binary
 Boyer-Moore
is significantly faster than
brute force for searching English text.
240-301 Comp. Eng. Lab III (Software), Pattern Matching
28
Worst Case Example
 T:
"aaaaa…a"
 P: "baaaaa"
T: a a a a a a a a a
6
5
4
3
2
1
P: b a a a a a
12 11 10
b
a
a
9
8
7
a
a
a
18 17 16 15 14 13
b
a
a
a
a
a
24 23 22 21 20 19
b
240-301 Comp. Eng. Lab III (Software), Pattern Matching
a
a
a
a
a
29
4. The KMP Algorithm
 The
Knuth-Morris-Pratt (KMP) algorithm
looks for the pattern in the text in a left-toright order (like the brute force algorithm).
 But
it shifts the pattern more intelligently
than the brute force algorithm.
240-301 Comp. Eng. Lab III (Software), Pattern Matching
continued
30
 If
a mismatch occurs between the text and
pattern P at P[j], what is the most we can
shift the pattern to avoid wasteful
comparisons?
 Answer:
the largest prefix of P[0 .. j-1] that
is a suffix of P[1 .. j-1]
240-301 Comp. Eng. Lab III (Software), Pattern Matching
31
Example
i
T:
P:
j=5
jnew = 2
240-301 Comp. Eng. Lab III (Software), Pattern Matching
32
Why
j == 5
 Find
largest prefix (start) of:
"a b a a b"
( P[0..j-1] )
which is suffix (end) of:
"b a a b"
( p[1 .. j-1] )
 Answer:
"a b"
 Set j = 2 // the new j value
240-301 Comp. Eng. Lab III (Software), Pattern Matching
33
KMP Failure Function
 KMP
preprocesses the pattern to find
matches of prefixes of the pattern with the
pattern itself.
 j = mismatch position in P[]
 k = position before the mismatch (k = j-1).
 The failure function F(k) is defined as the
size of the largest prefix of P[0..k] that is
also a suffix of P[1..k].
240-301 Comp. Eng. Lab III (Software), Pattern Matching
34
Failure Function Example
(k == j-1)
 P:
"abaaba"
j: 012345
jk
0
1
2
3
4
F(k)
F(j)
0
0
1
1
2
3
F(k) is the size of
the largest prefix.
 In
code, F() is represented by an array, like
the table.
240-301 Comp. Eng. Lab III (Software), Pattern Matching
35
Why is F(4) == 2?
 F(4)
P: "abaaba"
means
– find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaab" that
is also a suffix of "baab"
= find the size of "ab"
=2
240-301 Comp. Eng. Lab III (Software), Pattern Matching
36
Using the Failure Function
algorithm modifies
the brute-force algorithm.
 Knuth-Morris-Pratt’s
– if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = F(k); // obtain the new j
240-301 Comp. Eng. Lab III (Software), Pattern Matching
37
Example
T:
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
P:
a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k
0
1
2
3
4
14 15 16 17 18 19
F(k)
0
0
1
0
1
a b a c a b
240-301 Comp. Eng. Lab III (Software), Pattern Matching
38
More Examples....
1.T= "THIS IS A TEST TEXT“
P=“TEXT”
2.T=“AABDACAADAABAAACA”
P=“AABA”
P="cgtacgttcgtacg“ Find Failure Function.
240-301 Comp. Eng. Lab III (Software), Pattern Matching
39
Algorithm
Algorithm KMPMatch(T,P):
Input: Strings T (text) with n characters and P (pattern) with m characters
Output: Starting index of the first substring of T matching P, or an indication that P is not a
substring of T
f ← KMPFailureFunction(P) // construct the failure function f for P
i←0
j←0
while i < n do
if P[j] = T[i] then
if j = m − 1 then
return i − m + 1 // a match!
i←i+1
j←j+1
else if j > 0 // no match, but we have advanced in P then
j ← f(j − 1) // j indexes just after prefix of P that must match
else
i←i+1
return “There is no substring of T matching P.”
240-301 Comp. Eng. Lab III (Software), Pattern Matching
40
KMP Advantages
 KMP
runs in optimal time: O(m+n)
– very fast
 The
algorithm never needs to move
backwards in the input text, T
– this makes the algorithm good for processing
very large files that are read in from external
devices or through a network stream
240-301 Comp. Eng. Lab III (Software), Pattern Matching
41
KMP Disadvantages
 KMP
doesn’t work so well as the size of the
alphabet increases
– more chance of a mismatch (more possible
mismatches)
– mismatches tend to occur early in the pattern,
but KMP is faster when the mismatches occur
later
240-301 Comp. Eng. Lab III (Software), Pattern Matching
42
Pattern Matching Applications




Applications in Computational Biology
– DNA sequence is a long word (or text) over a 4-letter alphabet
– GTTTGAGTGGTCAGTCTTTTCGTTTCGACGGAGCCCCCA
ATTAATAAACTCATAAGCAGACCTCAGTTCGCTTAGAGC
AGCCGAAA…..
– Find a Specific pattern W
Finding patterns in documents formed using a large alphabet
– Word processing
– Web searching
– Desktop search (Google, MSN)
Matching strings of bytes containing
– Graphical data
– Machine code
grep in unix
– grep searches for lines matching a pattern.
240-301 Comp. Eng. Lab III (Software), Pattern Matching
43
KMP Extensions
 The
basic algorithm doesn't take into
account the letter in the text that caused the
mismatch.
T:
a b a a b x
Basic KMP
does not do this.
P: a b a a b a
a b a a b a
240-301 Comp. Eng. Lab III (Software), Pattern Matching
44
5. More Information

Algorithms in C++
Robert Sedgewick
Addison-Wesley, 1992
– chapter 19, String Searching
 Online
Animated Algorithms:
– http://www.ics.uci.edu/~goodrich/dsa/
11strings/demos/pattern/
– http://www-sr.informatik.uni-tuebingen.de/
~buehler/BM/BM1.html
– http://www-igm.univ-mlv.fr/~lecroq/string/
240-301 Comp. Eng. Lab III (Software), Pattern Matching
45
Download