ch11

advertisement
Pattern matching
Application
text
graphics
compilers
sub-lists ...
P - pattern of length m
T - text of length n
j - pattern position
k - text position
Vanilla flavored algorithm
best case O(m)
worst case O(m*n) -- T = AAAAAAAA... P = AAB
Causes backups
P:
ABABC
T:
ABABABCCA
ABABC
ABABABCCA
ABABC
ABABABCCA
Count character comparisons as a measure of work.
If the pattern and the text can match every time, but in the
last character, the worst case would be realized. O(mn). If the
pattern is in the first m characters of the text, best case
would be realized, O(m). If the first character of the pattern
in is not in the text at all, O(n).
Consider a FSM - 2 types of nodes
Read
Stop
Let the distinct characters of T and P be , and  =


Each state has  arrows out of it, each arrow labeled with one
symbol from . After the construction of FSM, the algorithm runs
in O(n), no text is ever considered twice. Difficulty is in
constructing FSM.
B
B,C
A
A
1
A
2
B
3
B,C
4
C
*
A
C
The FSM states remember what has been seen so far. Now patterns
can be recognized in O(n). Again, the problem is in building the
FSM. It is easy to build the successful match links as follows.
A
A
B
C
Enter Knuth-Morris-Pratt (KMP)
Also uses a FSM
contains success links and failure links only
characters are in the nodes, not on the links
there is a special read node
The KMP algorithm uses success and failure links, characters are
in the nodes. The next character is read upon following a
success link.
f
f
f
s
s
s
s
s
s
A
*
B
A
B
B
C
3
4
5
2
6
1
f
f
f
Get next character
0
Cell
Number
1
2
1
0
1
2
3
4
2
1
2
3
4
5
3
4
Text scanned
Index
Character
1
A
2
C
2
C
2
C
3
A
4
B
5
A
6
A
6
A
6
A
7
B
8
A
9
B
10
A
10
A
11
none
Success or
Failure
s
f
f
get next character
s
s
s
f
f
s
s
s
s
f
s
failure
Flow chart action on pattern ‘ABABCB’ for text ‘ACABAABABA’
Setting fail links Example:
P:
ABABABCB
T:
. . . . . ABABABCB . . . . .
P:
ABABABCB
T:
. .
.
ABABAB x . . . . .
The pattern is moved forward so that the longest initial segment
that matches part of the text preceding x is lined up with that
part of the text. Now x should be tested to see if it is an A to
match the third A of the pattern. Thus the failure link for the
node containing the C should point to the node containing the
third A.
KMPsetup(string P, int fail)
{
int k, r;
fail[1] = 0;
for (k = 2; k <= P.length; k++)
{
r = fail[k-1];
while (r > 0 && p[r] != p[k-1]) {
r = fail[r];
}
fail[k] = r + 1;
}
}
Analysis:
Count comparisons - condition of the while loop - running time
is a multiple of the comparisons.
when pr = pk-1 comparison is successful, unsuccessful otherwise.
At most m-1 successful comparisons are done (for k = 2 to m)
After every successful comparison r is decreased. The maximum
number of successful comparisons can be bound by the number of
times r can be decreased.
r is initially 0
r is increased by 1 by r := fail[r]
r is incremented m-2 times total
r is never negative
so r cannot be decreased more than m-2 times. So the most
unsuccessful comparisons are at most m-2. There are at most m-1
successful comparisons - m-2 unsuccessful for a total of 2m-3.
So the complexity is O(m) -- linear on the length of the
pattern.
KMP Scan algorithm. Does at most 2n comparisons, so two together
is O(n+m), a major improvement over O(mn).
The Boyer-Moore Algorithm (BM)
Allows some characters of the text to be skipped entirely.
The BM algorithm, unlike KMP, scans the pattern right to left
instead of left to right. Uses two techniques for determining
how far the pattern can be shifted to the right.
must
must
must
must
If you wish to understand others you must
The first 4 comparisons is shown above. When t is compared to “y”, not only is there no match,
but there is no “y” in “must” so the pattern can jump past the “y”.
must
must
must
must
must
must
must
must
If you wish to understand others you must
The above shows the remaining comparisons. In this example, 18 comparisons are done with a
match found at position 38 in T. The other algorithms would do at least 41 comparisons, nearly
twice as many. However, this algorithm must be able to back up in the text by the length of the
pattern.
The algorithm:
ComputeJumps(char P[]; char charJump[]) { // indexed by each member of ∑
// P the pattern
// charJUmp size of the alphabet, determines jump ahead per
// character of the alphabet
char ch;
int k;
for (each ch in the alphabet)
charJump[ch] = strlen(P);
for (k = 1; k < strlen(P); k++)
charJump[P[k]] = strlen(P) - k;
}
This runs in O((size of the alphabet) + m) where m is the number of characters in the pattern.
SECOND HEURISTIC (similar to the KMP algorithm)
P:
T: ...
batsandcats
dats
|
j
...
Using charJump['d'] we would slide the pattern to the right one to like up the d's. But, if we know
the next occurrence of ats in the P, the pattern, we can line it up as follows:
batsandcats
... dats ...
|
j
new j
recall j is the index into the text currently being scanned for a match by the pattern.
P:
T:
The algorithm: (computing jumps based on partial matches)
ComputMatchJumps(char P[]; int matchJump[]) {
int k, q, qq, back;
int m; // length of the pattern
m = strlen(P);
for (k = 1; k < m; k++)
matchJump[k] = 2*m - k; // largest possible jump
k = m;
q = m + 1;
while (k > 0) {
back[k] = q;
while(q <= m && p[k] != p[q]) {
matchJump[q] = min(matchJump[q], m-k);
q = back[q];
}
k--;
q--;
}
for (k = 1; k < q; k++)
matchJump[k] = min(matchJump[k], m+q-k);
qq = back[q];
while(q <= m) {
while(q <= qq) {
matchJump[q] = min(matchJump[q], qq-q+m);
q++;
}
qq = back[qq];
}
}
The Boyer Moore Scan Alorithm
int BMmatch(char P[], char T[], int charJump[], int matchJump[]) {
int j, k; // j indexes text, k indexes pattern
k = j = strlen(P);
while(j <= strlen(T) && k > 0) {
if (T[j] = P[k]){
j--;
k--;
}
else { // slide P forward
j = j + max(charJump[T[j]], matchJump[k]);
k = strlen(P);
}
}
if(k = 0) return (j+1); // match found
else return(strlen(T) + 1); // no match found
}
Analysis:
For patterns of length greater than 5 on natural language alphabets, BM algorithm did roughly .3
character comparisons per character. For binary text, roughly .7 character comparisons were
done for each character.
Interestingly, if the pattern is small (m <= 3), then the overhead of preprocessing the pattern is
not worthwhile; BM does more comparisons than the brute force approach.
Kurak Eggen approach:
Process the text.
Requires on average .05 comparisons per character.
Download