String Matching - Computer Information Systems

advertisement
ICS220 – Data Structures and
Algorithms Analysis
Lecture 14
Dr. Ken Cosh
Review
• Memory Management
– Memory Allocation
– Garbage Collection
This Week
• String Matching
– String matching is a common task for many
computer users;
• Internet Searches
• String manipulation in word processing
• Advanced DNA sequence matching
– Therefore effective pattern matching
algorithms are essential.
Brute Force
• Our first simple string matching algorithm
is brute force.
– We check the first character, if it is a match,
we check the second character, if not a
match, we step forward one character and
start again.
• Any useful information that could be used
in subsequent searches is then lost.
Brute Force
bruteForceStringMatching(pattern P, text T)
i=0;
while i ≤ |T| - |P|
j=0;
while Ti == Pj && j < |P|
i++;
j++;
if j == |P|
return match at i-|P|;
i = i – j + 1;
return no match;
Brute Force
•
1
2
3
4
5
6
7
8
T = ababcdababababababad, P=babab
ababcdababababababad
babab
babab
babab
babab
babab
babab
babab
babab
In this case the match is found on the 8th try.
Brute Force Complexity
• The best case for the algorithm is that the string
is matched straight away (consider searching
this sentence for “The”). Here |P| comparisons
are required – O(|P|).
• The worst case is if the string isn’t found, but for
each character in |T|, we are required to make
|P| comparisons – here worst case is O(|T||P|).
• The average case depends on the size and
frequencies of the character set.
Brute Force Complexity
• Notice the nested while loops in the Brute Force
algorithm;
while i ≤ |T| - |P|
while Ti == Pj && j < |P|
• Shortly we’ll investigate how we can reduce the
number of iterations of each loop.
• For the worst case to occur we could search of a
string such as aaaaaaaaaaaab within a string
aaaaaaaaaaaaaaaaaaaaaaaaaaa etc.
Improving Brute Force
• A key problem with brute force is that each time
we abort the comparison we have to start from
the beginning of the pattern again.
• We could reduce the algorithm complexity by
enabling us to skip unnecessary searches.
• Hancart’s algorithm allows the search to step
forward 2 characters if a match won’t be found.
Hancart
• Hancart’s algorithm
refines brute force in a
couple of ways.
Mismatch
with Ti+1
– Second comparisons begin
with the 2nd character in the
Text.
P0 != P1
Step 2
Step 1
(i=i+2)
If P0!=Ti+1,
then
P1!=Ti+1
– First the first two
characters of the pattern
are compared
• Either they are the same,
or they are different.
P0 == P1
Mismatch
after first
comparison
Step 1
Step 2
(i=i+2)
If P1==Ti+1,
then
P0!=Ti+1
Hancart
• Hancart’s revision works by allowing us to skip forward 2
characters in situations where there can’t be a match.
• Notice that the situations where 2 steps forwards are
allowed depends on whether the first 2 characters of the
pattern.
• We can refine the search further by extending this
observation – that the number of steps forward allowed
depends on the contents of the pattern.
• The Knuth Morris Pratt algorithm observes that the
pattern contains enough information to determine where
the next match could begin.
Hancart
• Hancart’s algorithm reduces the number of
iterations through the outer loop – by
sometimes allowing the increment to be;
i = i – j + 2;
Knuth Morris Pratt
• The Knutt Morris Pratt algorithm begins by
finding the longest suffix, which is equal to a
prefix of the same substring.
– Substring: A,B,C,D,A,B,D
– Longest Suffix: 0,0,0,0,1,2,0
• i.e. when the 2nd A comes it is both a suffix and a prefix for
the substring. The following B forms ‘AB’ a 2 character prefix
and suffix.
– Now for each iteration of the outer loop i can be
increased by j-x, where x is the longest suffix.
• i.e. if a mismatch is found when comparing the second A, j=5,
so i can be increased by 4 (j-1)
Test
Try searching for this substring,
A,B,C,D,A,B,D
within this string
ABCDABCABCDABDE
Knuth Morris Pratt complexity
• Knuth Morris Pratt removes some of the
complexity of the brute force algorithm by
preprocessing the substring being searched for
(to create the suffix table).
• Now as we don’t need to recheck characters in
the text it is O(|T|) for the outer loop.
• Preprocessing can be performed quickly, in
O(|P|) time, leaving a total complexity of
O(|T|+|P|)
Download