Exact and Approximate String Matching Algorithms

advertisement
Exact and Approximate
String Matching
Algorithms
R. C. T. Lee
K. H. Chen
C. W. Lu
C. S. Ou
Y. K. Shieh
1-1
Table of Contents
Chapter 1 Definitions of String Matching Algorithms
Section 1.1 The Exact String Matching Problem
Section 1.2 The Approximate String Matching problem
Section 1.3 References
Part I Exact String Matching Algorithms
Chapter 2 The Exact String Matching Algorithm Based upon the Convolution
Method
Section 2.1 The Window Sliding Method
Section 2.2 The Discrete Convolution
Section 2.3 The Application of Discrete Convolution to the Exact String
Matching Problem
Section 2.4 The Bit-Parallel Approach of the Application of Convolution to
Exact String Matching
Section 2.5 The Integer Multiplication Approach to Implement Discrete
Convolution
Section 2.6 The Fourier Transform Approach to Implement Discrete
Convolution
Section 2.7 The Prefix Finding and the Discrete Convolution Techniques for
Exact String Matching
Section 2.8 The Probability of a String Matching Approach in Another String
Section 2.9 Discrete Convolution Method with Early Termination
Section 2.10 An Elimination Oriented Exact String Matching Algorithm Derived
from the Prefix Finding Concept
Section 2.11 The Relationship between the BG Algorithm and the Convolution
Method
Section 2.12 References
Chapter 3
The Exact String Matching Algorithm Based upon the Suffix Tree
Method
Section 3.1 Prefix and Suffix
Section 3.2 The Suffix Tree
Section 3.3 The Construction of Suffix Trees
Section 3.4 The Suffix Array
1-2
Section 3.5 References
Chapter 4 Rule E2, Rule E2-1 and the Reverse Factor Algorithm
Section 4.1 Rule E1: The Substring Matching Rule
Section 4.2 Rule E2-1 The Suffix to Prefix Rule
Section 4.3 Method 1: the Chu’s Method to Solve the LSP( X , Y ) finding
Problem
Section 4.4 Method 2:: The Suffix Tree Method to Solve LSP( X , Y ) finding
Problem
Section 4.5 Method 3: The Bit-Parallel Method to Solve LSP( X , Y ) finding
Problem
Section 4.6
Method 4:
The Convolution Method to Solve LSP( X , Y )
finding Problem
Section 4.7 The Reverse Factor Algorithm Based on Rule E2-1
Section 4.8 An Average Analysis of the Reverse Factor Algorithm
Section 4.9 A Further Improvement of the Reverse Factor Algorithm
Section 4.10 The Reverse Factor Algorithm with the Filtering Concept.
Section 4.11 Another Improved Algorithm Which Avoids the Repeating
Examination of Characters
Section 4.12 References
Chapter 5 Rule E2-2 and the Horspool Algorithm
Section 5.1 Rule E2-2: The 1-Suffix Rule
Section 5.2 The Horspool Algorithm
Section 5.3 The Time-Complexity of the Horspool Algorithm
Section 5.4
Chapter 6 The MP and KMP Algorithms Based upon Rule E2-1
Section 6.1 The Morris-Pratt Algorithm (MP Algorithm)
Section 6.2
Section 6.3
Section 6.4
Section 6.5
Section 6.6
The Knuth- Morris-Pratt Algorithm (KMP Algorithm)
The Simon Allgorithm
The Time-Complexity Analysis of the KMP Algorithm
The Improved KMP Algorithm
References
Chapter 7 The Boyer and Moore Algorithm Based upon Rule E3
Section 7.1 The Basic Idea of the Boyer and Moore Algorithm
1-3
Section 7.2 The Good Suffix Rule 1 of the Boyer and Moore Algorithm:
An
Application of Rule E2.
Section 7.3 The Bad Character Rule of the Boyer and Moore Algorithm:
Another Application of Rule E2
Chapter 8 Algorithms with Selective Scanning Order
Chapter 9 The Wide Window Approach
Part II Approximate String Matching Algorithms
Chapter 10 The Edit Distance
Chapter 11 The Two Types of Approximate String Matching Problems and the
Seller’s Algorithm to Solve Problem 1
Chapter 12 The LV Algorithm and the U Algorithm to Solve Problem 1 Which Are
Chapter 13
Chapter 14
Chapter 15
Chapter 16
Chapter 17
Chapter 18
Chapter 19
Chapter 20
Based upon Rule A1
The WM1 Algorithm for Problem 1, a Bit-Parallel Approximate String
Matching Algorithm
Another Bit-Parallel Approach to Solve Problem 1: the Myers Algorithm
The WM2 Algorithm, the NB Algorithm Based upon Rules A2 and A3
to Solve Problem 1
The Error Pattern Approach to Solve Problem 2 by Turning It into an
Exact String Matching Problem
The FN Algorithm Based upon Rules A3 and A4 and the Filtering
Concept to Solve Problem 2
The Ukkonen Algorithm and the Lu Algorithm Based upon Filtering
Concept for Problem 1
The TU Algorithm to Solve Problem 1
The HN Filtering and Bit-Parallel Algorithm
1-4
Chapter 1:
Problems
Definitions of String Matching
In string matching problems, there are a text T  t1t 2 t n and a pattern
P  p1 p2  pm where both t i ’s and p i ’s are characters. Throughout this book,
we assume that m  n . We denote t i t i 1  t j by T (i, j ) . We shall consider two
different kinds of string matching problems: Exact string matching problem and
approximate string matching problems.
Section 1.1 The Exact String Matching Problem
Definition 1.1-1 The Exact String Problem
For the exact string matching problem, we are given a text T  t1t 2 t n and a
pattern P  p1 p2  pm . Our job is to find whether P appears in T and if it
does, where it appears.
Example 1.1-1
We are given T  aaccgtcaccggt and P  acc .
We would find P does
appear in T as shown below:
T=aaccgtcaccggt
Example 1.1-2
We are given T  aaccgtcaccggt and P  gcc .
In this case, we will find out
that P does not appear in T .
There are many algorithms developed to solve this exact string matching
problem and we shall present them in Part I of this book. Although there are a large
number of such algorithms, we have found that they are nearly all based upon some
general rules. This book is organized by stating these rules. By this approach, the
reader can more clearly understand the basic principles of the algorithms.
1-5
Section 1.2
The Approximate String Matching Problem
For the approximate string matching problem, we first define a distance function
called edit distance which measures the similarity between two strings. Given a
string S1 and a string S 2 , we can transform S 2 to S1 by three operations:
deletions, insertions and substitutions. Throughout this book, without losing
generality, we assume that the operations are on S 2 .
Deletion Operation.
Let S1  aacgt and S 2  aaccggt . We may delete c in location 4 and g
in location 5 from S 2 . Then after these two deletion operations as illustrated below,
we can transform S 2 to S1 :
S1
S2
1 2 3 4 5 6 7 8
=a a c - - g t
=a a c c g g t
Insertion Operation
Let S1  aactgt and S 2  actt . We may insert a and g at locations 2 and 5
respectively into S 2 to transform S 2 to S1 as shown below:
S1
S2
1 2 3 4
=a a c t
=a - c t
a
5 6
g t
- t
g
Substitution Operation
Let S1  aactgtt and S 2  abctatt .
Then we may transform S 2 to S1 by
the following substitution operations at locations 2 and 5. Note that in location 2, b
is substituted by a and in location 5, a is substituted by g.
S1
S2
1 2 3 4
=a a c t
=a b c t
a
1-6
5 6 7 8
g t t
a t t
g
Example 1.2-1
Let S1  aagttctattagacg and S 2  aacgtatatttatag .
into S1 through the following operations:
S1
S2
S 2 can be transformed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
= a a - g t t c t a t
t
a g a c g
= a a c g t - a t a t
t
t
a t
a g
t c
g
c
The operations are as follows:
1.deleting c at location 3 ,
2. inserting t at location 6,
3. substituting a at location 7 by c ,
4. deleting t at location 12,
5. substituting t at location 14 by g,
and 6. inserting c at location 16.
Thus there are totally 6 operations, including 2 deletions, 2 insertions and 2
substitutions.
Having defined these operations, we may now define edit distance as follows:
Definition 1.2-1 Edit Distance
The edit distance between two strings S1 and S 2 is the minimum number of
deletions, insertions and substitutions needed to transform S 2 to S1 . The edit
distance between S1 and S 2 is denoted as ED(S1 , S 2 ) .
Example 1.2-2
Let S1  aagttcgta and S 2  aattacaa .
S1 through the following operations:
1-7
Then S 2 can be transformed into
1 2 3 4 5 6 7 8 9 10
S1 = a a g t
S2 = a a - t
g
t
t
- c g t
a c a g t
a
a
As we can see, the operations are:
1. inserting g at location 3,
2. deleting a at location 6,
3. substituting a by g at location 8
and 4. inserting t at location 9
In this case, we perform 4 operations and it can be proved that this number is
minimal. Thus the edit distance between S1 and S 2 is 4.
Edit distance can be viewed as a measurement of the similarity between two
strings. If the edit distance is small, it means that these two strings are similar to
each other and quite different if otherwise. If edit distance is zero, these two strings
are identical. The edit distance is basically found by dynamic programming. We
shall elaborate this in later chapters.
Definition 1.2-2
Hamming Distance
A special version of edit distance only considers the substitution. This is
called the Hamming distance.
We can see that the Hamming distance is much easier to find and only a few
algorithms use Hamming distance.
Two Kinds of Approximate String Matching Problems
The approximate string matching problem is based upon the edit distance
function. There are two kinds of approximate string matching problems. We shall
call them Approximate Problem 1 and Approximate Problem 2.
Definition 1.2-3 Approximate String Matching Problem 1
Approximate String Matching Problem 1:
1-8
Given a text T , a pattern P and a
prespecified positive integer k , find every location i in T such that there is a
substring S of T which ends at location i such that ED( S , P)  k .
Example 1.2-3
Let T  aagtcggtac , P  cgt and k  1 . We can now see that there are three
locations in T , namely locations 4, 7 and 8, satisfying the conditions.
(1). At location 4 of T , there are substrings agt and gt ending at this location
whose edit distances with P are smaller than or equal to k  1 .
(2). At location 6 of T , there is substring cg satisfying the condition.
(3). At location 7 of T , there is substring cgg satisfying the condition.
(4). At location 8 of T , there are substrings cggt , ggt and gt satisfying the
condition.
Definition 1.2-4 Approximate String Matching Problem 2
Approximate String Matching Problem 2: Given a text T a pattern P and a
prespecified positive integer k , find all substrings S ’s in T such that
ED( S , P)  k .
Example 1.2-4.
Let T  aagtcggtac , P  ggt and k  1 . Then we can find five substrings in
T , namely
S1  S (2,4)  a g , t S2  S (3,4)  gt , S3  S (6,7)  gg ,
Note that
S4  S (6,8)  ggt , and S5  S (7,8)  gt as our solution.
ED(S1 , P)  1  k and ED(S 4 , P)  0  k .
Example 1.2-5.
Let T  aagtcggtac , P  ttt and k  1 . Then we cannot find any substring
in T whose edit distance with P is less than or equal to k .
Note that in Approximate Problem 1, it is not required to find substrings
satisfying our conditions. We only have to find the locations.
In this book, exact string matching algorithms are presented in Part I and
approximate string matching algorithms are presented in Part II. Again, we have
1-9
successfully found some basic rules for developing approximate string matching
algorithms and we shall present the algorithms based upon the rules which we have
found.
Section 1.3
References
The following is a list of references on string matching algorithms.
[CHL07]: Crochemore, M., Hancart, C. and Lecroq, T. Algorithms on Strings,
Cambridge University Press, Cambridge, England, 2007.
[CR94]: Crochemore, M. and Rytter, W. Text Algorithms, Oxford University Press,
N. Y. 1994.
[CR02]: Crochemore, M. and Rytter, W. Jewels of Stringology, World Scientific
Press, Singapore, Singapore, 2002
[G97]: Gusfield, P. M. Algorithms on Strings, Trees and Sequences, Cambridge
University Press, Cambridge, England 1997.
[L66]: Levenshtein, V. I. Binary codes capable of correcting deletions, insertions,
and reversals. Soviet Physics Doklady, 1996, Vol. 10, No. 8, pp. 707 - 710.
[NR02]: Navarro, G. and Raffinot, M. Flexible Pattern Matching in Strings,
Practical On-Line Search Algorithms for Texts and Biological Sequences, Cambridge
University Press, Cambridge, England, 2002.
[S01]: Szpankowski, W. Average Case Analysis of Algorithms on Sequences, Wiley
Interscience, N.Y. 2001.
1-10
Download