Exact and Approximate String Matching Algorithms R. C. T. Lee K. H. Chen C. W. Lu C. S. Ou Y. K. Shieh 1-1 Table of Contents Chapter 1 Definitions of String Matching Algorithms Section 1.1 The Exact String Matching Problem Section 1.2 The Approximate String Matching problem Section 1.3 References Part I Exact String Matching Algorithms Chapter 2 The Exact String Matching Algorithm Based upon the Convolution Method Section 2.1 The Window Sliding Method Section 2.2 The Discrete Convolution Section 2.3 The Application of Discrete Convolution to the Exact String Matching Problem Section 2.4 The Bit-Parallel Approach of the Application of Convolution to Exact String Matching Section 2.5 The Integer Multiplication Approach to Implement Discrete Convolution Section 2.6 The Fourier Transform Approach to Implement Discrete Convolution Section 2.7 The Prefix Finding and the Discrete Convolution Techniques for Exact String Matching Section 2.8 The Probability of a String Matching Approach in Another String Section 2.9 Discrete Convolution Method with Early Termination Section 2.10 An Elimination Oriented Exact String Matching Algorithm Derived from the Prefix Finding Concept Section 2.11 The Relationship between the BG Algorithm and the Convolution Method Section 2.12 References Chapter 3 The Exact String Matching Algorithm Based upon the Suffix Tree Method Section 3.1 Prefix and Suffix Section 3.2 The Suffix Tree Section 3.3 The Construction of Suffix Trees Section 3.4 The Suffix Array 1-2 Section 3.5 References Chapter 4 Rule E2, Rule E2-1 and the Reverse Factor Algorithm Section 4.1 Rule E1: The Substring Matching Rule Section 4.2 Rule E2-1 The Suffix to Prefix Rule Section 4.3 Method 1: the Chu’s Method to Solve the LSP( X , Y ) finding Problem Section 4.4 Method 2:: The Suffix Tree Method to Solve LSP( X , Y ) finding Problem Section 4.5 Method 3: The Bit-Parallel Method to Solve LSP( X , Y ) finding Problem Section 4.6 Method 4: The Convolution Method to Solve LSP( X , Y ) finding Problem Section 4.7 The Reverse Factor Algorithm Based on Rule E2-1 Section 4.8 An Average Analysis of the Reverse Factor Algorithm Section 4.9 A Further Improvement of the Reverse Factor Algorithm Section 4.10 The Reverse Factor Algorithm with the Filtering Concept. Section 4.11 Another Improved Algorithm Which Avoids the Repeating Examination of Characters Section 4.12 References Chapter 5 Rule E2-2 and the Horspool Algorithm Section 5.1 Rule E2-2: The 1-Suffix Rule Section 5.2 The Horspool Algorithm Section 5.3 The Time-Complexity of the Horspool Algorithm Section 5.4 Chapter 6 The MP and KMP Algorithms Based upon Rule E2-1 Section 6.1 The Morris-Pratt Algorithm (MP Algorithm) Section 6.2 Section 6.3 Section 6.4 Section 6.5 Section 6.6 The Knuth- Morris-Pratt Algorithm (KMP Algorithm) The Simon Allgorithm The Time-Complexity Analysis of the KMP Algorithm The Improved KMP Algorithm References Chapter 7 The Boyer and Moore Algorithm Based upon Rule E3 Section 7.1 The Basic Idea of the Boyer and Moore Algorithm 1-3 Section 7.2 The Good Suffix Rule 1 of the Boyer and Moore Algorithm: An Application of Rule E2. Section 7.3 The Bad Character Rule of the Boyer and Moore Algorithm: Another Application of Rule E2 Chapter 8 Algorithms with Selective Scanning Order Chapter 9 The Wide Window Approach Part II Approximate String Matching Algorithms Chapter 10 The Edit Distance Chapter 11 The Two Types of Approximate String Matching Problems and the Seller’s Algorithm to Solve Problem 1 Chapter 12 The LV Algorithm and the U Algorithm to Solve Problem 1 Which Are Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18 Chapter 19 Chapter 20 Based upon Rule A1 The WM1 Algorithm for Problem 1, a Bit-Parallel Approximate String Matching Algorithm Another Bit-Parallel Approach to Solve Problem 1: the Myers Algorithm The WM2 Algorithm, the NB Algorithm Based upon Rules A2 and A3 to Solve Problem 1 The Error Pattern Approach to Solve Problem 2 by Turning It into an Exact String Matching Problem The FN Algorithm Based upon Rules A3 and A4 and the Filtering Concept to Solve Problem 2 The Ukkonen Algorithm and the Lu Algorithm Based upon Filtering Concept for Problem 1 The TU Algorithm to Solve Problem 1 The HN Filtering and Bit-Parallel Algorithm 1-4 Chapter 1: Problems Definitions of String Matching In string matching problems, there are a text T t1t 2 t n and a pattern P p1 p2 pm where both t i ’s and p i ’s are characters. Throughout this book, we assume that m n . We denote t i t i 1 t j by T (i, j ) . We shall consider two different kinds of string matching problems: Exact string matching problem and approximate string matching problems. Section 1.1 The Exact String Matching Problem Definition 1.1-1 The Exact String Problem For the exact string matching problem, we are given a text T t1t 2 t n and a pattern P p1 p2 pm . Our job is to find whether P appears in T and if it does, where it appears. Example 1.1-1 We are given T aaccgtcaccggt and P acc . We would find P does appear in T as shown below: T=aaccgtcaccggt Example 1.1-2 We are given T aaccgtcaccggt and P gcc . In this case, we will find out that P does not appear in T . There are many algorithms developed to solve this exact string matching problem and we shall present them in Part I of this book. Although there are a large number of such algorithms, we have found that they are nearly all based upon some general rules. This book is organized by stating these rules. By this approach, the reader can more clearly understand the basic principles of the algorithms. 1-5 Section 1.2 The Approximate String Matching Problem For the approximate string matching problem, we first define a distance function called edit distance which measures the similarity between two strings. Given a string S1 and a string S 2 , we can transform S 2 to S1 by three operations: deletions, insertions and substitutions. Throughout this book, without losing generality, we assume that the operations are on S 2 . Deletion Operation. Let S1 aacgt and S 2 aaccggt . We may delete c in location 4 and g in location 5 from S 2 . Then after these two deletion operations as illustrated below, we can transform S 2 to S1 : S1 S2 1 2 3 4 5 6 7 8 =a a c - - g t =a a c c g g t Insertion Operation Let S1 aactgt and S 2 actt . We may insert a and g at locations 2 and 5 respectively into S 2 to transform S 2 to S1 as shown below: S1 S2 1 2 3 4 =a a c t =a - c t a 5 6 g t - t g Substitution Operation Let S1 aactgtt and S 2 abctatt . Then we may transform S 2 to S1 by the following substitution operations at locations 2 and 5. Note that in location 2, b is substituted by a and in location 5, a is substituted by g. S1 S2 1 2 3 4 =a a c t =a b c t a 1-6 5 6 7 8 g t t a t t g Example 1.2-1 Let S1 aagttctattagacg and S 2 aacgtatatttatag . into S1 through the following operations: S1 S2 S 2 can be transformed 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 = a a - g t t c t a t t a g a c g = a a c g t - a t a t t t a t a g t c g c The operations are as follows: 1.deleting c at location 3 , 2. inserting t at location 6, 3. substituting a at location 7 by c , 4. deleting t at location 12, 5. substituting t at location 14 by g, and 6. inserting c at location 16. Thus there are totally 6 operations, including 2 deletions, 2 insertions and 2 substitutions. Having defined these operations, we may now define edit distance as follows: Definition 1.2-1 Edit Distance The edit distance between two strings S1 and S 2 is the minimum number of deletions, insertions and substitutions needed to transform S 2 to S1 . The edit distance between S1 and S 2 is denoted as ED(S1 , S 2 ) . Example 1.2-2 Let S1 aagttcgta and S 2 aattacaa . S1 through the following operations: 1-7 Then S 2 can be transformed into 1 2 3 4 5 6 7 8 9 10 S1 = a a g t S2 = a a - t g t t - c g t a c a g t a a As we can see, the operations are: 1. inserting g at location 3, 2. deleting a at location 6, 3. substituting a by g at location 8 and 4. inserting t at location 9 In this case, we perform 4 operations and it can be proved that this number is minimal. Thus the edit distance between S1 and S 2 is 4. Edit distance can be viewed as a measurement of the similarity between two strings. If the edit distance is small, it means that these two strings are similar to each other and quite different if otherwise. If edit distance is zero, these two strings are identical. The edit distance is basically found by dynamic programming. We shall elaborate this in later chapters. Definition 1.2-2 Hamming Distance A special version of edit distance only considers the substitution. This is called the Hamming distance. We can see that the Hamming distance is much easier to find and only a few algorithms use Hamming distance. Two Kinds of Approximate String Matching Problems The approximate string matching problem is based upon the edit distance function. There are two kinds of approximate string matching problems. We shall call them Approximate Problem 1 and Approximate Problem 2. Definition 1.2-3 Approximate String Matching Problem 1 Approximate String Matching Problem 1: 1-8 Given a text T , a pattern P and a prespecified positive integer k , find every location i in T such that there is a substring S of T which ends at location i such that ED( S , P) k . Example 1.2-3 Let T aagtcggtac , P cgt and k 1 . We can now see that there are three locations in T , namely locations 4, 7 and 8, satisfying the conditions. (1). At location 4 of T , there are substrings agt and gt ending at this location whose edit distances with P are smaller than or equal to k 1 . (2). At location 6 of T , there is substring cg satisfying the condition. (3). At location 7 of T , there is substring cgg satisfying the condition. (4). At location 8 of T , there are substrings cggt , ggt and gt satisfying the condition. Definition 1.2-4 Approximate String Matching Problem 2 Approximate String Matching Problem 2: Given a text T a pattern P and a prespecified positive integer k , find all substrings S ’s in T such that ED( S , P) k . Example 1.2-4. Let T aagtcggtac , P ggt and k 1 . Then we can find five substrings in T , namely S1 S (2,4) a g , t S2 S (3,4) gt , S3 S (6,7) gg , Note that S4 S (6,8) ggt , and S5 S (7,8) gt as our solution. ED(S1 , P) 1 k and ED(S 4 , P) 0 k . Example 1.2-5. Let T aagtcggtac , P ttt and k 1 . Then we cannot find any substring in T whose edit distance with P is less than or equal to k . Note that in Approximate Problem 1, it is not required to find substrings satisfying our conditions. We only have to find the locations. In this book, exact string matching algorithms are presented in Part I and approximate string matching algorithms are presented in Part II. Again, we have 1-9 successfully found some basic rules for developing approximate string matching algorithms and we shall present the algorithms based upon the rules which we have found. Section 1.3 References The following is a list of references on string matching algorithms. [CHL07]: Crochemore, M., Hancart, C. and Lecroq, T. Algorithms on Strings, Cambridge University Press, Cambridge, England, 2007. [CR94]: Crochemore, M. and Rytter, W. Text Algorithms, Oxford University Press, N. Y. 1994. [CR02]: Crochemore, M. and Rytter, W. Jewels of Stringology, World Scientific Press, Singapore, Singapore, 2002 [G97]: Gusfield, P. M. Algorithms on Strings, Trees and Sequences, Cambridge University Press, Cambridge, England 1997. [L66]: Levenshtein, V. I. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 1996, Vol. 10, No. 8, pp. 707 - 710. [NR02]: Navarro, G. and Raffinot, M. Flexible Pattern Matching in Strings, Practical On-Line Search Algorithms for Texts and Biological Sequences, Cambridge University Press, Cambridge, England, 2002. [S01]: Szpankowski, W. Average Case Analysis of Algorithms on Sequences, Wiley Interscience, N.Y. 2001. 1-10