Chapter 9 A Review of the Exact String Matching Algorithms We have presented and discussed many algorithms. of them. It is time for us to have review First of all, we must note that we should not compare all of these algorithms. We may say that every algorithm has its own value as well as its weakness. But different algorithms may be used under different conditions. It is meaningless to say there is a best string matching algorithm. Let us imagine that our pattern string is a rather short one and our text is not too long either. In this case, we may simply use Chu’s Algorithm introduced in Section 4.3, the elimination oriented algorithm introduced in Section 2.10, the convolution algorithm introduced in Chapter 2. Let us consider the case where T catgacgctagt and P gatc . Suppose we use the Chu’s Algorithm. Although it was used to find the LSP( X , Y ) , we can easily modify it so that it can used as an exact string matching algorithm. It will proceed as follows: Initially 2 a 1 3 t 1 4 g 1 5 a 1 6 c 1 7 g 1 8 c 1 9 t 1 10 a 1 11 g 1 12 t 1) Consider p1 g 1 2 c a D = (0 0 3 t 0 4 g 1 5 a 0 6 c 0 7 g 1 8 c 0 9 t 0 10 a 0 11 g 1 12 t 0) Consider p2 a 1 2 c a D = (0 0 3 t 0 4 g 0 5 a 1 6 c 0 7 g 0 8 c 0 9 t 0 10 a 0 11 g 0 12 t 0) Consider p3 t 1 2 c a D = (0 0 3 t 0 4 g 0 5 a 0 6 c 0 7 g 0 8 c 0 9 t 0 10 a 0 11 g 0 12 t 0) D = 1 c (1 Since no 1 exists in D , we report “No”. Suppose we use the elimination oriented method, we would proceed as follows: Consider p1 g . D (4,7,11) 9-1 Consider p2 a . D (5) Consider p3 t . D () . We may now report “No”. Let us use the convolution method. catgacgctagt ctag 000100100010 010010000100 001000001001 00000000000000 We report “No”. We now discuss the convolution approach. The biggest advantage of convolution is that no window is produced, It is easy to program and besides, we may conveniently use the bit-parallel approach. That is, we perform a pre-processing on the text string to produce all of the incidence vectors. Suppose the size of the vocabulary set is 4. We only need 4 incidence vectors. Then we just perform some left shift and logical “and” operation. Of course, the size of the text string must not be too large; otherwise, the bit-parallel approach can not be used. This is its major disadvantage. If we use the bit-parallel approach, we perform a pre-processing and this pre-processing is performed on the text string which means this pre-processing will be useful for different pattern strings later. This is an advantage of the convolution approach. If the convolution approach is used, we should use the early termination approach. Actually, this is physically equivalent to prefix finding. If the size of the vocabulary is very large, this early termination approach is quite efficient. As for the suffix tree approach, we must admit that it is not very practical because it is hard to write a program to do the tree searching. Note that the suffix tree is not a binary tree. There is a linear algorithm to construct a suffix tree. But it is a rather complicated algorithm.. The advantage of using the suffix tree is that the pre-processing, namely construction of a suffix tree, is performed on the text string which means that this pre-processing is useful for different pattern strings. Another important thing about being familiar with suffix tree is that it is academically interesting. There is a large amount of research results on suffix tree. Any scholar must have some knowledge on suffix trees. 9-2 Although it is not very practical to use the suffix tree approach in exact string matching, it is practical to use the suffix array. Having a suffix array, we may use binary search or hashing to solve the exact string matching problem. We now discuss the Reverse Factor Algorithm. We like to point out that this is indeed a very good exact string matching algorithm.. Our main job is to find LSP (W , P ) where W is the window. If the length of LSP (W , P ) is long, we cannot shift too much. But, the probability theory tells us that it is quite unlikely that the length of LSP (W , P ) can be long. The reader may take a pen and randomly generate two strings X and Y . Then find LSP( X , Y ) . You will always find that LSP( X , Y ) is a short one. Thus we can often perform a long shift. The reader may be worried that LSP (W , P ) is found in run-time and thus the algorithm may not be very efficient. But, as we pointed out in the above paragraph, LSP(W , P) is usually very short, Thus, it usually takes a few steps to finish the process of finding LSP (W , P ) . As to the algorithms to find LSP (W , P ) , we introduced four of them. Most of them are easy to implement and efficient. Let us consider a case here. W actggctatgac and P gtatgcaccatg . Suppose we use the Chu’s Algorithm. Consider w12 c . D (000001011000) Consider w11 a . D (000000100000) Consider w10 g . D (000000000000) We report LSP(W , P) 0 and we may shift 12 steps. If we use the bit-parallel method, we do a pre-processing on the pattern string. Since the Reverse Factor Algorithm is a window sliding method, this pre-processing is useful for all windows. In the chapter discussing the Reverse Factor Algorithm, we introduce the filtering concept. This filtering is quite straightforward. We test whether a suffix of W with a particular length appears in P or not. If it does not, we ignore this window. This is actually an exact string matching problem. So, what are we going to solve the problem? Do we have to use any sophisticated algorithm in this case? Note that the suffix will be a short one, at most, say with length of 4, and the pattern is also relatively short also. Therefore, it is quite easy to solve the problem. For example, suppose the data are W actggctatgac , P gtatgcaccatg and we want to see whether suffix tgac of W appears in P . As shown in the above discussion, it only takes 3 steps to conclude the searching. 9-3 Although filtering was introduced in the Reverse Factor Algorithm, it can be used in almost any algorithm. The reader should remember this. We now come to the Horspool Algorithm, The greatest advantage of this algorithm is it is easy to implement. Besides, it is a constant space algorithm because the result of pre-processing, namely the location table, is of length , the size of the alphabet set, which is considered a constant.. But, there is a fundamental problem of this algorithm. The average number of steps of shifting is short. The reader is encouraged to write down any string and then construct the location table. You will find out that the chance that there is a large number is the table is very small. Consider the case presented above where P gtatgcaccatg . Only when the last character of W is g will cause a long shifting. We recommend the reader to use Liu’s Algorithm which will perform much better than the Horspool Algorithm. The KMP Algorithm is perhaps the most famous exact string matching algorithm. Researchers also like to compare their algorithms with this algorithm. But, this algorithms suffers from a disadvantage: It scans from left to right and stops as soon as a mismatch occurs. Let us assume that the mismatch occurs after j steps. Then we can shift at most j steps. But, unfortunately, j must be quite small because it is unlikely that there exists a long prefix of the window which is exactly equal to a prefix of the pattern. If j 4 , the probability of having such a prefix is roughly 0.0004. Because of this, we cannot expect the KMP Algorithm to be effective because we do have to open a large number of windows. Yet the KMP Algorithm is academically interesting because in the worst case, its time-complexity is O (n) . At the original KMP Algorithm paper, it was mentioned that filtering mechanism can be used. We wonder why researchers seldom mention that mechanism. Actually, the KMP Algorithm should include that part. The filtering mechanism improves the KMP Algorithm to a large degree. Consider Fig. 9.1. j T P U Z P U Y U X U Z 9-4 Fig. 9.1 An illustration of the KMP Algorithm From the above figure, we can see that a good substring U is a short one. In fact, the best situation is that there is not suffix in P(1, j ) equal to a prefix of P(1, j ) . This leads to the proposal of the Boyer and Moore Algorithm. The Boyer and Moore Algorithm scans from the right to left. Suppose a suffix U of the window W is found to be exactly the same as a suffix of P . If in P , U also appears to the left of it, we can move P as shown in Fig. 9.2. T P Z U P Fig. 9.2 X U Y U Z U U appearing in the left of it in P Suppose U is unique in P . Then we may find a suffix V of P , contained in U which is a prefix of P and we can move P as shown in Fig. 9.3. T P V V V P Fig. 9.3 V U being unique Since the Boyer and Moore scans from right to left, it is more efficient than the KMP Algorithm. Yet it suffers from one advantage: The pre-processing of it is much more complicated. As can be seen, if U is unique in P , we can shift P much farther to the right.. Both KMP and Boyer and Moore Algorithms utilize such a substring in P which exactly matches with a substring in W . Ideally, U should be unique and short. But U is not pre-determined. Since it is obtained in run-time, this is not guaranteed, Many algorithms are therefore what we call “selective scanning order” 9-5 algorithms. They neither scan from left to right nor from right to left. They have a mechanism to determine a scanning order in the hope that the resulting slide is a long one. 9-6