Chapter 5

Chapter 5 Rule E2-2 and the Horspool Algorithm In Chapter 4, we introduced Rule E2 which is a substring matching rule. Given a substring S in the window W of the text string T , we must try to find if there exists a substring which is identical to S to the left of S in the pattern string P . In Chapter 4, we also introduced a variant of Rule E2, namely Rule E2-1 in which the substring is the longest suffix of which is equal to a prefix of P . In this chapter, we shall introduce another variant of Rule E2. Section 5.1 Rule E2-2: The 1-Suffix Rule Consider Fig. 5.1-1. Note that the last character of W is x . If we have to move the pattern P , we must align the particular x in P , if it exists, to align it with the x in W as shown in Fig. 5.1-1(b). If no such an x exists in P , we move P as shown in Fig. 5.1-1(c). W x x P (a) W x x P (b) W x P (c) Fig. 5.1-1 The Basic Idea of Rule E2-2 The following is a formal statement of Rule E2-2. 5-1 Rule E2-2: We are given a text string T , a pattern string P , and a window W  T (a, b) of T , which is aligned with P . Assume that the last character of W is x and we have to shift P . If x exists in P(1, m  1) , let i (x) be the location of the rightmost x in P(1, m  1) , shift P to such an extent that p i ( x ) is aligned with t b . If no x exists in P(1, m  1) , shift P to such an extent that p1 is aligned with t b 1 . Section 5.2 The Horspool Algorithm The Horspool Algorithm starts with scanning from the right as shown in Fig. 5.2-1. For each pair of characters in the text and pattern, we compare them until we find a mismatch. After we find a mismatch, we know that we should shift the pattern to the right. The shifting is based upon Rule E2-2. T X P Y Fig. 5.2-1 The right to left scanning in the Horspool Algorithm To implement Rule E2-2, we must have a mechanism to find i (x) for a given x . This is not done in run-time. Instead, we do a pre-processing. Definition 5.2-1 The Location Table for the Horspool Algorithm Given an alphabet set   ( x1 , x 2 , , x ) and pattern P with length m , we create a table, denoted as location table of P , containing  entries. Each entry stores the location of the rightmost xi , 1  i   , in P(1, m  1) counted from location m  1 , if it exists. If xi does not exist in P(1, m  1) , store m in the entry. Example 5.2-1 Let P  aggttgaat . The location table of P is displayed as follows: Table 5.2-1 The location table of P  aggttgaat a c g t 1 9 3 4 5-2 Each time when we have to move the pattern P , we consult the location table of P. For instance, consider the case shown in Fig. 5.2-1. T P =A c c g a g g t t g a a t t g c =A g g t t g a a t Fig. 5.2-1 An example for the Horspool Algorithm As can be seen, we have to move P . The last character of the window is t . There are two t ’s in P (1,8) . Both t ’s are bold faced in Fig. 5.2-1. We consult Table 5.2-1. The entry of t is 4. We therefore move P 4 steps to the right as shown in Fig. 5.2-2. A match is now found. T P =A c c g a g g t t g a a t t g c = a g g t t g a a t Fig. 5.2-2 The moving of P in the Horspool Algorithm The Horspool Algorithm is very similar to the Reverse Factor Algorithm. now given as Algorithm 5.1 below: It is Algorithm 5.1 The Horspool Algorithm Based upon Rule 1-2 Input: A text string T and a pattern string P with lengths n and m respectively Output: All occurrences of P in T . Construct the location table of P Set i  1 , Step 1: Let W  T (i, i  m  1) be a window. Align P with W . Set j  i  m  1 and k  m . If j  n , exit. While t j  p k and j  i j  j 1 k  k 1 End of While If j  i , report that W  T (i, i  m  1) is an exact match of P . Find the entry of t i m 1 in the location table of P . Let it be denoted as d . i id. Go to Step 1. 5-3 Example 5.2-2 Let T  acgttattgacc and P  att . The location table is as shown in Table 5.2-2. Table 5.2-2 The location table for P  att a c g t 2 3 3 1 The Horspool Algorithm is initialized as shown in Fig. 5.2-3. T P =a a g t t a t t g a c =a t t Fig. 5.2-3 The initial alignment for Example 5.2-2 The last character of the window is g which is not found in P (1,2) . 3 steps as shown in Fig. 5.2-4. T P Fig. 5.2-4 We move P =a a g t t a t t g a c = a t t The first movement of P in Example 5.2-2 The last character of the window is a which is found in P (1,2) . From Table 5.2-2, we know that we should move P 2 steps as shown in Fig. 5.2-5. T P Fig. 5.2-5 =a a g t t a t t g a c = a t t The second movement of P in Example 5.2-2 A match is found. P is moved 1 step as shown in Fig. 5.2-6 T P Fig. 5.2-6 =a a g t t a t t g a c = a t t The third movement of P in Example 5.2-2 The last character of the window, namely g , does not exist in P (1,2) . 3 steps as shown in Fig. 5.2-7. T P =a a g t t a t t g a c = a t t 5-4 P is moved Fig. 5.2-7 Section 5.3 Algorithm The fourth movement of P in Example 5.2-2 The Time-Complexity Analysis of the Horspool The worst case time-complexity of the Horspool Algorithm is easy to obtain. denote the number of alphabets. Then Let  Preprocessing phase in O (m) time and O ( ) space complexity. Searching phase in O (mn) time complexity. Before we analyze the average number of comparisons of this algorithm, we must state our probabilistic assumptions. We assume that the distribution of the characters occurring in T or in P is uniform. That is, the random variable of the characters, X ranging over the  -alphbet A, satisfies, for any a in A, Pr( X  a )  1  . We shall assume that the given pattern string and the text string are random. We first define a term, called “head”. Definition 5.3-1 The last character of a window is called a head. To obtain an average case analysis of the Horspool Algorithm, we must know the probability that a location k of the text T is a head, denoted as H k . It is intuitive and correct that H k is the same for all k because there is no reason that one location is different others so far as being head is concerned. To find H k , we denote the average number of steps of shift by EShift  . With this term, we may easily have the following equation: Hk  1 E ( shift ) (5.3-1) Let us imagine that EShift   1 . Then obviously, every location will be a head. Suppose that EShift   2 . It will be expected that half of the locations in T will be heads. If the number of steps is large in average, then a small number of locations in T will be heads. Let L (i ) denote the value stored in the ith entry of the Location Table. Then we have 1  E S h i f t  L(i ) . (5.3-2)  i 1 5-5 For example, for the Location Table shown in Table 5.2-2, E ( shift )  1 1  3  3  2  2.25 . 4 To obtain the average case time-complexity of the Horspool Algorithm, we must have the average number of character comparisons for a window. Let AN (m) denote the average number of character comparisons for a window with size m . Then we can reason as follows: (1) The first comparison of character is a mismatch. In this case, the expected  1  1 number of character comparisons is therefore 1  as is the probability   that the first comparison yields a mismatch. (2) The first comparison is a match and the second comparison of character is a mismatch. In this case, the expected number of character comparisons is therefore 1  1    1  2    is the probability that the first comparison yields a  . Note that       match. The first (m  1) comparisons all yield matchings. Then there will be totally m comparisons. Based upon the above reasoning, we have:  1  1    1  1 AN (m)  1   2      m          1   1  1  1  1  1    2 1      m            1 1 1 1  1   2  2 2    m m 1  1  1   1 1    2  m 1  1  (5.3-3) m 1 1 m 1    m 1 1   1 when m is reasonably large. 5-6 Let us denote the expected number of character comparisons for a text string with length n and a pattern string with length m by C n . Then, C n  nH k ( AN (m))  nH k  (5.3-4)  1 The expected number of character comparisons for a text string with length n and a pattern string with length m per character is therefore: Cn  H k ( AN (m)) n  Hk (5.3-5)   1 For the case of the Location Table 5.2-2, we have: Cn 1  4  4   0.59   n 2.25  4  1  6.76 (5.3-6) The above result is obtained under the assumption that the pattern string is given and fixed, as we stated at the very beginning. We must understand that this is not very good average case analysis because it fails to give an analysis based upon the assumption that the pattern is random. In the following, we show some experiments of finding of the average number of character comparisons. For each of the following three pattern strings, we randomly generated 500 text strings with length 1000. The average number of character comparisons is shown below. It shows that the theoretical result is quite close to the experimental results. P Theoretical result Experiment result att 0.5925 0.6031 cgtac 0.5333 0.5592 aggttgaat 0.3137 0.3302 The above discussion is what we shall call the first approximation of the average case analysis of the Horspool Algorithm. In this discussion, we ignored one fact: There may be another head to the left of the head in the window. Consider Fig. 5.3-1. The case shown in Fig. 5.3-1 is a special one in which m  4 and sh1 , the distance between the two heads is equal to 3. Note that at Head 2, there is an exact matching between corresponding characters of T and P . There cannot be the case where the number of comparisons being 3 because as soon as the comparison at Head 2 is 5-7 done, it will automatically do the fourth comparison. We of course may ask the question: Under what condition will the characters corresponding to Head 2 be compared? They will be compared if the first two comparisons all yield exact matchings. In other words, as soon as the two first comparisons yield exact matchings, we will have 4 comparisons. Sh1 Head 2 Head 1 m=4 i-m+1 T i a a Fig. 5.3-1 The case where there are two heads in the window. The expected number of character comparisons is:   1  1    1   1  1    2      4             1 2  1  2  2 (5.3-7)  We may now ask another question, what is the expected number of character comparisons if there is no Head 2? It will be equal to 1 1   1  2  1 (5.3-8) 3 We may now rewrite Equation (5.3-7) as follows: 1 1  1   2  1  3  1  2  1 (5.3-9) 3 1  0 ,we mathematically conclude that the expected number of  3 character comparisons will be increased if there are more than one two head in the window. Since 2  1 5-8 In the general case, it can be easily derived that the expected number of character comparisons is: 1 1  1  1  2  1  m  1  sh1  1 m 1  m2  1  1 1  sh1  m 1   1   sh1  1   (5.3-10) if m is reasonably large. The above discussion only gives the reader some feeling about how to handle the problem where there are two heads in a window. The above discussion is simple enough to understand. It will be quite complicated mathematically if we want to consider the general case. Since the experimental results show that the first approximation theoretical result is close enough, the general case of existing more than one head will not be discussed this book. Section 5.4 Some Variations of the Horspool Algorithm In this section, we shall introduce four algorithms which are variations of the Horspool Algorithm. They are easy to understand and we shall only give a brief sketch of them. 1. The Raita Algorithm. The Raita Algorithm is different from the Horspool Algorithm in only one aspect: the order of comparison of characters. In the Horspool Algorithm, the comparison starts from the right. But the Raita Algorithm has a specified order of character comparison. For instance, it may first compare wm with p m and then w1 with p1 . 2. The Nebel Algorithm The Nebel Algorithm is also different from Horspool Algorithm in only one aspect: the order to compare of characters. In the Horspool Algorithm, the comparison starts from the right of W. The Nebel Algorithm also has a specified order of character comparison. Let the alphabet set of P be   x1 , x2 , , x . Without losing generality, we may assume that the number of occurrence of xi in P is the ith smallest. First, it compares the positions of x1 with the corresponding positions of W and then the positions of x2 with the corresponding positions of W. 5-9 Finally, it compares the positions of x with the corresponding positions of W. 3. The First Sunday Algorithm Note that for the Horspool Algorithm, we always align the pattern with the last character of the window. The First Sunday Algorithm pays attention to character next to the window. Consider Fig. 5.6-1. The first alignment is to align with a , the second alignment is with c and so on. T = a c g g t a t c g t a c g t t P = c g t a c T = a c g g t a t c g t a c g t t P = c g t a c T = a c g g t a t c g t a c g t t P = c g t a c T = a c g g t a t c g t a c g t t P = c g t a c Fig. 5.4-1 The First Sunday Algorithm The location table of the First Sunday Algorithm therefore is different from that of the Horspool Algorithm. For instance, for the above example, the location table will be as shown in Table 5.4-1. Table 5.4-1 The location table for P  cgtac in the First Sunday Algorithm a c g t 2 1 4 3 If the Horspool Algorithm is used, the location table is shown in Table 5.4-2. Table 5.4-2 The location table for P  cgtac in the Horspool Algorithm a c g t 1 4 3 2 4. The Smith Algorithm 5-10 The Smith Algorithm combines the First Sunday Algorithm and the Horspool Algorithm. We have two location tables. Whichever gives us a longer shift, we use it. Section 5.5 The Liu’s Algorithm The Liu’s Algorithm is much more sophisticated than the Horspool Algorithm although it is still in the spirit of the Horspool Algorithm. Consider the following pattern. P = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 a g t c c c c c c a g t c g c a c t Suppose we use location 18 of the window as a reference. That is, when we shift the pattern, we will match the pattern character in P with w18 . The location table will be as shown in Table 5.5-1. Table 5.5-1 The location table for reference point 18 a c g t 2 1 4 6 The average number of shifts is 2  1  4  6 13   3.25 4 4 Suppose we set the reference point to be location 14. Then the location table will be as shown in Table 5.7-2. Table 5.5-2 The location table for reference point 14 a c g t 4 1 3 2 The average number of shifts is 4  1  3  2 10   2.5 4 4 Let the reference point be location 10. Then the location table will be as shown in Table 5.5-3. Table 5.5-2 The location table for reference point 10 5-11 a c g t 9 1 8 7 The average number of shifts will be 9  1  8  7 25   6.25 4 4 The Liu’s Algorithm will thus conduct a pre-processing to determine the best reference point. The pre-processing can be quite efficiently implemented. In the following, we shall discuss how to find an optimal reference point for the Liu’s Algorithm. Let us consider the case as shown below: 1 2 3 4 5 6 7 P = a g a g c t a We start from i  2 . From the above, we can see that only alphabet a appears to the left of p 2 . Thus, if w2  a , we shift 1 step; otherwise, we shift 2 steps. Assume that the alphabet size   4 . The total number of shifts is 1  3  (2)  7 for i  2 . We increase i  2 to i  3 . We can now see that alphabets a and g appear to the left of p 3 . Since p1  a , if w3  a , we shift 3  1  2 steps. Since p2  g , if w3  g , we shift 3  2  1 step. For other two cases, we shift 3 steps. Thus, the total number of shifts is 1 2  11  2  3  2  1  6  9 steps. Since 9  7 , we say that i  3 is the best so far. We increase i  3 to i  4 . It can be easily seen that for i  4 , the total number of shifts is 11  1 2  2  4  1  2  8  11. We conclude that i  4 is the best so far. Consider the case where i  5 . We can see that the total number of shifts is 1 2  11  2  5  2  1  10  13 . Consider the case where i  6 . We can see that the total number of shifts is 1 3  11  1 2  1 6  3  1  2  6  12 . This shows that the best reference point is still i  5 . Finally, for i  7 , we can see that the total number of shifts is 1 4  1 2  1 3  11  4  2  3  1  10 . Thus the final solution is that the best optimal reference point is i  5 . There is a very simple trick here. When we consider location i  1 , only have to pay attention to p i and set its contribution to the shifting to be 1. For all other alphabets, we merely increase their contributions by 1 Algorithm 5.2 shows the 5-12 algorithm to determine the reference point and its related shift table. . Algorithm 5.2 Algorithm for determining ib and Shift table for Liu’s Algorithm Input : P(1, m) Output : ib and Shift table   a1 , a 2 , a3 ,, a  /*the set of alphabets*/ For i  1 to  d ai   1 Shift a j   1 End For Max  0 Total  0 ib  1 For i  1 to m For i  1 to  d a j   d a j   1 End For d  pi 1   1 Total  0 For i  1 to  Total  Total  d (ai ) End For If (Total  Max ) ib  i Max  Total For i  1 to  Shift a j   d a j  End For End If End For We now give an example. P  agagcta . We are given T  agtgtcagagctaca and Preprocessing Phase: 1 2 3 4 5 6 7 P = a g a g c t 5-13 a  a G c t d  = 1 1 1 1  a g c t Shift  = 1 1 1 1 Max  0 , Total  0 and ib  1 . When i  2 , 1 2 3 4 5 6 7 = a g a g c t P a  a g c t d  = 1 2 2 2 Total  d (a)  d ( g )  d (c)  d (t ) 1 2  2  2 7 Total  7 and Max  0 . The value Total is greater than Max , 7  0 , then  a g c t Shift  = 1 2 2 2 now Max  7 and ib  2 . When i  3 , 5-14 1 2 3 4 5 6 7 = a g a g c t P a  a g c t d  = 2 1 3 3 Total  d (a)  d ( g )  d (c)  d (t )  2 1 3  3 9 Total  9 and Max  7 . The value Total is greater than Max , 9  7 , then  a g c t Shift  = 2 1 3 3 now Max  9 and ib  3 . When i  4 , 1 2 3 4 5 6 7 = a g a g c t P a  a g c t d  = 1 2 4 4 Total  d (a )  d ( g )  d (c)  d (t ) 1 2  4  4  11 Total  11 and Max  9 . The value Total is greater than Max , 11  9 , then  a g c t 5-15 Shift  = 1 2 4 4 now Max  11 and ib  4 . When i  5 , 1 2 3 4 5 6 7 = a g a g c t P a  a g c t d  = 2 1 5 5 Total  d (a)  d ( g )  d (c)  d (t )  2 1 5  5  13 Total  13 and Max  11. The value Total is greater than Max , 13  11 , then  a g c t Shift  = 2 1 5 5 now Max  13 and ib  5 . When i  6 , 1 2 3 4 5 6 7 P = a g a g c t  a g c t d  = 3 2 1 6 5-16 a Total  d (a )  d ( g )  d (c)  d (t )  3  2 1 6  12 Total  12 and Max  13 . The value Total is less than Max , 12  13 . We don’t need to reset Shift table, Max value and ib . When i  7 , 1 2 3 4 5 6 7 P = a g a g c t a  a g c t d  = 4 3 2 1 Total  d (a)  d ( g )  d (c)  d (t )  4  3  2 1  10 Total  10 and Max  13 . The value Total is less than Max , 10  13 . We don’t need to reset Shift table, Max and ib . Section 5.6 The i-largest Number Domination Sequence As can be seen, the Horspool Algorithm is actually a window sliding algorithm. The average number of steps of the shifting of the window is therefore very important. If, in average, the number of steps of the window being shifted is large, the algorithm is efficient. For the Horspool Algorithm, the number of steps of shifting is determined by how the distinct characters are arranged in the pattern P . Let us consider P  tcaacgtttttttttttttttt . We can easily see that if the last character of the window is not t , the number of steps of pattern shifting is quite large. On the other hand, suppose that 5-17 P  accgtgtacccacgtt In this case, no matter what the last character of the window is, the number of steps of the pattern shifting is small. We are facing an interesting problem. Recall that the alphabet set is   ( x1 , x2 ,, x ) . Without losing generality, we may assume that when we scan from right to the left, starting from p m1 , the distinct characters we encounter are ordered as x1 , x2 ,, x . That is, p m1  x1 . Then the second distinct character we encounter is x 2 . For example, let P  accgttgtac . Then, x1  a x2  t x3  g x4  c For each xi , we like to know the location of the rightmost of it in P(1, m  1) , counted from location m  1 . Let us denote this as d i . For P  accgttgtac , d1  1 d2  2 d3  3 d4  7 To find the average case performance of the Horspool Algorithm, we have to find the average values of d i ’s. It will be informative for us to code the string P(1, m  1) into a string consisting of positive integers only. Let us code xi by i. For P  accgttgtac , the coding is as follows: a 1 t2 g 3 c4 5-18 Thus the original pattern P(1, m  1) becomes: 144322321. and we have 123223441. Let us now reverse it Scanning 123223441 from the left, we will find out that there are four sequences: the sequence where the first 1 appears and 1 is the largest, the sequence where the first 2 appears and 2 is the largest and so on as follows: S1 : 1 (The first 1 appears at location 1 counted from location m  1 .) S 2 : 12 (The first 2 appears at location 2 counted from location m  1 .) S 3 : 123 (The first 3 appears at location 3 counted from location m  1 .) S 4 : 1232234 (The first 4 appears at location 7 counted from location m  1 .) We shall point out that these sequences have a common property. Before doing that, let us consider another example: P  acgtggatcagagaat It can be seen that the coding is as follows: a 1 g2 c3 t4 P(1, m  1) becomes: 132422143121211 We reverse the above sequence into 112121341224231 Then we have the following sequences: S1 : 1 (The first 1 appears at location 1 counted from location m  1 .) S 2 : 112 (The first 2 appears at location 3 counted from location m  1 .) S 3 : 1121213 (The first 3 appears at location 7 counted from location m  1 .) S 4 : 11212134 (The first 4 appears at location 8 counted from location m  1 .) The physical meaning of each sequence listed above is as follows: 5-19 S1 : The first distinct character appears at location m  1 S 2 : The second distinct character appears at location m  2 S 3 : The third distinct character appears at location m  6 S 4 : The third distinct character appears at location m7. From the above discussion, we can see that the first distinct character in P(1, m  1) , counted from the right, must be located at m  1 . But the second distinct character may appear at any where. To analyze the average case performance of the Horspool Algorithm, we must know in average, where the i -th distinct character appears. It turns out that this problem can be formulated as the i-largest number domination sequence problem which will be defined and discussed in the next sections. We define the i-largest number domination sequence as follows: Definition 5.6-1 The i-largest number domination sequence An i-largest number domination sequence is a sequence S consisting of integers 1, 2, …, i satisfying the following conditions: 1. The integer i is the largest in the sequence, appears at the last position and appears only once. 2. For every positive integer k smaller than i, there exists a k-largest number domination sequence in S. The following sequences are all i-largest number domination sequences for some i: 1 12 123 1234 112 11112 1111112 1223 1213 12223 12221334 5-20 Consider the sequence 12223. 3 appears at the last and is the only 3 appearing in this sequence. In this sequence, the 2-largest number domination sequence is 12 and the 1-largest number domination sequence is 1. We therefore conclude that 12223 is a 3-largest number domination sequence. The following sequences are not i-largest number domination sequences: 11 121 1122344 1233 213 2213 Consider the sequence 121. 1 appears at the last. But 1 is not the largest number in this sequence. Thus it is not any i-largest number domination sequence. Consider the sequence 213. In this case, 3 appears as the last character. But there is no 2-largest number domination sequence in it. Therefore it is not any i-largest number domination sequence. That the i -largest number sequence is related to the Horspool Algorithm can be seen by considering the following case: Let P  accgttaccct . We code P(1, m  1) as 2114332111. Let us reverse the coded sequence and get 1112334112. The largest number is 4. So, we consider the sequence up to 4 which is 1112334. This sequence is obviously a 4-largest number domination sequence. There are totally four i -largest number sequences as follows: and 1 (1-largest number domination sequence with length 1) 1112 (2-largest number domination sequence with length 4) 11123 (3-largest number domination sequence with length 5) 1112334 (4-largest number domination sequence with length 7) The above sequences indicate that the first, second, third and fourth distinct characters in P , counted from pm 1 to the left are located in 1, 4, 5 and 7 respectively. For P  acgaaaccaccttt . The coded sequence of P(1, m  1) is 3243332232211. The reverse of it is 1122322333423. Again, there are 4 i -largest number sequences as follows: 1 (1-largest number domination sequence with length 1) 5-21 112 (2-largest number domination sequence with length 3) 11223 (3-largest number domination sequence with length 5) 11223223334 (4- largest number domination sequence with length 11). The first, second, third and fourth distinct characters in P , counted from pm 1 to the left are located in 1, 3, 5 and 11, respectively. To have an average case analysis of the Horspool Algorithm, we are interested in knowing the average location of the i th distinct character counted from pm 1 in P . To get this, we need first to solve one problem: For a location L counted from pm 1 in P , what is the probability that the i th distinct character occurs at L ? If the ith distinct character occurs at L , the reversed coded sequence of P(m  L, m  1) must be an i -largest number domination sequence. We recall that there the alphabet size is  . Therefore, there are  L distinct sequences. Some of them are i -largest number domination sequences. Given an i and an L , if we know the number of distinct i -largest number domination sequences, we would know the probability that the i th distinct character occurs at L . In the following section, we shall discuss the i -largest number domination sequence problem which addresses our concern. Section 5.7 Problem The i-largest Number Domination Sequence We first define the i-largest number domination sequence problem as follows: Definition 5.7-1 The i-largest number domination sequence problem The i-largest number domination sequence problem is to determine the number of i-largest number domination sequences with length L for a given i and a given L . For instance, let i  3 and L  3 . Then there is only one 3-largest number domination sequence with length 3, namely 123. If i  2 and L  3 , there is also only one 2-largest number domination sequences with length 3, namely 112. For i  3 and L  4 , there are three 3-largest number domination sequences with length 4, namely 1123, 1213 and 1223. To solve the i-largest number domination sequence problem 5-22 Let D(i, L) be the number of all i-largest number domination sequences with length L. We first present some boundary conditions: 1. 2. 3. 4. D(i, i )  1 for all i . D(1, L)  0 if L  1 D(i, L)  0 if L  i . D ( 2, L)  1 for L  2 . Now, let us consider D (4,6) . In this case, the length of the sequence is 6 and the largest number of this sequence is 4. Therefore, the sequence must be of the form 1s 2 s3 s 4 s5 4 As for s5 , it cannot be 4, by definition. Thus it can be either 1, 2 or 3. There are two possible cases: Case 1: 1s 2 s3 s 4 s5 is a 3-largest domination sequence. In this case, 1s 2 s3 s 4 s5  1s 2 s3 s 4 3 . For instance, 11213 is such a sequence and there are D (3,5) such sequences. Case 2: 1s 2 s3 s 4 s5 is not a 3-largest number domination sequence. In this case, s5 must be one of either 1, 2 or 3. For instance, 11231, 12312 and 11233 are all such sequences. They have a common property: If the last character is replaced by 4, they all become 4-largest number domination sequences: For instance, 11231, 12312 and 11233 will become 11234, 12314 and 11234 respectively and they are now all 4-largest number domination sequences. We may classify all of the Case 2 1s 2 s3 s 4 s5 sequences into three classes: Class 1: s5  1 , Class 2: s5  2 and Class 3: s5  3 . For each class, there are D (4,5) sequences. Therefore, for Case 2, there are 3D ( 4,5) sequences. Combining Case 1 and Case 2, we may conclude that D(4,6)  D(3,5)  3D(4,5) (5.7-1) 5-23 In general, suppose S  s1 s2 s L is an i -largest number domination sequence. There are D(i  1, L  1) sequences of the form of s1 s2 s L1 which are (i  1) -largest number domination sequences and (i  1) D(i, L  1) sequences of the form of s1 s2 s L1 which are not (i  1) -largest number domination sequences. Therefore, we have: D(i, L)  D(i  1, L  1)  (i  1) D(i, L  1) for L  i (5.7-2) with the following boundary conditions: D(i, i )  1 for all i . (5.7-3) D(1, L)  0 if L  1 (5.7-4) (5.7-5) (5.7-6) D(i, L)  0 if L  i . D ( 2, L)  1 for L  2 . For instance, D(4,6)  D(3,5)  3D(4,5)  D(2,4)  2 D(3,4)  3( D(3,4)  3D(4,4))  D(1,3)  D(2,3)  2( D(2,3)  2 D(3,3))  3( D(2,3)  2 D(3,3))  9 D(4,4)  6 D(2,3)  10 D(3,3)  9 D(4,4)  6( D(1,2)  D(2,2))  10 D(3,3)  9 D(4,4)  6  10  9  25 (5.7-7) In the following section, we shall show how to apply the i-largest number domination sequence problem to the average case analysis of the Horspool Algorithm. Section 5.8 Application of the i-largest Number Domination Sequence Problem to the Average Case Analysis of the Horspool Algorithm For the Horspool Algorithm, the location of the ith distinct character in P(1, m  1) , counted from location m  1 , is very important. If the last character x of the window W is equal to the ith distinct character, the number of steps of shifting the pattern P is exactly equal to this location. We only know that the first distinct character must be located in location m  1 . The ith distinct character where i  1 may appear at anywhere in P(1, m  2) . We are thus interested in the average 5-24 location of the ith distinct character in P(1, m  2) . As pointed out before, if the ith distinct character appears in location L , counted from location m  1 , then the reverse of the coded sequence of P(m  L, m  1) must be an i  largest number domination sequence. The number of i  largest number domination sequences with length L is denoted as D (i, L) . Let  be the alphabet size. Then there are  L sequences with length L . The probability that one random sequence with length L is an i  largest number domination sequence is now denoted as A(i, L ) . A(i, L ) can be found by the following formula: A(i, L)  D(i, L) (5.8-1) L It is obvious that A(i, L ) is also the probability that the ith distinct character appears at location L , counted from location m  1 . In the above section, we have given a recursive formula, expressed as Equation (5.7-2) to determine D (i, L) . Unfortunately, a close formula based upon Equation (5.7-2) is still at large. But, based upon Equation (5.7-2) and all of the other boundary conditions, we can compute D(i, L) ’s for limited range of i and L . Let us consider the case where i  4 , L  6 and   4 . Equations (5.7-7) and (5.8-1), we have D(i, L)  L  Then, from D(4,6) 25   0.006 . 6 4096 4 When the ith distinct character is equal to x, the last character of W, its shift is equal to L. The reverse of the coded sequence of P(m  L, m  1) must be an i-largest number domination sequence. The average number of steps for a shift for every ith distinct character in a random pattern with length m is m 1 m 1  D(i, L)  L ( A ( i , L ))  L    L  L i L i  (5.8-2) If x does not occur in P(1, m-1), then shift = m. For example, the last character of W in T is coded as 4 and P(1, m-1)=33211. Thus shift =5. However, 11233 does not conform to the definition of i-largest number domination sequence. How do we conquer this difficulty? Although the sequence 11233 is not a 3-largest number domination sequence, it contains a 3-largest number domination sequence, namely 1123. Therefore, if we insert 4 after the last character of this sequence, it becomes 112334 which is a 4-largest number domination sequence with length 6. Since there are D (4,6) 4-largest number domination sequences, we know there are D (4,6) such sequences, 5-25 each of which satisfying the following conditions: (1) (2) (3) It contains a 3-largest number domination sequences. It does not contain 4. Its length is 6  1  5 For instance, 111234 is 4-largest number domination sequence with length 6, 11123 is a possible sequence which does not contain 4 with length 5. Let us give another example, namely 123334. We can see that 12333 can be a coded sequence which does not contain 4. Based upon the above reasoning, assuming that there are only i-1 distinct characters in P(1, m  1) , the number of sequences which can be possible P(1, m  1) ’s which contain only i-1 distinct characters is D (i, m) . The probability that a sequence falls into such a category is: D (i, m) (5.8-3)  m 1 Thus, the average number of steps for a shift for every ith distinct character is: m 1 D(i, m)  D(i, L)  m L   m 1  L i   L (5.8-4) Because the alphabet size is  and the average number of steps for a shift of the first distinct character is 1, the average number of steps for a shift is D(i, m)  1 r  m1  D(i, L)  L m    L r i 1  L i    m1   (5.8-5) The alphabet contains  distinct characters. Hence there are  choices for the first distinct character,   1 choices for the second distinct character, and there are   i choices for the ith distinct character. Hence, there are D(i, L) P( , i ) (5.8-6) choices for each i-largest number domination sequence where P  , i   !    i ! In other words, if we are given a general pattern, the average number of steps for a shift , denoted as AS , is: 5-26 AS  1  m1  D(i, L)  D(i, m)  P , i  m L   m1   i 1  L i    L    (5.8-7) If the length of P is 11 and   4 , the average number of steps for a shift is 3.30. Our experimental result is 3.38. It is close to a degree. The advantage of the Horspool Algorithm is that it is easy to program and uses very small amount of memory because only a small location table is needed. The pre-processing is also very simple. But the weakness of it is that the number of steps is small. We do not need a theoretical analysis to make such a conclusion. Given a random pattern, it is very unlikely the distinct characters all occur at the left side of the pattern. It is very possible that they will appear close to the last character of the pattern. Thus we can hardly expect large shifts. The Liu’s Algorithm is an improvement of the Horspool Algorithm. References [A90] A. V. Aho, Algorithms for finding patterns in strings. in Handbook of Theoretical Computer Science, Volume A, Algorithms and complexity, J. van Leeuwen ed., Chapter 5, pp 255-300, Elsevier, Amsterdam, 1990. [BM77] R. Boyer and S. Moore, “A Fast String Searching Algorithm”, Communications of the ACM, Vol. 20, 1977, pp. 762-772. [BR92] R. A. Baeza-Yates and M. Régnier, “Average Running Time of the Boyer Moore Horspool Algorithm”, Theoretical Computer Science, Vol. 92, Issue 1, 1992, pp.19-31, [C99] M. Crochemore and C. Hancart, Pattern Matching in Strings, in Algorithms and Theory of Computation Handbook, 1999. [DRR2008] T. U. Devi, P. V. N. Rao and A. A. Rao, Promoter Prediction using Horspool’s Algorithm, BIOCOMP, 2008, pp. 248-250. [H80] R. N. Horspool, “Practical Fast Searching in Strings”, Software Practice and Experience, Vol. 10, 1980, pp. 501-506. [L95] T. Lecroq, Experimental results on string matching algorithms, Software Practice & Experience, Vol. 25, No. 7, 1995, pp. 727-765. [MSR97], H. M. Mahmoud, R. T. Smythe and M. R´egnier, Analysis of Boyer-Moore-Horspool string-matching heuristic, Random Structures and Algorithms, Vol. 10, Issue 1-2, 1997, pp.153–163. [MR2010] T. Marschall and S. Rahmann, Exact Analysis of Horspool’s and First Sunday’s Pattern Matching Algorithms with Probabilistic Arithmetic Automata, Lecture Notes in Computer Science, Vol. 6031, 2010, pp. 439-450. [N2006] M. E. Nebel: Fast string matching by using probabilities: On an optimal mismatch variant of Horspool's algorithm. Theoretical Computer Science, Vol. 359, 5-27 No.1-3, 2006, pp329-343. [R92] T. Raita, Tuning the Boyer–Moore–Horspool String Searching Algorithm, Sofeware-Practice and Experience, Vol. 10, No. 22, 1992, pp. 879-884. [S90] D.M., First Sunday, A very fast substring search algorithm, Communications of the ACM. Vol. 33, No.8, 1990, pp.132-142. [S94] P. D. Smith, On tuning the Boyer-Moore-Horspool string searching algorithm, Software—Practice & Experience archive, Vol. 24, 1994, pp. 435-436. [S2001] R.T. Smythe, The Boyer-Moore-Horspool heuristic with Markovian input, Random Structures and Algorithms, Vol. 18, 2001, pp.153–163. 5-28

Chapter 5

Related documents

Products

Support

Chapter 5

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib