Chapter 5

advertisement
Chapter 5
Rule E2-2 and the Horspool
Algorithm
In Chapter 4, we introduced Rule E2 which is a substring matching rule. Given a
substring S in the window W of the text string T , we must try to find if there
exists a substring which is identical to S to the left of S in the pattern string P .
In Chapter 4, we also introduced a variant of Rule E2, namely Rule E2-1 in which the
substring is the longest suffix of which is equal to a prefix of P . In this chapter, we
shall introduce another variant of Rule E2.
Section 5.1
Rule E2-2: The 1-Suffix Rule
Consider Fig. 5.1-1. Note that the last character of W is x . If we have to move
the pattern P , we must align the particular x in P , if it exists, to align it with the
x in W as shown in Fig. 5.1-1(b). If no such an x exists in P , we move P as
shown in Fig. 5.1-1(c).
W
x
x
P
(a)
W
x
x
P
(b)
W
x
P
(c)
Fig. 5.1-1 The Basic Idea of Rule E2-2
The following is a formal statement of Rule E2-2.
5-1
Rule E2-2: We are given a text string T , a pattern string P , and a window
W  T (a, b) of T , which is aligned with P . Assume that the last character of
W is x and we have to shift P . If x exists in P(1, m  1) , let i (x) be the
location of the rightmost x in P(1, m  1) , shift P to such an extent that
p i ( x ) is aligned with t b . If no x exists in P(1, m  1) , shift P to such an
extent that p1 is aligned with t b 1 .
Section 5.2 The Horspool Algorithm
The Horspool Algorithm starts with scanning from the right as shown in Fig. 5.2-1.
For each pair of characters in the text and pattern, we compare them until we find a
mismatch. After we find a mismatch, we know that we should shift the pattern to the
right. The shifting is based upon Rule E2-2.
T
X
P
Y
Fig. 5.2-1 The right to left scanning in the Horspool Algorithm
To implement Rule E2-2, we must have a mechanism to find i (x) for a given
x . This is not done in run-time. Instead, we do a pre-processing.
Definition 5.2-1 The Location Table for the Horspool Algorithm
Given an alphabet set   ( x1 , x 2 , , x ) and pattern P with length m , we
create a table, denoted as location table of P , containing  entries. Each
entry stores the location of the rightmost xi , 1  i   , in P(1, m  1) counted
from location m  1 , if it exists. If xi does not exist in P(1, m  1) , store m in
the entry.
Example 5.2-1
Let P  aggttgaat . The location table of P is displayed as follows:
Table 5.2-1 The location table of P  aggttgaat
a
c
g
t
1
9
3
4
5-2
Each time when we have to move the pattern P , we consult the location table of
P.
For instance, consider the case shown in Fig. 5.2-1.
T
P
=A c c g a g g t t g a a t t g c
=A g g t t g a a t
Fig. 5.2-1 An example for the Horspool Algorithm
As can be seen, we have to move P . The last character of the window is t .
There are two t ’s in P (1,8) . Both t ’s are bold faced in Fig. 5.2-1. We consult
Table 5.2-1. The entry of t is 4. We therefore move P 4 steps to the right as
shown in Fig. 5.2-2. A match is now found.
T
P
=A c c g a g g t t g a a t t g c
=
a g g t t g a a t
Fig. 5.2-2 The moving of P in the Horspool Algorithm
The Horspool Algorithm is very similar to the Reverse Factor Algorithm.
now given as Algorithm 5.1 below:
It is
Algorithm 5.1 The Horspool Algorithm Based upon Rule 1-2
Input: A text string T and a pattern string P with lengths n and m
respectively
Output: All occurrences of P in T .
Construct the location table of P
Set i  1 ,
Step 1: Let W  T (i, i  m  1) be a window.
Align P with W .
Set j  i  m  1 and k  m .
If j  n , exit.
While t j  p k and j  i
j  j 1
k  k 1
End of While
If j  i , report that W  T (i, i  m  1) is an exact match of P .
Find the entry of t i m 1 in the location table of P . Let it be denoted as d .
i id.
Go to Step 1.
5-3
Example 5.2-2
Let T  acgttattgacc and P  att . The location table is as shown in Table 5.2-2.
Table 5.2-2 The location table for P  att
a c g t
2 3 3 1
The Horspool Algorithm is initialized as shown in Fig. 5.2-3.
T
P
=a a g t t a t t g a c
=a t t
Fig. 5.2-3
The initial alignment for Example 5.2-2
The last character of the window is g which is not found in P (1,2) .
3 steps as shown in Fig. 5.2-4.
T
P
Fig. 5.2-4
We move P
=a a g t t a t t g a c
=
a t t
The first movement of P in Example 5.2-2
The last character of the window is a which is found in P (1,2) . From Table 5.2-2,
we know that we should move P 2 steps as shown in Fig. 5.2-5.
T
P
Fig. 5.2-5
=a a g t t a t t g a c
=
a t t
The second movement of P in Example 5.2-2
A match is found. P is moved 1 step as shown in Fig. 5.2-6
T
P
Fig. 5.2-6
=a a g t t a t t g a c
=
a t t
The third movement of P in Example 5.2-2
The last character of the window, namely g , does not exist in P (1,2) .
3 steps as shown in Fig. 5.2-7.
T
P
=a a g t t a t t g a c
=
a t t
5-4
P is moved
Fig. 5.2-7
Section 5.3
Algorithm
The fourth movement of P in Example 5.2-2
The Time-Complexity Analysis of the Horspool
The worst case time-complexity of the Horspool Algorithm is easy to obtain.
denote the number of alphabets. Then
Let 
Preprocessing phase in O (m) time and O ( ) space complexity.
Searching phase in O (mn) time complexity.
Before we analyze the average number of comparisons of this algorithm, we
must state our probabilistic assumptions. We assume that the distribution of the
characters occurring in T or in P is uniform. That is, the random variable of the
characters, X ranging over the  -alphbet A, satisfies, for any a in A,
Pr( X  a ) 
1

.
We shall assume that the given pattern string and the text string are random.
We first define a term, called “head”.
Definition 5.3-1 The last character of a window is called a head.
To obtain an average case analysis of the Horspool Algorithm, we must know the
probability that a location k of the text T is a head, denoted as H k . It is
intuitive and correct that H k is the same for all k because there is no reason that
one location is different others so far as being head is concerned. To find H k , we
denote the average number of steps of shift by EShift  . With this term, we may
easily have the following equation:
Hk 
1
E ( shift )
(5.3-1)
Let us imagine that EShift   1 . Then obviously, every location will be a head.
Suppose that EShift   2 . It will be expected that half of the locations in T will
be heads. If the number of steps is large in average, then a small number of
locations in T will be heads.
Let L (i ) denote the value stored in the ith entry of the Location Table. Then
we have
1 
E S h i f t  L(i ) .
(5.3-2)
 i 1
5-5
For example, for the Location Table shown in Table 5.2-2,
E ( shift ) 
1
1  3  3  2  2.25 .
4
To obtain the average case time-complexity of the Horspool Algorithm, we must
have the average number of character comparisons for a window. Let AN (m)
denote the average number of character comparisons for a window with size m .
Then we can reason as follows:
(1) The first comparison of character is a mismatch. In this case, the expected
 1
 1
number of character comparisons is therefore 1 
as
is the probability


that the first comparison yields a mismatch.
(2) The first comparison is a match and the second comparison of character is a
mismatch. In this case, the expected number of character comparisons is therefore
1
 1    1 
2   
is the probability that the first comparison yields a
 . Note that

    
match.
The first (m  1) comparisons all yield matchings. Then there will be totally
m comparisons.
Based upon the above reasoning, we have:
 1
 1    1 
1
AN (m)  1 
 2   
  m 

    
 
1   1 
1

1
 1  1    2 1      m 
       
 
1
1
1
1
 1   2  2 2    m m 1
 1

1


1
1



2

m 1

1

(5.3-3)
m 1
1
m
1



m 1
1

 1
when m is reasonably large.
5-6
Let us denote the expected number of character comparisons for a text string
with length n and a pattern string with length m by C n . Then,
C n  nH k ( AN (m))
 nH k

(5.3-4)
 1
The expected number of character comparisons for a text string with length n
and a pattern string with length m per character is therefore:
Cn
 H k ( AN (m))
n
 Hk
(5.3-5)

 1
For the case of the Location Table 5.2-2, we have:
Cn
1  4 
4

 0.59


n
2.25  4  1  6.76
(5.3-6)
The above result is obtained under the assumption that the pattern string is
given and fixed, as we stated at the very beginning. We must understand that
this is not very good average case analysis because it fails to give an analysis
based upon the assumption that the pattern is random. In the following, we
show some experiments of finding of the average number of character comparisons.
For each of the following three pattern strings, we randomly generated 500 text
strings with length 1000. The average number of character comparisons is shown
below. It shows that the theoretical result is quite close to the experimental results.
P
Theoretical result
Experiment result
att
0.5925
0.6031
cgtac
0.5333
0.5592
aggttgaat
0.3137
0.3302
The above discussion is what we shall call the first approximation of the average
case analysis of the Horspool Algorithm. In this discussion, we ignored one fact:
There may be another head to the left of the head in the window. Consider Fig. 5.3-1.
The case shown in Fig. 5.3-1 is a special one in which m  4 and sh1 , the distance
between the two heads is equal to 3. Note that at Head 2, there is an exact matching
between corresponding characters of T and P . There cannot be the case where
the number of comparisons being 3 because as soon as the comparison at Head 2 is
5-7
done, it will automatically do the fourth comparison. We of course may ask the
question: Under what condition will the characters corresponding to Head 2 be
compared? They will be compared if the first two comparisons all yield exact
matchings. In other words, as soon as the two first comparisons yield exact
matchings, we will have 4 comparisons.
Sh1
Head 2
Head 1
m=4
i-m+1
T
i
a
a
Fig. 5.3-1 The case where there are two heads in the window.
The expected number of character comparisons is:
  1
 1    1   1 
1 
  2   
  4 
  
       
1
2
 1  2

2
(5.3-7)

We may now ask another question, what is the expected number of character
comparisons if there is no Head 2? It will be equal to
1
1


1

2

1
(5.3-8)
3
We may now rewrite Equation (5.3-7) as follows:
1
1

1


2

1

3

1

2

1
(5.3-9)
3
1
 0 ,we mathematically conclude that the expected number of

3
character comparisons will be increased if there are more than one two head in the
window.
Since
2

1
5-8
In the general case, it can be easily derived that the expected number of character
comparisons is:
1
1

1

1

2

1

m

1

sh1

1
m
1
 m2  1  1
1
 sh1  m
1


1

 sh1
 1 

(5.3-10)
if m is reasonably large.
The above discussion only gives the reader some feeling about how to handle the
problem where there are two heads in a window. The above discussion is simple
enough to understand. It will be quite complicated mathematically if we want to
consider the general case. Since the experimental results show that the first
approximation theoretical result is close enough, the general case of existing more
than one head will not be discussed this book.
Section 5.4 Some Variations of the Horspool Algorithm
In this section, we shall introduce four algorithms which are variations of the
Horspool Algorithm. They are easy to understand and we shall only give a brief
sketch of them.
1. The Raita Algorithm.
The Raita Algorithm is different from the Horspool Algorithm in only one aspect:
the order of comparison of characters. In the Horspool Algorithm, the comparison
starts from the right. But the Raita Algorithm has a specified order of character
comparison. For instance, it may first compare wm with p m and then w1 with
p1 .
2. The Nebel Algorithm
The Nebel Algorithm is also different from Horspool Algorithm in only one
aspect: the order to compare of characters. In the Horspool Algorithm, the
comparison starts from the right of W. The Nebel Algorithm also has a specified
order of character comparison. Let the alphabet set of P be   x1 , x2 , , x .
Without losing generality, we may assume that the number of occurrence of xi in P is
the ith smallest. First, it compares the positions of x1 with the corresponding
positions of W and then the positions of x2 with the corresponding positions of W.
5-9
Finally, it compares the positions of x with the corresponding positions of W.
3. The First Sunday Algorithm
Note that for the Horspool Algorithm, we always align the pattern with the last
character of the window. The First Sunday Algorithm pays attention to character
next to the window. Consider Fig. 5.6-1. The first alignment is to align with a ,
the second alignment is with c and so on.
T = a c g g t a t c g t a c g t t
P = c g t a c
T = a c g g t a t c g t a c g t t
P = c g t a c
T = a c g g t a t c g t a c g t t
P = c g t a c
T = a c g g t a t c g t a c g t t
P = c g t a c
Fig. 5.4-1 The First Sunday Algorithm
The location table of the First Sunday Algorithm therefore is different from that
of the Horspool Algorithm. For instance, for the above example, the location table
will be as shown in Table 5.4-1.
Table 5.4-1 The location table for P  cgtac in the First Sunday Algorithm
a
c
g
t
2
1
4
3
If the Horspool Algorithm is used, the location table is shown in Table 5.4-2.
Table 5.4-2 The location table for P  cgtac in the Horspool Algorithm
a
c
g
t
1
4
3
2
4. The Smith Algorithm
5-10
The Smith Algorithm combines the First Sunday Algorithm and the Horspool
Algorithm. We have two location tables. Whichever gives us a longer shift, we use
it.
Section 5.5
The Liu’s Algorithm
The Liu’s Algorithm is much more sophisticated than the Horspool Algorithm
although it is still in the spirit of the Horspool Algorithm. Consider the following
pattern.
P
=
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
a
g
t
c
c
c
c
c
c
a
g
t
c
g
c
a
c
t
Suppose we use location 18 of the window as a reference. That is, when we
shift the pattern, we will match the pattern character in P with w18 . The location
table will be as shown in Table 5.5-1.
Table 5.5-1 The location table for reference point 18
a c g t
2 1 4 6
The average number of shifts is
2  1  4  6 13

 3.25
4
4
Suppose we set the reference point to be location 14. Then the location table
will be as shown in Table 5.7-2.
Table 5.5-2 The location table for reference point 14
a c g t
4 1 3 2
The average number of shifts is
4  1  3  2 10

 2.5
4
4
Let the reference point be location 10. Then the location table will be as shown
in Table 5.5-3.
Table 5.5-2 The location table for reference point 10
5-11
a
c
g
t
9
1
8
7
The average number of shifts will be
9  1  8  7 25

 6.25
4
4
The Liu’s Algorithm will thus conduct a pre-processing to determine the best
reference point. The pre-processing can be quite efficiently implemented.
In the following, we shall discuss how to find an optimal reference point for the
Liu’s Algorithm. Let us consider the case as shown below:
1 2 3 4 5 6 7
P = a g a g c t a
We start from i  2 . From the above, we can see that only alphabet a
appears to the left of p 2 . Thus, if w2  a , we shift 1 step; otherwise, we shift 2
steps. Assume that the alphabet size   4 . The total number of shifts is
1  3  (2)  7 for i  2 .
We increase i  2 to i  3 . We can now see that alphabets a and g appear
to the left of p 3 . Since p1  a , if w3  a , we shift 3  1  2 steps. Since
p2  g , if w3  g , we shift 3  2  1 step. For other two cases, we shift 3 steps.
Thus, the total number of shifts is 1 2  11  2  3  2  1  6  9 steps. Since
9  7 , we say that i  3 is the best so far.
We increase i  3 to i  4 . It can be easily seen that for i  4 , the total
number of shifts is 11  1 2  2  4  1  2  8  11. We conclude that i  4 is
the best so far.
Consider the case where i  5 . We can see that the total number of shifts is
1 2  11  2  5  2  1  10  13 .
Consider the case where i  6 . We can see that the total number of shifts is
1 3  11  1 2  1 6  3  1  2  6  12 . This shows that the best reference point
is still i  5 .
Finally, for i  7 , we can see that the total number of shifts is
1 4  1 2  1 3  11  4  2  3  1  10 . Thus the final solution is that the best
optimal reference point is i  5 .
There is a very simple trick here. When we consider location i  1 , only have
to pay attention to p i and set its contribution to the shifting to be 1. For all other
alphabets, we merely increase their contributions by 1 Algorithm 5.2 shows the
5-12
algorithm to determine the reference point and its related shift table.
.
Algorithm 5.2 Algorithm for determining ib and Shift table for Liu’s
Algorithm
Input
: P(1, m)
Output : ib and Shift table
  a1 , a 2 , a3 ,, a  /*the set of alphabets*/
For i  1 to 
d ai   1
Shift a j   1
End For
Max  0
Total  0
ib  1
For i  1 to m
For i  1 to 
d a j   d a j   1
End For
d  pi 1   1
Total  0
For i  1 to 
Total  Total  d (ai )
End For
If (Total  Max )
ib  i
Max  Total
For i  1 to 
Shift a j   d a j 
End For
End If
End For
We now give an example.
P  agagcta .
We are given T  agtgtcagagctaca and
Preprocessing Phase:
1 2 3 4 5 6 7
P
= a g a g c t
5-13
a

a G c t
d  = 1 1 1 1

a g c t
Shift  = 1 1 1 1
Max  0 , Total  0 and ib  1 .
When i  2 ,
1 2 3 4 5 6 7
= a g a g c t
P
a

a g c t
d  = 1 2 2 2
Total  d (a)  d ( g )  d (c)  d (t )
1 2  2  2
7
Total  7 and Max  0 .
The value Total is greater than Max , 7  0 , then

a g c t
Shift  = 1 2 2 2
now Max  7 and ib  2 .
When i  3 ,
5-14
1 2 3 4 5 6 7
= a g a g c t
P
a

a g c t
d  = 2 1 3 3
Total  d (a)  d ( g )  d (c)  d (t )
 2 1 3  3
9
Total  9 and Max  7 .
The value Total is greater than Max , 9  7 , then

a g c t
Shift  = 2 1 3 3
now Max  9 and ib  3 .
When i  4 ,
1 2 3 4 5 6 7
= a g a g c t
P
a

a g c t
d  = 1 2 4 4
Total  d (a )  d ( g )  d (c)  d (t )
1 2  4  4
 11
Total  11 and Max  9 .
The value Total is greater than Max , 11  9 , then

a g c t
5-15
Shift 
= 1 2 4 4
now Max  11 and ib  4 .
When i  5 ,
1 2 3 4 5 6 7
= a g a g c t
P
a

a g c t
d  = 2 1 5 5
Total  d (a)  d ( g )  d (c)  d (t )
 2 1 5  5
 13
Total  13 and Max  11.
The value Total is greater than Max , 13  11 , then

a g c t
Shift  = 2 1 5 5
now Max  13 and ib  5 .
When i  6 ,
1 2 3 4 5 6 7
P
= a g a g c t

a g c t
d  = 3 2 1 6
5-16
a
Total  d (a )  d ( g )  d (c)  d (t )
 3  2 1 6
 12
Total  12 and Max  13 .
The value Total is less than Max , 12  13 .
We don’t need to reset Shift table, Max value and ib .
When i  7 ,
1 2 3 4 5 6 7
P
= a g a g c t
a

a g c t
d  = 4 3 2 1
Total  d (a)  d ( g )  d (c)  d (t )
 4  3  2 1
 10
Total  10 and Max  13 .
The value Total is less than Max , 10  13 .
We don’t need to reset Shift table, Max and ib .
Section 5.6
The i-largest Number Domination Sequence
As can be seen, the Horspool Algorithm is actually a window sliding algorithm. The
average number of steps of the shifting of the window is therefore very important. If,
in average, the number of steps of the window being shifted is large, the algorithm is
efficient. For the Horspool Algorithm, the number of steps of shifting is determined
by how the distinct characters are arranged in the pattern P . Let us consider
P  tcaacgtttttttttttttttt .
We can easily see that if the last character of the window is not t , the number of steps
of pattern shifting is quite large. On the other hand, suppose that
5-17
P  accgtgtacccacgtt
In this case, no matter what the last character of the window is, the number of steps of
the pattern shifting is small.
We are facing an interesting problem. Recall that the alphabet set is
  ( x1 , x2 ,, x ) . Without losing generality, we may assume that when we scan
from right to the left, starting from p m1 , the distinct characters we encounter are
ordered as x1 , x2 ,, x . That is, p m1  x1 . Then the second distinct character
we encounter is x 2 . For example, let
P  accgttgtac .
Then,
x1  a
x2  t
x3  g
x4  c
For each xi , we like to know the location of the rightmost of it in P(1, m  1) ,
counted from location m  1 . Let us denote this as d i . For P  accgttgtac ,
d1  1
d2  2
d3  3
d4  7
To find the average case performance of the Horspool Algorithm, we have to find
the average values of d i ’s. It will be informative for us to code the string
P(1, m  1) into a string consisting of positive integers only. Let us code xi by i.
For
P  accgttgtac ,
the coding is as follows:
a 1
t2
g 3
c4
5-18
Thus the original pattern P(1, m  1) becomes: 144322321.
and we have 123223441.
Let us now reverse it
Scanning 123223441 from the left, we will find out that there are four sequences:
the sequence where the first 1 appears and 1 is the largest, the sequence where the first
2 appears and 2 is the largest and so on as follows:
S1 : 1 (The first 1 appears at location 1 counted from location m  1 .)
S 2 : 12 (The first 2 appears at location 2 counted from location m  1 .)
S 3 : 123 (The first 3 appears at location 3 counted from location m  1 .)
S 4 : 1232234 (The first 4 appears at location 7 counted from location m  1 .)
We shall point out that these sequences have a common property. Before doing
that, let us consider another example:
P  acgtggatcagagaat
It can be seen that the coding is as follows:
a 1
g2
c3
t4
P(1, m  1) becomes:
132422143121211
We reverse the above sequence into
112121341224231
Then we have the following sequences:
S1 : 1 (The first 1 appears at location 1 counted from location m  1 .)
S 2 : 112 (The first 2 appears at location 3 counted from location m  1 .)
S 3 : 1121213 (The first 3 appears at location 7 counted from location m  1 .)
S 4 : 11212134 (The first 4 appears at location 8 counted from location m  1 .)
The physical meaning of each sequence listed above is as follows:
5-19
S1 : The first distinct character appears at location m  1
S 2 : The second distinct character appears at location m  2
S 3 : The third distinct character appears at location m  6
S 4 : The third distinct character appears at location
m7.
From the above discussion, we can see that the first distinct character in
P(1, m  1) , counted from the right, must be located at m  1 . But the second
distinct character may appear at any where. To analyze the average case
performance of the Horspool Algorithm, we must know in average, where the i -th
distinct character appears. It turns out that this problem can be formulated as the
i-largest number domination sequence problem which will be defined and discussed
in the next sections.
We define the i-largest number domination sequence as follows:
Definition 5.6-1 The i-largest number domination sequence
An i-largest number domination sequence is a sequence S consisting of integers 1,
2, …, i satisfying the following conditions:
1. The integer i is the largest in the sequence, appears at the last position and
appears only once.
2. For every positive integer k smaller than i, there exists a k-largest number
domination sequence in S.
The following sequences are all i-largest number domination sequences for some i:
1
12
123
1234
112
11112
1111112
1223
1213
12223
12221334
5-20
Consider the sequence 12223. 3 appears at the last and is the only 3 appearing
in this sequence. In this sequence, the 2-largest number domination sequence is 12
and the 1-largest number domination sequence is 1. We therefore conclude that
12223 is a 3-largest number domination sequence.
The following sequences are not i-largest number domination sequences:
11
121
1122344
1233
213
2213
Consider the sequence 121. 1 appears at the last. But 1 is not the largest
number in this sequence. Thus it is not any i-largest number domination sequence.
Consider the sequence 213. In this case, 3 appears as the last character. But
there is no 2-largest number domination sequence in it. Therefore it is not any
i-largest number domination sequence.
That the i -largest number sequence is related to the Horspool Algorithm can be
seen by considering the following case:
Let P  accgttaccct .
We code
P(1, m  1) as 2114332111. Let us reverse the coded sequence and get 1112334112.
The largest number is 4. So, we consider the sequence up to 4 which is 1112334.
This sequence is obviously a 4-largest number domination sequence. There are
totally four i -largest number sequences as follows:
and
1 (1-largest number domination sequence with length 1)
1112 (2-largest number domination sequence with length 4)
11123 (3-largest number domination sequence with length 5)
1112334 (4-largest number domination sequence with length 7)
The above sequences indicate that the first, second, third and fourth distinct
characters in P , counted from pm 1 to the left are located in 1, 4, 5 and 7
respectively.
For P  acgaaaccaccttt .
The coded sequence of P(1, m  1) is
3243332232211. The reverse of it is 1122322333423. Again, there are 4 i -largest
number sequences as follows:
1 (1-largest number domination sequence with length 1)
5-21
112 (2-largest number domination sequence with length 3)
11223 (3-largest number domination sequence with length 5)
11223223334 (4- largest number domination sequence with length 11).
The first, second, third and fourth distinct characters in P , counted from pm 1
to the left are located in 1, 3, 5 and 11, respectively.
To have an average case analysis of the Horspool Algorithm, we are interested in
knowing the average location of the i th distinct character counted from pm 1 in P .
To get this, we need first to solve one problem: For a location L counted from
pm 1 in P , what is the probability that the i th distinct character occurs at L ? If
the ith distinct character occurs at L , the reversed coded sequence of
P(m  L, m  1) must be an i -largest number domination sequence. We recall that
there the alphabet size is  . Therefore, there are  L distinct sequences. Some
of them are i -largest number domination sequences. Given an i and an L , if we
know the number of distinct i -largest number domination sequences, we would
know the probability that the i th distinct character occurs at L . In the following
section, we shall discuss the i -largest number domination sequence problem which
addresses our concern.
Section 5.7
Problem
The i-largest Number Domination Sequence
We first define the i-largest number domination sequence problem as follows:
Definition 5.7-1 The i-largest number domination sequence problem
The i-largest number domination sequence problem is to determine the number
of i-largest number domination sequences with length L for a given i and a
given L .
For instance, let i  3 and L  3 . Then there is only one 3-largest number
domination sequence with length 3, namely 123. If i  2 and L  3 , there is also
only one 2-largest number domination sequences with length 3, namely 112. For
i  3 and L  4 , there are three 3-largest number domination sequences with length
4, namely 1123, 1213 and 1223.
To solve the i-largest number domination sequence problem
5-22
Let D(i, L) be the
number of all i-largest number domination sequences with length L.
We first present
some boundary conditions:
1.
2.
3.
4.
D(i, i )  1 for all i .
D(1, L)  0 if L  1
D(i, L)  0 if L  i .
D ( 2, L)  1 for L  2 .
Now, let us consider D (4,6) . In this case, the length of the sequence is 6 and
the largest number of this sequence is 4. Therefore, the sequence must be of the
form
1s 2 s3 s 4 s5 4
As for s5 , it cannot be 4, by definition. Thus it can be either 1, 2 or 3. There are
two possible cases:
Case 1:
1s 2 s3 s 4 s5
is a 3-largest domination sequence.
In this case,
1s 2 s3 s 4 s5  1s 2 s3 s 4 3 .
For instance, 11213 is such a sequence and there are D (3,5) such sequences.
Case 2: 1s 2 s3 s 4 s5 is not a 3-largest number domination sequence. In this case,
s5 must be one of either 1, 2 or 3. For instance, 11231, 12312 and 11233 are all
such sequences. They have a common property: If the last character is replaced by
4, they all become 4-largest number domination sequences: For instance, 11231,
12312 and 11233 will become 11234, 12314 and 11234 respectively and they are now
all 4-largest number domination sequences.
We may classify all of the Case 2 1s 2 s3 s 4 s5 sequences into three classes: Class
1: s5  1 , Class 2: s5  2 and Class 3: s5  3 . For each class, there are D (4,5)
sequences. Therefore, for Case 2, there are 3D ( 4,5) sequences.
Combining Case 1 and Case 2, we may conclude that
D(4,6)  D(3,5)  3D(4,5)
(5.7-1)
5-23
In general, suppose S  s1 s2 s L is an i -largest number domination sequence.
There are D(i  1, L  1) sequences of the form of s1 s2 s L1 which are
(i  1) -largest number domination sequences and (i  1) D(i, L  1) sequences of the
form of s1 s2 s L1 which are not (i  1) -largest number domination sequences.
Therefore, we have:
D(i, L)  D(i  1, L  1)  (i  1) D(i, L  1) for L  i
(5.7-2)
with the following boundary conditions:
D(i, i )  1 for all i .
(5.7-3)
D(1, L)  0 if L  1
(5.7-4)
(5.7-5)
(5.7-6)
D(i, L)  0 if L  i .
D ( 2, L)  1 for L  2 .
For instance,
D(4,6)
 D(3,5)  3D(4,5)
 D(2,4)  2 D(3,4)  3( D(3,4)  3D(4,4))
 D(1,3)  D(2,3)  2( D(2,3)  2 D(3,3))  3( D(2,3)  2 D(3,3))  9 D(4,4)
 6 D(2,3)  10 D(3,3)  9 D(4,4)
 6( D(1,2)  D(2,2))  10 D(3,3)  9 D(4,4)
 6  10  9
 25
(5.7-7)
In the following section, we shall show how to apply the i-largest number
domination sequence problem to the average case analysis of the Horspool Algorithm.
Section 5.8 Application of the i-largest Number
Domination Sequence Problem to the Average Case Analysis
of the Horspool Algorithm
For the Horspool Algorithm, the location of the ith distinct character in P(1, m  1) ,
counted from location m  1 , is very important. If the last character x of the
window W is equal to the ith distinct character, the number of steps of shifting the
pattern P is exactly equal to this location. We only know that the first distinct
character must be located in location m  1 . The ith distinct character where i  1
may appear at anywhere in P(1, m  2) . We are thus interested in the average
5-24
location of the ith distinct character in P(1, m  2) .
As pointed out before, if the ith distinct character appears in location L , counted
from location m  1 , then the reverse of the coded sequence of P(m  L, m  1) must
be an i  largest number domination sequence. The number of i  largest number
domination sequences with length L is denoted as D (i, L) . Let  be the alphabet
size. Then there are  L sequences with length L . The probability that one
random sequence with length L is an i  largest number domination sequence is
now denoted as A(i, L ) . A(i, L ) can be found by the following formula:
A(i, L) 
D(i, L)
(5.8-1)
L
It is obvious that A(i, L ) is also the probability that the ith distinct character appears
at location L , counted from location m  1 .
In the above section, we have given a recursive formula, expressed as Equation
(5.7-2) to determine D (i, L) . Unfortunately, a close formula based upon Equation
(5.7-2) is still at large. But, based upon Equation (5.7-2) and all of the other
boundary conditions, we can compute D(i, L) ’s for limited range of i and L .
Let us consider the case where i  4 , L  6 and   4 .
Equations (5.7-7) and (5.8-1), we have
D(i, L)

L

Then, from
D(4,6)
25

 0.006 .
6
4096
4
When the ith distinct character is equal to x, the last character of W, its shift is
equal to L. The reverse of the coded sequence of P(m  L, m  1) must be an
i-largest number domination sequence. The average number of steps for a shift for
every ith distinct character in a random pattern with length m is
m 1
m 1
 D(i, L) 
L
(
A
(
i
,
L
))

L



L 
L i
L i 
(5.8-2)
If x does not occur in P(1, m-1), then shift = m. For example, the last character of
W in T is coded as 4 and P(1, m-1)=33211. Thus shift =5. However, 11233 does
not conform to the definition of i-largest number domination sequence. How do we
conquer this difficulty?
Although the sequence 11233 is not a 3-largest number domination sequence, it
contains a 3-largest number domination sequence, namely 1123. Therefore, if we
insert 4 after the last character of this sequence, it becomes 112334 which is a
4-largest number domination sequence with length 6. Since there are D (4,6)
4-largest number domination sequences, we know there are D (4,6) such sequences,
5-25
each of which satisfying the following conditions:
(1)
(2)
(3)
It contains a 3-largest number domination sequences.
It does not contain 4.
Its length is 6  1  5
For instance, 111234 is 4-largest number domination sequence with length 6,
11123 is a possible sequence which does not contain 4 with length 5. Let us give
another example, namely 123334. We can see that 12333 can be a coded sequence
which does not contain 4.
Based upon the above reasoning, assuming that there are only i-1 distinct
characters in P(1, m  1) , the number of sequences which can be possible
P(1, m  1) ’s which contain only i-1 distinct characters is D (i, m) . The probability
that a sequence falls into such a category is:
D (i, m)
(5.8-3)
 m 1
Thus, the average number of steps for a shift for every ith distinct character is:
m 1
D(i, m)
 D(i, L) 
m
L

 m 1

L i
  L
(5.8-4)
Because the alphabet size is  and the average number of steps for a shift of the
first distinct character is 1, the average number of steps for a shift is
D(i, m) 
1 r  m1  D(i, L) 
L
m



L
r i 1  L i 

 m1 

(5.8-5)
The alphabet contains  distinct characters. Hence there are  choices for
the first distinct character,   1 choices for the second distinct character, and there
are   i choices for the ith distinct character. Hence, there are
D(i, L) P( , i )
(5.8-6)
choices for each i-largest number domination sequence where P  , i  
!
   i !
In other words, if we are given a general pattern, the average number of steps for
a shift , denoted as AS , is:
5-26
AS 
1
 m1  D(i, L) 
D(i, m) 
P , i 
m
L

 m1 

i 1  L i

  L
  
(5.8-7)
If the length of P is 11 and   4 , the average number of steps for a shift is 3.30.
Our experimental result is 3.38. It is close to a degree.
The advantage of the Horspool Algorithm is that it is easy to program and uses
very small amount of memory because only a small location table is needed. The
pre-processing is also very simple. But the weakness of it is that the number of steps
is small. We do not need a theoretical analysis to make such a conclusion. Given a
random pattern, it is very unlikely the distinct characters all occur at the left side of
the pattern. It is very possible that they will appear close to the last character of the
pattern. Thus we can hardly expect large shifts.
The Liu’s Algorithm is an improvement of the Horspool Algorithm.
References
[A90] A. V. Aho, Algorithms for finding patterns in strings. in Handbook of
Theoretical Computer Science, Volume A, Algorithms and complexity, J. van
Leeuwen ed., Chapter 5, pp 255-300, Elsevier, Amsterdam, 1990.
[BM77]
R. Boyer and S. Moore, “A Fast String Searching Algorithm”,
Communications of the ACM, Vol. 20, 1977, pp. 762-772.
[BR92] R. A. Baeza-Yates and M. Régnier, “Average Running Time of the Boyer
Moore Horspool Algorithm”, Theoretical Computer Science, Vol. 92, Issue 1, 1992,
pp.19-31,
[C99] M. Crochemore and C. Hancart, Pattern Matching in Strings, in Algorithms and
Theory of Computation Handbook, 1999.
[DRR2008] T. U. Devi, P. V. N. Rao and A. A. Rao, Promoter Prediction using
Horspool’s Algorithm, BIOCOMP, 2008, pp. 248-250.
[H80] R. N. Horspool, “Practical Fast Searching in Strings”, Software Practice and
Experience, Vol. 10, 1980, pp. 501-506.
[L95] T. Lecroq, Experimental results on string matching algorithms, Software Practice & Experience, Vol. 25, No. 7, 1995, pp. 727-765.
[MSR97], H. M. Mahmoud, R. T. Smythe and M. R´egnier, Analysis of
Boyer-Moore-Horspool string-matching heuristic, Random Structures and Algorithms,
Vol. 10, Issue 1-2, 1997, pp.153–163.
[MR2010] T. Marschall and S. Rahmann, Exact Analysis of Horspool’s and First
Sunday’s Pattern Matching Algorithms with Probabilistic Arithmetic Automata,
Lecture Notes in Computer Science, Vol. 6031, 2010, pp. 439-450.
[N2006] M. E. Nebel: Fast string matching by using probabilities: On an optimal
mismatch variant of Horspool's algorithm. Theoretical Computer Science, Vol. 359,
5-27
No.1-3, 2006, pp329-343.
[R92] T. Raita, Tuning the Boyer–Moore–Horspool String Searching Algorithm,
Sofeware-Practice and Experience, Vol. 10, No. 22, 1992, pp. 879-884.
[S90] D.M., First Sunday, A very fast substring search algorithm, Communications of
the ACM. Vol. 33, No.8, 1990, pp.132-142.
[S94] P. D. Smith, On tuning the Boyer-Moore-Horspool string searching algorithm,
Software—Practice & Experience archive, Vol. 24, 1994, pp. 435-436.
[S2001] R.T. Smythe, The Boyer-Moore-Horspool heuristic with Markovian input,
Random Structures and Algorithms, Vol. 18, 2001, pp.153–163.
5-28
Download