\def\algoname#1{Algorithm\thechapter

advertisement
Chapter 3
String Matching
To search some designated strings in computers is a classical and important
problem in computer science. We can search or replace some texts in word processors,
find some interesting websites from searching engines (like Yahoo and Google) and
search for papers from digital libraries. The main problem of those operations is string
matching.
Many databases (like GenBank) were built to support DNA and protein sequences
contributed by researchers in the world. Since DNA and protein sequences can be
regarded as character strings composed of 4 and 20 different characters respectively,
the string matching problem is the core problem for searching these databases. To
design efficient string matching algorithms will become a more and more important
research area as the size of databases grows.
In this chapter, we shall introduce two efficient algorithms for the exact string
matching problem. Moreover, we shall introduce two important data structures, the
suffix tree and array, for string matching problems. Finally, we introduce an algorithm
for the approximate string matching problem.
3.1 Basic Terminologies of Strings
A string is a sequence of characters from a alphabet set Σ. For example,
"AGCTTGA" is a string from Σ = {A,C,G,T}. The length of a string S, denoted by |S|,
is the number of characters of S. For example, the length of "AGCTTGA" is 7. Let
S i be the ith character in the left-to-right order for string S, where i >= 1. Let S i , j
denote the string S i S i 1 … S j , where i <= j. String S' is called a substring of string S
if S' = S i ,i |S '|1 for some i, where 1 <= i <= |S| - |S'| + 1. String S' is called a
subsequence of string S if there exists an increasing sequence  1  2 …  |s '| such that
3-1
S i' = S  i for 1 <= i <= |S'|. For example, "GCTT" and "TTG" are substrings and
subsequences of "AGCTTGATT", while "ACT" and "GTT" are subsequences of
"AGCTTGATT". String S' is called a prefix of string S if S' = S1,|S '| . String S' is called
a suffix of string S if S' = S |S ||S '|1,|S | . For example, "AGCTT" and "AG" are both
prefixes of "AGCTTGATT", while "TT" and "GATT" are both suffixes of
"AGCTTGATT".
Given a text string T of length n and a pattern string P of length m, the exact string
matching problem is to find all the occurrences of P in T. It is easy to design a
brute-force algorithm for this problem, which enumerates all n – m + 1 possible
substrings of length m in T by shifting a sliding window of length m and determines
which substrings are P. The algorithm is given below:
Algorithm 3.1 A brute-force algorithm for exact matching
Input: A text string T of length n and a pattern string P of length m.
Output: All occurrences of exact matchings of P in T.
for i = 1 to n – m + 1 do
if ( P1, m = T j , j m 1 ) then
Report that pattern P appears at position i;
endif
endfor
Since it takes O(m) time to determine whether two strings of length m are the
same, the algorithm conducts in O((n – m + 1)m) = O(nm) time.
3.2 The KMP Algorithm
In this section, we shall introduce the KMP algorithm, proposed by Knuth, Morris
and Pratt in 1977 to speed up the exact matching procedure. We will use several cases
to illustrate their idea.
Case 1: In this case, the first symbol of P dose not appear again in P. Consider
3-2
Figure 3.1(a). The first mismatch occurs at P4 and T4 . Since P1 to P3 exactly
match with T1 to T3 and the first symbol of P dose not appear any where in P, we
can slide the window all the way to T4 and match P1 with T4 , as shown in Figure
3.1(b).
Case 2: In this case, the first symbol of P appears again in P. Consider Figure
3.2(a). The first mismatch occurs at P7 and T7 . Since the first symbol of P dose
occur in P again, we can not slide the window all the way to match P1 with T7 .
Instead, we must slide the window to match P1 with T6 because P6 = "A" = T6 .
This is shown in Figure 3.2(b).
Case 3: In this case, not only the first symbol of P appears in P, prefixes of P
appear in P twice. Consider Figure 3.3(a). First, note that the string P6,8 is equal to
the prefix P1,3 and the string P9,10 is equal to the prefix P1, 2 . In Figure 3.3(a), the
first mismatch occurs at P8 and T8 . If there is a good preprocessing, we will go back
to P7 and find out that P6 , 7 is equal to the prefix P1, 2 . Besides, since mismatch
occurs after P7 , we know that P6 , 7 matches with T6, 7 . Therefore, we can slide the
window to align P1 with P6 and start matching P3 with T8 , as shown in Figure
3.3(b).
The basic principle of the KMP algorithm is illustrated in Figure 3.4. In general,
given a pattern P, we should first conduct a preprocessing to find out all of the places
where prefixes appear. This can be done by computing a prefix function.
For 1 <= j <= m, let the prefix function f(j) for P j be the largest k < j such that
P1,k = Pj  k 1, j ; 0 if there is no such k. The prefix function is illustrated in Figure 3.5.
Note that f(1) = 0 for any cases. For example, the prefix function f for P =
"ATCACATCATCA" is shown in Figure 3.6.
3-3
(a)
(b)
Figure 3.1: The First Case for the KMP Algorithm
(a)
(b)
Figure 3.2: The Second Case for the KMP Algorithm
(a)
(b)
Figure 3.3: The Third Case for the KMP Algorithm
3-4
Figure 3.4: The KMP Algorithm
Figure 3.5: The Prefix Function
Figure 3.6: An Example of the Prefix Function
If the prefix function is computed first and stored in an array, the window shifting
will be determined easily by looking up the array. Consider the pattern P = "ATCG".
If the mismatch occurs when comparing T4 with P4 , then we can continue
comparing T5 with Pf ( 4 1) 1 = P1 , where f(j) = 0, 1 <= j <= 4, for P = "ATCG".
Consider the pattern P = "ATCACATCATCA" shown in Figure 3.6. If the mismatch
occurs when comparing T13 with P8 , then we can continue comparing T13 with
Pf (81) 1 = P21 = P3 .
In general, if a mismatch occurs when Ti is compared with P j , then we align
Ti with Pf ( j 1) 1 if j ≠ 1 and align Ti 1 with P1 if j = 1.
In the following, we shall discuss how to find this prefix function. Consider the
prefix function of Figure 3.6.
(1) Suppose that f(4) has been found to be 1. Let us see what this means. This
means that P1 = P4 . Based upon this information, we can determine f(5) easily. If
3-5
P5 = P2 , then we set f(5) = f(4) + 1; otherwise, we set f(5) = 0. In this case, since P5
≠ P2 , we set f(5) = 0.
(2) Suppose we have found f(8) = 3. This means that P1,3 = P6,8 . We now see
whether P9 = P4 . Since P9 = P4 , we set f(9) = f(8) + 1 = 4.
(3) Suppose we have found that f(9) = 4. We now check whether P10 = P5 . The
answer is no. Does this mean that we should set f(10) to be 0? No. Let us note that f(9)
= 4 and f(4) = 1. This means that P9 = P4 = P1 . In other words, we have a pointer
pointing to P1 now. We check whether P10 = P2 . P10 is indeed equal to P2 . We
therefore set P10 = f(1) + 1 = 0 + 1 = 1.
Let f x ( y) = f ( f
x1
( y)) for x > 1 and f 1 ( y ) = f(y). The prefix function f(j), 2 <=
j <= m, for P j can be rewritten as follows:
 f k ( j  1)  1 if j  1 and there exists the smallest k  1 such that Pj  Pf k ( j 1)1
f ( j)  
0
otherwise

It can be shown that this new formula is equivalent to the old one. Consider the
pattern P in Figure 3.6 again. Note that f(1) is 0 for any cases. f(2) and f(3) are 0 by
the definition. We set f(4) = 1 because P4 = Pf ( 4 1) 1 = P1 = "A". However, we let
f(5) be 0 because "C" = P5 ≠ Pf ( 51) 1 = P2 = "T" and "C" = P5 ≠ Pf 2 (51)1 =
P1 = "A". In addition, we have f(6) = 1, f(7) = 2, f(8) = 3 and f(9) = 4 because P6 =
Pf ( 6 1) 1 = P1 = "A", P7 = Pf ( 7 1) 1 = P2 = "T", P8 = Pf (81) 1 = P3 = "C" and
P9 = Pf ( 9 1) 1 = P4 = "A", respectively. Since "T" = P10 ≠ Pf (101) 1 = P5 = "C"
and P10 = Pf 2 (101)1 = P2 = "T", we have f(10) = 2. Finally, we set f(11) = 3 and
f(12) = 4 because P11 = Pf (111) 1 = P3 = "C" and
respectively.
3-6
P12 = Pf (121) 1 = P4 = "A",
The complete algorithm for computing f is given below:
Algorithm 3.2 An algorithm for computing the prefix function f
Input: A string P of length m.
Output: The prefix function f for P.
f(1) = 0;
for j = 2 to m do
t = f(j-1);
while ( Pj ≠ Pt 1 and t ≠ 0) do
t = f(t);
endwhile
if (t = 0) then
f(j) = 0;
else
f(j) = t + 1;
endif
endfor
The KMP algorithm consists of two phases. It computes the prefix function for the
pattern P in the first phase, and searches the pattern in the second phase. Refer to
Algorithm 3.3 for the details.
Let us see an example for the KMP algorithm as shown in Figure 3.7. Initially, Pi
= P1 is aligned with T j = T1 (see Figure 3.7(a)). Next, the algorithm continues
comparing P1 and T2 because T1 ≠ P1 (see Figure 3.7(b)). A mismatch occurs at
P4 and T5 . Then it compares T5 with Pf ( 4 1) 1 = P01 = P1 (see Figure 3.7(c).
After that, since T1 ≠ P1 , the algorithm continues comparing T6 with P1 (see
Figure 3.7(d)). Fortunately, we match T6,17 and P1,12 successfully and report that
pattern P appears at position i – m + 1 = 17 – 12 + 1 = 6 of T. The next
pairwise-comparing starts at T18 and Pf ( m ) 1 = P41 = P5 (see Figure 3.7(e)).
Since a mismatch occurs at T19 and P6 , the algorithm continues comparing T19 and
T f ( 6 1) 1 = T1 (see Figure 3.7(f)). The remaining comparisons are left for exercises.
3-7
It can be proved that the algorithm runs in O(n + m) time; O(m) time for
computing function f and O(n) time for searching P. The analyses of the algorithm are
quite difficult and need the knowledge of amortized analysis, which will be omitted
here.
Algorithm 3.3 The KMP Algorithm for exact matching
Input: A text string T of length n and a pattern string P of length m.
Output: All occurrences of P in T.
/* Phase 1 */
Compute f(j) for 1 <= j <= m;
/* Phase 2 */
i = 1, j = 1;
while (i <= n) do
if ( P j = Ti and j = m) then
Report that pattern P appears at position i – m + 1;
j = f(m) + 1;
i = i + 1;
endif
if ( P j = Ti and j ≠ m) then
j = j + 1;
i = i + 1;
endif
if ( P j = Ti and j ≠ 1) then
j = f(j - 1) + 1;
endif
if ( P j = Ti and j = 1) then
i = i + 1;
endif
endwhile
3-8
3.3 The Boyer-Moore Algorithm
The Boyer-Moore algorithm was proposed by Boyer and Moore in 1977. The
worst time complexity of this algorithm is no better than that of KMP algorithm, but it
is more efficient in practice than the KMP algorithm.
This algorithm compares the pattern with the substring within a sliding window in
the right-to-left order. In addition, bad character and good suffix rules are used to
determine the movement of sliding window.
Suppose that P1 is aligned to Ts now, and we perform pairwise-comparing from
right to left. Assume that the first mismatch occurs when comparing Ts  j 1 with P j ;
that is, Ts  m k 1 = Pm k for 0 <= k <= m – j - 1, and Ts  j 1 ≠ P j as shown in
Figure 3.8. We have the following possible ways to shift the sliding window.
Figure 3.7: An Example for the KMP Algorithm
Since Ts  j 1 ≠ P j , any exact matching which moves the window to the right
must match some character to the left of P j in P exactly with Ts  j 1 . For example,
3-9
consider Figure 3.9(a). In this case, m = 12, s = 6 and j = 10, T15 = "G" ≠ P10 . We
scan from P10 to the left and find P7 = "G" = T15 . Therefore, we move the window
to the right to align T15 with P7 , as shown in Figure 3.9(b). The basic idea of this
rule is illustrated in Figure 3.9(c) and we have the following rule:
Bad Character Rule: Align Ts  j 1 with Pj ' , where j' is the rightmost position of
Ts  j 1 in P.
Figure 3.8: Ts  m  k 1 = Pm k for 0 <= k <= m – j -1, and Ts  j 1 ≠ P j
3-10
(a)
(b)
(c)
Figure 3.9: Bad Character Rule
In the above bad character rule, only one character is used. We can actually do
better than that by using the so called good suffix rule. Let us consider Figure 3.10(a).
In this case, m = 12, s = 6 and j = 10. We note that T16,17 = P11,12 = "CA" and T15 =
"G" ≠ P10 = "T". We therefore should find the rightmost substring in the left of P10
in P which is equal to "CA" and the left character to this substring is not equal to P10
= "T", if it exists. In our case, we find that P5, 6 = "CA" and P4 ≠ P10 = "T".
3-11
Therefore we move the window right to align T15 with P4 , as shown in Figure
3.10(b). The basic idea of this rule is now illustrated in Figure 3.10(c) and we have the
following rule:
Good Suffix Rule 1: Align Ts  j 1 with Pj ' m  j , where j' (m – j + 1 <= j' < m) is the
largest position such that Pj 1,m is a suffix of P1, j ' (i.e. Pj 1,m = Pj '1 m  j , j ' ) and
Pj ' m  j ≠ P j (see Figure 3.11).
(a)
(b)
(c)
Figure 3.10: Good Suffix Rule 1
3-12
Figure 3.11: The Movement for Good Suffix Rule 1
(a)
(b)
3-13
(c)
Figure 3.12: Good Suffix Rule 2
In the following, we will further make use of a matching. Consider Figure 3.12(a).
In this case, m = 12, s = 6 and j = 8. In the previous good suffix rule, we will try to
move the window so that another substring of P would match exactly with the T14,17
= "AATC". This is not possible because there is no other substring in P which is equal
to "AATC". Yet there is a prefix, namely P1,3 , which is equal to "ATC" which is a
suffix of "AATC". We therefore move the window so that T15 matches with P1 . The
basic idea of this rule is now illustrated in Figure 3.12(c) and we have the following
rule:
Good Suffix Rule 2: Align Ts  m  j ' with P1 , where j' (1 <= j' <= m - j) is the largest
position such that P1, j ' is a suffix of Pj 1,m (i.e. P1, j ' = Pm  j ' 1,m ' ) (see Figure 3.13).
Let B(a) be the rightmost position of a  Σ in P. Figure 3.14(a) shows the
values of B for Σ = {A,T,C,G} and P = "ATCACATCATCA". This function will be
used for applying bad character rule. The algorithm for computing B is described in
Algorithm 3.4.
3-14
Figure 3.13: The Movement for Good Suffix Rule 2
(a)
(b)
Figure 3.14: Two Functions for the Good Suffix Rule: (a)Function B (b)Function G
Algorithm 3.4 An algorithm for Computing Function B
Input: A pattern P of length m.
Output: Function B for P.
for j = 1 to m do
B(j) = 0;
endfor
for j = 1 to m do
B ( Pj ) = j;
endfor
As for the good suffix rule, we need to scan the pattern for storing the information
3-15
about the maximum number of shifts in each position of P. Let g1 ( j ) be the largest
k such that Pj 1,m is a suffix of P1,k and Pk  m  j ≠ P j , where m – j + 1 <= k < m; 0
if there is no such k. Let g 2 ( j ) be the largest k such that P1,k is a suffix of Pj 1,m ,
where 1 <= k <= m – j; 0 if there is no such k. Functions g1 ( j ) and g 2 ( j ) are
illustrated in Figure 3.15(a) and (b) respectively. Consider g 1 and g 2 for P =
"ATCACATCATCA", which is shown in Figure 3.14(b). We set g1 (7) = 9 because
P8,12 = "CATCA" = P5,9 , which is a suffix of P1,9 , and P7 ≠ P4 , while we set
g 2 (7) = 4 because P9,12 = "ATCA" = P1, 4 , which is a suffix of P7 ,12 . However, we
set g1 (8) = 0 by definition, while we set g 2 (8) = 4 because P9,12 = "ATCA" =
P1, 4 .
(a)
(b)
Figure 3.15: Functions g1 ( j ) and g 2 ( j )
Let us consider g1 (7) . Note that g1 (7) = 9. This means that P8,12 = "CATCA"
must be equal to P5,9 and P7 ≠ P4 . Suppose a mismatch occurs at P7 , as shown in
Figure 3.16(a). We can move the window m - g1 (7) = 12 – 9 = 3 positions, as
illustrated in Figure 3.16(b).
Consider g 2 (4) . Note that g 2 (4) = 4. This means that P1, 4 is a suffix of P5,12 .
That is, P1, 4 must be equal to P9,12 . If a mismatch occurs at P4 , as shown in Figure
3-16
3.17(a), we can move the window m - g 2 (4) = 12 – 4 = 8 positions, as illustrated in
Figure 3.17(b).
(a)
(b)
Figure 3.16: Shifting for the Good Suffix Rule 1
(a)
(b)
Figure 3.17: Shifting for the Good Suffix Rule 2
Function G(j) is defined as follows:
G( j )  m  max{ g1 ( j ), g 2 ( j )}
Actually, G(j) indicates the maximum number of shifts by good suffix rule when a
mismatch occurs for comparing P j with some character in T. Figure 3.14(b) shows
the values of G for P = "ATCACATCATCA". Note that G(m) is always 1.
For 1 <= j <= m - 1, let the suffix function f'(j) for P j be the smallest k, j + 2 <=
3-17
k <= m, such that Pk ,m = Pj 1,m  k  j 1 ; m + 1 if there is no such k. This is illustrated in
Figure 3.18. Let f'(m) = m + 2 for Pm . It is easy to see that function f' is similar to the
prefix function for reverse of P. Figure 3.19 shows the values of f' for P =
"ATCACATCATCA", where m = 12. We set f'(12) = 12 + 2 = 14. Since there is no k
for 13 = j + 2 <= k <= m = 12, f'(11) is set to 12 + 1 = 13. Since Pk ,m = P12,12 ≠
Pj 1,m  k  j 1 = P11,11 for j = 10 and j + 2 = 12 <= k <= m = 12, we get f'(10) = m + 1 =
13. Similarly, f'(9) = 12 + 1 = 13. For j = 8 and j + 2 = 10 <= k <= m = 12, we let f'(8)
= 12 because 12 is the smallest value of k such that Pk ,m = P12,12 = "A" =
Pj 1,m  k  j 1 = P9,9 . For j = 7 and j + 2 = 9 <= k <= m = 12, we let f'(7) = 11 because
11 is the smallest value of k such that Pk ,m = P11,12 = "CA" = Pj 1,m  k  j 1 = P8,9 .
Figure 3.18: The Suffix Function f’
Figure 3.19: Function f’ and G
Let f ' x ( y) = f ( f ' x1 ( y)) for x > 1 and f '1 ( y ) = f ' ( y ) . Function f' can be
redefined as follows:
m2
if j  m


f ' ( j )   f ' k ( j  1)  1 if 1  j  m  1 and there exists the smallest k  1 such that Pj 1  Pf 'k ( j 1)1 ;

m 1
otherwise

By comparing Figure 3.5 and Figure 3.18, we can easily see the suffix function f'
3-18
can be computed as f, except the direction is reversed.
Function G can be determined by scanning P twice, where the first one is a
right-to-left scan and the second one is a left-to-right scan. Function f' is generated in
the first right-to-left scan and some values of G can be determined in this scan.
Consider Figure 3.15 again. We can easily observe the following fact:
Fact 1: If we scan from right to left and g1 ( j ) is determined during the scanning,
then g1 ( j ) >= g 2 ( j ) .
Consider Figure 3.19 again. Suppose that P j = P4 is considered now. We set
f'(4) to 8. This means that Pf '( j ), m = P8,12 = "CATCA" = P5,9 = Pj 1,m  j  f '( j )1 . In
addition, if P j = P4 ≠ P7 = Pf '( j ) 1 , we know g1 ( f ' ( j )  1) = m + j - f'(j) + 1 = 9.
By Fact 1, G(j) = m - max{ g1 ( j ) , g 2 ( j ) } = m - g1 ( j ) = m - (m + j - f'(j) + 1) = (f'(j)
- 1) - j = 8 – 1 - 4 = 3.
When we compute f', there may be a "while" loop. Function g1 ( j ) can be
determined when we perform this "while" loop. That is, if t = f'(j) – 1 <= m and P j ≠
Pt , then g1 (t ) = m – t + j, as illustrated in Figure 3.20.
Figure 3.20: The Computation of g1 ( j )
Suppose we have scanned from right to left and determined f'(j) for j = 1, 2, …, m.
Now, consider f'(1). From Figure 3.19, f'(1) = 10. This means that P10,12 = P2 , 4 . Let t
be f'(1) - 1. Let us now observe Pt = P101 = P9 = "A" = P1 . This means P1, 4 =
P9,10 . Thus from Figure 3.15, g 2 (1) = 4 = m - (f'(1) - 1) + 1 = m - f'(1) + 2. Besides,
3-19
it can be easily seen that g 2 ( j ) = g 2 (1) for j = 2, 3, …, f'(1) - 2. This is illustrated
in Figure 3.21.
Figure 3.21: The Computation of g 2 (1)
Let k' be the smallest k in {1, …, m} such that Pf '( k ) (1)1 = P1 and f ' ( k ) (1)  1
<= m. If G(j') is not determined in the first scan and 1 <= j' <= f ' ( k ') (1)  1 , thus, in
the second scan, we set G(j') = m - max{ g1 ( j ' ) , g 2 ( j ' ) } = m - g 2 ( j ' ) =
f '( k ') (1)  2 . If no such k' exists, set each undetermined value of G to m in the second
scan. For example, since Pf '(1) 1 = P101 = P9 = "A" = P1 , we set G(1), G(2), G(3),
G(4), G(5), G(6) and G(8) to f'(1) – 2 = 8.
In general, let z = f '( k ') (1)  2 and f''(x) = f'(x) - 1. Let k'' be the largest value k
such that f ' '( k ) ( z )  1 <= m. Then we set G(j') = m - g 2 ( j ' ) = m - (m f ' '(i ) ( z )  1 ) = f ' '(i ) ( z )  1 , where 1 <= i <= k'' and f ' ' (i 1) ( z )  1 < j' <=
f ' '(i ) ( z )  1 and f ' ' ( 0) ( z ) = z. For example, since z = 8 and k'' = 2, we set G(9) and
G(11) to f ' '(1) ( z )  1 = f'(8) – 1 = 12 – 1 = 11.
The complete algorithm for computing G using f' is listed below:
Algorithm 3.5 An algorithm for computing function G
Input: A string pattern P of length m.
Output: Function G for P.
for j = 1 to m - 1 do
G(j) = 0;
endfor
f'(m) = m + 2;
t = m;
for j = m - 1 to 1 do
f'(j) = t + 1;
3-20
while (t <= m and P j ≠ Pt ) do
if (G(t) = 0) then
G(t) = t - j;
endif
t = f'(t) - 1;
endwhile
t = t - 1;
endfor
for j = 1 to m - 1 do
if (G(j) = 0) then
G(j) = t;
endif
if (j = t) then
t = f'(t) - 1;
endif
endfor
We essentially have to decide the maximum number of steps we can move the
window to the right when a mismatch occurs. This is decided by the following
function:
max{G(j),j – B( Ts  j 1 )}.
The Boyer-Moore algorithm has two phases, one for computing B and G and the
other one for searching patterns. The complete algorithm is described in the
following:
Algorithm 3.6 The Boyer-Moore algorithm for exact matching
Input: A text string T of length n and a pattern string P of length m.
Output: All occurrences of P in T.
/* Phase 1 */
Compute G;
Compute B;
/* Phase 2 */
s = 1;
while (s <= n – m + 1) do
3-21
j = m;
while ( P j = Ts  j 1 and j >= 1) do
j = j - 1;
endwhile
if (j = 0) then
Report that pattern P appears at position s;
s = s + G(1);
else
s = s + max{G(j),j - B( Ts  j 1 )}
endif
endwhile
Consider the text T and pattern P given in Figure 3.22, where n = 24 and m = 12.
The functions G and B are shown in Figure 3.14. The Boyer-Moore algorithm first
sets s = 1 and starts the pairwise-comparing in the right-to-left order (see Figure
3.22(a)). The first mismatch occurs at P j = P12 and the algorithm recomputes s by
the following statement:
s = s + max{G(j),m - B( Ts  j 1 )}
= 1 + max{G(12),12 - B( T12 )}
= 1 + max{1,2}
=3
Figure 3.22: An Example for the Boyer-Moore Algorithm
3-22
Then we shift the window to position s = 3 and starts a new pairwise-comparing
(see Figure 3.22(b). Next, the mismatch occurs at P j = P7 and s is reset by the
following statement:
s = s + max{G(j),m - B( Ts  j 1 )}
= 3 + max{G(7),12 - B( T9 )}
= 3 + max{3,0}
=6
So we shift the window to position s = 6 and starts a new pairwise-comparing (see
Figure 3.22(c). After m comparisons, a P is found at position s = 6. Then we
recompute s by the following statement:
s = s + G(1)
=6+8
= 14
Since s = 14 > n – m + 1 = 24 – 12 + 1 = 13, the algorithm terminates.
Preprocessing time in phase 1 is O(m) + O(m + |Σ|) = O(m + |Σ|), where O(m + |Σ|)
time is for computing B and O(m) time for G. However the worst time for phase 2
would be O((n – m + 1)m). It was proved that this algorithm has O(m) comparisons
when P is not in T. However, this algorithm has O(mn) comparisons when P is in T.
Many Boyer-Moore-like algorithms (such as Apostolico-Giancarlo algorithm) were
derived and have O(m) time in the worst case. It is more efficient in practice than
KMP algorithm.
3.4 Suffix Trees and Suffix Arrays
Let S be a string of n characters. Let S (i ) denote the suffix S i ,n of S for 1 <= i
<= n. For example, if S is ATCACATCATCA, its 12 suffixes are listed in Table 3.1.
3-23
Table 3.1: Suffixes for S = “ATCACATCATCA”
The basic idea of suffix tree concept is that we can classify all of the substrings of
a given sequence into groups. For example, consider the above sequence S =
ATCACATCATCA. There are only three characters which appear in this sequence.
They are A, C and T. Thus, we can classify all of the substrings into three groups: (1)
the substrings which start with A, (2) the substrings which start with C and (3) the
substrings which start with T. Note that A, for instance, appears in locations 1, 4, 6, 9
and 12. This means any substring which starts with A must be one of the following
suffixes: S(1), S(4), S(6), S(9) and S(12). Similarly, any substring which starts with C
must be only of the following suffixes: S(3), S(5), S(8) and S(11). The suffix tree, as
we shall introduce in the follows, groups the suffixes in such a way that whether any
substring appears in S can be found easily by using the above observation.
A suffix tree of S of length n is a tree with the following properties:


Each tree edge is labeled by a substring of S.
Each internal node has at least 2 children.

Each S (i ) has its corresponding labeled path from root to a leaf, for 1 <= i <= n.


There are n leaves.
No Edges branching out from the same internal node can start with the same
character.
3-24
A suffix tree for S = "ATCACATCATCA" is depicted in Figure 3.23. Note that we
add the symbol "$" to the end of S for satisfying the property of n leaves. The leaf
number denotes the index of the suffix constructed by the labeled-path from root to
the leaf. For example, the labeled-path from root to leaf 9 is "ATCA$" = S ( 9 ) . Since
each node has branches with labels of different starting symbols, the path from root to
a leaf represents a unique labeled-path. Each leaf corresponds to a suffix. The
structure of a suffix tree for S is unique.
The suffix tree for S can be used to determine whether a given pattern P is in S.
Consider the suffix tree for S = "ATCACATCATCA" drawn in Figure 3.23. Suppose P
= "TCAT". First we examine the branches from root. Since P1 is "T", we follow the
branch of "TCA". Then we match P1,3 with "TCA". Next we examine the branches
of "$", "TCA$" and "CATCATCA$". Since P4 is "T", we follow the branch of
"TCA$". We now can report that P is at position 7 in S because P4 matches the first
symbol of "TCA$" and leaf 7 is reached along the branch of "TCA$".
Suppose P = "TCA". First we examine the branches from root. Since P1 is "T",
we follow the branch of "TCA". Then we match P1,3 = "TCA" with "TCA". We now
can report that P is at positions 2, 7 and 10 in S because leaves 2, 7 and 10 are reached
along the branch of "TCA".
3-25
Figure 3.23: A suffix Tree for S = ”ATCACATCATCA”
Suppose P = "TCATT". First we examine the branches from root. Since P1 is "T",
we follow the branch of "TCA". Then we match P1,3 = "TCA" with "TCA". Next we
examine the branches of "$", "TCA$" and "CATCATCA$". Since P4 is "T", we
follow the branch with "TCA$". Since P4 , 5 = "TT" dose not match the first two
symbols of "TCA$", we report that P is not in S.
In the following, we shall give a simple algorithm to create a suffix tree. Given a
string S, we first create all of the suffixes, divide all suffixes into distinct groups
according to their starting characters and create a node. For each group, if it contains
only one suffix, create a leaf node and a branch with this suffix as its label; otherwise,
select a suffix with the longest common prefix among all suffixes of the group and
create a branch out of the node with this longest common prefix as its label. Delete
this prefix from all suffixes of the group. Repeat the above procedure for each node
which is not terminated.
Let us consider S = "ATCACATCATCA" as an example. The 12 suffixes are
3-26
shown in Table 3.1. We then divide these suffixes into three groups, N1 = {1, 4, 6, 9,
12}, N 2 = {3,5,8,11} and N 3 = {2,7,10} for starting characters "A", "C" and "T"
respectively.
Consider N 3 whose suffixes all start with "T". There are three suffixes, namely,
S ( 2 ) = "TCACATCATCA", S ( 7 ) = "TCATCA" and S (10) = "TCA". Among these
three suffixes, "TCA" is the longest common prefix. It can be easily shown that in
N 2 , the longest common prefix is "CA" and in N1 , it is "A". This is shown in Figure
3.24(a).
(b)
Figure 3.24: Constructing a Suffix Tree for S = “ATCACATCATCA”
After the longest common prefix determined, we delete this prefix from all
3-27
suffixes of the group. For example, for N 3 again. After "TCA" is deleted from S ( 2 ) ,
S ( 7 ) and S (10) , we now have three suffixes, namely "CATCATCA", "TCA" and "$".
They are divided into three groups and each group has only one suffix. Thus three
branches are now created for these three suffixes. Consider N 2 . After "CA" is deleted
from S ( 3) , S ( 5 ) , S (8 ) and S (11) , we now have "CACATCATCA", "TCATCA",
"TCA" and "$". We divide this set of suffixes into three groups, those starting with
"C", those starting with "T" and those starting with "$". Among those starting with
"T", we will find "TCA" has the longest prefix. The whole development of nodes after
the first round of longest common prefixes are determined is now shown in Figure
3.24(b).
Refer to Algorithm 3.7 for the details of creating a suffix tree
Algorithm 3.7 An algorithm to create a suffix tree
Input: A string S.
Output: A suffix tree of S.
Step 1: Create all suffixes of S.
Create a node N.
Denote the set of all suffixes of S by G N .
Let x = N, G x = G N and put x, together with G x , into a queue Q.
Step 2: Delete an element from Q.
Divide all suffixes in G x into groups such that in each group, all suffixes
start with the same character.
Denote these groups as G1 , G2 , …, Gk for some k.
Step 3: For k = 1 to n, do the following
If Gk contains one suffix, create a leaf node and a branch labeled with the
suffix and delete this suffix from the group.
Otherwise, among all suffixes of this group, find the longest common prefix.
Create a node x and a branch labeled with this suffix.
Delete this prefix from the group.
Let this new group of suffixes be denoted as G x .
Put x, together with G x , into Q.
Step 4: If Q is empty, exit and report the tree as a suffix tree for S.
Otherwise, go to Step 2.
3-28
A suffix tree for a text string T of length n can be constructed in O(n) time. To
search a pattern P of length m on a suffix tree needs O(m) comparisons. Thus we have
an O(n + m) time algorithm for the exact string matching problem.
An array A of n elements is called the suffix array for S if strings S ( A[1]) ,
S ( A[ 2 ]) , …, S ( A[ n ]) are in the non-decreasing lexical order. For example, the
non-decreasing lexical order of suffices of S = "ATCACATCATCA" is S (12) , S ( 4 ) ,
S ( 9 ) , S (1) , S ( 6 ) ,
S (11) , S ( 3) , S (8 ) , S ( 5 ) , S (10) , S ( 2 ) and S ( 7 ) . Table 3.2 shows the
suffix array A.
A suffix array A for S can be constructed by the lexical depth first searching in the
suffix tree for S. The lexical depth first searching is described below. We start the
search from root node x. Select an unvisited node y branched from x such that the
label of edge linking x and y has the smallest alphabetical ordering among edges
linking unvisited nodes and x. When a node z is reached such that all its adjacent
nodes have been visited, we back up to the last vertex visited and continue the depth
first search. The search terminates when no unvisited vertex can be reached from any
of the visited ones. The visiting sequence of leaves in the lexical depth first searching
represents the non-decreasing lexical order of suffices. Consider the suffix tree again
in Figure 3.23. The visiting sequence of leaves in the lexical depth first searching is
12, 4, 9, 1, 6, 11, 3, 8, 5, 10, 2, 7, and then the non-decreasing lexical order of suffices
is S (12) , S ( 4 ) , S ( 9 ) , S (1) , S ( 6 ) ,
S (11) , S ( 3) , S (8 ) , S ( 5 ) , S (10) , S ( 2 ) and S ( 7 ) .
If T is represented by a suffix array, it takes O(mlog n) time to find P in T because
a binary search can be conducted on the array. Since a suffix array can be determined
in O(n) by lexical depth first searching in a suffix tree for a string of length n, the total
time will be O(n + mlog n) time.
A string Z is called the common substring of strings X and Y if Z is in X and Y. The
longest common substring of strings X and Y is a common substring of X and Y which
has the longest length. For example, "PAT" is the longest common substring of X =
"APAT" and Y = "PATT". We can create a suffix tree for X and Y for finding the
longest common substring. Suppose that X = "APAT" and Y = "PATT". The suffices
3-29
for X and Y are described in Table 3.3. Figure 3.25 shows the suffix string for X and Y.
We mark each internal node reachable from leaves of suffices from X and Y by "1", as
shown in Figure 3.25. The labeled path from root to a marked internal node is a
common substring of X and Y. The longest common substring can be found by
generating all such common strings and finding the longest one.
Table 3.3: Suffixes for X = “APAT” and “PATT”
Figure 3.25: A suffix Tree for X = ”APAT” and Y = “PATT”
3.5 Approximate String Matching
Given a text string T of length n, a pattern string P of length m and a maximal
number of errors allowed k, the approximate string matching is to find all text
positions where the pattern matches the text up to k errors, where errors can be
substituting, deleting, or inserting a character. For instance, if T = "pttapa", P = "patt"
3-30
and k = 2, the substrings T1, 2 , T1,3 , T1, 4 and T5, 6 are all up to 2 errors with P.
In the following, we first define a distance function to measure, somehow, the
familiarity between two strings S1 and S 2 . This is called the suffix edit distance
which is the minimum number of substitutions, insertions and deletions, which will
transform some suffix of S1 into S 2 .
Consider S1 = "p" and S 2 = "p". The suffix edit distance between S1 and S 2
is 0.
Consider S1 = "ptt" and S 2 = "p". The suffix edit distance between S1 and
S 2 is 1 as we can replace the last character "t" by "p".
Consider S1 = "pttap" and S 2 = "patt". The suffix edit distance between S1
and S 2 is 3 because we need at least 3 operations to transform any suffix of S1 into
S2 .
What is the meaning of the suffix edit distance between T and P? If it is not
greater than k, then we know that there is an approximate matching of a suffix of T
with P with error not greater than k. That is, we have succeeded in finding a desired
approximate matching. Given T and P, our approach is to find the suffix edit distance
between T1,1 , T1, 2 , …, T1,n and P. For any i where the suffix edit distance between
T1,i and P is less than or equal to k, we know that there is an approximate matching
with error less than or equal to k.
Let us consider T = "pttapa" and P = "patt".
For T1,1 = "p" and P = "patt", the suffix edit distance between them is 3. There is
no approximate matching with error not greater than 2.
For T1, 2 = "pt" and P = "patt", the edit distance is 2. Thus we have found an
approximate matching with error 2.
3-31
For T1,5 = "pttap" and P = "patt", the edit distance is 3.
For T1, 6 = "Pttapa" and P = "patt", the edit distance is 2 again because we can
insert "tt" into T and then find an exact matching of a suffix of T with P.
Our approximate string matching problem now becomes a problem of the
following problem: Given T and P, find the suffix string edit distances between T1,i
and P for i = 1, 2, …, n where n is the length of T. This problem can be solved by
using the dynamic programming approach introduced in Chapter 2. The approach is
almost the same as that used to find the longest subsequence between two sequences.
Let E(i,j) denote the suffix edit distance between T1, j and P1,i . For T1, j and P1,i ,
to find the suffix edit distance, we consider the following possibilities:
Case 1. T j = Pi . In this case, we find E(i - 1,j - 1). Set E(i,j) = E(i - 1,j -1).
Case 2. T j ≠ Pi . In this case, we find E(i - 1,j) and E(i,j - 1). Set E(i,j) = min{E(i
- 1,j),E(i,j - 1)} + 1.
Refer to Algorithm 3.8 for the details of finding all suffix edit distances between T
and P. This problem can be solved in O(nm) time.
Algorithm 3.8 An algorithm to compute E(i,j) for 0 <= i <= m and 0 <= j <= n
Input: Strings T and P where the lengths of T and P are n and m respectively.
Output: E(i,j) for 0 <= i <= m and 0 <= j <= n.
Step 1: E(0,j) = 0 for j = 1, 2, …, n
E(i,0) = i for i = 0, 1, 2, …, m
Step 2: for i = 1 to m do
for j = 1 to n do
if ( T j = Pi ) then
E(i,j) = E(i - 1,j - 1)
3-32
else
E(i,j) = min{E(i - 1,j),E(i,j - 1)} + 1
endif
endfor
endfor
Figure 3.26(a) shows E for T = "pttapa" and P = "patt". As can be seen, E(2,4),
E(3,4) and E(6,4) are all less than or equal to k = 2. Through an appropriate tracing
back, we can find all of the desired approximate matchings.
Figure 3.26: Dynamic Programming Approach for T = “pttapa”, P = “patt” and k = 2
First of all, since our goal is to find all occurrences of approximate matchings, we
therefore have to record how each E(i,j) is obtained when it is computed, as shown in
Figure 3.26(b). For example, consider E(3,2). It is obtained from E(2,1) because T2
= P3 = "t". Thus there is an arrow from E(3,2) pointing to E(2,1). Consider E(4,6). It
is obtained from E(3,6). This is why there is an arrow from E(3,6) to E(4,6).
All of the approximate matchings with error less than k = 2 can now be found by
tracing back. When we trace back, we ignore every arrow pointing vertically because
it has nothing to do with locating the occurrence in T.
Consider E(4,3). The arrows traced are E(4,3) to E(3,2) to E(2,1) to E(1,1). We
ignore E(2,1) to E(1,1). Thus we have obtained an occurrence of approximate
matching, namely T1,3 = "ptt".
Consider E(4,6). The arrows traced are E(4,6) to E(3,6) to E(2,6) to E(1,5).
3-33
Ignoring E(2,6) to E(3,6) and E(3,6) to E(4,6), we obtain T5, 6 = "pa" which is an
occurrence of desired approximate matching.
Finally, if we change the distance function as follows:
E(i,j) = E(i - 1,j - 1) if T j = Pi ; and
E(i,j) = 0 if T j ≠ Pi ,
E(0,j) = 1 for j = 1, 2, …, n
E(i,0) = 1 for i = 0, 1, 2, …, m
We can use this function to find all occurrences of exact matching. This approach is
simple, but not efficient because the time complexity will be O(mn).
3.6: The Convolution Approach to Solve the Exact
String Matching Problem
For exact matching, one of the most straightforward approach is the sliding
window approach. Let us assume that we have a target string T=AAGTCTTCGA
and we have another string P=AGTC. Our job is to check whether P appears in T.
The sliding window approach is to slide P from left to right step by step as shown in
Figure 3-27.
AAGTCTTCGA
AGTC
AAGTCTTCGA
AGTC
AAGTCTTCGA
AGTC
AAGTCTTCGA
AGTC
AAGTCTTCGA
AGTC
Figure 3-27 .
This approach, although quite straightforward, cannot be programmed easily.
Besides, the time-complexity of this approach is O (mn) where m and n are the
lengths of S and P respectively. We now introduce an approach, called convolution,
which is almost the same as the sliding window approach. Yet, as we shall see, there
are many advantages in using this approach. We first introduce the definition of
convolution in the continuous case.
3-34
In communication, convolution is defined on two functions f (t ) and g (t ) ,

denoted as f (t )  g (t ) as follows: f (t )  g (t )   f ( ) g (t   )d .

The meaning of convolution can be best explained by imagining the following
situation: We fix f ( ) and move g ( ) all the way to   . We then move
g ( ) from the left to right. This is equivalent to setting t in g (t   ) from  
to   . Initially, we can imagine that these functions do not intersect and their
product is therefore 0. As g ( ) finally intersects f ( ) , the integration starts.
The computation of convolution is completed as soon as g ( ) departs from f ( ) .
An example is given below.
Consider the following two functions, x(u ) and y (u )
where,
1,  1  u  1
x(u )  
 0, otherwise
.
 u  1,0  u  1
y (u )  
 0, otherwise
The process of computing the convolution of x(u ) and y (u ) is illustrated in
Figure 3-28.
x(u)
y(u)
1
1
u
-1
0
1
1
(a)
(b)
x(u)
x(u)
y(t-u)
y(t-u)
1
u
t-1
t
-1
1
u
-1
1
t
(c)
(d)
3-35
1
u
1
y(t-u)
y(t-u)
1
u
t-1
t
-1
u
1
-1
1
t
(c)
(d)
1
1
u
-1
u
1
-1
t-1
Z(t)
(e)
1
t
(f)
1
2
t
-1
1
2
(g)
Figure 3-28
An example of convolution.
The relationship between convolution and the sliding window method for string
matching can be explained as follows: Imagine the process of the sliding window
method. Suppose that we are going to apply the sliding window method on two
sequences, T and P , first we move P to the extreme left and then move it
towards the right. As soon as P meets T , checking begins and the checking is
completed as soon as P leaves T completely. The process can be formally
illustrated in Figure 3-29.
T  T1T2 Tn
P  P1 P2  Pm
T1T2 Tn
P1 P2  Pm
shifting
P1 P2  Pm
shifting
P1 P2  Pm
Figure 3-29 A formal illustration of the sliding window approach
We can now see that the process of the sliding window method is quiet similar to
the process of convolution. In fact, we may use convolution to perform the job of
sliding window. In the following, we shall first define the discrete convolution.
3-36
The Definition of Convolution in the Discrete Case
Let X  x0 ,..., xm  , Y  y0 ,..., yn  be two given vectors, xi , yi  D . Let
 and  be two given functions, where : D  D  E , : E  E  E . Then the
convolution of X and Y with respect to  and  is
X  ,  Y  z o , z1 ,  , z n  m ;
z k   ( xi  y j ) for k  0,  , m  n
i  j k
According to the above definition:
z 0  ( xo  y 0 )
z1  ( x1  y 0 )  ( x0  y1 )
z 2  ( x 2  y 0 )  ( x1  y1 )  ( x0  y 2 )

z m  n 1  ( x m  y n 1 )  ( x m 1  y n )
z m n  ( xm  y n )
How does convolution have anything to do solving the exact string matching
problem by using the sliding window method? First, we reverse Y to be
Y '  ( y n , y n1 ,, y1 , y 0 ) . We then perform the convolution between X and Y ' as
follows:
z 0  ( xo  y n )
z1  ( x1  y n )  ( x0  y n 1 )
z 2  ( x 2  y n )  ( x1  y n 1 )  ( x0  y n  2 )

z m  n 1  ( x m  y1 )  ( x m 1  y 0 )
z m n  ( xm  y0 )
We further define the function  be the character comparison function
c ( xi , y j ) which is 1 if xi  y j and 0 if otherwise. We let the function  be the
integer addition function. Then the convolution of X and Y ' is
3-37
z 0  c( xo , y n )
z1  c( x1 , y n )  c( x0 , y n 1 )
z 2  c( x 2 , y n )  c( x1 , y n 1 )  c( x0 , y n  2 )

z m  n 1  c( x m , y1 )  c( x m 1 , y 0 )
z m  n  c( x m , y 0 )
Assume that m  n . If z i  n for some i , we know that there is an exact
match.
Consider X=abaab and Y=aab again. We now have
X  ( x0 , x1 , x2 , x3 , x4 )  (a, b, a, a, b) and Y '  ( y2 , y1 , y0 )  (b, a, a) .
Thus,
z 0  c ( a, b)  0
z1  c(a, a)  c(b, b)  2
z 2  c(a, a)  c(b, a)  c(a, b)  1
z 3  c(b, a)  c(a, a)  c(a, b)  1
z 4  c(a, a)  c(a, a)  c(b, b)  3
z 5  c(a, a)  c(b, a)  1
z 6  c ( a, b)  0
Since z 4  3 , we know that there is an exact match at the 4th shift, as shown in
Fig. 3-30.
abaab
aab
01234
Shift 0
Shift 1
Shift 2
Shift 3
Shift 4
Fig. 3-30 An illustration of a discrete convolution process
We may also view the discrete convolution in a graphical way. Let us use the
3-38
above case as an example:
abaab
baa
10110
10110
01001
0211310
Figure 3-31
A graphical way for demonstrating the discrete convolution.
From Fig. 3-31, it can be easily seen that convolution is very much similar to
multiplication. This property is quiet important as we shall use later. Meanwhile, let
us give the algorithm of discrete convolution in general as follows:
Algorithm 3-9: An Algorithm for Convolution Defined on  ,  .
Input: Two sequences, X  x0 x1  xn ; Y  y0 y1  y m
Output: z k   ( xi  y j ), k  0 ~ m  n
i  j k
for k  0 ~ m  n do
zk  
for i  0 ~ k do
z k  z k  ( x i  y k i )
end for
end for
For applying convolution to solve the exact string matching problem, the
algorithm is as follows:
3-39
Algorithm 3-10: An algorithm for Applying Convolution to Solve the Exact
String Matching problem
Input: Two sequences, X  x0 x1  xn ; Y  y0 y1  y m
Output: z k 
 (x
i  j k
i
 y j ), k  0 ~ m  n
for k  0 ~ m  n do
zk  0
for i  0 ~ k do
z k  z k  ( xi  y k i )
end for
end for
From the above algorithm, we can find occurrences of exact match as follows: If
z i is equal to the length of the pattern string, there is an exact match at the i -th shift.
In our example, where the text string is “abaab” of length 5 and the pattern string is
“aab” of length 3, their result of the discrete convolution shows that z 4  3 . This
stands for an exact match at the 4th shift.
The time complexity required for the above algorithm is O (mn) . In the
following, we shall point out that the discrete convolution can be transformed to
integer multiplication. Because integer multiplication can be solved in O(n log n)
where n is the length of the longer input by applying the convolution theorem and
fast Fourier transform, the time complexity of the original discrete convolution may
be reduced to O(n log n)
Transforming the Discrete Convolution into Integer Multiplication
We have already shown that the exact string-matching problem may be solved by
the discrete convolution defined on “character comparison” and “integer addition”. To
speed up the process of the discrete convolution, we are going to transform the
discrete convolution into integer multiplication. This approach was proposed in
[FSK82].
The approach can be explained in three steps as shown below:
Consider the exact string matching problem, we are given two input data, the text
3-40
string T of length n and the pattern string P of length m .
1.
We first define an indicator function to transform the input data, Ti , 0  i  n
and Pi , 0  i  m , which are characters originally, to be binary numbers,
according to the characters in the alphabet. The indicator function is defined as
follows
Let x be a character in the pattern string or the text string
1; x & a are identical
f ( x, a )  
, a
0; x & a are not identical
Extend the indicator function to strings in a natural way:
S is a string where S  s1 s2 ...sn
S ( a )  f ( s1 , a) f ( s2 , a)... f ( sn , a)
a  (alphaset )
In the exact string matching problem, we will use the indicator function to
transform the text string and the pattern string to be T (a ) , a   and
P (a ) , a   respectively. By transforming the pattern string and the text string
into this format, we may use the discrete convolution defined on the operations
of ‘integer multiplication’ and ‘integer addition’ in Step 2 to calculate the
number of matches by only considering the corresponding character.
2.
For each character a, a   , we apply the discrete convolution defined on
“integer multiplication” and “integer addition” on T ( a ) and P ( a ) , a   ,
( C ( a )  T ( a )  P ( a ) , a   ). In fact, the discrete convolution is equivalent to
integer multiplication because the input data can be treated as natural numbers
and the discrete convolution can actually be regarded as integer multiplication.
The results of these convolutions give the number of matches by considering the
corresponding character in the alphabet.
3.
The results of these convolutions(integer multiplications), C (a ) , a   , are
then summed ( S  Sum(C ( a ) ), a   ) to be the solution to the exact string
matching problem. The result in the final stage is equal to the one of the original
convolution.
3-41
The outline of the whole process of transforming the original convolution into integer
multiplication is illustrated in Figure 4.
T P
Step 1
Indicator
function
T (a) , a   P (a) , a  
Step 2
Convolution
Integer multiplication
C (a) , a  
Step 3
Summarization
S  Sum(C (a) ), a  
Figure 3-32 The outline of transforming the discrete convolution into integer
multiplication.
In the following, we use an example to demonstrate this process. Let us consider
the case where the text string X =abaab and the pattern string Y =aab, its alphabet
  {a, b} .
Step 1Their corresponding transformed strings are as follows:
X ( a )  10110
Y ( a )  110
X (b )  01001
Y (b )  001
To apply the discrete convolution to solve the exact string matching problem, we
need to use the reversal of Y , Y ' to be one of the inputs in the discrete convolution.
Thus, we have
3-42
Y ' ( a )  011
Y ' ( b )  100
Step 2We then apply the discrete convolution defined on “integer multiplication” and
“integer addition” on the transformed strings as illustrated as follows:
Z ( a )  X ( a )  ,  Y '( a ) Z (b )  X (b  ,  Y '( b )
)
10110
011
10110
10110
00000
0111210
01001
100
00000
00000
01001
0100100
Fig. 3-33 A step in the multiplication
Step 3After all the characters in the alphabet are considered, the results of each
character are combined as shown in Fig. 3-34.
z(a):0111210
z(b)-0100100
0211310
z(a):0111210
z(b)-0100100
0211310
01234 01234 01234
abaab abaab abaab
aab
aab
aab
Figure 3-34
The number of matches for every shift obtained by integer
multiplication.
The combined result is equivalent to the one of the original convolution. In
Figure 3-34, we can see that the number of matches in each shift is calculated. The
number equal to the length of pattern string is what we are looking for.
We summarize the above steps in Figure 3-35. In our example, we first
transform the input data according to the indicator function. Then we apply the
discrete convolution defined on “integer multiplication” and “integer addition” on
3-43
them. When we consider ‘a’(‘b’), the corresponding process is shown in the left(right)
part of Figure 3-35.
Finally, we sum the result in the middle of Figure 3-35.
X: abaab
Y:
aab   {a, b}
Y’: baa
X (a)-Indicator for ‘a’: 10110
X (b)-Indicator for ‘b’: 01001
Y’(a)-Indicator for ‘a’: 011
Y’(b)-Indicator for ‘b’: 100
10110
abaab
01001
011
baa
100
10110
10110
00000
10110
10110
00000
00000
01001
01001
Z(a): 0111210
Z:0211310
Z(b): 0100100
01234
a-aaaa-
01234
a-aaaa-
Figure 3-35
01234
a-aaaa-
01234
-b--b
--b
01234
-b--b
--b
01234
-b--b
--b
The decomposition of the discrete convolution for string matching.
Without losing generality, we may always decompose the original convolution
into  integer multiplications since the number of matches by considering only one
character in the alphabet may be computed in the corresponding integer multiplication.
Going through all elements in the alphabet set, we sum up the results. The result of
the summation is just the solution to the exact string matching problem.
After showing that the answer of exact string-matching can be found by integer
multiplication. we shall now introduce the mechanism to speed up integer
multiplication by using discrete Fourier transform.
Speeding Up of Integer Multiplication by Discrete Fourier Transform
In the above, we showed that discrete convolution can be viewed as integer
multiplication. In the following, basically, we shall show another interesting fact,
namely: Integer multiplication can be performed through discrete Fourier transforms.
To come to this point, we shall think reversely. That is, let us note that integer
multiplication is actually a special kind of convolution involving multiplication and
addition. That is, we express our integers x and y as two vectors X and Y and the
integer multiplication of x and y as the convolution of X and Y. That the convolution
3-44
of X and Y can be computed through Fourier transforms can be seen through the
convolution theorem.
Convolution Theorem
Given two vectors, X and Y , let F ( X ) and F (Y ) represent the Fourier
transforms of X and Y respectively. Let X  Y denote the convolution of X and
Y. Then the Fourier transform of X  Y is F ( X )  F (Y ) (  : vector product ).
Equivalently, the inverse Fourier Transform of F ( X )  F (Y ) , that is
F 1 ( F ( X )  F (Y )) , is X  Y .
The above description can be better expressed in the following formulas:
F ( X  Y )  F ( X )  F (Y )
F 1 (( F ( X  Y ))  F 1 ( F ( X )  F (Y ))
X  Y  F 1 ( F ( X )  F (Y ))
By applying convolution theorem described above, we may speed up integer
multiplication by fast Fourier transform (FFT). We explain the procedure in Figure
3-36.
X
Y
Brute-force method: O(n2) multiplications + O(n) additions
X*Y
FFT F(X) O(n) multiplications +
iFFT
F(X) F(Y)
O(n) additions
F(Y)
By applying convolution theorem: O(O(nlogn)+O(n)+O(nlogn))
Figure 3-36
The procedure of computing convolution by using fast Fourier
transform.
To simply our discussion, we assume that the length of X and Y are both n .
In Figure 3-36, it can be seen that we need to consume n 2 multiplications and n
additions if we want to obtain the result of the convolution of X and Y directly. The
time complexity is thus O(n 2 ) . If we apply convolution theorem and fast Fourier
transform, we may compute the result of this convolution by calculating first the
Fourier transform of X and Y in O(n log n) , second the vector product of F ( X )
3-45
and F (Y ) through n multiplications and n additions, and finally the inverse
Fourier transform of F ( X )  F (Y ) in O(n log n) . The time complexity is
O(n log n) .
Next, we shall explain how we compute the Fourier transform in O(n log n)
where n is the length of the input data.
Fast (Inverse) Fourier Transform
We first introduce the definition of the discrete Fourier transform(For details,
please consult Section 4.6 of Professor Lee’s lecture notes on communications.):
Given a sequence of numbers, a k with length n . Its discrete Fourier transform is
n 1
Ai   ak e
j
2ik
n
, 0  i  n 1
k 0
For example, consider the case where n  4 and ak  {a0 , a1 , a 2 , a3 } . To simplify
the calculation, we derive
e
j
2
n
e
j
2
4
e
j

2
 cos

2
 j sin

2
j
The result of its discrete Fourier transform will be computed as follows:
A0  a0  a1  a2  a3
A1  a0  a1 ( j )  a2 ( j ) 2  a3 ( j ) 3  a0  ja1  a2  ja3
A2  a0  a1 ( j ) 2  a2 ( j ) 4  a3 ( j ) 6  a0  a1  a2  a3
A3  a0  a1 ( j ) 3  a2 ( j ) 6  a3 ( j ) 9  a0  ja1  a2  ja3
Next, we introduce the definition of the discrete Fourier transform as follows: Given
the result of the discrete Fourier transform of a sequence of numbers, Ak with length
n . Its inverse discrete Fourier transform is
j
1 n 1
ai   Ak e
n k 0
2ik
n
, 0  i  n 1
3-46
For example:
Consider the case when n  4 and Ak  { A0 , A1 , A2 , A3 } .
To simplify the calculation, we derive
e
j
2
n
e
j

2
j
The inverse discrete Fourier transform may be computed as follows:
1
( A0  A1  A2  A3 )  a0
4
1
1
a1  ( A0  A1 ( j )  A2 ( j ) 2  A3 ( j ) 3 )  (4a1 )  a1
4
4
1
1
a2  ( A0  A1 ( j ) 2  A2 ( j ) 4  A3 ( j ) 6 )  (4a2 )  a2
4
4
1
1
a3  ( A0  A1 ( j ) 3  A2 ( j ) 6  A3 ( j )9 )  (4a3 )  a3
4
4
a0 
To compute the result of Fourier transform described above, it needs to consume
O(n ) where n is the length of the input data. To improve this time complexity, we
2
adopt the concept of “divide and conquer”. By adopting this concept, the result of
Fourier transform may be computed in O(n log n) instead of O(n 2 ) . An algorithm
for computing Fourier transform based upon the divide and conquer strategy is given
below [LCTT2001].
3-47
Algorithm 3-11: Fast Fourier Transform
Input: a 0 , a1 ,, a n 1 , n  2 k
n 1
Output: A j   a k e
2ijk
n
for j  0,1,2 n  1
k 0
Algorithm
1. if n  2
A0  a0  a1
A1  a0  a1
Return
2.
Recursively find the coefficients of the Fourier transform of
a0 , a 2 ,..., an2 (a1 , a3 ,..., an1 ) .
Let the coefficients be denoted as B0 , B1 ,..., Bn / 2 (C0 , C1 ,..., Cn / 2 ) .
3.
For j  0 to
n
1
2
A j  B j   nj C j
A
j
j
n
2
n
 B j  n 2 C j
End Algorithm
In the following, we give the fast inverse Fourier transform algorithm:
Algorithm 3-12: Fast Inverse Fourier Transform
Input: A0 , A1 , , An 1 , n  2 k
1 n 1
Output: a j   Ak e
n k 0
Algorithm
1.
2ijk
n
for j  0,1,2 n  1
if n  2
1
( A0  A1 )
2
1
a1  ( A0  A1 )
2
a0 
2.
Return
Recursively find the coefficients of the inverse Fourier transform of
A0 , A2 ,..., An2 ( A1 , A3 ,..., An1 ) .
Let the coefficients be denoted as b0 , b1 ,..., bn / 2 (c0 , c1 ,..., cn / 2 ) .
3-48
3.
For j  0 to
aj 
a
j
n
2
n
1
2
1
(b j   nj c j )
2
n
j
1
 (b j   n 2 c j )
2
End Algorithm
Next, we shall describe how we solve integer multiplication by combining
convolution theorem and fast Fourier transform which is referred to as “Strassen
method”.
A Fast Method for Integer Multiplication
Strassen method
Let n be a power of two. Let two big integers X and Y with less than n
n 1
n 1
j 0
j 0
coefficients such that X   x j B j , Y   y j B j
Step1: X *  ( x * 0 , x *1 , , x * 2 n1 ) ≡ F2 n ( x0 , x1 , , xn , , 0, , 0)
(Fourier transform)
Step2: Y *  ( y * 0 , y *1 , , y * 2 n1 ) ≡ F2 n ( y0 , y1 , , yn , , 0, , 0)
(Fourier transform)
Step3: Z *  ( z * 0 , z *1 , , z * 2 n 1 ) z i*  xi* y i* (vector multiplication)
Step4: Z  ( z 0 , z1 , , z 2n1 )  F 2n (Z * ) (inverse Fourier Transform)
2 n 1
Step5: Rearrange the coefficients in z i , the number Z   z i B i is equal to the
i 0
product of X by Y .
We may see that this method adopts the idea of convolution theorem to compute
the result by Fourier transform. Compared to convolution theorem, Step 1 and step 2
compute the Fourier transforms of the input vectors. Step 3 computes the vector
3-49
product of the Fourier transforms of two input data. Step 4 computes the inverse
Fourier transform of the resulting vector product. By using fast Fourier
transform(FFT), and fast inverse Fourier transform (IFFT)integer multiplication may
be solve in O(n log n) .
The following is a case with n  2 .
X  x0 B 0  x1 B1 ; Y  y 0 B 0  y1 B1
XY  x0 y 0 B 0  ( x0 y1  x1 y 0 ) B1  x1 y1 B 2  0 B 3
X *  F4 ( x0 , x1 ,0,0)  ( x0  x1 , x0  x1 j , x0  x1 , x0  x1 j )
Y *  F4 ( y0 , y1 ,0,0)  ( y0  y1 , y0  y1 j, y0  y1 , y0  y1 j )
Z *  X *  Y *  ( x0 y0  x0 y1  x1 y0  x1 y1 , x0 y0  ( x0 y1  x1 y0 ) j  x1 y1 ,
x0 y0  x0 y1  x1 y0  x1 y1 , x0 y0  ( x0 y1  x1 y0 ) j  x1 y1 )
F4 ( Z * ) = ( x0 y0 , x0 y1  x1 y0 , x1 y1 ,0)
Z  x0 y 0 B 0  ( x0 y1  x1 y 0 ) B1  x1 y1 B 2  0 B 3  XY
The above result demonstrates that the approach is correct.
In the following, we shall show an example for the Straussen method:
3-50
n2
X  23, Y  12, B  10
X  x0 10 0  x1101 ;Y  y 0 10 0  y1101
x0  3,x1  2
y 0  2,y1  1
X *  F4 (3,2,0,0)  (5,3  2 j ,1,3  2 j )
Y *  F4 (2,1,0,0)  (3,2  j ,1,2  j )
Z *  (15,4  7 j ,1,4  7 j )
F 4 ( Z *)  (6,7,2,0)
Z  z 0 10 0  z1101  z 2 10 2  z 3 10 3
 6  10 0  7  101  2  10 2  0  10 3
 276
 23  12
The reader can see that we have successfully found the answer.
To summarize, we now conclude that exact string matching can be solved by
discrete convolution with time-complexity O(n log n) where n is the length of the
longer string.
3.7 Some Other Applications of Convolution to
Sequence Analysis
In the following sections, we shall discuss how we apply convolution to several
applications. To simplify our discussion, when we mention convolution, it means the
discrete convolution defined above for solving the exact string matching problem.
The Common Substring with k -mismatches Allowed
In this problem, we are given a string T and a string P , P is much shorter as
compared to T . We need to find a segment of T which is very similar to P with
k mismatches allowed. In fact, the segment may be found by the longest common
3-51
substring with k mismatches allowed. The steps of applying convolution to find
such a segment in T are as follows:
1.
Apply convolution on both T and P , save the result of in z .
2.
Find the maximum value in z , z max such that z max  P  k .
3.
If such z max exists, return T P  zmax 1 , zmax ; otherwise return NULL.
An example to demonstrate how the common substring with k mismatches
allowed could be found by convolution is shown in Fig. 3-37. In this example, we
set k to be 3.
T=CTAAAGTTCTCTGTGATTGTGTATT
P=TCTAGCAAT
0100001101010100110101011
0011100000000001000000100
0011100000000001000000100
1000000010100000000000000
0000010000001010001010000
0011100000000001000000100
0100001101010100110101011
1000000010100000000000000
0100001101010100110101011
020112222724031322320422331112011
Figure 3-37 The process of finding the longest common substring with 3 mismatches
allowed by convolution.
From Figure 3-37, we can see that the longest common substring with 3
mismatches allowed in T, namely "TAAAGTTCT", is found by convolution.
Common Substrings with k -mismatches Allowed among Multiple Sequences
We can also find the common substrings with k mismatches allowed by
convolution among a set of sequences. Consider we are given several sequences
S1 , S 2 ,...S n , and we want to find the common substring with k mismatches allowed
among them. We first try to find such the longest common substring with k
mismatches allowed from S1 and S2. And then we test whether the common substring
3-52
with k mismatches allowed that we found exists in the remaining sequences, again,
by convolution. By doing so, the common substring with k mismatches allowed of
these sequences can be found by convolution if it exists.
Determining the Similarity of Two DNA Sequences
An intuitive way to measure the similarity between two DNA sequences is to use
the concept of edit distance of two strings. The edit distance is defined as the smallest
number of insertions, deletions, and substitutions required changing one string into
another. If the edit distance between two strings is small, we say that these two strings
are similar; otherwise, we say that they are distinct. But in some sense, using the edit
distance as the measurement of similarity between two strings is not practical.
Consider the following case: Suppose that we have two strings:
“CTAAAGTTCTCTGTGATTGTGTATT” and “TAAAGTTCTTGGTGGTAA”. The
edit distance can be computed to be 42. This value is quite large as the length of the
longer string is 25. Thus, we may conclude that these two strings are distinct. Actually,
there is a good common substring between them, namely “TAAAGTTCT”
(“CTAAAGTTCTCTGTGATTGTGTATT”, “TAAAGTTCTTGGTGGTAA”). It is
not entirely safe to say that these two strings are not similar.
We believe that in addition to edit distance, the longest common string with
mismatches allowed can also be used a measurement of the similarity of two
seuqences. If the score based on the longest common substring with mismatch
allowed of two strings is high, we say that they are similar; otherwise, we say that
they are distinct. As we indicated before, the longest common string with mismatches
allowed can be found by convolution. Thus we may say that convolution can be
used to measure the similarity between two sequences.
The following is an example for demonstrating the idea.
DNA sequence 1: agcctta
DNA sequence 2: cgcatc
3-53
agcctta
ctacgc
0011000
0100000
0011000
1000001
0000110
0011000
002103212000
Figure 3-38 Convolution of “agccta” and “cgcatc”.
From Figure 9, we choose the result in the 6th shift which has the maximum
number of matches to be the score for similarity. This corresponds to a common
substring with mismatch allowed which is “gcat”. Because the length of the
common substring with mismatch allowed of these two sequences is quite large as
compared to the lengths of the two sequences, we say that they are similar. Next, let
us examine another example:
DNA sequence 1: attgacat
DNA sequence 2: ccgtcg
attgacat
gcggcc
00000100
00000100
00010000
00010000
00000100
00010000
0001012001100
Figure 3-39
The result of convolution of “attgacat” and “ccgtcg”
From Figure 3-39, we choose the result in the 7th shift to the score of similarity
of these two sequences. This corresponds to a common substring with mismatches
allowed which is “gac” or “gtc”. Because the length of the substring is quite small
as compared to the lengths of the two sequences, we say that they are distinct.
We apply convolution on DNA sequences to find the similarity of them. The idea
of finding the similarity between DNA sequences is to find the highest score between
two DNA sequences based on the idea of longest common substring with mismatch
3-54
allowed from the result of convolution. The procedure for computing the score is as
follows:
Input: Two DNA sequences, X and Y,
1. Compute the result of convolution of X and the reversal of Y(Y’).
2. From the result in step 1, choose the maximum value, namely max(zi).
3. Return max(zi).
In this experiment, we used Hepatitis B virus with 26 sequences as our input. We
compared every pair of DNA sequences of Hepatitis B virus to find the maximum
score by convolution. The 26 Hepatitis B viruses were known in advance to be
divided into the following clusters for evaluating the goodness of the experiment later:
Cluster 1 1, 2, 3, 4
Cluster 4 15, 16, 17, 18
Cluster 2 5, 6, 7, 8, 9, 10 Cluster 5 19, 20, 21, 22
Cluster 3 11, 12, 13, 14
Cluster 6 23, 24, 25, 26
Table 3-x Clusters of 26 Hepatitis B virus
We applied convolution on these DNA sequences to determine whether the
clusters found by convolution is consistent with the clustering given. We summarized
the result in Figure 3-40.
S25
S21
S17
S13
S9
25
21
17
13
9
5
1
S5
660-680
640-660
620-640
600-620
580-600
560-580
S1
Figure 3-40 The result of 26 Hepatitis B viruses by the convolution method.
In Figure 3-40, the scores of S1 between S2, S3 and S4 are relatively high (greater 650)
3-55
compared to other sequences (less than 630). S5-S10, S11-S14, S15-S18, S19-S22 and S23-S26
have the same trend as S1-S5. This result is consistent with the clustering which was
known in advance. In the same cluster, we have higher score (greater than 650)
compared to different clusters (less than 630). Thus we can identify DNA sequences
from different clusters by analyzing the result of the convolution of every pair of
DNA sequences.
Searching in a DNA Sequences Database
We are now given a query sequence T and we have to find all of those sequences
similar to T within a large set P of target sequences. If we adopt the convolution
operation, we measure the similarity between two sequences by measuring the highest
score produced by convolution. If the score is high, we conclude it is similar. We are
not advocating that our method is the only way to search a DNA sequence data base.
But, the following experimental results show that our method is feasible.
In this experiment, we searched a DNA database with 1042 sequences. The
sequences in the databases is organized as following: S1-S26 are Hepatitis B virus
DNA sequences, S27-S162 are some human mitochondria DNA sequences and S162-S1042
are some other virus DNA sequences. We arbitrarily chose a segment from one of
these 1042 sequences. Then we searched this segment against all of other sequences
by the convolution method. We tested two cases which are summarized as follows:
Case 1:
Query segment: (A segment from a Hepatitis B virus DNA sequence)
CACAATACCACAGAGTCTAGACTCGTGGTGGACTTCTCTCAATTTTCT
The length of this segment is 48. Thus the score obtained by the convolution method
would not exceed 48.
3-56
60
50
40
30
20
10
0
0
200
400
600
800
1000
1200
Figure 3-41 The searching result by querying a segment from a Hepatitis B virus DNA sequence data
From Figure 3-41, we can see that, between S1-S26(Hepatitis B virus), they have
higher score compared to S27-S162(human mitochondria). In S163-S1042, some of them
also have a high score, this is because they are also virus DNA sequences.
Case 2:
Query segment: (A segment from a human mitochondria DNA sequence)
AAGTATTGACTCACCCATCAACAACCGCTATGTATTTCGTACATTACT
60
50
40
30
20
10
0
0
200
400
600
800
1000
1200
Figure 3-42 The searching result of querying a segment from a human mitochondria DNA sequence
data
3-57
In case 2, the searching result is similar to the one in case 1. Except, this time, we
didn’t see high score appeared in S163-S1042. This is because human mitochondria DNA
sequences are intrinsically different from virus DNA sequences.
Finding Repeating Groups in a DNA Sequence
In DNA sequences, there are often segments which repeatedly occur. This kind
of segments are called repeating groups. Its definition is as follows: Given a string S,
repeating groups of S are substrings which are longer than some k  2 and occur
more than q  2 times in S where k and q are pre-specified.
The steps of using convolution to solve this problem are shown as follows:
1.
Use S and its reversal S’ to be the input of convolution, and save the result in z.
2.
Because the values in z are symmetric, we only need to use the half part of the
result. Classify the result with the same values into G1, G2, …, Gp.
3.
Without losing generality, we consider G1  {z i , z j } ( Note that z i  z j  q ) ,
we then check whether S i , q 1 and S j ,q 1 are identical or not.
4.
Return S i , q 1 or S j ,q 1 if they are identical.
We now use an example to explain how we can find repeating groups by
convolution.
Consider S=“abcxyabc”, we use S and its reversal S ' to be the input of
convolution. The result of the convolution is shown in Figure 11.
3-58
S=abcxyabc
abcxyabc
cbayxcba
10000100
01000010
00100001
00010000
00001000
10000100
01000010
00100001
003000080000300
First
peak
Figure 3-40
Second
peak
An example of finding repeating groups by convolution.
From Figure 3-40, we may see that the repeating group “abc” can be found by
convolution.
An Aid for Detection in Transposition
In this application, we are given S1 and S2. We want to know whether there exist
substrings A and B in S1 and S1 is a concatenation of A and B and S2 is a concatenation
of B and A, as shown in Figure 3-41. If we transpose B and A in S2, S 2 will be
identical to S1 .
A
S1:
S2:
B
B
A
Figure 3-41 An example of transposition
We also want to find the transposition point in S1. This can be done by observing
the result of convolution. The idea of detecting the transposition point is shown in
Figure 3-42.
3-59
A
S1:
S2:
B
B
A
S1:
A
B
A
S2:
Figure 3-42
Transposition point
Transposition point
B
B
A
S1:
S2:
B
A
An illustration of detecting transposition and the transposition point
In Figure 3-42, we can see that the sum of number of matches in two certain
shifts, one is the exact match of “A” and the other is the exact match of “B”, is equal
to the length of S1. After we find these two shifts, the transposition point is at A
position of S1. Our task now becomes to find these two shifts and we can use
convolution to find them. We summarize the steps to detect the transposition point as
follows:
1.
Apply convolution on S1 and S2 and save the result in z.
2.
If zi  z j  S1 such that j  i  S1 , return i as the transposition point.
The following is an example.
Input:
S1=abcdef
S2=cdefab
The first detection and the second detection are shown as follows:
abcdef
cdefab
abcdef
cdefab
They can be found by convolution as shown in Figure 3-43.
3-60
abcdef
bafedc
001000
000100
000010
000001
100000
010000
02000004000
First detection
Second detection
Figure 3-43
An example of detecting the transposition point by convolution.
In Figure 3-43, we can see that the sum of number of matches in the 2nd shift and
the 8th shift is equal to the length of S1, the transposition point can be found in the
2+1=3rd position of S1.
Furthermore, we can also use convolution to detect whether S1 contains a
concatenation of B and C and S2 contains a concatenation of C and B, as shown in
Figure 3-44.
S1:
A
S2:
A
Figure 3-44
B
C
C
B
D
D
Another example of concatenation
The following is an example to demonstrate how this can be done by convolution.
Assume that we are given S1 and S 2 as follows:
S1: actgactgac
S2: acgatgactc
Its result of convolution is shown as follows:
3-61
actgactgac
ctcagtagca
1000100010
0100010001
0001000100
1000100010
0010001000
0001000100
1000100010
0100010001
0010001000
0100010001
0103011503240220010
First detection
Second detection
Fig. 3-45
Another example to find transposition point
These two detections correspond to the following shifts.
First detection:
actgactgac
acgatgactc
Second detection:
actgactgac
acgatgactc
An Aid for Detecting Insertion/Deletion
In this application, we are given two sequences, S1 of length n and S 2 of
length m. S1 is the sequence by inserting a segment into S 2 . We want to find the
insertion part between S1 and S 2 . These two sequences are illustrated in Figure
3-46.
S1:
C
A
B
Insert “C”
S2:
Figure 3-46
A
B
An illustration of insertion
3-62
Consider Figure 3-48.
A method to find the insertion “C” is to detect the
position of shift where the exact match of “A” occurs and then to detect the position
of shift where the exact match of “B” occurs by convolution. This is shown in Figure
3-48.
S1:
A
C
S2:
A
B
B
Insertion
Insertion
A
C
A
B
Figure 3-48
C
A
B
A
B
B
Detection of insertion
To use convolution to detect the insertion, the steps are as follows:
1.
Apply convolution on S1 and S2, save the result in z.
2.
If
z m1  z n1  S 2 , return S1 zm 1, s1  zn 1 as the insertion part.
The following is an example.
Input:
S1=abxycde
S2=abcde
Their overlaps in the beginning and the ending are shown as follows:
abxycde
abcde
abxycde
abcde
These overlaps can be found in the result of convolution as follows:
3-63
abxycde
edcba
1000000
0100000
0000100
0000010
0000001
000002030000
Ending
Beginning
Fig. 3-49 The checking of insertions
Since z m1  z n1  z 6  z 4  3  2  S 2 , we conclude that an insertion occurs
and the insertion part “xy” can be identified.
An Aid for Detecting the Overlapping of Segments Resulting from
the Shot-Gun Operations
We can also use convolution to find the overlapping of segments resulting from
the shot-gun operations of DNA sequences. The shot-gun approach will be explained
later. Loosely speaking, it cuts a sequence into two sequences. Consider the
following sequence:
S=“AGGCTAGTTGCCTAGTAGT”
After two shot-gun operations, we may have the following segments:
Original segquence:
AGGCTAGTTGCCTAGTAGT
First breaking up:
AGGCT AGTTG CCTAG TAGT
1
2
3
4
Second breaking up:
AGG CTAGT TGCCT AGTAGT
5
6
7
8
To reconstruct the original sequence, we often try to determine how the segments
3-64
overlap. We may apply convolution on every pair of segments. For example,
6: CTAGT
2:
AGTTG
7:
TGCCT
The above information shows that the relationship among segment 6, 2 and 7 is
as follows:
627
Consider segments 2, 7 and 3. By using convolution, we will find the following
information.
2: AGTTG
7:
TGCCT
3:
CCTGA
This corresponds to 273. These information will be very much useful if
we want to reconstruct the original sequence.
The Corresponding Pair-wise Nucleotides in a DNA Sequence
In this application, we shall first define a substitution string of a string X. For a
string consisting of A, C, G and T, the substitution string of X is obtained by making
the following substitutions:
AT
TA
CG
GC
For instance, the substitution string of AACTGC is TTGACG.
In this problem, we are given a sequence S. We want to know whether there
exists a substring A whose substitution string also exists in S as shown in Figure 17.
3-65
…
A
……
Substitution(A)
…
Figure 3-50 The corresponding nucleotides in a DNA sequence.
For instance, we are given sequence “acttgacttgaac”. We can see that its
longest substring whose substitution string also exists in the same sequence is
“acttg” (acttgacttgaac) whose substitution string is “tgaac”
(acttgacttgaac). In fact, we can adopt convolution to find corresponding
pair-wise nucleotides in a DNA sequence by defining the operation  in the
convolution as follows,
1; ( si , s j )  {( A, T), (C, G)}
 : c( s i , s j )  
, si and s j  {A, T, C, G}
0; otherwise
We use the same example where the input data S=”acttgacgtgaac” to
demonstrate how corresponding pair-wise nucleotides in a DNA sequence can be
found by convolution. It is similar to finding repeating groups as discussed in this
section before. We use S and its reversal S’ to be the input of the new convolution
whose  operation is defined as above. The result of this convolution is shown in
Figure 3-51.
acttgacttgaac
caagttcagttca
0011000110000
0000100001000
1000010000110
1000010000110
0100001000001
0011000110000
0000100001000
1000010000110
1000010000110
0100001000001
0011000110000
0011000110000
0000100001000
0001520018500058100251000
3-66
Figure 3-51 An example of finding pair-wise nucleotides in a DNA sequence by
convolution
We only need to observe the half part of the result since it is symmetric. The
result of this convolution tells us that in the 4th and 10th shift, we found two segments
in this DNA which are of pair-wise nucleotides to each other as shown in Figure 3-52.
In the 4th shift, we have
acttgacttgaac
acttgacttgaac
In the 10th shift, we have
acttgacttgaac
acttgacttgaac
Note that substitution characters of the
pair-wise nucleotides in a DNA
sequences are A-T and G-C
respectively.
Figure 3-52 The corresponding pair-wise nucleotides in S=acttgcacttgaac
From Figures 3-51 and Figure 3-52, we can conclude that the corresponding
pair-wise nucleotides in S, which are “acttg” and “tgaac”, can be identified by
convolution.
3-67
Download