Chapter 3 String Matching To search some designated strings in computers is a classical and important problem in computer science. We can search or replace some texts in word processors, find some interesting websites from searching engines (like Yahoo and Google) and search for papers from digital libraries. The main problem of those operations is string matching. Many databases (like GenBank) were built to support DNA and protein sequences contributed by researchers in the world. Since DNA and protein sequences can be regarded as character strings composed of 4 and 20 different characters respectively, the string matching problem is the core problem for searching these databases. To design efficient string matching algorithms will become a more and more important research area as the size of databases grows. In this chapter, we shall introduce two efficient algorithms for the exact string matching problem. Moreover, we shall introduce two important data structures, the suffix tree and array, for string matching problems. Finally, we introduce an algorithm for the approximate string matching problem. 3.1 Basic Terminologies of Strings A string is a sequence of characters from a alphabet set Σ. For example, "AGCTTGA" is a string from Σ = {A,C,G,T}. The length of a string S, denoted by |S|, is the number of characters of S. For example, the length of "AGCTTGA" is 7. Let S i be the ith character in the left-to-right order for string S, where i >= 1. Let S i , j denote the string S i S i 1 … S j , where i <= j. String S' is called a substring of string S if S' = S i ,i |S '|1 for some i, where 1 <= i <= |S| - |S'| + 1. String S' is called a subsequence of string S if there exists an increasing sequence 1 2 … |s '| such that 3-1 S i' = S i for 1 <= i <= |S'|. For example, "GCTT" and "TTG" are substrings and subsequences of "AGCTTGATT", while "ACT" and "GTT" are subsequences of "AGCTTGATT". String S' is called a prefix of string S if S' = S1,|S '| . String S' is called a suffix of string S if S' = S |S ||S '|1,|S | . For example, "AGCTT" and "AG" are both prefixes of "AGCTTGATT", while "TT" and "GATT" are both suffixes of "AGCTTGATT". Given a text string T of length n and a pattern string P of length m, the exact string matching problem is to find all the occurrences of P in T. It is easy to design a brute-force algorithm for this problem, which enumerates all n – m + 1 possible substrings of length m in T by shifting a sliding window of length m and determines which substrings are P. The algorithm is given below: Algorithm 3.1 A brute-force algorithm for exact matching Input: A text string T of length n and a pattern string P of length m. Output: All occurrences of exact matchings of P in T. for i = 1 to n – m + 1 do if ( P1, m = T j , j m 1 ) then Report that pattern P appears at position i; endif endfor Since it takes O(m) time to determine whether two strings of length m are the same, the algorithm conducts in O((n – m + 1)m) = O(nm) time. 3.2 The KMP Algorithm In this section, we shall introduce the KMP algorithm, proposed by Knuth, Morris and Pratt in 1977 to speed up the exact matching procedure. We will use several cases to illustrate their idea. Case 1: In this case, the first symbol of P dose not appear again in P. Consider 3-2 Figure 3.1(a). The first mismatch occurs at P4 and T4 . Since P1 to P3 exactly match with T1 to T3 and the first symbol of P dose not appear any where in P, we can slide the window all the way to T4 and match P1 with T4 , as shown in Figure 3.1(b). Case 2: In this case, the first symbol of P appears again in P. Consider Figure 3.2(a). The first mismatch occurs at P7 and T7 . Since the first symbol of P dose occur in P again, we can not slide the window all the way to match P1 with T7 . Instead, we must slide the window to match P1 with T6 because P6 = "A" = T6 . This is shown in Figure 3.2(b). Case 3: In this case, not only the first symbol of P appears in P, prefixes of P appear in P twice. Consider Figure 3.3(a). First, note that the string P6,8 is equal to the prefix P1,3 and the string P9,10 is equal to the prefix P1, 2 . In Figure 3.3(a), the first mismatch occurs at P8 and T8 . If there is a good preprocessing, we will go back to P7 and find out that P6 , 7 is equal to the prefix P1, 2 . Besides, since mismatch occurs after P7 , we know that P6 , 7 matches with T6, 7 . Therefore, we can slide the window to align P1 with P6 and start matching P3 with T8 , as shown in Figure 3.3(b). The basic principle of the KMP algorithm is illustrated in Figure 3.4. In general, given a pattern P, we should first conduct a preprocessing to find out all of the places where prefixes appear. This can be done by computing a prefix function. For 1 <= j <= m, let the prefix function f(j) for P j be the largest k < j such that P1,k = Pj k 1, j ; 0 if there is no such k. The prefix function is illustrated in Figure 3.5. Note that f(1) = 0 for any cases. For example, the prefix function f for P = "ATCACATCATCA" is shown in Figure 3.6. 3-3 (a) (b) Figure 3.1: The First Case for the KMP Algorithm (a) (b) Figure 3.2: The Second Case for the KMP Algorithm (a) (b) Figure 3.3: The Third Case for the KMP Algorithm 3-4 Figure 3.4: The KMP Algorithm Figure 3.5: The Prefix Function Figure 3.6: An Example of the Prefix Function If the prefix function is computed first and stored in an array, the window shifting will be determined easily by looking up the array. Consider the pattern P = "ATCG". If the mismatch occurs when comparing T4 with P4 , then we can continue comparing T5 with Pf ( 4 1) 1 = P1 , where f(j) = 0, 1 <= j <= 4, for P = "ATCG". Consider the pattern P = "ATCACATCATCA" shown in Figure 3.6. If the mismatch occurs when comparing T13 with P8 , then we can continue comparing T13 with Pf (81) 1 = P21 = P3 . In general, if a mismatch occurs when Ti is compared with P j , then we align Ti with Pf ( j 1) 1 if j ≠ 1 and align Ti 1 with P1 if j = 1. In the following, we shall discuss how to find this prefix function. Consider the prefix function of Figure 3.6. (1) Suppose that f(4) has been found to be 1. Let us see what this means. This means that P1 = P4 . Based upon this information, we can determine f(5) easily. If 3-5 P5 = P2 , then we set f(5) = f(4) + 1; otherwise, we set f(5) = 0. In this case, since P5 ≠ P2 , we set f(5) = 0. (2) Suppose we have found f(8) = 3. This means that P1,3 = P6,8 . We now see whether P9 = P4 . Since P9 = P4 , we set f(9) = f(8) + 1 = 4. (3) Suppose we have found that f(9) = 4. We now check whether P10 = P5 . The answer is no. Does this mean that we should set f(10) to be 0? No. Let us note that f(9) = 4 and f(4) = 1. This means that P9 = P4 = P1 . In other words, we have a pointer pointing to P1 now. We check whether P10 = P2 . P10 is indeed equal to P2 . We therefore set P10 = f(1) + 1 = 0 + 1 = 1. Let f x ( y) = f ( f x1 ( y)) for x > 1 and f 1 ( y ) = f(y). The prefix function f(j), 2 <= j <= m, for P j can be rewritten as follows: f k ( j 1) 1 if j 1 and there exists the smallest k 1 such that Pj Pf k ( j 1)1 f ( j) 0 otherwise It can be shown that this new formula is equivalent to the old one. Consider the pattern P in Figure 3.6 again. Note that f(1) is 0 for any cases. f(2) and f(3) are 0 by the definition. We set f(4) = 1 because P4 = Pf ( 4 1) 1 = P1 = "A". However, we let f(5) be 0 because "C" = P5 ≠ Pf ( 51) 1 = P2 = "T" and "C" = P5 ≠ Pf 2 (51)1 = P1 = "A". In addition, we have f(6) = 1, f(7) = 2, f(8) = 3 and f(9) = 4 because P6 = Pf ( 6 1) 1 = P1 = "A", P7 = Pf ( 7 1) 1 = P2 = "T", P8 = Pf (81) 1 = P3 = "C" and P9 = Pf ( 9 1) 1 = P4 = "A", respectively. Since "T" = P10 ≠ Pf (101) 1 = P5 = "C" and P10 = Pf 2 (101)1 = P2 = "T", we have f(10) = 2. Finally, we set f(11) = 3 and f(12) = 4 because P11 = Pf (111) 1 = P3 = "C" and respectively. 3-6 P12 = Pf (121) 1 = P4 = "A", The complete algorithm for computing f is given below: Algorithm 3.2 An algorithm for computing the prefix function f Input: A string P of length m. Output: The prefix function f for P. f(1) = 0; for j = 2 to m do t = f(j-1); while ( Pj ≠ Pt 1 and t ≠ 0) do t = f(t); endwhile if (t = 0) then f(j) = 0; else f(j) = t + 1; endif endfor The KMP algorithm consists of two phases. It computes the prefix function for the pattern P in the first phase, and searches the pattern in the second phase. Refer to Algorithm 3.3 for the details. Let us see an example for the KMP algorithm as shown in Figure 3.7. Initially, Pi = P1 is aligned with T j = T1 (see Figure 3.7(a)). Next, the algorithm continues comparing P1 and T2 because T1 ≠ P1 (see Figure 3.7(b)). A mismatch occurs at P4 and T5 . Then it compares T5 with Pf ( 4 1) 1 = P01 = P1 (see Figure 3.7(c). After that, since T1 ≠ P1 , the algorithm continues comparing T6 with P1 (see Figure 3.7(d)). Fortunately, we match T6,17 and P1,12 successfully and report that pattern P appears at position i – m + 1 = 17 – 12 + 1 = 6 of T. The next pairwise-comparing starts at T18 and Pf ( m ) 1 = P41 = P5 (see Figure 3.7(e)). Since a mismatch occurs at T19 and P6 , the algorithm continues comparing T19 and T f ( 6 1) 1 = T1 (see Figure 3.7(f)). The remaining comparisons are left for exercises. 3-7 It can be proved that the algorithm runs in O(n + m) time; O(m) time for computing function f and O(n) time for searching P. The analyses of the algorithm are quite difficult and need the knowledge of amortized analysis, which will be omitted here. Algorithm 3.3 The KMP Algorithm for exact matching Input: A text string T of length n and a pattern string P of length m. Output: All occurrences of P in T. /* Phase 1 */ Compute f(j) for 1 <= j <= m; /* Phase 2 */ i = 1, j = 1; while (i <= n) do if ( P j = Ti and j = m) then Report that pattern P appears at position i – m + 1; j = f(m) + 1; i = i + 1; endif if ( P j = Ti and j ≠ m) then j = j + 1; i = i + 1; endif if ( P j = Ti and j ≠ 1) then j = f(j - 1) + 1; endif if ( P j = Ti and j = 1) then i = i + 1; endif endwhile 3-8 3.3 The Boyer-Moore Algorithm The Boyer-Moore algorithm was proposed by Boyer and Moore in 1977. The worst time complexity of this algorithm is no better than that of KMP algorithm, but it is more efficient in practice than the KMP algorithm. This algorithm compares the pattern with the substring within a sliding window in the right-to-left order. In addition, bad character and good suffix rules are used to determine the movement of sliding window. Suppose that P1 is aligned to Ts now, and we perform pairwise-comparing from right to left. Assume that the first mismatch occurs when comparing Ts j 1 with P j ; that is, Ts m k 1 = Pm k for 0 <= k <= m – j - 1, and Ts j 1 ≠ P j as shown in Figure 3.8. We have the following possible ways to shift the sliding window. Figure 3.7: An Example for the KMP Algorithm Since Ts j 1 ≠ P j , any exact matching which moves the window to the right must match some character to the left of P j in P exactly with Ts j 1 . For example, 3-9 consider Figure 3.9(a). In this case, m = 12, s = 6 and j = 10, T15 = "G" ≠ P10 . We scan from P10 to the left and find P7 = "G" = T15 . Therefore, we move the window to the right to align T15 with P7 , as shown in Figure 3.9(b). The basic idea of this rule is illustrated in Figure 3.9(c) and we have the following rule: Bad Character Rule: Align Ts j 1 with Pj ' , where j' is the rightmost position of Ts j 1 in P. Figure 3.8: Ts m k 1 = Pm k for 0 <= k <= m – j -1, and Ts j 1 ≠ P j 3-10 (a) (b) (c) Figure 3.9: Bad Character Rule In the above bad character rule, only one character is used. We can actually do better than that by using the so called good suffix rule. Let us consider Figure 3.10(a). In this case, m = 12, s = 6 and j = 10. We note that T16,17 = P11,12 = "CA" and T15 = "G" ≠ P10 = "T". We therefore should find the rightmost substring in the left of P10 in P which is equal to "CA" and the left character to this substring is not equal to P10 = "T", if it exists. In our case, we find that P5, 6 = "CA" and P4 ≠ P10 = "T". 3-11 Therefore we move the window right to align T15 with P4 , as shown in Figure 3.10(b). The basic idea of this rule is now illustrated in Figure 3.10(c) and we have the following rule: Good Suffix Rule 1: Align Ts j 1 with Pj ' m j , where j' (m – j + 1 <= j' < m) is the largest position such that Pj 1,m is a suffix of P1, j ' (i.e. Pj 1,m = Pj '1 m j , j ' ) and Pj ' m j ≠ P j (see Figure 3.11). (a) (b) (c) Figure 3.10: Good Suffix Rule 1 3-12 Figure 3.11: The Movement for Good Suffix Rule 1 (a) (b) 3-13 (c) Figure 3.12: Good Suffix Rule 2 In the following, we will further make use of a matching. Consider Figure 3.12(a). In this case, m = 12, s = 6 and j = 8. In the previous good suffix rule, we will try to move the window so that another substring of P would match exactly with the T14,17 = "AATC". This is not possible because there is no other substring in P which is equal to "AATC". Yet there is a prefix, namely P1,3 , which is equal to "ATC" which is a suffix of "AATC". We therefore move the window so that T15 matches with P1 . The basic idea of this rule is now illustrated in Figure 3.12(c) and we have the following rule: Good Suffix Rule 2: Align Ts m j ' with P1 , where j' (1 <= j' <= m - j) is the largest position such that P1, j ' is a suffix of Pj 1,m (i.e. P1, j ' = Pm j ' 1,m ' ) (see Figure 3.13). Let B(a) be the rightmost position of a Σ in P. Figure 3.14(a) shows the values of B for Σ = {A,T,C,G} and P = "ATCACATCATCA". This function will be used for applying bad character rule. The algorithm for computing B is described in Algorithm 3.4. 3-14 Figure 3.13: The Movement for Good Suffix Rule 2 (a) (b) Figure 3.14: Two Functions for the Good Suffix Rule: (a)Function B (b)Function G Algorithm 3.4 An algorithm for Computing Function B Input: A pattern P of length m. Output: Function B for P. for j = 1 to m do B(j) = 0; endfor for j = 1 to m do B ( Pj ) = j; endfor As for the good suffix rule, we need to scan the pattern for storing the information 3-15 about the maximum number of shifts in each position of P. Let g1 ( j ) be the largest k such that Pj 1,m is a suffix of P1,k and Pk m j ≠ P j , where m – j + 1 <= k < m; 0 if there is no such k. Let g 2 ( j ) be the largest k such that P1,k is a suffix of Pj 1,m , where 1 <= k <= m – j; 0 if there is no such k. Functions g1 ( j ) and g 2 ( j ) are illustrated in Figure 3.15(a) and (b) respectively. Consider g 1 and g 2 for P = "ATCACATCATCA", which is shown in Figure 3.14(b). We set g1 (7) = 9 because P8,12 = "CATCA" = P5,9 , which is a suffix of P1,9 , and P7 ≠ P4 , while we set g 2 (7) = 4 because P9,12 = "ATCA" = P1, 4 , which is a suffix of P7 ,12 . However, we set g1 (8) = 0 by definition, while we set g 2 (8) = 4 because P9,12 = "ATCA" = P1, 4 . (a) (b) Figure 3.15: Functions g1 ( j ) and g 2 ( j ) Let us consider g1 (7) . Note that g1 (7) = 9. This means that P8,12 = "CATCA" must be equal to P5,9 and P7 ≠ P4 . Suppose a mismatch occurs at P7 , as shown in Figure 3.16(a). We can move the window m - g1 (7) = 12 – 9 = 3 positions, as illustrated in Figure 3.16(b). Consider g 2 (4) . Note that g 2 (4) = 4. This means that P1, 4 is a suffix of P5,12 . That is, P1, 4 must be equal to P9,12 . If a mismatch occurs at P4 , as shown in Figure 3-16 3.17(a), we can move the window m - g 2 (4) = 12 – 4 = 8 positions, as illustrated in Figure 3.17(b). (a) (b) Figure 3.16: Shifting for the Good Suffix Rule 1 (a) (b) Figure 3.17: Shifting for the Good Suffix Rule 2 Function G(j) is defined as follows: G( j ) m max{ g1 ( j ), g 2 ( j )} Actually, G(j) indicates the maximum number of shifts by good suffix rule when a mismatch occurs for comparing P j with some character in T. Figure 3.14(b) shows the values of G for P = "ATCACATCATCA". Note that G(m) is always 1. For 1 <= j <= m - 1, let the suffix function f'(j) for P j be the smallest k, j + 2 <= 3-17 k <= m, such that Pk ,m = Pj 1,m k j 1 ; m + 1 if there is no such k. This is illustrated in Figure 3.18. Let f'(m) = m + 2 for Pm . It is easy to see that function f' is similar to the prefix function for reverse of P. Figure 3.19 shows the values of f' for P = "ATCACATCATCA", where m = 12. We set f'(12) = 12 + 2 = 14. Since there is no k for 13 = j + 2 <= k <= m = 12, f'(11) is set to 12 + 1 = 13. Since Pk ,m = P12,12 ≠ Pj 1,m k j 1 = P11,11 for j = 10 and j + 2 = 12 <= k <= m = 12, we get f'(10) = m + 1 = 13. Similarly, f'(9) = 12 + 1 = 13. For j = 8 and j + 2 = 10 <= k <= m = 12, we let f'(8) = 12 because 12 is the smallest value of k such that Pk ,m = P12,12 = "A" = Pj 1,m k j 1 = P9,9 . For j = 7 and j + 2 = 9 <= k <= m = 12, we let f'(7) = 11 because 11 is the smallest value of k such that Pk ,m = P11,12 = "CA" = Pj 1,m k j 1 = P8,9 . Figure 3.18: The Suffix Function f’ Figure 3.19: Function f’ and G Let f ' x ( y) = f ( f ' x1 ( y)) for x > 1 and f '1 ( y ) = f ' ( y ) . Function f' can be redefined as follows: m2 if j m f ' ( j ) f ' k ( j 1) 1 if 1 j m 1 and there exists the smallest k 1 such that Pj 1 Pf 'k ( j 1)1 ; m 1 otherwise By comparing Figure 3.5 and Figure 3.18, we can easily see the suffix function f' 3-18 can be computed as f, except the direction is reversed. Function G can be determined by scanning P twice, where the first one is a right-to-left scan and the second one is a left-to-right scan. Function f' is generated in the first right-to-left scan and some values of G can be determined in this scan. Consider Figure 3.15 again. We can easily observe the following fact: Fact 1: If we scan from right to left and g1 ( j ) is determined during the scanning, then g1 ( j ) >= g 2 ( j ) . Consider Figure 3.19 again. Suppose that P j = P4 is considered now. We set f'(4) to 8. This means that Pf '( j ), m = P8,12 = "CATCA" = P5,9 = Pj 1,m j f '( j )1 . In addition, if P j = P4 ≠ P7 = Pf '( j ) 1 , we know g1 ( f ' ( j ) 1) = m + j - f'(j) + 1 = 9. By Fact 1, G(j) = m - max{ g1 ( j ) , g 2 ( j ) } = m - g1 ( j ) = m - (m + j - f'(j) + 1) = (f'(j) - 1) - j = 8 – 1 - 4 = 3. When we compute f', there may be a "while" loop. Function g1 ( j ) can be determined when we perform this "while" loop. That is, if t = f'(j) – 1 <= m and P j ≠ Pt , then g1 (t ) = m – t + j, as illustrated in Figure 3.20. Figure 3.20: The Computation of g1 ( j ) Suppose we have scanned from right to left and determined f'(j) for j = 1, 2, …, m. Now, consider f'(1). From Figure 3.19, f'(1) = 10. This means that P10,12 = P2 , 4 . Let t be f'(1) - 1. Let us now observe Pt = P101 = P9 = "A" = P1 . This means P1, 4 = P9,10 . Thus from Figure 3.15, g 2 (1) = 4 = m - (f'(1) - 1) + 1 = m - f'(1) + 2. Besides, 3-19 it can be easily seen that g 2 ( j ) = g 2 (1) for j = 2, 3, …, f'(1) - 2. This is illustrated in Figure 3.21. Figure 3.21: The Computation of g 2 (1) Let k' be the smallest k in {1, …, m} such that Pf '( k ) (1)1 = P1 and f ' ( k ) (1) 1 <= m. If G(j') is not determined in the first scan and 1 <= j' <= f ' ( k ') (1) 1 , thus, in the second scan, we set G(j') = m - max{ g1 ( j ' ) , g 2 ( j ' ) } = m - g 2 ( j ' ) = f '( k ') (1) 2 . If no such k' exists, set each undetermined value of G to m in the second scan. For example, since Pf '(1) 1 = P101 = P9 = "A" = P1 , we set G(1), G(2), G(3), G(4), G(5), G(6) and G(8) to f'(1) – 2 = 8. In general, let z = f '( k ') (1) 2 and f''(x) = f'(x) - 1. Let k'' be the largest value k such that f ' '( k ) ( z ) 1 <= m. Then we set G(j') = m - g 2 ( j ' ) = m - (m f ' '(i ) ( z ) 1 ) = f ' '(i ) ( z ) 1 , where 1 <= i <= k'' and f ' ' (i 1) ( z ) 1 < j' <= f ' '(i ) ( z ) 1 and f ' ' ( 0) ( z ) = z. For example, since z = 8 and k'' = 2, we set G(9) and G(11) to f ' '(1) ( z ) 1 = f'(8) – 1 = 12 – 1 = 11. The complete algorithm for computing G using f' is listed below: Algorithm 3.5 An algorithm for computing function G Input: A string pattern P of length m. Output: Function G for P. for j = 1 to m - 1 do G(j) = 0; endfor f'(m) = m + 2; t = m; for j = m - 1 to 1 do f'(j) = t + 1; 3-20 while (t <= m and P j ≠ Pt ) do if (G(t) = 0) then G(t) = t - j; endif t = f'(t) - 1; endwhile t = t - 1; endfor for j = 1 to m - 1 do if (G(j) = 0) then G(j) = t; endif if (j = t) then t = f'(t) - 1; endif endfor We essentially have to decide the maximum number of steps we can move the window to the right when a mismatch occurs. This is decided by the following function: max{G(j),j – B( Ts j 1 )}. The Boyer-Moore algorithm has two phases, one for computing B and G and the other one for searching patterns. The complete algorithm is described in the following: Algorithm 3.6 The Boyer-Moore algorithm for exact matching Input: A text string T of length n and a pattern string P of length m. Output: All occurrences of P in T. /* Phase 1 */ Compute G; Compute B; /* Phase 2 */ s = 1; while (s <= n – m + 1) do 3-21 j = m; while ( P j = Ts j 1 and j >= 1) do j = j - 1; endwhile if (j = 0) then Report that pattern P appears at position s; s = s + G(1); else s = s + max{G(j),j - B( Ts j 1 )} endif endwhile Consider the text T and pattern P given in Figure 3.22, where n = 24 and m = 12. The functions G and B are shown in Figure 3.14. The Boyer-Moore algorithm first sets s = 1 and starts the pairwise-comparing in the right-to-left order (see Figure 3.22(a)). The first mismatch occurs at P j = P12 and the algorithm recomputes s by the following statement: s = s + max{G(j),m - B( Ts j 1 )} = 1 + max{G(12),12 - B( T12 )} = 1 + max{1,2} =3 Figure 3.22: An Example for the Boyer-Moore Algorithm 3-22 Then we shift the window to position s = 3 and starts a new pairwise-comparing (see Figure 3.22(b). Next, the mismatch occurs at P j = P7 and s is reset by the following statement: s = s + max{G(j),m - B( Ts j 1 )} = 3 + max{G(7),12 - B( T9 )} = 3 + max{3,0} =6 So we shift the window to position s = 6 and starts a new pairwise-comparing (see Figure 3.22(c). After m comparisons, a P is found at position s = 6. Then we recompute s by the following statement: s = s + G(1) =6+8 = 14 Since s = 14 > n – m + 1 = 24 – 12 + 1 = 13, the algorithm terminates. Preprocessing time in phase 1 is O(m) + O(m + |Σ|) = O(m + |Σ|), where O(m + |Σ|) time is for computing B and O(m) time for G. However the worst time for phase 2 would be O((n – m + 1)m). It was proved that this algorithm has O(m) comparisons when P is not in T. However, this algorithm has O(mn) comparisons when P is in T. Many Boyer-Moore-like algorithms (such as Apostolico-Giancarlo algorithm) were derived and have O(m) time in the worst case. It is more efficient in practice than KMP algorithm. 3.4 Suffix Trees and Suffix Arrays Let S be a string of n characters. Let S (i ) denote the suffix S i ,n of S for 1 <= i <= n. For example, if S is ATCACATCATCA, its 12 suffixes are listed in Table 3.1. 3-23 Table 3.1: Suffixes for S = “ATCACATCATCA” The basic idea of suffix tree concept is that we can classify all of the substrings of a given sequence into groups. For example, consider the above sequence S = ATCACATCATCA. There are only three characters which appear in this sequence. They are A, C and T. Thus, we can classify all of the substrings into three groups: (1) the substrings which start with A, (2) the substrings which start with C and (3) the substrings which start with T. Note that A, for instance, appears in locations 1, 4, 6, 9 and 12. This means any substring which starts with A must be one of the following suffixes: S(1), S(4), S(6), S(9) and S(12). Similarly, any substring which starts with C must be only of the following suffixes: S(3), S(5), S(8) and S(11). The suffix tree, as we shall introduce in the follows, groups the suffixes in such a way that whether any substring appears in S can be found easily by using the above observation. A suffix tree of S of length n is a tree with the following properties: Each tree edge is labeled by a substring of S. Each internal node has at least 2 children. Each S (i ) has its corresponding labeled path from root to a leaf, for 1 <= i <= n. There are n leaves. No Edges branching out from the same internal node can start with the same character. 3-24 A suffix tree for S = "ATCACATCATCA" is depicted in Figure 3.23. Note that we add the symbol "$" to the end of S for satisfying the property of n leaves. The leaf number denotes the index of the suffix constructed by the labeled-path from root to the leaf. For example, the labeled-path from root to leaf 9 is "ATCA$" = S ( 9 ) . Since each node has branches with labels of different starting symbols, the path from root to a leaf represents a unique labeled-path. Each leaf corresponds to a suffix. The structure of a suffix tree for S is unique. The suffix tree for S can be used to determine whether a given pattern P is in S. Consider the suffix tree for S = "ATCACATCATCA" drawn in Figure 3.23. Suppose P = "TCAT". First we examine the branches from root. Since P1 is "T", we follow the branch of "TCA". Then we match P1,3 with "TCA". Next we examine the branches of "$", "TCA$" and "CATCATCA$". Since P4 is "T", we follow the branch of "TCA$". We now can report that P is at position 7 in S because P4 matches the first symbol of "TCA$" and leaf 7 is reached along the branch of "TCA$". Suppose P = "TCA". First we examine the branches from root. Since P1 is "T", we follow the branch of "TCA". Then we match P1,3 = "TCA" with "TCA". We now can report that P is at positions 2, 7 and 10 in S because leaves 2, 7 and 10 are reached along the branch of "TCA". 3-25 Figure 3.23: A suffix Tree for S = ”ATCACATCATCA” Suppose P = "TCATT". First we examine the branches from root. Since P1 is "T", we follow the branch of "TCA". Then we match P1,3 = "TCA" with "TCA". Next we examine the branches of "$", "TCA$" and "CATCATCA$". Since P4 is "T", we follow the branch with "TCA$". Since P4 , 5 = "TT" dose not match the first two symbols of "TCA$", we report that P is not in S. In the following, we shall give a simple algorithm to create a suffix tree. Given a string S, we first create all of the suffixes, divide all suffixes into distinct groups according to their starting characters and create a node. For each group, if it contains only one suffix, create a leaf node and a branch with this suffix as its label; otherwise, select a suffix with the longest common prefix among all suffixes of the group and create a branch out of the node with this longest common prefix as its label. Delete this prefix from all suffixes of the group. Repeat the above procedure for each node which is not terminated. Let us consider S = "ATCACATCATCA" as an example. The 12 suffixes are 3-26 shown in Table 3.1. We then divide these suffixes into three groups, N1 = {1, 4, 6, 9, 12}, N 2 = {3,5,8,11} and N 3 = {2,7,10} for starting characters "A", "C" and "T" respectively. Consider N 3 whose suffixes all start with "T". There are three suffixes, namely, S ( 2 ) = "TCACATCATCA", S ( 7 ) = "TCATCA" and S (10) = "TCA". Among these three suffixes, "TCA" is the longest common prefix. It can be easily shown that in N 2 , the longest common prefix is "CA" and in N1 , it is "A". This is shown in Figure 3.24(a). (b) Figure 3.24: Constructing a Suffix Tree for S = “ATCACATCATCA” After the longest common prefix determined, we delete this prefix from all 3-27 suffixes of the group. For example, for N 3 again. After "TCA" is deleted from S ( 2 ) , S ( 7 ) and S (10) , we now have three suffixes, namely "CATCATCA", "TCA" and "$". They are divided into three groups and each group has only one suffix. Thus three branches are now created for these three suffixes. Consider N 2 . After "CA" is deleted from S ( 3) , S ( 5 ) , S (8 ) and S (11) , we now have "CACATCATCA", "TCATCA", "TCA" and "$". We divide this set of suffixes into three groups, those starting with "C", those starting with "T" and those starting with "$". Among those starting with "T", we will find "TCA" has the longest prefix. The whole development of nodes after the first round of longest common prefixes are determined is now shown in Figure 3.24(b). Refer to Algorithm 3.7 for the details of creating a suffix tree Algorithm 3.7 An algorithm to create a suffix tree Input: A string S. Output: A suffix tree of S. Step 1: Create all suffixes of S. Create a node N. Denote the set of all suffixes of S by G N . Let x = N, G x = G N and put x, together with G x , into a queue Q. Step 2: Delete an element from Q. Divide all suffixes in G x into groups such that in each group, all suffixes start with the same character. Denote these groups as G1 , G2 , …, Gk for some k. Step 3: For k = 1 to n, do the following If Gk contains one suffix, create a leaf node and a branch labeled with the suffix and delete this suffix from the group. Otherwise, among all suffixes of this group, find the longest common prefix. Create a node x and a branch labeled with this suffix. Delete this prefix from the group. Let this new group of suffixes be denoted as G x . Put x, together with G x , into Q. Step 4: If Q is empty, exit and report the tree as a suffix tree for S. Otherwise, go to Step 2. 3-28 A suffix tree for a text string T of length n can be constructed in O(n) time. To search a pattern P of length m on a suffix tree needs O(m) comparisons. Thus we have an O(n + m) time algorithm for the exact string matching problem. An array A of n elements is called the suffix array for S if strings S ( A[1]) , S ( A[ 2 ]) , …, S ( A[ n ]) are in the non-decreasing lexical order. For example, the non-decreasing lexical order of suffices of S = "ATCACATCATCA" is S (12) , S ( 4 ) , S ( 9 ) , S (1) , S ( 6 ) , S (11) , S ( 3) , S (8 ) , S ( 5 ) , S (10) , S ( 2 ) and S ( 7 ) . Table 3.2 shows the suffix array A. A suffix array A for S can be constructed by the lexical depth first searching in the suffix tree for S. The lexical depth first searching is described below. We start the search from root node x. Select an unvisited node y branched from x such that the label of edge linking x and y has the smallest alphabetical ordering among edges linking unvisited nodes and x. When a node z is reached such that all its adjacent nodes have been visited, we back up to the last vertex visited and continue the depth first search. The search terminates when no unvisited vertex can be reached from any of the visited ones. The visiting sequence of leaves in the lexical depth first searching represents the non-decreasing lexical order of suffices. Consider the suffix tree again in Figure 3.23. The visiting sequence of leaves in the lexical depth first searching is 12, 4, 9, 1, 6, 11, 3, 8, 5, 10, 2, 7, and then the non-decreasing lexical order of suffices is S (12) , S ( 4 ) , S ( 9 ) , S (1) , S ( 6 ) , S (11) , S ( 3) , S (8 ) , S ( 5 ) , S (10) , S ( 2 ) and S ( 7 ) . If T is represented by a suffix array, it takes O(mlog n) time to find P in T because a binary search can be conducted on the array. Since a suffix array can be determined in O(n) by lexical depth first searching in a suffix tree for a string of length n, the total time will be O(n + mlog n) time. A string Z is called the common substring of strings X and Y if Z is in X and Y. The longest common substring of strings X and Y is a common substring of X and Y which has the longest length. For example, "PAT" is the longest common substring of X = "APAT" and Y = "PATT". We can create a suffix tree for X and Y for finding the longest common substring. Suppose that X = "APAT" and Y = "PATT". The suffices 3-29 for X and Y are described in Table 3.3. Figure 3.25 shows the suffix string for X and Y. We mark each internal node reachable from leaves of suffices from X and Y by "1", as shown in Figure 3.25. The labeled path from root to a marked internal node is a common substring of X and Y. The longest common substring can be found by generating all such common strings and finding the longest one. Table 3.3: Suffixes for X = “APAT” and “PATT” Figure 3.25: A suffix Tree for X = ”APAT” and Y = “PATT” 3.5 Approximate String Matching Given a text string T of length n, a pattern string P of length m and a maximal number of errors allowed k, the approximate string matching is to find all text positions where the pattern matches the text up to k errors, where errors can be substituting, deleting, or inserting a character. For instance, if T = "pttapa", P = "patt" 3-30 and k = 2, the substrings T1, 2 , T1,3 , T1, 4 and T5, 6 are all up to 2 errors with P. In the following, we first define a distance function to measure, somehow, the familiarity between two strings S1 and S 2 . This is called the suffix edit distance which is the minimum number of substitutions, insertions and deletions, which will transform some suffix of S1 into S 2 . Consider S1 = "p" and S 2 = "p". The suffix edit distance between S1 and S 2 is 0. Consider S1 = "ptt" and S 2 = "p". The suffix edit distance between S1 and S 2 is 1 as we can replace the last character "t" by "p". Consider S1 = "pttap" and S 2 = "patt". The suffix edit distance between S1 and S 2 is 3 because we need at least 3 operations to transform any suffix of S1 into S2 . What is the meaning of the suffix edit distance between T and P? If it is not greater than k, then we know that there is an approximate matching of a suffix of T with P with error not greater than k. That is, we have succeeded in finding a desired approximate matching. Given T and P, our approach is to find the suffix edit distance between T1,1 , T1, 2 , …, T1,n and P. For any i where the suffix edit distance between T1,i and P is less than or equal to k, we know that there is an approximate matching with error less than or equal to k. Let us consider T = "pttapa" and P = "patt". For T1,1 = "p" and P = "patt", the suffix edit distance between them is 3. There is no approximate matching with error not greater than 2. For T1, 2 = "pt" and P = "patt", the edit distance is 2. Thus we have found an approximate matching with error 2. 3-31 For T1,5 = "pttap" and P = "patt", the edit distance is 3. For T1, 6 = "Pttapa" and P = "patt", the edit distance is 2 again because we can insert "tt" into T and then find an exact matching of a suffix of T with P. Our approximate string matching problem now becomes a problem of the following problem: Given T and P, find the suffix string edit distances between T1,i and P for i = 1, 2, …, n where n is the length of T. This problem can be solved by using the dynamic programming approach introduced in Chapter 2. The approach is almost the same as that used to find the longest subsequence between two sequences. Let E(i,j) denote the suffix edit distance between T1, j and P1,i . For T1, j and P1,i , to find the suffix edit distance, we consider the following possibilities: Case 1. T j = Pi . In this case, we find E(i - 1,j - 1). Set E(i,j) = E(i - 1,j -1). Case 2. T j ≠ Pi . In this case, we find E(i - 1,j) and E(i,j - 1). Set E(i,j) = min{E(i - 1,j),E(i,j - 1)} + 1. Refer to Algorithm 3.8 for the details of finding all suffix edit distances between T and P. This problem can be solved in O(nm) time. Algorithm 3.8 An algorithm to compute E(i,j) for 0 <= i <= m and 0 <= j <= n Input: Strings T and P where the lengths of T and P are n and m respectively. Output: E(i,j) for 0 <= i <= m and 0 <= j <= n. Step 1: E(0,j) = 0 for j = 1, 2, …, n E(i,0) = i for i = 0, 1, 2, …, m Step 2: for i = 1 to m do for j = 1 to n do if ( T j = Pi ) then E(i,j) = E(i - 1,j - 1) 3-32 else E(i,j) = min{E(i - 1,j),E(i,j - 1)} + 1 endif endfor endfor Figure 3.26(a) shows E for T = "pttapa" and P = "patt". As can be seen, E(2,4), E(3,4) and E(6,4) are all less than or equal to k = 2. Through an appropriate tracing back, we can find all of the desired approximate matchings. Figure 3.26: Dynamic Programming Approach for T = “pttapa”, P = “patt” and k = 2 First of all, since our goal is to find all occurrences of approximate matchings, we therefore have to record how each E(i,j) is obtained when it is computed, as shown in Figure 3.26(b). For example, consider E(3,2). It is obtained from E(2,1) because T2 = P3 = "t". Thus there is an arrow from E(3,2) pointing to E(2,1). Consider E(4,6). It is obtained from E(3,6). This is why there is an arrow from E(3,6) to E(4,6). All of the approximate matchings with error less than k = 2 can now be found by tracing back. When we trace back, we ignore every arrow pointing vertically because it has nothing to do with locating the occurrence in T. Consider E(4,3). The arrows traced are E(4,3) to E(3,2) to E(2,1) to E(1,1). We ignore E(2,1) to E(1,1). Thus we have obtained an occurrence of approximate matching, namely T1,3 = "ptt". Consider E(4,6). The arrows traced are E(4,6) to E(3,6) to E(2,6) to E(1,5). 3-33 Ignoring E(2,6) to E(3,6) and E(3,6) to E(4,6), we obtain T5, 6 = "pa" which is an occurrence of desired approximate matching. Finally, if we change the distance function as follows: E(i,j) = E(i - 1,j - 1) if T j = Pi ; and E(i,j) = 0 if T j ≠ Pi , E(0,j) = 1 for j = 1, 2, …, n E(i,0) = 1 for i = 0, 1, 2, …, m We can use this function to find all occurrences of exact matching. This approach is simple, but not efficient because the time complexity will be O(mn). 3.6: The Convolution Approach to Solve the Exact String Matching Problem For exact matching, one of the most straightforward approach is the sliding window approach. Let us assume that we have a target string T=AAGTCTTCGA and we have another string P=AGTC. Our job is to check whether P appears in T. The sliding window approach is to slide P from left to right step by step as shown in Figure 3-27. AAGTCTTCGA AGTC AAGTCTTCGA AGTC AAGTCTTCGA AGTC AAGTCTTCGA AGTC AAGTCTTCGA AGTC Figure 3-27 . This approach, although quite straightforward, cannot be programmed easily. Besides, the time-complexity of this approach is O (mn) where m and n are the lengths of S and P respectively. We now introduce an approach, called convolution, which is almost the same as the sliding window approach. Yet, as we shall see, there are many advantages in using this approach. We first introduce the definition of convolution in the continuous case. 3-34 In communication, convolution is defined on two functions f (t ) and g (t ) , denoted as f (t ) g (t ) as follows: f (t ) g (t ) f ( ) g (t )d . The meaning of convolution can be best explained by imagining the following situation: We fix f ( ) and move g ( ) all the way to . We then move g ( ) from the left to right. This is equivalent to setting t in g (t ) from to . Initially, we can imagine that these functions do not intersect and their product is therefore 0. As g ( ) finally intersects f ( ) , the integration starts. The computation of convolution is completed as soon as g ( ) departs from f ( ) . An example is given below. Consider the following two functions, x(u ) and y (u ) where, 1, 1 u 1 x(u ) 0, otherwise . u 1,0 u 1 y (u ) 0, otherwise The process of computing the convolution of x(u ) and y (u ) is illustrated in Figure 3-28. x(u) y(u) 1 1 u -1 0 1 1 (a) (b) x(u) x(u) y(t-u) y(t-u) 1 u t-1 t -1 1 u -1 1 t (c) (d) 3-35 1 u 1 y(t-u) y(t-u) 1 u t-1 t -1 u 1 -1 1 t (c) (d) 1 1 u -1 u 1 -1 t-1 Z(t) (e) 1 t (f) 1 2 t -1 1 2 (g) Figure 3-28 An example of convolution. The relationship between convolution and the sliding window method for string matching can be explained as follows: Imagine the process of the sliding window method. Suppose that we are going to apply the sliding window method on two sequences, T and P , first we move P to the extreme left and then move it towards the right. As soon as P meets T , checking begins and the checking is completed as soon as P leaves T completely. The process can be formally illustrated in Figure 3-29. T T1T2 Tn P P1 P2 Pm T1T2 Tn P1 P2 Pm shifting P1 P2 Pm shifting P1 P2 Pm Figure 3-29 A formal illustration of the sliding window approach We can now see that the process of the sliding window method is quiet similar to the process of convolution. In fact, we may use convolution to perform the job of sliding window. In the following, we shall first define the discrete convolution. 3-36 The Definition of Convolution in the Discrete Case Let X x0 ,..., xm , Y y0 ,..., yn be two given vectors, xi , yi D . Let and be two given functions, where : D D E , : E E E . Then the convolution of X and Y with respect to and is X , Y z o , z1 , , z n m ; z k ( xi y j ) for k 0, , m n i j k According to the above definition: z 0 ( xo y 0 ) z1 ( x1 y 0 ) ( x0 y1 ) z 2 ( x 2 y 0 ) ( x1 y1 ) ( x0 y 2 ) z m n 1 ( x m y n 1 ) ( x m 1 y n ) z m n ( xm y n ) How does convolution have anything to do solving the exact string matching problem by using the sliding window method? First, we reverse Y to be Y ' ( y n , y n1 ,, y1 , y 0 ) . We then perform the convolution between X and Y ' as follows: z 0 ( xo y n ) z1 ( x1 y n ) ( x0 y n 1 ) z 2 ( x 2 y n ) ( x1 y n 1 ) ( x0 y n 2 ) z m n 1 ( x m y1 ) ( x m 1 y 0 ) z m n ( xm y0 ) We further define the function be the character comparison function c ( xi , y j ) which is 1 if xi y j and 0 if otherwise. We let the function be the integer addition function. Then the convolution of X and Y ' is 3-37 z 0 c( xo , y n ) z1 c( x1 , y n ) c( x0 , y n 1 ) z 2 c( x 2 , y n ) c( x1 , y n 1 ) c( x0 , y n 2 ) z m n 1 c( x m , y1 ) c( x m 1 , y 0 ) z m n c( x m , y 0 ) Assume that m n . If z i n for some i , we know that there is an exact match. Consider X=abaab and Y=aab again. We now have X ( x0 , x1 , x2 , x3 , x4 ) (a, b, a, a, b) and Y ' ( y2 , y1 , y0 ) (b, a, a) . Thus, z 0 c ( a, b) 0 z1 c(a, a) c(b, b) 2 z 2 c(a, a) c(b, a) c(a, b) 1 z 3 c(b, a) c(a, a) c(a, b) 1 z 4 c(a, a) c(a, a) c(b, b) 3 z 5 c(a, a) c(b, a) 1 z 6 c ( a, b) 0 Since z 4 3 , we know that there is an exact match at the 4th shift, as shown in Fig. 3-30. abaab aab 01234 Shift 0 Shift 1 Shift 2 Shift 3 Shift 4 Fig. 3-30 An illustration of a discrete convolution process We may also view the discrete convolution in a graphical way. Let us use the 3-38 above case as an example: abaab baa 10110 10110 01001 0211310 Figure 3-31 A graphical way for demonstrating the discrete convolution. From Fig. 3-31, it can be easily seen that convolution is very much similar to multiplication. This property is quiet important as we shall use later. Meanwhile, let us give the algorithm of discrete convolution in general as follows: Algorithm 3-9: An Algorithm for Convolution Defined on , . Input: Two sequences, X x0 x1 xn ; Y y0 y1 y m Output: z k ( xi y j ), k 0 ~ m n i j k for k 0 ~ m n do zk for i 0 ~ k do z k z k ( x i y k i ) end for end for For applying convolution to solve the exact string matching problem, the algorithm is as follows: 3-39 Algorithm 3-10: An algorithm for Applying Convolution to Solve the Exact String Matching problem Input: Two sequences, X x0 x1 xn ; Y y0 y1 y m Output: z k (x i j k i y j ), k 0 ~ m n for k 0 ~ m n do zk 0 for i 0 ~ k do z k z k ( xi y k i ) end for end for From the above algorithm, we can find occurrences of exact match as follows: If z i is equal to the length of the pattern string, there is an exact match at the i -th shift. In our example, where the text string is “abaab” of length 5 and the pattern string is “aab” of length 3, their result of the discrete convolution shows that z 4 3 . This stands for an exact match at the 4th shift. The time complexity required for the above algorithm is O (mn) . In the following, we shall point out that the discrete convolution can be transformed to integer multiplication. Because integer multiplication can be solved in O(n log n) where n is the length of the longer input by applying the convolution theorem and fast Fourier transform, the time complexity of the original discrete convolution may be reduced to O(n log n) Transforming the Discrete Convolution into Integer Multiplication We have already shown that the exact string-matching problem may be solved by the discrete convolution defined on “character comparison” and “integer addition”. To speed up the process of the discrete convolution, we are going to transform the discrete convolution into integer multiplication. This approach was proposed in [FSK82]. The approach can be explained in three steps as shown below: Consider the exact string matching problem, we are given two input data, the text 3-40 string T of length n and the pattern string P of length m . 1. We first define an indicator function to transform the input data, Ti , 0 i n and Pi , 0 i m , which are characters originally, to be binary numbers, according to the characters in the alphabet. The indicator function is defined as follows Let x be a character in the pattern string or the text string 1; x & a are identical f ( x, a ) , a 0; x & a are not identical Extend the indicator function to strings in a natural way: S is a string where S s1 s2 ...sn S ( a ) f ( s1 , a) f ( s2 , a)... f ( sn , a) a (alphaset ) In the exact string matching problem, we will use the indicator function to transform the text string and the pattern string to be T (a ) , a and P (a ) , a respectively. By transforming the pattern string and the text string into this format, we may use the discrete convolution defined on the operations of ‘integer multiplication’ and ‘integer addition’ in Step 2 to calculate the number of matches by only considering the corresponding character. 2. For each character a, a , we apply the discrete convolution defined on “integer multiplication” and “integer addition” on T ( a ) and P ( a ) , a , ( C ( a ) T ( a ) P ( a ) , a ). In fact, the discrete convolution is equivalent to integer multiplication because the input data can be treated as natural numbers and the discrete convolution can actually be regarded as integer multiplication. The results of these convolutions give the number of matches by considering the corresponding character in the alphabet. 3. The results of these convolutions(integer multiplications), C (a ) , a , are then summed ( S Sum(C ( a ) ), a ) to be the solution to the exact string matching problem. The result in the final stage is equal to the one of the original convolution. 3-41 The outline of the whole process of transforming the original convolution into integer multiplication is illustrated in Figure 4. T P Step 1 Indicator function T (a) , a P (a) , a Step 2 Convolution Integer multiplication C (a) , a Step 3 Summarization S Sum(C (a) ), a Figure 3-32 The outline of transforming the discrete convolution into integer multiplication. In the following, we use an example to demonstrate this process. Let us consider the case where the text string X =abaab and the pattern string Y =aab, its alphabet {a, b} . Step 1Their corresponding transformed strings are as follows: X ( a ) 10110 Y ( a ) 110 X (b ) 01001 Y (b ) 001 To apply the discrete convolution to solve the exact string matching problem, we need to use the reversal of Y , Y ' to be one of the inputs in the discrete convolution. Thus, we have 3-42 Y ' ( a ) 011 Y ' ( b ) 100 Step 2We then apply the discrete convolution defined on “integer multiplication” and “integer addition” on the transformed strings as illustrated as follows: Z ( a ) X ( a ) , Y '( a ) Z (b ) X (b , Y '( b ) ) 10110 011 10110 10110 00000 0111210 01001 100 00000 00000 01001 0100100 Fig. 3-33 A step in the multiplication Step 3After all the characters in the alphabet are considered, the results of each character are combined as shown in Fig. 3-34. z(a):0111210 z(b)-0100100 0211310 z(a):0111210 z(b)-0100100 0211310 01234 01234 01234 abaab abaab abaab aab aab aab Figure 3-34 The number of matches for every shift obtained by integer multiplication. The combined result is equivalent to the one of the original convolution. In Figure 3-34, we can see that the number of matches in each shift is calculated. The number equal to the length of pattern string is what we are looking for. We summarize the above steps in Figure 3-35. In our example, we first transform the input data according to the indicator function. Then we apply the discrete convolution defined on “integer multiplication” and “integer addition” on 3-43 them. When we consider ‘a’(‘b’), the corresponding process is shown in the left(right) part of Figure 3-35. Finally, we sum the result in the middle of Figure 3-35. X: abaab Y: aab {a, b} Y’: baa X (a)-Indicator for ‘a’: 10110 X (b)-Indicator for ‘b’: 01001 Y’(a)-Indicator for ‘a’: 011 Y’(b)-Indicator for ‘b’: 100 10110 abaab 01001 011 baa 100 10110 10110 00000 10110 10110 00000 00000 01001 01001 Z(a): 0111210 Z:0211310 Z(b): 0100100 01234 a-aaaa- 01234 a-aaaa- Figure 3-35 01234 a-aaaa- 01234 -b--b --b 01234 -b--b --b 01234 -b--b --b The decomposition of the discrete convolution for string matching. Without losing generality, we may always decompose the original convolution into integer multiplications since the number of matches by considering only one character in the alphabet may be computed in the corresponding integer multiplication. Going through all elements in the alphabet set, we sum up the results. The result of the summation is just the solution to the exact string matching problem. After showing that the answer of exact string-matching can be found by integer multiplication. we shall now introduce the mechanism to speed up integer multiplication by using discrete Fourier transform. Speeding Up of Integer Multiplication by Discrete Fourier Transform In the above, we showed that discrete convolution can be viewed as integer multiplication. In the following, basically, we shall show another interesting fact, namely: Integer multiplication can be performed through discrete Fourier transforms. To come to this point, we shall think reversely. That is, let us note that integer multiplication is actually a special kind of convolution involving multiplication and addition. That is, we express our integers x and y as two vectors X and Y and the integer multiplication of x and y as the convolution of X and Y. That the convolution 3-44 of X and Y can be computed through Fourier transforms can be seen through the convolution theorem. Convolution Theorem Given two vectors, X and Y , let F ( X ) and F (Y ) represent the Fourier transforms of X and Y respectively. Let X Y denote the convolution of X and Y. Then the Fourier transform of X Y is F ( X ) F (Y ) ( : vector product ). Equivalently, the inverse Fourier Transform of F ( X ) F (Y ) , that is F 1 ( F ( X ) F (Y )) , is X Y . The above description can be better expressed in the following formulas: F ( X Y ) F ( X ) F (Y ) F 1 (( F ( X Y )) F 1 ( F ( X ) F (Y )) X Y F 1 ( F ( X ) F (Y )) By applying convolution theorem described above, we may speed up integer multiplication by fast Fourier transform (FFT). We explain the procedure in Figure 3-36. X Y Brute-force method: O(n2) multiplications + O(n) additions X*Y FFT F(X) O(n) multiplications + iFFT F(X) F(Y) O(n) additions F(Y) By applying convolution theorem: O(O(nlogn)+O(n)+O(nlogn)) Figure 3-36 The procedure of computing convolution by using fast Fourier transform. To simply our discussion, we assume that the length of X and Y are both n . In Figure 3-36, it can be seen that we need to consume n 2 multiplications and n additions if we want to obtain the result of the convolution of X and Y directly. The time complexity is thus O(n 2 ) . If we apply convolution theorem and fast Fourier transform, we may compute the result of this convolution by calculating first the Fourier transform of X and Y in O(n log n) , second the vector product of F ( X ) 3-45 and F (Y ) through n multiplications and n additions, and finally the inverse Fourier transform of F ( X ) F (Y ) in O(n log n) . The time complexity is O(n log n) . Next, we shall explain how we compute the Fourier transform in O(n log n) where n is the length of the input data. Fast (Inverse) Fourier Transform We first introduce the definition of the discrete Fourier transform(For details, please consult Section 4.6 of Professor Lee’s lecture notes on communications.): Given a sequence of numbers, a k with length n . Its discrete Fourier transform is n 1 Ai ak e j 2ik n , 0 i n 1 k 0 For example, consider the case where n 4 and ak {a0 , a1 , a 2 , a3 } . To simplify the calculation, we derive e j 2 n e j 2 4 e j 2 cos 2 j sin 2 j The result of its discrete Fourier transform will be computed as follows: A0 a0 a1 a2 a3 A1 a0 a1 ( j ) a2 ( j ) 2 a3 ( j ) 3 a0 ja1 a2 ja3 A2 a0 a1 ( j ) 2 a2 ( j ) 4 a3 ( j ) 6 a0 a1 a2 a3 A3 a0 a1 ( j ) 3 a2 ( j ) 6 a3 ( j ) 9 a0 ja1 a2 ja3 Next, we introduce the definition of the discrete Fourier transform as follows: Given the result of the discrete Fourier transform of a sequence of numbers, Ak with length n . Its inverse discrete Fourier transform is j 1 n 1 ai Ak e n k 0 2ik n , 0 i n 1 3-46 For example: Consider the case when n 4 and Ak { A0 , A1 , A2 , A3 } . To simplify the calculation, we derive e j 2 n e j 2 j The inverse discrete Fourier transform may be computed as follows: 1 ( A0 A1 A2 A3 ) a0 4 1 1 a1 ( A0 A1 ( j ) A2 ( j ) 2 A3 ( j ) 3 ) (4a1 ) a1 4 4 1 1 a2 ( A0 A1 ( j ) 2 A2 ( j ) 4 A3 ( j ) 6 ) (4a2 ) a2 4 4 1 1 a3 ( A0 A1 ( j ) 3 A2 ( j ) 6 A3 ( j )9 ) (4a3 ) a3 4 4 a0 To compute the result of Fourier transform described above, it needs to consume O(n ) where n is the length of the input data. To improve this time complexity, we 2 adopt the concept of “divide and conquer”. By adopting this concept, the result of Fourier transform may be computed in O(n log n) instead of O(n 2 ) . An algorithm for computing Fourier transform based upon the divide and conquer strategy is given below [LCTT2001]. 3-47 Algorithm 3-11: Fast Fourier Transform Input: a 0 , a1 ,, a n 1 , n 2 k n 1 Output: A j a k e 2ijk n for j 0,1,2 n 1 k 0 Algorithm 1. if n 2 A0 a0 a1 A1 a0 a1 Return 2. Recursively find the coefficients of the Fourier transform of a0 , a 2 ,..., an2 (a1 , a3 ,..., an1 ) . Let the coefficients be denoted as B0 , B1 ,..., Bn / 2 (C0 , C1 ,..., Cn / 2 ) . 3. For j 0 to n 1 2 A j B j nj C j A j j n 2 n B j n 2 C j End Algorithm In the following, we give the fast inverse Fourier transform algorithm: Algorithm 3-12: Fast Inverse Fourier Transform Input: A0 , A1 , , An 1 , n 2 k 1 n 1 Output: a j Ak e n k 0 Algorithm 1. 2ijk n for j 0,1,2 n 1 if n 2 1 ( A0 A1 ) 2 1 a1 ( A0 A1 ) 2 a0 2. Return Recursively find the coefficients of the inverse Fourier transform of A0 , A2 ,..., An2 ( A1 , A3 ,..., An1 ) . Let the coefficients be denoted as b0 , b1 ,..., bn / 2 (c0 , c1 ,..., cn / 2 ) . 3-48 3. For j 0 to aj a j n 2 n 1 2 1 (b j nj c j ) 2 n j 1 (b j n 2 c j ) 2 End Algorithm Next, we shall describe how we solve integer multiplication by combining convolution theorem and fast Fourier transform which is referred to as “Strassen method”. A Fast Method for Integer Multiplication Strassen method Let n be a power of two. Let two big integers X and Y with less than n n 1 n 1 j 0 j 0 coefficients such that X x j B j , Y y j B j Step1: X * ( x * 0 , x *1 , , x * 2 n1 ) ≡ F2 n ( x0 , x1 , , xn , , 0, , 0) (Fourier transform) Step2: Y * ( y * 0 , y *1 , , y * 2 n1 ) ≡ F2 n ( y0 , y1 , , yn , , 0, , 0) (Fourier transform) Step3: Z * ( z * 0 , z *1 , , z * 2 n 1 ) z i* xi* y i* (vector multiplication) Step4: Z ( z 0 , z1 , , z 2n1 ) F 2n (Z * ) (inverse Fourier Transform) 2 n 1 Step5: Rearrange the coefficients in z i , the number Z z i B i is equal to the i 0 product of X by Y . We may see that this method adopts the idea of convolution theorem to compute the result by Fourier transform. Compared to convolution theorem, Step 1 and step 2 compute the Fourier transforms of the input vectors. Step 3 computes the vector 3-49 product of the Fourier transforms of two input data. Step 4 computes the inverse Fourier transform of the resulting vector product. By using fast Fourier transform(FFT), and fast inverse Fourier transform (IFFT)integer multiplication may be solve in O(n log n) . The following is a case with n 2 . X x0 B 0 x1 B1 ; Y y 0 B 0 y1 B1 XY x0 y 0 B 0 ( x0 y1 x1 y 0 ) B1 x1 y1 B 2 0 B 3 X * F4 ( x0 , x1 ,0,0) ( x0 x1 , x0 x1 j , x0 x1 , x0 x1 j ) Y * F4 ( y0 , y1 ,0,0) ( y0 y1 , y0 y1 j, y0 y1 , y0 y1 j ) Z * X * Y * ( x0 y0 x0 y1 x1 y0 x1 y1 , x0 y0 ( x0 y1 x1 y0 ) j x1 y1 , x0 y0 x0 y1 x1 y0 x1 y1 , x0 y0 ( x0 y1 x1 y0 ) j x1 y1 ) F4 ( Z * ) = ( x0 y0 , x0 y1 x1 y0 , x1 y1 ,0) Z x0 y 0 B 0 ( x0 y1 x1 y 0 ) B1 x1 y1 B 2 0 B 3 XY The above result demonstrates that the approach is correct. In the following, we shall show an example for the Straussen method: 3-50 n2 X 23, Y 12, B 10 X x0 10 0 x1101 ;Y y 0 10 0 y1101 x0 3,x1 2 y 0 2,y1 1 X * F4 (3,2,0,0) (5,3 2 j ,1,3 2 j ) Y * F4 (2,1,0,0) (3,2 j ,1,2 j ) Z * (15,4 7 j ,1,4 7 j ) F 4 ( Z *) (6,7,2,0) Z z 0 10 0 z1101 z 2 10 2 z 3 10 3 6 10 0 7 101 2 10 2 0 10 3 276 23 12 The reader can see that we have successfully found the answer. To summarize, we now conclude that exact string matching can be solved by discrete convolution with time-complexity O(n log n) where n is the length of the longer string. 3.7 Some Other Applications of Convolution to Sequence Analysis In the following sections, we shall discuss how we apply convolution to several applications. To simplify our discussion, when we mention convolution, it means the discrete convolution defined above for solving the exact string matching problem. The Common Substring with k -mismatches Allowed In this problem, we are given a string T and a string P , P is much shorter as compared to T . We need to find a segment of T which is very similar to P with k mismatches allowed. In fact, the segment may be found by the longest common 3-51 substring with k mismatches allowed. The steps of applying convolution to find such a segment in T are as follows: 1. Apply convolution on both T and P , save the result of in z . 2. Find the maximum value in z , z max such that z max P k . 3. If such z max exists, return T P zmax 1 , zmax ; otherwise return NULL. An example to demonstrate how the common substring with k mismatches allowed could be found by convolution is shown in Fig. 3-37. In this example, we set k to be 3. T=CTAAAGTTCTCTGTGATTGTGTATT P=TCTAGCAAT 0100001101010100110101011 0011100000000001000000100 0011100000000001000000100 1000000010100000000000000 0000010000001010001010000 0011100000000001000000100 0100001101010100110101011 1000000010100000000000000 0100001101010100110101011 020112222724031322320422331112011 Figure 3-37 The process of finding the longest common substring with 3 mismatches allowed by convolution. From Figure 3-37, we can see that the longest common substring with 3 mismatches allowed in T, namely "TAAAGTTCT", is found by convolution. Common Substrings with k -mismatches Allowed among Multiple Sequences We can also find the common substrings with k mismatches allowed by convolution among a set of sequences. Consider we are given several sequences S1 , S 2 ,...S n , and we want to find the common substring with k mismatches allowed among them. We first try to find such the longest common substring with k mismatches allowed from S1 and S2. And then we test whether the common substring 3-52 with k mismatches allowed that we found exists in the remaining sequences, again, by convolution. By doing so, the common substring with k mismatches allowed of these sequences can be found by convolution if it exists. Determining the Similarity of Two DNA Sequences An intuitive way to measure the similarity between two DNA sequences is to use the concept of edit distance of two strings. The edit distance is defined as the smallest number of insertions, deletions, and substitutions required changing one string into another. If the edit distance between two strings is small, we say that these two strings are similar; otherwise, we say that they are distinct. But in some sense, using the edit distance as the measurement of similarity between two strings is not practical. Consider the following case: Suppose that we have two strings: “CTAAAGTTCTCTGTGATTGTGTATT” and “TAAAGTTCTTGGTGGTAA”. The edit distance can be computed to be 42. This value is quite large as the length of the longer string is 25. Thus, we may conclude that these two strings are distinct. Actually, there is a good common substring between them, namely “TAAAGTTCT” (“CTAAAGTTCTCTGTGATTGTGTATT”, “TAAAGTTCTTGGTGGTAA”). It is not entirely safe to say that these two strings are not similar. We believe that in addition to edit distance, the longest common string with mismatches allowed can also be used a measurement of the similarity of two seuqences. If the score based on the longest common substring with mismatch allowed of two strings is high, we say that they are similar; otherwise, we say that they are distinct. As we indicated before, the longest common string with mismatches allowed can be found by convolution. Thus we may say that convolution can be used to measure the similarity between two sequences. The following is an example for demonstrating the idea. DNA sequence 1: agcctta DNA sequence 2: cgcatc 3-53 agcctta ctacgc 0011000 0100000 0011000 1000001 0000110 0011000 002103212000 Figure 3-38 Convolution of “agccta” and “cgcatc”. From Figure 9, we choose the result in the 6th shift which has the maximum number of matches to be the score for similarity. This corresponds to a common substring with mismatch allowed which is “gcat”. Because the length of the common substring with mismatch allowed of these two sequences is quite large as compared to the lengths of the two sequences, we say that they are similar. Next, let us examine another example: DNA sequence 1: attgacat DNA sequence 2: ccgtcg attgacat gcggcc 00000100 00000100 00010000 00010000 00000100 00010000 0001012001100 Figure 3-39 The result of convolution of “attgacat” and “ccgtcg” From Figure 3-39, we choose the result in the 7th shift to the score of similarity of these two sequences. This corresponds to a common substring with mismatches allowed which is “gac” or “gtc”. Because the length of the substring is quite small as compared to the lengths of the two sequences, we say that they are distinct. We apply convolution on DNA sequences to find the similarity of them. The idea of finding the similarity between DNA sequences is to find the highest score between two DNA sequences based on the idea of longest common substring with mismatch 3-54 allowed from the result of convolution. The procedure for computing the score is as follows: Input: Two DNA sequences, X and Y, 1. Compute the result of convolution of X and the reversal of Y(Y’). 2. From the result in step 1, choose the maximum value, namely max(zi). 3. Return max(zi). In this experiment, we used Hepatitis B virus with 26 sequences as our input. We compared every pair of DNA sequences of Hepatitis B virus to find the maximum score by convolution. The 26 Hepatitis B viruses were known in advance to be divided into the following clusters for evaluating the goodness of the experiment later: Cluster 1 1, 2, 3, 4 Cluster 4 15, 16, 17, 18 Cluster 2 5, 6, 7, 8, 9, 10 Cluster 5 19, 20, 21, 22 Cluster 3 11, 12, 13, 14 Cluster 6 23, 24, 25, 26 Table 3-x Clusters of 26 Hepatitis B virus We applied convolution on these DNA sequences to determine whether the clusters found by convolution is consistent with the clustering given. We summarized the result in Figure 3-40. S25 S21 S17 S13 S9 25 21 17 13 9 5 1 S5 660-680 640-660 620-640 600-620 580-600 560-580 S1 Figure 3-40 The result of 26 Hepatitis B viruses by the convolution method. In Figure 3-40, the scores of S1 between S2, S3 and S4 are relatively high (greater 650) 3-55 compared to other sequences (less than 630). S5-S10, S11-S14, S15-S18, S19-S22 and S23-S26 have the same trend as S1-S5. This result is consistent with the clustering which was known in advance. In the same cluster, we have higher score (greater than 650) compared to different clusters (less than 630). Thus we can identify DNA sequences from different clusters by analyzing the result of the convolution of every pair of DNA sequences. Searching in a DNA Sequences Database We are now given a query sequence T and we have to find all of those sequences similar to T within a large set P of target sequences. If we adopt the convolution operation, we measure the similarity between two sequences by measuring the highest score produced by convolution. If the score is high, we conclude it is similar. We are not advocating that our method is the only way to search a DNA sequence data base. But, the following experimental results show that our method is feasible. In this experiment, we searched a DNA database with 1042 sequences. The sequences in the databases is organized as following: S1-S26 are Hepatitis B virus DNA sequences, S27-S162 are some human mitochondria DNA sequences and S162-S1042 are some other virus DNA sequences. We arbitrarily chose a segment from one of these 1042 sequences. Then we searched this segment against all of other sequences by the convolution method. We tested two cases which are summarized as follows: Case 1: Query segment: (A segment from a Hepatitis B virus DNA sequence) CACAATACCACAGAGTCTAGACTCGTGGTGGACTTCTCTCAATTTTCT The length of this segment is 48. Thus the score obtained by the convolution method would not exceed 48. 3-56 60 50 40 30 20 10 0 0 200 400 600 800 1000 1200 Figure 3-41 The searching result by querying a segment from a Hepatitis B virus DNA sequence data From Figure 3-41, we can see that, between S1-S26(Hepatitis B virus), they have higher score compared to S27-S162(human mitochondria). In S163-S1042, some of them also have a high score, this is because they are also virus DNA sequences. Case 2: Query segment: (A segment from a human mitochondria DNA sequence) AAGTATTGACTCACCCATCAACAACCGCTATGTATTTCGTACATTACT 60 50 40 30 20 10 0 0 200 400 600 800 1000 1200 Figure 3-42 The searching result of querying a segment from a human mitochondria DNA sequence data 3-57 In case 2, the searching result is similar to the one in case 1. Except, this time, we didn’t see high score appeared in S163-S1042. This is because human mitochondria DNA sequences are intrinsically different from virus DNA sequences. Finding Repeating Groups in a DNA Sequence In DNA sequences, there are often segments which repeatedly occur. This kind of segments are called repeating groups. Its definition is as follows: Given a string S, repeating groups of S are substrings which are longer than some k 2 and occur more than q 2 times in S where k and q are pre-specified. The steps of using convolution to solve this problem are shown as follows: 1. Use S and its reversal S’ to be the input of convolution, and save the result in z. 2. Because the values in z are symmetric, we only need to use the half part of the result. Classify the result with the same values into G1, G2, …, Gp. 3. Without losing generality, we consider G1 {z i , z j } ( Note that z i z j q ) , we then check whether S i , q 1 and S j ,q 1 are identical or not. 4. Return S i , q 1 or S j ,q 1 if they are identical. We now use an example to explain how we can find repeating groups by convolution. Consider S=“abcxyabc”, we use S and its reversal S ' to be the input of convolution. The result of the convolution is shown in Figure 11. 3-58 S=abcxyabc abcxyabc cbayxcba 10000100 01000010 00100001 00010000 00001000 10000100 01000010 00100001 003000080000300 First peak Figure 3-40 Second peak An example of finding repeating groups by convolution. From Figure 3-40, we may see that the repeating group “abc” can be found by convolution. An Aid for Detection in Transposition In this application, we are given S1 and S2. We want to know whether there exist substrings A and B in S1 and S1 is a concatenation of A and B and S2 is a concatenation of B and A, as shown in Figure 3-41. If we transpose B and A in S2, S 2 will be identical to S1 . A S1: S2: B B A Figure 3-41 An example of transposition We also want to find the transposition point in S1. This can be done by observing the result of convolution. The idea of detecting the transposition point is shown in Figure 3-42. 3-59 A S1: S2: B B A S1: A B A S2: Figure 3-42 Transposition point Transposition point B B A S1: S2: B A An illustration of detecting transposition and the transposition point In Figure 3-42, we can see that the sum of number of matches in two certain shifts, one is the exact match of “A” and the other is the exact match of “B”, is equal to the length of S1. After we find these two shifts, the transposition point is at A position of S1. Our task now becomes to find these two shifts and we can use convolution to find them. We summarize the steps to detect the transposition point as follows: 1. Apply convolution on S1 and S2 and save the result in z. 2. If zi z j S1 such that j i S1 , return i as the transposition point. The following is an example. Input: S1=abcdef S2=cdefab The first detection and the second detection are shown as follows: abcdef cdefab abcdef cdefab They can be found by convolution as shown in Figure 3-43. 3-60 abcdef bafedc 001000 000100 000010 000001 100000 010000 02000004000 First detection Second detection Figure 3-43 An example of detecting the transposition point by convolution. In Figure 3-43, we can see that the sum of number of matches in the 2nd shift and the 8th shift is equal to the length of S1, the transposition point can be found in the 2+1=3rd position of S1. Furthermore, we can also use convolution to detect whether S1 contains a concatenation of B and C and S2 contains a concatenation of C and B, as shown in Figure 3-44. S1: A S2: A Figure 3-44 B C C B D D Another example of concatenation The following is an example to demonstrate how this can be done by convolution. Assume that we are given S1 and S 2 as follows: S1: actgactgac S2: acgatgactc Its result of convolution is shown as follows: 3-61 actgactgac ctcagtagca 1000100010 0100010001 0001000100 1000100010 0010001000 0001000100 1000100010 0100010001 0010001000 0100010001 0103011503240220010 First detection Second detection Fig. 3-45 Another example to find transposition point These two detections correspond to the following shifts. First detection: actgactgac acgatgactc Second detection: actgactgac acgatgactc An Aid for Detecting Insertion/Deletion In this application, we are given two sequences, S1 of length n and S 2 of length m. S1 is the sequence by inserting a segment into S 2 . We want to find the insertion part between S1 and S 2 . These two sequences are illustrated in Figure 3-46. S1: C A B Insert “C” S2: Figure 3-46 A B An illustration of insertion 3-62 Consider Figure 3-48. A method to find the insertion “C” is to detect the position of shift where the exact match of “A” occurs and then to detect the position of shift where the exact match of “B” occurs by convolution. This is shown in Figure 3-48. S1: A C S2: A B B Insertion Insertion A C A B Figure 3-48 C A B A B B Detection of insertion To use convolution to detect the insertion, the steps are as follows: 1. Apply convolution on S1 and S2, save the result in z. 2. If z m1 z n1 S 2 , return S1 zm 1, s1 zn 1 as the insertion part. The following is an example. Input: S1=abxycde S2=abcde Their overlaps in the beginning and the ending are shown as follows: abxycde abcde abxycde abcde These overlaps can be found in the result of convolution as follows: 3-63 abxycde edcba 1000000 0100000 0000100 0000010 0000001 000002030000 Ending Beginning Fig. 3-49 The checking of insertions Since z m1 z n1 z 6 z 4 3 2 S 2 , we conclude that an insertion occurs and the insertion part “xy” can be identified. An Aid for Detecting the Overlapping of Segments Resulting from the Shot-Gun Operations We can also use convolution to find the overlapping of segments resulting from the shot-gun operations of DNA sequences. The shot-gun approach will be explained later. Loosely speaking, it cuts a sequence into two sequences. Consider the following sequence: S=“AGGCTAGTTGCCTAGTAGT” After two shot-gun operations, we may have the following segments: Original segquence: AGGCTAGTTGCCTAGTAGT First breaking up: AGGCT AGTTG CCTAG TAGT 1 2 3 4 Second breaking up: AGG CTAGT TGCCT AGTAGT 5 6 7 8 To reconstruct the original sequence, we often try to determine how the segments 3-64 overlap. We may apply convolution on every pair of segments. For example, 6: CTAGT 2: AGTTG 7: TGCCT The above information shows that the relationship among segment 6, 2 and 7 is as follows: 627 Consider segments 2, 7 and 3. By using convolution, we will find the following information. 2: AGTTG 7: TGCCT 3: CCTGA This corresponds to 273. These information will be very much useful if we want to reconstruct the original sequence. The Corresponding Pair-wise Nucleotides in a DNA Sequence In this application, we shall first define a substitution string of a string X. For a string consisting of A, C, G and T, the substitution string of X is obtained by making the following substitutions: AT TA CG GC For instance, the substitution string of AACTGC is TTGACG. In this problem, we are given a sequence S. We want to know whether there exists a substring A whose substitution string also exists in S as shown in Figure 17. 3-65 … A …… Substitution(A) … Figure 3-50 The corresponding nucleotides in a DNA sequence. For instance, we are given sequence “acttgacttgaac”. We can see that its longest substring whose substitution string also exists in the same sequence is “acttg” (acttgacttgaac) whose substitution string is “tgaac” (acttgacttgaac). In fact, we can adopt convolution to find corresponding pair-wise nucleotides in a DNA sequence by defining the operation in the convolution as follows, 1; ( si , s j ) {( A, T), (C, G)} : c( s i , s j ) , si and s j {A, T, C, G} 0; otherwise We use the same example where the input data S=”acttgacgtgaac” to demonstrate how corresponding pair-wise nucleotides in a DNA sequence can be found by convolution. It is similar to finding repeating groups as discussed in this section before. We use S and its reversal S’ to be the input of the new convolution whose operation is defined as above. The result of this convolution is shown in Figure 3-51. acttgacttgaac caagttcagttca 0011000110000 0000100001000 1000010000110 1000010000110 0100001000001 0011000110000 0000100001000 1000010000110 1000010000110 0100001000001 0011000110000 0011000110000 0000100001000 0001520018500058100251000 3-66 Figure 3-51 An example of finding pair-wise nucleotides in a DNA sequence by convolution We only need to observe the half part of the result since it is symmetric. The result of this convolution tells us that in the 4th and 10th shift, we found two segments in this DNA which are of pair-wise nucleotides to each other as shown in Figure 3-52. In the 4th shift, we have acttgacttgaac acttgacttgaac In the 10th shift, we have acttgacttgaac acttgacttgaac Note that substitution characters of the pair-wise nucleotides in a DNA sequences are A-T and G-C respectively. Figure 3-52 The corresponding pair-wise nucleotides in S=acttgcacttgaac From Figures 3-51 and Figure 3-52, we can conclude that the corresponding pair-wise nucleotides in S, which are “acttg” and “tgaac”, can be identified by convolution. 3-67