advertisement

Alignment with Non-Overlapping Inversions on Two Strings Da-Jung Cho, Yo-Sub Han, and Hwee Kim Department of Computer Science, Yonsei University 50, Yonsei-Ro, Seodaemun-Gu, Seoul 120-749, Korea {dajung,emmous,kimhwee}@cs.yonsei.ac.kr Abstract. The inversion is one of the important operations in bio sequence analysis and the sequence alignment problem is well studied for bio sequence comparisons. This gives rise to the alignment with nonoverlapping inversion problem: Given two strings x and y, does there exist an alignment with non-overlapping inversions for x and y. We, in particular, consider the alignment problem when non-overlapping inversion are allowed for both x and y. Based on the properties of the inversion operation, we design an O(n3 ) algorithm using O(n2 ) space, where n is the size of the input string. 1 Introduction In modern biology, determining the exact order of DNA sequence and retrieving information of DNA sequence are important problems and alignments of sequences are widely used for biological sequence comparisons [1, 7, 12, 13]. For a DNA sequence, a chromosomal translocation is to relocate a piece of the DNA sequence from one place to another and, thus, rearrange the DNA sequence [9]. The chromosomal translocation is a crucial operation in DNA since it alters a DNA sequence and often causes genetic diseases [10]. A chromosomal inversion occurs when a single chromosome undergoes breakage and rearrangement within itself [11]. Based on the important biological events such as translocation and inversion, there is a well-defined string matching problem: given two strings and translocation or inversion, the string matching problem is finding all matched strings allowing translocations or inversions. Moreover, people proposed an alignment with translocation or inversion problem, which is closely related to find similarity between two given strings; that is to obtain minimal occurrences of translocation or inversion that transform one to another, applying to one string. Many researchers investigated eﬃcient algorithms for this problem [1–4, 6, 8, 13, 14]. The inversions, which are one of the important biological operations, are not automatically detected by the traditional alignment algorithms [14]. Schöniger and Waterman [13] introduced the alignment problem with non-overlapping inversions, defined a simplification hypothesis that all regions will not be allowed to overlap and showed an O(n6 ) time algorithm that computes local alignments with inversions between two strings of length n and m based on the dynamic programming. Vellozo et al. [14] presented an O(n2 m) algorithm, which improved the previous algorithm by Schöniger and Waterman. They built a matrix for one string and partially inverted string using table filling method with regard to the extended edit graph. Recently, Cantone et al. [1] introduced O(nm) time algorithm using O(m2 ) space for string matching problem, which is to find all locations in a given text of length n at which a given pattern of length m matches with non-overlapping inversions. Many diseases are often caused by genetic mutations, which can be inherited through generations and can result in new sequences from a normal gene [5]. In other words, we can have two diﬀerent sequences from a normal gene by diﬀerent mutations. This motivates us to examine the problem of deciding whether or not two gene sequences are mutated from the same gene sequence. In particular, we consider an inversion mutation. See Fig. 1 for an example. T G C G A G G G T x(1,2) Θ(x) = x(5,7) C G G C G T G A ··· x C T A C G Θ(x) = Θ(y) y(n−3,n−1) y(7,8) C A A A C A C x(n−1,n) x(9,9) C A C G C C T y(2,6) ··· ··· C G T T y Fig. 1. An example of non-overlapping inversions on both strings x and y, where Θx = (1, 2)(3, 3)′ (4, 4)′ (5, 7)(8, 8)′ (9, 9) · · · (n − 1, n) and Θy = (1, 1)′ (2, 6)(7, 8)(9, 9)′ · · · (n − 3, n − 1)(n, n)′ . Note that (i, i)′ denotes the alignment at position i without complementing x[i]. Note that this problem is diﬀerent from the previous approaches [13, 14], where a non-overlapping inversion occurs in one string and transforms it to the other string; namely Θ(x) = y for a set Θ of non-overlapping inversions. On the other hand, we consider more general case where inversions can occur in both x and y simultaneously. 2 Preliminaries Let A[a1 ][a2 ] · · · [an ] be an n-dimensional array, where the size of each dimension is ai for 1 ≤ i ≤ n. Let A[i1 ][i2 ] · · · [in ] be the element of A with indices (i1 , i2 , . . . , in ). Given a finite set Σ of character and a string s over Σ, we use |s| to denote the length of s and s[i] to denote the symbol of s at position i. We use s(i,j) to denote a substring s[i]s[i + 1] · · · s[j], where 1 ≤ i ≤ j ≤ |s|. We consider biological operation inversion θ and denote by θ(s) the reverse and complement of given string s. For example, θ(A) = T and θ(AGG) = θ(G)θ(G)θ(A) = CCT . We define an inversion operation θ(i,j) for a given range i, j as follows: θ(i,j) (s) = θ(s(i,j) ). For simplicity, we use (i, j) instead of θ(i,j) if the notation is clear in the context. We say that i+j 2 is the center of the inversion for (i, j). We define a set Θ of non-overlapping inversion to be Θ = {(i, j) | 1 ≤ i ≤ j and for ∀(i′ , j ′ ) ̸= (i, j) ∈ Θ, j < i′ or j ′ < i}. Then, for a set Θ of non-overlapping inversions and a string s, we have Θ(s) = s′ where { θ(s[j + k − i]) if (j, k) ∈ Θ and j ≤ i ≤ k ′ s [i] = s[i] otherwise. For example, if Θ = {(1, 1), (2, 3)} and s = AGCC, then Θ(s) = θ(A)θ(GC)C = T GCC. From now on, we use a set of inversions instead of a set of non-overlapping inversions since we only consider sets of non-overlapping inversions. Definition 1 (Problem). We formally define a new alignment problem with non-overlapping inversions on two strings as follows: Given two strings x and y of the same length, can we determine whether or not there exist two sets Θx and Θy of inversions such that Θx (x) = Θy (y)? 3 The Algorithm Vellozo et al. [14] considered the problem of finding an alignment of non-overlapping inversions on only one string, and proposed an O(n2 m) algorithm using O(n2 ) space for given two strings s and t of length n and m, respectively. We examine the problem in which non-overlapping inversions are allowed for both strings of the same length. Note that the problem is more diﬃcult than the problem considered by Vellozo et al. [14], since we can apply two sets of non-overlapping inversions Θx and Θy to x and y, respectively. We use x = AGCT and y = CGAA as our example strings for explaining the algorithm. Remark that θ(AG)Cθ(T ) = CT CA = Cθ(GA)A and, thus, we have two sets Θx = {(1, 2), (4, 4)} and Θy = {(2, 3)}. We start from building a table in which each cell contains a pair of a range and a character in each index of an input string. We define an array Tx [n][n + 1] for x as follows: if j < i, ((j, i), θ(x[j])) Tx [i][j] = ((i, i)′ , x[i]) if j = i, ((i, j − 1), θ(x[j − 1])) if j > i. We call all elements in Tx inversion fragments of x. For an inversion fragment F = ((p, q), σ) or ((p, p)′ , σ), we say that F yields the character σ. Inversion fragments become useful to compute a substring created by any inversion because of the following property of the inversion operation: θ(i−1,j+1) (x) = θ(x[j + 1])θ(i,j) (x)θ(x[i − 1]). Observation 1 For a string x and its Tx , 1. θ(i,j) (x) = θ(x[j])θ(x[j − 1]) · · · θ(x[i + 1])θ(x[i]), 2. ((i, j), θ(x[j])),((i+1, j−1), θ(x[j−1])),. . . ,((i+1, j−1), θ(x[i+1])),((i, j), θ(x[i])) are all inversion fragments in the i, i + 1, . . . , j − 1, j columns of Tx and have the same center. Note that it takes O(n2 ) time to construct Tx and the size of Tx is O(n2 ). We also construct Ty for y. Given a pair (((p1 , p2 ), σ1 ), ((q1 , q2 ), σ2 )) of two inversion fragments, we say that the pair is an agreed pair if q1 = p2 + 1 or p1 + p2 = q1 + q2 . Otherwise, we call it a disagreed pair. For two agreed pairs (((p1 , p2 ), σ1 ), ((q1 , q2 ), σ2 )) and (((q1 , q2 ), σ2 ), ((r1 , r2 ), σ3 )), we say that two pairs are connected by ((q1 , q2 ), σ2 ). We define an agreed sequence S to be a sequence of inversion fragments ((ai , bi ), σi ) for 1 ≤ i ≤ n, where a1 = 1, bn = n and (((ai , bi ), σi ), ((ai+1 , bi+1 ), σi+1 )) is an agreed pair for 1 ≤ i ≤ n − 1. Note that an agreed sequence S and a set of inversions Θ has one-to-one relationship, where inversion fragments with the same center represents the same inversion in Θ. We define an agreed sequence Sx to be legal if there exist Θx and Θy such that Θx (x) = Θy (y) and we can generate Sx from Θx . The main idea of our algorithm is to keep tracking of all possible agreed pairs for adjacent indices and check if there exist two connected pairs Px and Py for x and y, respectively, that generate a common substring of x and y; namely, we check if there exists a common substring σ1 σ2 σ3 and Px = (F1 , F2 ), (F2 , F3 ) for x and Py for y such that Fj yields σj . For instance, for x = AGCT and y = CGAA, there is a common substring CT C from index 1 to 3 such that Px = (((1, 2), C), ((1, 2), T )), (((1, 2), T ), ((3, 3)′ , C)) and Py = (((1, 1)′ , C), ((2, 3), T )), (((2, 3), T ), ((2, 3), C)) exist. Next, we define the following four sets for each index i: • • • • Ah [i] = {((p, i), σ) = Tx [i][p] | 1 ≤ p ≤ i ≤ n − 1} ∪ {((i, i)′ , x[i])}. At [i] = {((i+1, q), σ) = Tx [i+1][q +1] | i < q ≤ n}∪{((i+1, i+1)′ , x[i+1])}. B h [i] = {((p, q), σ) = Tx [i][j] | 1 ≤ p ≤ i < q ≤ n}. B t [i] = {((p, q), σ) = Tx [i + 1][j] | 1 ≤ p ≤ i < q ≤ n}. Namely, Ah [i] is the set of all inversion fragments that end at i and At [i] is the set of all inversion fragments that start from i + 1. Remark that B h [i] and B t [i] are sets of all inversion fragments that start before i and end after i + 1. From these four sets, we make the following observations: Observation 2 Given a string x and its four sets Ah [i], At [i], B h [i] and B t [i], the following statements hold: (1) For F1 = ((p, i), σ1 ) (or ((i, i), σ1 )) in Ah [i] and F2 = ((q, r), σ2 ) ∈ B t [i], (F1 , F2 ) is a disagreed pair. Similarly, for F1 = ((p, q), σ1 ) ∈ B h [i] and F2 = ((i + 1, r), σ2 ) (or ((i + 1, i + 1), σ2 )) in At [i], (F1 , F2 ) is a disagreed pair. (2) For F1 = ((p, i), σ1 ) (or ((i, i), σ1 )) in Ah [i] and F2 = ((i + 1, q), σ2 ) (or ((i + 1, i + 1), σ2 )) in At [i], (F1 , F2 ) is an agreed pair. (3) For F1 = ((p, q), σ1 ) ∈ B h [i] and F2 = ((r, s), σ2 ) ∈ B t [i], if p + q = r + s, (F1 , F2 ) is an agreed pair. Based on Observation 2, it is possible to create all agreed pairs for an index i by comparing Ah [i] and At [i], and B h [i] and B t [i], respectively. If we can connect pairs through all indices, then we are able to generate all agreed sequences, which are essentially all sets of non-overlapping inversions. We need additional tables Txc [|Σ|][|Σ|][|Σ|] and Tyc [|Σ|][|Σ|][|Σ|] to record a sequence of three characters generated by connected pairs. For a given index i, Txc [σ1 ][σ2 ][σ3 ] is true if there exist three inversion fragments ((ai−1 , bi−1 ), σ1 ), ((ai , bi ), σ2 )), ((ai+1 , bi+1 ), σ3 ) in a legal sequence Sx . We also define Txs [2][n + 1] and Tys [2][n + 1] as follows. For the given index i and 1 ≤ j ≤ n + 1, Txs [1][j] has elements (t, σ0 σ1 σ2 ) if Tx [i][j] = ((p1 , p2 ), σ1 ) is the ith element in a legal sequence Sx , t ≤ i and ((t, p1 + p2 − t), ω) is the tth element in Sx for some ω. Next, Txs [2][k] has elements (s, σ1 σ2 σ3 ) if Tx [i + 1][k] = ((q1 , q2 ), σ3 ) is the i + 1th element in Sx , s ≤ i + 1 and ((s, q1 + q2 − s), τ ) is the sth element in Sx for some τ . In other words, Txs and Tx contains information for the ith agreed pair in legal Sx . We design an algorithm that computes Txs and Tys for each index and checks whether or not Txs and Tys are empty for the last index. If Tx [1][j] yields σ, we set Txs [1][j] = {(1, AAσ)} as initial data. We also set Tys similarly. Now we run the following steps from index 1 to n − 1 for x and y. We only illustrate the case for x. (The case for y is similar.) Step 1. We check all inversion fragments in ith column of Tx . For Tx [i][j] = ((j, i), σ3 ) (or ((i, i)′ , σ3 )), if (j, σ1 σ2 σ3 ) ∈ Txs [1][j], then we add ((j, i), σ2 σ3 ) (or ((i, i)′ , σ2 σ3 )) to a set Ah . Namely, Ah contains every inversion fragment that can be the ith element in a legal sequence and ends at i. Step 2. We check all inversion fragments in i + 1th column of Tx . For Tx [i + 1][i + 1] = ((i + 1, i + 1),′ σ), we add ((i + 1, i + 1)′ , σ) to a set At . Moreover, for each Tx [i + 1][j] = ((i + 1, j − 1), σ), we add ((i + 1, j − 1), σ) to a set At . In other words, At contains every inversion fragment that can be the i + 1th element in a legal sequence and starts from i + 1. Lemma 1. Each cell in Txs and Tys has O(n) elements. Proof. Elements in Txs and Tys are of the form (p, σ1 σ2 σ3 ). The number of the combination for the second part is |Σ|3 , which is constant. For the first part, 1 ≤ p ≤ i for index i since p is the start index of the last inversion in an inversion sequence. Therefore, the number of possible elements for each cell is O(n). ⊓ ⊔ Lemma 2. We can build Ah and At in O(n2 ) using O(n) space. Proof. For Step 1, we need to check i + 1 inversion fragments, and for each inversion fragment we need to check O(n) elements by Lemma 1. Therefore, we need O(n2 ) for Step 1 and Ah has at most i + 1 inversion fragments. For Step 2, At has n − i inversion fragments. Therefore, the total process takes O(n2 ) time and requires O(n) space to store all inversion fragments. ⊓ ⊔ Once we have two set of inversion fragments representing Ah [i] and At [i], we can calculate all agreed pairs generated from Ah [i] and At [i]. Step 3. Based on Observation 2(2), for every ((p1 , i), σ1 σ2 ) (or ((i, i)′ , σ1 σ2 )) in Ah and ((i + 1, p2 ), σ3 ) (or ((i + 1, i + 1)′ , σ3 )) in At , we set Txc [σ1 ][σ2 ][σ3 ] = true. We also add (i + 1, σ1 σ2 σ3 ) to Txs [2][p2 ]. Lemma 3. Given Ah and At , we can update Txc and Txs in O(n2 ) time by Step 3. Proof. Since |Ah | ≤ i + 1 and |At | = n − i, we repeat Step 3 at most (i + 1)(n − i) times. Thus, we can update Txc and Txs in O(n2 ) time. ⊓ ⊔ We have calculated all agreed pairs generated from Ah [i] and At [i] in Step 3, and the other case of generating agreed pairs for the index i is to use the inversion fragments with same center from B h [i] and B t [i]. Note that inversions with same center represent the same inversion from Observation 1. Step 4. Based on Observation 2(3), for ((p1 , p2 ), σ2 ) = Tx [i][j] and ((q1 q2 ), σ3 ) = Tx [i+1][k], where p1 +p2 = q1 +q2 and i ̸= j and i+1 ̸= k, if (t, σ0 σ1 σ2 ) ∈ Txs [1][j] and t ≤ p1 , then we set Txc [σ1 ][σ2 ][σ3 ] = true and add (t, σ1 σ2 σ3 ) to Txs [2][k]. Lemma 4. We can update Txc and Txs in O(n2 ) time by Step 4. Proof. As illustrated in Fig. ??, we can construct Tx such that all inversion fragments at each column are sorted by the center. Therefore, we can check two inversion fragments with same center in O(n). Since for each inversion fragment we need to check O(n) elements in Txs from Lemma 1, the whole process takes O(n2 ). ⊓ ⊔ Step 5. For all three letter strings σ1 σ2 σ3 over Σ, { true if Txc [σ1 ][σ2 ][σ3 ] and Tyc [σ1 ][σ2 ][σ3 ] are both true, c Tx [σ1 ][σ2 ][σ3 ] = false otherwise. Once we recompute Txc , then, for each (p, σ1 σ2 σ3 ) in Txs , we remove (p, σ1 σ2 σ3 ) from Txs if Txc [σ1 ][σ2 ][σ3 ] is false. The process ensures that Txs and Tys produce the same sequence of characters by connected pairs. Lemma 5. It takes O(n2 ) time to generate Txs and Tys such that elements in both array generate the same set of sequences of characters. Proof. After modifying Txc , we need to remove elements in Txs according to Txc . Since the size of Txs is O(n2 ) and the size of Txc is constant, the process takes O(n2 ) time. ⊓ ⊔ We are now ready to present the whole procedure of our algorithm. (Algorithm 1 is a pseudo description of the proposed algorithm.) Algorithm 1: 1 2 3 4 5 6 7 8 9 10 11 12 Input: String x, y Output: Boolean (whether or not there exists Θx , Θy s.t. Θx (x) = Θy (y).) /* time complexity: O(n3 ), space complexity: O(n2 ) make Tx and Ty . initialize Txs and Tys . for i ← 1 to n − 1 do clear and initialize arrays used in previous iteration. store information for agreed sequences to Txs , Txc , Tys , Tyc . Txc , Tyc ← Txc ∧ Tyc modify Txs and Tys according to Txc and Tyc . copy the second column of Txs and Tys to the first column. */ if the second column of Txs and Tys are not empty then return true else return false Once we finish calculating Txs and Tys using Steps 1, 2, 3, 4 and 5 from index 1 to n − 1, we check whether or not the last columns in Txs and Tys are empty. If they are not empty, then there are agreed sequences for x and y that generate the same string, which means that x and y can generate the same string by applying non-overlapping inversions. If they are empty, then there is no agreed sequences for x and y that generate the same string. The proposed algorithm runs in O(n3 ) time using O(n2 ) space. Theorem 3. We can solve the the the alignment with non-overlapping inversion problem in Definition 1 in O(n3 ) time using O(n2 ) space, where n is the size of input strings. 4 Conclusions The inversion is an important operation for bio sequences such as DNA or RNA and is closely related to mutations. We have, in particular, considered nonoverlapping inversions on both bio sequences, which is important to find the original common sequence from two mutated sequences. We have proposed a new problem, alignment with non-overlapping inversions on two strings, and presented a polynomial algorithm for the problem. Given two strings x and y, based on the inversion properties, our algorithm decides whether or not there exist two sets Θx and Θy of inversions for x and y such that Θx (x) = Θy (y) in O(n3 ) time using O(n2 ) space, where n = |x| = |y|. The proposed problem is about the sequence alignment and can be extended to approximate pattern matching or edit distance problem. References 1. D. Cantone, S. Cristofaro, and S. Faro. Eﬃcient string-matching allowing for nonoverlapping inversions. Theoretical Computer Science, 483:85–95, 2013. 2. D. Cantone, S. Faro, and E. Giaquinta. Approximate string matching allowing for inversions and translocations. In Proceedings of the Prague Stringology Conference, pages 37–51, 2010. 3. Z.-Z. Chen, Y. Gao, G. Lin, R. Niewiadomski, Y. Wang, and J. Wu. A spaceeﬃcient algorithm for sequence alignment with inversions and reversals. Theoretical Computer Science, 325(3):361–372, 2004. 4. S. Grabowski, S. Faro, and E. Giaquinta. String matching with inversions and translocations in linear average time (most of the time). Information Processing Letters, 111(11):516–520, 2011. 5. Z. Ignatova, K. Zimmermann, and I. Martinez-Perez. DNA Computing Models. Advances in Information Security, v. 42. 2008. 6. J. D. Kececioglu and D. Sankoﬀ. Exact and approximation algorithms for the inversion distance between two chromosomes. CPM ’93, pages 87–105, 1993. 7. S. C. Li and Y. K. Ng. On protein structure alignment under distance constraint. ISAAC ’09, pages 65–76, 2009. 8. O. Lipsky, B. Porat, E. Porat, B. R. Shalom, and A. Tzur. Approximate string matching with swap and mismatch. ISAAC’07, pages 869–880. 2007. 9. C. M. Ogilvie and P. N. Scriven. Meiotic outcomes in reciprocal translocation carriers ascertained in 3-day human embryos. European Journal of Human Genetics, 10(12):801–806, 2009. 10. M. Oliver-Bonet, J. Navarro, M. Carrera, J. Egozcue, and J. Benet. Aneuploid and unbalanced sperm in two translocation carriers: evaluation of the genetic risk. Molecular Human Reproduction, 8(10):958–963, 2002. 11. T. S. Painter. A New Method for the Study of Chromosome Rearrangements and the Plotting of Chromosome Maps. Science, 78:585–586, Dec. 1933. 12. Y. Sakai. A new algorithm for the characteristic string problem under loose similarity criteria. ISAAC’11, pages 663–672, 2011. 13. M. Schöniger and M. S. Waterman. A local algorithm for DNA sequence alignment with inversions. Bulletin of Mathematical Biology, 54(4):521–536, 1992. 14. A. F. Vellozo, C. E. R. Alves, and A. P. do Lago. Alignment with non-overlapping inversions in O(n3 )-time. WABI’06, pages 186–196, 2006.