Approximate On-line Palindrome Recognition, and Applications Amihood Amir Benny Porat Moskva River Confluence of 4 Streams CPM 2014 Palindrome Recognition - Voz'mi-ka slovo ropot, - govoril Cincinnatu ego shurin, ostriak, -- I prochti obratno. A? Smeshno poluchaetsia? Vladimir Nabokov, Invitation to a Beheading (1) "Take the word ropot [murmur]," Cincinnatus' brother-in-law, the wit, was saying to him, "and read it backwards. Eh? Comes out funny, doesn't it?" [--› topor: the axe] A palindrome is a string that is the same whether read from right to left or from left to right: Examples: доход A man, a plan, a cat, a ham, a yak, a yam, a hat, a canal-Panama! Palindrome Example Ibn Ezra: Medieval Jewish philosopher, poet, Biblical commentator, and mathematician. Was asked: ""אבי אל חי שמך למה מלך משיח לא יבא [ My Father, the Living God, why does the king messiah not arrive?] His response: " שוב אשוב אליכם כי בא מועד,"דעו מאביכם כי לא בוש אבוש [ Know you from your Father that I will not be delayed. I will return to you when the time will come ] Palindromes in Computer Science Great programming exercise in CS 101. Example of a problem that can be solved by a RAM in linear time, but not by a 1-tape Turing machine. (Can be done in linear time by a 2-tape TM) Palindrome Concatenation We may be interested in finding out whether a string is a concatenation of palindromes of length > 1. Example: ABCCBABBCCBCAACB Why would we be interested in such a funny problem? – we’ll soon see Exercise: Do this in linear time… Stream 2 - Approximations As in exact matching, there may be errors. Find the minimum number of errors that, if fixed, will give a string that is a concatenation of palindromes of length > 1 ABCCBABBCCBCAACB Example: ABCCBCBBCCBCABCB For Hamming distance: A-Porat [ISAAC 13]: Algorithm of time O(n2) Stream 3 - Reversals Why is this funny problem interesting? Sorting by reversals: In the evolutionary process a substring may “detach” and “reconnect” in reverse: ABCABCDAABCBAD ABCABCDAABCBAD CBAADCB Sorting by Reversals What is the minimum number of reversals that, when applied to string A, result in string B? History: Introduced: Bafna & Pevzner [95] NP-hard: Carpara [97] Approximations: Christie [98] Berman, Hannenhalli, Karpinski [02] Hartman [03] Sorting by Reversals – Polynomial time Relaxations Signed reversals: Hannenhalli & Pevzner [99] Kaplan, Shamir, Tarjan [00] Tannier & Sagot [04] . . . Disjointness: Swap Matching Muthu [96] Two constraints: 1. The length of the reversed substring is limited to 2. 2. All swaps are disjoint. Pattern Matching with Disjoint Reversals • Reversal Distance (RD): – The RD between s1 and s2 is the minimum number k, such that there exist s2’ , where HAM(s1,s2’) =k, and s1 reversal match s2. S1: S2: A D C B A E D B A A D A A B A B C E RD(S1,S2) = 2 Connection between Reversal Matching and Palindrome Matching Interleave Strings: S1: S2: A D C B A E D B A C D A A B A B D E A C D D C A B A A B E A D B B D A E On-line Input Suppose that we get the input a byte at a time: For the palindrome problem: A C D D C ABAABEADBBDAE On-line Input Suppose that we get the input a byte at a time: For the reversal problem: A C D D C A B A A B E A D B B D A E Main Idea – Palindrome Fingerprint The Rabin Karp Fingerprint Φ(S)=r1s0+ r2s1+… rmsm-1 mod (p) s0,s1,s2,…sm-1 ΦR(S)=r-1s0+ r-2s1+… r-msm-1 mod (p) The Reversal Fingerprint If rm+1ΦR(S) = Φ(S) => S is a palindrome. w.h.p. Palindrome Fingerprint If rm+1ΦR(S) = Φ(S) => S is a palindrome. Φ(S)=r1s0+ r2s1+… rmsm-1 mod (p) Example: r6ΦR(S)= ΦR(S)=r-1s0+ r-2s1+… r-msm-1 mod (p) S=ABCBA r6 (1/r A + 1/r2 B + 1/r3 C + 1/r4 B + 1/r5 A) = r5 A + r4 B + r3 C + r2 B + r A = Φ(S) Simple Online Algorithm for Finding a Palindrome in a Text t1,t2,t3, … ti,ti+1,ti+2 ,…ti+m, ti+m+1 , … tn Φ=r1ti+ r2ti+1+… rmti+m mod (p) If rm+1ΦR =Φ => ΦR=r-1ti+ r-2ti+1+… r-mti+m mod (p) there is a palindrome starting in the i-th position. If not, then for the next position: Φ= Φ + rm+1t i+m+1 mod (p) ΦR=ΦR + r-(m+1)ti+m+1 mod (p) Note: This algorithm finds online whether the prefix of a text is a permutation. For finding online whether the text is a concatenation of permutations, assume even-length permutations, otherwise, every text is a concatenation of length-1 permutations. Palindrome with mismatches Start with 1 mismatch case. 1-Mismatch S= … s0,s1,s2, sm-1 Choose l prime numbers q1,…,ql < m such that l q i 1 i m 1-Mismatch S= S2,0= S2,1= s0,s2,s4 … S3,0= S3,1= S3,2= s0,s3,s6 … s1s3,s5 … s1,s4,s7 … s2,s5,s8 … … s0,s1,s2, mod 2 mod 3 sm-1 Examples: q1=2, q2=3 For each qi construct qi subsequences of S as follows: subsequence Sqi,j is all elements of S whose index is j mod qi. Example S= S2,0= s0,s2,s4 S2,1= s1s3,s5 S3,0= s0,s3 S3,1= s1,s4 S3,2= s2,s5 s0,s1,s2, s3,s4,s5 mod 2 mod 3 1-Mismatch • We need to compare: s0 , s1, s2, sm-1, sm-2, sm-3 … sm-2 ,sm-1 … s1 , s0 • We prove that in the partitions strings: Sq,j= SRq,(m-1-j)mod q Example S2,0= s0,s2,s4 S2,1= s1s3,s5 S3,0= s0,s3 S3,1= S3,2= S= s0,s1,s2,s3,s4,s5 SR= s5,s4,s3,s2,s1,s0 S3,0= s0,s3 SR3,2= s5,s2 s1,s4 S3,1= s1,s4 s2,s5 SR3,1= s4,s1 S2,0= s0,s2,s4 SR2,1= s5s3,s1 Exact Matching Lemma: S=SR Sq,j = SRq,(m-1-j) mod for all q and all 0 ≤ j ≤ q. q 1-Mismatch Lemma: There is exactly one mismatch There is exactly one subpattern in each group that does not match. C.R.T Chinese Remainder Theorem Let n and m two positive integers. a mod n b mod n a mod m b mod m a mod nm b mod nm In our case: if two different indices, i and j, have an error, and only one subsequence is erroneous, since the product of all q’s > m, it means that i=j. Complexity There exists a constant c such that, for any x<m, there are at least x/log m prime numbers between x and cx. log m Therefore, choose log log m prime numbers between log m and c log m. Complexity • For each qi we compute 2qi different fingerprints: log 2 m • Overall space: l 2 q i O log log m io • Each character participates in exactly two fingerprints (the regular and the reverse). log m 2 l O • Overall time: log log m Online All fingerprint calculations can be done online We know the m at every input character, to compute the comparisons. Conclude: Our algorithm is online. k-Mismatches Use Group testing… k-Mismatches Group Testing • Given n items with some positive ones, identify all positive ones by a small number of tests. • Each test is on a subset of items. • Test outcome is positive iff there is a positive item in the subset. k-Mismatch • Group: partition of the text. • Test: distinguish between: (using the 1-mismatch algorithm) – match – 1-mismatch – more then 1-mismatch k-Mismatches S= S2,0= S2,1= s0,s2,s4 … S3,0= S3,1= S3,2= s0,s3,s6 … s1s3,s5 … s1,s4,s7 … s2,s5,s8 … … s0,s1,s2, mod 2 mod 3 sm-1 Each Sq,j is a group in our group testing Similar to the 1-mismatch algorithm just with more prime numbers… l q 1 q 2 q 3 ... q l s.t q i 1 i m k Our tests • We define The reversal pair of Sq,j to be SRq,(m-1-j)mod q • Each partition is “tested against” its reversal pair. Correctness s0,s1,s2, … sj …. sm-1 i2 i9 i5 i7 i For any group of k character i1,i2,..ik There exists a partition where sj appears alone C.R.T Correctness s0,s1,s2, … sj …. sm-1 i2 i9 i5 i7 i If sj invokes a mismatch we will catch it. Complexity • Overall space: k log 2 m O 2 log log m • Overall time: k 2 log 4 m O 2 log log m Approximate Reversal Distance Using the palindrome up to k-mismatches algorithm, can be solved in O n log k nk 2 O n k 2 time, and space. спасибо