Exact and Approximate Pattern in the Streaming Model Benny Porat and Ely Porat 2009 FOCS Presented by - Tanushree Mitra Problem Statement • Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n. Contributions • Exact pattern matching - A fully online randomized algorithm for the classical pattern matching problem Time complexity - O(logm) per character that arrives Space complexity - O(logm), breaking the O(m) barrier that held for this problem for a long time. • Approximate pattern matching – An algorithm for pattern matching with k mismatches problem. Time complexity - O(k2poly(logm)) per character Space complexity - O(k3poly(logm)) Applications • • • • • • Monitoring Internet traffic Computational Biology Large Scale web searching Viruses and Malware detection Automatic Stock market analysis Robotics Background Brute Force Algorithm – – Slide the pattern along the text and – Compare it to the corresponding portion of the text Time Complexity – O(mn) Speedup possible in these 2 steps. • Sliding step speedup by pre-processing the pattern, – Knuth-Morris-Pratt algorithm – Boyer-Moore algorithm. – Ukkonen’s algorithm to construct suffix trees • Comparison step speedup – Rabin-Karp algorithm. Quick History The Intuition • When Rabin-Karp’s algorithm is done with the i’th character, and advances to the next position in the text, it does not use any of the information gathered. • The KMP algorithm, on the other hand, puts that information to good use. The Idea • Combine the key features of KMP and the RabinKarp algorithms to achieve an online algorithm that uses less space. Definitions - Fingerprints Fingerprint String S ф(S) Sliding Fingerprint Polynomial Fingerprint q = s1r + s2r2 + … +slrl mod p, where pЄθ(N4), rЄFp False Positives If S1 ≠ S2, then probability of фr,p(S1) = фr,p(S2) is < 1/n3 Definitions - PeriodPl • Period - A prefix Sp = s1,s2,….,sl of a string S is defined to be a period of S, iff si = si+l, for 0 ≤ i ≤ n-l • PeriodPl - For a pattern P = p1,p2,….,pm, prefix is, Pl = p1,p2,….,pl ,0 ≤ l ≤ m. The shortest period of Pl is periodPl Put the information to good use • If Pl matches the test at a given index i, then there cannot be a match between i to i + |periodPl| The Idea • Match at i’th index indicates that we know the False Positives?? Slide last ‘m’ characters, so no point saving them? over |period | position that could be a match. Very LOW • Preprocessing phase – Calculate Sliding PROBABILITY of false fingerprint on the pattern фp andpositives on the shortest period фperiod p Pl • Online – Slide fingerprint ф over Text phase and pattern should text. satisfy the entire stringent restrictions – While ф = фp, slide ф by | PeriodPl | characters – If we do not reach end of text abort Go for subpatterns • Log m subpatterns p1, p2, p3, … pm-3, pm-2, pm-1, pm pm pm-2 ,pm-1 P1 pm-6,pm-5,pm-4,pm-3 P2 P4 p1, p2, p3, … pm/2 Pm/2 • Starting point – Find a position in which the smallest subpattern matches the text. Smallest subpattern is of length 1 – this can be easily found. Algorithm • Guidelines – • Find a position where Pi is a match, try to match Pi + 1 from the same starting point as Pi • If Pi + 1 does not match, use the information that Pi is a match. • Check in jumps of |periodPi| until there is no overlap with the area where Pi matches. PROCESS What if there is a 1. Initialize an empty sliding fingerprint ф. starts in match that 2. For each character that arrive: substring of 1st process and ends in – Extend ф to include the new character – If |ф| = 2i and ф = фi for some 0 ≤ substring i ≤ log m. of 2nd processwith the last • If ф has at least |periodPi-1 | length overlaps match, slide ф by |periodPi-1| characters. • Else, abort. Exact_PM final Algorithm Introduce Checkpoint Checkpoint - Start a new process in the last checkpoint of each process Algorithm • Preprocessing – Initialize an empty sliding fingerprint ф. – For each 0 ≤ i ≤ log m calculate the sliding fingerprint – – фi of Pi and фi,period of the period of Pi Final Algorithm – Online Phase • Online Phase – – Start a new process – For any character that arrive send it to all the processes – If some process aborts start new prorcess – If some process , A reaches to a checkpoint • Stop the ‘son process’ of A (if it has one) • Start a new ‘son process’ of A Complexity • Space – – All fingerprints from preprocessing use O(log m) space. – Each process saves another fingerprint and there can be atmost log m processes in parallel – OVERALL usage – O(log m) space • Time – – Each process spends O(1) time for each new character that arrives – Each time there are at most 3 log m processes running (1. process A, 2. son-process of A, grandson-process of A. A has to die when great-granson of A is created) – OVERALL running time – O(log m) per character Pattern Matching ( 1 – Mistmatch) • Partition the pattern and the text • We need to align every partition of the pattern Pqi,j to qi text shifts Intuition • For each Pqi,j, run qi processes of Exact_PM. • Processqi,j,σ - σ’th process of the subpattern Pqi,j , for 0 ≤ σ < qi. This will try to match the Pqi,j to the text by considering the text as if it starts from the σ character. (τ mod qi = j – σ) • If for all qi, – numOfNotMatchqi,σ = 0 ‘match’. – numOfNotMatchqi,σ = 1, ‘exactly 1-mismatch’ – Otherwise, ‘more than 1-mismatch’. Complexity • FACTS – – Run ∑li=1 qi2 processes of Exact_PM – There exists a constant c such that for any x, there exist (x / logm) prime numbers, between x, and cx – We have q1,q2, . . . ql groups of partitions. Each qi is a prime number • Space - O(log4m / log log m) • Time - O(log3m / log log m) Pattern Matching ( k – Errors) • Preprocessing Phase – Initialize a process Processqi,j,σ of 1-mismatch, for each qi Є {q1,q2, . . . ql}, 0 ≤ i ≤ qi and 0 ≤ σ < qi • Online Phase – Send τ character to each Processqi,j,σ such that τ mod qi = j – σ • d = all mismatches from all processes that return ‘exactly 1-mismatch’ –d>k more than k mismatches Complexity • Space – – Run ∑i=1klogm qi2 Є O(k3 log4m/ log log m) processes of 1-mismatch in parallel. – Each process requires log4m space. – OVERALL - O(k3poly(log m)) • Time – – Number of processes of 1-mismatch algorithm is bounded by ∑i=1klogm qi2 Є O(k3 log4m/ log log m) – Running time of each character O(log3m) – OVERALL - O(k2poly(log m)) Concluding Discussion • The Two-Dimensional String-Matching Problem • The String-Matching Problem with Wild Characters – Example: pattern P = {abc#abc#} is found in texts T1 = {abcdcadbaccabc}, T2 = {abcabc} • String matching with weighted mismatch