ppt - TAMU Computer Science Faculty Pages

advertisement
Exact and Approximate Pattern in
the Streaming Model
Benny Porat and Ely Porat
2009 FOCS
Presented by - Tanushree Mitra
Problem Statement
• Find all instances of pattern P of length m, as a
contiguous substring in a text string T, of
length n, where m < n.
Contributions
• Exact pattern matching - A fully online randomized
algorithm for the classical pattern matching
problem
Time complexity - O(logm) per character that
arrives
Space complexity - O(logm), breaking the O(m)
barrier that held for this problem for a long time.
• Approximate pattern matching – An algorithm for
pattern matching with k mismatches problem.
Time complexity - O(k2poly(logm)) per character
Space complexity - O(k3poly(logm))
Applications
•
•
•
•
•
•
Monitoring Internet traffic
Computational Biology
Large Scale web searching
Viruses and Malware detection
Automatic Stock market analysis
Robotics
Background
Brute Force Algorithm –
– Slide the pattern along the text and
– Compare it to the corresponding portion of the text
Time Complexity – O(mn)
Speedup possible in these 2 steps.
• Sliding step speedup by pre-processing the pattern,
– Knuth-Morris-Pratt algorithm
– Boyer-Moore algorithm.
– Ukkonen’s algorithm to construct suffix trees
• Comparison step speedup
– Rabin-Karp algorithm.
Quick History
The Intuition
• When Rabin-Karp’s algorithm is done with the
i’th character, and advances to the next
position in the text, it does not use any of the
information gathered.
• The KMP algorithm, on the other hand, puts
that information to good use.
The Idea
• Combine the key features of KMP and the RabinKarp algorithms to achieve an online algorithm
that uses less space.
Definitions - Fingerprints
Fingerprint
String S
ф(S)
Sliding Fingerprint
Polynomial Fingerprint
q = s1r + s2r2 + … +slrl mod p, where pЄθ(N4), rЄFp
False Positives
If S1 ≠ S2, then probability of фr,p(S1) = фr,p(S2) is < 1/n3
Definitions - PeriodPl
• Period - A prefix Sp = s1,s2,….,sl of a string S is
defined to be a period of S, iff si = si+l, for 0 ≤ i ≤
n-l
• PeriodPl - For a pattern P = p1,p2,….,pm, prefix is,
Pl = p1,p2,….,pl ,0 ≤ l ≤ m. The shortest period of
Pl is periodPl
Put the information to good use
• If Pl matches the test at a given index i, then there
cannot be a match between i to i + |periodPl|
The Idea
• Match at i’th index indicates that we know the
False Positives??
Slide
last ‘m’ characters, so no point
saving them?
over |period | position that
could be a match.
Very LOW
• Preprocessing phase – Calculate
Sliding
PROBABILITY of false
fingerprint on the pattern фp andpositives
on the
shortest period фperiod p
Pl
• Online
– Slide fingerprint ф over
Text phase
and pattern
should text.
satisfy
the entire
stringent restrictions
– While ф = фp, slide ф by | PeriodPl | characters
– If we do not reach end of text abort
Go for subpatterns
• Log m subpatterns
p1, p2, p3, … pm-3,
pm-2, pm-1, pm
pm
pm-2 ,pm-1
P1
pm-6,pm-5,pm-4,pm-3
P2
P4
p1, p2, p3, … pm/2
Pm/2
• Starting point – Find a position in which the smallest
subpattern matches the text. Smallest subpattern is of
length 1 – this can be easily found.
Algorithm
• Guidelines –
• Find a position where Pi is a match, try to match Pi + 1 from the
same starting point as Pi
• If Pi + 1 does not match, use the information that Pi is a match.
• Check in jumps of |periodPi| until there is no overlap with the area
where Pi matches.
PROCESS
What if there is a
1. Initialize an empty sliding fingerprint
ф. starts in
match that
2. For each character that arrive: substring of 1st
process and ends in
– Extend ф to include the new character
– If |ф| = 2i and ф = фi for some 0 ≤ substring
i ≤ log m. of 2nd
processwith the last
• If ф has at least |periodPi-1 | length overlaps
match, slide ф by |periodPi-1| characters.
• Else, abort.
Exact_PM final Algorithm
Introduce Checkpoint
Checkpoint - Start a new process in the last
checkpoint of each process
Algorithm
• Preprocessing – Initialize an empty sliding fingerprint ф.
– For each 0 ≤ i ≤ log m calculate the sliding
fingerprint
–
–
фi of Pi and
фi,period of the period of Pi
Final Algorithm – Online Phase
• Online Phase –
– Start a new process
– For any character that arrive send it to all the
processes
– If some process aborts start new prorcess
– If some process , A reaches to a checkpoint
• Stop the ‘son process’ of A (if it has one)
• Start a new ‘son process’ of A
Complexity
• Space –
– All fingerprints from preprocessing use O(log m)
space.
– Each process saves another fingerprint and there can
be atmost log m processes in parallel
– OVERALL usage – O(log m) space
• Time –
– Each process spends O(1) time for each new character
that arrives
– Each time there are at most 3 log m processes running
(1. process A, 2. son-process of A, grandson-process of A. A
has to die when great-granson of A is created)
– OVERALL running time – O(log m) per character
Pattern Matching ( 1 – Mistmatch)
• Partition the pattern and the text
• We need to align every partition of the pattern Pqi,j
to qi text shifts
Intuition
• For each Pqi,j, run qi processes of Exact_PM.
• Processqi,j,σ - σ’th process of the subpattern Pqi,j ,
for 0 ≤ σ < qi. This will try to match the Pqi,j to
the text by considering the text as if it starts
from the σ character. (τ mod qi = j – σ)
• If for all qi,
– numOfNotMatchqi,σ = 0
‘match’.
– numOfNotMatchqi,σ = 1,
‘exactly 1-mismatch’
– Otherwise,
‘more than 1-mismatch’.
Complexity
• FACTS –
– Run ∑li=1 qi2 processes of Exact_PM
– There exists a constant c such that for any x, there
exist (x / logm) prime numbers, between x, and cx
– We have q1,q2, . . . ql groups of partitions. Each qi
is a prime number
• Space - O(log4m / log log m)
• Time - O(log3m / log log m)
Pattern Matching ( k – Errors)
• Preprocessing Phase – Initialize a process
Processqi,j,σ of 1-mismatch, for each qi Є {q1,q2,
. . . ql}, 0 ≤ i ≤ qi and 0 ≤ σ < qi
• Online Phase – Send τ character to each
Processqi,j,σ such that τ mod qi = j – σ
• d = all mismatches from all processes that
return ‘exactly 1-mismatch’
–d>k
more than k mismatches
Complexity
• Space –
– Run ∑i=1klogm qi2 Є O(k3 log4m/ log log m)
processes of 1-mismatch in parallel.
– Each process requires log4m space.
– OVERALL - O(k3poly(log m))
• Time –
– Number of processes of 1-mismatch algorithm is
bounded by ∑i=1klogm qi2 Є O(k3 log4m/ log log m)
– Running time of each character O(log3m)
– OVERALL - O(k2poly(log m))
Concluding Discussion
• The Two-Dimensional String-Matching
Problem
• The String-Matching Problem with Wild
Characters – Example: pattern P = {abc#abc#} is
found in texts T1 = {abcdcadbaccabc}, T2 =
{abcabc}
• String matching with weighted mismatch
Download