PPTX

advertisement
Amihood Amir, Avivit Levy,
Ely Porat and B. Riva Shalom
CPM 2014
1
DICTIONARY
MATCHING
WITH ONE GAP
CPM 2014 - MOSCOW
CPM 2014
2
MIND THE GAP !
CPM 2014
3
OUTLINE
 The
CPM 2014
DMG(Dictionary Matching with one Gap )
Problem
 Motivation
 Previous Work
 Bidirectional Suffix Trees Solution
 Lookup Table addition
 Open Problems
4
THE DMG PROBLEM
CPM 2014
A gapped pattern is a pattern P of the form:
P1{1,1} P2{2,2}… Pk-1{k-1,k-1}Pk
Each Pj is over alphabet ,
{j,j} is a sequence of at least j and at most j
don’t cares = @.
Example: aba{3,6}cbb
aba @@@cbb
aba@@@@cbb
aba@@@@@cbb
aba@@@@@@cbb
5
THE DMG PROBLEM
The DMG problem is:
 Preprocess:
A dictionary D of d gapped
CPM 2014
patterns
P1,…, Pd over alphabet .
 Query: A text T of length n over alphabet .
 Output: all locations in T where a dictionary
gapped pattern ends.
We focus on DMG with a single gap.
6
EXAMPLE
Query
text:
CPM 2014
Dictionary: P1 = aba {3,6} cbb
P2 = ab {3,6} bbac
P3 = aa {3,6} ac
1 2 3 4 5 6 7 8 9 10 11
abaabacbbac
P1,1
P2,1
P3,1
P1,2 P2,2P3,2
First =1≤i≤d{ Pi,1 } Second=1≤i≤d{ Pi,2 }
7
MOTIVATION
 Computational
CPM 2014
Biology
 A renew interest due to cyber security.
 Network intrusion detection systems
perform protocol analysis, content
searching and content matching to
detect harmful software.
 Malware may appear in several packets!
8
PREVIOUS WORK
 Gapped
CPM 2014
pattern matching problem was
studied for a few decades,
eg. [Myers, JACM 1992],[Navaro&Raffinot, Algorithmica
2004],[Bille&Thorup, ICALP 2009] , [Bille&Thorup SODA 2010],
[Morgante et al., JCB 2005], [Rahman et al., COCOON 2006],
[Bille et al., TCS 2012]
DMG problem not studied enough !
[Kucherov&Rosinovich,TCS 1997],[Zhang et al., IPL 2010]-no
bounds on the length of the gap.
9
BI-DIRECTIONAL SUFFIX TREES
ALGORITHM
Query:
CPM 2014
Gapped pattern: a b{3,6}b b a c
abaabacbbac
10
BI-DIRECTIONAL SUFFIX TREES
ALGORITHM
Idea:
view as [Amir et al., JAL 2000]
Use suffix
tree TFR of
FirstR
CPM 2014
Gapped patterns:P1= a b a{3,6}a b a c
P2= a b a{3,6}b b a
P3= a b{3,6}b a a
Query:
abaabacbbac
gap
Use suffix
tree TS of
11
Second
BI-DIRECTIONAL SUFFIX TREES
ALGORITHM
Finds Pi,2 starting at location l.
CPM 2014
For each text location l
Insert tl tl +1…tn to TS (the node h)
to find labels on the path to h.
For f= l --1 to l --1
Insert tftf-1…t1 to TFR (the node g)
to find labels on the path to g.
Finds Pi,1 ending at location f.
Output intersection (for end locations).
12
BI-DIRECTIONAL SUFFIX TREES
ALGORITHM - INTERSECTION
Patterns: {(1,4),(2,9),(3,7),…,(6,5),…}
TS
2
Range:
[1,9]
3
6
9 g
5
CPM 2014
TFR
1
Range:
[2,7]
7 h
13
BI-DIRECTIONAL SUFFIX TREES
ALGORITHM (CONTINUED)
Intersection via range queries:
(2,9)
CPM 2014
(8,8)
(3,7)
(6,5)
Range:
[2,7]
(1,4)
14
Range: [1,9]
TIME & SPACE
Preprocessing Time:
Dictionary segments suffix tree and reverse
suffix tree: O(|D|)
Preprocessing grid for range queries:
O(d log d).
[Chan et al., SoCG 2011]

CPM 2014
Preprocessing Space:
Dictionary segments suffix tree and reverse
suffix tree: O(|D|)
Space for grid:
O(d log d).
[Chan et al., SoCG 2011]

15
TIME & SPACE
Query Time:
For each end text location, we try every gap size:
a factor of .
The number of range queries is the number of
vertical paths in a given path:
1
O(log2 min{d, log |D|}).
A range query costs: O(log log d+occ). 3

CPM 2014
[Chan et al., SoCG 2011]
Total:
O(n()log log d log2 min{d, log |D|}+occ).
6
9g
16
LOOKUP TABLE ALGORITHM
CPM 2014
Idea:
Instead of using range queries in a grid
to compute the intersection, we use a
pre-computed lookup table.
Enables intersection in O(occ) time.
Total query time becomes:
O(n()+occ).
17
LOOKUP TABLE ALGORITHM
Inter[ 3, 5 ]= {4}
CPM 2014
Inter[g,h] = all i s.t. Pi,1R appears on the
path from the root of TFR till node g
and Pi,2 appears on the path from the
root of TS till node h.
P1=(1,4), P2=(2,9), P3=(3,7),
P4=(3,2), …,P6=(6,5), P7 =(9,6)
1
3 g
6
9
2
5h
7
18
LOOKUP TABLE ALGORITHM
Inter[ 3, 5 ]= {4}
Inter[ 3, 7 ]= {3,4}
CPM 2014
Inter[g,h] = all i s.t. Pi,1R appears on the
path from the root of TFR till node g
and Pi,2 appears on the path from the
root of TS till node h.
P1=(1,4), P2=(2,9), P3=(3,7),
P4=(3,2), …,P6=(6,5), P7 =(9, 6)
1
3 g
6
9
2
5
7h
19
LOOKUP TABLE ALGORITHM
Inter[ 3, 5 ]= {4}
Inter[ 3, 7 ]= {3,4}
Inter[ 6, 7 ]= {3,4,6}
CPM 2014
Inter[g,h] = all i s.t. Pi,1R appears on the
path from the root of TFR till node g
and Pi,2 appears on the path from the
root of TS till node h.
P1=(1,4), P2=(2,9), P3=(3,7),
P4=(3,2), …,P6=(6,5), P7 =(9,6)
1
3 g
6
9
2
5
7h
20
LOOKUP TABLE ALGORITHM
Inter[
Inter[
Inter[
Inter[
3,
3,
6,
9,
5
7
7
7
]= {4}
]= {3,4}
]= {3,4,6}
]= {3,4,6}
CPM 2014
Inter[g,h] = all i s.t. Pi,1R appears on the
path from the root of TFR till node g
and Pi,2 appears on the path from the
root of TS till node h.
P1=(1,4), P2=(2,9), P3=(3,7),
P4=(3,2), …,P6=(6,5), P7 =(9,6)
1
3
6
9g
2
5
7h
21
LOOKUP TABLE ALG.
1
3
P1=(1,4), P2=(2,9), P3=(3,7),
P4=(3,2), …,P6=(6,5),P7 =(9,6)
2
--
--
….
4
5
1
--
6
7
5
7
CPM 2014
1
1
6
9
2
2
3
4
3
--
:
6
6
:
9
7
Inter[3,5]= {4}
Inter[3,7]= {3,4}
Inter[6,7]= {3,4,7}
22
LOOKUP TABLE ALGORITHM
CPM 2014
Preprocessing:
Time: Table can be computed using DP in
time O(d2 ovr + |D|) where ovr is the
number of subpatterns including other
subpattern as a prefix or suffix.
Space: O(d 2 + |D|).
Query time: O(n()+occ).
23
CPM 2014
Bi-directional
suffix trees &
OUR RESULTS
range queries
 Preprocessing time: O(d log d + |D|).
Space: O(d log d + |D|).
Query time:
O(n()log log d log2(min{d, log |D|} )+occ).
Bi-directional
suffix trees &
Lookup table
 Preprocessing time: O(d2 ovr + |D|).
Space: O(d 2 + |D|).
24
Query time: O(n()+occ).
OPEN PROBLEMS
to k gaps
Reducing the dependency on
the size 
Scalability to different gap
bounds in the dictionary
Online algorithm
CPM 2014
Generalizing
25
CPM 2014
THANK YOU!
26
Download