Subgraph Search

advertisement
Subgraph Containment Search
Dayu Yuan
The Pennsylvania State University
1
© Dayu Yuan
4/8/2015
Outline
1. Background & Related Work:



Preliminary & Problem Definition
Filter + Verification [Feature Based Index Approach]
2. Lindex: A general index structure for sub search
3.Direct feature mining for sub search


2
© Dayu Yuan
4/8/2015
Subgraph Search: Definition
Problem Definition:


In a graph database D = {g1,g2,...gn}, given a query graph q, the
subgraph search algorithm returns all database graphs
containing q as a subgraph.
Solution:



Brute force: For each query q, scan the dataset, find D(q)
Filter + Verification:

Given a query q, find a candidate set C(q), then verify each graph in
C(q) to obtain D(q)
C(q) = D
q
D
3
C(q)
D(q)
© Dayu Yuan
4/8/2015
Subgraph Search: Solutions

Filter + Verification:

Rule:


Inverted Index: <Key,Value> pair


4
If a graph g contains the query q, then g has to contain all q’s
subgraphs.
Left: subgraph features (small segment of subgraphs),
Right: Posting List (IDs of all db graphs containing the “key” subgraph)
© Dayu Yuan
4/8/2015
Subgraph Search: Related Work
Response time:


(1) filtering cost:



D-> C(q)
Cost of the search for subgraph features contained in the query
Cost of loading the postings file, cost of joining the postings
(2) verification cost: C(q) -> D(q)

subgraph isomorphism tests
NP-complete, dominate overall cost
Related work:


Reduce the verification cost by mining subgraph features
Disadvantages:



5
(1) Different index structure designs for different features
(2) “batch mode” feature mining [talk latter]
© Dayu Yuan
4/8/2015
Outline
1. Background:
2. Lindex: A general index structure for subsearch






Compact (memory consumption)
Effective (filtering power)
Efficient (response time)
Experiment Results
3.Direct feature mining for sub search

6
© Dayu Yuan
4/8/2015
Lindex: A general index structure
Contributions:
 Orthogonal to related work (feature mining)
 General: Applicable to all subgraph/subtree features.
 Compact, Effective and Efficient



7
Compact: less memory consumption.
Effective: prune more false positive (with the same features)
Efficient: runs faster
© Dayu Yuan
4/8/2015
Lindex: Compact
Space Saving (Extension Labeling)


Each Edge in a graph is represented as:


the label of the graph sg2 is


<ID(u), ID(v), Label(u), Label(edge(u, v)), Label(v)>
< 1,2,6,1,7 >,< 1,3,6,2,6 >
the label of its chosen parent sg1 is

< 1,2,6,1,7 >
Then subgraph g2 can be stored
< 1,2,6,1,7 >
as just < 1, 3, 6, 2, 6 >
< 1, 3, 6, 2, 6 >
8
© Dayu Yuan
4/8/2015
Lindex: Empirical Evaluation of Memory
Index\Featur
DFG
∆TCFG MimR
e
Feature Count 7599/6238 9873/5712 5000
Gindex
1359
FGindex
1534
Tree+∆
6172/38 7500/6172
1348
1339
1826
SwiftIndex
Lindex
DFT
860
677
841
772
676
671
Unit in KB
9
© Dayu Yuan
4/8/2015
Lindex: Effective in Filtering

Definition (maxSub, minSuper).
max Sub(g,S)  {gi S | gi  g, x S s.t. gi  x  g}
min Sup(g,S)  {gi S | g  gi , x S s.t. g  x  gi }
(1) sg2 and sg4 are
maxSub of q
(2) sg5 is
minsup of q
10
© Dayu Yuan
4/8/2015
Lindex: Effective in Filtering
Strategy One: Minimal Supergraph Filtering

Given a query q and Lindex L(D,S), the candidate set on which
an algorithm should check for subgraph isomorphism is
C(q)  I i D( fi )  U j (hj ), fi maxSub(q), hj minSup(q)
(1) sg2 and sg4 are
maxSub of q
(2) sg5 is
minsup of q
(3)
C(q)  D(sg2 ) I D(sg4 )  D(sg5 )
 {a,b, c} I {a,b, d}  {b}
a
11
© Dayu Yuan
4/8/2015
Lindex: Effective in Filtering


Strategy Two: Postings Partition
Direct & Indirect Value Set.


Direct Set: Vd (sg)  {g D(sg)} such that sg can extend to g,
without being isomorphic to any other features
Indirect Set: Vi (sg)  D(sg)  Vd (sg)
Vd (sg2 )  {a}
Vd (sg3 )  {b}
Vd (sg1 )  {b}
Why “b” is in the direct value
set of “sg1”, but “a” is not?
12
Index
Data Based
Graphs
© Dayu Yuan
4/8/2015
Lindex: Effective in Filtering

Given a query q and Lindex L(D,S), the candidate set on which an
algorithm should check for subgraph isomorphism is
C(q)  I i Vd ( fi )  U j (hj ), fi maxSub(q), hj minSup(q)
Omit Prof
c
Query “a”
Traditional Model
Strategy(1)
Strategy(1 + 2)
13
b
Graphs need to be verified
{a,b,c} I {a,b,c}  {a,b,c}
{a,b,c} I {a,b,c}  {c}  {a,b}
{a,c} I {a}  {c}  {a}
© Dayu Yuan
4/8/2015
Lindex: Efficient in Maxsub Feature Search
the label of the graph sg2 is


< 1,2,6,1,7 >
the label of its chosen parent sg1 is


< 1,2,6,1,7 >,< 1,3,6,2,6 >
< 1,2,6,1,7 >
< 1, 3, 6, 2, 6 >
Node1 of sg1 mapped to Node1 of sg2
instead of constructing canonical labels for each subgraph
of q and comparing them with the existing labels in the
index to check if a indexing feature matches
while traversing a graph lattice, mappings constructed to
check that a graph sg1 is contained in q can be extended
to check whether a supergraph of sg1 in the lattice, sg2, is
contained in q by incrementally expanding the mappings
from sg1 to q.
14
© Dayu Yuan
4/8/2015
Lindex: Efficient in Minsup Feature Search
The set of minimal supergraph of a query q in the Lindex is
a subset of the intersection of the set of descendants of
each subgraph node of q in the partial lattice.
minSup(q)  I
sgmaxSub(q)
Descendant(sg)
< 1,2,6,1,7 >
< 1, 3, 6, 2, 6 >
15
© Dayu Yuan
4/8/2015
Outline


1. Background:
2. Lindex: A general index structure for subsearch





Compact (memory consumption)
Effective (filtering power)
Efficient (response time)
Experiment Results
3.Direct feature mining for sub search
16
© Dayu Yuan
4/8/2015
Lindex: Experiments
Exp on AIDS Dataset: 40,000 Graphs
17
© Dayu Yuan
4/8/2015
Lindex: Experiments
Exp on AIDS Dataset: 40,000 Graphs
18
© Dayu Yuan
4/8/2015
Lindex: Experiments
Exp on AIDS Dataset: 40,000 Graphs
19
© Dayu Yuan
4/8/2015
Outline



1. Background:
2. Lindex: A general index structure for sub search
3.Direct feature mining for sub search





20
Motivation
Problem Definition & Objective Function
Branch & Bound
Partition of the search space
Experiment Results
© Dayu Yuan
4/8/2015
Feature Mining: A Brief History
Graph Feature Mining
1
All Freq
Subgraphs
2
Graphs
21
3
Batch Mode
Direct Feature
Mining
Applications
Graph
Containment
Search
Graph
Classification
…….
© Dayu Yuan
4/8/2015
Feature Mining: Motivation

All previous feature selection algorithms for “subgraph
search problem” follow “batch mode”




Assume stable database
Bottleneck (frequent subgraph enumeration)
Hard to tune the setting of parameters (minimum support, etc)
Our Contributions:



22
First direct feature mining algorithm for the subgraph search
problem
Effective in index updating
Choose high quality features
© Dayu Yuan
4/8/2015
Feature Mining: Problem Definition
Trsp (q)  T filter (q)  Tverf (q,C(q))
 0

Tresp (q)  Tverf (q,C(q))  
| C(q) || I


Previous work:


n
Xq [i ]1
D( pi ) | q P
Given a graph database D, find a set of subgraph (subtree)
features, minimizing the response time over training query Q.
 P  argmin  | C(q, P) |
|P| N
Our work:

q P
qQ
Given a graph database D, an already built index I with feature
set P0, search for a new feature p, such that the new feature set
{P0 + p} minimizes the response time
gain(p, P0 )  | C(q, P0 ) |  | C(q,{p, P0 } |
qQ
p  arg max gain( p, P0 )
qQ
p
23
© Dayu Yuan
4/8/2015
Feature Mining: Problem Definition

Iterative Index Updating:


Given database D, current index I with features P0
(1) Remove Useless Features

Find a feature p in P0 p  argmin(| C(q,{P0 \ p} |  | C(q, P0 ) |)
pP0
Po  Po  p

Find a new feature p p  argmax(| C(q, P0 ) |  | C(q,{p, P0 } |)
p
Po  Po  p
qQ
qQ
(3) Goes to (1)
C(q, P)  I
24
qQ
(2) Add New Features


qQ
n
Xq [i]1
D( pi )  I
n
pi max Sub(q,P)
D( pi )
© Dayu Yuan
4/8/2015
Feature Mining:
More on the Object Function

(1) Pros and Cons of using the query logs


The objective function of previous algorithms (i.e. Gindex,
FGindex) depends on queries too. [Implicitly]
(2) Feature selected are “discriminative”

Previous work:


Our objective function:


the discriminative power of ‘sg’ is measured w.r.t to sub(sg) or sup(sg),
where sub(sg) denotes all subgraphs of ‘sg’, and sup(sg) denotes all
supergraph of ‘sg’.
discriminative power is measure w.r.t P0
(3) Computation Issue:
25
© Dayu Yuan
4/8/2015
Feature Mining:
More on the Object Function
gain(p, P0 )  | C(q, P0 ) |  | C(q,{p, P0 } |
qQ
qQ
{q Q | q  p}
{q Q | p maxSub(q,{p, P0 })}
MinSup Queries(p, Q)
Q
gain(p, P0 ) 

(| C(q, P0 ) | | C(q,{p, P0 } |)   I(p  q)(| C(q, P0 ) |
qmin Sup(q,Q)
gain( p, P0 ) 

qmin Sup(q,Q)
qQ
(| C(q, P0 ) |  C(q, P0 ) I D( p) |)   I( p  q)(| C(q, P0 ) |
qQ
Computing D(p) for each enumerated feature ‘p’ is expensive
26
© Dayu Yuan
4/8/2015
Feature Mining: Challenges




(1) Objective function is expensive to evaluate
(2) Exponential search space for the new index subgraph
feature “p”.
(3) Objective function is neither monotonic nor antimonotonic. [Apriori rule can not be used]
(4) Traditional graph feature mining algorithms (e.g.
LeapSearch) do not work. (They rely only on
“frequencies”)
27
© Dayu Yuan
4/8/2015
Feature Mining:
Estimate The Objective Function

The objective function of a new subgraph feature p, has an
easy to compute upper bound and lower bound:
Upp( p, P0 ) 
1
1
|
C(q,
P
)

D(q)
|

I( p  q) | C(q, P0 ) |


0
| Q | qmin Sup( p,Q)
| Q | qQ
Low( p, P0 ) 
1
1
1
|
C(q,
P
)

D(max
Sub(
p))
|

I( p  q) | C(q, P0 ) |


0
| Q | qmin Sup( p,Q)

| Q | qQ


Inexpensive to compute
Two approaches to estimate

Omit Prof
(1) Lazy calculation: don’t have to calculate gain(p, P0) when
 Upp(p, P0) < gain(p*, P0)
 Low(p, P0) > gain(p*, P0)

28
(2) gain( p, P0 )    Upp( p, P0 )  (1   )Upp( p, P0 )
© Dayu Yuan
4/8/2015
Feature Mining: Branch and Bound

Exhaustive Search according to DFS Tree

A graph(pattern) can be canonically labeled as a string, the DFS
tree is a pre-fix tree of the labels of graphs.
 Depth first search.
n1
 Visit: n1, n2, n3, n4 and find the current best
pattern is n3.
n2
n3
n5
n4
n6
n7  Now visit n , pre-observe that n and all its
5
5
offspring have gain function less than n3.
 Prune the branch and start to visit n7.
The objective function is
neither monotonic or antimonotonic
29
© Dayu Yuan
4/8/2015
Feature Mining: Branch and Bound


For each branch, e.g., branch starting from n5, find an branch
upper bound > gain value of all nodes on that branch.
Thm:
For a feature p, an upper bound exists such that for all p’ that are
supergraph of p, gain(p’, P0) <= BUpp(p, P0)
1
BUpp( p)  {  | C(q, P0 )  D(q) |  max p ' p | C( p') |  I(q  p')}
Q qQ,q  p
qQ

Omit Prof
Although correct,
the upper bound is not tight
{q Q | q  p}
{q Q | p maxSub(q,{p, P0 })}
MinSup Queries(p, Q)
Q
30
© Dayu Yuan
4/8/2015
Feature Mining:
Heuristic based search space partition

Problem:


The search always starts from the same root and search according to the same order
Observation

The new graph pattern p must be a super graph of some patterns in P 0, i.e., p ⊃ p2 in
Figure 4
1) A great proportion of the queries are
supergraphs of root, otherwise there will
be few queries using p ⊃ r for filtering
2) The average size of the set of
candidates for queries ⊃ r are large,
which means improvement over those
queries is important.
Spoin(r) 

| C(q, P0 )  D(q) |  max p 'r | C(p') |
qmin Sup(r,Q)
31

qmin Sup(r,Q)
© Dayu Yuan
4/8/2015
I(q  p')
Feature Mining:
Heuristic based search space partition

Procedure:

(1)gain(p*)=0
(2)Sort all P0 according to sPoint(pi) function in decreasing order
(3) Start Iterating
 For i=1to|P| do
 If branch upper bound of BUpp(ri) < gain(p∗) then break
 Else Find the minimal supergraph queries minSup(r, Q)
 p*(r) = Branch & Bound Search (minSup(r, Q), p∗)
 If gain(p*(r)) > gain(p∗), update p∗ = p∗r



Discussion:


(1) Candidate features are enumerated as descendent of the “root”
(2) Candidate features are ‘frequent’ on D(r), not all D



32
Smaller minimum support
(3) “root” are visited according to sPoint(r) score, quick to find a close
to optimal feature.
(4) Top k feature selection
© Dayu Yuan
4/8/2015
Outline



1. Background:
2. Lindex: A general index structure for sub search
3.Direct feature mining for sub search





33
Motivation
Problem Definition & Objective Function
Branch & Bound
Partition of the search space
Experiment Results
© Dayu Yuan
4/8/2015
Feature Mining: Experiments



The same AIDS dataset D,
Index0:
Gindex with minsupport 0.05
IndexDF: Gindex with minsupport 0.02
[1175 new feature are added]

Index QG/BB/TK (Index updated based on Index0)

BB: branch and bound

QG: search space partitioned
TK: top k feature returned in on iteration


34
Achieving the same candidate set size decrease
© Dayu Yuan
4/8/2015
Feature Mining: Experiments
35
© Dayu Yuan
4/8/2015
Feature Mining: Experiments

2 Dataset: D1 & D2 (80% same)
DF(D1): Gindex on Dataset D1
DF(D2): Gindex on Dataaset D2

Index QG/BB/TK (Index updated based on DF(D1))



BB: branch and bound

QG: search space partitioned
TK: top k feature returned in on iteration




36
Exp1: D2 = D1 + 20% New
Exp2: D2 = 80%D1 + 20%New
Iterative until the objective value is stable
© Dayu Yuan
4/8/2015
Feature Mining: Experiments
DF VS. iterative methods
37
© Dayu Yuan
4/8/2015
Feature Mining: Experiments
38
© Dayu Yuan
4/8/2015
Feature Mining: Experiments
Iterative until the gain is stable
TCFG VS. iterative methods
MimR VS. iterative methods
39
© Dayu Yuan
4/8/2015
Conclusion

1. Lindex: index structure general enough to support any
features




Compact
Effective
Efficient
2. Direct feature mining


Third generation algorithm (no frequent feature enumeration
bottleneck)
Effective in updating the index to accommodate changes


40
Runs much faster than building the index from scratch
Feature selected can filter more false positives than features selected
from scratch.
© Dayu Yuan
4/8/2015
Thanks
Questions?
41
© Dayu Yuan
4/8/2015
Download