Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1 © Dayu Yuan 4/8/2015 Outline 1. Background & Related Work: Preliminary & Problem Definition Filter + Verification [Feature Based Index Approach] 2. Lindex: A general index structure for sub search 3.Direct feature mining for sub search 2 © Dayu Yuan 4/8/2015 Subgraph Search: Definition Problem Definition: In a graph database D = {g1,g2,...gn}, given a query graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph. Solution: Brute force: For each query q, scan the dataset, find D(q) Filter + Verification: Given a query q, find a candidate set C(q), then verify each graph in C(q) to obtain D(q) C(q) = D q D 3 C(q) D(q) © Dayu Yuan 4/8/2015 Subgraph Search: Solutions Filter + Verification: Rule: Inverted Index: <Key,Value> pair 4 If a graph g contains the query q, then g has to contain all q’s subgraphs. Left: subgraph features (small segment of subgraphs), Right: Posting List (IDs of all db graphs containing the “key” subgraph) © Dayu Yuan 4/8/2015 Subgraph Search: Related Work Response time: (1) filtering cost: D-> C(q) Cost of the search for subgraph features contained in the query Cost of loading the postings file, cost of joining the postings (2) verification cost: C(q) -> D(q) subgraph isomorphism tests NP-complete, dominate overall cost Related work: Reduce the verification cost by mining subgraph features Disadvantages: 5 (1) Different index structure designs for different features (2) “batch mode” feature mining [talk latter] © Dayu Yuan 4/8/2015 Outline 1. Background: 2. Lindex: A general index structure for subsearch Compact (memory consumption) Effective (filtering power) Efficient (response time) Experiment Results 3.Direct feature mining for sub search 6 © Dayu Yuan 4/8/2015 Lindex: A general index structure Contributions: Orthogonal to related work (feature mining) General: Applicable to all subgraph/subtree features. Compact, Effective and Efficient 7 Compact: less memory consumption. Effective: prune more false positive (with the same features) Efficient: runs faster © Dayu Yuan 4/8/2015 Lindex: Compact Space Saving (Extension Labeling) Each Edge in a graph is represented as: the label of the graph sg2 is <ID(u), ID(v), Label(u), Label(edge(u, v)), Label(v)> < 1,2,6,1,7 >,< 1,3,6,2,6 > the label of its chosen parent sg1 is < 1,2,6,1,7 > Then subgraph g2 can be stored < 1,2,6,1,7 > as just < 1, 3, 6, 2, 6 > < 1, 3, 6, 2, 6 > 8 © Dayu Yuan 4/8/2015 Lindex: Empirical Evaluation of Memory Index\Featur DFG ∆TCFG MimR e Feature Count 7599/6238 9873/5712 5000 Gindex 1359 FGindex 1534 Tree+∆ 6172/38 7500/6172 1348 1339 1826 SwiftIndex Lindex DFT 860 677 841 772 676 671 Unit in KB 9 © Dayu Yuan 4/8/2015 Lindex: Effective in Filtering Definition (maxSub, minSuper). max Sub(g,S) {gi S | gi g, x S s.t. gi x g} min Sup(g,S) {gi S | g gi , x S s.t. g x gi } (1) sg2 and sg4 are maxSub of q (2) sg5 is minsup of q 10 © Dayu Yuan 4/8/2015 Lindex: Effective in Filtering Strategy One: Minimal Supergraph Filtering Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is C(q) I i D( fi ) U j (hj ), fi maxSub(q), hj minSup(q) (1) sg2 and sg4 are maxSub of q (2) sg5 is minsup of q (3) C(q) D(sg2 ) I D(sg4 ) D(sg5 ) {a,b, c} I {a,b, d} {b} a 11 © Dayu Yuan 4/8/2015 Lindex: Effective in Filtering Strategy Two: Postings Partition Direct & Indirect Value Set. Direct Set: Vd (sg) {g D(sg)} such that sg can extend to g, without being isomorphic to any other features Indirect Set: Vi (sg) D(sg) Vd (sg) Vd (sg2 ) {a} Vd (sg3 ) {b} Vd (sg1 ) {b} Why “b” is in the direct value set of “sg1”, but “a” is not? 12 Index Data Based Graphs © Dayu Yuan 4/8/2015 Lindex: Effective in Filtering Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is C(q) I i Vd ( fi ) U j (hj ), fi maxSub(q), hj minSup(q) Omit Prof c Query “a” Traditional Model Strategy(1) Strategy(1 + 2) 13 b Graphs need to be verified {a,b,c} I {a,b,c} {a,b,c} {a,b,c} I {a,b,c} {c} {a,b} {a,c} I {a} {c} {a} © Dayu Yuan 4/8/2015 Lindex: Efficient in Maxsub Feature Search the label of the graph sg2 is < 1,2,6,1,7 > the label of its chosen parent sg1 is < 1,2,6,1,7 >,< 1,3,6,2,6 > < 1,2,6,1,7 > < 1, 3, 6, 2, 6 > Node1 of sg1 mapped to Node1 of sg2 instead of constructing canonical labels for each subgraph of q and comparing them with the existing labels in the index to check if a indexing feature matches while traversing a graph lattice, mappings constructed to check that a graph sg1 is contained in q can be extended to check whether a supergraph of sg1 in the lattice, sg2, is contained in q by incrementally expanding the mappings from sg1 to q. 14 © Dayu Yuan 4/8/2015 Lindex: Efficient in Minsup Feature Search The set of minimal supergraph of a query q in the Lindex is a subset of the intersection of the set of descendants of each subgraph node of q in the partial lattice. minSup(q) I sgmaxSub(q) Descendant(sg) < 1,2,6,1,7 > < 1, 3, 6, 2, 6 > 15 © Dayu Yuan 4/8/2015 Outline 1. Background: 2. Lindex: A general index structure for subsearch Compact (memory consumption) Effective (filtering power) Efficient (response time) Experiment Results 3.Direct feature mining for sub search 16 © Dayu Yuan 4/8/2015 Lindex: Experiments Exp on AIDS Dataset: 40,000 Graphs 17 © Dayu Yuan 4/8/2015 Lindex: Experiments Exp on AIDS Dataset: 40,000 Graphs 18 © Dayu Yuan 4/8/2015 Lindex: Experiments Exp on AIDS Dataset: 40,000 Graphs 19 © Dayu Yuan 4/8/2015 Outline 1. Background: 2. Lindex: A general index structure for sub search 3.Direct feature mining for sub search 20 Motivation Problem Definition & Objective Function Branch & Bound Partition of the search space Experiment Results © Dayu Yuan 4/8/2015 Feature Mining: A Brief History Graph Feature Mining 1 All Freq Subgraphs 2 Graphs 21 3 Batch Mode Direct Feature Mining Applications Graph Containment Search Graph Classification ……. © Dayu Yuan 4/8/2015 Feature Mining: Motivation All previous feature selection algorithms for “subgraph search problem” follow “batch mode” Assume stable database Bottleneck (frequent subgraph enumeration) Hard to tune the setting of parameters (minimum support, etc) Our Contributions: 22 First direct feature mining algorithm for the subgraph search problem Effective in index updating Choose high quality features © Dayu Yuan 4/8/2015 Feature Mining: Problem Definition Trsp (q) T filter (q) Tverf (q,C(q)) 0 Tresp (q) Tverf (q,C(q)) | C(q) || I Previous work: n Xq [i ]1 D( pi ) | q P Given a graph database D, find a set of subgraph (subtree) features, minimizing the response time over training query Q. P argmin | C(q, P) | |P| N Our work: q P qQ Given a graph database D, an already built index I with feature set P0, search for a new feature p, such that the new feature set {P0 + p} minimizes the response time gain(p, P0 ) | C(q, P0 ) | | C(q,{p, P0 } | qQ p arg max gain( p, P0 ) qQ p 23 © Dayu Yuan 4/8/2015 Feature Mining: Problem Definition Iterative Index Updating: Given database D, current index I with features P0 (1) Remove Useless Features Find a feature p in P0 p argmin(| C(q,{P0 \ p} | | C(q, P0 ) |) pP0 Po Po p Find a new feature p p argmax(| C(q, P0 ) | | C(q,{p, P0 } |) p Po Po p qQ qQ (3) Goes to (1) C(q, P) I 24 qQ (2) Add New Features qQ n Xq [i]1 D( pi ) I n pi max Sub(q,P) D( pi ) © Dayu Yuan 4/8/2015 Feature Mining: More on the Object Function (1) Pros and Cons of using the query logs The objective function of previous algorithms (i.e. Gindex, FGindex) depends on queries too. [Implicitly] (2) Feature selected are “discriminative” Previous work: Our objective function: the discriminative power of ‘sg’ is measured w.r.t to sub(sg) or sup(sg), where sub(sg) denotes all subgraphs of ‘sg’, and sup(sg) denotes all supergraph of ‘sg’. discriminative power is measure w.r.t P0 (3) Computation Issue: 25 © Dayu Yuan 4/8/2015 Feature Mining: More on the Object Function gain(p, P0 ) | C(q, P0 ) | | C(q,{p, P0 } | qQ qQ {q Q | q p} {q Q | p maxSub(q,{p, P0 })} MinSup Queries(p, Q) Q gain(p, P0 ) (| C(q, P0 ) | | C(q,{p, P0 } |) I(p q)(| C(q, P0 ) | qmin Sup(q,Q) gain( p, P0 ) qmin Sup(q,Q) qQ (| C(q, P0 ) | C(q, P0 ) I D( p) |) I( p q)(| C(q, P0 ) | qQ Computing D(p) for each enumerated feature ‘p’ is expensive 26 © Dayu Yuan 4/8/2015 Feature Mining: Challenges (1) Objective function is expensive to evaluate (2) Exponential search space for the new index subgraph feature “p”. (3) Objective function is neither monotonic nor antimonotonic. [Apriori rule can not be used] (4) Traditional graph feature mining algorithms (e.g. LeapSearch) do not work. (They rely only on “frequencies”) 27 © Dayu Yuan 4/8/2015 Feature Mining: Estimate The Objective Function The objective function of a new subgraph feature p, has an easy to compute upper bound and lower bound: Upp( p, P0 ) 1 1 | C(q, P ) D(q) | I( p q) | C(q, P0 ) | 0 | Q | qmin Sup( p,Q) | Q | qQ Low( p, P0 ) 1 1 1 | C(q, P ) D(max Sub( p)) | I( p q) | C(q, P0 ) | 0 | Q | qmin Sup( p,Q) | Q | qQ Inexpensive to compute Two approaches to estimate Omit Prof (1) Lazy calculation: don’t have to calculate gain(p, P0) when Upp(p, P0) < gain(p*, P0) Low(p, P0) > gain(p*, P0) 28 (2) gain( p, P0 ) Upp( p, P0 ) (1 )Upp( p, P0 ) © Dayu Yuan 4/8/2015 Feature Mining: Branch and Bound Exhaustive Search according to DFS Tree A graph(pattern) can be canonically labeled as a string, the DFS tree is a pre-fix tree of the labels of graphs. Depth first search. n1 Visit: n1, n2, n3, n4 and find the current best pattern is n3. n2 n3 n5 n4 n6 n7 Now visit n , pre-observe that n and all its 5 5 offspring have gain function less than n3. Prune the branch and start to visit n7. The objective function is neither monotonic or antimonotonic 29 © Dayu Yuan 4/8/2015 Feature Mining: Branch and Bound For each branch, e.g., branch starting from n5, find an branch upper bound > gain value of all nodes on that branch. Thm: For a feature p, an upper bound exists such that for all p’ that are supergraph of p, gain(p’, P0) <= BUpp(p, P0) 1 BUpp( p) { | C(q, P0 ) D(q) | max p ' p | C( p') | I(q p')} Q qQ,q p qQ Omit Prof Although correct, the upper bound is not tight {q Q | q p} {q Q | p maxSub(q,{p, P0 })} MinSup Queries(p, Q) Q 30 © Dayu Yuan 4/8/2015 Feature Mining: Heuristic based search space partition Problem: The search always starts from the same root and search according to the same order Observation The new graph pattern p must be a super graph of some patterns in P 0, i.e., p ⊃ p2 in Figure 4 1) A great proportion of the queries are supergraphs of root, otherwise there will be few queries using p ⊃ r for filtering 2) The average size of the set of candidates for queries ⊃ r are large, which means improvement over those queries is important. Spoin(r) | C(q, P0 ) D(q) | max p 'r | C(p') | qmin Sup(r,Q) 31 qmin Sup(r,Q) © Dayu Yuan 4/8/2015 I(q p') Feature Mining: Heuristic based search space partition Procedure: (1)gain(p*)=0 (2)Sort all P0 according to sPoint(pi) function in decreasing order (3) Start Iterating For i=1to|P| do If branch upper bound of BUpp(ri) < gain(p∗) then break Else Find the minimal supergraph queries minSup(r, Q) p*(r) = Branch & Bound Search (minSup(r, Q), p∗) If gain(p*(r)) > gain(p∗), update p∗ = p∗r Discussion: (1) Candidate features are enumerated as descendent of the “root” (2) Candidate features are ‘frequent’ on D(r), not all D 32 Smaller minimum support (3) “root” are visited according to sPoint(r) score, quick to find a close to optimal feature. (4) Top k feature selection © Dayu Yuan 4/8/2015 Outline 1. Background: 2. Lindex: A general index structure for sub search 3.Direct feature mining for sub search 33 Motivation Problem Definition & Objective Function Branch & Bound Partition of the search space Experiment Results © Dayu Yuan 4/8/2015 Feature Mining: Experiments The same AIDS dataset D, Index0: Gindex with minsupport 0.05 IndexDF: Gindex with minsupport 0.02 [1175 new feature are added] Index QG/BB/TK (Index updated based on Index0) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration 34 Achieving the same candidate set size decrease © Dayu Yuan 4/8/2015 Feature Mining: Experiments 35 © Dayu Yuan 4/8/2015 Feature Mining: Experiments 2 Dataset: D1 & D2 (80% same) DF(D1): Gindex on Dataset D1 DF(D2): Gindex on Dataaset D2 Index QG/BB/TK (Index updated based on DF(D1)) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration 36 Exp1: D2 = D1 + 20% New Exp2: D2 = 80%D1 + 20%New Iterative until the objective value is stable © Dayu Yuan 4/8/2015 Feature Mining: Experiments DF VS. iterative methods 37 © Dayu Yuan 4/8/2015 Feature Mining: Experiments 38 © Dayu Yuan 4/8/2015 Feature Mining: Experiments Iterative until the gain is stable TCFG VS. iterative methods MimR VS. iterative methods 39 © Dayu Yuan 4/8/2015 Conclusion 1. Lindex: index structure general enough to support any features Compact Effective Efficient 2. Direct feature mining Third generation algorithm (no frequent feature enumeration bottleneck) Effective in updating the index to accommodate changes 40 Runs much faster than building the index from scratch Feature selected can filter more false positives than features selected from scratch. © Dayu Yuan 4/8/2015 Thanks Questions? 41 © Dayu Yuan 4/8/2015