Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan1, Guoren Wang1, Haixun Wang2, Lei 3 Chen 1. Northeastern University, China 2. Microsoft Resarch Asia 3. HKUST 1 Outline Ⅰ Background Ⅱ Problem Definition Ⅲ Query Processing Framework Ⅳ Solutions V Conclusions Background Graph is a complicated data structure, and has been used in many real applications. Bioinformatics Gene regulatory networks Yeast PPI networks 3 Background Compounds benzene ring Compounds database 4 Background Social Networks EntityCube Web2.0 5 Background In these applications, graph data may be noisy and incomplete, which leads to uncertain graphs. STRING database (http://string-db.org) is a data source that contains PPIs with uncertain edges provided by biological experiments. Therefore, it is important to study Visual Pattern Recognition, uncertain graphs are used query processing to model visual objects. on large uncertain graphs . Social networks, uncertain links used to represent possible relationships or strength of influence between people. 6 Outline Ⅰ Background Ⅱ Problem Definition Ⅲ Query Processing Framework Ⅳ Solutions V Conclusions Problem Definition Probabilistic subgraph search Uncertain graph: Vertex uncertainty (existence probability) Edge uncertainty (existence probability given its two endpoints) A (0.8) 1 b 2 A (0.6) 0.9 0.7 0.5 a b 3 B (0.9) 8 Problem Definition Probabilistic subgraph search Possible worlds: combination of all uncertain edges and vertices A (0.8) 1 b 2 A (0.6) 0.9 0.7 0.5 a 1 2 3 0.008 0.032 0.012 0.072 1 1 2 2 3 3 0.0432 0.2016 0.054 (1) (2) (3) (4) (5) (6) (7) 1 1 2 1 1 1 1 2 3 3 0.0048 0.0864 0.054 0.00648 0.05832 0.01512 0.00648 (8) (9) (10) (11) (12) (13) (14) 1 1 b 3 B (0.9) 1 2 2 3 1 3 2 3 2 3 2 3 2 2 3 2 3 3 0.13608 0.13608 0.05832 0.01512 (15) (16) (17) (18) 9 Problem Definition Probabilistic subgraph search Given: an uncertain graph database G={g1,g2,…,gn}, query graph q and probability threshold Query: find all gi ∈G, such that the subgraph isomorphic probability is not smaller than . Subgraph isomorphic probability (SIP): The SIP between q and gi = the sum of the probabilities of gi’s possible worlds to which q is subgraph isomorphic 10 Problem Definition Probabilistic subgraph search Subgraph isomorphic probability (SIP): A (0.8) 1 b b 0.9 0.7 0.5 a 2 a A 3 B (0.9) A (0.6) g 2 3 0.054 (7) q 1 1 + 2 3 0.00648 (14) + B 2 1 3 0.13608 (15) + 2 1 3 0.05832 + 2 3 0.01512 (17) = 0.27 (18) It is #P-complete to calculate SIP 11 Outline Ⅰ Background Ⅱ Problem Definition Ⅲ Query Processing Framework Ⅳ Solutions V Conclusions Probabilistic subgraph query processing framework Naïve method:sequence scan D, and decide if the SIP between q and gi is not smaller than threshold . g1 subgraph isomorphic to g2 : NP-Complete Calculating SIP: #P-Complete Naïve method: very costly, infeasible! 13 Probabilistic subgraph query processing framework Filter-and-Verification {g1,g2,..,gn} {g’1,g’2,..,g’m} Filtering Query q Candidates {g”1,g”2,..,g”k} Answers Verification 14 Outline Ⅰ Background Ⅱ Problem Definition Ⅲ Query Processing Framework Ⅳ Solutions V Conclusions Solutions Filtering: structural pruning Principle: if we remove all the uncertainty from g, and the resulting graph still does not contain q, then the original uncertain graph cannot contain q. A (0.8) 1 b 2 0.9 0.7 0.5 a b A a B 3 B (0.9) A (0.6) g q Theorem: if qgc,then Pr(qg)=0 16 Solutions Probabilistic pruning: let f be a feature of gc i.e., fgc Rule 1: if f q , UpperB(Pr(fg))<,then g is pruned. ∵ f q, ∴ Pr(qg)Pr(fg)< A (0.7) 6 A (1) 1 A a b B (0.3) 0.8 1 0.5 4 0.2 b 3 b 0.6 c 0.9 0.9 A (0.6) a a 2 5 A (0.5) B (0.4) Uncertain graph A A c (a B A a c b, A 0.6 ) B feature query & 17 Solutions Rule 2: if qf, LowerB(Pr(fg)),then g is an answer. ∵ q f, ∴ Pr(qg)Pr(fg) A (1) 1 A (0.7) 6 A a b B (0.3) 0.8 1 0.5 4 0.2 b 3 b 0.6 c 0.9 0.9 A (0.6) a a 2 5 A (0.5) Uncertain graph B (0.4) c A a ( A a B , 0.2 ) B feature query & Two main issues for probabilistic pruning: How to derive lower and upper bounds of SIP? How to select features with great pruning power? 18 Solutions Technique 1: calculation of lower and upper bounds Lemma: let Bf1,…,Bf|Ef|be all embeddings of f in gc, then Pr(fg)=Pr(Bf1…Bf|Ef|). UpperB(Pr(fg)): Pr f g Pr Bf 1 Bf Ef 1 Pr Bf 1 Bf Ef Ef Pr Bf1 Bf Ef Pr Bfi i 1 |Ef | |Ef | i 1 i 1 Pr f g 1 Pr(Bfi ) 1 (1 Pr(Bfi )) UpperB( f ) 19 Solutions Technique 1: calculation of lower and upper bounds LowerB(Pr(fg)): Pr f g Pr Ef i 1 IN Bfi Pr INj1 Bf j 1 1 PrBfi LowerB f Tightest LowerB(f) j 1 1 a 0.8 2 B (0.3) 3 b 0.6 0.9 a 0.5 c A (0.5) 3 2 (EM1) A (0.6) 0.9 a 0.2 b 5 b A a B (f2) EM1 6 5 4 A 6 B (0.4) (002) 4 1 b 1 (EM3) 1 EM2 EM3 3 2 (EM2) Embeddings of f2 in 002 Graph bG of embeddings Converting into computing the maximum weight clique of graph bG, NP-hard. 20 Solutions Technique 1: calculation of lower and upper bounds Exact value V.S. Upper and lower bound UpperBound Exact LowerBound UpperBound 1 LowerBound Caculation time (second) 1000 0.8 Probability Exact 0.6 0.4 0.2 100 10 1 0.1 0 50 100 150 Database size Value 200 250 50 100 150 200 250 Database size Computing time 21 Solutions Technique2: optimal feature selection If we index all features, we will have the most pruning power index. But it is also very costly to query such index. Thus we would like to select a small number of features but with the greatest pruning power. Cost model: Max gain = sequence scan cost– query index cost Maximum set coverage: NP-complete; use the greedy algorithm to approximate it. 22 Solutions Technique2: optimal feature selection Maximum converge:greedy algorithm Feature Matrix 001 002 f1 (0.19,0.19) (0.27,0.49) f2 (0.27,0.27) (0.4,0.49) f3 0 (0.01,0.11) Approximate optimal index within 1-1/e A b q1 ( A a q2 ( A a B , 0.2 ) q3 B (0.19,0.19) (0.27,0.49) f2 (0.27,0.27) (0.4,0.49) 0 (0.27,0.49) (0.27,0.27) (0.4,0.49) b , 0.6 ) (a B f1 f3 A A a , 0.5 ) 0 c A 0 (0.27,0.27) (0.4,0.49) 0 0 0 0 0 (0.01,0.11) 001 002 001 002 001 002 Probabilistic Index 23 Solutions Probabilistic Index Construct a string for each feature Construct a prefix tree for all feature strings Construct an invert list for all leaf nodes Root fa fb ID-list: {<g1, 0.2, 0.6>, <g2, 0.4, 0.7>, ….} ID-list: {….} fc fd ID-list: {….} ID-list: {<g2, 0.3, 0.8>, <g4, 0.4, 0.6>, ….} 24 Solutions Verification: Iterative bound pruning Lemma: Pr(qg)=Pr(Bq1…Bq|Eq|) Eq Prq g 1 Unfolding: i 1 Let Si Pr j 1 Bqj J i 1 Pr j 1 Bqj J 1,, E , J i J q Based on Inclusion-Exclusion Principle 1i 1 i S w w 1 P rq g i i 1 1 w1 S w if i is odd if i is even Iterative bound pruning 25 Solutions Performance Evaluation Real dataset: uncertain PPI 1500 uncertain graphs Average 332 vertices and 584 edges Average probability: 0.367 Synthetic dataset: AIDS dataset Generate probabilities using Gaussian distribution 10k uncertain graphs Average 24.3 vertices and 26.5 edges 26 Solutions Performance Evaluation Results on real dataset Non-PF SCAN PFiltering 100 100 Candidate size Response time (second) PIndex 10 50 0 1 q50 q100 q150 Query size q200 q250 q50 q100 q150 q200 q250 Query size 27 Solutions Performance Evaluation Results on real dataset Non-PF PFiltering Non-PF Response time (second) Feature number 10000 1000 100 10 PFiltering 10 1 0.1 0.01 1 250 200 150 # Distinct labels 100 50 250 200 150 100 50 # Distinct labels 28 Solutions Performance Evaluation Response and Construction time SFiltering PFiltering E-Bound SFiltering Construction time (second) Response time (second) 10 1 0.1 0.01 2k 4k 6k Database size 8k 10k PFiltering 300 250 200 150 100 50 0 2k 4k 6k 8k 10k Database size 29 Solutions Performance Evaluation Results on synthetic dataset SFiltering SFiltering PFiltering 100 Index size(MB) 10000 Feature number PFiltering 1000 100 10 1 10 1 0.1 0.01 0.3 0.4 0.5 Parameter Mean 0.6 0.7 0.3 0.4 0.5 0.6 0.7 Parameter Variance 30 Outline Ⅰ Background Ⅱ Problem Definition Ⅲ Query Processing Framework Ⅳ Solutions V Conclusions Conclusion We propose the first efficient solution to answer threshold-based probabilistic sub-graph search over uncertain graph databases. We employ a filter and verification framework, and develop probability bounds for filtering. We design a cost model to select minimum number of features with the largest pruning ability. We demonstrate the effectiveness of our solution through experiments on real and synthetic data sets. 32 Thanks! 33