Slides

advertisement
Efficient Subgraph Search over
Large Uncertain Graphs
Ye Yuan1, Guoren Wang1, Haixun Wang2, Lei
3
Chen
1. Northeastern University, China
2. Microsoft Resarch Asia
3. HKUST
1
Outline
Ⅰ
Background
Ⅱ
Problem Definition
Ⅲ Query Processing Framework
Ⅳ
Solutions
V
Conclusions
Background
 Graph is a complicated data structure, and has been used
in many real applications.
 Bioinformatics
Gene regulatory networks
Yeast PPI networks
3
Background
 Compounds
benzene ring
Compounds database
4
Background

Social Networks
EntityCube
Web2.0
5
Background
 In these applications, graph data may be noisy and
incomplete, which leads to uncertain graphs.
 STRING database (http://string-db.org) is a data source
that contains PPIs with uncertain edges provided by
biological
experiments.
Therefore,
it is important to study
 Visual Pattern Recognition,
uncertain graphs are used
query processing
to model visual objects.
on
large
uncertain
graphs
.
 Social networks, uncertain links used to represent
possible relationships or strength of influence between
people.
6
Outline
Ⅰ
Background
Ⅱ
Problem Definition
Ⅲ Query Processing Framework
Ⅳ
Solutions
V
Conclusions
Problem Definition
 Probabilistic subgraph search
 Uncertain graph:
Vertex uncertainty (existence probability)
Edge uncertainty (existence probability given its two
endpoints)
A (0.8)
1
b
2
A (0.6)
0.9 0.7
0.5
a
b
3
B (0.9)
8
Problem Definition
 Probabilistic subgraph search

Possible worlds: combination of all uncertain edges and vertices
A (0.8)
1
b
2
A (0.6)
0.9 0.7
0.5
a

1
2
3
0.008
0.032
0.012
0.072
1
1
2
2
3
3
0.0432
0.2016
0.054
(1)
(2)
(3)
(4)
(5)
(6)
(7)
1
1
2
1
1
1
1
2
3
3
0.0048
0.0864
0.054
0.00648
0.05832
0.01512
0.00648
(8)
(9)
(10)
(11)
(12)
(13)
(14)
1
1
b
3
B (0.9)
1
2
2
3
1
3
2
3
2
3
2
3
2
2
3
2
3
3
0.13608
0.13608
0.05832
0.01512
(15)
(16)
(17)
(18)
9
Problem Definition
 Probabilistic subgraph search
 Given: an uncertain graph database G={g1,g2,…,gn},
query graph q and probability threshold 
 Query: find all gi ∈G, such that the subgraph isomorphic
probability is not smaller than .
 Subgraph isomorphic probability (SIP):
The SIP between q and gi = the sum of the probabilities of
gi’s possible worlds to which q is subgraph isomorphic
10
Problem Definition
 Probabilistic subgraph search

Subgraph isomorphic probability (SIP):
A (0.8)
1
b
b
0.9 0.7
0.5
a
2
a
A
3
B (0.9)
A (0.6)
g
2
3
0.054
(7)
q
1
1
+
2
3
0.00648
(14)
+
B
2
1
3
0.13608
(15)
+
2
1
3
0.05832
+
2
3
0.01512
(17)
= 0.27
(18)
It is #P-complete to calculate SIP
11
Outline
Ⅰ
Background
Ⅱ
Problem Definition
Ⅲ Query Processing Framework
Ⅳ
Solutions
V
Conclusions
 Probabilistic subgraph query processing framework
 Naïve method:sequence scan D, and decide if the SIP
between q and gi is not smaller than threshold .
 g1 subgraph isomorphic to g2 : NP-Complete
 Calculating SIP: #P-Complete
 Naïve method: very costly, infeasible!
13
 Probabilistic subgraph query processing framework
 Filter-and-Verification
{g1,g2,..,gn}
{g’1,g’2,..,g’m}
Filtering
Query q
Candidates
{g”1,g”2,..,g”k}
Answers
Verification
14
Outline
Ⅰ
Background
Ⅱ
Problem Definition
Ⅲ Query Processing Framework
Ⅳ
Solutions
V
Conclusions
Solutions
 Filtering: structural pruning
 Principle: if we remove all the uncertainty from g, and
the resulting graph still does not contain q, then the
original uncertain graph cannot contain q.
A (0.8)
1
b
2
0.9 0.7
0.5
a
b
A
a
B
3
B (0.9)
A (0.6)
g
q
 Theorem: if qgc,then Pr(qg)=0
16
Solutions
 Probabilistic pruning: let f be a feature of gc i.e., fgc
 Rule 1:
if f  q , UpperB(Pr(fg))<,then g is pruned.
∵ f q, ∴ Pr(qg)Pr(fg)<
A (0.7)
6
A (1)
1
A
a
b
B
(0.3)
0.8
1
0.5 4
0.2 b
3
b 0.6
c
0.9
0.9
A
(0.6)
a
a
2
5
A (0.5)
B (0.4)
Uncertain graph
A
A
c
(a
B
A
a
c
b,
A
0.6 )
B
feature
query & 
17
Solutions
 Rule 2:
if qf, LowerB(Pr(fg)),then g is an answer.
∵ q f, ∴ Pr(qg)Pr(fg)
A (1)
1
A (0.7)
6
A
a
b
B
(0.3)
0.8
1
0.5 4
0.2 b
3
b 0.6
c
0.9
0.9
A
(0.6)
a
a
2
5
A (0.5)
Uncertain graph
B (0.4)
c
A
a
( A
a
B , 0.2 )
B
feature
query & 
 Two main issues for probabilistic pruning:
How to derive lower and upper bounds of SIP?
How to select features with great pruning power?
18
Solutions
 Technique 1: calculation of lower and upper bounds
 Lemma:
let Bf1,…,Bf|Ef|be all embeddings of f in gc, then
Pr(fg)=Pr(Bf1…Bf|Ef|).
 UpperB(Pr(fg)):



Pr  f  g   Pr Bf 1    Bf Ef  1  Pr Bf 1    Bf Ef


Ef

 
Pr Bf1    Bf Ef   Pr Bfi
i 1
|Ef |
|Ef |
i 1
i 1
Pr f  g   1   Pr(Bfi )  1   (1  Pr(Bfi ))  UpperB( f )
19
Solutions
 Technique 1: calculation of lower and upper bounds
 LowerB(Pr(fg)):

Pr f  g   Pr 
Ef
i 1
 

IN
Bfi  Pr  INj1 Bf j  1   1  PrBfi   LowerB f 
 Tightest LowerB(f)
j 1
1
a
0.8
2
B (0.3)
3
b 0.6
0.9
a
0.5
c
A (0.5)
3
2
(EM1)
A (0.6) 0.9
a
0.2 b
5
b
A
a
B
(f2)
EM1
6
5
4
A
6
B (0.4)
(002)
4
1
b
1
(EM3)
1
EM2
EM3
3
2
(EM2)
Embeddings of f2 in 002
Graph bG of embeddings
Converting into computing the
maximum weight clique of
graph bG, NP-hard.
20
Solutions
 Technique 1: calculation of lower and upper bounds
 Exact value V.S. Upper and lower bound
UpperBound
Exact
LowerBound
UpperBound
1
LowerBound
Caculation time (second)
1000
0.8
Probability
Exact
0.6
0.4
0.2
100
10
1
0.1
0
50
100
150
Database size
Value
200
250
50
100
150
200
250
Database size
Computing time
21
Solutions
 Technique2: optimal feature selection

If we index all features, we will have the most pruning power
index. But it is also very costly to query such index. Thus we
would like to select a small number of features but with the
greatest pruning power.
 Cost model:
Max gain = sequence scan cost– query index cost
Maximum set coverage: NP-complete; use the greedy
algorithm to approximate it.
22
Solutions
 Technique2: optimal feature selection
 Maximum converge:greedy algorithm
Feature Matrix
001
002
f1
(0.19,0.19)
(0.27,0.49)
f2
(0.27,0.27)
(0.4,0.49)
f3
0
(0.01,0.11)
Approximate
optimal index
within 1-1/e
A
b
q1 (
A
a
q2
( A
a
B , 0.2 )
q3
B
(0.19,0.19) (0.27,0.49)
f2
(0.27,0.27) (0.4,0.49)
0
(0.27,0.49)
(0.27,0.27) (0.4,0.49)
b , 0.6 )
(a
B
f1
f3
A
A
a , 0.5 )
0
c
A
0
(0.27,0.27) (0.4,0.49)
0
0
0
0
0
(0.01,0.11)
001
002
001
002
001
002
Probabilistic Index
23
Solutions
 Probabilistic Index
 Construct a string for each feature
 Construct a prefix tree for all feature strings
 Construct an invert list for all leaf nodes
Root
fa
fb
ID-list: {<g1, 0.2, 0.6>, <g2, 0.4, 0.7>, ….} ID-list: {….}
fc
fd
ID-list: {….} ID-list: {<g2, 0.3, 0.8>, <g4, 0.4, 0.6>, ….}
24
Solutions
 Verification: Iterative bound pruning
 Lemma: Pr(qg)=Pr(Bq1…Bq|Eq|)
Eq
Prq  g     1
 Unfolding: 
i 1

 Let Si   Pr  j 1 Bqj
J
i 1

 Pr  j 1 Bqj
J  1,, E , J  i
J

q

 Based on Inclusion-Exclusion Principle
   1i 1 i S w

w 1
P rq  g  
i
i 1




1

w1 S w

if i is odd
if i is even
Iterative bound pruning
25
Solutions
 Performance Evaluation
 Real dataset: uncertain PPI
1500 uncertain graphs
Average 332 vertices and 584 edges
Average probability: 0.367
 Synthetic dataset: AIDS dataset
Generate probabilities using Gaussian
distribution
10k uncertain graphs
Average 24.3 vertices and 26.5 edges
26
Solutions
 Performance Evaluation
 Results on real dataset
Non-PF
SCAN
PFiltering
100
100
Candidate size
Response time (second)
PIndex
10
50
0
1
q50
q100
q150
Query size
q200
q250
q50
q100
q150
q200
q250
Query size
27
Solutions
 Performance Evaluation
 Results on real dataset
Non-PF
PFiltering
Non-PF
Response time (second)
Feature number
10000
1000
100
10
PFiltering
10
1
0.1
0.01
1
250
200
150
# Distinct labels
100
50
250
200
150
100
50
# Distinct labels
28
Solutions
 Performance Evaluation
 Response and Construction time
SFiltering
PFiltering
E-Bound
SFiltering
Construction time (second)
Response time (second)
10
1
0.1
0.01
2k
4k
6k
Database size
8k
10k
PFiltering
300
250
200
150
100
50
0
2k
4k
6k
8k
10k
Database size
29
Solutions
 Performance Evaluation
 Results on synthetic dataset
SFiltering
SFiltering
PFiltering
100
Index size(MB)
10000
Feature number
PFiltering
1000
100
10
1
10
1
0.1
0.01
0.3
0.4
0.5
Parameter
Mean
0.6
0.7
0.3
0.4
0.5
0.6
0.7
Parameter
Variance
30
Outline
Ⅰ
Background
Ⅱ
Problem Definition
Ⅲ Query Processing Framework
Ⅳ
Solutions
V
Conclusions
Conclusion
 We propose the first efficient solution to answer
threshold-based probabilistic sub-graph search over
uncertain graph databases.
 We employ a filter and verification framework, and
develop probability bounds for filtering.
 We design a cost model to select minimum number of
features with the largest pruning ability.
 We demonstrate the effectiveness of our solution
through experiments on real and synthetic data sets.
32
Thanks!
33
Download