slides

advertisement
Reachability Querying: An Independent
Permutation Labeling Approach
(published in VLDB 2014)
Presenter:
WEI, Hao
Graph Reachability Query
Given a directed graph G = (V, E) and two vertices u and v,
u is said to reach v if there exists a path from u to v over G.
 Any

directed graph can be easily transformed into a DAG
trivial if u and v are in the same connect component
0
1
2
3
4
5
6
7
8
9
10
11

Query(v1, v8)
Reachable

Query(v2, v11)
Unreachable
The Issue and the Challenge
‘Big Data’ era brings us large
graph with millions of nodes and
edges.
 web-uk dataset: 133 million
nodes, 5 billion edges
 DAG of web-uk: 22 million
nodes, 38 million edges
 Traditional approaches are not
applicable.

Related Work
Recent works builds index, label(u), offline for every node u.
Label-Only Approach: answer Query(u, v) only by label(u) and
label(v) only
 Hop Labeling: TF-Label, Hierarchy Label, Distribution Label, …
 Transitive Closure Compression: Chain-Cover, Tree-Cover, …
 non-linear index construction time and index size, may generate
unacceptable large index

Label+G Approach: answer Query(u, v) by label(u) and label(v) with
the possibility of accessing G if needed
 interval labeling: GRIPP, GRAIL, Ferrari, …
 linear index size, but may perform DFS

Main Idea of IP Labeling
Out(u) denote the set of vertices that u can reach, including u itself.
In(u) denote the set of vertices in which every vertex can reach u,
including u.
 u can reach v iff Out(v) ⊆Out(u) and In(u) ⊆In(v).
if Out(v) ⊈Out(u) or In(u)⊈In(v), u cannot reach v.

Both are time/space consuming if an exact answer is
needed for large sets.
Main Idea of IP Labeling
IP label aims to answer unreachable query pair (u, v) by
detecting Out(v) ⊈Out(u) or In(u) ⊈ In(v)

based on Min-wise Independent Permutation

high probability guarantee to answer query

linear index construction time and index size
Min-wise Independent Permutation
Given two sets 𝐴 and B ( Out(u), Out(v) or In(v), In(u) ) and a
random permutation 𝜋, according to the definition of
min-wise independent permutation,
Pr min 𝜋 𝐴
> min 𝜋 𝐵
=1−
|A|
|𝐴∪𝐵|
K-min-wise Independent Permutation
We propose to use top-k smallest numbers instead of
top-1 smallest number to improve the performance.


mink{π(X)} be the subset of π(X) containing up to the
k smallest numbers of π(X).
an order(≼) between mink{π(A)} and mink{π(B)},
such that mink{π(A)} ≼ mink{π(B)} if every π (bi) ∈
mink{π(B)} \ mink{π(A)} is larger than the largest
number in mink{π(A)} . We use mink{π(A)} ≻
mink{π(B)}otherwise.
K-min-wise Independent Permutation
We prove that

if mink{π(A)} ≻ mink{π(B)}is true, B⊈A

Let |A| = p, |A ∪ B| = q and |mink{π(A)}|= kA for kA ≤k,
Pr(mink{𝜋(𝐴)} ≻ min 𝑘 {𝜋(𝐵)} ) = 1
≈1−
𝑝 𝑘𝐴
(for q ≥ p ≫kA)
𝑞
𝑝! 𝑞−𝑘𝐴 !
−
𝑞! 𝑝−𝑘𝐴 !
Independent Permutation Generation
0
1
7
2
3
11
8
6
Knuth Shuffle
4
5
3
6
7
8
9
10
11
0
2
1
10
4
9
5
IP Label
The IP label of u consists of two parts:

Lout(u): the mink{ } set of Out(u), mink{Out(u)}

Lin(u): the mink{ } set of In(u), mink{In(u)}
IP Label
Vertex Lout
7
11
{2, 3, 4, 8, 10}
8
6
{2, 3, 4, 10}
{8}
3
0
{3}
2
1
{2, 10}
10
{10}
4
{4}
9
5
Lin
v0
{0, 1, 2, 3, 4}
{7}
v1
{0, 1, 2, 3, 4}
{11}
v2
{2, 3, 4, 8, 10} {7, 8}
v3
{1, 2, 3, 4, 6}
{6, 7}
v4
{2, 3, 4, 10}
{3, 6, 7, 8, 11}
v5
{0, 1, 5, 9, 10} {0, 7, 11}
v6
{2, 10}
{2, 3, 6, 7, 8}
v7
{1}
{0, 1, 6, 7, 11}
v8
{10}
{0, 2, 3, 6, 7}
v9
{4}
{3, 4, 6, 7, 8}
v10
{9}
{0, 7, 9, 11}
v11
{5}
{0, 5, 7, 11}
for k = 5
IP Label
Vertex Lout
Q1: Query(v2, v7)
0
1
2
7
9
10
Lout
≻ 3,
Lout
)
Lout
(v2(v
) 2=){2,
4,(v
8,710}
Lout
(v7) =7){1}
Out(v
⊈ Out(v2)
{0, 1, 2, 3, 4}
{7}
v1
{0, 1, 2, 3, 4}
{11}
v5
{0, 1, 5, 9, 10} {0, 7, 11}
v7
{1}
{0, 1, 6, 7, 11}
v8
{10}
{0, 2, 3, 6, 7}
v9
{4}
{3, 4, 6, 7, 8}
v10
{9}
{0, 7, 9, 11}
v11
{5}
{0, 5, 7, 11}
So Lout(v2) ≻ Lout(v7)
v6
{2, 10}
{2, 3, 6, 7, 8}
Out(v7) ⊈ Out(v2)
5
6
8
v0
8, 10} {7,
8}
1∉v2Lout{2,
(v23,) 4,
, 1∈L
out(v7) and
{1, 2, 3,than
4, 6} the
{6, 7}
1 vis3 smaller
largest
v4
{2,in
3, L
4, 10}
{3, 6, 7, 8, 11}
number
(v
2
)
out
3
4
Lin
11
for k = 5
IP Label
Vertex Lout
Q2: Query(v1, v3)
0
𝟏 𝟐
,
𝟐 𝟑
1
2
Lin
v0
{0, 1, 2, 3, 4}
{7}
v1
{0, 1, 2, 3, 4}
{11}
Letv2|A| ={2,p,3,|A
∪ 10}
B| ={7,q8}and
4, 8,
|min
for {6,
kA 7}<k,
v3k{π(A)}|=
{1, 2, 3, k
4,A6}
3
v4
{2, 3, 4, 10}
{3, 6, 7, 8, 11}
Pr(min
k{𝜋(𝐴)} ≻ min 𝑘 {𝜋(𝐵)} )
4
v5
= 1v6−
5
6
7
8
9
10
11
Lout
(v1) ≼ L out(v
Pr(L
(v33))) =
out(v1) ≻ Lout
Lin(v3) ≼Lin(v1)
2
Pr(Lin(v3) ≻ Lin(v1)) =
3
1
2
1, 5,𝐴9,!10} {0, 7, 11}𝑝 𝑘
𝑝!{0,𝑞−𝑘
𝐴
≈
1
−
(
)
10} 𝐴 !
{2, 3, 6, 𝑞
7, 8}
𝑞!{2,𝑝−𝑘
v7
{1}
{0, 1, 6, 7, 11}
v8
{10}
{0, 2, 3, 6, 7}
v9
{4}
{3, 4, 6, 7, 8}
v10
{9}
{0, 7, 9, 11}
v11
{5}
{0, 5, 7, 11}
for k = 5
IP Label
Vertex Lout
Q4: Query(v1, v3)
0
𝟏 𝟐
,
𝟐 𝟑
1
2
3
4
𝟏𝟒 𝟗
,
𝟏𝟓 𝟏𝟎
5
6
7
8
9
10
11
Lout
(v4) ≻ L out(v
Pr(L
(v33))) =
out(v4) ≻ Lout
Lin(v3) ≻Lin(v4)
9
Pr(Lin(v3) ≻ Lin(v4)) =
14
15
10
Lin
v0
{0, 1, 2, 3, 4}
{7}
v1
{0, 1, 2, 3, 4}
{11}
v2
{2, 3, 4, 8, 10} {7, 8}
v3
{1, 2, 3, 4, 6}
{6, 7}
v4
{2, 3, 4, 10}
{3, 6, 7, 8, 11}
v5
{0, 1, 5, 9, 10} {0, 7, 11}
v6
{2, 10}
{2, 3, 6, 7, 8}
v7
{1}
{0, 1, 6, 7, 11}
v8
{10}
{0, 2, 3, 6, 7}
v9
{4}
{3, 4, 6, 7, 8}
v10
{9}
{0, 7, 9, 11}
v11
{5}
{0, 5, 7, 11}
for k = 5
IP Label
Vertex Lout
Q4: Query(v1, v3)
0
𝟏 𝟐
,
𝟐 𝟑
1
2
3
4
𝟏𝟒 𝟗
,
𝟏𝟓 𝟏𝟎
5
𝟏𝟐𝟓 𝟓
,
𝟏𝟐𝟔 𝟔
6
7
8
9
10
11
(v55))≻≻LLout
(v33))) =
Pr(LLout
out(v
out(v
Lin(v3) ≻Lin(v5)
5
Pr(Lin(v3) ≻ Lin(v5)) =
6
125
126
Lin
v0
{0, 1, 2, 3, 4}
{7}
v1
{0, 1, 2, 3, 4}
{11}
v2
{2, 3, 4, 8, 10} {7, 8}
v3
{1, 2, 3, 4, 6}
{6, 7}
v4
{2, 3, 4, 10}
{3, 6, 7, 8, 11}
v5
{0, 1, 5, 9, 10} {0, 7, 11}
v6
{2, 10}
{2, 3, 6, 7, 8}
v7
{1}
{0, 1, 6, 7, 11}
v8
{10}
{0, 2, 3, 6, 7}
v9
{4}
{3, 4, 6, 7, 8}
v10
{9}
{0, 7, 9, 11}
v11
{5}
{0, 5, 7, 11}
for k = 5
IP Label
Vertex Lout
Q4: Query(v1, v3)
0
𝟏 𝟐
,
𝟐 𝟑
1
2
3
4
𝟏𝟒 𝟗
,
𝟏𝟓 𝟏𝟎
5
𝟏𝟐𝟓 𝟓
,
𝟏𝟐𝟔 𝟔
6
7
8
9
10
11
Lin
v0
{0, 1, 2, 3, 4}
{7}
v1
{0, 1, 2, 3, 4}
{11}
v2
{2, 3, 4, 8, 10} {7, 8}
v3
{1, 2, 3, 4, 6}
{6, 7}
v4
{2, 3, 4, 10}
{3, 6, 7, 8, 11}
v5
{0, 1, 5, 9, 10} {0, 7, 11}
v6
{2, 10}
{2, 3, 6, 7, 8}
v7
{1}
{0, 1, 6, 7, 11}
v8
{10}
{0, 2, 3, 6, 7}
v9
{4}
{3, 4, 6, 7, 8}
v10
{9}
{0, 7, 9, 11}
v11
{5}
{0, 5, 7, 11}
The probability increase significantly !
for k = 5
IP Label
Assume DFS is needed even though u cannot reach v.
Consider a vertex w, as a descendant of u, is visited by
DFS towards v, the followings are true:
Pr(Lout(u) ≻ Lout(v)) <Pr(Lout(w) ≻Lout(v))
Pr(Lin(v) ≻Lin(u)) <Pr(Lin(v) ≻Lin(w))
While DFS becomes deeper, it is much more likely to
answer the unreachability queries, and therefore, it can
stop in an early stage.
Two Optimizations

Level Label: use the topological structure to prune
the search space

Huge-Vertex Label: build additional index to handle
the huge vertices of the graph
Performance Studies
Real Dataset:
Dataset
| V(G) |
| E(G) |
davg
R-ratio
uniprotenc
25M
25M
0.999
1.30E-7
twitter
18M
18M
1.013
7.39E-2
web-uk
22M
38M
1.678
1.50E-1
citeseerx
6.5M
15M
2.295
4.07E-4
go-uniprot
6.9M
34M
4.990
3.64E-6
govwild
8.0M
23M
2.948
7.20E-5
Performance Studies
Index Construction Time (in second)
Dataset
TF-Label
DL
GRAIL
Ferrari
IP+
uniprotenc
58.529
22.280
58.242
24.292
18.96
twitter
15.291
13.719
32.323
19.972
12.44
---
24.240
44.031
26.927
17.46
citeseerx
91.877
12.045
23.170
19.792
7.54
go-uniprot
38.668
18.277
44.557
40.365
9.68
govwild
30.520
18.584
29.237
19.924
8.45
web-uk
Performance Studies
Query Time (in millisecond)
Dataset
TF-Label
DL
GRAIL
Ferrari
IP+
uniprotenc
119.164
119.618
820.249
116.351
54.205
twitter
102.923
104.698
---
82.212
79.285
---
146.429
---
214.857
253.082
citeseerx
230.318
111.329
28774
131.534
101.444
go-uniprot
55.279
153.214
499.505
313.300
34.577
254.785
128.199
719.494
295.432
112.990
web-uk
govwild
Performance Studies
Performance Studies
Distribution of the number of vertices visited
Conclusion

We propose a new IP labeling approach, the first one to
explore the randomness to answer reachability queries.

Our new labeling approach has linear index construction
time and index size. By independent permutation, the
query performance is guaranteed by high probability.

We analyze the performance of our proposed approach
by extensive experimental studies and our approach
shows both good efficiency and scalability.
Download