Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao Graph Reachability Query Given a directed graph G = (V, E) and two vertices u and v, u is said to reach v if there exists a path from u to v over G. Any directed graph can be easily transformed into a DAG trivial if u and v are in the same connect component 0 1 2 3 4 5 6 7 8 9 10 11 Query(v1, v8) Reachable Query(v2, v11) Unreachable The Issue and the Challenge ‘Big Data’ era brings us large graph with millions of nodes and edges. web-uk dataset: 133 million nodes, 5 billion edges DAG of web-uk: 22 million nodes, 38 million edges Traditional approaches are not applicable. Related Work Recent works builds index, label(u), offline for every node u. Label-Only Approach: answer Query(u, v) only by label(u) and label(v) only Hop Labeling: TF-Label, Hierarchy Label, Distribution Label, … Transitive Closure Compression: Chain-Cover, Tree-Cover, … non-linear index construction time and index size, may generate unacceptable large index Label+G Approach: answer Query(u, v) by label(u) and label(v) with the possibility of accessing G if needed interval labeling: GRIPP, GRAIL, Ferrari, … linear index size, but may perform DFS Main Idea of IP Labeling Out(u) denote the set of vertices that u can reach, including u itself. In(u) denote the set of vertices in which every vertex can reach u, including u. u can reach v iff Out(v) ⊆Out(u) and In(u) ⊆In(v). if Out(v) ⊈Out(u) or In(u)⊈In(v), u cannot reach v. Both are time/space consuming if an exact answer is needed for large sets. Main Idea of IP Labeling IP label aims to answer unreachable query pair (u, v) by detecting Out(v) ⊈Out(u) or In(u) ⊈ In(v) based on Min-wise Independent Permutation high probability guarantee to answer query linear index construction time and index size Min-wise Independent Permutation Given two sets 𝐴 and B ( Out(u), Out(v) or In(v), In(u) ) and a random permutation 𝜋, according to the definition of min-wise independent permutation, Pr min 𝜋 𝐴 > min 𝜋 𝐵 =1− |A| |𝐴∪𝐵| K-min-wise Independent Permutation We propose to use top-k smallest numbers instead of top-1 smallest number to improve the performance. mink{π(X)} be the subset of π(X) containing up to the k smallest numbers of π(X). an order(≼) between mink{π(A)} and mink{π(B)}, such that mink{π(A)} ≼ mink{π(B)} if every π (bi) ∈ mink{π(B)} \ mink{π(A)} is larger than the largest number in mink{π(A)} . We use mink{π(A)} ≻ mink{π(B)}otherwise. K-min-wise Independent Permutation We prove that if mink{π(A)} ≻ mink{π(B)}is true, B⊈A Let |A| = p, |A ∪ B| = q and |mink{π(A)}|= kA for kA ≤k, Pr(mink{𝜋(𝐴)} ≻ min 𝑘 {𝜋(𝐵)} ) = 1 ≈1− 𝑝 𝑘𝐴 (for q ≥ p ≫kA) 𝑞 𝑝! 𝑞−𝑘𝐴 ! − 𝑞! 𝑝−𝑘𝐴 ! Independent Permutation Generation 0 1 7 2 3 11 8 6 Knuth Shuffle 4 5 3 6 7 8 9 10 11 0 2 1 10 4 9 5 IP Label The IP label of u consists of two parts: Lout(u): the mink{ } set of Out(u), mink{Out(u)} Lin(u): the mink{ } set of In(u), mink{In(u)} IP Label Vertex Lout 7 11 {2, 3, 4, 8, 10} 8 6 {2, 3, 4, 10} {8} 3 0 {3} 2 1 {2, 10} 10 {10} 4 {4} 9 5 Lin v0 {0, 1, 2, 3, 4} {7} v1 {0, 1, 2, 3, 4} {11} v2 {2, 3, 4, 8, 10} {7, 8} v3 {1, 2, 3, 4, 6} {6, 7} v4 {2, 3, 4, 10} {3, 6, 7, 8, 11} v5 {0, 1, 5, 9, 10} {0, 7, 11} v6 {2, 10} {2, 3, 6, 7, 8} v7 {1} {0, 1, 6, 7, 11} v8 {10} {0, 2, 3, 6, 7} v9 {4} {3, 4, 6, 7, 8} v10 {9} {0, 7, 9, 11} v11 {5} {0, 5, 7, 11} for k = 5 IP Label Vertex Lout Q1: Query(v2, v7) 0 1 2 7 9 10 Lout ≻ 3, Lout ) Lout (v2(v ) 2=){2, 4,(v 8,710} Lout (v7) =7){1} Out(v ⊈ Out(v2) {0, 1, 2, 3, 4} {7} v1 {0, 1, 2, 3, 4} {11} v5 {0, 1, 5, 9, 10} {0, 7, 11} v7 {1} {0, 1, 6, 7, 11} v8 {10} {0, 2, 3, 6, 7} v9 {4} {3, 4, 6, 7, 8} v10 {9} {0, 7, 9, 11} v11 {5} {0, 5, 7, 11} So Lout(v2) ≻ Lout(v7) v6 {2, 10} {2, 3, 6, 7, 8} Out(v7) ⊈ Out(v2) 5 6 8 v0 8, 10} {7, 8} 1∉v2Lout{2, (v23,) 4, , 1∈L out(v7) and {1, 2, 3,than 4, 6} the {6, 7} 1 vis3 smaller largest v4 {2,in 3, L 4, 10} {3, 6, 7, 8, 11} number (v 2 ) out 3 4 Lin 11 for k = 5 IP Label Vertex Lout Q2: Query(v1, v3) 0 𝟏 𝟐 , 𝟐 𝟑 1 2 Lin v0 {0, 1, 2, 3, 4} {7} v1 {0, 1, 2, 3, 4} {11} Letv2|A| ={2,p,3,|A ∪ 10} B| ={7,q8}and 4, 8, |min for {6, kA 7}<k, v3k{π(A)}|= {1, 2, 3, k 4,A6} 3 v4 {2, 3, 4, 10} {3, 6, 7, 8, 11} Pr(min k{𝜋(𝐴)} ≻ min 𝑘 {𝜋(𝐵)} ) 4 v5 = 1v6− 5 6 7 8 9 10 11 Lout (v1) ≼ L out(v Pr(L (v33))) = out(v1) ≻ Lout Lin(v3) ≼Lin(v1) 2 Pr(Lin(v3) ≻ Lin(v1)) = 3 1 2 1, 5,𝐴9,!10} {0, 7, 11}𝑝 𝑘 𝑝!{0,𝑞−𝑘 𝐴 ≈ 1 − ( ) 10} 𝐴 ! {2, 3, 6, 𝑞 7, 8} 𝑞!{2,𝑝−𝑘 v7 {1} {0, 1, 6, 7, 11} v8 {10} {0, 2, 3, 6, 7} v9 {4} {3, 4, 6, 7, 8} v10 {9} {0, 7, 9, 11} v11 {5} {0, 5, 7, 11} for k = 5 IP Label Vertex Lout Q4: Query(v1, v3) 0 𝟏 𝟐 , 𝟐 𝟑 1 2 3 4 𝟏𝟒 𝟗 , 𝟏𝟓 𝟏𝟎 5 6 7 8 9 10 11 Lout (v4) ≻ L out(v Pr(L (v33))) = out(v4) ≻ Lout Lin(v3) ≻Lin(v4) 9 Pr(Lin(v3) ≻ Lin(v4)) = 14 15 10 Lin v0 {0, 1, 2, 3, 4} {7} v1 {0, 1, 2, 3, 4} {11} v2 {2, 3, 4, 8, 10} {7, 8} v3 {1, 2, 3, 4, 6} {6, 7} v4 {2, 3, 4, 10} {3, 6, 7, 8, 11} v5 {0, 1, 5, 9, 10} {0, 7, 11} v6 {2, 10} {2, 3, 6, 7, 8} v7 {1} {0, 1, 6, 7, 11} v8 {10} {0, 2, 3, 6, 7} v9 {4} {3, 4, 6, 7, 8} v10 {9} {0, 7, 9, 11} v11 {5} {0, 5, 7, 11} for k = 5 IP Label Vertex Lout Q4: Query(v1, v3) 0 𝟏 𝟐 , 𝟐 𝟑 1 2 3 4 𝟏𝟒 𝟗 , 𝟏𝟓 𝟏𝟎 5 𝟏𝟐𝟓 𝟓 , 𝟏𝟐𝟔 𝟔 6 7 8 9 10 11 (v55))≻≻LLout (v33))) = Pr(LLout out(v out(v Lin(v3) ≻Lin(v5) 5 Pr(Lin(v3) ≻ Lin(v5)) = 6 125 126 Lin v0 {0, 1, 2, 3, 4} {7} v1 {0, 1, 2, 3, 4} {11} v2 {2, 3, 4, 8, 10} {7, 8} v3 {1, 2, 3, 4, 6} {6, 7} v4 {2, 3, 4, 10} {3, 6, 7, 8, 11} v5 {0, 1, 5, 9, 10} {0, 7, 11} v6 {2, 10} {2, 3, 6, 7, 8} v7 {1} {0, 1, 6, 7, 11} v8 {10} {0, 2, 3, 6, 7} v9 {4} {3, 4, 6, 7, 8} v10 {9} {0, 7, 9, 11} v11 {5} {0, 5, 7, 11} for k = 5 IP Label Vertex Lout Q4: Query(v1, v3) 0 𝟏 𝟐 , 𝟐 𝟑 1 2 3 4 𝟏𝟒 𝟗 , 𝟏𝟓 𝟏𝟎 5 𝟏𝟐𝟓 𝟓 , 𝟏𝟐𝟔 𝟔 6 7 8 9 10 11 Lin v0 {0, 1, 2, 3, 4} {7} v1 {0, 1, 2, 3, 4} {11} v2 {2, 3, 4, 8, 10} {7, 8} v3 {1, 2, 3, 4, 6} {6, 7} v4 {2, 3, 4, 10} {3, 6, 7, 8, 11} v5 {0, 1, 5, 9, 10} {0, 7, 11} v6 {2, 10} {2, 3, 6, 7, 8} v7 {1} {0, 1, 6, 7, 11} v8 {10} {0, 2, 3, 6, 7} v9 {4} {3, 4, 6, 7, 8} v10 {9} {0, 7, 9, 11} v11 {5} {0, 5, 7, 11} The probability increase significantly ! for k = 5 IP Label Assume DFS is needed even though u cannot reach v. Consider a vertex w, as a descendant of u, is visited by DFS towards v, the followings are true: Pr(Lout(u) ≻ Lout(v)) <Pr(Lout(w) ≻Lout(v)) Pr(Lin(v) ≻Lin(u)) <Pr(Lin(v) ≻Lin(w)) While DFS becomes deeper, it is much more likely to answer the unreachability queries, and therefore, it can stop in an early stage. Two Optimizations Level Label: use the topological structure to prune the search space Huge-Vertex Label: build additional index to handle the huge vertices of the graph Performance Studies Real Dataset: Dataset | V(G) | | E(G) | davg R-ratio uniprotenc 25M 25M 0.999 1.30E-7 twitter 18M 18M 1.013 7.39E-2 web-uk 22M 38M 1.678 1.50E-1 citeseerx 6.5M 15M 2.295 4.07E-4 go-uniprot 6.9M 34M 4.990 3.64E-6 govwild 8.0M 23M 2.948 7.20E-5 Performance Studies Index Construction Time (in second) Dataset TF-Label DL GRAIL Ferrari IP+ uniprotenc 58.529 22.280 58.242 24.292 18.96 twitter 15.291 13.719 32.323 19.972 12.44 --- 24.240 44.031 26.927 17.46 citeseerx 91.877 12.045 23.170 19.792 7.54 go-uniprot 38.668 18.277 44.557 40.365 9.68 govwild 30.520 18.584 29.237 19.924 8.45 web-uk Performance Studies Query Time (in millisecond) Dataset TF-Label DL GRAIL Ferrari IP+ uniprotenc 119.164 119.618 820.249 116.351 54.205 twitter 102.923 104.698 --- 82.212 79.285 --- 146.429 --- 214.857 253.082 citeseerx 230.318 111.329 28774 131.534 101.444 go-uniprot 55.279 153.214 499.505 313.300 34.577 254.785 128.199 719.494 295.432 112.990 web-uk govwild Performance Studies Performance Studies Distribution of the number of vertices visited Conclusion We propose a new IP labeling approach, the first one to explore the randomness to answer reachability queries. Our new labeling approach has linear index construction time and index size. By independent permutation, the query performance is guaranteed by high probability. We analyze the performance of our proposed approach by extensive experimental studies and our approach shows both good efficiency and scalability.