Finding Skyline Nodes in Large Networks Arijit Khan Vishwakarma Singh Jian Wu Motivation If John is interested in Big Data, Cloud Computing, and Map Reduce, who will be the top-5 people John should ask about these topics? Evaluation Metrics: Distance from the query node. (John) Coverage of the Query Topics. (Big Data, Cloud Computing, Map Reduce) Finding Skyline Nodes in Large Networks 2 Homogeneous Approach ? If John is interested in Big Data, Cloud Computing, and Map Reduce, who will be the top-5 people John should ask about these topics? Score = λ . Distance + (1- λ ). Coverage How to get λ ? Finding Skyline Nodes in Large Networks 3 Weighted Set Cover ? Find nodes with smallest aggregate distance from the query node, such that they cover all query topics. u0 = q Q = { a, b, c } Ignore some interesting nodes. a Cannot rank the results. b u1 abc c u2 u3 a cd u5 u4 abc u6 de u7 Finding Skyline Nodes in Large Networks u8 4 Graph Skyline Dominance on Coverage: u >c v Query topics covered by node u is a superset of the query topics covered by node v. Dominance on Distance: u >d v Distance of u from q is less than that of v from q. Dominance: u > v (1) u >c v and u ≥d v ; or (2) u ≥c v and u >d v. u0 = q Q = { a, b, c } a b u1 c u2 abc u3 a cd u5 u4 abc u6 de u7 u8 A node is a skyline node if it is not dominated by any other node in the network. Finding Skyline Nodes in Large Networks 5 Ranking of Skyline Nodes Too many skyline nodes. u0 = q Q = { a, b, c } Rank them. Dominance Count: # nodes dominated by a skyline node. [Lin et. al., ICDE ‘07] Higher Dominance Count => more pruning from candidate set. 1. DC(u4) = {u5, u6, u7}, 2. DC(u1) = {u5} 3. DC(u2) = Φ; 4. DC(u3) = Φ a b u1 c u2 abc u3 a cd u5 u4 abc u6 de u7 u8 Given a query node and a set of query topics in a network, find the top-k skyline nodes with maximum dominance count. Finding Skyline Nodes in Large Networks 6 Algorithm Construct a Query DAG. Three variables associated with each DAG node: Count (C), Dominance (D), Traversal (T). u0 = q a Q = { a, b, c } b u1 c u2 abc u3 a cd u5 u4 abc C=2 D=T=- u6 de u7 Input Network u8 ab C=0 D=T=C=2 D=T=- a Naïve Complexity: O(n2r) abc Complexity with Preprocessing: O(nr2) C=0 D = - ac T=- C=1 D=T=- b bc c Query DAG Finding Skyline Nodes in Large Networks C=0 D=T=- C=2 D=T=- 7 Query DAG Construction the label. For each label, find a sorted list of nodes that contain Incremental DAG construction. u0 = q Q = { a, b, c } u4 a b u1 abc c u6 u7 c cd u5 u6 de u7 u4 u3 a abc u3 ab u2 u4 u7 u8 a u1 u5 b u2 Finding Skyline Nodes in Large Networks 8 Query DAG Construction (cont.) the label. For each label, find a sorted list of nodes that contains lists in order. Consider the labels and their sorted u0 = q a Q = { a, b, c } b u1 abc c cd u5 u6 de u7 u7 u3 a abc u4 ab u2 u4 abc u8 a u1 u5 b u2 Finding Skyline Nodes in Large Networks c u3 u6 9 Query DAG Construction (cont.) the label. For each label, find a sorted list of nodes that contains lists in order. Consider the labels and their sorted u0 = q a Q = { a, b, c } b u1 abc c ab u2 a u6 de u7 ac u7 bc cd u5 abc u4 u3 a u4 abc u8 u1 b u5 u2 Finding Skyline Nodes in Large Networks c u3 u6 10 Find Dominance Variable Perform a topological ordering of the DAG nodes to evaluate the Dominance variable (D) of each DAG node. # Nodes dominated (or equal) by coverage. u0 = q a Q = { a, b, c } b u1 c u2 abc u3 a cd u5 u4 abc C=2 D=7 T=- u6 de u7 Input Network u8 ab C=0 D=3 T=C=2 D=2 T=- a Naïve Complexity: O(n2r) abc C=0 D = 4 ac T=- C=1 D=1 b T=Query DAG Complexity by Topological Ordering: O(3r) bc c Finding Skyline Nodes in Large Networks C=0 D=3 T=- C=2 D=2 T=- 11 Find Traversal Variable Perform a Breadth First Search (BFS) starting from the query node. # Nodes not dominated by distance. u0 = q a b u1 abc c u2 u4 C=2 D=7 T=1 Q = { a, b, c } a cd u6 u5 abc ab u3 de u7 Input Network C=0 D=3 T=0 h =2 C = 2 u8 D=2 T=2 a Complexity by BFS: O(n+e) abc C=0 D = 4 ac T=0 C=1 D=1 b T=1 Query DAG bc c Finding Skyline Nodes in Large Networks C=0 D=3 T=0 C=2 D=2 T=2 12 Find Skyline Nodes Store DAG nodes into a Lookup Table. Skyline Bit for each DAG node. Helps to prune non-skyline nodes directly. u0 = q Q = { a, b, c } abc a b u1 abc c u2 u4 a Input Network bc u6 de u7 ac h =1 cd u5 abc ab u3 a b c u8 Query DAG Finding Skyline Nodes in Large Networks abc 0 ab 0 ac 0 bc 0 a 1 b 1 c 1 Lookup Table 13 Find Skyline Nodes (cont.) Store DAG nodes into a Lookup Table. Skyline Bit for each DAG node. Helps to prune non-skyline nodes directly. u0 = q Q = { a, b, c } abc a b u1 abc c u2 u4 a ac bc cd u5 abc ab u3 u6 h =2 de u7 Input Network a b c u8 Query DAG Finding Skyline Nodes in Large Networks abc 1 ab 1 ac 1 bc 1 a 1 b 1 c 1 Lookup Table 14 Dominance Count of Skyline Nodes DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3 Top-k Buffer to store top-k skyline nodes. u0 = q a b u1 abc ab C=0 D=3 T=0 u3 a cd u5 u6 h =2 abc abc c u2 u4 C=2 D=7 T=0 Q = { a, b, c } de u7 Input Network u8 C=2 a D=2 T=1 C = 0 ac D=4 T=0 C=1 D=1 b T=1 Query DAG bc C = 0 D=3 T=0 C=2 c D=2 T=1 Finding Skyline Nodes in Large Networks abc 1 ab 1 ac 1 bc 1 a 1 b 1 c 1 Lookup Table 15 Pruning and Early Termination DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3 Top-k Buffer to store top-k skyline nodes. Dominance Variable of a DAG node has smaller value than the smallest Dominance Count in the top-k buffer. Skyline Bits of all entries in the Lookup Table are 1’s. Finding Skyline Nodes in Large Networks 16 Experimental Results DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3 0.7M Nodes, Edges, 10 Node Labels (distinct). Top-k Buffer to store3M top-k skyline nodes. 5 Query Topics. Finding Skyline Nodes in Large Networks 17 Efficiency DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3 185M to Nodes, Node Labels (distinct). Top-k Buffer store 90M top-kEdges, skyline1000 nodes. 5 Query Topics, Top-5 Result Nodes. Finding Skyline Nodes in Large Networks 18 Conclusion and Future Works DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3 Efficient Algorithm to find top-k skyline nodes in large attributed network. Top-k Buffer to store top-k skyline nodes. Required experimental evaluation in real and synthetic datasets. Time Complexity is linear in the number of nodes and edges in the network. Distance based indexing might improve the efficiency. Top-k Skyline set instead of Top-k Skyline nodes might be more effective. Finding Skyline Nodes in Large Networks 19 Questions DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3 Top-k Buffer to store top-k skyline nodes. Thank You ! ! ! Finding Skyline Nodes in Large Networks 20