Slides

advertisement
Finding Skyline Nodes in Large Networks
Arijit Khan
Vishwakarma Singh
Jian Wu
Motivation
If John is interested in Big Data, Cloud
Computing, and Map Reduce, who will be the top-5 people John should ask
about these topics?
Evaluation Metrics:
 Distance from the query node. (John)
 Coverage of the Query Topics. (Big Data, Cloud Computing, Map Reduce)
Finding Skyline Nodes in Large Networks
2
Homogeneous Approach ?
If John is interested in Big Data, Cloud
Computing, and Map Reduce, who will be the top-5 people John should ask
about these topics?
Score = λ . Distance + (1- λ ). Coverage
How to get λ ?
Finding Skyline Nodes in Large Networks
3
Weighted Set Cover ?
 Find nodes with smallest aggregate distance from the query node, such
that they cover all query topics.
u0 = q
Q = { a, b, c }
 Ignore some interesting nodes.
a
 Cannot rank the results.
b
u1
abc
c
u2
u3
a
cd
u5
u4
abc
u6
de
u7
Finding Skyline Nodes in Large Networks
u8
4
Graph Skyline
 Dominance on Coverage: u >c v
Query topics covered by node u is a
superset of the query topics covered
by node v.
 Dominance on Distance: u >d v
Distance of u from q is less than that
of v from q.
 Dominance: u > v
(1) u >c v and u ≥d v ;
or (2) u ≥c v and u >d v.
u0 = q
Q = { a, b, c }
a
b
u1
c
u2
abc
u3
a
cd
u5
u4
abc
u6
de
u7
u8
A node is a skyline node if it is not dominated by any other
node in the network.
Finding Skyline Nodes in Large Networks
5
Ranking of Skyline Nodes
 Too many skyline nodes.
u0 = q
Q = { a, b, c }
 Rank them.
 Dominance Count: # nodes dominated
by a skyline node. [Lin et. al., ICDE ‘07]
 Higher Dominance Count => more
pruning from candidate set.
 1. DC(u4) = {u5, u6, u7},
2. DC(u1) = {u5}
3. DC(u2) = Φ; 4. DC(u3) = Φ
a
b
u1
c
u2
abc
u3
a
cd
u5
u4
abc
u6
de
u7
u8
Given a query node and a set of query topics in a
network, find the top-k skyline nodes with maximum dominance count.
Finding Skyline Nodes in Large Networks
6
Algorithm
 Construct a Query DAG.
 Three variables associated with each DAG node: Count (C), Dominance
(D), Traversal (T).
u0 = q
a
Q = { a, b, c }
b
u1
c
u2
abc
u3
a
cd
u5
u4
abc
C=2
D=T=-
u6
de
u7
Input Network
u8
ab
C=0
D=T=C=2
D=T=-
a
 Naïve Complexity: O(n2r)
abc
 Complexity with
Preprocessing: O(nr2)
C=0
D = - ac
T=-
C=1
D=T=-
b
bc
c
Query DAG
Finding Skyline Nodes in Large Networks
C=0
D=T=-
C=2
D=T=-
7
Query DAG Construction

the label.
For each label, find a sorted list of nodes that contain

Incremental DAG construction.
u0 = q
Q = { a, b, c }
u4
a
b
u1
abc
c
u6
u7
c
cd
u5
u6
de
u7
u4
u3
a
abc
u3
ab
u2
u4
u7
u8
a
u1
u5
b
u2
Finding Skyline Nodes in Large Networks
8
Query DAG Construction (cont.)

the label.
For each label, find a sorted list of nodes that contains

lists in order.
Consider the labels and their sorted
u0 = q
a
Q = { a, b, c }
b
u1
abc
c
cd
u5
u6
de
u7
u7
u3
a
abc
u4
ab
u2
u4
abc
u8
a
u1
u5
b
u2
Finding Skyline Nodes in Large Networks
c
u3
u6
9
Query DAG Construction (cont.)

the label.
For each label, find a sorted list of nodes that contains

lists in order.
Consider the labels and their sorted
u0 = q
a
Q = { a, b, c }
b
u1
abc
c
ab
u2
a
u6
de
u7
ac
u7
bc
cd
u5
abc
u4
u3
a
u4
abc
u8
u1
b
u5
u2
Finding Skyline Nodes in Large Networks
c
u3
u6
10
Find Dominance Variable
 Perform a topological ordering of the DAG nodes to evaluate the
Dominance variable (D) of each DAG node.
 # Nodes dominated (or equal) by coverage.
u0 = q
a
Q = { a, b, c }
b
u1
c
u2
abc
u3
a
cd
u5
u4
abc
C=2
D=7
T=-
u6
de
u7
Input Network
u8
ab
C=0
D=3
T=C=2
D=2
T=-
a
 Naïve Complexity: O(n2r)
abc
C=0
D = 4 ac
T=-
C=1
D=1 b
T=Query DAG
 Complexity by
Topological Ordering: O(3r)
bc
c
Finding Skyline Nodes in Large Networks
C=0
D=3
T=-
C=2
D=2
T=-
11
Find Traversal Variable
 Perform a Breadth First Search (BFS) starting from the query node.
 # Nodes not dominated by distance.
u0 = q
a
b
u1
abc
c
u2
u4
C=2
D=7
T=1
Q = { a, b, c }
a
cd
u6
u5
abc
ab
u3
de
u7
Input Network
C=0
D=3
T=0
h =2 C = 2
u8
D=2
T=2
a
 Complexity by BFS: O(n+e)
abc
C=0
D = 4 ac
T=0
C=1
D=1 b
T=1
Query DAG
bc
c
Finding Skyline Nodes in Large Networks
C=0
D=3
T=0
C=2
D=2
T=2
12
Find Skyline Nodes
 Store DAG nodes into a Lookup Table. Skyline Bit for each DAG node.
 Helps to prune non-skyline nodes directly.
u0 = q
Q = { a, b, c }
abc
a
b
u1
abc
c
u2
u4
a
Input Network
bc
u6
de
u7
ac
h =1
cd
u5
abc
ab
u3
a
b
c
u8
Query DAG
Finding Skyline Nodes in Large Networks
abc
0
ab
0
ac
0
bc
0
a
1
b
1
c
1
Lookup Table
13
Find Skyline Nodes (cont.)
 Store DAG nodes into a Lookup Table. Skyline Bit for each DAG node.
 Helps to prune non-skyline nodes directly.
u0 = q
Q = { a, b, c }
abc
a
b
u1
abc
c
u2
u4
a
ac
bc
cd
u5
abc
ab
u3
u6
h =2
de
u7
Input Network
a
b
c
u8
Query DAG
Finding Skyline Nodes in Large Networks
abc
1
ab
1
ac
1
bc
1
a
1
b
1
c
1
Lookup Table
14
Dominance Count of Skyline Nodes
 DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3
 Top-k Buffer to store top-k skyline nodes.
u0 = q
a
b
u1
abc
ab
C=0
D=3
T=0
u3
a
cd
u5
u6
h =2
abc
abc
c
u2
u4
C=2
D=7
T=0
Q = { a, b, c }
de
u7
Input Network
u8
C=2 a
D=2
T=1
C = 0 ac
D=4
T=0
C=1
D=1 b
T=1
Query DAG
bc C = 0
D=3
T=0
C=2
c D=2
T=1
Finding Skyline Nodes in Large Networks
abc
1
ab
1
ac
1
bc
1
a
1
b
1
c
1
Lookup Table
15
Pruning and Early Termination
 DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3
 Top-k Buffer to store
top-k skyline
nodes.
Dominance
Variable
of a DAG node has smaller value
than the smallest Dominance Count in the top-k buffer.

Skyline Bits of all entries in the Lookup Table are 1’s.
Finding Skyline Nodes in Large Networks
16
Experimental Results
 DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3

0.7M Nodes,
Edges,
10 Node
Labels (distinct).
 Top-k Buffer
to store3M
top-k
skyline
nodes.
 5 Query Topics.
Finding Skyline Nodes in Large Networks
17
Efficiency
 DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3

185M to
Nodes,
Node Labels (distinct).
 Top-k Buffer
store 90M
top-kEdges,
skyline1000
nodes.
 5 Query Topics, Top-5 Result Nodes.
Finding Skyline Nodes in Large Networks
18
Conclusion and Future Works
 DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3
 Efficient Algorithm to find top-k skyline nodes in large attributed
 network.
Top-k Buffer to store top-k skyline nodes.
 Required experimental evaluation in real and synthetic datasets.
 Time Complexity is linear in the number of nodes and edges in the
network. Distance based indexing might improve the efficiency.
 Top-k Skyline set instead of Top-k Skyline nodes might be more
effective.
Finding Skyline Nodes in Large Networks
19
Questions
 DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3
 Top-k Buffer to store top-k skyline nodes.
Thank You ! ! !
Finding Skyline Nodes in Large Networks
20
Download