A Cut Point Based Approach

advertisement
Dealing with Diversity in
Mining and Query Processing
Jeffrey Xu Yu (于旭)
Department of Systems Engineering and Engineering
Management
The Chinese University of Hong Kong
yu@se.cuhk.edu.hk
Books on Social Networks

Social and Economic Networks
by Matthew O. Jackon

Social Network Data Analysis
by Charu C. Aggarwal

Exploratory Social Network Analysis with Pajek by Wouter de
Nooy, Andrej Mrvar, and Vladimir Batagelj

Networks, Crowds, and Markets: Reasoning about a Highly
Connected World
by David Easley and John Keinberg

Networks An Introduction
by M.E.J. Newman
Some Online Courses



Mining of Massive Datasets (Anand Rajaraman and Jeff Ullman)
http://infolab.stanford.edu/~ullman/mmds.html
Networks, Crowds, and Markets: Reasoning about a highly
connected world, by David Easley and Jon Kleinberg
http://www.cs.cornell.edu/home/kleinber/networks-book
Topics in Data Management & Mining – Social Networks, Laks
V.S. Lakshmanan http://www.cs.ubc.ca/~laks/534l/cpsc534l.html
Stanford Large Network Dataset Collection
http://snap.stanford.edu/data












Social networks
Communication networks
Citation networks
Collaboration networks
Web graphs
Amazon networks
Internet networks
Road networks
Autonomous systems
Signed networks
Wikipedia networks and metadata
Twitter and Memetracker
Graph Database
http://en.wikipedia.org/wiki/Graph_database




Pregel: Google’s internal graph processing platform
Trinity: Microsoft Research Asia
Neo4j: commercial graph database
…
Diversified Ranking

Why diversified ranking?
 Information requirements diversity
 Query incomplete
Problem Statement



For query dependent diversity ranking, the goal is to find
K nodes in a graph that are relevant to the query node,
and also they are dissimilar to each other.
For query independent diversity ranking, the goal is to
find K prestige nodes in a graph that are dissimilar to
each other.
Main applications
 Ranking nodes in social network, ranking papers, etc.
Challenges



Diversity measures
 No wildly accepted diversity measures on graph in the
literature.
Scalability
 Most existing methods cannot be scalable to large
graphs.
Lack of intuitive interpretation.
Existing Methods





Grasshopper [Zhu, et al., HLT-NAACL’07]
ManiRank [Zhu, et al., WWW’11]
DivRank [Mei, et al., KDD’10]
DRAGON [Tong, et al., KDD’11]
Resistive Graph Centers [Dubey, et al., KDD’11]
Grasshopper/ManiRank

The main idea

Work in an iterative manner.

Select a node at one iteration by random walk.
Set the selected node to be an absorbing node, and perform
random walk again to select the second node.
Perform the same process K iterations to get K nodes.



No diversity measure


Achieving diversity only by intuition and experiments.
Cannot scale to large graph (time complexity O(𝐾𝑛2 ))
Grasshopper/ManiRank

Initial random walk with no absorbing states

Absorbing random walk after ranking the first item
DivRank




Based on a vertex-reinforced random walk.
No diversity measure.
Convergence properties is not clear.
Time and space complexity is 𝑂(𝑛2 )
DRAGON, Resistive Graph Centers

DRAGON [Tong, et al., KDD’11]


Diversity measure lacks of clear topological interpretation
Resistive Graph Centers [Dubey, et al., KDD’11]


Based on personalized PageRank with a learnable teleportation
parameter.
Cannot be scalable to large graphs.
A Summary

Comparison with existing methods
Our Approach

The main idea



Relevance of the top-K nodes (denoted by a set S) is achieved by the
large (Personalized) PageRank scores.
Diversity of the top-K nodes is achieved by large expansion ratio.
Expansion ratio of a set nodes S: σ(S)=|N(S)|/n

Larger expansion ratio implies better diversity
The K-step Expansion

K-step expansion ratio of S: σk(S)=|Nk(S)|/n

Our diversity measures
A Discrete Optimization Problem

Diversified ranking problem on graph as a discrete
optimization problem.

Submodularity


F(S) is shown to be submodular and non-descreasing.
The greedy algorithm


A 1-1/e approximation algorithm for solving Eq. (1).
Linear time and space complexity w.r.t. the size of the graph.
The Greedy Algorithm
Marginal gain


Works in K rounds
Select a node with maximal marginal gain at one round
Generalized Diversified Ranking
Optimization

Maximize Fk(S) subject to cardinality constraint
|S| <= K

Submodularity



Fk(S) is shown to be submodular and non-descreasing.
Randomized greedy algorithm

Near 1-1/e approximation algorithm.

Linear time and space complexity w.r.t. the size of the graph.
Generalized Diversified Ranking
Optimization

Randomized greedy algorithm



Same idea as the greedy algorithm
Works in K rounds
At each round, select the node with maximal marginal gain. But,
evaluating the maximal marginal gain is expensive.
wu   (| Nk (S {u}) |  | Nk ( S ) |)

Marginal gain
Our idea: Use a probabilistic counting data structure
to sketch the k-step neighborhood for each node.
FM Sketch and Its Properties




A probabilistic counting structure, devised by Flajolet
and Martin.
Be used to estimate the cardinality of a multi-set using
only logC+t bits, where C denotes the cardinality and t
is a small constant.
Each FM Sketch is a log C+t bitmap.
Advantage: To estimate the cardinality of the union of
two multi-sets, we only need to do a bitwise-OR
between to FM Sketches.
The Randomized Greedy Algorithm

Randomized greedy algorithm


For each node u, use FM Sketch to sketch Nk({u})
Use the following rule to sketch Nk({u}), which can be implemented in a
recursive way
N k ({u}) 
N k 1 ({v})
( u ,v )E


Use FM sketch to sketch Nk(S)
Evaluating the marginal gain can be implemented by a bitwise-OR
between Nk(S) and Nk({u})
Experimental Studies


We conduct experiments on 5 real networks (3
collaboration networks, 1 citation network, and 1 social
network).
We show some results with Flickr, which is a popular
photo shared website (from ASU social computing data
repository).
 Undirected social network (80,513 nodes and
5,899,882 edges, and 195 different groups)
Some Testing Results on Flickr
Make a Top-K Algorithm Diversified
The result of searching “apple” in Google image

Existing top-𝐾 search algorithms


Search results are ranked independently
When searching “apple” in google image, 9 out of top 15 results are the
logo of Apple Inc.
Structural Keyword Search (1)
DBLP
w1
“graph patterns”
“keyword search”
a41
a41
w2
p31
p32
w1
w4
p31
v1
a41
a41
p34
w3
p33
v2
w3
w2
p32
w4
p33
v3
p34
v4
a41
Author: Jiawei Han
w1
p31
Paper: Mining Graph Patterns
p32
Paper: Optimizing Index for Taxonomy Keyword Search
p
33
Paper: Mining Significant Graph
Patterns by Leap Search
p34
Paper: Keyword Search in Text Cube:
Finding Top-k Aggregated Cell Documents

w2
w3
w4
Action: Write
Example: Keyword Search in Graphs


Input: a graph with text information on each node, and a user given keyword query
Output: top-k of minimal Steiner trees that contain all user given keywords
Structural Keyword Search (2)
a41
a41
w1
0.6
w2
w1

w4
Suppose the similarity of 𝑣𝑖 and
|𝑣𝑖 ∩𝑣𝑗 |
𝑣𝑗 is
, e.g.,
𝑠𝑖𝑚
p31
p31
p32
v2 score=0.5
v1 score=0.8
0.6
0.2
p
33

Let 𝐾 = 2

{𝑣1 , 𝑣4 } is better than {𝑣1 , 𝑣2 }
because 𝑣1 and 𝑣2 are similar
with each other

{𝑣1 , 𝑣4 } is better than {𝑣2 , 𝑣3 }
because {𝑣1 , 𝑣4 } has a larger
total score
a41
w2
p32
v3 score=0.5
0.6
0.6
0.6
0.2
a41
w3
p34
max{|𝑣𝑖 |,|𝑣𝑗 |}
3
𝑣1 , 𝑣2 = =
5
w3
p33
w4
p34
v4 score=0.4
Diversified Top-K

We should consider both similarity and score
Let 𝑆 = {𝑣1 , 𝑣2 , … } be a list of search results
Let 𝑠𝑐𝑜𝑟𝑒(𝑣𝑖 ) be the score of result 𝑣𝑖

Let 𝑠𝑖𝑚 𝑣𝑖 , 𝑣𝑗 be the similarity of 𝑣𝑖 and 𝑣𝑗

For any 𝑣𝑖 , 𝑣𝑗




𝑣𝑖 and 𝑣𝑗 are similar ⇔ 𝑠𝑖𝑚 𝑣𝑖 , 𝑣𝑗 > 𝜏

𝜏: a user given threshold
Diversified top-𝐾 results result 𝐷:



At most 𝐾 results: |𝐷| ≤ 𝐾
No two results in 𝐷 are similar
Total score of results in 𝐷 is maximized
A Diversity Graph
v2
6
8
v32
6
8
v5
v35
v33
7
v3
v34
7
7
7
10
v1
10
1
v36
𝐾 = 2, 𝐷 = {𝑣1 , 𝑣2 }

v31
1
v36
𝐾 = 3, 𝐷 = {𝑣1 , 𝑣2 }
Diversity Graph 𝐺

Undirected graph
∀𝑣𝑖 , 𝑣𝑗 , there is an edge (𝑣𝑖 ,𝑣𝑗 ) in 𝐺 ⟺ 𝑣𝑖 is similar to 𝑣𝑗

The diversified top-𝐾result set is an independent set of 𝐺

v4
Existing Top-K Search Frameworks


Most existing top-K search frameworks avoid exploring all search
results by finding an early stop condition.
Incremental Top-K



Results are generated one by one in ranked order
Stops when K results are output
Bounding Top-K



Results are generated not necessarily in ranked order.
A non-increasing score upper bound for unseen result u is maintained.
Stop when the K-th largest score generated is no smaller than u.
Our Framework

We support the existing top-K frameworks



Results are generated one by one
Stops if a certain stop condition is satisfied
Our framework
Step 3
Step 2
Step 1
• Check the stop condition
sufficient()
• Stops if sufficient() is
satisfied

• Generate the next
result using the original
top-K algorithm
• Check the necessary()
condition
• If necessary() is satisfied,
search the diversified top-K
results using div-search()
• Go to Step 1
We extend the existing algorithms to get top-K diversified results by three
new functions.



sufficient(): a new early stop condition
necessary(): the necessary stop condition
div-search(): search top-k diversified results on the current results
Sufficient Stop Condition

Sufficient stop condition sufficient()





𝑆 : the set of current generated results
𝑏𝑒𝑠𝑡(𝑆) : an upper bound of the optimal solution calculated from current
generated results 𝑆
𝐷𝑖 (𝑆) : the current diversified top-𝑖 results with score 𝑠𝑐𝑜𝑟𝑒(𝐷𝑖 (𝑆))
𝑢 : the score upper bound of all unseen results
For each 𝑖 < 𝐾, in the ideal situation, for the unseen results, all the
remaining 𝐾 − 𝑖 results are set to be 𝑢

We have 𝑏𝑒𝑠𝑡 𝑆 = max {𝑠𝑐𝑜𝑟𝑒 𝐷𝑖 (𝑆) + (𝑘 − 𝑖) × 𝑢}

The sufficient stop condition is
1≤𝑖≤𝐾
𝑠𝑐𝑜𝑟𝑒 𝐷𝐾 (𝑆) ≥ 𝑏𝑒𝑠𝑡(𝑆)
Necessary Stop Condition

Necessary stop condition necessary()


𝑆 : the set of current generated results
Assume the stop condition of the original algorithm is satisfied




Otherwise the algorithm cannot stop
𝑆’ : the set of results when the last time necessary() is satisfied (or ∅ if
necessary() is never satisfied)
If 𝐷𝑖 (𝑆′) ≠ ∅ for a certain 1 ≤ 𝑖 ≤ 𝐾, we need at least 𝐾 − 𝑖 + 1 more
results generated in order to get 𝐾 results
The necessary stop condition is
𝑆 ≥ 𝑆 ′ + 𝐾 − max{𝑖|1 ≤ 𝑖 ≤ 𝐾, 𝐷𝑖 (𝑆′) ≠ ∅}
The Possible Search Algorithms

Given the diversity graph 𝐺 for the current generated result set 𝑆
Finding 𝐷(𝑆) on 𝐺 is an NP-Hard problem

Greed is Not Good
100 v30
100 v
0
99
v31
0.5 u31
99
1
v32
u2
v33
… v3100
u3
…1u
99
1
99
100
𝐺 (𝐾 = 100)
Greedy Solution: score=199
99
v0
u
0.5 31
99
v2
99
v3
u
u
1 32 1 33
… v100
99
…1 u3
100
𝐺 (𝐾 = 100)
Optimal Solution: score=9900
Three New Search Algorithms

We propose three exact algorithms



div-astar: an A* based approach
div-dp: decompose div-astar using operator ⊕
div-cut: further decompose div-dp using operators ⊕ and ⊗
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
div-dp
NP
NP
NP
div-astar
NP
div-cut
An A* Based Approach


We use a heap 𝐻 to maintain partial solutions
Each partial solution is with form 𝑒 = (𝑠, 𝑠𝑐𝑜𝑟𝑒, 𝑢𝑏)





𝑠: the set of results selected in the partial solution
𝑠𝑐𝑜𝑟𝑒: the total score of results in 𝑠
𝑢𝑏: the upper bound of score if 𝑠 is expanded to a full solution
Entries in 𝐻 are expanded in non-increasing order of 𝑒. 𝑢𝑏
The algorithm stops when 𝑢𝑏 of the next soution is no larger than
the score of the current best solution
An A* Based Approach

Calculation of 𝑢𝑏
𝑢𝑏 = max
𝑣𝑖 ∈𝑉
s.t.




𝑠𝑐𝑜𝑟𝑒 𝑣𝑖
𝑐1 : 𝑉 ≤ 𝐾
𝑐2 : 𝑠 ⊆ 𝑉 ⊆ 𝑉 𝐺
𝑐3 : 𝑣𝑖 . 𝑎𝑑𝑗 ∩ 𝑠 = ∅
𝑐4 : max 𝑖 𝑣𝑖 ∈ 𝑠 < min 𝑖 𝑣𝑖 ∈ (𝑉 − 𝑠)
𝑣𝑖 . 𝑎𝑑𝑗 is the set of adjacent nodes of 𝑣𝑖 in 𝐺
The equation is a relaxation of the optimal solution w.r.t. 𝑠
𝑐4 is to avoid generating redundant results
𝑢𝑏 can be calculated in 𝑂(|𝑉(𝐺)|) time in the worst case
An A* Based Approach
An example (𝐾 = 3)

{𝑣1 }, 10,21
8
6
{𝑣2 }, 8,8
3
3
7
{𝑣3 }, 7,20
3
3
∅, 0,25
{𝑣4 }, 7,13
7
10
3
3
{𝑣5 }, 6,6
3
Diversity graph 𝐺
{𝑣6 }, 3,3
Step 1: Expand node (∅, 0,25), with 𝑉 = 𝑣1 , 𝑣2 , 𝑣3
An A* Based Approach
An example (𝐾 = 3)

{𝑣1 , 𝑣2 }, 18,18
{𝑣1 }, 10,21
{𝑣1 , 𝑣6 }, 13,13
8
6
{𝑣2 }, 8,8
3
3
7
{𝑣3 }, 7,20
3
3
∅, 0,25
{𝑣4 }, 7,13
7
10
3
3
{𝑣5 }, 6,6
3
Diversity graph 𝐺
{𝑣6 }, 3,3
Step 2: Expand node ({𝑣1 }, 10,21), with 𝑉 = 𝑣1 , 𝑣2 , 𝑣6
An A* Based Approach
An example (𝐾 = 3)

{𝑣1 , 𝑣2 }, 18,18
{𝑣1 }, 10,21
{𝑣1 , 𝑣6 }, 13,13
8
6
{𝑣2 }, 8,8
3
3
{𝑣3 , 𝑣4 }, 14,20
7
{𝑣3 }, 7,20
3
3
∅, 0,25
{𝑣3 , 𝑣5 }, 13,13
{𝑣4 }, 7,13
7
10
3
3
{𝑣5 }, 6,6
3
Diversity graph 𝐺
{𝑣6 }, 3,3
Step 3: Expand node ({𝑣3 }, 7,20), with 𝑉 = 𝑣3 , 𝑣4 , 𝑣5
An A* Based Approach
An example (𝐾 = 3)

{𝑣1 , 𝑣2 }, 18,18
{𝑣1 }, 10,21
{𝑣1 , 𝑣6 }, 13,13
8
6
{𝑣2 }, 8,8
3
3
{𝑣3 , 𝑣4 }, 14,20
7
{𝑣3 }, 7,20
3
3
∅, 0,25
{𝑣3 , 𝑣5 }, 13,13
{𝑣4 }, 7,13
7
10
3
{𝑣5 }, 6,6
3
3
Diversity graph 𝐺
{𝑣6 }, 3,3
Step 4: Expand node ({𝑣3 , 𝑣4 }, 14,20), with 𝑉 = 𝑣3 , 𝑣4 , 𝑣5
{𝑣3 , 𝑣4 , 𝑣5 }, 20,20
An A* Based Approach
An example (𝐾 = 3)

{𝑣1 , 𝑣2 }, 18,18
{𝑣1 }, 10,21
{𝑣1 , 𝑣6 }, 13,13
8
6
{𝑣2 }, 8,8
3
3
{𝑣3 , 𝑣4 }, 14,20
7
{𝑣3 , 𝑣4 , 𝑣5 }, 20,20
{𝑣3 }, 7,20
3
3
∅, 0,25
{𝑣3 , 𝑣5 }, 13,13
{𝑣4 }, 7,13
7
10
3
3
{𝑣5 }, 6,6
3
Diversity graph 𝐺
{𝑣6 }, 3,3
Step 5: Expand node ({𝑣3 , 𝑣4 , 𝑣5 }, 20,20), with 𝑉 = 𝑣3 , 𝑣4 , 𝑣5
Current best score is 20, and next best score is 18: stop
Optimal solution: 𝑣3 , 𝑣4 , 𝑣5
A DP Based Approach

The diversity graph may contain many disconnected components



It is costly to apply A* algorithm on the whole diversity graph
Combine the results of disconnected components using operator ⊕
based on Dynamic Programming (DP)
Dynamic Programming



Suppose 𝐺 contains two disconnected components 𝐺1 and 𝐺2
State 𝐺. 𝑠𝑖 : the optimal score of the diversified top-𝑖 results on 𝐺
State transition equation:
𝐺. 𝑠𝑖 = max {𝐺1 . 𝑠𝑗 + 𝐺2 . 𝑠𝑖−𝑗 }
0≤𝑗≤𝑖
A DP Based Approach

An Example (𝐾 = 5)
6
3
𝐺. 𝑠5 = max {𝐺1 . 𝑠𝑗 + 𝐺2 . 𝑠5−𝑗 }
3
7
10
optimal solution: {𝑣1 , 𝑣2 ,𝑢2 ,𝑢4 ,𝑢5 }
s
i
0 ∅
0
1 {𝑣1 }
10
2 {𝑣1 , 𝑣2 }
18
3 {𝑣1 , 𝑣4 , 𝑣5 }
20
4 ∅
5 ∅
solution
𝐺1
3
1
8
3
7
𝐺2
𝐺
s
i
solution
s
0 ∅
0
0
∅
0
1 {𝑢1 }
10
1
{𝑣1 }
10
2 {𝑢1 , 𝑢3 }
18
2
{𝑣1 , 𝑢1 }
20
3 {𝑢2 , 𝑢4 , 𝑢5 } 22
3
{𝑣1 , 𝑢1 , 𝑢3 }
28
0
4 ∅
0
4
{𝑣1 , 𝑣2 , 𝑢1 , 𝑢3 }
36
0
5 ∅
0
5
{𝑣1 , 𝑣2 , 𝑢2 , 𝑢4 , 𝑢5 }
40
⊕
6
3
= max{0 + 0, 10 + 0, 18 + 22,
20 + 18, 0 + 10, 0 + 0}
= 40
solution
3
9
7
0≤𝑗≤5
i
10
8
𝐺2
=
𝐺
A Cut Point Based Approach

Cut point of graph 𝐺




Suppose 𝐺 is a connected graph
A cut point is a point whose removal makes 𝐺 disconnected
𝐺 can be further decomposed using cut points
Suppose 𝑐 is a cut point of 𝐺, there are two situations

𝐺. 𝑒𝑥(𝑐): 𝑐 is excluded in the final solution


𝐺. 𝑖𝑛(𝑐): 𝑐 is included in the final solution



After removing 𝑐, 𝐺 becomes several disconnected components
After removing 𝑐 and all 𝑐’s adjacent nodes, 𝐺 becomes several disconnected
components
Add 𝑐 to each result in 𝐺. 𝑖𝑛(𝑐)
𝐺. 𝑒𝑥(𝑐) and 𝐺. 𝑖𝑛(𝑐) are combined using operator ⊗ to compute 𝐺
A Cut Point Based Approach






Let 𝑐 be a cut point of 𝐺
Let 𝐺1 be the solution by excluding 𝑐
Let 𝐺2 be the solution by including 𝑐
𝐺1 and 𝐺2 are mutually exclusive with each other
𝐺. 𝑠𝑖 : the optimal score of diversified top-𝑖 results on 𝐺
Calculating 𝐺 = 𝐺1 ⊗ 𝐺2
𝐺. 𝑠𝑖 = 𝑚𝑎𝑥{𝐺1 . 𝑠𝑖 , 𝐺2 . 𝑠𝑖 }
A Cut Point Based Approach

Handling multiple cut points

Step 1: Construct a cup-point tree (cptree)



Each node: associated with a cut point (leaf node is associated with a virtual
cut point)
Each edge: associated with a subgraph that connects two cut points (the
subgraph can be empty or disconnected)
A sample cptree:
𝑐
𝐺1
𝑐1
𝐺4
𝑐4

Step 2: Search the cptree

In a bottom-up fashion
0
𝐺2
𝑐2
𝐺5
𝑐5
𝐺3
𝑐3
𝐺6
𝑐6
A Cut Point Based Approach

An Example
𝑐34
𝐺34

𝑐12
𝑐24
𝐺4
𝐺12
𝐺2
𝐺3
𝐺1
𝐺


Suppose 𝐺3 . 𝑖𝑛(𝑐34 ), 𝐺3 . 𝑒𝑥 𝑐34 ,
𝐺1 . 𝑖𝑛(𝑐12 ), 𝐺1 . 𝑒𝑥(𝑐12 ) have been
computed
𝐺 = 𝐺. 𝑒𝑥 𝑐24 ⊗ 𝐺. 𝑖𝑛 𝑐24
We now compute 𝐺. 𝑒𝑥 𝑐24 and
𝐺. 𝑖𝑛 𝑐24
A Cut Point Based Approach

An Example
𝑐34
𝑐12
𝑐24

Computing 𝐺. 𝑒𝑥 𝑐24

𝐺. 𝑒𝑥 𝑐24 = 𝐺12 ⊕ 𝐺34
Computing 𝐺12

𝐺34
𝐺4
𝐺12
𝐺2


𝐺3
𝐺1


𝐺
(Case 1) 𝑐12 is excluded: 𝐺′12 =
𝐺1 . 𝑒𝑥 𝑐12 ⊕ 𝐺2
(Case 2) 𝑐12 is included: 𝐺′′12 =
𝐺1 . 𝑖𝑛 𝑐12 ⊕ (𝐺2 − 𝑐12 . 𝑎𝑑𝑗)

𝐺2 − 𝑐12 . 𝑎𝑑𝑗 is the result after removing
adjacent nodes of 𝑐12 from 𝐺2
We have 𝐺12 = 𝐺′12 ⊗ 𝐺′′12
𝐺34 can be computed similarly
A Cut Point Based Approach

An Example
𝑐34
𝐺34
𝑐12
𝑐24
𝐺4
𝐺12
𝐺2

Computing 𝐺. 𝑖𝑛 𝑐24

𝐺. 𝑖𝑛 𝑐24 = (𝐺12 − 𝑐24 . 𝑎𝑑𝑗) ⊕ (𝐺34 −
𝑐24 . 𝑎𝑑𝑗)
Computing 𝐺12 − 𝑐24 . 𝑎𝑑𝑗


𝐺3

𝐺1


𝐺

(Case 1) 𝑐12 is excluded: 𝐺′12 =
𝐺1 . 𝑒𝑥 𝑐12 ⊕ (𝐺2 − 𝑐24 . 𝑎𝑑𝑗)
(Case 2) 𝑐12 is included: 𝐺′′12 =
𝐺1 . 𝑖𝑛 𝑐12 ⊕ (𝐺2 − 𝑐24 . 𝑎𝑑𝑗 − 𝑐12 . 𝑎𝑑𝑗)
We have 𝐺12 − 𝑐24 . 𝑎𝑑𝑗 = 𝐺′12 ⊗ 𝐺′′12
𝐺34 − 𝑐24 . 𝑎𝑑𝑗 can be computed
similarly
Do not forget to add {𝑐24 } to all the
results of 𝐺. 𝑖𝑛 𝑐24
A Cut Point Based Approach

An Example (𝐾 = 5)
i
10
𝒘3𝟓
9
3
6
1
𝒘3𝟔
1
𝐺3
3
8
𝒘3𝟐
6
𝐺2
𝐺4
3
𝒘3𝟑
3
1
0
1 {𝑤2 }
13
2 {𝑤2 , 𝑣1 }
23
3 {𝑤2 , 𝑣1 , 𝑢1 }
33
4 {𝑤2 , 𝑣3 , 𝑣5 , 𝑢1 }
36
5 {𝑤2 , 𝑣3 , 𝑣5 , 𝑢4 , 𝑢5 }
39
𝒘3𝟒
𝐺1
𝐺
1
1
i
solution
s
0 ∅
0
1 {𝑤2 }
13
2 {𝑤2 , 𝑣1 }
23
3 {𝑤2 , 𝑣1 , 𝑢1 }
33
s
4 {𝑣1 , 𝑣2 , 𝑢1 , 𝑢3 }
36
0 ∅
0
5 {𝑣1 , 𝑣2 , 𝑢2 , 𝑢4 , 𝑢5 }
40
1 {𝑣1 }
10
2 {𝑣1 , 𝑢1 }
20
3 {𝑣1 , 𝑢1 , 𝑢3 }
28
4 {𝑣1 , 𝑣2 , 𝑢1 , 𝑢3 }
36
5 {𝑣1 , 𝑣2 , 𝑢2 , 𝑢4 , 𝑢5 }
40
i
7
3
0 ∅
13
7
s
𝑮. 𝒊𝒏(𝒘𝟐 )
8
3
10
7
solution
solution
𝑮. 𝒆𝒙(𝒘𝟐 )
⊗=
𝑮
Further Improvements



Example
𝑤1 can be removed from 𝐺
There exists 𝑤2 s.t.




𝑤2 ∈ 𝑤1 . 𝑎𝑑𝑗
𝑠𝑐𝑜𝑟𝑒 𝑤2 ≥ 𝑠𝑐𝑜𝑟𝑒(𝑤1 )
𝑤2 . 𝑎𝑑𝑗 ∪ {𝑤2 } ⊆ 𝑤1 . 𝑎𝑑𝑗 ∪ {𝑤1 }
After removing 𝑤1

10
3
𝒘3𝟓
1
9
1
𝒘3𝟔
1
𝑤2 and 𝑤5 become cut points
12
8
3
𝒘3𝟏 𝒘3𝟐
10
𝒘3𝟓
6
8
𝒘3𝟐
13
6
8
6
8
3
7
3
7
3
𝒘3𝟑
7
10
1
𝒘3𝟒
3
𝐺2
𝐺4
7
1
1
7
13
𝒘3𝟑
3
10
𝐺
3
3
1
3
6
9
𝒘3𝟔
1
𝐺3
7
3
3
𝒘3𝟒
𝐺1
𝐺′
1
1
Performance Studies

Experimental Setup

We use 2 real datasets: Enwiki and Reuters


Enwiki: 11,930,681 articles from English Wikipedia
Reuters: 21,578 news from Reuters

Query: a set of keywords
Answer: top-𝐾 documents

We compare three algorithms





div-star: A* based approach
div-dp: Dynamic programming based approach
div-cut: Cut point based approach
We vary 3 parameters:

𝐾: (two groups)




Small 𝐾: 40, 80, 120, 160, 200, default 120
Large 𝐾: 500, 700, 900, 1300, 2000, default 900
Similarity threshold 𝜏: 0.4, 0.5, 0.6, 0.7, 0.8 default 0.6
Keyword frequency 𝑘𝑓𝑟𝑒𝑞: 5 levels 1,2,3,4,5, default 3
Performance Studies

Score function:

Given a query 𝑄 and a document 𝑑
𝑠𝑐𝑜𝑟𝑒 𝑄, 𝑑 =

𝑞∈𝑄 𝑡𝑓(𝑞, 𝑑)
𝑙𝑒𝑛(𝑑)

𝑡𝑓(𝑞, 𝑑) is term frequency of keyword 𝑞

𝑖𝑑𝑓 𝑞 = 𝑙𝑜𝑔

𝑙𝑒𝑛(𝑑) is the total number of words in 𝑑
|𝐷|
𝑑∈𝐷:𝑞∈𝑑 +1
× 𝑖𝑑𝑓(𝑞)
for dataset 𝐷
Similarity function:

Given two documents 𝑑1 and 𝑑2
𝑠𝑖𝑚 𝑑1 , 𝑑2 =
𝑤∈𝑑1 ∩𝑑2 𝑖𝑑𝑓(𝑤)
𝑤∈𝑑1 ∪𝑑2 𝑖𝑑𝑓(𝑤)
Performance Studies
Small 𝐾
Small 𝐾
Large 𝐾
Large 𝐾
Vary 𝐾 (Enwiki)
Conclusion


We study the diversified ranking.
We study the diversified top-𝐾 search problem.


The diversity use only the similarity of search results themselves
We propose a framework, s.t. most top-𝐾 algorithm can be easily
extended to handle diversified top-𝐾 search by applying.
APWeb 2013 in Sydney, Australia

The 15th International Asia-Pacific Web Conference (APWeb), 4-6
April, 2013, Sydney, Australia



Three Keynote Speakers




Just before ICDE 2013.
Paper Submission Deadline: October 20.
H.V. Jagadish (University of Michigan)
Dan Suciu (University of Washington)
Mark Sanderson (RMIT)
A Special Issue on WWW Journal
Research Postgraduate Study at SEEM/CUHK
[www.se.cuhk.edu.hk/programmes]

Research Postgraduate Programs


M.Phil, PhD, M.phil-PhD (Articulated)
Deadlines:





December 1, 2012 (First Round)
January 31, 2013 (Official Final Round). But, due to Chinese New Year, submit it
early before January 20.
Postgraduate Studentship: HK$13,600 per month (non-taxable)
Current Tuition Fees: HK$42,100/year
Hong Kong PhD Fellowship Scheme 2013-2014 (135 positions in HK)




Deadline: December 1, 2012
Monthly stipend of HK$20,000
10,000 travel allowance
Current Tuition Fees: HK$42,100/year
Taught Postgraduate Study at SEEM/CUHK
[www.se.cuhk.edu.hk/programmes]






Taught Postgraduate Programmes
MSc Programme in SEEM (Systems Engineering and Engineering
Management)
MSc Programme in ECLT (E-Commerce and Logistics Technologies)
Current Tuition Fees: (Provisional) HK$128,000
Full-Time One-Year study in HK
Application deadline:



1st Round: January 15, 2013
2nd Round: March 15, 2013
Early applications are encouraged; Offers may be made to eligible
applicants well before March 15.
Thank you!
Questions?
Download