Dealing with Diversity in Mining and Query Processing Jeffrey Xu Yu (于旭) Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong yu@se.cuhk.edu.hk Books on Social Networks Social and Economic Networks by Matthew O. Jackon Social Network Data Analysis by Charu C. Aggarwal Exploratory Social Network Analysis with Pajek by Wouter de Nooy, Andrej Mrvar, and Vladimir Batagelj Networks, Crowds, and Markets: Reasoning about a Highly Connected World by David Easley and John Keinberg Networks An Introduction by M.E.J. Newman Some Online Courses Mining of Massive Datasets (Anand Rajaraman and Jeff Ullman) http://infolab.stanford.edu/~ullman/mmds.html Networks, Crowds, and Markets: Reasoning about a highly connected world, by David Easley and Jon Kleinberg http://www.cs.cornell.edu/home/kleinber/networks-book Topics in Data Management & Mining – Social Networks, Laks V.S. Lakshmanan http://www.cs.ubc.ca/~laks/534l/cpsc534l.html Stanford Large Network Dataset Collection http://snap.stanford.edu/data Social networks Communication networks Citation networks Collaboration networks Web graphs Amazon networks Internet networks Road networks Autonomous systems Signed networks Wikipedia networks and metadata Twitter and Memetracker Graph Database http://en.wikipedia.org/wiki/Graph_database Pregel: Google’s internal graph processing platform Trinity: Microsoft Research Asia Neo4j: commercial graph database … Diversified Ranking Why diversified ranking? Information requirements diversity Query incomplete Problem Statement For query dependent diversity ranking, the goal is to find K nodes in a graph that are relevant to the query node, and also they are dissimilar to each other. For query independent diversity ranking, the goal is to find K prestige nodes in a graph that are dissimilar to each other. Main applications Ranking nodes in social network, ranking papers, etc. Challenges Diversity measures No wildly accepted diversity measures on graph in the literature. Scalability Most existing methods cannot be scalable to large graphs. Lack of intuitive interpretation. Existing Methods Grasshopper [Zhu, et al., HLT-NAACL’07] ManiRank [Zhu, et al., WWW’11] DivRank [Mei, et al., KDD’10] DRAGON [Tong, et al., KDD’11] Resistive Graph Centers [Dubey, et al., KDD’11] Grasshopper/ManiRank The main idea Work in an iterative manner. Select a node at one iteration by random walk. Set the selected node to be an absorbing node, and perform random walk again to select the second node. Perform the same process K iterations to get K nodes. No diversity measure Achieving diversity only by intuition and experiments. Cannot scale to large graph (time complexity O(𝐾𝑛2 )) Grasshopper/ManiRank Initial random walk with no absorbing states Absorbing random walk after ranking the first item DivRank Based on a vertex-reinforced random walk. No diversity measure. Convergence properties is not clear. Time and space complexity is 𝑂(𝑛2 ) DRAGON, Resistive Graph Centers DRAGON [Tong, et al., KDD’11] Diversity measure lacks of clear topological interpretation Resistive Graph Centers [Dubey, et al., KDD’11] Based on personalized PageRank with a learnable teleportation parameter. Cannot be scalable to large graphs. A Summary Comparison with existing methods Our Approach The main idea Relevance of the top-K nodes (denoted by a set S) is achieved by the large (Personalized) PageRank scores. Diversity of the top-K nodes is achieved by large expansion ratio. Expansion ratio of a set nodes S: σ(S)=|N(S)|/n Larger expansion ratio implies better diversity The K-step Expansion K-step expansion ratio of S: σk(S)=|Nk(S)|/n Our diversity measures A Discrete Optimization Problem Diversified ranking problem on graph as a discrete optimization problem. Submodularity F(S) is shown to be submodular and non-descreasing. The greedy algorithm A 1-1/e approximation algorithm for solving Eq. (1). Linear time and space complexity w.r.t. the size of the graph. The Greedy Algorithm Marginal gain Works in K rounds Select a node with maximal marginal gain at one round Generalized Diversified Ranking Optimization Maximize Fk(S) subject to cardinality constraint |S| <= K Submodularity Fk(S) is shown to be submodular and non-descreasing. Randomized greedy algorithm Near 1-1/e approximation algorithm. Linear time and space complexity w.r.t. the size of the graph. Generalized Diversified Ranking Optimization Randomized greedy algorithm Same idea as the greedy algorithm Works in K rounds At each round, select the node with maximal marginal gain. But, evaluating the maximal marginal gain is expensive. wu (| Nk (S {u}) | | Nk ( S ) |) Marginal gain Our idea: Use a probabilistic counting data structure to sketch the k-step neighborhood for each node. FM Sketch and Its Properties A probabilistic counting structure, devised by Flajolet and Martin. Be used to estimate the cardinality of a multi-set using only logC+t bits, where C denotes the cardinality and t is a small constant. Each FM Sketch is a log C+t bitmap. Advantage: To estimate the cardinality of the union of two multi-sets, we only need to do a bitwise-OR between to FM Sketches. The Randomized Greedy Algorithm Randomized greedy algorithm For each node u, use FM Sketch to sketch Nk({u}) Use the following rule to sketch Nk({u}), which can be implemented in a recursive way N k ({u}) N k 1 ({v}) ( u ,v )E Use FM sketch to sketch Nk(S) Evaluating the marginal gain can be implemented by a bitwise-OR between Nk(S) and Nk({u}) Experimental Studies We conduct experiments on 5 real networks (3 collaboration networks, 1 citation network, and 1 social network). We show some results with Flickr, which is a popular photo shared website (from ASU social computing data repository). Undirected social network (80,513 nodes and 5,899,882 edges, and 195 different groups) Some Testing Results on Flickr Make a Top-K Algorithm Diversified The result of searching “apple” in Google image Existing top-𝐾 search algorithms Search results are ranked independently When searching “apple” in google image, 9 out of top 15 results are the logo of Apple Inc. Structural Keyword Search (1) DBLP w1 “graph patterns” “keyword search” a41 a41 w2 p31 p32 w1 w4 p31 v1 a41 a41 p34 w3 p33 v2 w3 w2 p32 w4 p33 v3 p34 v4 a41 Author: Jiawei Han w1 p31 Paper: Mining Graph Patterns p32 Paper: Optimizing Index for Taxonomy Keyword Search p 33 Paper: Mining Significant Graph Patterns by Leap Search p34 Paper: Keyword Search in Text Cube: Finding Top-k Aggregated Cell Documents w2 w3 w4 Action: Write Example: Keyword Search in Graphs Input: a graph with text information on each node, and a user given keyword query Output: top-k of minimal Steiner trees that contain all user given keywords Structural Keyword Search (2) a41 a41 w1 0.6 w2 w1 w4 Suppose the similarity of 𝑣𝑖 and |𝑣𝑖 ∩𝑣𝑗 | 𝑣𝑗 is , e.g., 𝑠𝑖𝑚 p31 p31 p32 v2 score=0.5 v1 score=0.8 0.6 0.2 p 33 Let 𝐾 = 2 {𝑣1 , 𝑣4 } is better than {𝑣1 , 𝑣2 } because 𝑣1 and 𝑣2 are similar with each other {𝑣1 , 𝑣4 } is better than {𝑣2 , 𝑣3 } because {𝑣1 , 𝑣4 } has a larger total score a41 w2 p32 v3 score=0.5 0.6 0.6 0.6 0.2 a41 w3 p34 max{|𝑣𝑖 |,|𝑣𝑗 |} 3 𝑣1 , 𝑣2 = = 5 w3 p33 w4 p34 v4 score=0.4 Diversified Top-K We should consider both similarity and score Let 𝑆 = {𝑣1 , 𝑣2 , … } be a list of search results Let 𝑠𝑐𝑜𝑟𝑒(𝑣𝑖 ) be the score of result 𝑣𝑖 Let 𝑠𝑖𝑚 𝑣𝑖 , 𝑣𝑗 be the similarity of 𝑣𝑖 and 𝑣𝑗 For any 𝑣𝑖 , 𝑣𝑗 𝑣𝑖 and 𝑣𝑗 are similar ⇔ 𝑠𝑖𝑚 𝑣𝑖 , 𝑣𝑗 > 𝜏 𝜏: a user given threshold Diversified top-𝐾 results result 𝐷: At most 𝐾 results: |𝐷| ≤ 𝐾 No two results in 𝐷 are similar Total score of results in 𝐷 is maximized A Diversity Graph v2 6 8 v32 6 8 v5 v35 v33 7 v3 v34 7 7 7 10 v1 10 1 v36 𝐾 = 2, 𝐷 = {𝑣1 , 𝑣2 } v31 1 v36 𝐾 = 3, 𝐷 = {𝑣1 , 𝑣2 } Diversity Graph 𝐺 Undirected graph ∀𝑣𝑖 , 𝑣𝑗 , there is an edge (𝑣𝑖 ,𝑣𝑗 ) in 𝐺 ⟺ 𝑣𝑖 is similar to 𝑣𝑗 The diversified top-𝐾result set is an independent set of 𝐺 v4 Existing Top-K Search Frameworks Most existing top-K search frameworks avoid exploring all search results by finding an early stop condition. Incremental Top-K Results are generated one by one in ranked order Stops when K results are output Bounding Top-K Results are generated not necessarily in ranked order. A non-increasing score upper bound for unseen result u is maintained. Stop when the K-th largest score generated is no smaller than u. Our Framework We support the existing top-K frameworks Results are generated one by one Stops if a certain stop condition is satisfied Our framework Step 3 Step 2 Step 1 • Check the stop condition sufficient() • Stops if sufficient() is satisfied • Generate the next result using the original top-K algorithm • Check the necessary() condition • If necessary() is satisfied, search the diversified top-K results using div-search() • Go to Step 1 We extend the existing algorithms to get top-K diversified results by three new functions. sufficient(): a new early stop condition necessary(): the necessary stop condition div-search(): search top-k diversified results on the current results Sufficient Stop Condition Sufficient stop condition sufficient() 𝑆 : the set of current generated results 𝑏𝑒𝑠𝑡(𝑆) : an upper bound of the optimal solution calculated from current generated results 𝑆 𝐷𝑖 (𝑆) : the current diversified top-𝑖 results with score 𝑠𝑐𝑜𝑟𝑒(𝐷𝑖 (𝑆)) 𝑢 : the score upper bound of all unseen results For each 𝑖 < 𝐾, in the ideal situation, for the unseen results, all the remaining 𝐾 − 𝑖 results are set to be 𝑢 We have 𝑏𝑒𝑠𝑡 𝑆 = max {𝑠𝑐𝑜𝑟𝑒 𝐷𝑖 (𝑆) + (𝑘 − 𝑖) × 𝑢} The sufficient stop condition is 1≤𝑖≤𝐾 𝑠𝑐𝑜𝑟𝑒 𝐷𝐾 (𝑆) ≥ 𝑏𝑒𝑠𝑡(𝑆) Necessary Stop Condition Necessary stop condition necessary() 𝑆 : the set of current generated results Assume the stop condition of the original algorithm is satisfied Otherwise the algorithm cannot stop 𝑆’ : the set of results when the last time necessary() is satisfied (or ∅ if necessary() is never satisfied) If 𝐷𝑖 (𝑆′) ≠ ∅ for a certain 1 ≤ 𝑖 ≤ 𝐾, we need at least 𝐾 − 𝑖 + 1 more results generated in order to get 𝐾 results The necessary stop condition is 𝑆 ≥ 𝑆 ′ + 𝐾 − max{𝑖|1 ≤ 𝑖 ≤ 𝐾, 𝐷𝑖 (𝑆′) ≠ ∅} The Possible Search Algorithms Given the diversity graph 𝐺 for the current generated result set 𝑆 Finding 𝐷(𝑆) on 𝐺 is an NP-Hard problem Greed is Not Good 100 v30 100 v 0 99 v31 0.5 u31 99 1 v32 u2 v33 … v3100 u3 …1u 99 1 99 100 𝐺 (𝐾 = 100) Greedy Solution: score=199 99 v0 u 0.5 31 99 v2 99 v3 u u 1 32 1 33 … v100 99 …1 u3 100 𝐺 (𝐾 = 100) Optimal Solution: score=9900 Three New Search Algorithms We propose three exact algorithms div-astar: an A* based approach div-dp: decompose div-astar using operator ⊕ div-cut: further decompose div-dp using operators ⊕ and ⊗ NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP div-dp NP NP NP div-astar NP div-cut An A* Based Approach We use a heap 𝐻 to maintain partial solutions Each partial solution is with form 𝑒 = (𝑠, 𝑠𝑐𝑜𝑟𝑒, 𝑢𝑏) 𝑠: the set of results selected in the partial solution 𝑠𝑐𝑜𝑟𝑒: the total score of results in 𝑠 𝑢𝑏: the upper bound of score if 𝑠 is expanded to a full solution Entries in 𝐻 are expanded in non-increasing order of 𝑒. 𝑢𝑏 The algorithm stops when 𝑢𝑏 of the next soution is no larger than the score of the current best solution An A* Based Approach Calculation of 𝑢𝑏 𝑢𝑏 = max 𝑣𝑖 ∈𝑉 s.t. 𝑠𝑐𝑜𝑟𝑒 𝑣𝑖 𝑐1 : 𝑉 ≤ 𝐾 𝑐2 : 𝑠 ⊆ 𝑉 ⊆ 𝑉 𝐺 𝑐3 : 𝑣𝑖 . 𝑎𝑑𝑗 ∩ 𝑠 = ∅ 𝑐4 : max 𝑖 𝑣𝑖 ∈ 𝑠 < min 𝑖 𝑣𝑖 ∈ (𝑉 − 𝑠) 𝑣𝑖 . 𝑎𝑑𝑗 is the set of adjacent nodes of 𝑣𝑖 in 𝐺 The equation is a relaxation of the optimal solution w.r.t. 𝑠 𝑐4 is to avoid generating redundant results 𝑢𝑏 can be calculated in 𝑂(|𝑉(𝐺)|) time in the worst case An A* Based Approach An example (𝐾 = 3) {𝑣1 }, 10,21 8 6 {𝑣2 }, 8,8 3 3 7 {𝑣3 }, 7,20 3 3 ∅, 0,25 {𝑣4 }, 7,13 7 10 3 3 {𝑣5 }, 6,6 3 Diversity graph 𝐺 {𝑣6 }, 3,3 Step 1: Expand node (∅, 0,25), with 𝑉 = 𝑣1 , 𝑣2 , 𝑣3 An A* Based Approach An example (𝐾 = 3) {𝑣1 , 𝑣2 }, 18,18 {𝑣1 }, 10,21 {𝑣1 , 𝑣6 }, 13,13 8 6 {𝑣2 }, 8,8 3 3 7 {𝑣3 }, 7,20 3 3 ∅, 0,25 {𝑣4 }, 7,13 7 10 3 3 {𝑣5 }, 6,6 3 Diversity graph 𝐺 {𝑣6 }, 3,3 Step 2: Expand node ({𝑣1 }, 10,21), with 𝑉 = 𝑣1 , 𝑣2 , 𝑣6 An A* Based Approach An example (𝐾 = 3) {𝑣1 , 𝑣2 }, 18,18 {𝑣1 }, 10,21 {𝑣1 , 𝑣6 }, 13,13 8 6 {𝑣2 }, 8,8 3 3 {𝑣3 , 𝑣4 }, 14,20 7 {𝑣3 }, 7,20 3 3 ∅, 0,25 {𝑣3 , 𝑣5 }, 13,13 {𝑣4 }, 7,13 7 10 3 3 {𝑣5 }, 6,6 3 Diversity graph 𝐺 {𝑣6 }, 3,3 Step 3: Expand node ({𝑣3 }, 7,20), with 𝑉 = 𝑣3 , 𝑣4 , 𝑣5 An A* Based Approach An example (𝐾 = 3) {𝑣1 , 𝑣2 }, 18,18 {𝑣1 }, 10,21 {𝑣1 , 𝑣6 }, 13,13 8 6 {𝑣2 }, 8,8 3 3 {𝑣3 , 𝑣4 }, 14,20 7 {𝑣3 }, 7,20 3 3 ∅, 0,25 {𝑣3 , 𝑣5 }, 13,13 {𝑣4 }, 7,13 7 10 3 {𝑣5 }, 6,6 3 3 Diversity graph 𝐺 {𝑣6 }, 3,3 Step 4: Expand node ({𝑣3 , 𝑣4 }, 14,20), with 𝑉 = 𝑣3 , 𝑣4 , 𝑣5 {𝑣3 , 𝑣4 , 𝑣5 }, 20,20 An A* Based Approach An example (𝐾 = 3) {𝑣1 , 𝑣2 }, 18,18 {𝑣1 }, 10,21 {𝑣1 , 𝑣6 }, 13,13 8 6 {𝑣2 }, 8,8 3 3 {𝑣3 , 𝑣4 }, 14,20 7 {𝑣3 , 𝑣4 , 𝑣5 }, 20,20 {𝑣3 }, 7,20 3 3 ∅, 0,25 {𝑣3 , 𝑣5 }, 13,13 {𝑣4 }, 7,13 7 10 3 3 {𝑣5 }, 6,6 3 Diversity graph 𝐺 {𝑣6 }, 3,3 Step 5: Expand node ({𝑣3 , 𝑣4 , 𝑣5 }, 20,20), with 𝑉 = 𝑣3 , 𝑣4 , 𝑣5 Current best score is 20, and next best score is 18: stop Optimal solution: 𝑣3 , 𝑣4 , 𝑣5 A DP Based Approach The diversity graph may contain many disconnected components It is costly to apply A* algorithm on the whole diversity graph Combine the results of disconnected components using operator ⊕ based on Dynamic Programming (DP) Dynamic Programming Suppose 𝐺 contains two disconnected components 𝐺1 and 𝐺2 State 𝐺. 𝑠𝑖 : the optimal score of the diversified top-𝑖 results on 𝐺 State transition equation: 𝐺. 𝑠𝑖 = max {𝐺1 . 𝑠𝑗 + 𝐺2 . 𝑠𝑖−𝑗 } 0≤𝑗≤𝑖 A DP Based Approach An Example (𝐾 = 5) 6 3 𝐺. 𝑠5 = max {𝐺1 . 𝑠𝑗 + 𝐺2 . 𝑠5−𝑗 } 3 7 10 optimal solution: {𝑣1 , 𝑣2 ,𝑢2 ,𝑢4 ,𝑢5 } s i 0 ∅ 0 1 {𝑣1 } 10 2 {𝑣1 , 𝑣2 } 18 3 {𝑣1 , 𝑣4 , 𝑣5 } 20 4 ∅ 5 ∅ solution 𝐺1 3 1 8 3 7 𝐺2 𝐺 s i solution s 0 ∅ 0 0 ∅ 0 1 {𝑢1 } 10 1 {𝑣1 } 10 2 {𝑢1 , 𝑢3 } 18 2 {𝑣1 , 𝑢1 } 20 3 {𝑢2 , 𝑢4 , 𝑢5 } 22 3 {𝑣1 , 𝑢1 , 𝑢3 } 28 0 4 ∅ 0 4 {𝑣1 , 𝑣2 , 𝑢1 , 𝑢3 } 36 0 5 ∅ 0 5 {𝑣1 , 𝑣2 , 𝑢2 , 𝑢4 , 𝑢5 } 40 ⊕ 6 3 = max{0 + 0, 10 + 0, 18 + 22, 20 + 18, 0 + 10, 0 + 0} = 40 solution 3 9 7 0≤𝑗≤5 i 10 8 𝐺2 = 𝐺 A Cut Point Based Approach Cut point of graph 𝐺 Suppose 𝐺 is a connected graph A cut point is a point whose removal makes 𝐺 disconnected 𝐺 can be further decomposed using cut points Suppose 𝑐 is a cut point of 𝐺, there are two situations 𝐺. 𝑒𝑥(𝑐): 𝑐 is excluded in the final solution 𝐺. 𝑖𝑛(𝑐): 𝑐 is included in the final solution After removing 𝑐, 𝐺 becomes several disconnected components After removing 𝑐 and all 𝑐’s adjacent nodes, 𝐺 becomes several disconnected components Add 𝑐 to each result in 𝐺. 𝑖𝑛(𝑐) 𝐺. 𝑒𝑥(𝑐) and 𝐺. 𝑖𝑛(𝑐) are combined using operator ⊗ to compute 𝐺 A Cut Point Based Approach Let 𝑐 be a cut point of 𝐺 Let 𝐺1 be the solution by excluding 𝑐 Let 𝐺2 be the solution by including 𝑐 𝐺1 and 𝐺2 are mutually exclusive with each other 𝐺. 𝑠𝑖 : the optimal score of diversified top-𝑖 results on 𝐺 Calculating 𝐺 = 𝐺1 ⊗ 𝐺2 𝐺. 𝑠𝑖 = 𝑚𝑎𝑥{𝐺1 . 𝑠𝑖 , 𝐺2 . 𝑠𝑖 } A Cut Point Based Approach Handling multiple cut points Step 1: Construct a cup-point tree (cptree) Each node: associated with a cut point (leaf node is associated with a virtual cut point) Each edge: associated with a subgraph that connects two cut points (the subgraph can be empty or disconnected) A sample cptree: 𝑐 𝐺1 𝑐1 𝐺4 𝑐4 Step 2: Search the cptree In a bottom-up fashion 0 𝐺2 𝑐2 𝐺5 𝑐5 𝐺3 𝑐3 𝐺6 𝑐6 A Cut Point Based Approach An Example 𝑐34 𝐺34 𝑐12 𝑐24 𝐺4 𝐺12 𝐺2 𝐺3 𝐺1 𝐺 Suppose 𝐺3 . 𝑖𝑛(𝑐34 ), 𝐺3 . 𝑒𝑥 𝑐34 , 𝐺1 . 𝑖𝑛(𝑐12 ), 𝐺1 . 𝑒𝑥(𝑐12 ) have been computed 𝐺 = 𝐺. 𝑒𝑥 𝑐24 ⊗ 𝐺. 𝑖𝑛 𝑐24 We now compute 𝐺. 𝑒𝑥 𝑐24 and 𝐺. 𝑖𝑛 𝑐24 A Cut Point Based Approach An Example 𝑐34 𝑐12 𝑐24 Computing 𝐺. 𝑒𝑥 𝑐24 𝐺. 𝑒𝑥 𝑐24 = 𝐺12 ⊕ 𝐺34 Computing 𝐺12 𝐺34 𝐺4 𝐺12 𝐺2 𝐺3 𝐺1 𝐺 (Case 1) 𝑐12 is excluded: 𝐺′12 = 𝐺1 . 𝑒𝑥 𝑐12 ⊕ 𝐺2 (Case 2) 𝑐12 is included: 𝐺′′12 = 𝐺1 . 𝑖𝑛 𝑐12 ⊕ (𝐺2 − 𝑐12 . 𝑎𝑑𝑗) 𝐺2 − 𝑐12 . 𝑎𝑑𝑗 is the result after removing adjacent nodes of 𝑐12 from 𝐺2 We have 𝐺12 = 𝐺′12 ⊗ 𝐺′′12 𝐺34 can be computed similarly A Cut Point Based Approach An Example 𝑐34 𝐺34 𝑐12 𝑐24 𝐺4 𝐺12 𝐺2 Computing 𝐺. 𝑖𝑛 𝑐24 𝐺. 𝑖𝑛 𝑐24 = (𝐺12 − 𝑐24 . 𝑎𝑑𝑗) ⊕ (𝐺34 − 𝑐24 . 𝑎𝑑𝑗) Computing 𝐺12 − 𝑐24 . 𝑎𝑑𝑗 𝐺3 𝐺1 𝐺 (Case 1) 𝑐12 is excluded: 𝐺′12 = 𝐺1 . 𝑒𝑥 𝑐12 ⊕ (𝐺2 − 𝑐24 . 𝑎𝑑𝑗) (Case 2) 𝑐12 is included: 𝐺′′12 = 𝐺1 . 𝑖𝑛 𝑐12 ⊕ (𝐺2 − 𝑐24 . 𝑎𝑑𝑗 − 𝑐12 . 𝑎𝑑𝑗) We have 𝐺12 − 𝑐24 . 𝑎𝑑𝑗 = 𝐺′12 ⊗ 𝐺′′12 𝐺34 − 𝑐24 . 𝑎𝑑𝑗 can be computed similarly Do not forget to add {𝑐24 } to all the results of 𝐺. 𝑖𝑛 𝑐24 A Cut Point Based Approach An Example (𝐾 = 5) i 10 𝒘3𝟓 9 3 6 1 𝒘3𝟔 1 𝐺3 3 8 𝒘3𝟐 6 𝐺2 𝐺4 3 𝒘3𝟑 3 1 0 1 {𝑤2 } 13 2 {𝑤2 , 𝑣1 } 23 3 {𝑤2 , 𝑣1 , 𝑢1 } 33 4 {𝑤2 , 𝑣3 , 𝑣5 , 𝑢1 } 36 5 {𝑤2 , 𝑣3 , 𝑣5 , 𝑢4 , 𝑢5 } 39 𝒘3𝟒 𝐺1 𝐺 1 1 i solution s 0 ∅ 0 1 {𝑤2 } 13 2 {𝑤2 , 𝑣1 } 23 3 {𝑤2 , 𝑣1 , 𝑢1 } 33 s 4 {𝑣1 , 𝑣2 , 𝑢1 , 𝑢3 } 36 0 ∅ 0 5 {𝑣1 , 𝑣2 , 𝑢2 , 𝑢4 , 𝑢5 } 40 1 {𝑣1 } 10 2 {𝑣1 , 𝑢1 } 20 3 {𝑣1 , 𝑢1 , 𝑢3 } 28 4 {𝑣1 , 𝑣2 , 𝑢1 , 𝑢3 } 36 5 {𝑣1 , 𝑣2 , 𝑢2 , 𝑢4 , 𝑢5 } 40 i 7 3 0 ∅ 13 7 s 𝑮. 𝒊𝒏(𝒘𝟐 ) 8 3 10 7 solution solution 𝑮. 𝒆𝒙(𝒘𝟐 ) ⊗= 𝑮 Further Improvements Example 𝑤1 can be removed from 𝐺 There exists 𝑤2 s.t. 𝑤2 ∈ 𝑤1 . 𝑎𝑑𝑗 𝑠𝑐𝑜𝑟𝑒 𝑤2 ≥ 𝑠𝑐𝑜𝑟𝑒(𝑤1 ) 𝑤2 . 𝑎𝑑𝑗 ∪ {𝑤2 } ⊆ 𝑤1 . 𝑎𝑑𝑗 ∪ {𝑤1 } After removing 𝑤1 10 3 𝒘3𝟓 1 9 1 𝒘3𝟔 1 𝑤2 and 𝑤5 become cut points 12 8 3 𝒘3𝟏 𝒘3𝟐 10 𝒘3𝟓 6 8 𝒘3𝟐 13 6 8 6 8 3 7 3 7 3 𝒘3𝟑 7 10 1 𝒘3𝟒 3 𝐺2 𝐺4 7 1 1 7 13 𝒘3𝟑 3 10 𝐺 3 3 1 3 6 9 𝒘3𝟔 1 𝐺3 7 3 3 𝒘3𝟒 𝐺1 𝐺′ 1 1 Performance Studies Experimental Setup We use 2 real datasets: Enwiki and Reuters Enwiki: 11,930,681 articles from English Wikipedia Reuters: 21,578 news from Reuters Query: a set of keywords Answer: top-𝐾 documents We compare three algorithms div-star: A* based approach div-dp: Dynamic programming based approach div-cut: Cut point based approach We vary 3 parameters: 𝐾: (two groups) Small 𝐾: 40, 80, 120, 160, 200, default 120 Large 𝐾: 500, 700, 900, 1300, 2000, default 900 Similarity threshold 𝜏: 0.4, 0.5, 0.6, 0.7, 0.8 default 0.6 Keyword frequency 𝑘𝑓𝑟𝑒𝑞: 5 levels 1,2,3,4,5, default 3 Performance Studies Score function: Given a query 𝑄 and a document 𝑑 𝑠𝑐𝑜𝑟𝑒 𝑄, 𝑑 = 𝑞∈𝑄 𝑡𝑓(𝑞, 𝑑) 𝑙𝑒𝑛(𝑑) 𝑡𝑓(𝑞, 𝑑) is term frequency of keyword 𝑞 𝑖𝑑𝑓 𝑞 = 𝑙𝑜𝑔 𝑙𝑒𝑛(𝑑) is the total number of words in 𝑑 |𝐷| 𝑑∈𝐷:𝑞∈𝑑 +1 × 𝑖𝑑𝑓(𝑞) for dataset 𝐷 Similarity function: Given two documents 𝑑1 and 𝑑2 𝑠𝑖𝑚 𝑑1 , 𝑑2 = 𝑤∈𝑑1 ∩𝑑2 𝑖𝑑𝑓(𝑤) 𝑤∈𝑑1 ∪𝑑2 𝑖𝑑𝑓(𝑤) Performance Studies Small 𝐾 Small 𝐾 Large 𝐾 Large 𝐾 Vary 𝐾 (Enwiki) Conclusion We study the diversified ranking. We study the diversified top-𝐾 search problem. The diversity use only the similarity of search results themselves We propose a framework, s.t. most top-𝐾 algorithm can be easily extended to handle diversified top-𝐾 search by applying. APWeb 2013 in Sydney, Australia The 15th International Asia-Pacific Web Conference (APWeb), 4-6 April, 2013, Sydney, Australia Three Keynote Speakers Just before ICDE 2013. Paper Submission Deadline: October 20. H.V. Jagadish (University of Michigan) Dan Suciu (University of Washington) Mark Sanderson (RMIT) A Special Issue on WWW Journal Research Postgraduate Study at SEEM/CUHK [www.se.cuhk.edu.hk/programmes] Research Postgraduate Programs M.Phil, PhD, M.phil-PhD (Articulated) Deadlines: December 1, 2012 (First Round) January 31, 2013 (Official Final Round). But, due to Chinese New Year, submit it early before January 20. Postgraduate Studentship: HK$13,600 per month (non-taxable) Current Tuition Fees: HK$42,100/year Hong Kong PhD Fellowship Scheme 2013-2014 (135 positions in HK) Deadline: December 1, 2012 Monthly stipend of HK$20,000 10,000 travel allowance Current Tuition Fees: HK$42,100/year Taught Postgraduate Study at SEEM/CUHK [www.se.cuhk.edu.hk/programmes] Taught Postgraduate Programmes MSc Programme in SEEM (Systems Engineering and Engineering Management) MSc Programme in ECLT (E-Commerce and Logistics Technologies) Current Tuition Fees: (Provisional) HK$128,000 Full-Time One-Year study in HK Application deadline: 1st Round: January 15, 2013 2nd Round: March 15, 2013 Early applications are encouraged; Offers may be made to eligible applicants well before March 15. Thank you! Questions?