Optimal Top-k Generation of Attribute Combinations based on Ranked Lists Jiaheng Lu, Renmin University of China Joint work with Pierre Senellart, Chunbin Lin, Xiaoyong Du, Shan Wang, and Xinxing Chen Motivation & Problem Statement Forward Game Score ID Center Guard F1 F2 C1 C2 G1 G2 Juwan Howard LeBron James Chris Bosh Eddy Curry Dwyane Wade Terrel Harris (G01, 3.81) (G06, 3.59) (G04, 3.21) (G07, 3.03) (G09, 2.07) (G11, 1.70) (G10, 1.62) (G02, 1.59) …… (G08, 1.19) (G02, 6.59) (G03, 6.19) (G04, 5.81) (G05, 4.01) (G01, 3.38) (G09, 2.25) (G06, 1.52) (G08, 1.51) …… (G07, 1.00) (G09, 7.10) (G03, 6.01) (G04, 3.79) (G08, 3.02) (G05, 2.89) (G02, 2.52) (G01, 2.00) (G10, 1.59) …… (G06, 1.52) (G01, 9.31) (G07, 9.02) (G03, 8.87) (G04, 5.02) (G11, 4.81) (G08, 4.02) (G06, 4.31) (G05, 3.59) …… (G09, 2.06) (G02, 8.91) (G08, 8.07) (G05, 7.54) (G10, 7.52) (G03, 6.14) (G01, 5.05) (G04, 5.01) (G09, 3.34) …… (G06, 3.01) (G05, 7.21) (G02, 6.01) (G06, 5.58) (G10, 5.51) (G04, 5.00) (G11, 3.09) (G01, 2.06) (G08, 2.03) …… (G09, 1.98) Goal: Select a combination of three players including forward, center, and guard positions. (a) Source data of three groups select players with the highest score in each group Methods calculate the average scores of players across all games …… Limitation: overlook team spirit Motivation & Problem Statement consider their combined scores in the same game Select the top-k combinations according to top-m aggregate scores Tuple aggregation function Instance aggregation function Top-k,m Problem Motivation & Problem Statement Guard Center Forward F1 F2 C1 C2 G1 G2 Juwan Howard LeBron James Chris Bosh Eddy Curry Dwyane Wade Terrel Harris (G01, 9.31) (G07, 9.02) (G03, 8.87) (G04, 5.02) (G11, 4.81) (G08, 4.02) (G06, 4.31) (G05, 3.59) …… (G09, 2.06) (G02, 8.91) (G08, 8.07) (G05, 7.54) (G10, 7.52) (G03, 6.14) (G01, 5.05) (G04, 5.01) (G09, 3.34) …… (G06, 3.01) (G05, 7.21) (G02, 6.01) (G06, 5.58) (G10, 5.51) (G04, 5.00) (G11, 3.09) (G01, 2.06) (G08, 2.03) …… (G09, 1.98) (G01, 3.81) (G06, 3.59) (G04, 3.21) (G07, 3.03) (G09, 2.07) (G11, 1.70) (G10, 1.62) (G02, 1.59) …… (G08, 1.19) (G02, 6.59) (G03, 6.19) (G04, 5.81) (G05, 4.01) (G01, 3.38) (G09, 2.25) (G06, 1.52) (G08, 1.51) …… (G07, 1.00) (G09, 7.10) (G03, 6.01) (G04, 3.79) (G08, 3.02) (G05, 2.89) (G02, 2.52) (G01, 2.00) (G10, 1.59) …… (G06, 1.52) Top-1,2: select the top-1 combination of players according to top-2 aggregate scores for games where they played together. (a) Source data of three groups F1C1G1 F1C1G2 F1C2G1 F1C2G2 F2C1G1 F2C1G2 F2C2G1 F2C2G2 (G04, 15.83) (G04, 13.81) (G01, 16.50) (G01, 15.12) (G02, 21.51) (G05, 17.64) (G02, 17.09) (G02, 13.02) (G05, 14.81) (G05, 13.69) (G04, 14.04) (G07, 12.05) (G05, 18.76) (G02, 17.44) (G04, 14.03) (G09, 12.51) … … … … … … … … (b) Top-2 aggregate scores for each combination F2C1G1 is the best combination, since (21.51 + 18.76) is the highest overall score. Difference between top-k queries and top-k,m queries Top-k Top-k,m Return the top-k tuples Return the top-k combinations of attributes Can be transformed into a SQL Cannot be transformed into a SQL Application XML keyword refinement Example Q = {DB;UC Irvine; 2002} Groups: G1 = {"DB"; "database"}, G2={"UCI";"UC Irvine"} G3 = {"2002"}. Consider a top-1,2 query Answer: Q’={DB, UCI, 2002} Application (Cont.) • Evidence combination mining in medical databases • Package recommendation systems • … Outline Motivation & Problem Statement Top-k,m Query Processing Experimental Results Conclusion Top-k,m Query Processing Access Model: Sorted Accesses (a, 9.0) (b, 8.7) (c, 8.7) (i, 8.8) (c, 7.9) (a, 7.5) (d, 7.4) …… (f, 6.9) …… (i, 5.3) (d, 4.7) Top-k,m Query Processing Access Model: Random Accesses (a, 9.0) (b, 8.7) (c, 8.7) (i, 8.8) (c, 7.9) (a, 7.5) (d, 7.4) …… (f, 6.9) …… (i, 5.3) (d, 4.7) Top-k,m Query Processing Baseline Method: ETA Calculate aggregate score for each combination Compute top-m tuples for each combination Threshold Algorithm (TA) (A1,B1,C1) A1 A2 (a, 10) (c, 8.3) (b,7.4) (d, 7.1) …… …… G1 (A1,B1,C2) 100 …... 105 B1 B2 C1 C2 (b, 9) (a, 8) (e, 8) (f, 7) (b, 9) (a, 8) (e, 8) (f, 7) …… …… …… …… G2 …... G3 …... Return the top-k combinations Top-k,m Query Processing Upper and Lower bounds Algorithm: ULA Lower Bound Consider top-m seen match instances Compute the upper and lower bounds for each combination Upper Bound Consider threshold value and top-m match instances Termination condition: k combinations meet the hit-condition Upper and Lower bounds Algorithm: ULA A1 A2 B1 B2 (G5,7.8) (G2, 7.9) (G4,8.0) (G2, 8.3) (G11,7.3) (G1,7.0) (G8,7.3) (G8, 3.0) (G4, 1.8) (G11, 4.2) (G4, 2.6) (G2, 1.5) (G5, 3.3) (G1, 9.3) Group 1 (G2, 4.4) (G1, 2.3) Group 2 A1B1 Upper and Lower bounds Algorithm: ULA A1 A2 (G1, 9.3) (G2, 8.3) B1 B2 (G5,7.8) (G2, 7.9) (G4,8.0) (G11,7.3) (G1,7.0) (G8,7.3) U: 34.4 A1B1 L: 32.5 U: 34.6 A1B2 L: 22.2 U: 31.4 A2B1 (G8, 3.0) (G4, 1.8) (G11, 4.2) (G4, 2.6) (G2, 1.5) (G5, 3.3) Group 1 L: 20.5 (G2, 4.4) (G1, 2.3) Group 2 A1B1 L: 32.5 (G1, 9.3+7.0), (G2, 7.9+8.3) A2B1 U: 31.4 Threshold value=7.8+7.9=15.7, 15.7*2=31.4 U: 31.6 A2B2 L: 9.8 Upper and Lower bounds Algorithm: ULA A1 A2 (G1, 9.3) (G2, 8.3) B1 B2 (G5,7.8) (G2, 7.9) (G4,8.0) (G11,7.3) (G1,7.0) (G8,7.3) U: 32.5 A1B1 L: 32.5 U: 31.2 A1B2 L: 22.2 (G8, 3.0) (G4, 1.8) (G11, 4.2) (G4, 2.6) (G2, 1.5) (G5, 3.3) Group 1 (G2, 4.4) (G1, 2.3) Group 2 A1B1 Can we run fast? Optimization heuristics (1) Pruning combinations without computing the bounds (A3,B2) is dominated by (A2,B1) 6.3<7.1 and 8.0<8.2 Optimization heuristics (2) Reducing the number of accesses Avoiding both sorted and random accesses for specific lists A1 A2 A3 (a, 10) (c, 8.3) (a, 6.3) (b,7.4) (d, 7.1) (d, 4) …… …… G1 …… B1 B2 (b, 9) (a, 8) (e, 8) (f, 7) …… …… G2 (A1,B1)and(A1,B2) cannot be part of answers, all sorted accesses and random accesses on list A1 are unnecessary. Optimization heuristics (3) Reducing the number of accesses Reducing random accesses across two lists A1 A2 (a, 10) (c, 8.3) (b,7.4) (d, 7.1) …… …… G1 B1 B2 (b, 9) (a, 8) (e, 8) (f, 7) (e, 9) (k, 8) (d, 7) (f, 7) …… …… …… G2 C1 C2 …… G3 (A1,B1,C1)and(A1,B1,C2) cannot be part of answers, random accesses between A1 and B1 are unnecessary. Optimization heuristics (4) Reducing the number of accesses Eliminating random accesses for specific tuples Random access from Le to Lt for tuple x is useless Top-k,m Query Processing Compute upper and lower bounds for unterminated combinations Prune dominated combinations ULA+ Terminate combinations by reducing number of accesses Until k combinations meet hitcondition Interesting theoretical results Optimality properties Instance Optimality If wild guesses are not allowed, and the size of each group is treated as a constant, then ULA and ULA+ are instance-optimal. for every instance there exist two constants a and b such that cost(A) <= a*cost(A’) + b The upper bound of the optimality ratio is tight Interesting theoretical results (Cont.) Optimality properties No Instance Optimal Algorithms If wild guesses are allowed, Then there is no deterministic algorithm that is instance-optimal. Outline Motivation & Problem Statement Top-k,m Query Processing Experimental Results Conclusion Experimental Results Experimental Setup Language: Java; Data sets OS: Windows XP; CPU: 2.0GHz; Disk:320GB Experimental Results Experimental results on NBA and YQL datasets ULA+ outperforms ETA by 1-2 orders of magnitude both in running time and access number. Experimental Results Performance of optimization to reduce combinations More than 60% combinations are pruned without computing their bounds Experimental Results Performance of different optimizations Combination of all optimizations has the most powerful pruning capability. Experimental Results Experimental results on XML DBLP dataset XULA and XULA+ perform better than XETA and scale well in both running time and number of accesses. Related Works Top-k with both random and sorted accesses U. Güntzer etc, VLDB2000 S. Nepal etc, ICDE1999 Fagin etc, PODS 2001 Top-k with only sorted accesses R. Fagin etc, JCSS2003 N. Mamoulis etc, TDS2007 Fagin etc, PODS 2001 Related Works Top-k with sorted access on restricted lists Top-k with no need for exact aggregate score Ad-hoc top-k queries N. Bruno etc, ICDE2002 K. C. C. Chang etc, SIGMOD2002 I. F. Ilya etc, VLDB2002 C. Li etc, SIGMOD2006 M. L. Yiu etc, DKE2008 Related Works Top-k Package recommendation T. Deng, W, Fan and F. Geerts, On the Complexity of Package Recommendation Problems PODS 2012 Outline Motivation & Problem Statement Top-k,m Query Processing Experimental Results Conclusion Conclusion • Propose a new problem called top-k,m query evaluation • Developed a family of efficient algorithms, including ULA and ULA+ • Study the optimality properties of our algorithms • Apply top-k,m query to the context of XML keyword query refinement Optimal Top-k Generation of Attribute Combinations based on Ranked Lists