Boolean + Ranking: Querying a Database by K-Constrained Optimization Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Many queries naturally combine Boolean and ranking Traditional databases Boolean query: dept = CS and year = 2 + Qualifying constraint Quantifying function Find top answers B: dept = CS and year = 2 R: gpa Information retrieval Database applications on Web Ranking query: Top 5 ranked by gpa AIM 2 Motivating scenarios Data retrieval: Find houses in certain price range with good price/sqrft ratio Select h.address from House h, h CrimeRate c Where h.price ≤ 200k ν h.price ≥ 400k and h.zipcode = c.zipcode Order by h.size/|h.price-300k| *c.crimerate Limit 1 -1 Limit 10 Data analysis: Find products with highest sale increase in consecutive years Select itemid from Sales s1, Sales s2 Where s1.itemid = s2.itemid and s2.year – s1.year = 1 Order by s2.sale – s1.sale Limit 10 AIM 3 Boolean + Ranking form a coherent goal function Boolean B + Ranking R = Goal function G For a tuple t G(t) = B(t)*R(t) = AIM R(t) 0 (ie, lowest score) if B(t) is true if B(t) is false 4 The nature of Boolean + Ranking is K-constrained optimization query Optimize goal function G over database D Goal function G Database D G h.size/|h.price-300k| [h.price ≤ 200k ν h.price ≥ 400k ] Addr Zip Price Size 1. Oak park, Chicago 60644 600K 4500 2. Mattis, Champaign 61821 350K 2000 3. … 150K 1000 4. … 250K 2000 5. … 300K 3500 6. … 80K 500 AIM D 5 What is the query evaluation mechanism? Boolean query + Ranking query How to answer? AIM 6 Current techniques lack of global search mechanism If evaluated as separate operators D … Rankingquery query Boolean BR … … Rankingquery query Boolean BR Current techniques optimize only condition-by-condition If search by an overall goal function G as a ranking function Goal function G D B R Current techniques restrict G to be monotonic AIM 7 Our thesis: Evaluate query as its nature suggests! G OPT* Optimize G over D Function optimization of G Discrete state search over D D D AIM 8 We view compound index as discrete space AIM Addr Zip Price Size 1. Oak park, Chicago 60644 600K 4500 2. Mattis, Champaign 61821 350K 2000 3. … 150K 1000 4. … 250K 2000 5. … 300K 3500 6. … 80K 500 9 We view compound index as discrete space b1 Price (k) 0-250 250-600 b2 0-100 b3 100-250 ……… 250-350 b6 2 350-600 600 1 350 2 b7 5 1 250 5 4 3 100 6 1500 a1 3000 0-3000 4000 4500 size 3000-4500 a2 0-1500 a3 1500-3000 ……… AIM 3000-4000 a6 5 4000-6000 1 a7 10 We view compound index as discrete space b1 Price (k) 0-250 b2 0-100 600 b3 100-250 ……… Mij =(ai, bj) 250-600 250-350 350 350-600 b6 b7 2 5 1 M11 1 2 250 5 M22 …… M55 100 6 0-3000 3000 4000 4500 size M75 M56 M66 M77 M6 M76 4 2 5 1 3000-4500 a2 a3 1500-3000 ……… AIM M33 7 1500 0-1500 M23 4 3 a1 M32 3000-4000 a6 5 4000-6000 1 a7 11 We view compound index as discrete space conceptually, combined space b1 Price (k) 0-250 b2 0-100 600 b3 100-250 ……… Mij =(ai, bj) 250-600 250-350 350 350-600 b6 2 b7 2 5 M11 1 5 250 1 M22 100 … 6 0-3000 3000 4500 size 4000 M55 M75 M56 M66 M77 M6 M76 4 2 5 1 3000-4500 a2 a3 1500-3000 ……… AIM M33 7 1500 0-1500 M23 4 3 a1 M32 3000-4000 a6 5 4000-6000 1 a7 12 How to perform the search in the space? What is the search mechanism? How to conceptually view the index space of D for search How to guide the search? How to use function G to focus the search AIM 13 Challenge 1: What is the search mechanism? AIM 14 We encode as A* because it’s optimal What A* is: Finding the shortest path Why we choose: Completeness and optimality with proper heuristics Complete: guarantee to find shortest path Optimal: visit least number of nodes origin 5 3 1 2 5 1 7 9 AIM 6 destination 15 Encoding our problem into shortest path is challenging K-constrained optimization Shortest path Find a tuple with maximal score Find a path with minimal distance How to encode: a tuple a path? score of tuple distance of path? AIM 16 Therefore, we encode K-constrained opt. as: How to encode a tuple to a path? Adding a virtual target t* only reachable through tuples How to encode maximal tuple with minimal path? Quality of path depends solely on the tuple it passes by M11 For tuple state t 0 D(t, t*) = - G(t) For two states r, u M22 0 M32 M23 M33 0 D(r, u) = 0 0 … M55 M75 M56 M66 M77 M67 M76 0 0 4 2 5 1 - G(1) - G(4) t* AIM 17 Challenge 2: How to guide the search? AIM 18 We use function opt. to sketch the landscape of G Function optimization measures quality of states Function optimization enables: 1. How to define heuristics? 2. How to configure space? 3. Where to start the search? AIM 19 1. Define admissible heuristics: Measure tightest upper bound To guarantee completeness A* requires admissible heuristics, ie, estimate optimistically To ensure admissible heuristics Function optimization gives tightest upper bound Analytical approaches Numeric analysis package H(region) = OPTMAX(G, region) ie, maximal value of G in the region AIM 20 2. Configure descending space: disconnect uphills To guarantee optimality A* requires descending heuristics To ensure descending heuristics Remove uphill links M11 M22 … M55 M75 4 AIM M32 M23 M33 M56 M66 M77 2 5 1 M67 M76 21 Find right start point: Start from local optima To guarantee correctness Every tuple state must be reachable from start states Taking only downhills requires start with high points To ensure reachability Initial states should contain all local optima M11 M22 … M55 4 M75 M32 M23 M33 M56 M66 M77 2 5 1 AIM M67 M76 22 Putting together: Executing A* on the configured space top-down M11 M22 … M55 4 M75 M32 M23 M56 M66 2 5 M33 M77 M67 M76 M57 1 Search is implemented as priority queue driven traversal AIM 23 Putting together: Executing A* on the configured space top-down M22 … M55 M75 4 M11 M32 M56 M66 2 5 bottom-up M22 … M55 4 M75 M23 M32 M33 M77 M67 M76 M57 1 M11 M23 M56 M66 2 5 M33 M77 M67 M76 M57 1 Bottom-up approach is always better than top-down AIM 24 Experiments Comparison vs. Boolean then ranking Ranking then boolean Metrics: Settings: node accessed = Nl + Nt Benchmark queries over real dataset Controlled queries over synthetic dataset AIM 25 Benchmark queries Datasets: 19,706 real estate listing crawled online Queries Q1: size * bedrms/| price-450k| : [40k<=price<=50k] Q2: size * ebedrms / |price-350k| : [price<400k^size>4000] Q3: size/price : [bedrms=3 ν bedrms=4] BR_clustered BR_unclustered OPT* Q1 AIM Q2 Q3 26 Controlled queries Datasets Three randomly generated datasets of 100k points Uniform, gaussian, logvariatenormal Queries Linear average queries: (eg, 0.4*a + 0.6*b) Nearest neighbor queries: (eg, (x-3)^2 + (y-4)^2) Join queries: (0.4*R.a + 0.6*S.b: R.c=R.d) !"#$ ! "#$% !"#$ % AIM 27 Conclusion Problem Abstraction Study K-constrained optimization queries as boolean + ranking Encode K-constrained optimization into shortest path problem Framework Develop OPT* to process K-constrained optimization AIM 28 Thank you! Questions? AIM 29 How to implement function optimization? How do we compare with RankSQL? If bottom-up is always better, why consider top-down Computing upper bound for each region is costly Random vs. sequential I/O Assuming indices on every attribute? Materialize state space for every query? Exponential number of states when attribute grows Not every attribute has index on it Selective choose the right index (attribute) to use We do perform experiment to study how the system scale with #attr Your algorithm is not optimal because you change the space AIM 30