AIM - University of Illinois Urbana

Boolean + Ranking: Querying a Database by K-Constrained Optimization Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Many queries naturally combine Boolean and ranking Traditional databases Boolean query: dept = CS and year = 2 + Qualifying constraint Quantifying function Find top answers B: dept = CS and year = 2 R: gpa Information retrieval Database applications on Web Ranking query: Top 5 ranked by gpa AIM 2 Motivating scenarios  Data retrieval:  Find houses in certain price range with good price/sqrft ratio Select h.address from House h, h CrimeRate c Where h.price ≤ 200k ν h.price ≥ 400k and h.zipcode = c.zipcode Order by h.size/|h.price-300k| *c.crimerate Limit 1 -1 Limit 10  Data analysis:  Find products with highest sale increase in consecutive years Select itemid from Sales s1, Sales s2 Where s1.itemid = s2.itemid and s2.year – s1.year = 1 Order by s2.sale – s1.sale Limit 10 AIM 3 Boolean + Ranking form a coherent goal function  Boolean B + Ranking R = Goal function G For a tuple t G(t) = B(t)*R(t) = AIM R(t) 0 (ie, lowest score) if B(t) is true if B(t) is false 4 The nature of Boolean + Ranking is K-constrained optimization query  Optimize goal function G over database D Goal function G Database D G h.size/|h.price-300k| [h.price ≤ 200k ν h.price ≥ 400k ] Addr Zip Price Size 1. Oak park, Chicago 60644 600K 4500 2. Mattis, Champaign 61821 350K 2000 3. … 150K 1000 4. … 250K 2000 5. … 300K 3500 6. … 80K 500 AIM D 5 What is the query evaluation mechanism? Boolean query + Ranking query How to answer? AIM 6 Current techniques lack of global search mechanism  If evaluated as separate operators D … Rankingquery query Boolean BR … … Rankingquery query Boolean BR  Current techniques optimize only condition-by-condition  If search by an overall goal function G as a ranking function Goal function G D B R  Current techniques restrict G to be monotonic AIM 7 Our thesis: Evaluate query as its nature suggests! G OPT* Optimize G over D Function optimization of G Discrete state search over D D D AIM 8 We view compound index as discrete space AIM Addr Zip Price Size 1. Oak park, Chicago 60644 600K 4500 2. Mattis, Champaign 61821 350K 2000 3. … 150K 1000 4. … 250K 2000 5. … 300K 3500 6. … 80K 500 9 We view compound index as discrete space b1 Price (k) 0-250 250-600 b2 0-100 b3 100-250 ……… 250-350 b6 2 350-600 600 1 350 2 b7 5 1 250 5 4 3 100 6 1500 a1 3000 0-3000 4000 4500 size 3000-4500 a2 0-1500 a3 1500-3000 ……… AIM 3000-4000 a6 5 4000-6000 1 a7 10 We view compound index as discrete space b1 Price (k) 0-250 b2 0-100 600 b3 100-250 ……… Mij =(ai, bj) 250-600 250-350 350 350-600 b6 b7 2 5 1 M11 1 2 250 5 M22 …… M55 100 6 0-3000 3000 4000 4500 size M75 M56 M66 M77 M6 M76 4 2 5 1 3000-4500 a2 a3 1500-3000 ……… AIM M33 7 1500 0-1500 M23 4 3 a1 M32 3000-4000 a6 5 4000-6000 1 a7 11 We view compound index as discrete space conceptually, combined space b1 Price (k) 0-250 b2 0-100 600 b3 100-250 ……… Mij =(ai, bj) 250-600 250-350 350 350-600 b6 2 b7 2 5 M11 1 5 250 1 M22 100 … 6 0-3000 3000 4500 size 4000 M55 M75 M56 M66 M77 M6 M76 4 2 5 1 3000-4500 a2 a3 1500-3000 ……… AIM M33 7 1500 0-1500 M23 4 3 a1 M32 3000-4000 a6 5 4000-6000 1 a7 12 How to perform the search in the space?  What is the search mechanism?   How to conceptually view the index space of D for search How to guide the search?  How to use function G to focus the search AIM 13 Challenge 1: What is the search mechanism? AIM 14 We encode as A* because it’s optimal   What A* is: Finding the shortest path Why we choose: Completeness and optimality with proper heuristics   Complete: guarantee to find shortest path Optimal: visit least number of nodes origin 5 3 1 2 5 1 7 9 AIM 6 destination 15 Encoding our problem into shortest path is challenging  K-constrained optimization Shortest path Find a tuple with maximal score Find a path with minimal distance How to encode:  a tuple  a path?  score of tuple distance of path? AIM 16 Therefore, we encode K-constrained opt. as:  How to encode a tuple to a path?   Adding a virtual target t* only reachable through tuples How to encode maximal tuple with minimal path?  Quality of path depends solely on the tuple it passes by  M11 For tuple state t 0 D(t, t*) = - G(t)  For two states r, u M22 0 M32 M23 M33 0 D(r, u) = 0 0 … M55 M75 M56 M66 M77 M67 M76 0 0 4 2 5 1 - G(1) - G(4) t* AIM 17 Challenge 2: How to guide the search? AIM 18 We use function opt. to sketch the landscape of G   Function optimization measures quality of states Function optimization enables:    1. How to define heuristics? 2. How to configure space? 3. Where to start the search? AIM 19 1. Define admissible heuristics: Measure tightest upper bound  To guarantee completeness   A* requires admissible heuristics, ie, estimate optimistically To ensure admissible heuristics  Function optimization gives tightest upper bound   Analytical approaches Numeric analysis package H(region) = OPTMAX(G, region) ie, maximal value of G in the region AIM 20 2. Configure descending space: disconnect uphills  To guarantee optimality   A* requires descending heuristics To ensure descending heuristics  Remove uphill links M11 M22 … M55 M75 4 AIM M32 M23 M33 M56 M66 M77 2 5 1 M67 M76 21 Find right start point: Start from local optima  To guarantee correctness Every tuple state must be reachable from start states Taking only downhills requires start with high points    To ensure reachability Initial states should contain all local optima  M11 M22 … M55 4 M75 M32 M23 M33 M56 M66 M77 2 5 1 AIM M67 M76 22 Putting together: Executing A* on the configured space top-down M11 M22 … M55 4  M75 M32 M23 M56 M66 2 5 M33 M77 M67 M76 M57 1 Search is implemented as priority queue driven traversal AIM 23 Putting together: Executing A* on the configured space top-down M22 … M55 M75 4 M11 M32 M56 M66 2 5 bottom-up M22 … M55 4  M75 M23 M32 M33 M77 M67 M76 M57 1 M11 M23 M56 M66 2 5 M33 M77 M67 M76 M57 1 Bottom-up approach is always better than top-down AIM 24 Experiments  Comparison vs.   Boolean then ranking Ranking then boolean  Metrics:  Settings:   node accessed = Nl + Nt Benchmark queries over real dataset Controlled queries over synthetic dataset AIM 25 Benchmark queries  Datasets:   19,706 real estate listing crawled online Queries    Q1: size * bedrms/| price-450k| : [40k<=price<=50k] Q2: size * ebedrms / |price-350k| : [price<400k^size>4000] Q3: size/price : [bedrms=3 ν bedrms=4] BR_clustered BR_unclustered OPT* Q1 AIM Q2 Q3 26 Controlled queries  Datasets Three randomly generated datasets of 100k points    Uniform, gaussian, logvariatenormal Queries Linear average queries: (eg, 0.4*a + 0.6*b) Nearest neighbor queries: (eg, (x-3)^2 + (y-4)^2) Join queries: (0.4*R.a + 0.6*S.b: R.c=R.d)    !"#$ ! "#$% !"#$ % AIM 27 Conclusion  Problem   Abstraction   Study K-constrained optimization queries as boolean + ranking Encode K-constrained optimization into shortest path problem Framework  Develop OPT* to process K-constrained optimization AIM 28 Thank you! Questions? AIM 29         How to implement function optimization? How do we compare with RankSQL? If bottom-up is always better, why consider top-down Computing upper bound for each region is costly Random vs. sequential I/O Assuming indices on every attribute? Materialize state space for every query? Exponential number of states when attribute grows     Not every attribute has index on it Selective choose the right index (attribute) to use We do perform experiment to study how the system scale with #attr Your algorithm is not optimal because you change the space AIM 30

AIM - University of Illinois Urbana

Related documents

Products

Support

AIM - University of Illinois Urbana

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib