Supporting Queries with Imprecise [WebDB 2004; Constraints VLDB 2005 (d); WWW 2005 (p); ICDE 2006] Ullas Nambiar Subbarao Kambhampati Dept. of Computer Science University of California, Davis Dept. of Computer Science Arizona State University 18th July, AAAI -06, Boston, USA Dichotomy in Query Processing IR Systems Databases • User knows what she Autonomous Un-curated DB • User has an idea of • User query Inexperienced, Impatient user population • User query captures wants completely expresses the need • Answers exactly matching query constraints what she wants the need to some degree • Answers ranked by degree of relevance Supporting Queries with Imprecise Constraints Why Support Imprecise Queries ? Toyota A Feasible Query Want a ‘sedan’ priced around $7000 Make =“Toyota”, Model=“Camry”, Price ≤ $7000 Camry $7000 1999 Toyota Camry $7000 2001 Toyota Camry $6700 2000 Toyota Camry $6500 1998 ……… What about the price of a Honda Accord? Is there a Camry for $7100? Solution: Support Imprecise Queries Supporting Queries with Imprecise Constraints Others are following … Supporting Queries with Imprecise Constraints What does Supporting Imprecise Queries Mean? The Problem: Given a conjunctive query Q over a relation R, find a set of tuples that will be considered relevant by the user. Ans(Q) ={x|x Є R, Rel(x|Q,U) >c} Constraints – Minimal burden on the end user – No changes to existing database – Domain independent Autonomous Un-curated DB Inexperienced, Impatient user population Supporting Queries with Imprecise Constraints Assessing Relevance Function Rel(x|Q,U) We looked at a variety of non-intrusive relevance assessment methods – Basic idea is to learn the relevance function for user population rather than single users Methods – From the analysis of the (sample) data itself • Allows us to understand the relative importance of attributes, and the similarity between the values of an attribute [ICDE 2006;WWW 2005 poster] – From the analysis of query logs • Allows us to identify related queries, and then throw in their answers [WIDM 2003; WebDB 2004] – From co-click patterns • Allows us to identify similarity based on user click pattern [Under Review] Supporting Queries with Imprecise Constraints Our Solution: AIMQ Supporting Queries with Imprecise Constraints The AIMQ Approach Imprecise Query Engine Query Q Query Engine Map: Convert “like” to “=” Derive Base Set Abs Qpr = Map(Q) Abs = Qpr(R) Dependency Miner Similarity Miner Use Base Set as set of relaxable selection queries Use Value similarities Using AFDs find relaxation order Derive Extended Set by executing relaxed queries and attribute importance to measure tuple similarities Prune tuples below threshold Return Ranked Set [For the special case of empty query, we start with a relaxation that uses AFD analysis] Supporting Queries with Imprecise Constraints Imprecise Query Map: Convert Q “like” to “=” Qpr = Map(Q) Derive Base Set Abs Abs = Qpr(R) Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set An Illustrative Example Relation:- CarDB(Make, Model, Price, Year) Imprecise query Q :− CarDB(Model like “Camry”, Price like “10k”) Base query Qpr :− CarDB(Model = “Camry”, Price = “10k”) Base set Abs Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001” Supporting Queries with Imprecise Constraints Imprecise Query Map: Convert Q “like” to “=” Qpr = Map(Q) Derive Base Set Abs Abs = Qpr(R) Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Obtaining Extended Set Problem: Given base set, find tuples from database similar to tuples in base set. Solution: – Consider each tuple in base set as a selection query. e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” – Relax each such query to obtain “similar” precise queries. e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000” – Execute and determine tuples having similarity above some threshold. Challenge: Which attribute should be relaxed first? – Make ? Model ? Price ? Year ? Solution: Relax least important attribute first. Supporting Queries with Imprecise Constraints Least Important Attribute Definition: An attribute whose binding value when changed has minimal effect on values binding other attributes. • • Does not decide values of other attributes Value may depend on other attributes E.g. Changing/relaxing Price will usually not affect other attributes but changing Model usually affects Price Dependence between attributes useful to decide relative importance • Approximate Functional Dependencies & Approximate Keys Approximate in the sense that they are obeyed by a large percentage (but not all) of tuples in the database • Can use TANE, an algorithm by Huhtala et al [1999] Imprecise Query Map: Convert Q “like” to “=” Qpr = Map(Q) Derive Base Set Abs Abs = Qpr(R) Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Deciding Attribute Importance Mine AFDs and Approximate Keys Create dependence graph using AFDs – Strongly connected hence a topological sort not possible Using Approximate Key with highest support partition attributes into – Deciding set – Dependent set – Sort the subsets using dependence and influence weights Measure attribute importance as Wtdecides( Ai ) Wtdecides Re laxOrder ( Ai ) Wimp( Ai) or count ( Attributes( R )) Wtdepends( Ai ) Wtdepends •Attribute relaxation order is all nonkeys first then keys •Greedy multi-attribute relaxation CarDB(Make, Model, Year, Price) Decides: Make, Year Depends: Model, Price Order: Price, Model, Year, Make 1- attribute: { Price, Model, Year, Make} 2-attribute: {(Price, Model), (Price, Year), (Price, Make).. } Supporting Queries with Imprecise Constraints Tuple Similarity Imprecise Query Convert Q Map: “like” to “=” Derive Base Set Abs Qpr = Map(Q) Abs = Qpr(R) Use Base Set as set of relaxable selection queries Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Tuples obtained after relaxation are ranked according to their similarity to the corresponding tuples in base set Similarity (t1, t 2) AttrSimilarity (value(t1[ Ai]), value(t 2[ Ai])) Wi where Wi = normalized influence weights, ∑ Wi = 1 , i = 1 to |Attributes(R)| Value Similarity • Euclidean for numerical attributes e.g. Price, Year • Use Concept similarity to measure tuple similarities Concept Similarity for categorical e.g. Make, Model Imprecise Query Map: Convert Q “like” to “=” Qpr = Map(Q) Derive Base Set Abs Abs = Qpr(R) Categorical Value Similarity Two words are semantically similar if they have a common context – from NLP Context of a value represented as a set of bags of co-occurring values called Supertuple Value Similarity: Estimated as the percentage of common {Attribute, Value} pairs Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set ST(QMake=Toy ota) Model Camry: 3, Corolla: 4,…. Year 2000:6,1999:5 2001:2,…… Price 5995:4, 6500:3, 4000:6 Supertuple for Concept Make=Toyota m VSim(v1, v 2) Wimp( Ai) JaccardSim ( ST (v1). Ai, ST (v 2). Ai) i 1 – Measured as the Jaccard Similarity among supertuples representing the values JaccardSim(A,B) = A B A B Supporting Queries with Imprecise Constraints Value Similarity Graph Dodge Nissan 0.15 Honda 0.12 0.22 0.11 BMW 0.25 0.16 Ford Chevrolet Toyota August 15th 2005 Answering Imprecise Queries over Autonomous Databases Empirical Evaluation Goal – Evaluate the effectiveness of the query relaxation and similarity estimation Database – Used car database CarDB based on Yahoo Autos CarDB( Make, Model, Year, Price, Mileage, Location, Color) • – Census Database from UCI Machine Learning Repository • Populated using 100k tuples from Yahoo Autos Populated using 45k tuples Algorithms – AIMQ • • – RandomRelax – randomly picks attribute to relax GuidedRelax – uses relaxation order determined using approximate keys and AFDs ROCK: RObust Clustering using linKs (Guha et al, ICDE 1999) • Compute Neighbours and Links between every tuple • Neighbour – tuples similar to each other Link – Number of common neighbours between two tuples Cluster tuples having common neighbours Supporting Queries with Imprecise Constraints Efficiency of Relaxation Guided Relaxation Random Relaxation 900 180 800 160 Є = 0.6 Є = 0.5 140 Work/Relevant Tuple 700 Work/Relevant Tuple Є = 0.7 Є= 0.7 Є = 0.6 Є = 0.5 600 500 400 300 120 100 80 60 200 40 100 20 0 0 1 2 3 4 5 6 7 8 9 10 Queries 1 2 3 4 5 6 7 8 9 10 Queries •Average 8 tuples extracted per relevant tuple for Є =0.5. Increases to 120 tuples for Є=0.7. •Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є= 0.7. •Not resilient to change in Є •Resilient to change in Є Supporting Queries with Imprecise Constraints Accuracy over CarDB 1 GuidedRelax RandomRelax ROCK • Similarity learned using 25k sample Average MRR . 0.8 •14 queries over 100K tuples 0.6 • Mean Reciprocal Rank (MRR) estimated as 0.4 1 MRR (Q) Avg | UserRank ( t i ) AIMQRank ( t i ) | 1 0.2 • Overall high MRR shows high relevance of suggested answers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Queries Supporting Queries with Imprecise Constraints Handling Imprecision & Incompleteness Imprecision in queries – Queries posed by lay users • Who combine querying and browsing Relevance Function Incompleteness in data – Databases are being populated by • Entry by lay people • Automated extraction E.g. entering an “accord” without mentioning “Honda” Density Function General Solution: “Expected Relevance Ranking” Challenge: Automated & Non-intrusive assessment of Relevance and Density functions Supporting Queries with Imprecise Constraints Handling Imprecision & Incompleteness Supporting Queries with Imprecise Constraints