Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Subbarao Kambhampati Dept. of Computer Science University of California, Davis Dept. of Computer Science Arizona State University 5th April, ICDE 2006, Atlanta, USA Dichotomy in Query Processing Databases IR Systems • User knows what she • User has an idea of • User query • User query captures • Answers exactly • Answers ranked by wants completely expresses the need matching query constraints what she wants the need to some degree degree of relevance Answering Imprecise Queries over Autonomous Web Databases Why Support Imprecise Queries ? Toyota A Feasible Query Want a ‘sedan’ priced around $7000 Make =“Toyota”, Model=“Camry”, Price ≤ $7000 Camry $7000 1999 Toyota Camry $7000 2001 Toyota Camry $6700 2000 Toyota Camry $6500 1998 ……… What about the price of a Honda Accord? Is there a Camry for $7100? Solution: Support Imprecise Queries Answering Imprecise Queries over Autonomous Web Databases Others are following … Answering Imprecise Queries over Autonomous Web Databases What does Supporting Imprecise Queries Mean? The Problem: Given a conjunctive query Q over a relation R, find a set of tuples that will be considered relevant by the user. Ans(Q) ={x|x Є R, Relevance(Q,x) >c} Objectives – Minimal burden on the end user – No changes to existing database – Domain independent Motivation – How far can we go with relevance model estimated from database ? • Tuples represent real-world objects and relationships between them – Use the estimated relevance model to provide a ranked set of tuples similar to the query Answering Imprecise Queries over Autonomous Web Databases Challenges Estimating Query-Tuple Similarity – Weighted summation of attribute similarities – Need to estimate semantic similarity Measuring Attribute Importance – Not all attributes equally important – Users cannot quantify importance Answering Imprecise Queries over Autonomous Web Databases Our Solution: AIMQ Imprecise Query Engine Query Q Query Engine Map: Convert “like” to “=” Derive Base Set Abs Qpr = Map(Q) Abs = Qpr(R) Dependency Miner Similarity Miner Use Base Set as set of relaxable selection queries Use Value similarities Using AFDs find relaxation order Derive Extended Set by executing relaxed queries and attribute importance to measure tuple similarities Prune tuples below threshold Return Ranked Set Answering Imprecise Queries over Autonomous Web Databases Imprecise Query Map: Convert Q “like” to “=” Qpr = Map(Q) Derive Base Set Abs Abs = Qpr(R) Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set An Illustrative Example Relation:- CarDB(Make, Model, Price, Year) Imprecise query Q :− CarDB(Model like “Camry”, Price like “10k”) Base query Qpr :− CarDB(Model = “Camry”, Price = “10k”) Base set Abs Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001” Answering Imprecise Queries over Autonomous Web Databases Imprecise Query Map: Convert Q “like” to “=” Qpr = Map(Q) Derive Base Set Abs Abs = Qpr(R) Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Obtaining Extended Set Problem: Given base set, find tuples from database similar to tuples in base set. Solution: – Consider each tuple in base set as a selection query. e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” – Relax each such query to obtain “similar” precise queries. e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000” – Execute and determine tuples having similarity above some threshold. Challenge: Which attribute should be relaxed first? – Make ? Model ? Price ? Year ? Solution: Relax least important attribute first. Answering Imprecise Queries over Autonomous Web Databases Imprecise Query Map: Convert Q “like” to “=” Qpr = Map(Q) Derive Base Set Abs Abs = Qpr(R) Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Least Important Attribute Definition: An attribute whose binding value when changed has minimal effect on values binding other attributes. – Does not decide values of other attributes – Value may depend on other attributes E.g. Changing/relaxing Price will usually not affect other attributesbut changing Model usually affects Price Requires dependence between attributes to decide relative TANE- an algorithm by importance Huhtala et al [1999] used to mine AFDs and – Attribute dependence information not provided by sources Approximate Keys – Learn using Approximate Functional Dependencies & Approximate Keys • Approximate Functional Dependency (AFD) X A is a FD over r’, r’ ⊆ r If error(X A ) = |r-r’|/ |r| < 1 then X A is a AFD over r. • Exponential in the number of attributes • Linear in the • Approximate in the sense that they are obeyed by a large percentage (but not all) of number of tuples the tuples in the database Answering Imprecise Queries over Autonomous Web Databases Imprecise Query Map: Convert Q “like” to “=” Qpr = Map(Q) Derive Base Set Abs Abs = Qpr(R) Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Deciding Attribute Importance Mine AFDs and Approximate Keys Create dependence graph using AFDs – Strongly connected hence a topological sort not possible Using Approximate Key with highest support partition attributes into – Deciding set – Dependent set – Sort the subsets using dependence and influence weights Measure attribute importance as Wtdecides( Ai ) Wtdecides Re laxOrder ( Ai ) Wimp( Ai) or count ( Attributes( R)) Wtdepends( Ai ) Wtdepends •Attribute relaxation order is all nonkeys first then keys •Greedy multi-attribute relaxation CarDB(Make, Model, Year, Price) Decides: Make, Year Depends: Model, Price Order: Price, Model, Year, Make 1- attribute: { Price, Model, Year, Make} 2-attribute: {(Price, Model), (Price, Year), (Price, Make).. } Answering Imprecise Queries over Autonomous Web Databases Imprecise Query Map: Convert Q “like” to “=” Qpr = Map(Q) Derive Base Set Abs Abs = Qpr(R) Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Query-Tuple Similarity Tuples in extended set show different levels of relevance Ranked according to their similarity to the corresponding tuples in base set using VSim (Q. Ai, t. Ai) if Dom ( Ai) Categorical n Sim (Q, t ) Wimp( Ai) i 1 | Q. Ai t. Ai | Q. Ai if Dom ( Ai) Numerical – n = Count(Attributes(R)) and Wimp is the importance weight of the attribute – Euclidean distance as similarity for numerical attributes e.g. Price, Year – VSim – semantic value similarity estimated by AIMQ for categorical attributes e.g. Make, Model Answering Imprecise Queries over Autonomous Web Databases Imprecise Query Map: Convert Q “like” to “=” Qpr = Map(Q) Derive Base Set Abs Abs = Qpr(R) Categorical Value Similarity Two words are semantically similar if they have a common context – from NLP Context of a value represented as a set of bags of co-occurring values called Supertuple Value Similarity: Estimated as the percentage of common {Attribute, Value} pairs Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set ST(QMake=Toy ota) Model Camry: 3, Corolla: 4,…. Year 2000:6,1999:5 2001:2,…… Price 5995:4, 6500:3, 4000:6 Supertuple for Concept Make=Toyota m VSim(v1, v 2) Wimp( Ai) JaccardSim ( ST (v1). Ai, ST (v 2). Ai) i 1 – Measured as the Jaccard Similarity among supertuples representing the values JaccardSim(A,B) = A B A B Answering Imprecise Queries over Autonomous Web Databases Empirical Evaluation Goal – – Test robustness of learned dependencies Evaluate the effectiveness of the query relaxation and similarity estimation Database – Used car database CarDB based on Yahoo Autos CarDB( Make, Model, Year, Price, Mileage, Location, Color) • – Census Database from UCI Machine Learning Repository • Populated using 100k tuples from Yahoo Autos Populated using 45k tuples Algorithms – AIMQ • • – RandomRelax – randomly picks attribute to relax GuidedRelax – uses relaxation order determined using approximate keys and AFDs ROCK: RObust Clustering using linKs (Guha et al, ICDE 1999) • Compute Neighbours and Links between every tuple • Neighbour – tuples similar to each other Link – Number of common neighbours between two tuples Cluster tuples having common neighbours Answering Imprecise Queries over Autonomous Web Databases Robustness of Dependencies 0.4 0.35 100k 50k 25k 15k 0.9 0.8 0.7 0.3 0.6 0.25 Quality . Dependence . 100k 50k 25k 15k 0.2 0.15 0.5 0.4 0.3 0.1 0.2 0.1 0.05 0 0 Model 1 Color Dependent Attribute Year Make 6 11 16 21 26 Keys Attribute dependence order & Key quality is unaffected by sampling Answering Imprecise Queries over Autonomous Web Databases Robustness of Value Similarities Value Make=“Kia” Make=“Bronco” Year=“1985” Similar Values 25K 100k Hyundai 0.17 0.17 Isuzu 0.15 0.15 Subaru 0.13 0.13 Aerostar 0.19 0.21 F-350 0 0.12 Econoline Van 0.11 0.11 1986 0.16 0.16 1984 0.13 0.14 1987 0.12 0.12 Answering Imprecise Queries over Autonomous Web Databases Efficiency of Relaxation Guided Relaxation Random Relaxation 900 180 800 160 Є = 0.6 Є = 0.5 140 Work/Relevant Tuple 700 Work/Relevant Tuple Є = 0.7 Є= 0.7 Є = 0.6 Є = 0.5 600 500 400 300 120 100 80 60 200 40 100 20 0 0 1 2 3 4 5 6 7 8 9 10 Queries 1 2 3 4 5 6 7 8 9 10 Queries •Average 8 tuples extracted per relevant tuple for Є =0.5. Increases to 120 tuples for Є=0.7. •Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є= 0.7. •Not resilient to change in Є •Resilient to change in Є Answering Imprecise Queries over Autonomous Web Databases Accuracy over CarDB 1 GuidedRelax RandomRelax ROCK • Similarity learned using 25k sample Average MRR . 0.8 •14 queries over 100K tuples 0.6 • Mean Reciprocal Rank (MRR) estimated as 0.4 1 MRR (Q) Avg | UserRank ( t i ) AIMQRank ( t i ) | 1 0.2 • Overall high MRR shows high relevance of suggested answers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Queries Answering Imprecise Queries over Autonomous Web Databases Avg Qry-Tuple Class Similarity Accuracy over CensusDB 0.85 AIMQ • 1000 randomly selected tuples as queries ROCK 0.75 • Overall high MRR for AIMQ shows higher relevance of suggested answers 0.65 0.55 Top-10 Top-5 Top-3 Similar Answers Top-1 Answering Imprecise Queries over Autonomous Web Databases AIMQ - Summary An approach for answering imprecise queries over Web database – Mine and use AFDs to determine attribute order – Domain independent semantic similarity estimation technique – Automatically compute attribute importance scores Empirical evaluation shows – – – – Efficiency and robustness of algorithms Better performance than current approaches High relevance of suggested answers Domain independence Answering Imprecise Queries over Autonomous Web Databases Answering Imprecise Queries over Autonomous Web Databases