Supporting Queries with Imprecise Constraints Ullas Nambiar Subbarao Kambhampati

advertisement
Supporting Queries with Imprecise
[WebDB 2004;
Constraints
VLDB 2005 (d);
WWW 2005 (p);
ICDE 2006]
Ullas Nambiar
Subbarao Kambhampati
Dept. of Computer Science
University of California, Davis
Dept. of Computer Science
Arizona State University
18th July, AAAI -06, Boston, USA
Dichotomy in Query Processing
IR Systems
Databases
• User knows what she
Autonomous
Un-curated DB
• User has an idea of
• User query
Inexperienced,
Impatient user
population
• User query captures
wants
completely
expresses the need
• Answers exactly
matching query
constraints
what she wants
the need to some
degree
• Answers ranked by
degree of relevance
Supporting Queries with Imprecise Constraints
Why Support Imprecise Queries ?
Toyota
A Feasible Query
Want a ‘sedan’ priced
around $7000
Make =“Toyota”,
Model=“Camry”,
Price ≤ $7000
Camry
$7000
1999
Toyota
Camry
$7000
2001
Toyota
Camry
$6700
2000
Toyota
Camry
$6500
1998
………
What about the price of a
Honda Accord?
Is there a Camry for
$7100?
Solution: Support Imprecise Queries
Supporting Queries with Imprecise Constraints
Others are following …
Supporting Queries with Imprecise Constraints
What does Supporting Imprecise Queries Mean?
The Problem: Given a conjunctive query Q over a relation R, find a
set of tuples that will be considered relevant by the user.
Ans(Q) ={x|x Є R, Rel(x|Q,U) >c}
Constraints
– Minimal burden on the end user
– No changes to existing database
– Domain independent
Autonomous
Un-curated DB
Inexperienced,
Impatient user
population
Supporting Queries with Imprecise Constraints
Assessing Relevance Function Rel(x|Q,U)
 We looked at a variety of non-intrusive relevance
assessment methods
– Basic idea is to learn the relevance function for user population
rather than single users
 Methods
– From the analysis of the (sample) data itself
• Allows us to understand the relative importance of attributes, and the
similarity between the values of an attribute
 [ICDE 2006;WWW 2005 poster]
– From the analysis of query logs
• Allows us to identify related queries, and then throw in their answers
 [WIDM 2003; WebDB 2004]
– From co-click patterns
• Allows us to identify similarity based on user click pattern
 [Under Review]
Supporting Queries with Imprecise Constraints
Our Solution: AIMQ
Supporting Queries with Imprecise Constraints
The AIMQ Approach
Imprecise
Query Engine
Query
Q
Query Engine
Map: Convert
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Dependency Miner
Similarity Miner
Use Base Set as set of
relaxable selection
queries
Use Value similarities
Using AFDs find
relaxation order
Derive Extended Set by
executing relaxed queries
and attribute
importance to measure
tuple similarities
Prune tuples below
threshold
Return Ranked Set
[For the special case of empty query, we start with a relaxation that uses AFD analysis]
Supporting Queries with Imprecise Constraints
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
An Illustrative Example
Relation:- CarDB(Make, Model, Price, Year)
Imprecise query
Q :− CarDB(Model like “Camry”, Price like “10k”)
Base query
Qpr :− CarDB(Model = “Camry”, Price = “10k”)
Base set Abs
Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”
Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001”
Supporting Queries with Imprecise Constraints
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Obtaining Extended Set
 Problem: Given base set, find tuples from database
similar to tuples in base set.
 Solution:
– Consider each tuple in base set as a selection query.
e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”
– Relax each such query to obtain “similar” precise queries.
e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000”
–
Execute and determine tuples having similarity above some
threshold.
 Challenge: Which attribute should be relaxed first?
– Make ? Model ? Price ? Year ?
Solution: Relax least important attribute first.
Supporting Queries with Imprecise Constraints
Least Important Attribute
Definition: An attribute whose binding value when changed has
minimal effect on values binding other attributes.
•
•
Does not decide values of other attributes
Value may depend on other attributes
E.g. Changing/relaxing Price will usually not affect other
attributes but changing Model usually affects Price
Dependence between attributes useful to decide relative importance
•
Approximate Functional Dependencies & Approximate Keys

Approximate in the sense that they are obeyed by a large
percentage (but not all) of tuples in the database
•
Can use TANE, an algorithm by Huhtala et al [1999]
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Deciding Attribute Importance


Mine AFDs and Approximate Keys
Create dependence graph using
AFDs
– Strongly connected hence a
topological sort not possible

Using Approximate Key with highest
support partition attributes into
– Deciding set
– Dependent set
– Sort the subsets using dependence
and influence weights

Measure attribute importance as
Wtdecides( Ai )
 Wtdecides


Re laxOrder ( Ai )

Wimp( Ai) 
 or
count ( Attributes( R )) 
Wtdepends( Ai )


  Wtdepends
•Attribute relaxation order is all nonkeys first then keys
•Greedy multi-attribute relaxation
CarDB(Make, Model, Year,
Price)
Decides: Make, Year
Depends: Model, Price
Order: Price, Model, Year,
Make
1- attribute: { Price, Model,
Year, Make}
2-attribute: {(Price, Model),
(Price, Year), (Price, Make).. }
Supporting Queries with Imprecise Constraints
Tuple Similarity
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Tuples obtained after relaxation are ranked according to their
similarity to the corresponding tuples in base set
Similarity (t1, t 2)   AttrSimilarity (value(t1[ Ai]), value(t 2[ Ai])) Wi
where Wi = normalized influence weights, ∑ Wi = 1 , i = 1 to
|Attributes(R)|
Value Similarity
• Euclidean for numerical attributes e.g. Price, Year
•
Use Concept similarity
to measure tuple
similarities
Concept Similarity for categorical e.g. Make, Model
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Categorical Value Similarity
 Two words are semantically
similar if they have a common
context – from NLP
 Context of a value represented as
a set of bags of co-occurring
values called Supertuple
 Value Similarity: Estimated as
the percentage of common
{Attribute, Value} pairs
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
ST(QMake=Toy
ota)
Model
Camry: 3, Corolla:
4,….
Year
2000:6,1999:5
2001:2,……
Price
5995:4, 6500:3,
4000:6
Supertuple for Concept Make=Toyota
m
VSim(v1, v 2)   Wimp( Ai)  JaccardSim ( ST (v1). Ai, ST (v 2). Ai)
i 1
– Measured as the Jaccard Similarity
among supertuples representing the
values
JaccardSim(A,B) =
A B
A B
Supporting Queries with Imprecise Constraints
Value Similarity Graph
Dodge
Nissan
0.15
Honda
0.12
0.22
0.11
BMW
0.25
0.16
Ford
Chevrolet
Toyota
August 15th 2005
Answering Imprecise Queries over Autonomous Databases
Empirical Evaluation

Goal
–

Evaluate the effectiveness of the query relaxation and similarity estimation
Database
– Used car database CarDB based on Yahoo Autos
CarDB( Make, Model, Year, Price, Mileage, Location, Color)
•
–
Census Database from UCI Machine Learning Repository
•

Populated using 100k tuples from Yahoo Autos
Populated using 45k tuples
Algorithms
–
AIMQ
•
•
–
RandomRelax – randomly picks attribute to relax
GuidedRelax – uses relaxation order determined using approximate keys and AFDs
ROCK: RObust Clustering using linKs (Guha et al, ICDE 1999)
•
Compute Neighbours and Links between every tuple


•
Neighbour – tuples similar to each other
Link – Number of common neighbours between two tuples
Cluster tuples having common neighbours
Supporting Queries with Imprecise Constraints
Efficiency of Relaxation
Guided Relaxation
Random Relaxation
900
180
800
160
Є = 0.6
Є = 0.5
140
Work/Relevant Tuple
700
Work/Relevant Tuple
Є = 0.7
Є= 0.7
Є = 0.6
Є = 0.5
600
500
400
300
120
100
80
60
200
40
100
20
0
0
1
2
3
4
5
6
7
8
9
10
Queries
1
2
3
4
5
6
7
8
9
10
Queries
•Average 8 tuples extracted per
relevant tuple for Є =0.5. Increases to
120 tuples for Є=0.7.
•Average 4 tuples extracted per
relevant tuple for Є=0.5. Goes up to
12 tuples for Є= 0.7.
•Not resilient to change in Є
•Resilient to change in Є
Supporting Queries with Imprecise Constraints
Accuracy over CarDB
1
GuidedRelax
RandomRelax
ROCK
• Similarity learned using 25k
sample
Average MRR
.
0.8
•14 queries over 100K tuples
0.6
• Mean Reciprocal Rank (MRR)
estimated as
0.4


1

MRR (Q)  Avg
|
UserRank
(
t
i
)

AIMQRank
(
t
i
)
|

1


0.2
• Overall high MRR shows high
relevance of suggested answers
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Queries
Supporting Queries with Imprecise Constraints
Handling Imprecision & Incompleteness
 Imprecision in queries
– Queries posed by lay users
• Who combine querying and
browsing
Relevance
Function
 Incompleteness in data
– Databases are being populated
by
• Entry by lay people
• Automated extraction
 E.g. entering an “accord”
without mentioning “Honda”
Density
Function
General Solution: “Expected Relevance Ranking”
Challenge: Automated & Non-intrusive assessment
of Relevance and Density functions
Supporting Queries with Imprecise Constraints
Handling Imprecision & Incompleteness
Supporting Queries with Imprecise Constraints
Download