Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Subbarao Kambhampati

advertisement
Answering Imprecise Queries over
Autonomous Web Databases
Ullas Nambiar
Subbarao Kambhampati
Dept. of Computer Science
University of California, Davis
Dept. of Computer Science
Arizona State University
5th April, ICDE 2006, Atlanta, USA
Dichotomy in Query Processing
Databases
IR Systems
• User knows what she
• User has an idea of
• User query
• User query captures
• Answers exactly
• Answers ranked by
wants
completely
expresses the need
matching query
constraints
what she wants
the need to some
degree
degree of relevance
Answering Imprecise Queries over Autonomous Web Databases
Why Support Imprecise Queries ?
Toyota
A Feasible Query
Want a ‘sedan’ priced
around $7000
Make =“Toyota”,
Model=“Camry”,
Price ≤ $7000
Camry
$7000
1999
Toyota
Camry
$7000
2001
Toyota
Camry
$6700
2000
Toyota
Camry
$6500
1998
………
What about the price of a
Honda Accord?
Is there a Camry for
$7100?
Solution: Support Imprecise Queries
Answering Imprecise Queries over Autonomous Web Databases
Others are following …
Answering Imprecise Queries over Autonomous Web Databases
What does Supporting Imprecise Queries Mean?
The Problem: Given a conjunctive query Q over a relation R, find a
set of tuples that will be considered relevant by the user.
Ans(Q) ={x|x Є R, Relevance(Q,x) >c}
Objectives
– Minimal burden on the end user
– No changes to existing database
– Domain independent
Motivation
– How far can we go with relevance model estimated from
database ?
• Tuples represent real-world objects and relationships between
them
– Use the estimated relevance model to provide a ranked set of
tuples similar to the query
Answering Imprecise Queries over Autonomous Web Databases
Challenges
 Estimating Query-Tuple Similarity
– Weighted summation of attribute similarities
– Need to estimate semantic similarity
 Measuring Attribute Importance
– Not all attributes equally important
– Users cannot quantify importance
Answering Imprecise Queries over Autonomous Web Databases
Our Solution: AIMQ
Imprecise
Query Engine
Query
Q
Query Engine
Map: Convert
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Dependency Miner
Similarity Miner
Use Base Set as set of
relaxable selection
queries
Use Value similarities
Using AFDs find
relaxation order
Derive Extended Set by
executing relaxed queries
and attribute
importance to measure
tuple similarities
Prune tuples below
threshold
Return Ranked Set
Answering Imprecise Queries over Autonomous Web Databases
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
An Illustrative Example
Relation:- CarDB(Make, Model, Price, Year)
Imprecise query
Q :− CarDB(Model like “Camry”, Price like “10k”)
Base query
Qpr :− CarDB(Model = “Camry”, Price = “10k”)
Base set Abs
Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”
Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001”
Answering Imprecise Queries over Autonomous Web Databases
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Obtaining Extended Set
 Problem: Given base set, find tuples from database
similar to tuples in base set.
 Solution:
– Consider each tuple in base set as a selection query.
e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”
– Relax each such query to obtain “similar” precise queries.
e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000”
–
Execute and determine tuples having similarity above some
threshold.
 Challenge: Which attribute should be relaxed first?
– Make ? Model ? Price ? Year ?
Solution: Relax least important attribute first.
Answering Imprecise Queries over Autonomous Web Databases
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Least Important Attribute
 Definition: An attribute whose binding value when changed
has minimal effect on values binding other attributes.
– Does not decide values of other attributes
– Value may depend on other attributes
E.g. Changing/relaxing Price will usually not affect other attributesbut changing
Model usually affects Price
 Requires dependence between attributes to
decide relative
TANE- an algorithm by
importance
Huhtala et al [1999]
used to mine AFDs and
– Attribute dependence information not provided by sources
Approximate Keys
– Learn using Approximate Functional Dependencies & Approximate Keys
• Approximate Functional Dependency (AFD)
X  A is a FD over r’, r’ ⊆ r
If error(X  A ) = |r-r’|/ |r| < 1 then X  A is a AFD over r.
• Exponential in the
number of attributes
• Linear in the
• Approximate in the sense that they are obeyed by a large percentage
(but
not all) of
number of
tuples
the tuples in the database
Answering Imprecise Queries over Autonomous Web Databases
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Deciding Attribute Importance


Mine AFDs and Approximate Keys
Create dependence graph using
AFDs
– Strongly connected hence a
topological sort not possible

Using Approximate Key with highest
support partition attributes into
– Deciding set
– Dependent set
– Sort the subsets using dependence
and influence weights

Measure attribute importance as
Wtdecides( Ai )
 Wtdecides


Re laxOrder ( Ai )

Wimp( Ai) 
 or
count ( Attributes( R)) 
Wtdepends( Ai )


  Wtdepends
•Attribute relaxation order is all nonkeys first then keys
•Greedy multi-attribute relaxation
CarDB(Make, Model, Year,
Price)
Decides: Make, Year
Depends: Model, Price
Order: Price, Model, Year,
Make
1- attribute: { Price, Model,
Year, Make}
2-attribute: {(Price, Model),
(Price, Year), (Price, Make).. }
Answering Imprecise Queries over Autonomous Web Databases
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Query-Tuple Similarity
 Tuples in extended set show different levels of relevance
 Ranked according to their similarity to the corresponding tuples in
base set using

VSim (Q. Ai, t. Ai)

if Dom ( Ai)  Categorical
n

Sim (Q, t )   Wimp( Ai)  
i 1
 | Q. Ai  t. Ai |

Q. Ai

if Dom ( Ai)  Numerical

– n = Count(Attributes(R)) and Wimp is the importance weight of the
attribute
– Euclidean distance as similarity for numerical attributes e.g. Price,
Year
– VSim – semantic value similarity estimated by AIMQ for
categorical attributes e.g. Make, Model
Answering Imprecise Queries over Autonomous Web Databases
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Categorical Value Similarity
 Two words are semantically
similar if they have a common
context – from NLP
 Context of a value represented as
a set of bags of co-occurring
values called Supertuple
 Value Similarity: Estimated as
the percentage of common
{Attribute, Value} pairs
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
ST(QMake=Toy
ota)
Model
Camry: 3, Corolla:
4,….
Year
2000:6,1999:5
2001:2,……
Price
5995:4, 6500:3,
4000:6
Supertuple for Concept Make=Toyota
m
VSim(v1, v 2)   Wimp( Ai)  JaccardSim ( ST (v1). Ai, ST (v 2). Ai)
i 1
– Measured as the Jaccard Similarity
among supertuples representing the
values
JaccardSim(A,B) =
A B
A B
Answering Imprecise Queries over Autonomous Web Databases
Empirical Evaluation

Goal
–
–

Test robustness of learned dependencies
Evaluate the effectiveness of the query relaxation and similarity estimation
Database
– Used car database CarDB based on Yahoo Autos
CarDB( Make, Model, Year, Price, Mileage, Location, Color)
•
–
Census Database from UCI Machine Learning Repository
•

Populated using 100k tuples from Yahoo Autos
Populated using 45k tuples
Algorithms
–
AIMQ
•
•
–
RandomRelax – randomly picks attribute to relax
GuidedRelax – uses relaxation order determined using approximate keys and AFDs
ROCK: RObust Clustering using linKs (Guha et al, ICDE 1999)
•
Compute Neighbours and Links between every tuple


•
Neighbour – tuples similar to each other
Link – Number of common neighbours between two tuples
Cluster tuples having common neighbours
Answering Imprecise Queries over Autonomous Web Databases
Robustness of Dependencies
0.4
0.35
100k
50k
25k
15k
0.9
0.8
0.7
0.3
0.6
0.25
Quality .
Dependence .
100k
50k
25k
15k
0.2
0.15
0.5
0.4
0.3
0.1
0.2
0.1
0.05
0
0
Model
1
Color
Dependent Attribute
Year
Make
6
11
16
21
26
Keys
Attribute dependence order & Key quality is unaffected by sampling
Answering Imprecise Queries over Autonomous Web Databases
Robustness of Value Similarities
Value
Make=“Kia”
Make=“Bronco”
Year=“1985”
Similar Values
25K
100k
Hyundai
0.17
0.17
Isuzu
0.15
0.15
Subaru
0.13
0.13
Aerostar
0.19
0.21
F-350
0
0.12
Econoline Van
0.11
0.11
1986
0.16
0.16
1984
0.13
0.14
1987
0.12
0.12
Answering Imprecise Queries over Autonomous Web Databases
Efficiency of Relaxation
Guided Relaxation
Random Relaxation
900
180
800
160
Є = 0.6
Є = 0.5
140
Work/Relevant Tuple
700
Work/Relevant Tuple
Є = 0.7
Є= 0.7
Є = 0.6
Є = 0.5
600
500
400
300
120
100
80
60
200
40
100
20
0
0
1
2
3
4
5
6
7
8
9
10
Queries
1
2
3
4
5
6
7
8
9
10
Queries
•Average 8 tuples extracted per
relevant tuple for Є =0.5. Increases to
120 tuples for Є=0.7.
•Average 4 tuples extracted per
relevant tuple for Є=0.5. Goes up to
12 tuples for Є= 0.7.
•Not resilient to change in Є
•Resilient to change in Є
Answering Imprecise Queries over Autonomous Web Databases
Accuracy over CarDB
1
GuidedRelax
RandomRelax
ROCK
• Similarity learned using 25k
sample
Average MRR
.
0.8
•14 queries over 100K tuples
0.6
• Mean Reciprocal Rank (MRR)
estimated as
0.4


1

MRR (Q)  Avg
|
UserRank
(
t
i
)

AIMQRank
(
t
i
)
|

1


0.2
• Overall high MRR shows high
relevance of suggested answers
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Queries
Answering Imprecise Queries over Autonomous Web Databases
Avg Qry-Tuple Class Similarity
Accuracy over CensusDB
0.85
AIMQ
• 1000 randomly selected
tuples as queries
ROCK
0.75
• Overall high MRR for
AIMQ shows higher
relevance of suggested
answers
0.65
0.55
Top-10
Top-5
Top-3
Similar Answers
Top-1
Answering Imprecise Queries over Autonomous Web Databases
AIMQ - Summary
 An approach for answering imprecise queries over
Web database
– Mine and use AFDs to determine attribute order
– Domain independent semantic similarity estimation technique
– Automatically compute attribute importance scores
 Empirical evaluation shows
–
–
–
–
Efficiency and robustness of algorithms
Better performance than current approaches
High relevance of suggested answers
Domain independence
Answering Imprecise Queries over Autonomous Web Databases
Answering Imprecise Queries over Autonomous Web Databases
Download