Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri

advertisement
Query Processing over Incomplete
Autonomous Web Databases
MS Thesis Defense
by Hemal Khatri
Web
??? ???
Committee Members:
Prof. Subbarao Kambhampati (chair)
Prof. Chitta Baral
Prof. Yi Chen
Prof. Huan Liu
Web
??? ???
Introduction to Web databases


Many websites allow user query through a form
based interface and are supported by backend
databases
Consider used cars selling websites such as
Cars.com, Yahoo! autos, etc
Autonomous
Database
Web
??? ???
Incompleteness in Web databases




Web databases are often input by lay individuals
without any curation. For e.g. Cars.com, Yahoo!
Autos
Web databases are being populated using
automated information extraction techniques which
are inherently imperfect
The local schema of data sources may not support
certain attributes supported by the global schema
Incomplete/Uncertain tuple: A tuple in which one or
more of its attributes have a missing value
Website
# of
attributes
# of
incomplete body style engine
tuples tuples
autotrader.com
13
25127 33.67%
3.6%
8.1%
carsdirect.com
14
32564 98.74%
55.7%
55.8%
Web
??? ???
Problem Statement


Many entities corresponding to tuples with missing
values might be relevant to the user query
Current query processing techniques return answers that
exactly satisfy the user query
– Such techniques return results with high precision but
low recall
Q:Make=Honda


null
Accord
2003
sedan
Relevant Uncertain tuple: A tuple which does not exactly
satisfy the query predicates but the entity represented by
that tuple might be relevant to the query
How to support query processing over incomplete
autonomous databases in order to retrieve ranked
uncertain results?
Web
??? ???
Challenges Involved


How to predict missing
values in autonomous
databases?
As autonomous databases
are accessible only through
form-based interfaces, how
to retrieve relevant
uncertain answers?
– How to keep query
processing cost
manageable in retrieving
uncertain tuples?

How to rank the retrieved
uncertain answers?
Web
??? ???
Related Work

Probabilistic databases
– Incomplete databases are similar to probabilistic
databases once we assess the probabilities for missing
values
– TRIO: uncertainty with lineage
– ConQuer: handling inconsistency over databases
• Assume probability distributions are given for uncertain
or inconsistent attributes
– We assess probability distribution for missing attribute and
use it to rank rewritten queries to retrieve relevant answers
since the probabilities cannot be stored in databases
– Our query rewriting framework is general and can be used
by these systems if the databases are autonomous

Handling Missing Values
– EM algorithm, Bayes Net, Association rules
Web
??? ???
Possible Approaches
For a query Q:body style = convt
1.Certain Answers Only (CAO): Return
certain answers only as in traditional
databases
2. All Uncertain Answers (AUA): Null
matches any concrete value, hence
return all answers having body
style=convt along with answers having
body style as null
3. Relevant Uncertain Answers (RUA):
Ranking answers by predicting values
of missing attribute

Low Recall
Low Precision,
infeasible
Costly,
infeasible
Web
??? ???
Outline





Introduction
QPIAD: Query Processing over
Incomplete Autonomous Databases
Data Integration over Incomplete
Autonomous Databases
Other Contributions
Conclusion
Web
??? ???
QPIAD System Architecture
Web
??? ???
RRUA: Generating Rewritten Queries


Restricted Relevant Uncertain Answers (RRUA) approach
only retrieves only relevant incomplete tuples instead of
retrieving all tuples as in AUA and RUA
Consider a query Q:Body style=convt Base Result Set:RS(Q)
Make
Model
Year
Price
Body style
Audi
a4
2004
20000
convt
BMW
z4
2003
17000
convt
Porsche
boxster
2000
13000
convt
…..
……
……
……
……
Rewritten queries are based on the determining set
from AFD for Body style: Model ~~> Body style:0.9
Determining Attribute set(dtrSet)
Q1:model=‘a4’
Q2:model=‘z4’
Q3:model=‘boxster’
Web
??? ???
Learning Attribute Correlations
Sample
Database


TANE Algorithm
AFDs and AKeys
Prune AFDs based
on AKeys
AFDs for Query Rewriting
and Feature Selection in classifier
AFD: VIN ~~> Model where VIN is an Approximate
Key(AKey) with high confidence
VIN will not be useful for query rewriting and feature
selection since it will not be able to retrieve additional new
tuples
Web
??? ???
RRUA: Ranking Rewritten Queries

All queries may not be equally good in
retrieving relevant answers
– “z4” model cars are more likely to be
convertibles than a car with “a4” model

When database or network resources
are limited, the mediator can choose to
issue the top K queries to get the most
relevant uncertain answers
Web
Learning Value Distributions
??? ???



Used to rank queries based on the
determining set of attributes from the AFD for
query attribute
We use Naïve Bayes Classifier with mestimates with AFD as a feature selection step
Rank of a rewritten query Qi = P(Am=vm|ti),
where ti ε ПdtrSet(Am)(RS(Q))
– Q1:model=‘a4’, R(Q1) = P(bodystyle=convt|model=a4) = 0.4
– Q2:model=‘z4’, R(Q2) = P(bodystyle=convt|model=z4)= 1.0
– Q3:model=‘boxster’, R(Q3) = P(bodystyle=convt|model=boxster)=0.7
R(Q2) > R(Q3) > R(Q1)

Relevant uncertain answers are ranked based
on the rank of the rewritten query that
retrieved it
Web
??? ???
Combining AFDs and Classifiers


More than one AFD may exist for some
attributes
Experimented with several
approaches:
– Only best-AFD having highest confidence
– All attributes ignoring AFDs
– Hybrid One-AFD
– Ensemble of classifiers
Web
??? ???
Empirical Evaluation of QPIAD


Test Databases: AutoTrader database
containing 100K tuples and Census
database from UCI Repository containing
50K tuples
Oracular study: To evaluate the
effectiveness of our system against a
ground truth, we artificially insert missing
values in 10% of the tuples within these
databases
Web
RRUA vs AUA vs RUA
1
Q:body style=convt
Q:education=bachelors
0.8
AUA(1245)
AUA (467)
RUA(1245)
RUA (467)
RRUA (204)
RRUA(209)
0.6
0.6
Precision
Precision
??? ???
0.4
0.4
0.2
0.2
0
0
0
0.2
0
0.2
0.4
0.6
0.4 Recall
Recall 0.6
0.8
0.8
1
1
Web
Precision over Top K Tuples
1
1
AUA
Q:education=bachelors
Q:body style=convt
0.8
0.8
Precision
Precision
??? ???
RUA
AUA
RUA
RRUA RRUA
0.6
0.4
0.4
0.2
0.2
0
0
0
0
10
20
20
30
40
40
50
60
60
Top
TopK
Ktuples
Tuples
70
8080
90 100
100
Web
Ranking the Rewritten Queries
0.6
0.8
Avg. Accumulated Precision
Avg. Accumulated Precision
??? ???
0.6
0.4
0.2
0
0.5
0.4
0.3
0
20
40
60
80
Kth Query
Cars database
100
0
20
40
60
80
Kth Query
Census database
100
Web
Robustness of QPIAD
3%
5%
10%
15%
1
0.9
Accumulated Precision
??? ???
0.8
0.7
0.6
Q:workshop=private
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
Kth Query
80
100
Web
??? ???
User Relevance Issues with QPIAD



When the query processor presents
incomplete tuples, it becomes a
recommender system
For a query Q:year=2000
How to convince users into believing the
system results?
Make Model Year Price
Honda
Civic
null
15000
Mileage Explanation
18000
We have determined that this car’s year is
60% likely to 2000 based on price=15000
and mileage=18000
Web
??? ???
Outline





Introduction
QPIAD: Query Processing over
Incomplete Autonomous Databases
Data Integration over Incomplete
Autonomous Databases
Other Contributions
Conclusion
Web
??? ???
Leveraging Correlations between
Data Sources
Q:Body style=coupe
Mediator:GS(Make,Model,Year,Price,Mileage,Bodystyle)
Web
??? ???
Correlated Source and Maximum
Correlated Source

Consider four sources with schema:
– S1(Make,Model,Year,Price)
– S2(Engine,Drive,Bodystyle),
• AFD: {Engine, Drive} -> Body style confidence 0.7
– S3(Make,Model,Body style)
• AFD: Model -> Body style confidence 0.8
– S4(Make,Price,Body style)
• AFD: {Make, Price} -> Body Style confidence 0.6
– Mediator global schema GS(Make,Model,Year,Price,
Bodystyle, Engine, Drive)


S3 and S4 are correlated sources with S1 on Body
style attribute
S3 is the maximum correlated source for S1 on
Body style attribute
Web
??? ???
Retrieving Relevant Uncertain
Answers from CarsDirect.com



Consider a query Q:body style = coupe(GS)
Cars.com has an AFD: Model ~~> Body style(0.9)
Cars.com is the maximum correlated source for
CarsDirect.com which doesn’t support Body style
but supports Model attribute
Q1:model=Accord
Q2:model=Mustang
Q3:model=Legend
Q4:model=325
Make
Model
Year
Price
Body style
Honda
Accord
2003 19000
coupe
Ford
Mustang
2004 29100
coupe
Acura
Legend
1997 12000
coupe
BMW
325
2003 28000
coupe
Web
??? ???
Empirical Evaluation of using
Correlation between Data Sources



We consider a mediator performing data
integration over three sources: Cars.com,
Yahoo! Autos and CarsDirect.com
Yahoo! Autos and CarsDirect.com do not
allow querying on body style but when the
tuples are retrieved we can check the body
style attribute to determine if the tuple
retrieved has the body style specified in the
query
Evaluation using attribute correlations and
value distributions learned from Cars.com
for 5 test queries on body style attribute
??? ???
Retrieving Relevant Answers using
Correlations from Cars.com
0.8
1
Carsdirect.com
Yahoo!
Autos
0.9
0.7
0.8
0.6
0.7
Precision
Precision
Web
0.5
0.6
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
00
0
0
5
20 10
1540
20
Kth Tuple
Kth Tuple
6025
30 80
35
100
40
Web
??? ???
Handling Joins over Incomplete
Autonomous databases

Mediator performing data integration across two sources:
– Source S1 is incomplete
– Source S2 is complete
Source
Local Schema
S1
Cars(Make,Model,Year,Price)
S2
Review(Model,Ratings)
Mediator View
UsedCars(Make,Model,Year,Price,Ratings) :Cars(Make,Model,Year,Price), Review(Model,
Ratings)
Web
??? ???
Issues in Handling Joins


Performing joins over probabilistic
databases will lead to a disjunction in join
results
Consider joining uncertain tuples from the
two sources:
Make
Model
Honda null [0.6 Civic]
[0.4 Accord]
Approximation
0.6
0.4
Year Price
Model
Ratings
2003 18000
Civic
5
Accord
4
Make
Honda
Model
Civic
Year
2003
Price
18000
Ratings
5
Honda
Accord
2003
18000
4
or
Web
??? ???
Handling Join Queries


Q:σMake=Honda(UsedCars)
Assume AFDs: {Make,Year} ~~> Model, Model ~~> Make
Q1: Model=Odyssey:R(Q1)=1
Honda
Odyssey 2000 10000
3
Q2: Model=Accord:R(Q2)=1
Honda
Accord
2004 20000
4
Queries on source S2 to join
Q3:Model=Odyssey:R(Q3)=1
Q4:Model=Accord:R(Q4)=1
Q5:Model=Civic:R(Q5)=0.6
null
Accord
2002 18000
4
1.0
Honda
null
2000
5
0.6
15000
Make
Model(FK)
Year
Price
Model(PK)
Ratings
Honda
Odyssey
2000
10000
Civic
5
Honda
Accord
2004
20000
Corolla
4
Honda
null
2000
15000
Accord
4
Altima
3
Camry
5
Odyssey
3
0.6 Civic
0.4 Accord
null
Accord
2002
18000
Toyota
Camry
2003
16000
Web
Experimental Results Joins
1 1
Q:model=Civic
Q:make=audi Q:ratings=4
RUA(2475)
RUA(2475)
RUA(4892)
RRUA(157)
RRUA(58)
RRUA(24)
0.80.81
Precision
Precision
Precision
??? ???
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2 0.2
0.2
0
0
0
0
0
0
0.1
0.2
0.2
0.3
0.2
0.4
0.4
0.5
Recall
Recall
Recall 0.4
0.6
0.6
0.7
0.8
0.6
0.8
0.9
1
1
0.8
Web
??? ???
Outline





Introduction
QPIAD: Query Processing over
Incomplete Autonomous Databases
Data Integration over Incomplete
Autonomous Databases
Other Contributions
Conclusion
Web
??? ???
QUIC: Querying under Imprecision and
Incompleteness



Consider a query Q:model like Civic(Cars)
User might be interested in similar cars like “Accord”,
”Camry”, etc
Ranking results in presence of both similar and
incomplete tuples
Id
Make
Model
Year
Body style
1
Honda
Civic
2000
Sedan
2
Honda
Accord
2004
Coupe
3
Toyota
Camry
2001
Sedan
4
Honda
null
2004
Coupe
5
Honda
null
2000
Sedan
6
Honda
Civic
2004
Coupe
7
BMW
3series
2001
convt
8
Toyota
null
1999
sedan
Web
??? ???
Other Contributions[*Collaboration
with Garrett Wolf]



Handling multi-attribute selection
queries for incomplete databases*
QUIC system for query processing
under imprecision and incompleteness
Online learning of value distribution
based on base result set to avoid
sample biases
Web
??? ???
Conclusion


Thesis proposed a framework for query
processing over incomplete autonomous web
databases:
– QPIAD: Query processing over incomplete
autonomous databases
– QPIAD: Data Integration over multiple
incomplete data sources
Results of empirical evaluation on real world
databases show that our system returns
relevant answers with high precision while
keeping the query processing cost manageable
Thank You!!
Web
Questions??
??? ???
Download