An Investigation of the cost and accuracy tradeoffs of Supplanting
AFDs with Bayes Network in Query
Processing in the Presence of
Incompleteness in Autonomous
Databases
MS Thesis Defense
Rohit Raghunathan
August 19 th , 2011
Committee Members
Dr. Subbarao Kambhampti (Chair)
Dr. Joohyung Lee
Dr. Huan Liu
1
• Introduction to Incomplete Autonomous
Databases
• Overview of QPIAD and shortcomings of AFDbased approaches
• Our approach: Bayes network based imputation and query rewriting
2
• Introduction to Incomplete Autonomous
Databases
• Overview of QPIAD and shortcomings of AFDbased approaches
• Our approach: Bayes network based imputation and query rewriting
3
• Many websites allow user query through a form based interface and are supported by backend databases
• Consider used cars selling websites such as Cars.com,
Yahoo! autos, etc
Autonomous
Database
4
• Web databases are often input by lay individuals without any curation. For e.g. Cars.com, Yahoo!
Autos
• Web databases are being populated using automated information extraction techniques which are inherently imperfect
• Incomplete/Uncertain tuple : A tuple in which one or more of its attributes have a missing value
Website # of attributes
Autotrader.com
13
Carsdirect.com
14
# of tuples incomplete tuples
25127 33.67%
32564 98.74%
5
• Many entities corresponding to tuples with missing values might be relevant to the user query Q: Make = Honda
Null Accord 2003 Sedan
• Traditional query processing does not retrieve such tuples
6
• Single vs Multiple missing values
– Multiple missing values requires capturing the correlations between them
1 Audi Sedan 20000
2 Audi A8 Sedan 15000
3 Audi 2005 Sedan 23000
• Imputation vs Query Rewriting
– Imputation can look at all available evidence
– Query Rewriting requires finding the smallest number of evidences
• Looking at all evidences -> reduces throughput
• Looking at very few evidences -> reduction in precision
• Need to find middle ground
User Q: Model = A8
Rewritten Query
• Introduction to Incomplete Autonomous
Databases
• Overview of QPIAD and shortcomings of AFDbased approaches
• Our approach: Bayes network based imputation and query rewriting
8
Approximate Functional Dependencies
(AFDs)
• AFDs are Functional Dependencies that hold on all but a small fraction of the database
Make
Honda
Honda
Honda
Honda
Model
Civic
Civic
Civic
Civic
Body
Sedan
Coupe
Sedan
Sedan
Model Body : 0.75
Make Body : 0.75
Model Make : 1.0
• An AFD is of the form X A where X is a set of attributes and A is a single attribute
• An attribute can have multiple rules
9
ID Make Model Year Body Mileage Q: Body = Sedan
1 BMW 745
2 Acura Tl
3 BMW 645
4 BMW 745
5 Acura Tl
2005 Sedan 20000
2003 35000
2002 Convt 45000
2001 35000
2002 Sedan 24000
Relevant incomplete answers
Model Body : 0.75
•
•
• QPIAD uses AFDs and Naïve Bayes Classifiers to retrieve relevant uncertain answers
• When mediator has access privileges to modify the underlying data source
– Missing values can be completed by a simple classification task. (Imputation)
– After which Traditional query processing will suffice
• When mediators do not have such privileges
– Generate a set of rewritten queries and issue it to the autonomous database (Query Rewriting)
Issuing
Q1 : Model = Tl
Q2 : Model = 745 will retrieve relevant incomplete answers T2 and T4.
QPIAD uses only the highest confidence AFD of each attribute for imputation and Query Rewriting
Techniques for combining multiple AFDs shown to be ineffective
10
Shortcomings of AFD-based approaches
• Principles of locality and detachment do not hold for uncertain reasoning
• Model Body (0.7)
• This intuitively means that model of a car determines the body of a car with a probability of 0.7 when no other evidence is available.
• When other evidences are present, there is no easy way to combine the probabilities
11
3
4
5
6
1
2
ID
Shortcomings of AFD-based approaches
Make Model Year Body Mileage
Audi
Audi
BMW
Audi
Audi
A8
745
A8
2002
2005
2005
1999
Sedan 20000
Sedan 15000
Sedan 40000
Sedan 20000
Sedan 20000
Convt 25000
• Imputing the missing values in T2 using a single AFD; ignore influence from other attributes
• Imputing missing values in T1 ignores the correlations between the attributes Model and Year
• Imputing missing values in T6 will get AFDs into cycles
Model Make Make Model
12
• Introduction to Incomplete Autonomous Databases
• Overview of QPIAD and shortcomings of AFD-based approaches
• Our approach: Bayes network based imputation and query rewriting
– Introduction
– Learning Bayes network models from data
– Imputation
• Single and multiple missing values
• Varying levels of incompleteness in test data
– Query Rewriting
• Bayes network based rewriting
• Comparison of Bayes network based rewriting and AFDs
13
• Introduction to Incomplete Autonomous Databases
• Overview of QPIAD and shortcomings of AFD-based approaches
• Our approach: Bayes network based imputation and query rewriting
– Introduction
– Learning Bayes network models from data
– Imputation
• Single and multiple missing values
• Varying levels of incompleteness in test data
– Query Rewriting
• Bayes network based rewriting
• Comparison of Bayes network based rewriting and AFDs
14
• A Bayes network is a DAG representing the probabilistic dependencies between attributes
• It is a compact representation of the full joint distribution
– Therefore influence from all variables are accounted
• It represents the generative model of the autonomous database
CPDs model the strength of the probabilistic dependencies
Model
Year
Make Body
Model
Make
Civic
Honda 0.8
…
… ..
.
.
.
.
Mileage
15
Challenges in using Bayes networks for handling incompleteness in
Autonomous databases
• Learning and inference with Bayes networks is computationally harder than AFDs
– Learning the topology and parameters from data involves searching over search the space of topologies
• But can be done offline
– Inference in a general Bayes network is intractable.
• But can use approximate inference
Question: Can we get benefits of exact inference while containing costs?
16
• Introduction to Incomplete Autonomous Databases
• Overview of QPIAD and shortcomings of AFD-based approaches
• Our approach: Bayes network based imputation and query rewriting
– Introduction
– Learning Bayes network models from data
– Imputation
• Single and multiple missing values
• Varying levels of incompleteness in test data
– Query Rewriting
• Bayes network based rewriting
• Comparison of Bayes network based rewriting and AFDs
17
• Structure & Parameter Learning From Data
– Challenge: Involves searching over topologies
– Use Banjo Software Package as black-box.
– Experiments show learned topology is robust w.r.t
• Sample size(5-20%) – same topology
• Search time(5-30 minutes) – same topology
• Max parent count (2-4) – same topology; significantly higher networks examined in case of 2.
18
• Exact Techniques
– NP-hard, in the general case. Therefore, do not scale well with increase in incompleteness
– Junction Tree (fastest; but inapplicable when query variables do not form a clique)
– Variable Elimination
• Approximate Techniques (Scales well; retaining accuracy of exact methods)
– Gibbs Sampling
– Using Infer.net
package allows us to use Expectation
Propagation inference
19
• Introduction to Incomplete Autonomous Databases
• Overview of QPIAD and shortcomings of AFD-based approaches
• Our approach: Bayes network based imputation and query rewriting
– Introduction
– Learning Bayes network models from data
– Imputation
• Single and multiple missing values
• Varying levels of incompleteness in test data
– Query Rewriting
• Bayes network based rewriting
• Comparison of Bayes network based rewriting and AFDs
20
• Experimental Setup
– Test Databases: Cars.com database containing 8K tuples and Adult Database from UCI repository containing 15K tuples
– Bayes net inference
• Exact inference: Junction Tree, Variable Elimination
• Approximate inference: Gibbs Sampling
21
• Remove all the values for the attribute being predicted
• Substitute missing value with most likely value
• AFD-approach
– Use only highest confidence AFD (Use all attributes if confidence is low, e.g., mileage(Cars)).
Called Hybrid-one by authors of QPIAD.
• Bayes net
– Infer the posterior distribution of missing attribute, given evidences of the other attributes in the tuple
22
• Introduction to Incomplete Autonomous Databases
• Overview of QPIAD and shortcomings of AFD-based approaches
• Our approach: Bayes network based imputation and query rewriting
– Introduction
– Learning Bayes network models from data
– Imputation
• Single and multiple missing values
• Varying levels of incompleteness in test data
– Query Rewriting
• Bayes network based rewriting
• Comparison of Bayes network based rewriting and AFDs
23
3
4
1
2
Imputation- single missing attribute
BN-Exact BN-Gibbs AFDs
ID Make Model Year Body
Audi A8
BMW 745
Audi
Audi A8
Sedan
2002 Sedan
2005 Sedan
2005 Sedan
1
0,8
0,6
0,4
0,2
0
• Significant difference for attributes Model and Year.
• AFDs using only the highest confidence rule, and ignore others.
– Attempts at combining evidences from multiple rules have been ineffective.
• Bayes nets systematically combines all evidences.
24
Imputation- multiple missing attributes
• AFD-approach
– Predict each missing value independently
– Can get in cycles
Make Model
Model Make
• Bayes net
– Computes the Joint distribution over the missing attributes.
Make Model Year Body
BMW Sedan
BMW
BMW
2003
745 2004 Sedan
25
0,8
0,6
0,4
0,2
0
Imputation- multiple missing attributes
AFD BN-Gibbs BN-Exact
Year
Model
Make Body
Mileage
Price
• When missing attributes are correlated, they often get into cycles
– Only 9 out of 20 combinations could be predicted when 3 attributes are missing
• AFD accuracies are lower as they use a single rule independently for prediction
– BNs systematically combine evidences from multiple sources and capture correlations by finding the joint distribution
• When attributes are D-separated and involve attributes which have similar prediction accuracies for both methods, there is no difference in accuracy
26
• Introduction to Incomplete Autonomous Databases
• Overview of QPIAD and shortcomings of AFD-based approaches
• Our approach: Bayes network based imputation and query rewriting
– Introduction
– Learning Bayes network models from data
– Imputation
• Single and multiple missing values
• Varying levels of incompleteness in test data
– Query Rewriting
• Bayes network based rewriting
• Comparison of Bayes network based rewriting and AFDs
27
Imputation- Increase in incompleteness in test data
• Evidence for predicting missing values reduces with increase in incompleteness
• AFD-approach
– Chain missing values in determining set of AFD
• Bayes net
– No change. Just compute posterior distribution of the attributes to be imputed given the evidence.
Make
BMW
BMW
BMW
Model Year Body
Sedan
2003
745 2004 Sedan
Q: Model = 745
AFDs: Make, Body Model
Year Body
28
0,8
0,6
0,4
0,2
0
0,1
AFD
Imputation- Increase in incompleteness in test data
BN-Gibbs BN-Exact
Model
0,3 0,5 0,7
Percentage of Incompleteness
0,9
0,3
0,25
0,2
0,15
0,1
0,05
0
0,1
AFD BN-Gibbs BN-Exact
Race-Occupation
0,3 0,5 0,7
Percentage of Incompleteness
0,9
0,8
0,6
0,4
0,2
0
0,1
Year-Body
0,9 0,3 0,5 0,7
Percentage of Incompleteness
AFD
BN-Gibbs
BN-Exact
29
% incomplet eness
AFD
(Sec.)
BN-Gibbs
(Sec.)
(250 Samples)
BN-
Exact
(Sec.)
40
50
60
70
80
0
10
20
30
90
0.271
44.46
0.267
47.15
0.205
52.02
0.232
54.86
0.231
56.19
0.234
58.12
0.232
60.09
0.235
61.52
0.262
0.219
63.69
66.19
16.23
44.88
82.52
128.26
182.33
248.75
323.78
402.13
490.31
609.65
BN-Gibbs retains the accuracy edge of BN-Exact while containing costs
30
• Introduction to Incomplete Autonomous Databases
• Overview of QPIAD and shortcomings of AFD-based approaches
• Our approach: Bayes network based imputation and query rewriting
– Introduction
– Learning Bayes network models from data
– Imputation
• Single and multiple missing values
• Varying levels of incompleteness in test data
– Query Rewriting
• Bayes network based rewriting
• Comparison of Bayes network based rewriting and AFDs
31
• When mediators do not have access privileges, missing values cannot be substituted as in the case of imputation.
• Need to generate and send “rewritten” queries to retrieve relevant uncertain answers.
32
Query Rewriting– Single-attribute queries
ID Make Model Year Body Mileage
Q: Body = Sedan
1 BMW 745
2 Acura Tl
3 BMW 645
4 BMW 745
5 Acura Tl
2005 Sedan 20000
2003 35000
2002 Convt 45000
2001 35000
2002 Sedan 24000
Relevant incomplete answers
CERTAIN ANSWERS (BASE RESULT SET)
1 BMW
5 Acura
745
Tl
2005
2002
Sedan 20000
Sedan 24000
Can retrieve
T2 with Q’
1
: Model = Tl
T4 with Q’
2
: Model = 745
33
Model
Make
Year
CERTAIN ANSWERS (BASE RESULT SET)
ID Make
1 BMW
5 Acura
Model Year
745 2005
Tl 2002
Body Mileage
Sedan 20000
Sedan 24000
Mileage
Body
Bayes Networks
ATTRIBUTES: ALL ATTRIBUTES IN
MARKOV BLANKET
(BN-ALL-MB)
Q’
1
: Model = 745
Q’
2
: Model = Tl
AFDs
ATTRIBUTES:
DETERMINING SET OF
AFD
Model Body : 0.9
Q’
1
: Model = 745
Q’
2
: Model = Tl
Given evidence of all attributes in MARKOV
BLANKET, an attribute is independent of ALL other attributes
Q: Body = Sedan
34
• All queries may not be equally good in retrieving relevant answers
– “tl” model cars are more likely to be sedans than a car with “745” model
• Rank queries based on their expected precision (ExpPrec)
ExpPrec(Q) = P(A m
=v where t i
ε П
MB(Am) m
|t i
)
(RS(Q)) for Bayes nets where t i
ε П dtrSet(Am)
(RS(Q)) for AFDs
Q
1
’: Model = ‘tl’.
ExpPrec(Q
1
’)= P(Body=Sedan|Model=tl) = 1
Q
2
’= Model = ‘745’.
ExpPrec(Q
2
’)= P(Body=Sedan|Model=745) = 0.6
Bayes Networks
Inference in bayes network
AFDs
Use Naïve Bayes Classifiers
35
Ranking Rewritten Queries- only K queries
• When database or network resources are limited, the mediator can choose to issue the top-K queries to get the most relevant uncertain answers
– It is important to carefully trade precision with throughput
• Use F-measure metric (idea borrowed from QPIAD)
1 + α ∗ 𝑃 ∗ 𝑅
F − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
α =0 – only precision
α ∗ 𝑃 + 𝑅
P – expected precision (e.g. P(Model=745|Make =BMW) )
R – expected recall
R = expected precision * expected selectivity expected selectivity = Sample Selectivity * Sample Ratio
Sample Ratio estimated from cardinalities result sets from sample and original database
36
• Test databases: Cars database consisting of 55K tuples and Adult database consisting of 15K tuples
• Training set 15% of the database.
• Test data split in two halves-
– One half contains no incompleteness and is used to return the base result set
– In the other half all query-constrained attributes are made null
– A copy of test data is used as the ground truth to compute precision and recall
– This is an aggressive setup since most databases have
<50% incompleteness
37
BN-All-MB: P(Make=bmw|model= 330)
AFD: P(Make=bmw|model=330)
Q: Make
• When size of determining set > 1
Expected Precision values represented of AFDs (represented by NBCs) are inaccurate
• Actual precision is lower for AFDs because their expected precisions are inaccurate
38
Model
Make
Year
Mileage
Body
Q: Model = 745
Q’
1
: MakeᴧBodyᴧYear
Q’
2
: MakeᴧBodyᴧYear
Q’
3
: MakeᴧBodyᴧYear
• Throughput of queries reduces drastically as markov blanket size increases
Use F-measure based ranking to increase recall
When almost all queries have very low throughput there is simply no way to increase recall
39
ID Make Model Year Body Mileage
Q: Model = 745 1 BMW 745
2 BMW
3 BMW 645
4 BMW 745
5 Acura Tl
6 BMW
2005 Sedan 20000
2005 Sedan 35000
2002 Convt 45000
2001 35000
2002 Sedan 24000
2001 Sedan 20000
Year
Model Mileage
Candidate Attribute Set
= {Year, Make, Body}
Make Body
40
Best rewritten queries of size 1
Level 1
Make = BMW
Year = 2001
Body = Sedan
Year
Body
Level 2
Make = BMW ^
Year = 2001
Make = BMW ^
Year = 2005
Body = Sedan
Pick Top-K queries at each level based on
F-measure metric
At Level L all (partial) queries have ≤ L attributes constrained
1 + α ∗ 𝑃 ∗ 𝑅
F − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
α ∗ 𝑃 + 𝑅
P – expected precision (e.g. P(Model=745|Make =BMW) )
R – expected recall
R = expected precision * expected selectivity
Expected selectivity = Sample Selectivity * Sample Ratio
Sample Ratio estimated from cardinalities result sets from sample and original database
Level L
Q’
1
Q’
2
Q’
3
Issue to database in the increasing order of expected precision
41
Recall Plot
Results for Top-10 queries for user query Year = 2002
Precision Plot
• Increasing α does not increase recall of BN-All-MB
• BN-Beam increases recall without a catastrophic reduction in precision
42
• Contribution to QPIAD
• Aim: To retrieve relevant uncertain answers with multiple-missing values on queryconstrained attributes.
43
QPIAD
Base result set
ID Make
1
2 BMW
3
4
5 BMW
6 BMW
Model Year
645 2002
645
745
2002
2001
645
745
645
2002
2001
2002
Body
Coupe
Convt
Sedan
Coupe
Coupe
Convt
Mileage
40000
40000
40000
Q: Make = BMW ʌ Mileage = 40000
Base result set = T5, T6
QPIAD retrieves T1 and T2.
BN-Beam can also retrieve T3 and T4.
Candidate attribute set: union of attributes in the markov blanket of all constrained attributes
All other steps same as single-attribute query case
BN-Beam
44
Comparison over multi-attribute queries
• Two AFD approaches
1. AFD-All-Attributes: Creates a conjunctive query by joining all attributes in the determining set of the AFDs of the constrained attributes.
Consider AFDs
Model Make Year Mileage
Q: Make = BMW ʌ Mileage = 40000
Expected Precision =
Product of individual query’s expected precision
Make = BMW
Model = 745
Mileage = 40000 Q’
1
: Model=745ᴧYear=2001
Q’
2
: Model=645ᴧYear=2001
Q’
3
: Model=745ᴧYear=2002
Q’
4
: Model=645ᴧYear=2002
Model = 645
Year = 2001
Year = 2002
45
Results for top-10 queries
Q: Make ^ Mileage
Precision of BN-Beam is competitive with
AFD-All Attributes
Recall of BN-Beam is higher
• AFD-All-Attributes does not consider the joint distribution between the query-constrained attributes.
• Leads to low throughput or even empty queries 46
Comparison of multi-attribute queries
2. AFD-Highest-Confidence: Uses only the AFD of the highest confidence constrained attribute for rewriting
Q: Make = Dodge ᴧ Year = 2004
IGNORE all attributes other than Make
AFD : Model Make
Q’
1
: Model=ram
Q’
2
: Model= intrepid
47
BN-Beam vs AFD-Highest-Confidence
Results for top-10 queries
Q:Make ʌ Year
(Car database)
AFD-Highest-Confidence increases recall but NOT WITHOUT a
CATASTROPHIC drop in precision
48
• A comparison of cost and accuracy tradeoffs of using
Bayes network models and AFDs for handling incompleteness in autonomous databases
• Bayes nets have a significant edge over AFDs when missing values are on highly correlated attributes and at higher levels of incompleteness in test data.
• Presented two approachesBN-All-MB and BN-Beam for generating rewritten queries using Bayes networks.
We showed that BN-Beam is able to retrieve tuples with higher recall than BN-All-MB. We compared Bayes network based rewriting with AFD based rewriting and found the former to retrieve results with higher precision and recall
49
• CAVEAT: I found two bugs in my code (Query
Rewriting section)
• Corrected one bug (related to BN-based rewriting)
• Will correct the other one (related to AFDbased rewriting) after the defense
THANK YOU
QUESTIONS?
50