An Investigation of the cost and accuracy tradeoffs of Supplanting

advertisement

An Investigation of the cost and accuracy tradeoffs of Supplanting

AFDs with Bayes Network in Query

Processing in the Presence of

Incompleteness in Autonomous

Databases

MS Thesis Defense

Rohit Raghunathan

August 19 th , 2011

Committee Members

Dr. Subbarao Kambhampti (Chair)

Dr. Joohyung Lee

Dr. Huan Liu

1

Overview of the talk

• Introduction to Incomplete Autonomous

Databases

• Overview of QPIAD and shortcomings of AFDbased approaches

• Our approach: Bayes network based imputation and query rewriting

2

Overview of the talk

• Introduction to Incomplete Autonomous

Databases

• Overview of QPIAD and shortcomings of AFDbased approaches

• Our approach: Bayes network based imputation and query rewriting

3

Introduction to Web databases

• Many websites allow user query through a form based interface and are supported by backend databases

• Consider used cars selling websites such as Cars.com,

Yahoo! autos, etc

Autonomous

Database

4

Incompleteness in Web databases

• Web databases are often input by lay individuals without any curation. For e.g. Cars.com, Yahoo!

Autos

• Web databases are being populated using automated information extraction techniques which are inherently imperfect

• Incomplete/Uncertain tuple : A tuple in which one or more of its attributes have a missing value

Website # of attributes

Autotrader.com

13

Carsdirect.com

14

# of tuples incomplete tuples

25127 33.67%

32564 98.74%

5

Problem Statement

• Many entities corresponding to tuples with missing values might be relevant to the user query Q: Make = Honda

Null Accord 2003 Sedan

• Traditional query processing does not retrieve such tuples

6

Dimensions of the problem

• Single vs Multiple missing values

– Multiple missing values requires capturing the correlations between them

1 Audi Sedan 20000

2 Audi A8 Sedan 15000

3 Audi 2005 Sedan 23000

• Imputation vs Query Rewriting

– Imputation can look at all available evidence

– Query Rewriting requires finding the smallest number of evidences

• Looking at all evidences -> reduces throughput

• Looking at very few evidences -> reduction in precision

• Need to find middle ground

User Q: Model = A8

Rewritten Query

Overview of the talk

• Introduction to Incomplete Autonomous

Databases

• Overview of QPIAD and shortcomings of AFDbased approaches

• Our approach: Bayes network based imputation and query rewriting

8

Approximate Functional Dependencies

(AFDs)

• AFDs are Functional Dependencies that hold on all but a small fraction of the database

Make

Honda

Honda

Honda

Honda

Model

Civic

Civic

Civic

Civic

Body

Sedan

Coupe

Sedan

Sedan

Model  Body : 0.75

Make  Body : 0.75

Model  Make : 1.0

• An AFD is of the form X  A where X is a set of attributes and A is a single attribute

• An attribute can have multiple rules

9

Overview of QPIAD

ID Make Model Year Body Mileage Q: Body = Sedan

1 BMW 745

2 Acura Tl

3 BMW 645

4 BMW 745

5 Acura Tl

2005 Sedan 20000

2003 35000

2002 Convt 45000

2001 35000

2002 Sedan 24000

Relevant incomplete answers

Model  Body : 0.75

• QPIAD uses AFDs and Naïve Bayes Classifiers to retrieve relevant uncertain answers

• When mediator has access privileges to modify the underlying data source

– Missing values can be completed by a simple classification task. (Imputation)

– After which Traditional query processing will suffice

• When mediators do not have such privileges

– Generate a set of rewritten queries and issue it to the autonomous database (Query Rewriting)

Issuing

Q1 : Model = Tl

Q2 : Model = 745 will retrieve relevant incomplete answers T2 and T4.

QPIAD uses only the highest confidence AFD of each attribute for imputation and Query Rewriting

Techniques for combining multiple AFDs shown to be ineffective

10

Shortcomings of AFD-based approaches

• Principles of locality and detachment do not hold for uncertain reasoning

• Model  Body (0.7)

• This intuitively means that model of a car determines the body of a car with a probability of 0.7 when no other evidence is available.

• When other evidences are present, there is no easy way to combine the probabilities

11

3

4

5

6

1

2

ID

Shortcomings of AFD-based approaches

Make Model Year Body Mileage

Audi

Audi

BMW

Audi

Audi

A8

745

A8

2002

2005

2005

1999

Sedan 20000

Sedan 15000

Sedan 40000

Sedan 20000

Sedan 20000

Convt 25000

• Imputing the missing values in T2 using a single AFD; ignore influence from other attributes

• Imputing missing values in T1 ignores the correlations between the attributes Model and Year

• Imputing missing values in T6 will get AFDs into cycles

Model  Make Make  Model

12

Overview of the talk

• Introduction to Incomplete Autonomous Databases

• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting

– Introduction

– Learning Bayes network models from data

– Imputation

• Single and multiple missing values

• Varying levels of incompleteness in test data

– Query Rewriting

• Bayes network based rewriting

• Comparison of Bayes network based rewriting and AFDs

13

Overview of the talk

• Introduction to Incomplete Autonomous Databases

• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting

– Introduction

– Learning Bayes network models from data

– Imputation

• Single and multiple missing values

• Varying levels of incompleteness in test data

– Query Rewriting

• Bayes network based rewriting

• Comparison of Bayes network based rewriting and AFDs

14

Bayes network

• A Bayes network is a DAG representing the probabilistic dependencies between attributes

• It is a compact representation of the full joint distribution

– Therefore influence from all variables are accounted

• It represents the generative model of the autonomous database

CPDs model the strength of the probabilistic dependencies

Model

Year

Make Body

Model

Make

Civic

Honda 0.8

… ..

.

.

.

.

Mileage

15

Challenges in using Bayes networks for handling incompleteness in

Autonomous databases

• Learning and inference with Bayes networks is computationally harder than AFDs

– Learning the topology and parameters from data involves searching over search the space of topologies

• But can be done offline

– Inference in a general Bayes network is intractable.

• But can use approximate inference

Question: Can we get benefits of exact inference while containing costs?

16

Overview of the talk

• Introduction to Incomplete Autonomous Databases

• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting

– Introduction

– Learning Bayes network models from data

– Imputation

• Single and multiple missing values

• Varying levels of incompleteness in test data

– Query Rewriting

• Bayes network based rewriting

• Comparison of Bayes network based rewriting and AFDs

17

Learning a Bayes network model

• Structure & Parameter Learning From Data

– Challenge: Involves searching over topologies

– Use Banjo Software Package as black-box.

– Experiments show learned topology is robust w.r.t

• Sample size(5-20%) – same topology

• Search time(5-30 minutes) – same topology

• Max parent count (2-4) – same topology; significantly higher networks examined in case of 2.

18

Inference in Bayes networks

• Exact Techniques

– NP-hard, in the general case. Therefore, do not scale well with increase in incompleteness

– Junction Tree (fastest; but inapplicable when query variables do not form a clique)

– Variable Elimination

• Approximate Techniques (Scales well; retaining accuracy of exact methods)

– Gibbs Sampling

– Using Infer.net

package allows us to use Expectation

Propagation inference

19

Overview of the talk

• Introduction to Incomplete Autonomous Databases

• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting

– Introduction

– Learning Bayes network models from data

– Imputation

• Single and multiple missing values

• Varying levels of incompleteness in test data

– Query Rewriting

• Bayes network based rewriting

• Comparison of Bayes network based rewriting and AFDs

20

Imputation

• Experimental Setup

– Test Databases: Cars.com database containing 8K tuples and Adult Database from UCI repository containing 15K tuples

– Bayes net inference

• Exact inference: Junction Tree, Variable Elimination

• Approximate inference: Gibbs Sampling

21

Imputation

• Remove all the values for the attribute being predicted

• Substitute missing value with most likely value

• AFD-approach

– Use only highest confidence AFD (Use all attributes if confidence is low, e.g., mileage(Cars)).

Called Hybrid-one by authors of QPIAD.

• Bayes net

– Infer the posterior distribution of missing attribute, given evidences of the other attributes in the tuple

22

Overview of the talk

• Introduction to Incomplete Autonomous Databases

• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting

– Introduction

– Learning Bayes network models from data

– Imputation

• Single and multiple missing values

• Varying levels of incompleteness in test data

– Query Rewriting

• Bayes network based rewriting

• Comparison of Bayes network based rewriting and AFDs

23

3

4

1

2

Imputation- single missing attribute

BN-Exact BN-Gibbs AFDs

ID Make Model Year Body

Audi A8

BMW 745

Audi

Audi A8

Sedan

2002 Sedan

2005 Sedan

2005 Sedan

1

0,8

0,6

0,4

0,2

0

• Significant difference for attributes Model and Year.

• AFDs using only the highest confidence rule, and ignore others.

– Attempts at combining evidences from multiple rules have been ineffective.

• Bayes nets systematically combines all evidences.

24

Imputation- multiple missing attributes

• AFD-approach

– Predict each missing value independently

– Can get in cycles

Make  Model

Model  Make

• Bayes net

– Computes the Joint distribution over the missing attributes.

Make Model Year Body

BMW Sedan

BMW

BMW

2003

745 2004 Sedan

25

0,8

0,6

0,4

0,2

0

Imputation- multiple missing attributes

AFD BN-Gibbs BN-Exact

Year

Model

Make Body

Mileage

Price

• When missing attributes are correlated, they often get into cycles

– Only 9 out of 20 combinations could be predicted when 3 attributes are missing

• AFD accuracies are lower as they use a single rule independently for prediction

– BNs systematically combine evidences from multiple sources and capture correlations by finding the joint distribution

• When attributes are D-separated and involve attributes which have similar prediction accuracies for both methods, there is no difference in accuracy

26

Overview of the talk

• Introduction to Incomplete Autonomous Databases

• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting

– Introduction

– Learning Bayes network models from data

– Imputation

• Single and multiple missing values

• Varying levels of incompleteness in test data

– Query Rewriting

• Bayes network based rewriting

• Comparison of Bayes network based rewriting and AFDs

27

Imputation- Increase in incompleteness in test data

• Evidence for predicting missing values reduces with increase in incompleteness

• AFD-approach

– Chain missing values in determining set of AFD

• Bayes net

– No change. Just compute posterior distribution of the attributes to be imputed given the evidence.

Make

BMW

BMW

BMW

Model Year Body

Sedan

2003

745 2004 Sedan

Q: Model = 745

AFDs: Make, Body  Model

Year  Body

28

0,8

0,6

0,4

0,2

0

0,1

AFD

Imputation- Increase in incompleteness in test data

BN-Gibbs BN-Exact

Model

0,3 0,5 0,7

Percentage of Incompleteness

0,9

0,3

0,25

0,2

0,15

0,1

0,05

0

0,1

AFD BN-Gibbs BN-Exact

Race-Occupation

0,3 0,5 0,7

Percentage of Incompleteness

0,9

0,8

0,6

0,4

0,2

0

0,1

Year-Body

0,9 0,3 0,5 0,7

Percentage of Incompleteness

AFD

BN-Gibbs

BN-Exact

29

Time Taken For Imputation

% incomplet eness

AFD

(Sec.)

BN-Gibbs

(Sec.)

(250 Samples)

BN-

Exact

(Sec.)

40

50

60

70

80

0

10

20

30

90

0.271

44.46

0.267

47.15

0.205

52.02

0.232

54.86

0.231

56.19

0.234

58.12

0.232

60.09

0.235

61.52

0.262

0.219

63.69

66.19

16.23

44.88

82.52

128.26

182.33

248.75

323.78

402.13

490.31

609.65

BN-Gibbs retains the accuracy edge of BN-Exact while containing costs

30

Overview of the talk

• Introduction to Incomplete Autonomous Databases

• Overview of QPIAD and shortcomings of AFD-based approaches

• Our approach: Bayes network based imputation and query rewriting

– Introduction

– Learning Bayes network models from data

– Imputation

• Single and multiple missing values

• Varying levels of incompleteness in test data

– Query Rewriting

• Bayes network based rewriting

• Comparison of Bayes network based rewriting and AFDs

31

Query Rewriting

• When mediators do not have access privileges, missing values cannot be substituted as in the case of imputation.

• Need to generate and send “rewritten” queries to retrieve relevant uncertain answers.

32

Query Rewriting– Single-attribute queries

ID Make Model Year Body Mileage

Q: Body = Sedan

1 BMW 745

2 Acura Tl

3 BMW 645

4 BMW 745

5 Acura Tl

2005 Sedan 20000

2003 35000

2002 Convt 45000

2001 35000

2002 Sedan 24000

Relevant incomplete answers

CERTAIN ANSWERS (BASE RESULT SET)

1 BMW

5 Acura

745

Tl

2005

2002

Sedan 20000

Sedan 24000

Can retrieve

T2 with Q’

1

: Model = Tl

T4 with Q’

2

: Model = 745

33

Model

Make

Year

Generating Rewritten Queries

CERTAIN ANSWERS (BASE RESULT SET)

ID Make

1 BMW

5 Acura

Model Year

745 2005

Tl 2002

Body Mileage

Sedan 20000

Sedan 24000

Mileage

Body

Bayes Networks

ATTRIBUTES: ALL ATTRIBUTES IN

MARKOV BLANKET

(BN-ALL-MB)

Q’

1

: Model = 745

Q’

2

: Model = Tl

AFDs

ATTRIBUTES:

DETERMINING SET OF

AFD

Model  Body : 0.9

Q’

1

: Model = 745

Q’

2

: Model = Tl

Given evidence of all attributes in MARKOV

BLANKET, an attribute is independent of ALL other attributes

Q: Body = Sedan

34

Ranking Rewritten queries

• All queries may not be equally good in retrieving relevant answers

– “tl” model cars are more likely to be sedans than a car with “745” model

• Rank queries based on their expected precision (ExpPrec)

ExpPrec(Q) = P(A m

=v where t i

ε П

MB(Am) m

|t i

)

(RS(Q)) for Bayes nets where t i

ε П dtrSet(Am)

(RS(Q)) for AFDs

Q

1

’: Model = ‘tl’.

ExpPrec(Q

1

’)= P(Body=Sedan|Model=tl) = 1

Q

2

’= Model = ‘745’.

ExpPrec(Q

2

’)= P(Body=Sedan|Model=745) = 0.6

Bayes Networks

Inference in bayes network

AFDs

Use Naïve Bayes Classifiers

35

Ranking Rewritten Queries- only K queries

• When database or network resources are limited, the mediator can choose to issue the top-K queries to get the most relevant uncertain answers

– It is important to carefully trade precision with throughput

• Use F-measure metric (idea borrowed from QPIAD)

1 + α ∗ 𝑃 ∗ 𝑅

F − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =

α =0 – only precision

α ∗ 𝑃 + 𝑅

P – expected precision (e.g. P(Model=745|Make =BMW) )

R – expected recall

R = expected precision * expected selectivity expected selectivity = Sample Selectivity * Sample Ratio

Sample Ratio estimated from cardinalities result sets from sample and original database

36

Experimental Setup

• Test databases: Cars database consisting of 55K tuples and Adult database consisting of 15K tuples

• Training set 15% of the database.

• Test data split in two halves-

– One half contains no incompleteness and is used to return the base result set

– In the other half all query-constrained attributes are made null

– A copy of test data is used as the ground truth to compute precision and recall

– This is an aggressive setup since most databases have

<50% incompleteness

37

BN-All-MB vs AFD

BN-All-MB: P(Make=bmw|model= 330)

AFD: P(Make=bmw|model=330)

Q: Make

• When size of determining set > 1

Expected Precision values represented of AFDs (represented by NBCs) are inaccurate

• Actual precision is lower for AFDs because their expected precisions are inaccurate

38

Model

Make

Year

Shortcoming of BN-All-MB

Mileage

Body

Q: Model = 745

Q’

1

: MakeᴧBodyᴧYear

Q’

2

: MakeᴧBodyᴧYear

Q’

3

: MakeᴧBodyᴧYear

• Throughput of queries reduces drastically as markov blanket size increases

 Use F-measure based ranking to increase recall

When almost all queries have very low throughput there is simply no way to increase recall

39

BN-Beam (Single-attribute queries)

ID Make Model Year Body Mileage

Q: Model = 745 1 BMW 745

2 BMW

3 BMW 645

4 BMW 745

5 Acura Tl

6 BMW

2005 Sedan 20000

2005 Sedan 35000

2002 Convt 45000

2001 35000

2002 Sedan 24000

2001 Sedan 20000

Year

Model Mileage

Candidate Attribute Set

= {Year, Make, Body}

Make Body

40

Best rewritten queries of size 1

BN-Beam

Level 1

Make = BMW

Year = 2001

Body = Sedan

Year

Body

Level 2

Make = BMW ^

Year = 2001

Make = BMW ^

Year = 2005

Body = Sedan

Pick Top-K queries at each level based on

F-measure metric

At Level L all (partial) queries have ≤ L attributes constrained

1 + α ∗ 𝑃 ∗ 𝑅

F − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =

α ∗ 𝑃 + 𝑅

P – expected precision (e.g. P(Model=745|Make =BMW) )

R – expected recall

R = expected precision * expected selectivity

Expected selectivity = Sample Selectivity * Sample Ratio

Sample Ratio estimated from cardinalities result sets from sample and original database

Level L

Q’

1

Q’

2

Q’

3

Issue to database in the increasing order of expected precision

41

BN-Beam vs BN-All-MB

Recall Plot

Results for Top-10 queries for user query Year = 2002

Precision Plot

• Increasing α does not increase recall of BN-All-MB

• BN-Beam increases recall without a catastrophic reduction in precision

42

Multi-attribute queries

• Contribution to QPIAD

• Aim: To retrieve relevant uncertain answers with multiple-missing values on queryconstrained attributes.

43

QPIAD

Base result set

Multi-attribute queries

ID Make

1

2 BMW

3

4

5 BMW

6 BMW

Model Year

645 2002

645

745

2002

2001

645

745

645

2002

2001

2002

Body

Coupe

Convt

Sedan

Coupe

Coupe

Convt

Mileage

40000

40000

40000

Q: Make = BMW ʌ Mileage = 40000

Base result set = T5, T6

QPIAD retrieves T1 and T2.

BN-Beam can also retrieve T3 and T4.

 Candidate attribute set: union of attributes in the markov blanket of all constrained attributes

 All other steps same as single-attribute query case

BN-Beam

44

Comparison over multi-attribute queries

• Two AFD approaches

1. AFD-All-Attributes: Creates a conjunctive query by joining all attributes in the determining set of the AFDs of the constrained attributes.

Consider AFDs

Model  Make Year  Mileage

Q: Make = BMW ʌ Mileage = 40000

Expected Precision =

Product of individual query’s expected precision

Make = BMW

Model = 745

Mileage = 40000 Q’

1

: Model=745ᴧYear=2001

Q’

2

: Model=645ᴧYear=2001

Q’

3

: Model=745ᴧYear=2002

Q’

4

: Model=645ᴧYear=2002

Model = 645

Year = 2001

Year = 2002

45

BN-Beam vs AFD-All-Attributes

Results for top-10 queries

Q: Make ^ Mileage

Precision of BN-Beam is competitive with

AFD-All Attributes

Recall of BN-Beam is higher

• AFD-All-Attributes does not consider the joint distribution between the query-constrained attributes.

• Leads to low throughput or even empty queries 46

Comparison of multi-attribute queries

2. AFD-Highest-Confidence: Uses only the AFD of the highest confidence constrained attribute for rewriting

Q: Make = Dodge ᴧ Year = 2004

IGNORE all attributes other than Make

AFD : Model  Make

Q’

1

: Model=ram

Q’

2

: Model= intrepid

47

BN-Beam vs AFD-Highest-Confidence

Results for top-10 queries

Q:Make ʌ Year

(Car database)

AFD-Highest-Confidence increases recall but NOT WITHOUT a

CATASTROPHIC drop in precision

48

Summary

• A comparison of cost and accuracy tradeoffs of using

Bayes network models and AFDs for handling incompleteness in autonomous databases

• Bayes nets have a significant edge over AFDs when missing values are on highly correlated attributes and at higher levels of incompleteness in test data.

• Presented two approachesBN-All-MB and BN-Beam for generating rewritten queries using Bayes networks.

We showed that BN-Beam is able to retrieve tuples with higher recall than BN-All-MB. We compared Bayes network based rewriting with AFD based rewriting and found the former to retrieve results with higher precision and recall

49

Deviations From the Thesis Draft

• CAVEAT: I found two bugs in my code (Query

Rewriting section)

• Corrected one bug (related to BN-based rewriting)

• Will correct the other one (related to AFDbased rewriting) after the defense

THANK YOU

QUESTIONS?

50

Download