Mining Approximate Functional Dependencies (AFDs) as Condensed Representations of Association Rules

advertisement

Mining Approximate Functional Dependencies (AFDs) as

Condensed Representations of Association Rules

Master’s Thesis Defense by Aravind Krishna Kalavagattu

Committee Members:

Dr. Subbarao Kambhampati (chair)

Dr. Yi Chen

Dr. Huan Liu

Database Systems

• Well-defined schema and method for querying (SQL)

• Query optimization

• Lately, some systems started supporting IR-

Style answering of user queries

Data mining

Rule Mining with

Several applications

Over databases

• Discovering useful patterns from data

Rule learning is a well researched method for discovering interesting relations between variables in large databases

• Association Rules

Introduction to AFDs

Approximate Functional Dependencies are rules denoting approximate determinations at attribute level.

AFDs are of the form (X ~~> Y), where X and Y are sets of attributes

 X is the “ determining set” and Y is called “ dependent set”

Rules with singleton dependent sets are of high interest

A classic example of an AFD

(Nationality ~~ > Language)

More examples

Make ~~> Model

 (Job Title, Experience) ~~> Salary

Indicates that we can approximately guess the language of a person if we know which country she is from.

Introduction (contd..)

 Functional Dependency (FD)

 Given a relation R, a set of attributes X in R is said to

functionally determine another attribute Y, also in R,

(written XY) if and only if each X value is associated with precisely one Y value.

 AFDs can be loosely defined as FDs that approximately hold (there are some exception rows that fail to satisfy the Function over the current relation)

 Example: Make~~>Model (with error = 0.3)

 70% of the tuples satisfy the dependency

Applications of AFDs

Predicting Missing Values of attributes

In relational tables

(QPIAD)

Using values of attributes in determining set of AFD

Query Rewriting

(AIMQ, QPIAD, QUIC)

Example: Model

~~>

BodyStyle

Rewrite query on Model=“RAV4” to

Retrieve tuples with bodystyle=“SUV”

Query Optimization

(CORDS, BHUNT)

Maintaining correct selectivity estimates

Database design

(Database normalization)

(Efficient Storage)

Similar to the way FDs are used

FD Mining and Implications

 FD Mining aims at finding a

minimal cover

 Minimum set of FDs from which the entire set of FDs can be generated

 Example: If A → B is an FD, then, ({A,C} → B) is considered redundant

 Can we substitute this by generating only minimal dependencies in case of AFDs?

 NO

, because AFDs (

Z~~>B

) may be interesting for the application and we may prefer them to

A~~>B

.

 Non-minimal dependencies perform better in QPIAD,

QUIC etc

Example: AFD (JobTitle, Experience)

~~

>Salary Vs (JobTitle

~~

>Salary)

Performance Concerns

 AFD Mining is costly

 The pruning strategies of FDs are not applicable in case of AFDs.

 For datasets with large number of attributes, the search space gets worse!

 Method for determining whether a dependency holds or not is costly

 Way to traverse the search space is tricky

 Bottom-up Vs Top-down ?

Quality Concerns

 Before algorithms for discovering AFDs can be developed,

AFDs need better Interestingness measures

 AFDs used as feature selectors in classification are expected to give good Accuracy .

 AFDs used in query rewriting are expected to give a high throughput per query .

 (

VIN~~>Make

) Vs (

Model~~>Make

)

 (

VIN~~>Make

) looks good using the error metric

 But, intuitively (as well as practically) (

Model~~>Make

) is a better AFD.

Challenges in AFD Mining

1. Defining right interestingness measures

2. Performing an efficient traversal in the search space of possible rules

3. Employing effective pruning strategies

Agenda/Outline

 Introduction

 Related Work

 Provide new perspective for AFDs

 Roll-ups/condensed representations to association rules

 Define measures for AFDs

 Present the AFDMiner algorithm

 Experimental Results

 Performance

 Quality

Agenda/Outline

 Introduction

 Related Work

 Provide new perspective for AFDs

 Roll-ups/condensed representations to association rules

 Define measures for AFDs

 Present the AFDMiner algorithm

 Experimental Results

 Performance

 Quality

Related Work

FD Mining Algorithms

• Aim at finding minimal cover

• DepMiner, FUN, TANE, FD_Mine

Existing Approximation measures for AFDs

• Tau, InD metrics

Grouping association rules

Clustering association rules

(v

1~> u, v

2~> u as (v

1

^v

2~> u))

Do not work well for AFDs

•Metrics do not seem to matter in practice

•No accompanied algorithm to mine

AFDs

No one combines them as

AFDs

Existing AFD Miners

CORDS

• SoftFDs (C1=>C2)

• Uses |C1,C2|/|C1||C2| as the approximation measure

•Restricted to singleton determining set

•Works from a sample

•Measure used is not appropriate

AIMQ/QPIAD/QUIC

• TANE

• Post-processing over TANE

•Highly Inefficient

•Quality of some AFDs is bad

Agenda/Outline

 Introduction

 Related Work

 Provide new perspective for AFDs

 Roll-ups/condensed representations to association rules

 Define measures for AFDs

 Present the AFDMiner algorithm

 Experimental Results

 Performance

 Quality

Condensing Association Rules

Example:

Association Rule: (Toyota, Camry)~>Sedan

 Viewing database relations as transactions

 Itemsets ≈attribute-value pairs

 Association rules

 Between Itemsets

 Beer

~>

Diapers

 Here, they are between attribute value pairs

 AFDs are rules between

Attributes

 Corresponding to a lot of association rules sharing the same attributes

 Example

Rolling up association rules as AFDs

Make~~>Model

Honda

~~>

Accord Toyota

~~>

Camry … … Tata

~~>

Maruti800

Confidence

 Consider an association rule of the form ( α→ β)

 Confidence denotes the conditional probability of β (head) given α (body).

 Similarly for an AFD (

X~~>A

),

 Confidence should denote the chance of finding the values of A, given values of X

 Define AFD Confidence in terms of confidence of association rules

Specifically, picking the best association rule for every distinct value-combination of the body of the association rule.

Confidence

 For the example carDB,

Confidence =

Support (Make:Honda~~>Model:Accord) +

Support (Make:Toyota~~>Model:Camry)

= 3/8+2/8 = 5/8

 Interestingly this is equal to (1-g

 g

3

) has a natural interpretation as the fraction of tuples with exceptions affecting the dependency.

Specificity

 For an association rule ( α→ β),

Support is the probability with which the conditioning event (i.e., α) occurs

 Rule with High-Confidence, yet Low-Support is a bad rule!

 Presence of a lot of association rules with low supports makes the AFD bad.

 In classification, this affects prediction accuracy.

 For query rewriting tasks, per-query throughput is less.

Model~~>Make

Types of AFDs

Accord

~~>

Honda Camry

~~>

Toyota

… …

Maruti800

~~>

Tata

1.

Model ~~> Make

 Few Branches - Uniform Distribution

 Good, and might hold good universally

2.

VIN ~~> Make

 Many Branches - Uniform Distribution

 Bad - Confidence of each association rule is high, but bad supports

3.

Model, Location ~~> Price

 Many Branches - Skewed Distribution

 Few association rules with high support and many with low support

Specificity

Normalized with the worst case Specificity i.e., X is a key

 The Specificity measure captures our intuition of different types of AFDs.

 It is based on information entropy

 Higher the Specificity (above a threshold), worse the AFD is !

 Shares similar motivations with the way SplitInfo is defined in decision trees while computing Information Gain Ratio

 Follows Monotonicity

Agenda/Outline

 Introduction

 Related Work

 Provide new perspective for AFDs

 Roll-ups/condensed representations to association rules

 Define measures for AFDs

 Present the AFDMiner algorithm

 Experimental Results

 Performance

 Quality

AFD Mining Problem

 Good AFDs are the ones within the desired thresholds of the Confidence and Specificity measures.

 Formally, the AFD mining problem can be stated as follows:

AFD Mining

 The problem of AFD Mining is learn all AFDs that hold over a given relational table

 Two costs:

1. Major cost is the Combinatoric cost of traversing the search space

2. Cost of visiting data to validate each rule

(To compute the interestingness measures)

 Search process for AFDs is exponential in terms of the number of attributes

Pruning Strategies

1. Pruning by Specificity

 Specificity(Y) ≥ Specificity(X), where Y is a superset of X

 If Specificity(X) > maxSpecificity , we can prune all AFDs with X and its supersets as the determining set

2. Pruning (applicable to FDs)

 If (X → A

) is an FD, all AFDs of the form (Y → A) can be pruned

3. Pruning keys

 Needed for FDs

 But, this is subsumed by case 1 in AFDMiner

 Because if Specificity(X) = 1, it means X is a key

AFDMiner algorithm

 Search starts from singleton sets of attributes and works its way to larger attribute sets through the set containment lattice level by level.

 When the algorithm is processing a set X, it tests

AFDs of the form

(

X \{A})~~>A)

, where

AєX.

 Information from previous levels is captured by maintaining RHS+

Candidate Sets for each set.

Traversal in the Search Space

 During the bottom-up breadth-first search, the stopping criteria at a node are:

FD based Pruning

1.

The AFD confidence becomes 1, and thus it is an FD.

2.

The Specificity value of the X is greater than the max value given.

Specificity based Pruning

Example:

A →C is an FD

Then, C is removed from RHS+(ABC)

Computing Confidence and Specificity

 Methods are based on representing attribute sets by equivalence class partitions of the set of tuples

 And, ∏

X is the collection of equivalence classes of tuples for attribute set X

 Example:

 ∏ make

= {{1, 2, 3, 4, 5}, {6, 7, 8}}

 ∏ model

= {{1, 2, 3}, {4, 5}, {6}, {7, 8}}

 ∏

{make U model}

= {{1, 2, 3}, {4, 5}, {6}, {7, 8}}

 A functional dependency holds if ∏

X

= ∏

XUA

 For the AFD (

X~~>A

),

 Confidence = 1 – g

3

(

X~~>A

)

In this example, Confidence(

Model ~~>Make

) = 1

Confidence(

Make~~>Model

) = 5/8

Algorithms

 Algorithm AFDMiner:

• Computes Confidence

• Applies FD-based pruning

Computes Specificity and applies pruning

• Computes level L l+1

• L l+1 contains only those attribute sets of size l+1 which have their subsets of size l in L l

Agenda/Outline

 Introduction

 Related Work

 Provide new perspective for AFDs

 Roll-ups/condensed representations to association rules

 Define measures for AFDs

 Present the AFDMiner algorithm

 Experimental Results

 Performance

 Quality

Empirical Evaluation

 Experimental Setup

 Data sets

 CensusDB (199523 tuples, 30 attrb)

 MushroomDB (8124 tuples, 23 attrb)

 Parameters for AFDMiner

 minConf

 maxSpecificity

No. of tuples

No. of attributes

MaxLength of determining set

Aim of the experiments is to show that the Dual-Measure approach

(AFDMiner—using both confidence and specificity outperforms the

Single-Measure approach (No_Specificity – that uses Confidence alone)

No_Specificity : A modified version of AFDMiner, which uses using only

Confidence but not Specificity for AFDs. Thus, it generates all AFDs (X

~~>

A) with (Confidence(X

~~>

A) >minConf)

Evaluating Quality

93

92

91

90

89

88

87

86

85

84

83

82

No_InfoSupport AFDMiner

BestAFD:

 The highest confident AFD among all the

AFDs with attribute A as their dependent attribute

Classification Task:

 Classifier is run with determining set of

BestAFD as features

 Used 10-fold cross-validation and computed the average classification accuracy

 Weka tool-kit

 Evaluated over the censusDB

Evaluation Quality

No_Specificity

CensusDB

Average Classification accuracy for all attributes

minConf = 0.8 ; maxSpecificity = 0.4

Choosing minConf !

Shows that Specificity is effective in generating better quality AFDs.

Choosing maxSpecificity

6000

5000

4000

3000

2000

1000

0

0

CensusDB

0.2

0.4

0.6

0.8

CensusDB MaxSpecificity

 Classification Accuracy (by varying maxSpecificity)

 threshold low => good rules are pruned

 threshold high => bad rules are not being pruned

 Classification accuracy approximately forms a double elbow shaped curve.

Choosing maxSpecificity

Best Value

6000

5000

4000

3000

2000

1000

0

0 0.2

MaxSpecificity

 Time to compute AFDs:

 Increases with increasing maxSpecificity

 Rate of change varies

0.4

0.6

0.8

 A good threshold value for Specificity (i.e., maxSpecificity) is the value at the first elbow in the graph on quality

Query Throughput

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

2 3 4 6 7 8 9 10 11 12 14 15 17 18 19 20 21 22 23 24

Attributes

A F DMiner

No. of tuples returned for an top-10 queries on each distinct determining set (denotes query throughput)

Discussion on TANE

Primarily designed to generate FDs

 Modified version for generating

Approximate Dependencies

Uses the error metric

g

3

for AFDs

Bottom-up search in the lattice

 Generates only minimal dependencies

 Pruning applicable to FDs

Comparison (AFDMiner Vs TANE)

 TANENOMINP is a modified version of TANE that does not stop with just minimal dependencies.

minConf is 0.8 (thus, we set the g

3 to be 0.2)

AFDMiner outperforms both the approaches -- thus strengthening the argument that AFDs with high confidence and with reasonable Specificity are the best

Evaluating Performance

18000 40000

16000

14000

No_Specificity

AFDMiner

35000

No_Specificity

AFDMiner

30000

12000 25000

10000

20000

8000

15000

6000

10000

4000

5000

2000

0

0 0 5 10 15 20

0 2000 4000 6000 8000 10000 12000

Num ber of Tuples

CensusDB No. of attributes

CensusDB

 Time varies linearly with the number of tuples.

25

 AFDMiner takes less time compared to that of

NoSpecificity.

 Time varies exponentially on the number of attributes.

 AFDMiner completes much faster than NoSpecificity

30 35

Evaluating Performance

160000

140000

120000

100000

80000

60000

40000

20000

0

0

CensusDB

No_Specificity

AFDMiner

1 2 3 4 5 6

Length of determining set in each AFD

7

30000

25000

20000

15000

10000

5000

0

0

CensusDB

No_Specificity

AFD Miner

1 2 3 4

Length of determining set in each AFD

5

6000

5000

4000

3000

2000

1000

0

No_Specificity

AFDMiner

0 2000

MushroomDB

4000 6000

No. of Tuples

8000 10000

6000

5000

4000

No_Specificity

AFDMiner (ms)

3000

2000

1000

0

0 5

MushroomDB

10 15

No of attributes

These experiments show that AFDMiner is fast

20 25

6

Conclusion

 Introduced a novel perspective for AFDs

 Condensed roll-ups of association rules.

 Two metrics for AFDs

 Confidence

 Specificity

A version of this thesis is currently under review at ICDE’ 09

 Algorithm AFDMiner

 all AFDs ( confidence > minConf ; Specificity < maxSpecificity )

 Bottom-up search in a breadth-first manner in the set containment lattice of attributes

 Pruning based on Specificity

 Experiments – AFDMiner generates high-quality AFDs faster.

 AFDs with high Confidence and reasonable Specificity

Future Direction

 Conditional Functional Dependencies (CFDs)

 Dependencies of the form ({ZipCode → City} if country =”England”).

 i.e., Holding true only for certain values of one or more of other attributes.

 CAFDs are the probabilistic counter part of CFDs

 CFDs and CAFDs are applied in data cleaning and value prediction recently, but mining these conditional rules is unexplored.

Intuitively, CFDs are intermediate rules between association rules (value level) and FD

(attribute level). So, we believe that our approach can help in generating them !

Questions ?

Download