Master’s Thesis Defense by Aravind Krishna Kalavagattu
Committee Members:
Dr. Subbarao Kambhampati (chair)
Dr. Yi Chen
Dr. Huan Liu
Database Systems
• Well-defined schema and method for querying (SQL)
• Query optimization
• Lately, some systems started supporting IR-
Style answering of user queries
Data mining
Rule Mining with
Several applications
Over databases
• Discovering useful patterns from data
• Rule learning is a well researched method for discovering interesting relations between variables in large databases
• Association Rules
Approximate Functional Dependencies are rules denoting approximate determinations at attribute level.
AFDs are of the form (X ~~> Y), where X and Y are sets of attributes
X is the “ determining set” and Y is called “ dependent set”
Rules with singleton dependent sets are of high interest
A classic example of an AFD
(Nationality ~~ > Language)
More examples
Make ~~> Model
(Job Title, Experience) ~~> Salary
Indicates that we can approximately guess the language of a person if we know which country she is from.
Functional Dependency (FD)
Given a relation R, a set of attributes X in R is said to
functionally determine another attribute Y, also in R,
(written X → Y) if and only if each X value is associated with precisely one Y value.
AFDs can be loosely defined as FDs that approximately hold (there are some exception rows that fail to satisfy the Function over the current relation)
Example: Make~~>Model (with error = 0.3)
70% of the tuples satisfy the dependency
Predicting Missing Values of attributes
In relational tables
(QPIAD)
Using values of attributes in determining set of AFD
Query Rewriting
(AIMQ, QPIAD, QUIC)
Example: Model
~~>
BodyStyle
Rewrite query on Model=“RAV4” to
Retrieve tuples with bodystyle=“SUV”
Query Optimization
(CORDS, BHUNT)
Maintaining correct selectivity estimates
Database design
(Database normalization)
(Efficient Storage)
Similar to the way FDs are used
FD Mining aims at finding a
Minimum set of FDs from which the entire set of FDs can be generated
Example: If A → B is an FD, then, ({A,C} → B) is considered redundant
Can we substitute this by generating only minimal dependencies in case of AFDs?
NO
, because AFDs (
Z~~>B
) may be interesting for the application and we may prefer them to
A~~>B
.
Non-minimal dependencies perform better in QPIAD,
QUIC etc
Example: AFD (JobTitle, Experience)
~~
>Salary Vs (JobTitle
~~
>Salary)
AFD Mining is costly
The pruning strategies of FDs are not applicable in case of AFDs.
For datasets with large number of attributes, the search space gets worse!
Method for determining whether a dependency holds or not is costly
Way to traverse the search space is tricky
Bottom-up Vs Top-down ?
Before algorithms for discovering AFDs can be developed,
AFDs need better Interestingness measures
AFDs used as feature selectors in classification are expected to give good Accuracy .
AFDs used in query rewriting are expected to give a high throughput per query .
(
VIN~~>Make
) Vs (
Model~~>Make
)
(
VIN~~>Make
) looks good using the error metric
But, intuitively (as well as practically) (
Model~~>Make
) is a better AFD.
Introduction
Related Work
Provide new perspective for AFDs
Roll-ups/condensed representations to association rules
Define measures for AFDs
Present the AFDMiner algorithm
Experimental Results
Performance
Quality
Introduction
Related Work
Provide new perspective for AFDs
Roll-ups/condensed representations to association rules
Define measures for AFDs
Present the AFDMiner algorithm
Experimental Results
Performance
Quality
FD Mining Algorithms
• Aim at finding minimal cover
• DepMiner, FUN, TANE, FD_Mine
Existing Approximation measures for AFDs
• Tau, InD metrics
Grouping association rules
Clustering association rules
(v
1~> u, v
2~> u as (v
1
^v
2~> u))
Do not work well for AFDs
•Metrics do not seem to matter in practice
•No accompanied algorithm to mine
AFDs
No one combines them as
AFDs
CORDS
• SoftFDs (C1=>C2)
• Uses |C1,C2|/|C1||C2| as the approximation measure
•Restricted to singleton determining set
•Works from a sample
•Measure used is not appropriate
AIMQ/QPIAD/QUIC
• TANE
• Post-processing over TANE
•Highly Inefficient
•Quality of some AFDs is bad
Introduction
Related Work
Provide new perspective for AFDs
Roll-ups/condensed representations to association rules
Define measures for AFDs
Present the AFDMiner algorithm
Experimental Results
Performance
Quality
Example:
Association Rule: (Toyota, Camry)~>Sedan
Viewing database relations as transactions
Itemsets ≈attribute-value pairs
Association rules
Between Itemsets
Beer
~>
Diapers
Here, they are between attribute value pairs
AFDs are rules between
Attributes
Corresponding to a lot of association rules sharing the same attributes
Example
Make~~>Model
Honda
~~>
Accord Toyota
~~>
Camry … … Tata
~~>
Maruti800
Consider an association rule of the form ( α→ β)
Confidence denotes the conditional probability of β (head) given α (body).
Similarly for an AFD (
X~~>A
),
Confidence should denote the chance of finding the values of A, given values of X
Define AFD Confidence in terms of confidence of association rules
Specifically, picking the best association rule for every distinct value-combination of the body of the association rule.
For the example carDB,
Confidence =
Support (Make:Honda~~>Model:Accord) +
Support (Make:Toyota~~>Model:Camry)
= 3/8+2/8 = 5/8
Interestingly this is equal to (1-g
g
3
) has a natural interpretation as the fraction of tuples with exceptions affecting the dependency.
For an association rule ( α→ β),
Support is the probability with which the conditioning event (i.e., α) occurs
Rule with High-Confidence, yet Low-Support is a bad rule!
Presence of a lot of association rules with low supports makes the AFD bad.
In classification, this affects prediction accuracy.
For query rewriting tasks, per-query throughput is less.
Model~~>Make
Accord
~~>
Honda Camry
~~>
Toyota
… …
Maruti800
~~>
Tata
1.
Model ~~> Make
Few Branches - Uniform Distribution
Good, and might hold good universally
2.
VIN ~~> Make
Many Branches - Uniform Distribution
Bad - Confidence of each association rule is high, but bad supports
3.
Model, Location ~~> Price
Many Branches - Skewed Distribution
Few association rules with high support and many with low support
Normalized with the worst case Specificity i.e., X is a key
The Specificity measure captures our intuition of different types of AFDs.
It is based on information entropy
Higher the Specificity (above a threshold), worse the AFD is !
Shares similar motivations with the way SplitInfo is defined in decision trees while computing Information Gain Ratio
Follows Monotonicity
Introduction
Related Work
Provide new perspective for AFDs
Roll-ups/condensed representations to association rules
Define measures for AFDs
Present the AFDMiner algorithm
Experimental Results
Performance
Quality
Good AFDs are the ones within the desired thresholds of the Confidence and Specificity measures.
Formally, the AFD mining problem can be stated as follows:
The problem of AFD Mining is learn all AFDs that hold over a given relational table
Two costs:
1. Major cost is the Combinatoric cost of traversing the search space
2. Cost of visiting data to validate each rule
(To compute the interestingness measures)
Search process for AFDs is exponential in terms of the number of attributes
1. Pruning by Specificity
Specificity(Y) ≥ Specificity(X), where Y is a superset of X
If Specificity(X) > maxSpecificity , we can prune all AFDs with X and its supersets as the determining set
2. Pruning (applicable to FDs)
If (X → A
) is an FD, all AFDs of the form (Y → A) can be pruned
3. Pruning keys
Needed for FDs
But, this is subsumed by case 1 in AFDMiner
Because if Specificity(X) = 1, it means X is a key
Search starts from singleton sets of attributes and works its way to larger attribute sets through the set containment lattice level by level.
When the algorithm is processing a set X, it tests
AFDs of the form
(
X \{A})~~>A)
, where
AєX.
Information from previous levels is captured by maintaining RHS+
Candidate Sets for each set.
During the bottom-up breadth-first search, the stopping criteria at a node are:
FD based Pruning
1.
The AFD confidence becomes 1, and thus it is an FD.
2.
The Specificity value of the X is greater than the max value given.
Specificity based Pruning
Example:
A →C is an FD
Then, C is removed from RHS+(ABC)
Methods are based on representing attribute sets by equivalence class partitions of the set of tuples
And, ∏
X is the collection of equivalence classes of tuples for attribute set X
Example:
∏ make
= {{1, 2, 3, 4, 5}, {6, 7, 8}}
∏ model
= {{1, 2, 3}, {4, 5}, {6}, {7, 8}}
∏
{make U model}
= {{1, 2, 3}, {4, 5}, {6}, {7, 8}}
A functional dependency holds if ∏
X
= ∏
XUA
For the AFD (
X~~>A
),
Confidence = 1 – g
3
(
X~~>A
)
In this example, Confidence(
Model ~~>Make
) = 1
Confidence(
Make~~>Model
) = 5/8
Algorithm AFDMiner:
• Computes Confidence
• Applies FD-based pruning
Computes Specificity and applies pruning
• Computes level L l+1
• L l+1 contains only those attribute sets of size l+1 which have their subsets of size l in L l
Introduction
Related Work
Provide new perspective for AFDs
Roll-ups/condensed representations to association rules
Define measures for AFDs
Present the AFDMiner algorithm
Experimental Results
Performance
Quality
Experimental Setup
Data sets
CensusDB (199523 tuples, 30 attrb)
MushroomDB (8124 tuples, 23 attrb)
Parameters for AFDMiner
minConf
maxSpecificity
No. of tuples
No. of attributes
MaxLength of determining set
Aim of the experiments is to show that the Dual-Measure approach
(AFDMiner—using both confidence and specificity outperforms the
Single-Measure approach (No_Specificity – that uses Confidence alone)
No_Specificity : A modified version of AFDMiner, which uses using only
Confidence but not Specificity for AFDs. Thus, it generates all AFDs (X
~~>
A) with (Confidence(X
~~>
A) >minConf)
93
92
91
90
89
88
87
86
85
84
83
82
No_InfoSupport AFDMiner
BestAFD:
The highest confident AFD among all the
AFDs with attribute A as their dependent attribute
Classifier is run with determining set of
BestAFD as features
Used 10-fold cross-validation and computed the average classification accuracy
Weka tool-kit
Evaluated over the censusDB
No_Specificity
CensusDB
Average Classification accuracy for all attributes
minConf = 0.8 ; maxSpecificity = 0.4
Choosing minConf !
Shows that Specificity is effective in generating better quality AFDs.
6000
5000
4000
3000
2000
1000
0
0
CensusDB
0.2
0.4
0.6
0.8
CensusDB MaxSpecificity
Classification Accuracy (by varying maxSpecificity)
threshold low => good rules are pruned
threshold high => bad rules are not being pruned
Classification accuracy approximately forms a double elbow shaped curve.
Best Value
6000
5000
4000
3000
2000
1000
0
0 0.2
MaxSpecificity
Time to compute AFDs:
Increases with increasing maxSpecificity
Rate of change varies
0.4
0.6
0.8
A good threshold value for Specificity (i.e., maxSpecificity) is the value at the first elbow in the graph on quality
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
2 3 4 6 7 8 9 10 11 12 14 15 17 18 19 20 21 22 23 24
Attributes
A F DMiner
No. of tuples returned for an top-10 queries on each distinct determining set (denotes query throughput)
Modified version for generating
Approximate Dependencies
g
3
Generates only minimal dependencies
Pruning applicable to FDs
TANENOMINP is a modified version of TANE that does not stop with just minimal dependencies.
minConf is 0.8 (thus, we set the g
3 to be 0.2)
AFDMiner outperforms both the approaches -- thus strengthening the argument that AFDs with high confidence and with reasonable Specificity are the best
18000 40000
16000
14000
No_Specificity
AFDMiner
35000
No_Specificity
AFDMiner
30000
12000 25000
10000
20000
8000
15000
6000
10000
4000
5000
2000
0
0 0 5 10 15 20
0 2000 4000 6000 8000 10000 12000
Num ber of Tuples
CensusDB No. of attributes
CensusDB
Time varies linearly with the number of tuples.
25
AFDMiner takes less time compared to that of
NoSpecificity.
Time varies exponentially on the number of attributes.
AFDMiner completes much faster than NoSpecificity
30 35
160000
140000
120000
100000
80000
60000
40000
20000
0
0
CensusDB
No_Specificity
AFDMiner
1 2 3 4 5 6
Length of determining set in each AFD
7
30000
25000
20000
15000
10000
5000
0
0
CensusDB
No_Specificity
AFD Miner
1 2 3 4
Length of determining set in each AFD
5
6000
5000
4000
3000
2000
1000
0
No_Specificity
AFDMiner
0 2000
MushroomDB
4000 6000
No. of Tuples
8000 10000
6000
5000
4000
No_Specificity
AFDMiner (ms)
3000
2000
1000
0
0 5
MushroomDB
10 15
No of attributes
These experiments show that AFDMiner is fast
20 25
6
Introduced a novel perspective for AFDs
Condensed roll-ups of association rules.
Two metrics for AFDs
Confidence
Specificity
A version of this thesis is currently under review at ICDE’ 09
Algorithm AFDMiner
all AFDs ( confidence > minConf ; Specificity < maxSpecificity )
Bottom-up search in a breadth-first manner in the set containment lattice of attributes
Pruning based on Specificity
Experiments – AFDMiner generates high-quality AFDs faster.
AFDs with high Confidence and reasonable Specificity
Conditional Functional Dependencies (CFDs)
Dependencies of the form ({ZipCode → City} if country =”England”).
i.e., Holding true only for certain values of one or more of other attributes.
CAFDs are the probabilistic counter part of CFDs
CFDs and CAFDs are applied in data cleaning and value prediction recently, but mining these conditional rules is unexplored.
Intuitively, CFDs are intermediate rules between association rules (value level) and FD
(attribute level). So, we believe that our approach can help in generating them !