Slides - Subbarao Kambhampati

advertisement
Improving Retrieval Accuracy in Web
Databases Using Attribute Dependencies
Ravi Gummadi & Anupam Khulbe
gummadi@asu.edu – akhulbe@asu.edu
Computer Science Department
Arizona State University
1
Agenda
• Introduction [Ravi]
• SmartINT System [Anupam]
• Query Processing [Anupam]
– Source Selection
– Tuple Expansion
• Learning [Anupam]
• Experiments [Ravi]
• Conclusion & Future Work [Ravi]
2
INTRODUCTION
3
This describes the imaginary schema
containing all the attributes of a vehicle
Introduction
Consider a table with Universal Relation from vehicle domain
VIN
Make
Vehicletype
MID
Cylind
Model Price Engine Miles
ers Dealer
V001
HACC9
Honda Fullsize
6
Accord 19000 K24A4
V002
TYCRA
Toyota Midsize
08 Corolla 14000 F23A1
V003
TYCRA
Toyota Midsize
09 Corolla 16000 155 HP
V004
TYCRY
2AZ-FE
Toyota Fullsize
09
Camry 12000
I4
109k
V005
HACV0
Database
Honda Midsize
8
Civic
11500 Administrator
F23A1 120k
4
Introduction
45k
80k
50k
6
4
4
6
Address
Frank
1011 E Lemon St,
Scottsdale, AZ
Frank
1011 E Lemon St,
Scottsdale, AZ
John
900 10th Street,
Tucson, AZ
Steven
601 Apache Blvd,
Glendale, AZ
Frank
1011 E Lemon St,
Scottsdale, AZ
4
Normalized Tables
Lossless Normalization
Name
Frank
Steven
John
Database Administrator
Primary Key
VIN
V001
V002
V003
V004
V005
Introduction
Make
MID
HACC96 Honda
TYCRA08 Toyota
TYCRA09 Toyota
TYCRY09 Toyota
HACV08 Honda
MID
Miles
HACC96
45k
TYCRA08 80k
TYCRA09 50k
TYCRY09 109k
HACV08 120k
Dealer
Frank
Frank
John
Steven
Frank
Model
Accord
Corolla
Corolla
Camry
Civic
Price
19000
14000
16000
12000
11500
Cars-for-Sale
Dealer-Info
Address
1011 E Lemon St, Scottsdale, AZ
601 Apache Blvd, Glendale, AZ
900 10th Street, Tucson, AZ
VehicleReview
type
Excellent Midsize
Good
Fullsize
Average
SUV
Excellent Fullsize
Very Good Midsize
Car-Reviews
Engine
K24A4
F23A1
155 HP
2AZ-FE I4
F23A1
Cylinders
6
4
4
6
4
Foreign Key
5
Query Processing
SELECT make, mid, model FROM cars-for-sale c, car-reviews r
WHERE cylinders = 4 AND price < $15k
MID
TYCRA08
HACV08
Complete Data
Make
Toyota
Honda
Certain Query
Model
Corolla
Civic
Lossless Normalization
Accurate Results
Introduction
6
Advent of Web
(in context of Vehicle Domain)
Used Car Dealers
Car Reviewers
Database Administrator
Customers Selling Cars
Engine Makers
Introduction
7
A Sample Data Model
Name
Address
Frank 1011 E Lemon St, Scottsdale, AZ
Steven 601 Apache Blvd, Glendale, AZ
John
900 10th Street, Tucson, AZ
Used Car Dealers
MID
Make
HACC96 Honda
HACV08 Honda
TYCRY08 Toyota
TYCRA09 Toyota
Model
Accord
Civic
Camry
Corolla
Customers Selling Cars
Model_name
Corolla
Accord
Highlander
Camry
Civic
Price
19000
12000
14500
14500
Car Reviewers
Review
Excellent
Good
Average
Excellent
Very Good
Vehicle-type
Midsize
Fullsize
SUV
Fullsize
Midsize
MID
Mdl
HACC96 Accord
TYCRA08 Corolla
TYCRA09 Corolla
TYCRY09 Camry
HACV08 Civic
HACV07 Civic
Engine
K24A4
F23A1
155 HP
2AZ-FE I4
F23A1
J27B1
Dealer
Frank
Frank
John
Steven
Frank
Cylinders
6
4
4
6
4
4
Engine Makers
Introduction
8
VIN field masked
Hidden Sensitive Information
A Sample Data Model
Name
Address
Frank 1011 E Lemon St, Scottsdale, AZ
Steven 601 Apache Blvd, Glendale, AZ
John
900 10th Street, Tucson, AZ
Used Car Dealers – t_dealer_info
Schema
Heterogeneity
Unavailability of
Information
MID
Make
HACC96 Honda
HACV08 Honda
TYCRY08 Toyota
TYCRA09 Toyota
Model_name
Corolla
Accord
Highlander
Camry
Civic
Key might not be the
shared attribute
Review
Excellent
Good
Average
Excellent
Very Good
Vehicle-type
Midsize
Fullsize
SUV
Fullsize
Midsize
Dealer
Frank
Frank
John
Steven
Frank
Car Reviewers – t_car_reviews
Model
Accord
Civic
Camry
Corolla
Customers Selling Cars – t_car_sales
Price
19000
12000
14500
14500
MID
Mdl
HACC96 Accord
TYCRA08 Corolla
TYCRA09 Corolla
TYCRY09 Camry
HACV08 Civic
HACV07 Civic
Engine
K24A4
F23A1
155 HP
2AZ-FE I4
F23A1
J27B1
Cylinders
6
4
4
6
4
4
Engine Makers – t_eng_makers
Introduction
9
Vehicles Revisited
Engine Makers
Table 2
Car Reviewers
Table 1
Customers Selling Cars
User Query
Used Car Dealers
Introduction
10
Query is Partial….
SELECT make, model
FROM cars-for-sale c, carreviews r WHERE cylinders = 4 AND price < $15k
The attributes from one
source are not visible in
other source in WebDBs; the
query is not complete
Introduction
The tables are not
visible to the users
11
Approaches – Single Table
• Answering queries from a single table
• Unable to propagate constraints; Inaccurate results
SELECT make, model WHERE cylinders = 4 AND price < $15k
MID
Make
Model
Price
HACV08
Honda
Civic
12000
TYCRY08
Toyota
Camry
14500
TYCRA09
Toyota
Corolla
14500
Inaccurate Result –
Camry has 6 cylinders
Introduction
Customers Selling Cars
MID
Make
HACC96 Honda
HACV08 Honda
TYCRY08 Toyota
TYCRA09 Toyota
Model
Accord
Civic
Camry
Corolla
Price
19000
12000
14500
14500
12
Approaches – Direct Join
• Join the tables based on shared attribute
• Leads to spurious tuples which do not exist
SELECT make, model WHERE cylinders = 4 AND price < $15k
Make
Honda
Honda
Toyota
Toyota
Price
Mdl
12000Civic
12000Civic
14500Corolla
14500Corolla
Introduction
Cylinders
4
4
4
4
Join the following two tables
Spurious results Generates extra tuples
Make
Honda
Honda
Toyota
Toyota
Engine
F23A1
J27B1
F23A1
155 HP
Model
Accord
Civic
Camry
Corolla
Price
Customers Selling Cars
19000
12000
14500
14500
Mdl
Accord
Corolla
Corolla
Camry
Civic
Civic
Engine
K24A4
F23A1
155 HP
2AZ-FE I4
F23A1
J27B1
Engine Makers
Cylinders
6
4
4
6
4
4
13
Why is JOIN not working?
The Rules of Normalization
• Eliminate Repeating Groups
• Eliminate Redundant Data
• Eliminate Columns Not Dependent
On Key
Cannot ensure in
Autonomous
Web Databases
All Columns are dependent on Key in Normalization which is
NOT necessarily true in Ad hoc Normalization!!
Introduction
http://www.datamodel.org/NormalizationRules.html
14
Dependencies….
• Shared attribute(s) is not the ‘Key’!
• The shared attribute’s relation with other
columns is unknown!!
• LEARN the dependencies between them
• Mine Functional Dependencies (FD) among the
columns..
– Neat…works quite well ‘IF ONLY’ the data is clean
– Lot of noisy data in Web Databases
• Instead consider
– APPROXIMATE FUNCTIONAL DEPENDENCIES
Introduction
15
Approximate Functional Dependencies
• Approximate Functional Dependencies are rules
denoting approximate determinations at attribute level.
– AFDs are of the form (X ~~> Y), where X and Y are
sets of attributes
– X is the “determining set” and Y is called “dependent
set”
– Rules with singleton dependent sets are of high
interest
• Examples of AFDs
– (Nationality ~~> Language)
– Make ~~> Model
– (Job Title, Experience) ~~> Salary
Introduction
16
Using AFDs for Query Processing
• These AFDs make up for the missing dependency
information between columns.
• They help in propagating constraints distributed across
tables.
• They help in predicting the attributes distribute across
tables
• They assist in completing the entity information by
predicting the related attributes
Introduction
MID
Make
Model
Price
HACV08
Honda
Civic
12000
TYCRA09
Toyota
Corolla
14500
17
Summary
• Traditional query processing does not hold for
Autonomous Web Databases.
• Problems like incomplete/Noisy data, imprecise
query and ad hoc normalization exist.
• Schema Heterogeneity can be countered by
existing works.
• (Still) Missing PK-FK information lead to
inaccurate joins.
• Mine Approximate Functional Dependencies and
use them to make up for missing PK-FK
information.
Introduction
18
Problem Statement
Given a collection of ad hoc normalized
tables, the attribute mappings between
the tables and a partial query – return the
user an accurate result set covering the
majority of attributes described in the
universal relation.
Introduction
19
Agenda
• Introduction [Ravi]
• SmartINT System [Anupam]
• Query Processing [Anupam]
– Source Selection
– Tuple Expansion
• Learning [Anupam]
• Experiments [Ravi]
• Conclusion & Future Work [Ravi]
20
SMART-INT(EGRATOR) &
RELATED WORK
21
SmartINT Framework
Result Set
AFDMiner
Tuple
Expansion
Query
Statistics
Learner
Source
Selection
Tree of Tables
Web
Database
Attribute
Mapping
Q
U
E
R
Y
I
N
T
E
R
F
A
C
E
Graph
of
Tables
SmartINT
22
Related Work – Attribute Mapping
•Large body of research over the past few years
•Automatic and Manual Approaches
Result Set
•LSD (Doan et al, SIGMOD 2001)
AFDMiner
•Simiflood (Melnik et al, ICDE 2002)
Tuple
•Cupid (J. Madhavan et al,
VLDB 2001)
Expansion
•SEMINT (Clifton et al, TKDE 2000)
•Clio (Hernandez et al, SIGMOD 2001)
•Schema Mapping(Translation Rules) is
Query
Statistics
More Difficult!!
•1-1 Learner
Attribute mapping is comparatively easier and
Source
can be automated
Selection
Tree of Tables
Web
Database
Attribute
Mapping
Q
U
E
R
Y
I
N
T
E
R
F
A
C
E
Graph
of
Tables
SmartINT
23
Related Work – Query Interface
•Imprecise Queries
•Vague (A. Motro, ACM TOIS 1998)
AFDMiner
•AIMQ (U. Nambiar et al, ICDE 2006)
Tuple
•QUIC (Kambhampati et al,
CIDR 2007)
Expansion
Result Set
•Keyword Search
•BANKS (Bhalotia et al, ICDE 2002)
Query
•DISCOVER (Hristdis et al, VLDB 2003)
Statistics
•KITE
(Mayassam et al, ICDE 2007)
Learner
•PK-FK Assumption does not hold!!
Source
Selection
Tree of Tables
Web
Database
Attribute
Mapping
Q
U
E
R
Y
I
N
T
E
R
F
A
C
E
Graph
of
Tables
SmartINT
24
Related Work – Web Database
•Query Processing on Web Databases is an
important research problem
Result Set
AFDMiner
• Ives at al, SIGMOD 2004
• Lembo et Tuple
al, KRDB 2002
Expansion
•QPIAD (G. Wolf et al, VLDB 2007) from DBYochan, close to ours in spirit, uses AFD
based prediction to make up for missing
Query
Statistics data.
Learner
Source
Selection
Tree of Tables
Web
Database
Attribute
Mapping
Q
U
E
R
Y
I
N
T
E
R
F
A
C
E
Graph
of
Tables
SmartINT
25
Related Work – AFD Mining
AFDMiner
Statistics
Learner
Web
Database
Q
U
Result Set
E
R
Tuple
•FD/AFD Mining is an important problem inY
Expansion
DB Community
I
•Mines AFDs as approximation of AFDs with
N
few error tuples
T
•CORDS
Query
E
•TANE
R
F
Source
•Mining them as condensed representationAof
Selection
association rules
C
Tree
of
Tables
•AFDMiner (Kalavagattu, MS Thesis, ASU
E
2008)
Attribute
Mapping
Graph
of
Tables
SmartINT
26
Agenda
• Introduction [Ravi]
• SmartINT System [Anupam]
• Query Processing [Anupam]
– Source Selection
– Tuple Expansion
• Learning [Anupam]
• Experiments [Ravi]
• Conclusion & Future Work [Ravi]
27
Result Set
AFDMiner
Tuple
Expansion
Query
Statistics
Learner
Source
Selection
Q
U
E
R
Y
I
N
T
E
R
F
A
C
E
Tree of Tables
Web
Database
Attribute
Mapping
Graph
of
Tables
QUERY PROCESSING
28
Query Answering Task
SELECT Make, Vehicle-type WHERE cylinders = 4 AND price < $15k
Distributed
constraints
Distributed
attributes
Model_name Review
Corolla
Excellent
Accord
Good
Highlander Average
Camry
Excellent
Civic
Very Good
Attributes need to
be integrated
Query Processing
Make
Honda
Honda
Toyota
Toyota
Model
Accord
Civic
Camry
Corolla
Vehicle-type
Midsize
Fullsize
SUV
Fullsize
Midsize
Dealer
Frank
Frank
John
Steven
Frank
Price
19000
12000
14500
14500
Result set should adhere to
all the constraints distributed
across tables
Attribute
Match
Mdl
Accord
Corolla
Corolla
Camry
Civic
Civic
Name
Address
Frank 1011 E Lemon St, Scottsdale, AZ
Steven 601 Apache Blvd, Glendale, AZ
John
900 10th Street, Tucson, AZ
Engine
K24A4
F23A1
155 HP
2AZ-FE I4
F23A1
J27B1
Cylinders
6
4
4
6
4
4
Query Answering Approach
Select a tree
Make
Honda
Honda
Toyota
Toyota
Model
Accord
Civic
Camry
Corolla
Direction of constraint
Model_name Review
propagationVehicle-type
and
Corolla
Excellent
Midsize
attribute
prediction
Accord
Good
Fullsize
Highlander matters!
Average
SUV
Camry
Civic
Excellent
Very Good
Predict attributes
using AFDs to
expand seed tuples
Query Processing
Fullsize
Midsize
Dealer
Frank
Frank
John
Steven
Frank
Process root table
constraints to generate
“seed” tuples
Price
19000
12000
14500
14500
Propagate constraints
to the root table
Mdl
Accord
Corolla
Corolla
Camry
Civic
Civic
Engine
K24A4
F23A1
155 HP
2AZ-FE I4
F23A1
J27B1
Cylinders
Role of AFDs
AccuracyAddress
of constraint propagation and
1011 E Lemon
St, Scottsdale,
AZ
attribute
prediction
depends
on AFD confidence
Name
Frank
Steven 601 Apache Blvd, Glendale, AZ
John
900 10th Street, Tucson, AZ
6
4
4
6
4
4
Tuple
Expansion
Query
Source
Selection
Tree of Tables
SOURCE SELECTION
31
Selecting the best tree
Objective: Given a graph of tables and a query,
select the most relevant tree of tables of size up to k
1
4
2
Source Selection
4
2
5
3
3
6
Query
Requirements
1. Need to estimate relevance of a table, when some of the
constraints are not mapped on to its attributes
2. Need a relevance function for a tree of tables
Source Selection
32
Constraint Propagation
< 15k
Table 1
Make
Honda
Honda
Toyota
Toyota
Model
Accord
Civic
Camry
Corolla
Price
19000
12000
14500
14500
Engine
K24A4
F23A1
155 HP
2AZ-FE I4
F23A1
J27B1
Make
Honda
Honda
Toyota
Toyota
Model
Accord
Civic
Camry
Corolla
Price
19000
12000
14500
14500
Model = Corolla or Civic
Table 2
Table 2
Mdl
Accord
Corolla
Corolla
Camry
Civic
Civic
Table 1
=
Cylinders
6
4
4
4
6
4
4
Mdl
Accord
Corolla
Corolla
Camry
Civic
Civic
Engine
K24A4
F23A1
155 HP
2AZ-FE I4
F23A1
J27B1
=
Cylinders
6
4
4
4
6
4
4
Propagate
Cylinders = 4 to Table 1
Distributed
Other
constraints
information
AFD provides
the cond. probability P2(Cylinders = 4 | Mdl = modeli)
Source Selection
33
Relevance of a tree
T1
Factors?
1. Root table
relevance
Make
Honda
Honda
Toyota
Toyota
Model
Accord
Civic
Camry
Corolla
Relevance of tree T w.r.t query q
Price
19000
12000
14500
14500
C1: Price< 15k
C2: Model = ‘Corolla’ or
‘Civic’
T2
T3
Model_name
Corolla
Accord
Highlander
Camry
Civic
Review
Here,
Excellent
Good
Average
Excellent
Very Good
Vehicle-type
Midsize
Fullsize
SUV
Fullsize
Midsize
Dealer
Frank
Frank
John
Steven
Frank
3. AFD Confidence: How
accurately can the value be
Source Selection predicted?
Mdl
Accord
Corolla
Corolla
Camry
Civic
Civic
Engine
K24A4
F23A1
155 HP
2AZ-FE I4
F23A1
J27B1
Cylinders
6
4
4
6
4
4
2. Value overlap:
What fraction of tuples in
base-table can be expanded
by child table
34
Relevance of a table
Factors?
Make
Honda
Honda
Toyota
Toyota
Model
Accord
Civic
Camry
Corolla
1. Fraction of query
attributes provided
- horizontal relevance
2. Conformance to constraints
- vertical relevance
Price
19000
12000
14500
14500
C1: Price< 15k
C2: Model = ‘Corolla’ or
‘Civic’
Mdl
Accord
Corolla
Corolla
Camry
Civic
Civic
Engine
K24A4
F23A1
155 HP
2AZ-FE I4
F23A1
J27B1
=4
Cylinders
6
4
4
6
4
4
SELECT Make, Vehicle-type WHERE cylinders = 4 AND price < $15k
Source Selection
35
Tuple
Expansion
Query
Source
Selection
Tree of Tables
TUPLE EXPANSION
36
Tuple Expansion
• Tuple expansion operates on the
tree of tables given by source
selection
• It has two main steps
1. Constructing the Schema
2. Populating the tuples
37
Phase 1: Constructing schema
Tree of tables
Table 1
Make
Honda
Honda
Toyota
Toyota
Model
Accord
Civic
Camry
Corolla
Price
19000
12000
14500
14500
Make
Model
Price
Table 3
Model_name Review
Corolla
Excellent
Accord
Good
Highlander Average
Camry
Excellent
Civic
Very Good
Vehicle-type
Midsize
Fullsize
SUV
Fullsize
Midsize
Dealer
Frank
Frank
John
Steven
Frank
Model_name Vehicle-type
SELECT Make, Vehicle-type WHERE cylinders = 4 AND price < $15k
Tuple Expansion
Constructed schema
38
Phase 2: Populating the tuples
Local constraint
Price < 15k
Make
Honda
Honda
Toyota
Toyota
Model
Accord
Civic
Camry
Corolla
Model_name
Corolla
Accord
Highlander
Camry
Civic
Vehicle-type
Midsize
Fullsize
SUV
Fullsize
Midsize
Tuple Expansion
Price
19000
12000
14500
14500
Make
Honda
Toyota
Evaluate
constraints
Model
Civic
Corolla
Vehicle-type
Midsize
Midsize
Predict
Vehicle-type
Translated constraint
Model = Corolla or
Civic
39
Agenda
• Introduction [Ravi]
• SmartINT System [Anupam]
• Query Processing [Anupam]
– Source Selection
– Tuple Expansion
• Learning [Anupam]
• Experiments [Ravi]
• Conclusion & Future Work [Ravi]
40
Result Set
AFDMiner
Tuple
Expansion
Query
Statistics
Learner
Source
Selection
Q
U
E
R
Y
I
N
T
E
R
F
A
C
E
Tree of Tables
Web
Database
Attribute
Mapping
Graph
of
Tables
LEARNING
41
AFD Mining
• The problem of AFD Mining is learn all AFDs
that hold over a given relational table
• Two costs:
1. Major cost is the Combinatoric cost of
traversing the search space
2. Cost of visiting data to validate each rule
(To compute the interestingness measures)
• Search process for AFDs is exponential in terms
of the number of attributes
Learning
Specificity
Normalized with the worst case Specificity i.e., X is a key
• The Specificity measure captures our intuition of
different types of AFDs.
• It is based on information entropy
– Shares similar motivations with the way SplitInfo is
defined in decision trees while computing
Information Gain Ratio
• Follows Monotonicity
– The Specificity of a subset is equal to or lower than the
Specificity of the set. (based on Apriori property)
Learning
Lattice Traversal
Specificity Follows
Monotonicity
ABC
AB
ABD
ACD
BCD
AFDMiner mines rules with High Confidence and Low
Specificity which are apt for works like QPIAD, but
SmartINT requires rules with High Specificity. So we
AC
AD
BC
BD
CD
change the direction of traversal so that we can use the
monotonicity of Specificity to prune more nodes.
A
Upper bound on
Specificity – bottom
up makes sense
Learning
ABCD
B
Reaches the
Specificity threshold
C
Ǿ
D
Traversal direction through
the lattice depends on the
pruning techniques available
44
Lattice Traversal
Specificity Follows
Monotonicity
Lower bound on
Specificity – Top
down makes sense
ABCD
Reaches the
Specificity threshold
ABC
AB
ABD
AC
ACD
AD
BC
BCD
BD
CD
All these nodes
are pruned off
A
B
C
Ǿ
Learning
D
Traversal direction through
the lattice depends on the
pruning techniques available
45
Pruning Strategies
1. Pruning off non-shared Attributes
– SmartINT is not interested in non-shared
attributes in the determining set. It is only
interested in rules with shared attributes in
determining set.
2. Pruning by Specificity
– Specificity(Y) ≥ Specificity(X), where Y is a
superset of X
– If Specificity(X) < minSpecificity, we can prune
all AFDs with X and its subsets as the
determining set
Learning
Agenda
• Introduction [Ravi]
• SmartINT System [Anupam]
• Query Processing [Anupam]
– Source Selection
– Tuple Expansion
• Learning [Anupam]
• Experiments [Ravi]
• Conclusion & Future Work [Ravi]
47
EXPERIMENTAL EVALUATION
48
Experimental Hypothesis
In the context of Autonomous Web Databases, If
you Learn Approximate Functional Dependencies
(AFDs) and use them in query answering, then it
would result in a better retrieval accuracy than
using direct-join or single-table approaches.
49
Experimental Setup
 Performed experiments over Vehicle data crawled from
Google Base
 350,000 Tuples
 Generated different partitions of the tables
 Posed queries on the data with varying projected
attributes and varying constraints
 Implemented in Java
 Source code at the following location [In development]
 http://24cross7.svnrepository.com/svn/sorcerer/trunk/code/smartintweb
 Data stored in MySQL database
Experiments
50
Evaluation Methodology
• We should have the ‘Oracular Truth’ to compare the
approaches
• MASTER TABLE - Table containing all the tuples
with the universal relation which serves as oracular
truth
• Splitting MASTER TABLE into different partitions
• Issue queries over both partitioned tables and
master table – Compare the results and measure
precision
Experiments
51
Lets consider the
following tuple from
Master Table (Ground
Truth)
Correctness & Completeness
Tuple from Master Table (8 Attributes)
Correctness of a tuple =
fraction of correct values
Completeness of a tuple =
Total number of values retrieved
Here it is 3/6
Here it is 6/8
Tuple from one of the approaches (6 Attributes)
RIGHT
WRONG
RIGHT
The following is the
tuple from one of the
approaches
Experiments
WRONG
RIGHT
WRONG
Need two metrics
analogous to
Precision and
Recall at the tuple
level
52
Precision & Recall
Result Set from Master Table (8 Attributes)
Precision
=
Average Correctness
of the tuple
Result Set from one of the approaches (6 Attributes)
RIGHT
WRONG
RIGHT
WRONG
RIGHT
WRONG
Recall
=
Cumulative completeness
of tuples returned
Experiments
53
Varying No. of Projected Attributes
Precision vs Attributes
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
0.8
Precision
Recall
Recall vs Attributes
0.6
0.4
0.2
0
2
2
4
Attributes
6
4
Attributes
6
F-measure vs Attributes
1
0.9
0.8
F-measure
0.7
0.6
Around 0.55
improvement
In F-measure….
0.5
0.4
0.3
0.2
0.1
0
2
Experiments
4
Attributes
6
54
Varying No. of Constraints
Recall vs Constraints
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Recall
Precision
Precision vs Constraints
2
3
Constraints
4
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
2
3
Constraints
F-measure vs Constraints
4
1
0.9
0.8
F-measure
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
2
Experiments
3
Constraints
4
55
Other Experiments
1
0.9
Comparison with Multiple
0.8
0.7
Join Paths
0.6
0.5
Precision
0.4
Recall
0.3
F-measure
0.2
 SmartINT performed better
than all possible joins
0.1
0
Join: Model
Join: Year
Join: Model,
Year
SmartInt
Variable Width Expansion
The dip in F-measure can be
used to stop the expansion
Experiments
56
Learning Evaluation
AFDMiner performs
better than TANE
approach
The execution time
and the quality of
AFDs are both higher
than TANE
Kalavagattu 2008 – M.S Thesis
Experiments
57
DEMO [work in progress]
http://149.169.227.245:8080/smartintweb/
Experiments
58
Agenda
• Introduction [Ravi]
• SmartINT System [Anupam]
• Query Processing [Anupam]
– Source Selection
– Tuple Expansion
• Learning [Anupam]
• Experiments [Ravi]
• Conclusion & Future Work [Ravi]
59
CONCLUSION &
FUTURE WORK
60
Conclusion
• Autonomous Web Databases call for
novel systems to counter the problems
due to uncertainty of the Web.
• SmartINT makes an effort to answer one
such issue – Missing PK-FK
• The system gave good improvement in
terms of F-measure over approaches like
Single Table and Direct Join.
Conclusion and Future Work
61
Autonomous Web Traditional
Database
QPIAD
(VLDB ‘07, VLDBJ ‘09)
Incomplete
Complete Data
AIMQ(ICDE ‘06) SmartINT
QUIC(CIDR ‘07) (Submitted to ICDE ‘09)
Imprecise
Ad hoc
Certain Query
Lossless Normalization
Probabilistic
Accurate Results
Conclusion and Future Work
62
Future Work
• Back-door JOIN
– Can SmartINT be used as back-door approach to join tables?
– SmartINT performs as good as other systems when PK-FK
relation is present
– In the absence of such information, other systems fail whereas
SmartINT gives good accuracy
• Vertical Aggregation
– Taking into account the vertical overlap between the tables
– In the absence of substantial overlap, the strength of AFDs
would not help you to retrieve accurate results
• Discover Key Info
– Using AFDMiner to discover key information
Conclusion and Future Work
63
Future Work
• Top ‘KW’ search
– Striking a balance between the number of
tuples and width of the tuple.
– The more you expand the less precise the
results are going to be
• Diverse results
– Providing the user with diverse set of results.
Conclusion and Future Work
64
Thank you…
• Prof. Subbarao Kambhampati
• Prof. Pat Langley
• Prof. Jieping Ye
• Special thanks to
–Aravind Kalavagattu
–Raju Balakrishnan
65
QUESTIONS
66
Individual Contribution
• Problem Identification and Formulization
– Identifying the problem: Joint work
– Using AFDs for Tuple Expansion: Gummadi
– Source Selection: Khulbe
• System Development and Evaluation
– Initial framework setup: Gummadi
– Tuple Expansion, Experiments (Multiple join paths, variable
widthe expansion): Gummadi
– Source Selection, Experiments (comparison with direct-join and
single table approaches): Khulbe
• Writing
–
–
–
–
Introduction, Related Work, System Description: Gummadi
Preliminaries, Source Selection: Khulbe
Experiments: Joint Work
Learning: Aravind Kalavagattu
67
Download