Efficient Keyword Search across Heterogeneous Relational Databases Mayssam Sayyadian, AnHai Doan Hieu LeKhac

advertisement
Efficient Keyword Search across
Heterogeneous Relational Databases
Mayssam Sayyadian, AnHai Doan
University of Wisconsin - Madison
Hieu LeKhac
University of Illinois - Urbana
Luis Gravano
Columbia University
Key Message of Paper

Precise data integration is expensive
 But we can do IR-style data integration
very cheaply, with no manual cost!
– just apply automatic schema/data matching
– then do keyword search across the databases
– no need to verify anything manually

Already very useful
Build upon keyword search over a single database ...
2
Keyword Search over
a Single Relational Database

A growing field, numerous current works
–
–
–
–

Many related works over XML / other types of data
–
–
–
–

DBXplorer [ICDE02], BANKS [ICDE02]
DISCOVER [VLDB02]
Efficient IR-style keyword search in databases [VLDB03],
VLDB-05, SIGMOD-06, etc.
XKeyword [ICDE03], XRank [Sigmod03]
TeXQuery [WWW04]
ObjectRank [Sigmod06]
TopX [VLDB05], etc.
More are coming at SIGMOD-07 ...
3
A Typical Scenario
Customers
tid custid name
Complaints
contact
addr
tid id
emp-name
comments
t1 c124
Cisco
Michael Jones
…
u1 c124 Michael Smith
Repair didn’t work
t2
c533
IBM
David Long
…
u2 c124 John
Deferred work to
t3
c333
MSR
David Ross
…
John Smith
Foreign-Key Join
Q = [Michael Smith Cisco]
Ranked list of answers
Repair didn’t work
score=.8
Deferred work to John Smith
score=.7
t1 c124 Cisco
Michael Jones …
u1 c124 Michael Smith
t1 c124 Cisco
Michael Jones …
u2 c124 John
4
Our Proposal:
Keyword Search across Multiple Databases
Employees
Complaints
comments
tid
empid
u1 c124 Michael Smith
Repair didn’t work
v1
e23
Mike D. Smith
u2 c124 John
Deferred work to
v2
e14
John Brown
John Smith
v3
e37
Jack Lucas
tid id
emp-name
name
Groups
Customers
tid custid name
contact
addr
tid
eid
reports-to
t1
c124
Cisco
Michael Jones
…
x1
e23
e37
t2
c533
IBM
David Long
…
x2
e14
e37
t3
c333
MSR
Joan Brown
…
Query: [Cisco Jack Lucas]
t1 c124 Cisco Michael Jones …
u1 c124 Michael Smith Repair didn’t work
v1 e23 Mike D. Smith
x1 e23 e37
across databases
v3 e37 Jack Lucas
 IR-style data integration
5
A Naïve Solution
1. Manually identify FK joins across DBs
2. Manually identify matching data instances across DBs
3. Now treat the combination of DBs as a single DB
 apply current keyword search techniques
Just like in traditional data integration,
this is too much manual work
6
Kite Solution

Automatically find FK joins / matching data instances
across databases
 no manual work is required from user
Employees
Complaints
comments
tid
empid
u1 c124 Michael Smith
Repair didn’t work
v1
e23
Mike D. Smith
u2 c124 John
Deferred work to
v2
e14
John Brown
John Smith
v3
e37
Jack Lucas
tid id
emp-name
name
Groups
Customers
tid custid name
contact
addr
tid
eid
reports-to
t1
c124
Cisco
Michael Jones
…
x1
e23
e37
t2
c533
IBM
David Long
…
x2
e14
e37
t3
c333
MSR
Joan Brown
…
7
Complaints
Automatically Find FK Joins
across Databases
Employees
comments
tid
empid
u1 c124 Michael Smith
Repair didn’t work
v1
e23
Mike D. Smith
u2 c124 John
Deferred work to
v2
e14
John Brown
John Smith
v3
e37
Jack Lucas
tid id
emp-name
name

Current solutions analyze data values (e.g., Bellman)
 Limited accuracy
– e.g., “waterfront” with values yes/no
“electricity” with values yes/no

Our solution: data analysis + schema matching
– improve accuracy drastically (by as much as 50% F-1)
Automatic join/data matching can be wrong
 incorporate confidence scores into answer scores8
Incorporate Confidence Scores
into Answer Scores

Recall: answer example in single-DB settings
t1 c124 Cisco

Michael Jones …
u1 c124 Michael Smith
Repair didn’t work
score=.8
Recall: answer example in multiple-DB settings
score 0.7 for data matching
t1 c124 Cisco Michael Jones …
u1 c124 Michael Smith Repair didn’t work
v1 e23 Mike D. Smith
score 0.9 for FK join
score (A, Q) =
x1 e23 e37
v3 e37 Jack Lucas
α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q)
size (A)
9
Summary of Trade-Offs

SQL queries
Precise data integration
– the holy grail

IR-style data integration, naïve way
– manually identify FK joins, matching data
– still too expensive

IR-style data integration, using Kite
– automatic FK join finding / data matching
– cheap
– only approximates the “ideal” ranked list found by naïve
10
Kite Architecture
Q = [ Smith Cisco ]
Index Builder
IR index1
…
IR indexn
Foreign key joins
Condensed
CN Generator
– Partial
Refinement
rules
Top-k
Searcher
D1
Schema
Matcher
…
Dn
Offline preprocessing
– Deep
Data instance
matcher
Foreign-Key Join Finder
Data-based
Join Finder
– Full
Distributed SQL queries
D1
…
Dn
Online querying
11
Online Querying
Database 1
Relation 1
Relation 2
Database 2
Relation 1
Relation 2
What current solutions do:
1. Create answer templates
2. Materialize answer templates to obtain answers
12
Create Answer Templates
Service-DB
Find tuples that contain query keywords
–
–
Use DB’s IR index
example:
Complaints
Customers
u1
v1
u2
v2
v3
Q = [Smith Cisco]
Tuple sets: Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1}
HR-DB:
EmployeesQ={t1}
GroupsQ={}
Create tuple-set graph
HR-DB
Groups
Employees
x1
t1
x2
t2
t3
Schema graph:
Customers
J1
J4
Complaints
J2
Emps
Groups
J3
Tuple set graph:
Customers{}
J1
J4
Complaints{}
Emps{}
J1
J4
J3
J1
J4
J3
CustomersQ
J1
ComplaintsQ
J4
EmpsQ
J2
Groups{}
J2
13
Create Answer Templates (cont.)

Search tuple-set graph to generate answer templates
– also called Candidate Networks (CNs)

Each answer template =
one way to join tuples to form an answer
sample CNs
sample tuple set graph
J1
Customers{}
CN1: CustomersQ
J4
Complaints{}
Emps{}
J1
J4
J3
J2
J1
J4
J3
Groups{}
J2
CustomersQ
J1
ComplaintsQ
J4
EmpsQ
J1
CN2: CustomersQ  Complaints{Q}
J2
J2
J4
CN3: EmpsQ  Groups{}  Emps{}  Complaints{Q}
J2
J3
J4
CN4: EmpsQ  Groups{}  Emps{}  Complaints{Q}
14
Materialize Answer Templates
to Generate Answers

By generating and executing SQL queries
J1
CN:
CustomersQ
 ComplaintsQ
(CustomersQ = {v1} , ComplaintsQ = {u1, u2})
SQL:
SELECT * FROM Customers C, Complaints P
WHERE C.cust-id = P.id AND
(C.tuple-id = v1) AND
(P.tuple-id = u1 OR tuple-id = u2)

Naïve solution
– materialize all answer templates, score, rank, then return answers

Current solutions
– find only top-k answers
– materialize only certain answer templates
– make decisions using refinement rules + statistics
15
Challenges for Kite Setting

More databases
 way too many answer templates to generate
– can take hours on just 3-4 databases

Materializing an answer template takes way too long
– requires SQL query execution across multiple databases
– invoking each database incurs large overhead

Difficult to obtain reliable statistics across databases

See paper for our solutions (or backup slides)
16
Empirical Evaluation
Domains
Domain
Avg # approximate FK joins tuples
Avg #
Avg # tables
Avg # tuples
per table
# DBs
attributes per
per DB
per table
schema
total across DBs
per pair
Total
size
DBLP
2
3
3
11
6
11
500K
400M
Inventory
8
5.8
5.4
890
804
33.6
2K
50M
Sample Inventory Schema
AUTHOR
ARTIST
BOOK
CD
WH2BOOK
WH2CD
WAREHOUSE
Inventory 1
The DBLP Schema
AR (aid, biblo)
CITE (id1, id2)
PU (aid, uid)
AR (id, title)
AU (id, name)
CNF (id, name)
DBLP 1
DBLP 2
17
Runtime Performance (1)
runtime vs. maximum CCN size
180
time (sec)
DBLP
120
60
0
1
2
3
4
5
6
7
8
9
Inventory
120
max
CCN
size
60
0
1
2
3
4
5
6
7
2-keyword queries, k=10, 5 databases
2-keyword queries, k=10, 2 databases
runtime vs. # of databases
max
CCN
size
Hybrid algorithm adapted to run
over multiple databases
45
Inventory
time (sec)
time (sec)
180
30
Kite without adaptive rule
selection and without rule Deep
15
Kite without condensed CNs
Kite without rule Deep
0
1
2
3
4
5
6
7
8 # of DBs
maximum CCN size = 4, 2-keyword queries, k=10
Full-fledged Kite algorithm
18
Runtime Performance (2)
runtime vs. # of keywords in the query
40
DBLP
15
time (sec)
time (sec)
20
10
5
Inventory
30
20
10
0
|q|
1
2
3
4
|q|
0
5
1
max CCN=6, k=10, 2 databases
2
3
4
5
max CCN=4, k=10, 5 databases
runtime vs. # of answers requested
45
time (sec)
time (sec)
45
30
15
k
0
1
4
7
10
13
16
19
22
25
27
30
2-keyword queries, max CCN=4, |q|=2, 5 databases
Inventory
30
15
k
0
1
4
7
10
13
16
19
22
25
27
30
2-keyword queries, max CCN=4, 5 databases
19
Query Result Quality
Pr@k
Pr@k
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
k
0
1
5
10
15
OR-semantic queries

20
k
0
1
5
10
15
20
AND-semantic queries
Pr@k = the fraction of answers that appear in the “ideal” list
20
Summary

Kite executes IR-style data integration
– performs some automatic preprocessing
– then immediately allows keyword querying

Relatively painless
– no manual work!
– no need to create global schema, nor to understand SQL

Can be very useful in many settings:
e.g., on-the-fly, best-effort, for non-technical people
– enterprises, on the Web, need only a few answers
– emergency (e.g., hospital + police), need answers quickly
21
Future Directions

Incorporate user feedback
 interactive IR-style data integration

More efficient query processing
– large # of databases, network latency

Extends to other types of data
– XML, ontologies, extracted data, Web data
IR-style data integration
is feasible and useful
extends current works on keyword search over DB
raises many opportunities for future work
22
BACKUP
23
Condensing Candidate Networks
In multi-database settings  unmanageable number of CNs
–
–
Many CNs share the same tuple sets and differ only in the associated joins
Group CNs into condensed candidate networks (CCNs)
J1
Customers{}
J4
Complaints{}
J1
Emps{}
J3
J4
J3
J4
J1
CustomersQ
J1
condense tuple set graph
ComplaintsQ
J4
Customers{}
J2
Groups{}
Condense
J2
EmpsQ
sample CNs
J2
J2
J4
{}
{}
 Groups  Emps  Complaints{Q} Condense
CN3:
J2
J3
J4
Q
{}
{}
CN4: Emps  Groups  Emps  Complaints{Q}
EmpsQ
J1
J4
Complaints{}
J1
J4
J1
J4
CustomersQ J ComplaintsQ
1
Emps{} {J2, J3}
Groups
{}`

sample tuple set graph
EmpsQ
J4
{J2, J3}
sample CCNs
{J2, J3}
J2
J4
EmpsQ  Groups{}  Emps{}  Complaints{Q}
24
Top-k Search
Main ideas for top-k keyword search:
– No need to materialize all CNs
– Sometimes, even partially materializing a CN is enough
– Estimate score intervals for CNs, then branch and bound search
iteration 1
iteration 2
iteration 3
K = {P2, P3}, min score = 0.7
....
...
...
P [0.6, 1]
..
.
Q [0.5, 0.7]
R [0.4, 0.9]
P1 [0.6, 0.8]
P2 0.9
.
P3 0.7
...
R [0.4, 0.9]
..
.
Res = {P2, R2}
min score = 0.85
R1 [0.4, 0.6]
R2 0.85
Kite approach: materialize CNs using refinement rules
25
Top-k Search Using Refinement Rules
• In single-database setting 
selecting rules based on database statistics
• In multi-database setting  Inaccurate statistics
• Inaccurate statistics  Inappropriate rule selection
26

Full:
–
–

Exhaustively extract all answers from a CN (fully materialize S)
 too much data to move around the network (data transfer cost)
Partial:
–
–

Refinement Rules
Try to extract the most promising answer from a CN
 invoke remote databases for only one answer (high cost of database
t1
u1
invocation)
Deep:
–
–
–
A middle-ground approach
Once a table in a remote
database is invoked, extract
all answers involving that table
Takes into account database
invocation cost
t1
t2
t3
t4
TQ
0.9
0.7
0.4
0.3
UQ
0.8
0.6
0.5
0.1
u1
u2
u3
u4
t1
t2
t3
t4
t1
t1
t2
t3
t4
TQ
0.9
0.7
0.4
0.3
UQ
0.8
0.6
0.5
0.1
u1
u2
u3
u4
t1
t2
t3
t4
UQ
0.8
0.6
0.5
0.1
u1
u2
u3
u4
u1 , t1 u3
UQ
TQ
0.9
0.8
0.6
0.7
0.4
0.5
0.3
0.1
u1
u2
u3
u4
TQ
0.9
0.7
0.4
0.3
27
Adaptive Search

Question: which refinement rule to apply next?
– In single-database setting  based on database statistics
– Multi-database setting  inaccurate statistics

Kite approach: adaptively select rules
goodness-score (rule, cn) = benefit (rule, cn) – cost (rule, cn)
– cost (rule, cn): optimizer’s estimated cost for SQL statements
– benefit (rule, cn): reduce the benefit if a rule is applied for a while
without making any progress
28
Other Experiments
Join Discovery Accuracy
1
accuracy (F1)
0.8

Schema matching helps
improve join discovery
algorithm drastically

Kite also improves singledatabase keyword search
algorithm mHybrid
0.6
0.4
0.2
0
Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5
Join Discovery
Join Discovery + Schema Matching
Kite over single database
time (sec)
6
4
2
0
1
2
3
4
5
6
7
8
max CCN size
29
Download