Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins,

advertisement
Effective Keyword Based
Selection of Relational
Databases
Bei Yu, Guoliang Li, Karen Sollins,
Anthony K.H Tung
Overview
• What is unstructured retrieval?
This is retrieving data from documents like
journals, articles etc.
• What is structured retrieval?
Retrieving data from databases, XML files
etc. (that is, structural relationship
between data exists)
Traditional IR approach
• Use keyword frequency and document
frequency statistics for query words to
determine relevance of a document
– Keyword frequency – No. of times a keyword
appears in a document
– Document frequency – No. of documents in
which a keyword appears.
• Use the combination of the two as a
weighting factor
Traditional IR technique is
inadequate for relational databases
• Traditional IR techniques do not capture the
relationship between data sources in a
normalized database
• Need to take into account the relationship
between keywords in a database
• Example:
– A keyword is in a tuple referenced by many other
tuples
– No. of joins that need to be performed to get all
keywords in a query
Example
DB1
Inproceedings
Conferences
id
inprocID
title
procID
year
mon
annote
id
procID
Conference
t1
Adiba1986
Historical
Multimedia
Databases
23
1988
Aug
temporal
t3
23
The conference on
Connection
Perspective
Reform
18
t2
Abarbanel1987
Very Large
Databases (VLDB)
1987
May
Intellicorp
t4
18
ACM Sigmod Conf
on management of
data
Example
DB2
Example
Query = (Multimedia, Database, VLDB)
• DB1 will give us good results,
• But traditional IR model will return DB2 as the
better one as term frequencies are higher in
DB2
• Hence we need to effectively summarize
relationships between keywords in databases
Contributions
1)
2)
3)
4)
Address the problem of selection of structured data
sources for keyword based queries
Propose a method for summarizing relationships
between keywords in a database
Define metrics to rank source databases given a
keyword query based on keyword relationships
Evaluation of proposed summarization using real
datasets
Measuring Strength of
Relationships Between Keywords
• Strength of relationships between two keywords
measured as a combination of two factors:
1) Proximity factor – Inverse of distance
2) Frequency factor, given a distance d –
Number of combinations of exactly d+1
distinct tuples that can be joined in a
sequence to get the two keywords in the
end tuples
Modeling of an RDBMS
• Let m = No. of distinct keywords in database DB
• Let n = Total no. of tuples in DB.
• Then matrix D =
t1 t2 …. tn
k1
k2
:
:
km
• D represents presence or absence of a keyword in a tuple
(Similar to term-document incidence matrix in VSM)
Modeling of an RDBMS Cont’d
• Matrix T represents relationship between tuples
(for example, foreign key)
T=
t1
t2
:
:
tn
t1 t2 ……………… tn
0 1
1 0
Mathematical representation of
keyword relationships
1)   User supplied parameter denoting maximum number of
allowed join operators
2) K  Maximum no. of results expected from the database
Enables a user to control the quality of results
3) For each distance d (0  d   ),
ωd(k i, kj)  frequency of d - distance joining sequences to
connect ki and kj
Mathematical representation of
keyword relationships Cont’d
• A Keyword Relationship Matrix (KRM) R represents the
relationship between any two pair of keywords with
respect to δ and K
δ
1) When
 ω (k ,k )  K,
d
i
j
d 0
δ
R[i, j]  rij   ψd * ωd(ki, kj) , where ψd  1/(d  1)
d 0
Mathematical representation of
keyword relationships Cont’d
δ
2) When
 ω ( k ,k )  K,
d
i
j
d 0
δ'
we have δ'  δ,  ωd ( ki, kj )  K and
d 0
δ'-1
 ω ( k ,k )  K
d
i
j
d 0
δ'-1
δ'-1
d 0
d 0
R[ i, j ]  rij   ψd * ωd ( ki, kj )  ψδ' * (K -  ωd ( ki, kj )) ,
wher e ψd  1/( d  1 )
Example
• For two given keywords k1 and k2, and K=40
• Database A has 5 joining sequences connecting them at
distance = 1
Then score = 5 * (1/2) = 2.5
• Database B has 40 joining sequences connecting them
at distance = 4
Then score = 40*(1/5) = 8
• Here B wins.
Example (cont’d)
• If we bring down K to 10, then A wins.
• Thus one may prefer A to B due to better
quality.
• K defines the number of top results users
expect from the database.
Computation of KRM
How to compute ωd(k i, kj)
Few definitions –
• d - distance tuple relationsh ip matrix, denoted as
Td(n  n) is a symmetric matrix wit h binary entries
such that for any 1  i, j  n and i  j,
1) Td[i, j]  Td[j, i]  1 if and only if the shortest joining
sequence to connect the two tuples ti and tj is of distance d,
2) and Td[i, j]  Td[j, i]  0 otherwise
Three proven propositions aiding
the computation of the KRM
Proposition 1: For any i, j (i  j) and d1, d2 (d1  d2)
if Td1[i, j]  1, then Td2[i, j]  0
Proposition 2 : given T1  T, and supposing Td *  d Tk
k 1
Td  1[i, j]  0 if Td * [i, j]  1
1 if Td * [i, j]  0 and r (1  r  n) , Td[i, r] * T1[r, j]  1
Three proven propositions aiding
the Computation of KRM Cont’d
Proposition 3 : Let W 0  D  DT
(DT is transpose of D, where D is keyword incidence matrix)
1) We have i, j, 1  i, j  m and i  j,
ω0(k i, kj)  W 0[i, j]
2) For d  1, let W d  D  Td  DT
(where T is the tuple relationsh ip matrix)
We have i, j, 1  i, j  m and i  j,
ωd(k i, kj)  W d[i, j]
Comparison of frequencies of
keyword pairs in DB1 and DB2
Frequencies of keyword pairs in DB1
Keyword pair
d=0
d=1
d=2
d=3
d=4
database:multimedia
1
1
-
-
-
multimedia:VLDB
0
1
-
-
-
Database:VLDB
1
1
-
-
-
Frequencies of keyword pairs in DB2
Keyword pair
d=0
d=1
d=2
d=3
d=4
database:multimedia
0
0
0
0
2
multimedia:VLDB
0
0
0
0
0
Database:VLDB
0
0
1
0
0
Our query was Q = (Multimedia, Database, VLDB )
Observation tells us that query words are more closely related in DB1
Comparison of relationship scores
of DB1 and DB2
Keyword pair
DB1
DB2
Database:multimedia 1.5
0.4
Multimedia:VLDB
0.5
0
Database:VLDB
1.5
0.33
• Sample computation for DB1 (K=10)
Rel [ Database, multimedia ] = 1 * 1 + 0.5 * 1 = 1.5
Implementation with SQL
• Relation RD(kId, tId) represents the non-zero
entries of the keyword incidence matrix D
• kId is the keyword ID and tId is the tuple ID
• RK(kId, keyword) stores the keyword IDs and
keywords (similar to a word dictionary in IR)
• Matrices T1, T2, T3... (Tuple relationship matrices)
are represented with relations RT1,RT2 ,RT3..
• RT1 :- Produced by joining pairs of tables
• RT2 :- Produced by self-joining RT1
Implementation with SQL Cont’d
RT3 produced using the following SQLs
INSERT INTO RT3 (tId1, tId2)
SELECT s1.tId1, s2.tId2
FROM RT2 s1, RT1 s2
WHERE s1.tId2 = s2.tId1
INSERT INTO RT3 (tId1, tId2)
SELECT s1.tId1, s2.tId1
FROM RT2 s1, RT1 s2
WHERE s1.tId2 = s2.tId2 AND s1.tId1 < s2.tId1
INSERT INTO RT3 (tId1, tId2)
SELECT s2.tId1, s1.tId2
FROM RT2 s1, RT1 s2
WHERE s1.tId1 = s2.tId2
Implementation with SQL Cont’d
INSERT INTO RT3 (tId1, tId2)
SELECT s1.tId2, s2.tId2
FROM RT2 s1, RT1 s2
WHERE s1.tId1 = s2.tId1 AND s1.tId2 < s2.tId2
DELETE a FROM RT3 a, RT2 b, RT1 c
WHERE (a.tId1 = b.tId1 AND a.tId2 = b.tId2) OR
(a.tId1 = c.tId1 AND a.tId2 = c.tId2)
• In general, RTd is generated by joining RTd-1 with RT1
and excluding the tuples already in RTd-1, RTd-2, … RT1
Creation of W0,W1, W2….(Matrices
representing frequencies)
• W0 is represented with a relation RW0(kId1, kId2, freq)
• tuple (kId1, kId2, freq) records the pair of keywords
(kId1,kId2) (kId1 < kId2), and its frequency (freq) at 0
distance, where freq is greater than 0.
• RW0 is the result of self-joining RD (kId, tId).
• SQL for creating RW0
INSERT INTO RW0 (kId1, kId2, freq)
SELECT s1.kId AS kId1, s2.kId AS kId2, count(*)
FROM RD s1, RD s2
WHERE s1.tId = s2.tId AND s1.kId < s2.kId
GROUP BY kId1, kId2
Creation of W0,W1, W2….(Matrices
representing frequencies)
• SQL for creating RWd , d > 0
INSERT INTO RWd (kId1, kId2, freq)
SELECT s1.kId AS kId1, s2.kId AS kId2, count(*)
FROM RD s1, RD s2, RTd r
WHERE ((s1.tId = r.tId1 AND s2.tId = r.tId2) OR
(s1.tId = r.tId2 AND s2.tId = r.tId1)) AND s1.kId < s2.kId
GROUP BY kId1, kId2
Final resulting KRM
• The final resulting KRM, R is stored in a relation
RR(kId1,kId2),consisting of pairs of keywords and their
relationship score.
• It is computed using the formula –
δ
R[i, j]   ψd * ωd(ki,kj)
d 0
• Update issues :The tables for storing these matrices can be updated
dynamically.
Estimating multi-keyword
relationships
• Mutiple keywords are connected with Steiner trees.
• It is an NP complete problem to find a minimum Steiner
tree.
• Most current keyword search algorithms rely on
heuristics to find top-K results.
• Hence estimation between multiple keywords estimated
using derived keyword relationships described above.
Estimating multi-keyword
relationships Cont’d
Proposition 4
1) Given a set of keywords Q  {k 1, k2, k3,....,, kq},
the number of edges of the tuple tree TQ that contain
all the keywords in Q is no less than
max { min{d | d  0 & ωd (k i, kj)  0) } } 1  i, j  q, i  j
2) If a pair of keywords is not found in a KR summary,
the no. of edges of the tuple tree containing all keyword
edges must be greater than δ, so its score is set to 0
so that it can be safely pruned from selection.
Estimating multi-keyword
relationships Cont’d
We can use four kinds of estimation s of scores : 1) relmin (Q, DB)  min rel(k i, kj)
{k i, kj}  Q, i  j
This is the most conservati ve estimation formula
2) relmax (Q, DB)  max rel(k i, kj)
{k i, kj}  Q, i  j
Estimating multi-keyword
relationships Cont’d
3) relsum (Q, DB)   rel(k i, kj)
{ki, kj}  Q, i  j
4) relprod (Q, DB)   rel(k i, kj)
{ki, kj}  Q, i  j
This formula assumes the highest degree of intersecti on
Database ranking and indexing
• With KR summary, we can effectively rank a set of databases
D = {DB1,DB2,…,DBN} for a given keyword query.
rank(DB 1)  rank(DB 2)  rel(Q, DB1)  rel(Q, DB2)
• We can use either a global index or a local index
• Global Index –
1. Analogous to an inverted index in IR
Use keyword pairs as key, and <database Id, relationship
score> as a postings entry
2. To evaluate a query, fetch the corresponding inverted
lists, and compute the score for each database.
Database ranking and indexing
Cont’d
• Decentralized index
1. Each machine can store a subset of the index (that
is, keyword pairs and inverted lists)
2. When a query is received at a node, search
messages are sent across nodes and the
corresponding postings lists are retrieved.
Experiments done to evaluate
efficiency of this system
K-R score compared with score from brute force method
(real_rank) over 82 databases spread across 16 nodes.
• Effectiveness of this technique has been successfully
established over distributed databases
Definitions used for comparison :•
1) real_rank (DBi)  real_rank (DBj)  real_score (Q, DBi)  real_score (Q, DBj),
k
where real_score is defined as
 Score (T , Q),
i
i1
where Ti  ith top result given query Q,
and Score (T i, Q) measures relevance of Ti to Q
Experiments done to evaluate
efficiency of this system
2) recall (l) 
 Score (Q, DB)
DB  Top l(S)
/
 Score (Q, DB)
DB  Top l(R)
where S and R denote summary based and real rankings respective ly,
and Score (Q, DB) is the real score of the database
( In IR, recall  (Number of relevant retrieved) / (Number of relevant)
3) precision (l)  | { DB  Top l (S) | Score (Q, DB )  0 } | / | Top l (R) |
( Number of relevant / Number of retrieved)
Experiments done to evaluate
efficiency of this system Cont’d
•
Effects of δ (length of joining sequence)
1) Selection performance of keyword queries generally gets
better when δ grows larger.
2) Precision and recall values for different values tend to
cluster into groups
3) There are big gaps in both precision and recall values
when 0    1 and when δ is greater
Experiments done to evaluate
efficiency of this system Cont’d
Recall and precision of 2-keyword queries using KR summaries and
KF-summaries
Experiments done to evaluate
efficiency of this system Cont’d
• Effects of number of query keywords –
1) Performance of 2-keyword queries generally better than
3-keyword and 4-keyword queries
5-keyword queries give better recall than 3 and 4 keyword queries
as they are more selective
2) Generally, the difference in the recall of queries with
different no. of keywords is less than that of the precision
This shows that the system is effective in assigning high ranks to
useful databases, although less relevant or irrelevant databases
may also be selected.
Experiments done to evaluate
efficiency of this system Cont’d
Comparison of four kinds of estimations
(MIN,MAX,SUM,PROD)
• SUM and PROD have similar behavior
and outperform the other two methods
• Hence it is more effective to take into account
relationship information of every keyword pair in the
query when estimating overall scores
Experiments done to evaluate
efficiency of this system Cont’d
Recall and precision of K-R summaries using different
estimations (   3 )
Download