Presentation

advertisement
Automated Generation of Object
Summaries from Relational Databases:
A Novel Keyword Searching Paradigm
GEORGIOS FAKAS
Department of Computing and Mathematics,
Manchester Metropolitan University
Manchester, UK.
g.fakas@mmu.ac.uk
Related Work:
Keyword Search in Relational DBs
 Full-text Search (e.g. Oracle 9i Text)
 Kw Searching in Relational DB (DISCOVER, BANKS)
Region
Territories
Kw Search:
Leverling,
Peacock
r1
t1
r2
t2
t3
t4
EmployeeTerritories
Employees
et1
e1
et2
e2
et3
e3
Result:
e3-o2-c2
e4-06-c2
et4
e4
Customers
Orders
c1
o1
c2
o2
c3
o3
Shippers
o4
o5
s1
o6
s2
o7
s3
Order Details
od1
Products
p1
p2
od2
od3
Suppliers
od4
od5
od6
su1
Categories
ca1
Related Work:
Web Search Engines: Keyword Search
Kw Search: Peacock
Result:
A ranked set of web pages
Related Work:
Web Search Engines: Keyword Search
Kw Search: Peacock
Result:
A ranked set of web pages
A Novel Keyword Searching Paradigm:
Object Summaries (OSs)
Kw Search: Peacock
Result: A Ranked set of OSs
Result 1: 4 tuples out of 27
Customer
CustomerID
CompanyName
ContactName
Address
...
Quick
QUICK-Stop
Margaret Peacock
Taucherstae 10
...
Orders
OrderID
ShipName
ShipAddress
Employees
... LastName FirstName
10418
...
QUICK-Stop
Taucherstae 10
...
Peacock
Margaret
...
...
Shippers
CompanyName
Speedy Express
...
...
...
Result 2: 4 tuples out of 24
Employee
EmployeeID
LastName FirstName
Title
TitleOfCourtesy
Address
...
4
Peacock
Sales Representative
Mrs.
4110 Old Redmond Rd.
...
ShipName
ShipAddress
OrderDate
...
QUICK-Stop
Taucherstae 10
1996-07-15
...
...
Margaret
EmployeeTerritories Region Orders
TerritoryDescription Region OrderID
Rockville
...
Eastern
10418
Result 3: 4 tuples out of 9
Employee
EmployeeID
LastName
FirstName
Title
TitleOfCourtesy
Address
...
3
Peacock
Janet
Sales Representative
Ms.
722 Moss Bay Blvd.
...
EmployeeTerritories
TerritoryDescription
Atlanta
...
Region
Region
Orders
OrderID ShipName
ShipAddress
OrderDate
...
Southern
10273
Taucherstae 10
1996-08-05
...
...
QUICK-Stop
A Novel Keyword Searching Paradigm:
Object Summaries (OSs)
Kw Search: Peacock
Result: A Ranked set of OSs
Result 1: 4 tuples out of 27
Customer
Employees
Problems-Challenges:
How can we automatically
(1) Generate and (2) Rank OSs
liberating users from knowledge of:
(1) Schema and
(2) Query Language
CustomerID
CompanyName
ContactName
Address
...
Quick
QUICK-Stop
Margaret Peacock
Taucherstae 10
...
Orders
OrderID
ShipName
ShipAddress
...
10418
...
QUICK-Stop
Taucherstae 10
...
Result 2: 4 tuples out of 24
Employee
Employees
LastName FirstName
Peacock
Margaret
...
...
Shippers
CompanyName
Speedy Express
...
(e2)
...
...
Territories, Region
(et1, t4, r2)
EmployeeID
LastName FirstName
Title
4
Peacock
Sales Representative
Margaret
EmployeeTerritories Region Orders
TerritoryDescription Region OrderID
Rockville
...
Eastern
10418
TitleOfCourtesy
Address
...
Mrs.
4110 Old Redmond Rd.
...
ShipName
ShipAddress
OrderDate
...
QUICK-Stop
Taucherstae 10
1996-07-15
...
...
Orders
Customers
(c2)
Shippers
Result 3: 4 tuples out of 9
Employee
EmployeeID
3
LastName
Peacock
EmployeeTerritories
TerritoryDescription
Atlanta
...
(e3)
Employees (Reports To)
FirstName
Janet
Title
TitleOfCourtesy
Sales Representative
Address
...
Ms.
722 Moss Bay Blvd.
Region
Region
Orders
OrderID ShipName
ShipAddress
OrderDate
...
Southern
10273
Taucherstae 10
1996-08-05
...
...
QUICK-Stop
...
(s3)
?
Order Details
(od1)
Products
(p2)
(o2)
OS Generation - Methodology
KW-ID = “Janet Leverling”
 tDS a central tuple
Region
Territories
r1
t1
r2
t2
t3
t4
EmployeeTerritories
Employees
et1
e1
et2
e2
et3
e3
et4
e4
Customers
Orders
c1
o1
c2
o2
c3
containing the Kw; tuples
around tDS contain
additional information
about the Data Subject.
o3
Shippers
o4
o5
s1
o6
s2
o7
s3
Products
Order Details
p1
od1
p2
od2
Suppliers
od3
od4
su1
od5
od6
Categories
ca1
 RDS the corresponding
Employees
Orders
Shippers
EmployeeTerritories
Customers
Order Details
Territories
Region
CustomerDemographics
Products
Categories
Suppliers
CustomerCustomerDemo
central Relation; similarly
Relations around contain
additional information.
OS Generation - Methodology
KW-ID = “Janet Leverling”
 tDS a central tuple
Region
Territories
r1
t1
r2
t2
t3
t4
EmployeeTerritories
Employees
et1
e1
et2
e2
et3
e3
et4
e4
Customers
Orders
c1
o1
c2
o2
c3
containing the Kw; tuples
around tDS contain
additional information
about the Data Subject.
o3
Shippers
o4
o5
s1
o6
s2
o7
s3
Products
Order Details
p1
od1
p2
od2
Suppliers
od3
od4
su1
od5
od6
Categories
ca1
 RDS the corresponding
Employees
Orders
Shippers
EmployeeTerritories
Customers
Order Details
Territories
Region
CustomerDemographics
Products
Categories
Suppliers
CustomerCustomerDemo
central Relation; similarly
Relations around contain
additional information.
OS Generation - Methodology
KW-ID = “Janet Leverling”
 tDS a central tuple
Region
Territories
r1
t1
r2
t2
t3
t4
EmployeeTerritories
Employees
et1
e1
et2
e2
et3
e3
et4
e4
Customers
Orders
c1
o1
c2
o2
c3
containing the Kw; tuples
around tDS contain
additional information
about the Data Subject.
o3
Shippers
o4
o5
s1
o6
s2
o7
s3
Products
Order Details
p1
od1
p2
od2
Suppliers
od3
od4
su1
od5
od6
Categories
ca1
 RDS the corresponding
Employees
Orders
Shippers
EmployeeTerritories
Customers
Order Details
Territories
Region
CustomerDemographics
Products
Categories
Suppliers
CustomerCustomerDemo
central Relation; similarly
Relations around contain
additional information.
OS Generation - Methodology
KW-ID = “Janet Leverling”
 tDS a central tuple
Region
Territories
r1
t1
r2
t2
t3
t4
EmployeeTerritories
Employees
et1
e1
et2
e2
et3
e3
et4
e4
Customers
Orders
c1
o1
c2
o2
c3
containing the Kw; tuples
around tDS contain
additional information
about the Data Subject.
o3
Shippers
o4
o5
s1
o6
s2
o7
s3
Products
Order Details
p1
od1
p2
od2
Suppliers
od3
od4
su1
od5
od6
Categories
ca1
 RDS the corresponding
Employees
Orders
Shippers
EmployeeTerritories
Customers
Order Details
Territories
Region
CustomerDemographics
Products
Categories
Suppliers
CustomerCustomerDemo
central Relation; similarly
Relations around contain
additional information.
OS Generation - Methodology
KW-ID = “Janet Leverling”
OS for “Janet Leverling”
Employees
(e3)
Region
Territories
Employees (Reports To)
r1
t1
(e2)
r2
t2
Territories, Region
t3
t4
(et1, t4, r2)
EmployeeTerritories
Employees
Orders
(o2)
et1
e1
et2
e2
et3
e3
Customers
et4
e4
Customers
(c2)
Shippers
Orders
c1
o1
(s3)
c2
o2
Order Details
c3
o3
(od1)
Shippers
o4
Products
o5
s1
o6
(p2)
s2
o7
s3
Categories
Products
Order Details
(ca1)
p1
od1
p2
od2
Suppliers
od3
od4
od6
Employees
(9)
GDS
su1
od5
Categories
ca1
Employees
(2)
Employees
(9)
Employees
EmployeeTerritories
Territories
Shippers
Customers
Order Details
CustomerDemographics
Products
Categories
EmployeeTerritories
(49)
Shippers
(3)
Territories
(53)
Region
Customer
(91)
Orders
Orders
(830)
Suppliers
Order Details
(2155)
CustomerCustomerDemo
CustomerCustomerDemo
(0)
Products
(49)
Categories
(8)
Region
(4)
Suppliers
(29)
OS Generation - Methodology
GDS
Employees
(9)
Employees
(2)
Employees
(9)
Order Details
(2155)
Customer
(91)
Problem: Not all Relations
in GDS are relevant:
Orders
(830)
EmployeeTerritories
(49)
Shippers
(3)
Territories
(53)
How do I decide
1) What relations to select or not
2) When to Stop Traversing
Solution: Investigate
Relational Semantics:
CustomerCustomerDemo
(0)
Products
(49)
Categories
(8)
Region
(4)
Suppliers
(29)
Schema Connectivity, Cardinality,
Related Cardinality etc.
Quantify Affinity of Relations
Af R R DS : Affinity of Relations to RDS in GDS
i
Distance
Employees
(9)
 Physical (fd), Logical (ld), ld=fd-
|M:N|
Employees
(2)
Employees
(9)
Customer
(91)
Order Details
(2155)
CustomerCustomerDemo
(0)
Orders
(830)
Shippers
(3)
Products
(49)
Categories
(8)
Suppliers
(29)
fdi=0
EmployeeTerritories fdi=1
(49)
Territories
(53)
fdi=2
Region
(4)
fdi=3
fdi=4
Af R R DS : Affinity of Relations to RDS in GDS
i
Distance
Employees
(9)
 Physical (fd), Logical (ld), ld=fd-
|M:N|
 E.g. Orders closer than Customer
and CustomerDemo to Employees
Employees
(2)
Employees
(9)
Customer
(91)
Order Details
(2155)
CustomerCustomerDemo
(0)
Orders
(830)
Shippers
(3)
Products
(49)
Categories
(8)
Suppliers
(29)
fdi=0
EmployeeTerritories fdi=1
(49)
Territories
(53)
fdi=2
Region
(4)
fdi=3
fdi=4
Af R R DS : Affinity of Relations to RDS in GDS
i
Distance
Supplier
(10,000)
1 (400)
 Physical (fd), Logical (ld), ld=fd-
|M:N|
 E.g. Orders closer than Customer
and CustomerDemo to Employees
Nation
(25)
1 (5)
Region
(5)
Partsupp
(800,000)
6000 (1)
Customer
(150,000)
10 (1)
 Hubs: spurious shortcuts
 Rather irrelevant or lateral
information RC(R1, R2)
N:1
:M
R DS ... 
R hub 1
R2
80 (1)
Orders
(1,500,000)
4 (1)
Lineitem
(6,001,215)
1 (4)
Part
(200,00*)
7.5 (1)
Lineitem
(6,001,215)
1 (400)
Orders
(1,500,000)
4 (1)
Lineitem
(6,001,215)
Af R R DS : Affinity of Relations to RDS in GDS
i
Connectivity
Employees
(9)
 Schema Connectivity (Coi)
 Data-graph Connectivity:
 Relative Cardinality (RCi→j), i.e. the
average number of tuples of Ri that are
connected with each tuple from Rj
 for 1:M RCi→j=|Ri|/|Rj|
 for M:1 RCi→j=1
 Reverse Relative Cardinality (RRCi→j)
is the reverse of RCi→j
 i.e. RRCi→j=RCi→j).
1 (1)
92.2 (1)
1 (1)
Orders
(830)
Employees
(2)
Employees
(9)
1 (9.1)
Customer
(91)
2.5 (1)
1 (276.6)
Shippers
(3)
Order Details
(2155)
Products
(49)
1 (9.6)
Categories
(8)
5.4 (1)
EmployeeTerritories fdi=1
(49)
1 (0.9)
Territories
(53)
fdi=2
1 (13.2)
1 (27.9)
CustomerCustomerDemo
(0)
fdi=0
Region
(4)
fdi=3
1 (2.6)
Suppliers
(29)
fdi=4
Af R R DS : Affinity of Relations to RDS in GDS
i
 DAf(Ri)={(m1, w1), (m2, w2),.. (mn, wn)}
 m1=f1(ldi), m2=f1(log(10*RCi), m3=f1(log(10*RRCi), m4=f1(log(10*Coi)
 f1(α)=(11- α)/10
 For a hub-child m1=f1(ldi *hi) and m2=f1(RCi)
Formula 1 (Semantic Affinity):
The affinity of Ri to RDS, denoted as Af R  R DS , with respect to a schema
i
and a database conforming to the schema, can be calculated with the
following formula:
Af R  R DS   m j w j  Af R
i
Where Af R
j
Parent  R
DS
Paren t R
DS
is the affinity of the Ri’s Parent to RDS or is 1 if
RParent≡RDS.□
Af R R DS : Affinity of Relations to RDS in GDS
i
Employees
1 (9)
1 (1)
92.2 (1)
1 (1)
Orders
0.90 (830)
Employees
0.98 (2)
Employees
0.98 (9)
2.5 (1)
1 (9.1)
Shippers
0.85 (3)
Order Details
0.84 (2155)
Customer
0.85 (91)
1 (276.6)
Products
0.74 (49)
1 (9.6)
Categories
0.63 (8)
5.4 (1)
EmployeeTerritories fdi=1
(49)
1 (0.9)
Territories
0.96 (53)
fdi=2
1 (13.2)
1 (27.9)
CustomerCustomerDemo
Null (0)
fdi=0
Region
0.91 (4)
fdi=3
1 (2.6)
Suppliers
0.62 (29)
fdi=4
GDS (θ)
OS Ranking
A Ranked set of Partial OSs - A complete OS
Result 1: 4 tuples out of 27
Customer
Employees
CustomerID
CompanyName
ContactName
Address
...
Quick
QUICK-Stop
Margaret Peacock
Taucherstae 10
...
Orders
OrderID
ShipName
ShipAddress
...
10418
...
QUICK-Stop
Taucherstae 10
...
Employees
LastName FirstName
Peacock
Margaret
(e3)
...
...
Shippers
CompanyName
Speedy Express
Employees (Reports To)
...
(e2)
...
...
Territories, Region
Result 2: 4 tuples out of 24
Employee
(et1, t4, r2)
EmployeeID
LastName FirstName
Title
TitleOfCourtesy
Address
...
4
Peacock
Sales Representative
Mrs.
4110 Old Redmond Rd.
...
Margaret
EmployeeTerritories Region Orders
TerritoryDescription Region OrderID
Rockville
...
Eastern
10418
ShipName
ShipAddress
OrderDate
...
QUICK-Stop
Taucherstae 10
1996-07-15
...
...
Orders
(o2)
Customers
(c2)
Shippers
Result 3: 4 tuples out of 9
Employee
EmployeeID
3
LastName
Peacock
EmployeeTerritories
TerritoryDescription
Atlanta
...
FirstName
Janet
Title
TitleOfCourtesy
Sales Representative
Address
Ms.
722 Moss Bay Blvd.
...
Order Details
(od1)
Region
Region
Orders
OrderID ShipName
ShipAddress
OrderDate
...
Southern
10273
Taucherstae 10
1996-08-05
...
...
QUICK-Stop
(s3)
...
Products
(p2)
OS Ranking- Problems and Challenges
 Existing Keyword Searching ranking semantics
 the smaller size the higher ranking
 In contrast, in the proposed paradigm an OS
containing many and well connected tuples should
have certainly greater importance than an OS with
less tuples.
 For instance, a Customer or Employee OS involved in
many Orders or an Author authored many important
papers and books.
OS Ranking- Importance
Im(OS)=
 Im( t
i
)  Af R ( ti )
log(| OS |)  1
ti is a tuple of OS
Im(ti) is the Importance of ti (i.e. PageRank)
|OS| is the amount of tuples in OS,
AfR(ti) is the affinity of R that ti belongs to
Experimental Evaluation
 MS Northwind and TPC-H DBs
 Precision, Recall, F-Score
 Compare GDSs and OSs produced by 12 GDS(θ) v GDS(h)
 GDS(h) was proposed by 10 participants
 GDS: average F-score 86.77, OS aver F-score 83
GDS Precision, Recall and F-score (Averages)
<0.5, 0.4, 0.05, 0.05>
OSs Precision, Recall and F-score (Averages)
<0.5, 0.4, 0.05, 0.05>
100
100
Precision
Precision
Recall
80
Recall
80
F-Score
F-Score
60
60
40
40
20
TPC-H
Northwind
TPC-H
Region
Nation
Orders
Parts
Supplier
Customer
Products
Orders
Shippers
Suppliers
Employees
Region
Nation
Orders
Parts
Supplier
Customer
Products
Orders
Shippers
Employees
Customers
Suppliers
Northwind
0
Customers
20
0
Conclusions –Future Work
 Top-k OS results
 Top-k size of an OS
 Challenge: the weights of new tuples are not monotonic
 (since a tuple’s PageRank may increase while its Affinity decrease).
 Alternative to PageRank weighting systems are currently
investigated; i.e. ObjectRanks
Conclusions -Novel Contributions
 The formal definition of the novel Searching Paradigm which
automatically produces a ranked set of OSs for a Data Subject.



minimum contribution from the user (i.e. only a Kw)
no prior knowledge of the DB schema or query language needed.
Excellent Precision, Recall and F-score results
 The formal definition and quantification of Relation’s Affinity in the
context of GDS

consider both Schema Design and Data distributions
 A novel ranking paradigm to calculate Im(OS).


The quantification of tuples’ and OSs’ Importance.
A Combine Function that considers:



the weight (e.g. PageRank) of tuples,
Affinity and
size of OS
Af R R DS : Affinity of Relations to RDS in GDS
i
RDS
Ri
Employees
Employees (ReportsTo)
Employees (ReportedBy)
Territories
Region
Order
Customer
Shipper
OrderDetails
Product
Supplier
Categories
CustDemographics
ldi, RCi,
RC i , Coi
RDS
1, 1, 0.9, 4
1, 0.9, 1, 4
1, 5.4, 1, 2
2, 1, 13.2, 1
1, 92.2, 1, 4
2, 1, 9.1, 2
2, 1, 276.6, 1
2, 2.5, 1, 2
3, 1, 43.9, 4
4, 1, 1.6, 1
4, 1, 6.1, 1
3, null, null, 1
Employees
m1..m4
RDS
1, 1, 1, 0.7
1, 1, 1, 0.7
1, 0.9, 1, 0.9
0.9, 1, 0.88, 1
1, 0.8, 1, 0.7
0.9, 1, 0.9, 0.9
0.9, 1, 0.75, 1
0.9, 0.96, 1, 0.9
0.8, 1, 0.83, 0.8
0.7, 1, 0.9, 1
0.7, 1, 0.92, 1
0.8, null, null, 1
AfRi
Customer
AfRi (rRi)
Order
AfRi (rRi)
Shipper
AfRi (rRi)
1.00
0.98
0.98
0.96
0.91
0.90
0.85
0.85
0.84
0.74
0.63
0.62
Null
0.88 (3)
0.78 (5)
0.70(7)
0.55 (10)
0.46 (11)
0.94 (1)
1 (RDS)
0.88 (2)
0.88 (4)
0.77 (6)
0.65 (8)
0.65 (9)
Null
0.97 (4)
0.91 (5)
0.85 (7)
0.66 (10)
0.59 (11)
1 (RDS)
0.99 (1)
0.98 (2)
0.97 (3)
0.91 (6)
0.82 (8)
0.81 (9)
Null
0.82 (4)
0.73 (5)
0.66 (7)
0.51 (10)
0.43 (11)
0.89 (1)
0.83 (2)
1 (RDS)
0.82 (3)
0.73 (6)
0.62 (8)
0.61 (9)
Null
Affinity Ranking Correctness (Average)
Affinity Ranking Correctness (Averages)
Northwind
100 
TPC-H
100 *  d (rRiAf , rRih )
i
Region
Nation
Orders
Parts
Supplier
Customer
Products
Orders
Shippers
Suppliers
Employees
Customers
100
90
80
70
60
50
40
30
20
10
0
Download