Automated Generation of Object Summaries from Relational Databases: A Novel Keyword Searching Paradigm GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. Related Work: Keyword Search in Relational DBs Full-text Search (e.g. Oracle 9i Text) Kw Searching in Relational DB (DISCOVER, BANKS) Region Territories Kw Search: Leverling, Peacock r1 t1 r2 t2 t3 t4 EmployeeTerritories Employees et1 e1 et2 e2 et3 e3 Result: e3-o2-c2 e4-06-c2 et4 e4 Customers Orders c1 o1 c2 o2 c3 o3 Shippers o4 o5 s1 o6 s2 o7 s3 Order Details od1 Products p1 p2 od2 od3 Suppliers od4 od5 od6 su1 Categories ca1 Related Work: Web Search Engines: Keyword Search Kw Search: Peacock Result: A ranked set of web pages Related Work: Web Search Engines: Keyword Search Kw Search: Peacock Result: A ranked set of web pages A Novel Keyword Searching Paradigm: Object Summaries (OSs) Kw Search: Peacock Result: A Ranked set of OSs Result 1: 4 tuples out of 27 Customer CustomerID CompanyName ContactName Address ... Quick QUICK-Stop Margaret Peacock Taucherstae 10 ... Orders OrderID ShipName ShipAddress Employees ... LastName FirstName 10418 ... QUICK-Stop Taucherstae 10 ... Peacock Margaret ... ... Shippers CompanyName Speedy Express ... ... ... Result 2: 4 tuples out of 24 Employee EmployeeID LastName FirstName Title TitleOfCourtesy Address ... 4 Peacock Sales Representative Mrs. 4110 Old Redmond Rd. ... ShipName ShipAddress OrderDate ... QUICK-Stop Taucherstae 10 1996-07-15 ... ... Margaret EmployeeTerritories Region Orders TerritoryDescription Region OrderID Rockville ... Eastern 10418 Result 3: 4 tuples out of 9 Employee EmployeeID LastName FirstName Title TitleOfCourtesy Address ... 3 Peacock Janet Sales Representative Ms. 722 Moss Bay Blvd. ... EmployeeTerritories TerritoryDescription Atlanta ... Region Region Orders OrderID ShipName ShipAddress OrderDate ... Southern 10273 Taucherstae 10 1996-08-05 ... ... QUICK-Stop A Novel Keyword Searching Paradigm: Object Summaries (OSs) Kw Search: Peacock Result: A Ranked set of OSs Result 1: 4 tuples out of 27 Customer Employees Problems-Challenges: How can we automatically (1) Generate and (2) Rank OSs liberating users from knowledge of: (1) Schema and (2) Query Language CustomerID CompanyName ContactName Address ... Quick QUICK-Stop Margaret Peacock Taucherstae 10 ... Orders OrderID ShipName ShipAddress ... 10418 ... QUICK-Stop Taucherstae 10 ... Result 2: 4 tuples out of 24 Employee Employees LastName FirstName Peacock Margaret ... ... Shippers CompanyName Speedy Express ... (e2) ... ... Territories, Region (et1, t4, r2) EmployeeID LastName FirstName Title 4 Peacock Sales Representative Margaret EmployeeTerritories Region Orders TerritoryDescription Region OrderID Rockville ... Eastern 10418 TitleOfCourtesy Address ... Mrs. 4110 Old Redmond Rd. ... ShipName ShipAddress OrderDate ... QUICK-Stop Taucherstae 10 1996-07-15 ... ... Orders Customers (c2) Shippers Result 3: 4 tuples out of 9 Employee EmployeeID 3 LastName Peacock EmployeeTerritories TerritoryDescription Atlanta ... (e3) Employees (Reports To) FirstName Janet Title TitleOfCourtesy Sales Representative Address ... Ms. 722 Moss Bay Blvd. Region Region Orders OrderID ShipName ShipAddress OrderDate ... Southern 10273 Taucherstae 10 1996-08-05 ... ... QUICK-Stop ... (s3) ? Order Details (od1) Products (p2) (o2) OS Generation - Methodology KW-ID = “Janet Leverling” tDS a central tuple Region Territories r1 t1 r2 t2 t3 t4 EmployeeTerritories Employees et1 e1 et2 e2 et3 e3 et4 e4 Customers Orders c1 o1 c2 o2 c3 containing the Kw; tuples around tDS contain additional information about the Data Subject. o3 Shippers o4 o5 s1 o6 s2 o7 s3 Products Order Details p1 od1 p2 od2 Suppliers od3 od4 su1 od5 od6 Categories ca1 RDS the corresponding Employees Orders Shippers EmployeeTerritories Customers Order Details Territories Region CustomerDemographics Products Categories Suppliers CustomerCustomerDemo central Relation; similarly Relations around contain additional information. OS Generation - Methodology KW-ID = “Janet Leverling” tDS a central tuple Region Territories r1 t1 r2 t2 t3 t4 EmployeeTerritories Employees et1 e1 et2 e2 et3 e3 et4 e4 Customers Orders c1 o1 c2 o2 c3 containing the Kw; tuples around tDS contain additional information about the Data Subject. o3 Shippers o4 o5 s1 o6 s2 o7 s3 Products Order Details p1 od1 p2 od2 Suppliers od3 od4 su1 od5 od6 Categories ca1 RDS the corresponding Employees Orders Shippers EmployeeTerritories Customers Order Details Territories Region CustomerDemographics Products Categories Suppliers CustomerCustomerDemo central Relation; similarly Relations around contain additional information. OS Generation - Methodology KW-ID = “Janet Leverling” tDS a central tuple Region Territories r1 t1 r2 t2 t3 t4 EmployeeTerritories Employees et1 e1 et2 e2 et3 e3 et4 e4 Customers Orders c1 o1 c2 o2 c3 containing the Kw; tuples around tDS contain additional information about the Data Subject. o3 Shippers o4 o5 s1 o6 s2 o7 s3 Products Order Details p1 od1 p2 od2 Suppliers od3 od4 su1 od5 od6 Categories ca1 RDS the corresponding Employees Orders Shippers EmployeeTerritories Customers Order Details Territories Region CustomerDemographics Products Categories Suppliers CustomerCustomerDemo central Relation; similarly Relations around contain additional information. OS Generation - Methodology KW-ID = “Janet Leverling” tDS a central tuple Region Territories r1 t1 r2 t2 t3 t4 EmployeeTerritories Employees et1 e1 et2 e2 et3 e3 et4 e4 Customers Orders c1 o1 c2 o2 c3 containing the Kw; tuples around tDS contain additional information about the Data Subject. o3 Shippers o4 o5 s1 o6 s2 o7 s3 Products Order Details p1 od1 p2 od2 Suppliers od3 od4 su1 od5 od6 Categories ca1 RDS the corresponding Employees Orders Shippers EmployeeTerritories Customers Order Details Territories Region CustomerDemographics Products Categories Suppliers CustomerCustomerDemo central Relation; similarly Relations around contain additional information. OS Generation - Methodology KW-ID = “Janet Leverling” OS for “Janet Leverling” Employees (e3) Region Territories Employees (Reports To) r1 t1 (e2) r2 t2 Territories, Region t3 t4 (et1, t4, r2) EmployeeTerritories Employees Orders (o2) et1 e1 et2 e2 et3 e3 Customers et4 e4 Customers (c2) Shippers Orders c1 o1 (s3) c2 o2 Order Details c3 o3 (od1) Shippers o4 Products o5 s1 o6 (p2) s2 o7 s3 Categories Products Order Details (ca1) p1 od1 p2 od2 Suppliers od3 od4 od6 Employees (9) GDS su1 od5 Categories ca1 Employees (2) Employees (9) Employees EmployeeTerritories Territories Shippers Customers Order Details CustomerDemographics Products Categories EmployeeTerritories (49) Shippers (3) Territories (53) Region Customer (91) Orders Orders (830) Suppliers Order Details (2155) CustomerCustomerDemo CustomerCustomerDemo (0) Products (49) Categories (8) Region (4) Suppliers (29) OS Generation - Methodology GDS Employees (9) Employees (2) Employees (9) Order Details (2155) Customer (91) Problem: Not all Relations in GDS are relevant: Orders (830) EmployeeTerritories (49) Shippers (3) Territories (53) How do I decide 1) What relations to select or not 2) When to Stop Traversing Solution: Investigate Relational Semantics: CustomerCustomerDemo (0) Products (49) Categories (8) Region (4) Suppliers (29) Schema Connectivity, Cardinality, Related Cardinality etc. Quantify Affinity of Relations Af R R DS : Affinity of Relations to RDS in GDS i Distance Employees (9) Physical (fd), Logical (ld), ld=fd- |M:N| Employees (2) Employees (9) Customer (91) Order Details (2155) CustomerCustomerDemo (0) Orders (830) Shippers (3) Products (49) Categories (8) Suppliers (29) fdi=0 EmployeeTerritories fdi=1 (49) Territories (53) fdi=2 Region (4) fdi=3 fdi=4 Af R R DS : Affinity of Relations to RDS in GDS i Distance Employees (9) Physical (fd), Logical (ld), ld=fd- |M:N| E.g. Orders closer than Customer and CustomerDemo to Employees Employees (2) Employees (9) Customer (91) Order Details (2155) CustomerCustomerDemo (0) Orders (830) Shippers (3) Products (49) Categories (8) Suppliers (29) fdi=0 EmployeeTerritories fdi=1 (49) Territories (53) fdi=2 Region (4) fdi=3 fdi=4 Af R R DS : Affinity of Relations to RDS in GDS i Distance Supplier (10,000) 1 (400) Physical (fd), Logical (ld), ld=fd- |M:N| E.g. Orders closer than Customer and CustomerDemo to Employees Nation (25) 1 (5) Region (5) Partsupp (800,000) 6000 (1) Customer (150,000) 10 (1) Hubs: spurious shortcuts Rather irrelevant or lateral information RC(R1, R2) N:1 :M R DS ... R hub 1 R2 80 (1) Orders (1,500,000) 4 (1) Lineitem (6,001,215) 1 (4) Part (200,00*) 7.5 (1) Lineitem (6,001,215) 1 (400) Orders (1,500,000) 4 (1) Lineitem (6,001,215) Af R R DS : Affinity of Relations to RDS in GDS i Connectivity Employees (9) Schema Connectivity (Coi) Data-graph Connectivity: Relative Cardinality (RCi→j), i.e. the average number of tuples of Ri that are connected with each tuple from Rj for 1:M RCi→j=|Ri|/|Rj| for M:1 RCi→j=1 Reverse Relative Cardinality (RRCi→j) is the reverse of RCi→j i.e. RRCi→j=RCi→j). 1 (1) 92.2 (1) 1 (1) Orders (830) Employees (2) Employees (9) 1 (9.1) Customer (91) 2.5 (1) 1 (276.6) Shippers (3) Order Details (2155) Products (49) 1 (9.6) Categories (8) 5.4 (1) EmployeeTerritories fdi=1 (49) 1 (0.9) Territories (53) fdi=2 1 (13.2) 1 (27.9) CustomerCustomerDemo (0) fdi=0 Region (4) fdi=3 1 (2.6) Suppliers (29) fdi=4 Af R R DS : Affinity of Relations to RDS in GDS i DAf(Ri)={(m1, w1), (m2, w2),.. (mn, wn)} m1=f1(ldi), m2=f1(log(10*RCi), m3=f1(log(10*RRCi), m4=f1(log(10*Coi) f1(α)=(11- α)/10 For a hub-child m1=f1(ldi *hi) and m2=f1(RCi) Formula 1 (Semantic Affinity): The affinity of Ri to RDS, denoted as Af R R DS , with respect to a schema i and a database conforming to the schema, can be calculated with the following formula: Af R R DS m j w j Af R i Where Af R j Parent R DS Paren t R DS is the affinity of the Ri’s Parent to RDS or is 1 if RParent≡RDS.□ Af R R DS : Affinity of Relations to RDS in GDS i Employees 1 (9) 1 (1) 92.2 (1) 1 (1) Orders 0.90 (830) Employees 0.98 (2) Employees 0.98 (9) 2.5 (1) 1 (9.1) Shippers 0.85 (3) Order Details 0.84 (2155) Customer 0.85 (91) 1 (276.6) Products 0.74 (49) 1 (9.6) Categories 0.63 (8) 5.4 (1) EmployeeTerritories fdi=1 (49) 1 (0.9) Territories 0.96 (53) fdi=2 1 (13.2) 1 (27.9) CustomerCustomerDemo Null (0) fdi=0 Region 0.91 (4) fdi=3 1 (2.6) Suppliers 0.62 (29) fdi=4 GDS (θ) OS Ranking A Ranked set of Partial OSs - A complete OS Result 1: 4 tuples out of 27 Customer Employees CustomerID CompanyName ContactName Address ... Quick QUICK-Stop Margaret Peacock Taucherstae 10 ... Orders OrderID ShipName ShipAddress ... 10418 ... QUICK-Stop Taucherstae 10 ... Employees LastName FirstName Peacock Margaret (e3) ... ... Shippers CompanyName Speedy Express Employees (Reports To) ... (e2) ... ... Territories, Region Result 2: 4 tuples out of 24 Employee (et1, t4, r2) EmployeeID LastName FirstName Title TitleOfCourtesy Address ... 4 Peacock Sales Representative Mrs. 4110 Old Redmond Rd. ... Margaret EmployeeTerritories Region Orders TerritoryDescription Region OrderID Rockville ... Eastern 10418 ShipName ShipAddress OrderDate ... QUICK-Stop Taucherstae 10 1996-07-15 ... ... Orders (o2) Customers (c2) Shippers Result 3: 4 tuples out of 9 Employee EmployeeID 3 LastName Peacock EmployeeTerritories TerritoryDescription Atlanta ... FirstName Janet Title TitleOfCourtesy Sales Representative Address Ms. 722 Moss Bay Blvd. ... Order Details (od1) Region Region Orders OrderID ShipName ShipAddress OrderDate ... Southern 10273 Taucherstae 10 1996-08-05 ... ... QUICK-Stop (s3) ... Products (p2) OS Ranking- Problems and Challenges Existing Keyword Searching ranking semantics the smaller size the higher ranking In contrast, in the proposed paradigm an OS containing many and well connected tuples should have certainly greater importance than an OS with less tuples. For instance, a Customer or Employee OS involved in many Orders or an Author authored many important papers and books. OS Ranking- Importance Im(OS)= Im( t i ) Af R ( ti ) log(| OS |) 1 ti is a tuple of OS Im(ti) is the Importance of ti (i.e. PageRank) |OS| is the amount of tuples in OS, AfR(ti) is the affinity of R that ti belongs to Experimental Evaluation MS Northwind and TPC-H DBs Precision, Recall, F-Score Compare GDSs and OSs produced by 12 GDS(θ) v GDS(h) GDS(h) was proposed by 10 participants GDS: average F-score 86.77, OS aver F-score 83 GDS Precision, Recall and F-score (Averages) <0.5, 0.4, 0.05, 0.05> OSs Precision, Recall and F-score (Averages) <0.5, 0.4, 0.05, 0.05> 100 100 Precision Precision Recall 80 Recall 80 F-Score F-Score 60 60 40 40 20 TPC-H Northwind TPC-H Region Nation Orders Parts Supplier Customer Products Orders Shippers Suppliers Employees Region Nation Orders Parts Supplier Customer Products Orders Shippers Employees Customers Suppliers Northwind 0 Customers 20 0 Conclusions –Future Work Top-k OS results Top-k size of an OS Challenge: the weights of new tuples are not monotonic (since a tuple’s PageRank may increase while its Affinity decrease). Alternative to PageRank weighting systems are currently investigated; i.e. ObjectRanks Conclusions -Novel Contributions The formal definition of the novel Searching Paradigm which automatically produces a ranked set of OSs for a Data Subject. minimum contribution from the user (i.e. only a Kw) no prior knowledge of the DB schema or query language needed. Excellent Precision, Recall and F-score results The formal definition and quantification of Relation’s Affinity in the context of GDS consider both Schema Design and Data distributions A novel ranking paradigm to calculate Im(OS). The quantification of tuples’ and OSs’ Importance. A Combine Function that considers: the weight (e.g. PageRank) of tuples, Affinity and size of OS Af R R DS : Affinity of Relations to RDS in GDS i RDS Ri Employees Employees (ReportsTo) Employees (ReportedBy) Territories Region Order Customer Shipper OrderDetails Product Supplier Categories CustDemographics ldi, RCi, RC i , Coi RDS 1, 1, 0.9, 4 1, 0.9, 1, 4 1, 5.4, 1, 2 2, 1, 13.2, 1 1, 92.2, 1, 4 2, 1, 9.1, 2 2, 1, 276.6, 1 2, 2.5, 1, 2 3, 1, 43.9, 4 4, 1, 1.6, 1 4, 1, 6.1, 1 3, null, null, 1 Employees m1..m4 RDS 1, 1, 1, 0.7 1, 1, 1, 0.7 1, 0.9, 1, 0.9 0.9, 1, 0.88, 1 1, 0.8, 1, 0.7 0.9, 1, 0.9, 0.9 0.9, 1, 0.75, 1 0.9, 0.96, 1, 0.9 0.8, 1, 0.83, 0.8 0.7, 1, 0.9, 1 0.7, 1, 0.92, 1 0.8, null, null, 1 AfRi Customer AfRi (rRi) Order AfRi (rRi) Shipper AfRi (rRi) 1.00 0.98 0.98 0.96 0.91 0.90 0.85 0.85 0.84 0.74 0.63 0.62 Null 0.88 (3) 0.78 (5) 0.70(7) 0.55 (10) 0.46 (11) 0.94 (1) 1 (RDS) 0.88 (2) 0.88 (4) 0.77 (6) 0.65 (8) 0.65 (9) Null 0.97 (4) 0.91 (5) 0.85 (7) 0.66 (10) 0.59 (11) 1 (RDS) 0.99 (1) 0.98 (2) 0.97 (3) 0.91 (6) 0.82 (8) 0.81 (9) Null 0.82 (4) 0.73 (5) 0.66 (7) 0.51 (10) 0.43 (11) 0.89 (1) 0.83 (2) 1 (RDS) 0.82 (3) 0.73 (6) 0.62 (8) 0.61 (9) Null Affinity Ranking Correctness (Average) Affinity Ranking Correctness (Averages) Northwind 100 TPC-H 100 * d (rRiAf , rRih ) i Region Nation Orders Parts Supplier Customer Products Orders Shippers Suppliers Employees Customers 100 90 80 70 60 50 40 30 20 10 0