Efficient Keyword Search across Heterogeneous Relational Databases Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Key Message of Paper Precise data integration is expensive But we can do IR-style data integration very cheaply, with no manual cost! – just apply automatic schema/data matching – then do keyword search across the databases – no need to verify anything manually Already very useful Build upon keyword search over a single database ... 2 Keyword Search over a Single Relational Database A growing field, numerous current works – – – – Many related works over XML / other types of data – – – – DBXplorer [ICDE02], BANKS [ICDE02] DISCOVER [VLDB02] Efficient IR-style keyword search in databases [VLDB03], VLDB-05, SIGMOD-06, etc. XKeyword [ICDE03], XRank [Sigmod03] TeXQuery [WWW04] ObjectRank [Sigmod06] TopX [VLDB05], etc. More are coming at SIGMOD-07 ... 3 A Typical Scenario Customers tid custid name Complaints contact addr tid id emp-name comments t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work t2 c533 IBM David Long … u2 c124 John Deferred work to t3 c333 MSR David Ross … John Smith Foreign-Key Join Q = [Michael Smith Cisco] Ranked list of answers Repair didn’t work score=.8 Deferred work to John Smith score=.7 t1 c124 Cisco Michael Jones … u1 c124 Michael Smith t1 c124 Cisco Michael Jones … u2 c124 John 4 Our Proposal: Keyword Search across Multiple Databases Employees Complaints comments tid empid u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith u2 c124 John Deferred work to v2 e14 John Brown John Smith v3 e37 Jack Lucas tid id emp-name name Groups Customers tid custid name contact addr tid eid reports-to t1 c124 Cisco Michael Jones … x1 e23 e37 t2 c533 IBM David Long … x2 e14 e37 t3 c333 MSR Joan Brown … Query: [Cisco Jack Lucas] t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith x1 e23 e37 across databases v3 e37 Jack Lucas IR-style data integration 5 A Naïve Solution 1. Manually identify FK joins across DBs 2. Manually identify matching data instances across DBs 3. Now treat the combination of DBs as a single DB apply current keyword search techniques Just like in traditional data integration, this is too much manual work 6 Kite Solution Automatically find FK joins / matching data instances across databases no manual work is required from user Employees Complaints comments tid empid u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith u2 c124 John Deferred work to v2 e14 John Brown John Smith v3 e37 Jack Lucas tid id emp-name name Groups Customers tid custid name contact addr tid eid reports-to t1 c124 Cisco Michael Jones … x1 e23 e37 t2 c533 IBM David Long … x2 e14 e37 t3 c333 MSR Joan Brown … 7 Complaints Automatically Find FK Joins across Databases Employees comments tid empid u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith u2 c124 John Deferred work to v2 e14 John Brown John Smith v3 e37 Jack Lucas tid id emp-name name Current solutions analyze data values (e.g., Bellman) Limited accuracy – e.g., “waterfront” with values yes/no “electricity” with values yes/no Our solution: data analysis + schema matching – improve accuracy drastically (by as much as 50% F-1) Automatic join/data matching can be wrong incorporate confidence scores into answer scores8 Incorporate Confidence Scores into Answer Scores Recall: answer example in single-DB settings t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work score=.8 Recall: answer example in multiple-DB settings score 0.7 for data matching t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith score 0.9 for FK join score (A, Q) = x1 e23 e37 v3 e37 Jack Lucas α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q) size (A) 9 Summary of Trade-Offs SQL queries Precise data integration – the holy grail IR-style data integration, naïve way – manually identify FK joins, matching data – still too expensive IR-style data integration, using Kite – automatic FK join finding / data matching – cheap – only approximates the “ideal” ranked list found by naïve 10 Kite Architecture Q = [ Smith Cisco ] Index Builder IR index1 … IR indexn Foreign key joins Condensed CN Generator – Partial Refinement rules Top-k Searcher D1 Schema Matcher … Dn Offline preprocessing – Deep Data instance matcher Foreign-Key Join Finder Data-based Join Finder – Full Distributed SQL queries D1 … Dn Online querying 11 Online Querying Database 1 Relation 1 Relation 2 Database 2 Relation 1 Relation 2 What current solutions do: 1. Create answer templates 2. Materialize answer templates to obtain answers 12 Create Answer Templates Service-DB Find tuples that contain query keywords – – Use DB’s IR index example: Complaints Customers u1 v1 u2 v2 v3 Q = [Smith Cisco] Tuple sets: Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1} HR-DB: EmployeesQ={t1} GroupsQ={} Create tuple-set graph HR-DB Groups Employees x1 t1 x2 t2 t3 Schema graph: Customers J1 J4 Complaints J2 Emps Groups J3 Tuple set graph: Customers{} J1 J4 Complaints{} Emps{} J1 J4 J3 J1 J4 J3 CustomersQ J1 ComplaintsQ J4 EmpsQ J2 Groups{} J2 13 Create Answer Templates (cont.) Search tuple-set graph to generate answer templates – also called Candidate Networks (CNs) Each answer template = one way to join tuples to form an answer sample CNs sample tuple set graph J1 Customers{} CN1: CustomersQ J4 Complaints{} Emps{} J1 J4 J3 J2 J1 J4 J3 Groups{} J2 CustomersQ J1 ComplaintsQ J4 EmpsQ J1 CN2: CustomersQ Complaints{Q} J2 J2 J4 CN3: EmpsQ Groups{} Emps{} Complaints{Q} J2 J3 J4 CN4: EmpsQ Groups{} Emps{} Complaints{Q} 14 Materialize Answer Templates to Generate Answers By generating and executing SQL queries J1 CN: CustomersQ ComplaintsQ (CustomersQ = {v1} , ComplaintsQ = {u1, u2}) SQL: SELECT * FROM Customers C, Complaints P WHERE C.cust-id = P.id AND (C.tuple-id = v1) AND (P.tuple-id = u1 OR tuple-id = u2) Naïve solution – materialize all answer templates, score, rank, then return answers Current solutions – find only top-k answers – materialize only certain answer templates – make decisions using refinement rules + statistics 15 Challenges for Kite Setting More databases way too many answer templates to generate – can take hours on just 3-4 databases Materializing an answer template takes way too long – requires SQL query execution across multiple databases – invoking each database incurs large overhead Difficult to obtain reliable statistics across databases See paper for our solutions (or backup slides) 16 Empirical Evaluation Domains Domain Avg # approximate FK joins tuples Avg # Avg # tables Avg # tuples per table # DBs attributes per per DB per table schema total across DBs per pair Total size DBLP 2 3 3 11 6 11 500K 400M Inventory 8 5.8 5.4 890 804 33.6 2K 50M Sample Inventory Schema AUTHOR ARTIST BOOK CD WH2BOOK WH2CD WAREHOUSE Inventory 1 The DBLP Schema AR (aid, biblo) CITE (id1, id2) PU (aid, uid) AR (id, title) AU (id, name) CNF (id, name) DBLP 1 DBLP 2 17 Runtime Performance (1) runtime vs. maximum CCN size 180 time (sec) DBLP 120 60 0 1 2 3 4 5 6 7 8 9 Inventory 120 max CCN size 60 0 1 2 3 4 5 6 7 2-keyword queries, k=10, 5 databases 2-keyword queries, k=10, 2 databases runtime vs. # of databases max CCN size Hybrid algorithm adapted to run over multiple databases 45 Inventory time (sec) time (sec) 180 30 Kite without adaptive rule selection and without rule Deep 15 Kite without condensed CNs Kite without rule Deep 0 1 2 3 4 5 6 7 8 # of DBs maximum CCN size = 4, 2-keyword queries, k=10 Full-fledged Kite algorithm 18 Runtime Performance (2) runtime vs. # of keywords in the query 40 DBLP 15 time (sec) time (sec) 20 10 5 Inventory 30 20 10 0 |q| 1 2 3 4 |q| 0 5 1 max CCN=6, k=10, 2 databases 2 3 4 5 max CCN=4, k=10, 5 databases runtime vs. # of answers requested 45 time (sec) time (sec) 45 30 15 k 0 1 4 7 10 13 16 19 22 25 27 30 2-keyword queries, max CCN=4, |q|=2, 5 databases Inventory 30 15 k 0 1 4 7 10 13 16 19 22 25 27 30 2-keyword queries, max CCN=4, 5 databases 19 Query Result Quality Pr@k Pr@k 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 k 0 1 5 10 15 OR-semantic queries 20 k 0 1 5 10 15 20 AND-semantic queries Pr@k = the fraction of answers that appear in the “ideal” list 20 Summary Kite executes IR-style data integration – performs some automatic preprocessing – then immediately allows keyword querying Relatively painless – no manual work! – no need to create global schema, nor to understand SQL Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people – enterprises, on the Web, need only a few answers – emergency (e.g., hospital + police), need answers quickly 21 Future Directions Incorporate user feedback interactive IR-style data integration More efficient query processing – large # of databases, network latency Extends to other types of data – XML, ontologies, extracted data, Web data IR-style data integration is feasible and useful extends current works on keyword search over DB raises many opportunities for future work 22 BACKUP 23 Condensing Candidate Networks In multi-database settings unmanageable number of CNs – – Many CNs share the same tuple sets and differ only in the associated joins Group CNs into condensed candidate networks (CCNs) J1 Customers{} J4 Complaints{} J1 Emps{} J3 J4 J3 J4 J1 CustomersQ J1 condense tuple set graph ComplaintsQ J4 Customers{} J2 Groups{} Condense J2 EmpsQ sample CNs J2 J2 J4 {} {} Groups Emps Complaints{Q} Condense CN3: J2 J3 J4 Q {} {} CN4: Emps Groups Emps Complaints{Q} EmpsQ J1 J4 Complaints{} J1 J4 J1 J4 CustomersQ J ComplaintsQ 1 Emps{} {J2, J3} Groups {}` sample tuple set graph EmpsQ J4 {J2, J3} sample CCNs {J2, J3} J2 J4 EmpsQ Groups{} Emps{} Complaints{Q} 24 Top-k Search Main ideas for top-k keyword search: – No need to materialize all CNs – Sometimes, even partially materializing a CN is enough – Estimate score intervals for CNs, then branch and bound search iteration 1 iteration 2 iteration 3 K = {P2, P3}, min score = 0.7 .... ... ... P [0.6, 1] .. . Q [0.5, 0.7] R [0.4, 0.9] P1 [0.6, 0.8] P2 0.9 . P3 0.7 ... R [0.4, 0.9] .. . Res = {P2, R2} min score = 0.85 R1 [0.4, 0.6] R2 0.85 Kite approach: materialize CNs using refinement rules 25 Top-k Search Using Refinement Rules • In single-database setting selecting rules based on database statistics • In multi-database setting Inaccurate statistics • Inaccurate statistics Inappropriate rule selection 26 Full: – – Exhaustively extract all answers from a CN (fully materialize S) too much data to move around the network (data transfer cost) Partial: – – Refinement Rules Try to extract the most promising answer from a CN invoke remote databases for only one answer (high cost of database t1 u1 invocation) Deep: – – – A middle-ground approach Once a table in a remote database is invoked, extract all answers involving that table Takes into account database invocation cost t1 t2 t3 t4 TQ 0.9 0.7 0.4 0.3 UQ 0.8 0.6 0.5 0.1 u1 u2 u3 u4 t1 t2 t3 t4 t1 t1 t2 t3 t4 TQ 0.9 0.7 0.4 0.3 UQ 0.8 0.6 0.5 0.1 u1 u2 u3 u4 t1 t2 t3 t4 UQ 0.8 0.6 0.5 0.1 u1 u2 u3 u4 u1 , t1 u3 UQ TQ 0.9 0.8 0.6 0.7 0.4 0.5 0.3 0.1 u1 u2 u3 u4 TQ 0.9 0.7 0.4 0.3 27 Adaptive Search Question: which refinement rule to apply next? – In single-database setting based on database statistics – Multi-database setting inaccurate statistics Kite approach: adaptively select rules goodness-score (rule, cn) = benefit (rule, cn) – cost (rule, cn) – cost (rule, cn): optimizer’s estimated cost for SQL statements – benefit (rule, cn): reduce the benefit if a rule is applied for a while without making any progress 28 Other Experiments Join Discovery Accuracy 1 accuracy (F1) 0.8 Schema matching helps improve join discovery algorithm drastically Kite also improves singledatabase keyword search algorithm mHybrid 0.6 0.4 0.2 0 Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5 Join Discovery Join Discovery + Schema Matching Kite over single database time (sec) 6 4 2 0 1 2 3 4 5 6 7 8 max CCN size 29