Combining Keyword Search and Forms for Ad Hoc Querying of Databases (Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton) Computer Sciences Department University of Wisconsin-Madison {ericc, baid, xchai, anhai, naughton} @cs.wisc.edu Contents • • • • • • • Motivation Query Forms Generating forms Keyword Search for Forms Displaying Returned Forms Experimental Analysis Related Work and References Traditional Access Methods for Databases Relational/XML Databases are structured or semi-structured, with rich meta-data Typically accessed by structured query languages: SQL • Advantages: high-quality results • Disadvantages: – Query languages: long learning curves – Schemas: Complex Small user population “The usability of a database is as important as its capability” Motivation Information discovery in databases requires: Knowledge of schema Knowledge of a query language (Example: SQL) Challenges? • Hard for users uncomfortable with a formal query language. Motivation What is the solution? Form Based Interfaces and Keyword Search Approach • User submits keyword query • System returns ranked list of relevant forms • User selects one of forms and builds structured query Relational Schema of DBLife Entity tables: person(id, name, homepage, title, group,organization, country) publication(id, name, booktitle, year, pages, cites, clink, link) topic(id, name) organization(id, name) conference(id, name) Relationship Tables related_people(rid, pid1, pid2, strength) related_topic(rid, pid, tid, strength) related_organization(rid, pid, oid, strength) give_tutorial(rid, pid, cid) give_conf_talk(rid, pid, cid) give_org_talk(rid, pid, oid) serve_conf(rid, pid, cid, assignment) write_pub(rid, pid, pub_id, position) co_author(rid, pid1, pid2, strength) Query Forms Interface for a query template. Example: Completed form over the person relation of DBLife. • Query represented is SELECT * FROM person WHERE organization = ‘Microsoft Research’ • General template for the above form SELECT * FROM person WHERE name op value AND homepage op value AND title op value AND group op value AND organization op value AND country op value How to generate forms? Step 1: Specify a subset of SQL as the target language to implement the queries supported by forms. SQL’ SQL’: Let B = (SELECT select-list FROM from-list WHERE qualification [GROUP BY grouping-list HAVING group-qualification] UNION | INTERSECT) Note: Nested queries are not allowed in FROM and WHERE clauses. Step 2: Determine set of skeleton templates specifying the main clauses and join conditions based on chosen subset of SQL and SD. Let Ri be a relation following a relation schema Si ∈ SD Case 1: If Ri does not reference other relations with foreign keys. SELECT * FROM Ri WHERE predicate-list Case 2: If Ri references other relations with foreign keys. SELECT * FROM <Ri and relations referenced> WHERE < Join relations and for each attribute have “attr op value” predicate > Example: Relation : Give_Tutorial give_tutorial(rid,pid,cid) Relations Referenced: Person and Conference person(id,name,homepage,title,group,organization,country) conference(id,name) Skeleton Template: SELECT *FROM give_tutorial t, person p, conference c WHERE t.pid = p.id AND t.cid = c.id AND p.name op expr AND … AND c.name op expr Step 3: Finalize templates by modifying skeleton templates based on form specificity. How specific or general we want the forms to be? Form Specificity Form Complexity Data Specificity Initial State of the form Adjusting form specificity: Increase its complexity by adding more parameters. Decrease its complexity by removing parameters. Increase data specificity by binding more existing parameters to constants. Decrease data specificity by unbinding parameters with fixed vales. Approach followed in this paper: • • • • To adjust Form Complexity Divide SQL’ into 4 query classes: SELECT: basic SELECT-FROM-WHERE construct AGGR: SELECT with aggregation GROUP: AGGR with GROUP BY and HAVING clauses UNION-INTERSECT: a UNION or INTERSECT of two SELECT To adjust Data Specificity • Bind “value” fields of the “attr op value” predicates in the WHERE clause to data values. Step 4: Map each template to a form Standard form components: • Label • Drop down list • Input box • Button Keyword Search for Forms Basic Idea Used to find relevant forms which are used to pose structured queries. Basic Approach Naïve AND Returns forms containing all the terms from keyword query. Naïve OR Some forms would be returned if the query includes at least one term. Drawback? Keyword query must have schema term(s). Approaches proposed in this paper: Check whether data terms from user query appear in database. If yes, modify query with relevant schema terms. • Double Index OR Evaluation done using OR semantics. • Double Index AND Evaluation done using AND semantics. Example: Information Need: For which conferences a researcher named “Widom” has served on program committee. Keyword Query: “Widom Conference” Here, Data term = “Widom” Schema term = “Conference” Results obtained: • Naïve AND - No forms returned as “Widom” does not appear on any form. • Naïve OR - Ignores “Widom” and returns all forms that contain “Conference” • DI OR – Rewritten query will be “Widom person conference” as “Widom” appears in person table and evaluated with OR semantics. • DI AND - Two queries generated “person conference” and “widom conference” ,evaluated with AND semantics and union of results returned. DB Life person(id, name, homepage, title, group,organization, country) conference(id, name) Double Index OR Implementation Indexes Used: • DataIndex- Inputs a data term and returns a set of <tuple-id, table> pairs. • FormIndex-Inputs a term and returns a set of form-ids. Input- Keyword Query Output- Set of form-id’s. Step 1: • Probe DataIndex with each query term qi in a query Q. • If qi is a data term, DataIndex will return a set of <tuple-id,table> pairs. • Add each table to the set FormTerms. • Add qi to FormTerms. Step 2: • Probe FormIndex with terms in FormTerms. • Return form containing at least one of these terms. DI OR Input: A keyword query Q = [q1 q2.... qn] Output: A set of form-ids F’ Algorithm: FormTerms = {}, F’ = {} // Replace any data terms with table names for each qi ∈Q if DataIndex(qi) returns <table, tuple-id> pairs Add each table to FormTerms Add qi to FormTerms // qi could be a form term // Get form-ids based on FormTerms FormIndex(FormTerms) => F’ // OR semantics return F’ Double Index AND • Generating all possible queries that result from replacing user supplied data terms with schema terms. • Use AND semantics and return union of query results. Problem? • Performing AND query with all the terms in FormTerms is wrong. Why is this so? • Data term may appear in multiple unrelated tables such that no form would contain all these tables. Concept of Bucket • For query “q1 AND q2” : “a ∈ Sq1 AND b ∈ Sq2,” where Sqi is a “bucket” containing the form terms associated with qi, and a and b are two form terms from Sq1 and Sq2 correspondingly. Double Index AND Implementation Input- Keyword query. Output- Set of form-id’s. Step 1: • For each qi , initially bucket Sqi is empty. • If the query contains data terms, DataIndex will return <table,tuple-id> pairs. • For each table, add table to Sqi and FormTerms. • Add qi to Sqi and FormTerms Step 2: • Generate and add to SQ’ all distinct queries, each of which taking one term from each Sqi. • For each query in SQ’, probe the FormIndex and retrieve forms that have all terms in query. DI AND Input: A keyword query Q = [q1 q2.... qn] Output: A set of form-ids F’ Algorithm: FormTerms = {}, F’ = {} // Replace any data terms with table names for each qi ∈Q Sqi = {} // Bucket for qi if DataIndex(qi) returns <table, tuple-id> pairs for each table if table ∉ FormTerms Add table to Sqi and FormTerms if qi ∉ FormTerms Add qi to Sqi and FormTerms // Get form-ids based on Sqi SQ’ = EnumQueries(∀ Sqi) // Enumerate all unique queries, // each having one term from each Sqi for each Q’∈ SQ’ FormIndex(Q’) => F’ // A.D semantics on FormIndex return F’ Example: • User wants to search for a person “John Doe” • “John Doe” is present in person table but is not involved in any relationship. What will be the output? {Forms from person table + Forms from tables which reference person} will be returned. User Action: User tries to enter “John Doe” in the field name in a form which is join of say person and conference tables. Output? No results returned ------ > DEAD FORMS Double Index Join • Used to perform a check to see if a form will return an answer if instantiated with data terms in the user query. How is the check performed? • • • • Step 1: Given keyword query Q, probe DataIndex with each query term qi. When qi is a data term that leads to set of <table ,tuple-id> pairs, look up each table T in a schema graph for SD and find reference tables that reference T. For each reference table, check to see if it contains any tuple-id of T. If No, retrieve the forms that contain both T and refTable and record these “dead” forms in say X. Step 2: • Return F’ – X. This filters the dead forms. DI Join Input: A keyword query Q = [q1 q2.... qn] Output: A set of form-ids F’ Algorithm: FormTerms = {}, F’ = {}, X = {} for each qi ∈Q Sqi = {} if DataIndex(qi) returns <table, tuple-id> pairs for each table T let I be the set of tuple-ids from T if T ∉ FormTerms Add T to Sqi and FormTerms SchemaGraph(T) returns refTables for each refTable if DataIndex(refTable:tid) is NULL for every tid ∈ I FormIndex(T AND refTable) => X if qi ∉ FormTerms Add qi to Sqi and FormTerms // Get form-ids based on form terms SQ’ = EnumQueries(∀ Sqi) for each Q’∈ SQ’ FormIndex(Q’) => F’ return F’ – X Displaying Returned Forms How are the returned forms ranked? • • Based on scoring function of Lucene index. Lucene score for a query Q and a document D is: score(Q,D) = coord(Q,D) * queryNorm(Q) * Σt in Q( tf(t in D) * idf(t)2 * t.getBoost() * norm(t,D) ) Problem? “Sister Forms” Illustration: User query – “Widom” Result of the query : Impossible to find what user is looking for. What is the solution? Grouping Forms: Approach 1: • Group consecutive sister forms with same score- first level groups • Group forms by the four query classes • Display the classes in the order of SELECT, AGGR, GROUP, and UNION-INTERSECT. Result of “Widom” query: Problem? Non-consecutive sister forms join different first level groups having the same description. Solution? Approach 2: • First group the returned forms by their table. • Order the groups by the sum of their scores. • Advantage No repetition Experimental Analysis Experimental Setup • Data set-DBLife • Generated set of forms F1 • 14 skeleton templates, one for each of 5 Entity tables and 9 Relationship tables • Created templates-1 SELECT, 5 AGGR,6 GROUP, 2 UNION-INTERSECT, so F1 had 196 forms. • Real life user study was done with 7 graduate students who found answers for 6 information needs. Experimental Analysis • • • Comparing Naïve, Double-Index, and Double-Index-Join Ranking and Displaying Forms Which is the best approach? Why? Let’s find out. Related Work and References • Jayapandian[11] proposed automatic form generation for a database based on a sample query workload. [11] M. Jayapandian, H. V. Jagadish. Automating the Design and Construction of Query Forms. ICDE 2006 • Liu [14] proposed to automatically distinguish between schema terms and value terms in keyword query. [14] F. Liu, C. Yu, W. Meng, A. Chowdhury. Effective Keyword Search in Relational Databases. SIGMOD 2006 • BANKS[3] proposed supporting the “attribute = value” construct in keyword queries. [3] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword Searching and Browsing in Databases using BANKS. ICDE, 2002. • Luo [16] proposed to detect empty result queries by “remembering” results from previously executed empty results queries. [16] G. Luo. Efficient Detection of Empty-Result Queries. VLDB 2006. Thank You!