Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly

Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors: Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign) Published in: Proceedings of the 31st VLDB Conference, Trondheim, Norway 2005 Presented by: Bruce Vincent CSE-718 Seminar April 25, 2008 Outline  Overview    System Architecture Design Approaches      Problem Description, Motivating Example Query Modeling and Translation Dynamic Predicate Mapping Implementation - Form Assistant Toolkit Experiments Related Work Problem Description  “Deep Web”    Estimated to contain 450,000 online databases (2004) Sometimes referred to as “Invisible Web” or “Hidden Web” Much of this is accessible only by query forms instead of static URL links  Common domains such as: books, cars, airfares Problem Description  Often it can be useful to query multiple alternative sources in the same domain    Automation of this entails several components One key component is dynamic query translation Software toolkit “Form Assistant” designed to provide potential translations of user queries for alternative sources  e.g., User-entered Amazon form query automatically translated to potential Barnes & Noble form query Problem Description  Goals of query translator:  Source-generality   Built-in translation must generally cope with new or “unseen” sources Domain-portability  Translator must be easily customizable with domain-specific knowledge, and thus deployable for new domains Motivating Example Source query Qs on source form S: (e.g. Amazon) Target query form T: (e.g. Barnes & Noble) Motivating Example Source query Qs on source form S Target query form T Query Translation Filter: : σtitle contain “red storm” and price < 35 and age > 12 Union Query Qt*: Tom Clancy U Tom Clancy System Architecture Form Extractor Form Extractor Source query Qs Target query form QI Attribute Matcher: Syntax-based schema matching Predicate Mapper: Type-based search-driven mapping Query Rewriter: Constraint-based query rewriting Target query Qt* Domain-specific Thesaurus Domain-specific type handlers Form Assistant (FA) Design Approaches  Query Modeling    Vocabulary and Syntax Query Translation Dynamic Predicate Modeling Query Modeling  Vocabulary   Predicate templates: { P1, P2, P3, P4, P5 } Example: P1 P3 P2 P4 P5 Query Modeling  Example Vocabulary (predicate templates)       P1 = [author; contain; $au] P2 = [title; contain; $ti] P3 = [subject; contain; $su] P4 = [isbn; contain; $isbn] P5 = [price; between; $s, $e] Example Syntax (valid conjunctive forms)         F1 = F2 = F3 = F4 = F5 = F6 = F7 = F8 = P1 P2 P3 P4 P1 P2 P3 P4 P5 P5 P5 P5 Query Modeling  Example Vocabulary Instantiations      Corresponding Form Queries:    p1 = [author; contain; Tom Clancy] p2 = [title; contain; red storm] p51 = [price; between; 0-25] p52 = [price; between; 25-45] f1 = p1 f2 = p1 p5 1 p5 2 Resultant Union Query:  Qt = f1 f2 Query Modeling  Syntax  Valid combination of predicate templates {F1, F2, F3, F4, F5, F6, F7, F8 }  Example (‘v’ indicates ‘valid’): P1 F1 ν (author) F2 F3 F4 ν (title) F7 F8 F1: ν ν P3 (subject) ν ν P4 (isbn) (price) F6 Tom Clancy P2 P5 F5 ν v v v v ν F2: Query Translation  Based on semantic closeness of query predicates:   Finds minimal subsuming Cmin Benefits of this approach:     No false positives Minimizes false negatives Has clear semantics, independent of DB content Modular translation Query Translation  Example: ? s: t1: t2: t3: t1 v t2: t1 v t2 v t3: 0 35 25 0 25 45 45 0 0 65 45 65  Cmin Query Translation  Definition:    Given source query Qs and target query form T, a query Qt* is a “minimal subsuming translation” w.r.t. T if: 1. Qt* is a validquery w.r.t T 2. Qt* subsumes Qs   i.e., for any database instance Di, Qs(Di) ≤ Qt*(Di) 3. Qt* is minimal  i.e., there is no query Qt such that Qt satisfies (1.) and (2.) above and Qt* subsumes Qt Query Translation Qt1 = (f1: p1 Qt2 = f2 Qt3 = f3: p1 p5 1 ) (f2 : p1 p5 2 ) p1 = [author; contain; Tom Clancy] p51 = [price; between; 0-25] p52 = [price; between; 25-45]  Example:  Consider source query Qs in first example and three target queries Qt1,Qt2,Qt3    Qt1 and Qt3 subsume Qs while Qt2 does not  Misses price range 0-25  Thus can’t be the best translation Cmin Prune Qt3 because it subsumes Qt1 That leaves Qt1 as Cmin Dynamic Predicate Mapping  Tasks:    Choose operator Fill in values Objective:  Minimal subsuming between source and target Dynamic Predicate Mapping  Example: Input: Predicate Mapping U output: Predicate Mapping System Architecture (reminder) Form Extractor Form Extractor Source query Qs Target query form QI Attribute Matcher: Syntax-based schema matching Predicate Mapper: Type-based search-driven mapping Query Rewriter: Constraint-based query rewriting Target query Qt* Domain-specific Thesaurus Domain-specific type handlers Form Assistant (FA) Implementation – Form Assistant Toolkit  Form Extractor  Parses HTML into query predicate templates [attr; op; val]   Attribute Matcher (1:1)  Identifies semantically corresponding attributes between forms      Details discussed in a different paper [3.] by same research group Customized with domain thesaurus (indexes synonyms for commonly used concepts) Stems (e.g., “children” -> “child) and removes stop words (e.g., “the”) Matched by value type and synonym attributes Predicate Mapper (discussed in previous slides) Query Rewriter  Well-studied problem to find minimal subsuming query of given predicatemapped query (uses approach of [5.] by Papakonstantinou, et al) Experiments  Datasets  447 Deep Web sources (query forms) in 8 domains  3 “Basic” domains – each with custom thesaurus in FA   5 “New” domains (for tests, these don’t have thesaurus)   Books, Airfares, Automobiles Car Rentals, Jobs, Hotels, Movies, Music/Records Test Approach  Run the FA to translate 120 form queries   Each translation test corresponds to random pairing of sources within a domain Count correct mappings in translation suggested by FA  Indicates amount of user effort the Form Assistant has saved Experiments  Results: Accuracy Distributions    X: % correct predicate translations; Y: % tested query forms Forms with all 1:1 mappings had 87% perfect accuracy for Basic dataset, 85% perfect for New dataset (good domain flexibility) Forms having complex mapping: 76%, 70% “near perfect” (Y>80%)  FA did not attempt complex (n:m) mappings, such as a full name in source mapping to separate first and last names in target For Basic dataset: For New dataset: Experiments  Accuracy ratio: correct results per 1:1 query    Raw: includes some forms whose input form extraction step had errors Perfect: manually forces all correct form extractions Avg. accuracy improves for perfectly correct extraction step:   for Basic dataset, 90.4% improves to 96.1% For New dataset, 81.1% improves to 86.7% Basic: 3 domains New: 5 domains Experiments  Example Error in Form Extraction  delta.com form has link to alternative reservation page  “One-way & multi-city reservations”  Wrongly interpreted by Form Extractor as input field label (attribute) Experiments  Error Distribution    % of errors caused by each component Fewest errors are due to Attribute Matching Most errors due to Predicate Mapping  Cited reason for PM errors is insufficient domain knowledge   Example failure: source subject value “computer science” didn’t properly map to target subject value “programming languages” Improvement could entail better domain-specific ontology and type handlers Form Extraction 40% Attribute Matching 18% 42% Predicate Mapping Related Work  From the same research group:  Complex Matchings (n:m)  Defines “Type Recognizer” used in Form Assistant’s Attribute Matcher, and discusses complex n:m matchings not attempted by Form Assistant:   [1.] Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach. B. He, K. C.-C. Chang, and J. Han. In Proceedings of the 2004 ACM SIGKDD Conference (KDD 2004) (Full Paper), Seattle, Washington, August 2004 MetaQuerier System  Fuller system for both exploring (to find) and integrating (to query) Deep Web databases:  [2.] Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web. K. C.-C. Chang, B. He, and Z. Zhang. In Proceedings of the Second Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, California, January 2005 Related Work  From the same research group:  Form Extraction  As used by implementation of Form Assistant:   2007 thorough analysis of the Deep Web  Interesting survey of web databases and query interfaces:   [3.] Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. Z. Zhang, B. He, and K. C.-C. Chang. In Proceedings of the 2004 ACM SIGMOD Conference (SIGMOD 2004), Paris, France, June 2004 [4.] Accessing the Deep Web: A Survey. B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Communications of the ACM (CACM), 50(5):94-101, May 2007 Public Datasets  Cached real world query form web pages (used in experiments):   http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8 Additional Deep Web integration resources:  http://metaquerier.cs.uiuc.edu/repository Related Work  Query Rewriting  As used by implementation of Form Assistant:  [5.] Y. Papakonstaninou, A. Gupta, H. Garcia-Molina, and J. Ullman. A query translation scheme for rapid implementation of wrappers In proceedings of the Fourth International Conference on Deductive and Object-Oriented Databases, Singapore, December 1995. Thank you !

Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly

Related documents

Products

Support

Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib