Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly

advertisement
Light-weight Domain-based Form Assistant:
Querying Web Databases On The Fly
Authors:
Z. Zhang, B. He, K. C.-C. Chang
(Univ. of Illinois at Urbana-Champaign)
Published in:
Proceedings of the 31st VLDB Conference,
Trondheim, Norway 2005
Presented by:
Bruce Vincent
CSE-718 Seminar
April 25, 2008
Outline

Overview



System Architecture
Design Approaches





Problem Description, Motivating Example
Query Modeling and Translation
Dynamic Predicate Mapping
Implementation - Form Assistant Toolkit
Experiments
Related Work
Problem Description

“Deep Web”



Estimated to contain 450,000 online databases (2004)
Sometimes referred to as “Invisible Web” or “Hidden Web”
Much of this is accessible only by query forms
instead of static URL links

Common domains such as: books, cars, airfares
Problem Description

Often it can be useful to query multiple alternative
sources in the same domain



Automation of this entails several components
One key component is dynamic query translation
Software toolkit “Form Assistant” designed to provide
potential translations of user queries for alternative sources

e.g., User-entered Amazon form query automatically translated to
potential Barnes & Noble form query
Problem Description

Goals of query translator:

Source-generality


Built-in translation must generally cope with new or “unseen”
sources
Domain-portability

Translator must be easily customizable with domain-specific
knowledge, and thus deployable for new domains
Motivating Example
Source query Qs on source form S:
(e.g. Amazon)
Target query form T:
(e.g. Barnes & Noble)
Motivating Example
Source query Qs on source form S
Target query form T
Query Translation
Filter: :
σtitle contain “red storm” and price < 35 and age > 12
Union Query Qt*:
Tom Clancy
U
Tom Clancy
System Architecture
Form Extractor
Form Extractor
Source query Qs
Target query form QI
Attribute Matcher:
Syntax-based schema matching
Predicate Mapper:
Type-based search-driven mapping
Query Rewriter:
Constraint-based query rewriting
Target query Qt*
Domain-specific
Thesaurus
Domain-specific
type handlers
Form
Assistant
(FA)
Design Approaches

Query Modeling



Vocabulary and Syntax
Query Translation
Dynamic Predicate Modeling
Query Modeling

Vocabulary


Predicate templates: { P1, P2, P3, P4, P5 }
Example:
P1
P3
P2
P4
P5
Query Modeling

Example Vocabulary (predicate templates)






P1 = [author; contain; $au]
P2 = [title; contain; $ti]
P3 = [subject; contain; $su]
P4 = [isbn; contain; $isbn]
P5 = [price; between; $s, $e]
Example Syntax (valid conjunctive forms)








F1 =
F2 =
F3 =
F4 =
F5 =
F6 =
F7 =
F8 =
P1
P2
P3
P4
P1
P2
P3
P4
P5
P5
P5
P5
Query Modeling

Example Vocabulary Instantiations





Corresponding Form Queries:



p1 = [author; contain; Tom Clancy]
p2 = [title; contain; red storm]
p51 = [price; between; 0-25]
p52 = [price; between; 25-45]
f1 = p1
f2 = p1
p5 1
p5 2
Resultant Union Query:

Qt = f1
f2
Query Modeling

Syntax

Valid combination of predicate templates
{F1, F2, F3, F4, F5, F6, F7, F8 }

Example (‘v’ indicates ‘valid’):
P1
F1
ν
(author)
F2
F3
F4
ν
(title)
F7
F8
F1:
ν
ν
P3
(subject)
ν
ν
P4
(isbn)
(price)
F6
Tom Clancy
P2
P5
F5
ν
v
v
v
v
ν
F2:
Query Translation

Based on semantic closeness of query
predicates:


Finds minimal subsuming Cmin
Benefits of this approach:




No false positives
Minimizes false negatives
Has clear semantics, independent of DB content
Modular translation
Query Translation

Example:
?
s:
t1:
t2:
t3:
t1 v t2:
t1 v t2 v t3:
0
35
25
0
25
45
45
0
0
65
45
65

Cmin
Query Translation

Definition:



Given source query Qs and target query form T, a
query Qt* is a “minimal subsuming translation”
w.r.t. T if:
1. Qt* is a validquery w.r.t T
2. Qt* subsumes Qs


i.e., for any database instance Di, Qs(Di) ≤ Qt*(Di)
3. Qt* is minimal

i.e., there is no query Qt such that Qt satisfies (1.) and
(2.) above and Qt* subsumes Qt
Query Translation
Qt1 = (f1: p1
Qt2 = f2
Qt3 = f3: p1
p5 1 )
(f2 : p1
p5 2 )
p1 = [author; contain; Tom Clancy]
p51 = [price; between; 0-25]
p52 = [price; between; 25-45]

Example:
 Consider source query Qs in first example and three
target queries Qt1,Qt2,Qt3



Qt1 and Qt3 subsume Qs while Qt2 does not
 Misses price range 0-25
 Thus can’t be the best translation Cmin
Prune Qt3 because it subsumes Qt1
That leaves Qt1 as Cmin
Dynamic Predicate Mapping

Tasks:



Choose operator
Fill in values
Objective:

Minimal subsuming between source and target
Dynamic Predicate Mapping

Example:
Input:
Predicate Mapping
U
output:
Predicate Mapping
System Architecture (reminder)
Form Extractor
Form Extractor
Source query Qs
Target query form QI
Attribute Matcher:
Syntax-based schema matching
Predicate Mapper:
Type-based search-driven mapping
Query Rewriter:
Constraint-based query rewriting
Target query Qt*
Domain-specific
Thesaurus
Domain-specific
type handlers
Form
Assistant
(FA)
Implementation –
Form Assistant Toolkit

Form Extractor

Parses HTML into query predicate templates [attr; op; val]


Attribute Matcher (1:1)

Identifies semantically corresponding attributes between forms





Details discussed in a different paper [3.] by same research group
Customized with domain thesaurus (indexes synonyms for commonly used
concepts)
Stems (e.g., “children” -> “child) and removes stop words (e.g., “the”)
Matched by value type and synonym attributes
Predicate Mapper (discussed in previous slides)
Query Rewriter

Well-studied problem to find minimal subsuming query of given predicatemapped query (uses approach of [5.] by Papakonstantinou, et al)
Experiments

Datasets

447 Deep Web sources (query forms) in 8 domains

3 “Basic” domains – each with custom thesaurus in FA


5 “New” domains (for tests, these don’t have thesaurus)


Books, Airfares, Automobiles
Car Rentals, Jobs, Hotels, Movies, Music/Records
Test Approach

Run the FA to translate 120 form queries


Each translation test corresponds to random pairing of sources
within a domain
Count correct mappings in translation suggested by FA

Indicates amount of user effort the Form Assistant has saved
Experiments

Results: Accuracy Distributions



X: % correct predicate translations; Y: % tested query forms
Forms with all 1:1 mappings had 87% perfect accuracy for Basic
dataset, 85% perfect for New dataset (good domain flexibility)
Forms having complex mapping: 76%, 70% “near perfect” (Y>80%)

FA did not attempt complex (n:m) mappings, such as a full name in source
mapping to separate first and last names in target
For Basic dataset:
For New dataset:
Experiments

Accuracy ratio: correct results per 1:1 query



Raw: includes some forms whose input form extraction step had errors
Perfect: manually forces all correct form extractions
Avg. accuracy improves for perfectly correct extraction step:


for Basic dataset, 90.4% improves to 96.1%
For New dataset, 81.1% improves to 86.7%
Basic: 3 domains
New: 5 domains
Experiments

Example Error in Form Extraction

delta.com form has link to alternative reservation page

“One-way & multi-city reservations”

Wrongly interpreted by Form Extractor as input field label (attribute)
Experiments

Error Distribution



% of errors caused by each component
Fewest errors are due to Attribute Matching
Most errors due to Predicate Mapping

Cited reason for PM errors is insufficient domain knowledge


Example failure: source subject value “computer science” didn’t
properly map to target subject value “programming languages”
Improvement could entail better domain-specific ontology and type
handlers
Form Extraction
40%
Attribute Matching
18%
42%
Predicate Mapping
Related Work

From the same research group:

Complex Matchings (n:m)

Defines “Type Recognizer” used in Form Assistant’s Attribute
Matcher, and discusses complex n:m matchings not attempted by
Form Assistant:


[1.] Discovering Complex Matchings across Web Query Interfaces: A
Correlation Mining Approach. B. He, K. C.-C. Chang, and J. Han. In
Proceedings of the 2004 ACM SIGKDD Conference (KDD 2004) (Full
Paper), Seattle, Washington, August 2004
MetaQuerier System
 Fuller system for both exploring (to find) and integrating (to
query) Deep Web databases:

[2.] Toward Large Scale Integration: Building a MetaQuerier over
Databases on the Web. K. C.-C. Chang, B. He, and Z. Zhang. In
Proceedings of the Second Conference on Innovative Data Systems
Research (CIDR 2005), Asilomar, California, January 2005
Related Work

From the same research group:

Form Extraction

As used by implementation of Form Assistant:


2007 thorough analysis of the Deep Web

Interesting survey of web databases and query interfaces:


[3.] Understanding Web Query Interfaces: Best-Effort Parsing with Hidden
Syntax. Z. Zhang, B. He, and K. C.-C. Chang. In Proceedings of the 2004
ACM SIGMOD Conference (SIGMOD 2004), Paris, France, June 2004
[4.] Accessing the Deep Web: A Survey. B. He, M. Patel, Z. Zhang, and K.
C.-C. Chang. Communications of the ACM (CACM), 50(5):94-101, May
2007
Public Datasets

Cached real world query form web pages (used in experiments):


http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8
Additional Deep Web integration resources:

http://metaquerier.cs.uiuc.edu/repository
Related Work

Query Rewriting

As used by implementation of Form Assistant:

[5.] Y. Papakonstaninou, A. Gupta, H. Garcia-Molina, and J. Ullman. A query
translation scheme for rapid implementation of wrappers In proceedings of
the Fourth International Conference on Deductive and Object-Oriented
Databases, Singapore, December 1995.
Thank you !
Download