Information Integration and Source Wrapping

advertisement
Information Integration
and Source Wrapping
Jose Luis Ambite, USC/ISI
Outline
• Information Integration
–
–
–
–
Definition
Architectures
Domain Models
Source wrapping
• Application to EIA
– Example of Multi-State Data re-organization
• Wrapping
• Modelling
Information Integration
Single Interface to Multiple Sources
Decision Support
Application Programs
Information Agent
Databases
Knowledge Bases
The Web
Computer Programs
Information Integration
The problem of providing
uniform (sources transparent to user)
access to (query, and eventually updates too)
multiple (even 2 is hard!)
autonomous (not affect the behavior of sources)
heterogeneous (different data models, schemas)
structured (and semistructured)
data sources (not only databases, web sources, …)
Information Integration in SIMS
To enable query access SIMS needs to:
• address semantic heterogeneity:
=> describe sources in common domain model
• address syntactic (format) heterogeneity:
=> standardize access to sources:
– Structured (DBMS): Oracle, MS Access …
– Semistructured: wrappers for html, text, pdf
Domain Model (for Time Series)
Point of Sale
Unit
Time
Series
Period
Gasoline
G. Premium
Month
Area
USA
CA
G. Leaded
G. Regular
Week
Year
Product
Measurement
Footnote
Date
Tag
Value
Text
G. Unleaded
G. Premium
Unleaded
Quality
NY
Subclass
Part-of
General Relation
Source Mapping
Price
CPI
Volume
PPI
Integration Architectures
Materialized:
Data Warehouse
Virtual:
Mediator
from [Levy2000]
Wrappers
• provide uniform mechanism for extracting data
from semi-structured sources (HTML, text, …)
• transform semi-structured sources into structured
Name
Restaurants in
Santa Monica?
Wrapper
Address
Chinois on Main 2709 Main St.
Chao Dara
13 Union Sq.
…
...
Wrapper Building Tools
• Creating Wrappers (semi-)automatically:
– Demonstration-oriented user interface enables
users to show system what to extract by example
[Muslea99]
– System automatically induces extraction rules
– Common extraction engine
• Benefits:
– Rapid wrapper creation
– Simplified wrapper maintenance
• Fetch.com
– Start-up that comercializes the technology
Example: EIA Multi-State Data
EIA Multi-State Data:
Multiple formats
EIA Multi-State Data: Table 31
Text source:
• Formatted text
• Tables contains
national, regional
(PADDs), state data
extract state data
• Tables contains
different measurements
EIA Multi-State: Wrapper Creation
Wrapper Creation
1. Mark-up a few
examples and assign
meaning (map to
attributes from
domain
model)
date
value
2. System induces
extraction rules
EIA Multi-State: Metadata Extraction
• Extract
- data
- metadata
• Associate with
domain model
Extracted Data + Metadata
EIA-T31-1: Regular Gasoline Prices in Maine,
Sales to end users through retail outlets
Measured in cents per gallon
Domain Model
Point of Sale
Unit
Cents/gallon
Period
Week
Year
Month
CA
G. Leaded
G. Regular
EIA-T31-1
Measurement
Maine
Gasoline
G. Premium
Area
USA
Time
Serie
s
Retail outlets
Product
Date
Subclass
Part-of
General Relation
Source Mapping
Value
G. Unleaded
G. Premium
Unleaded
Footnote
Tag
EIA Multi-state
Wrapper
Text
Quality
Price
Volume
Additional slides
Example of Extraction Rule
[Muslea et al 1999]
RULE = sequence of landmarks (e.g., Cuisine : )
Page:
<b>Name:</b>Chinois on Main<b>
Cuisine :<p> </b>
Start: SkipTo(Cuisine :) SkipTo(</b>)
Pacific New Wave
<br>
End: SkipTo(<br>)
Example of Rule Induction
Training Examples:
[Muslea et al 1999]
<p>Cuisine:<p><b>Thai</b><p>Review:<p> <b> Good
<p>Review:<p><br><p> <b> Excellent
SkipTo( <b> )
Example of Rule Induction
Training Examples:
[Muslea et al 1999]
<p>Cuisine:<p><b>Thai</b><p>Review:<p> <b> Good
<p>Review:<p><br><p> <b> Excellent
SkipTo( <b> )
SkipTo(<p> <b>)
... SkipTo( : ) SkipTo(<b>) ...
SkipTo(<p>)SkipTo(<b>)
Example of Rule Induction
Training Examples:
[Muslea et al 1999]
<p>Cuisine:<p><b>Thai</b><p>Review:<p> <b> Good
<p>Review:<p><br><p> <b> Excellent
SkipTo( <b> )
SkipTo(<p> <b>)
…
... SkipTo( : ) SkipTo(<b>) ...
SkipTo(<p>)SkipTo(<b>)
SkipTo( Review :) SkipTo( <b> )
...
Mediator
Architecture
• User queries in
global (mediator)
schema
• Mediator
translates and
decomposes user
query into
multiple source
queries
from [Levy2000]
Construction
phase:
• Deploy DBs
• Extend ontol.
System Architecture
Integrated ontology
- global terminology
- source descriptions
- integration axioms
Domain
modeling
- DB analysis
- text analysis
Query processor
- reformulation
- cost optimization
Sources
User Interface
- ontology browser
- query constructor
User phase:

R
S
• Compose query
T
Access phase:
• Create DB query
• Retrieve data
Download