Crossing the Structure Chasm

advertisement
Crossing the Structure Chasm
Alon Halevy
University of Washington, Seattle
UCLA, April 15, 2004
The Structure Chasm
Authoring
Querying
Data sharing
Writing text
keywords
Easy
But we can pose
complex queries
Creating a schema
Using someone else’s
schema
Committees, standards
Why is This a Problem?
Databases used to be isolated and administered only
by experts.
Today’s applications call for large-scale data sharing:




Big science (bio-medicine, astrophysics, …)
Government agencies
Large corporations
The web (over 100,000 searchable data sources)
The vision:




Content authoring by anyone, anywhere
Powerful database-style querying
Use relevant data from anywhere to answer the query
The Semantic Web
Fundamental problem: reconciling different models of
the world.
Outline
Two motivating scenarios:


A web of structured data
Personal data management
A tour of recent data sharing architectures


Data integration systems
Peer-data management systems
The algorithmic problems:


Query reformulation
Reconciling semantic heterogeneity
Reconsidering authoring and querying challenges
Large-Scale Scientific Data Sharing
SwissProt
OMIM
UW
HUGO
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
UW Microbiology
UW Genome Sciences
UCLA Genetics
GeneClinics
Non-urgent Applications
B of A
Fidelity
IRS
UW
1040 DB
California IRS
NY IRS
County real-estate DB
Employer Tax Reports
Personal Data Management
[Semex: Sigurdsson, Nemes, H.]
Organizer,
Participants
Event
Person
Homepage
Web Page
Cached
Document
Author
Softcopy
Data is organized by application
Sender,
Recipients
Paper
Message
Softcopy
Presentation
Cites
Mail &
calendar
Papers
HTML
Files
Presentations
Finding Publications
Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa
Person: A. Halevy
Person: Dan Suciu
Person: Maya Rodrig
Person: Steven Gribble
Person: Zachary Ives
Following Associations (1)
Publication
Bernstein
Following Associations (2)
“A survey of approaches to automatic
schema matching”
“Corpus-based schema matching”
Publication
Bernstein
“Database management
management for
for peer-to-peer
peer-to-peer
computing:
computing: A
A vision”
vision”
“Matching schemas by learning from
others”
Following Associations (3)
Cited by
Publication
Publication
Bernstein
Citations
Following Associations (4)
Cited Authors
Publication
Bernstein
PIM Data Sharing Challenges
Need to combine data from multiple
applications/ sources.
After initial set of concepts are given,
extend and personalize concept hierarchy,
 share (parts) of our data with others,
 incorporate external data into our view.

Need also Instance level reconciliation:

Alon Halevy, A. Halevy, Alon Y. Levy – same guy!
Outline
Two motivating scenarios:


A web of structured data
Personal data management
A tour of recent data sharing architectures


Data integration systems
Peer-data management systems
The algorithmic problems:


Query reformulation
Reconciling semantic heterogeneity
Reconsidering authoring and querying challenges
Data Integration
Goal: provide a uniform interface to a set of
autonomous data sources.
New abstraction layer over multiple sources.
Many research projects (DB & AI)


Mine: Information Manifold, Tukwila, BioMediator
Cal: Garlic (IBM), Ariadne (USC), XMAS (UCSD),…
Recent “Enterprise Information Integration”
industry:


Startups: Nimble, Enosys, Composite, MetaMatrix
Products from big players: BEA, IBM
Relational Abstraction Layer
Schema: the template for data.
Students:
SSN
123-45-6789
234-56-7890
Courses:
CID
CSE444
CSE541
Queries:
Takes:
Name
Charles
Dan
…
Category
undergrad
grad
…
Name
Databases
Operating systems
SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
…
Quarter
fall
winter
SELECT C.name
FROM Students S, Takes T, Courses C
WHERE S.name=“Mary” and
S.ssn = T.ssn and T.cid = C.cid
Data Integration:
Higher-level Abstraction
Q
Mediated Schema
Semantic mappings
Q1
Q2
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
…
Category
undergrad
grad
…
CID
CSE444
CSE541
Name
Databases
Operating systems
SSN
123-45-6789
123-45-6789
234-56-7890
Quarter
fall
winter
CID
CSE444
CSE444
CSE142
…
…
Category
undergrad
grad
…
Q3
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
…
CID
CSE444
CSE541
Name
Quarter
Databases
fall
Operating systems winter
SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
…
…
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
…
Category
undergrad
grad
…
CID
CSE444
CSE541
Name
Quarter
Databases
fall
Operating systems winter
SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
…
Entity
Mediated Schema
Phenotype
Gene
Sequenceable
Entity
Protein
OMIM
Structured
Vocabulary
Experiment
Nucleotide
Sequence
Microarray
Experiment
SwissProt
HUGO
GeneClinics
www.biomediator.org
Tarczy-Hornoch, Mork
LocusLink
GO
Entrez
GEO
Query: For the micro-array experiment I just ran, what are the
related nucleotide sequences and for what protein do they code?
Semantic Mappings
Differences in:
 Names in schema
 Attribute grouping
BooksAndMusic
Title
Author
Publisher
ItemID
ItemType
SuggestedPrice
Categories
Keywords
Inventory
Database A


Books
Title
ISBN
Price
DiscountPrice
Edition
Authors
ISBN
FirstName
LastName
BookCategories
ISBN
Category
CDCategories
CDs
Album
ASIN
Price
DiscountPrice
Studio
ASIN
Category
Artists
ASIN
ArtistName
GroupName
Coverage of databases
Inventory Database B
Granularity and format of attributes
Key Issues
 Formalism for mappings
 Reformulation algorithms
Q
Mediated Schema
 How will we create them?
Q’
Q’
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
…
Category
undergrad
grad
…
CID
CSE444
CSE541
Name
Databases
Operating systems
SSN
123-45-6789
123-45-6789
234-56-7890
Quarter
fall
winter
CID
CSE444
CSE444
CSE142
…
…
Category
undergrad
grad
…
Q’
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
…
CID
CSE444
CSE541
Name
Quarter
Databases
fall
Operating systems winter
SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
…
…
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
…
Category
undergrad
grad
…
CID
CSE444
CSE541
Name
Quarter
Databases
fall
Operating systems winter
SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
…
Beyond Data Integration
Mediated schema is a bottleneck for
large-scale data sharing
It’s hard to create, maintain, and agree
upon.
Peer Data Management Systems
Piazza: [Tatarinov, H., Ives, Suciu, Mork]
Mappings
specified
locally
Map to most
convenient
nodes
Queries
answered by
traversing
semantic
paths.
Q
UCLA
Q3
CiteSeer
Stanford
Q1
Q4
UW
Q5
DBLP
UC Berkeley
Q2
UCSD
Q6
PDMS-Related Projects
Hyperion (Toronto)
PeerDB (Singapore)
Local relational models (Trento)
Edutella (Hannover, Germany)
Semantic Gossiping (EPFL Zurich)
Raccoon (UC Irvine)
Orchestra (U. Penn)
A Few Comments about Commerce
Until 5 years ago:

Data integration = Data warehousing.
Since then:

A wave of startups:



Nimble, MetaMatrix, Calixa, Composite, Enosys
Big guys made announcements (IBM, BEA).
[Delay] Big guys released products.
Success: analysts have new buzzword – EII

New addition to acronym soup (with EAI).
Lessons:

Performance was fine. Need management tools.
Data Integration: Before
Q
Mediated Schema
Q’
Source
Q’
Source
Q’
Source
Q’
Source
Q’
Source
Data Integration: After
Front-End
User
Applications
Lens™ File
Software
Developers Kit
InfoBrowser™
Lens Builder™
NIMBLE™ APIs
XML Query
Nimble Integration Engine™
Cache
Compiler
Executor
Metadata
Server
Common
XML View
Management
Tools
Integration
Builder
Concordance
Developer
Relational Data Warehouse/ Legacy
Mart
Flat File
Web Pages
Data
Administrator
Security Tools
Integration
Layer
XML
Sound Business Models
Enterprise Information
2001 2003 2005
1995 1997 1999
Source: Gartner, 1999
Explosion of intranet and
extranet information
80% of corporate
information is unmanaged
By 2004 30X more
enterprise data than 1999
The average company:
 maintains 49 distinct
enterprise applications
 spends 35% of total IT
budget on integrationrelated efforts
Outline
Two motivating scenarios:


A web of structured data
Personal data management
A tour of recent data sharing architectures


Data integration systems
Peer-data management systems
The algorithmic problems


Query reformulation
Reconciling semantic heterogeneity
Reconsidering authoring and querying challenges
Languages for Schema Mapping
Q
Mediated Schema
GAV
Q’
Source
LAV GLAV
Q’
Source
Q’
Source
Q’
Source
Q’
Source
GLAV Mappings
R1a(isbn, title,n), R1b(isbn, genre,n) 
Book(isbn, title, genre, year), Author(isbn, n), year < 1970
Book: ISBN, Title, Genre, Year
Author: ISBN, Name
R1a
R2
R1b
Books before 1970
R3
R4
R5
Query Reformulation
Query: Find authors of humor books
R5(x,y) :- Book(x,y,”Humor”)
Plan:
R1 Join R5
R1
Book: ISBN, Title, Genre, Year
Author: ISBN, Name
R2
Books before 1970
R3
R4
R5
Humor books
Answering Queries Using Views
Formal Problem: can we use previously
answered queries to answer a new query?

Challenge: need to invert query expression.
Results depend on:




Query language used for sources and queries,
Open-world vs. Closed-world assumption
Allowable access patterns to the sources
MiniCon [Pottinger and H., 2001]: scales to
thousands of sources.
Every commercial DBMS implements some
version of answering queries using views.
Some Open Research Issues
Managing large networks
of mappings:
• Consistency
• Trust
Improving networks:
finding additional
mappings
Indexing:
Heterogeneous data
across the network
Caching:
Where? What?
UCLA
CiteSeer
Stanford
UW
DBLP
UC Berkeley
UCSD
Outline
Two motivating scenarios:


A web of structured data
Personal data management
A tour of recent data sharing architectures


Data integration systems
Peer-data management systems
The algorithmic problems


Query reformulation
Reconciling semantic heterogeneity
Reconsidering authoring and querying challenges
Semantic Mappings
Need mappings in every
data sharing architecture
BooksAndMusic
Title
Author
Publisher
ItemID
ItemType
SuggestedPrice
Categories
Keywords
Inventory
Database A
“Standards are great,
but there are too many.”
Books
Title
ISBN
Price
DiscountPrice
Edition
Authors
ISBN
FirstName
LastName
BookCategories
ISBN
Category
CDCategories
CDs
Album
ASIN
Price
DiscountPrice
Studio
ASIN
Category
Artists
ASIN
ArtistName
GroupName
Inventory Database B
Why is it so Hard?
Schemas never fully capture their intended
meaning:

We need to leverage any additional information
we may have.
A human will always be in the loop.
Goal is to improve designer’s productivity.
 Solution must be extensible.

Two cases for schema matching:
Find a map to a common mediated schema.
 Find a direct mapping between two schemas.

Typical Matching Heuristics
We build a model for every element from
multiple sources of evidences in the schemas

Schema element names


Descriptions and documentation



ItemID: unique identifier for a book or a CD
ISBN: unique identifier for any book
Data types, data instances



BooksAndCDs/Categories ~ BookCategories/Category
DateTime  Integer,
addresses have similar formats
Schema structure

All books have similar attributes
Models consider only the two schemas.
In isolation,
techniques are
incomplete or
brittle:
Need principled
combination.
Using Past Experience
Matching tasks are often repetitive
Humans improve over time at matching.

A matching system should improve too!
Mediated Schema
data sources
LSD:


Learns to recognize elements of mediated schema.
[Doan, Domingos, H., SIGMOD-01, MLJ-03]

Doan: 2003 ACM Distinguished Dissertation Award.
Example: Matching Real-Estate Sources
Mediated schema
address
location
price
agent-phone
listed-price
phone
description
comments
Schema of realestate.com
location
listed-price
phone
comments
realestate.com Miami, FL $250,000 (305) 729 0831 Fantastic house
Boston, MA $110,000 (617) 253 1429 Great location
...
...
...
...
homes.com
price
contact-phone
extra-info
$550,000 (278) 345 7215 Beautiful yard
$320,000 (617) 335 2315 Great beach
...
...
...
Learned hypotheses
If “phone” occurs
in the name =>
agent-phone
If “fantastic” &
“great”
occur frequently
in data values =>
description
Learning Source Descriptions
We learn a classifier for each element of
the mediated schema.
Training examples are provided by the
given mappings.
Multi-strategy learning:
Base learners: name, instance, description
 Combine using stacking.

Accuracy of 70-90% in experiments.
Learning about the mediated schema.
Corpus-Based Schema Matching
[Madhavan, Doan, Bernstein, H.]
Can we use previous experience to
match two new schemas?
Learn about a domain? Classifier for every
corpus element
Music
Books
Authors
Authors
Items
Artists
Information
Learn general purpose
knowledge
Publisher
Litreture
CDs
Categories
Artists
Corpus of Schemas and Matches
Reuse extracted knowledge
to match new schemas
Exploiting The Corpus
Given an element s  S and t  T, how do we
determine if s and t are similar?
The PIVOT Method:

Elements are similar if they are similar to the same
corpus concepts
The AUGMENT Method:

Enrich the knowledge about an element by
exploiting similar elements in the corpus.
Pivot: measuring (dis)agreement
Compute interpretations w.r.t. corpus
Pk= Probability (s ~ ck )
Interpretation I(s) =
element s Schema S
# concepts in corpus
S
T
I(s)
s
I(t)
t
Similarity(I(s), I(t))
Interpretation captures how similar an element is
to each corpus concept

Compared using cosine distance.
Augmenting element models
S
Schema
Search similar
corpus concepts
s
e
M’s
Element
Model
s
Name:
Instances:
Type:
…
e
f
f
Build augmented
models
Search similar corpus concepts

Pick the most similar ones from the interpretation
Build augmented models

Robust since more training data to learn from
Compare elements using the augmented models
Corpus of known
schemas and
mappings
Experimental Results
Five domains:



Auto and real estate: webforms
Invsmall and inventory: relational schemas
Nameaddr: real xml schemas
Performance measure:

F-Measure:
f 
2 * precision * recall
precision  recall
Precision and recall are measured in terms of the matches
predicted.
Comparison over domains
direct
augment
pivot
0.9
Average FMeasure
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
auto
real estate
invsmall
inventory
nameaddr
Corpus based techniques perform better in all the domains
“Tough” schema pairs
direct
augment
pivot
0.9
0.8
Average F-Measure
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
auto
real estate
invsmall
inventory
nameaddr
Significant improvement in difficult to match schema pairs
Mixed corpus
direct
augment
pivot
0.9
0.8
Average F-Measure
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
auto + re + invsmall
difficult
auto + invsmall
Corpus with schemas from different domains can also be useful
Other Corpus Based Tools
A corpus of schemas can be the basis for
many useful tools:

Mirror the success of corpora in IR and NLP?
Back to the structure chasm:

Authoring and querying.
Auto-complete:

I start creating a schema (or show sample data),
and the tool suggests a completion.
Formulating queries on new databases:

I ask a query using my terminology, and it gets
reformulated appropriately.
Conclusion
Vision: data authoring, querying and sharing by
everyone, everywhere.
Need to make it easier to enjoy the benefits of
structured data.
Challenge: reconciling semantic heterogeneity
schema
mapping
Corpus
Of
schemas
Some References
www.cs.washington.edu/homes/alon
Piazza: ICDE03, WWW03, VLDB-03
The Structure Chasm: CIDR-03
Surveys on schema matching languages:


Halevy, VLDB Journal 01
Lenzerini, PODS 2002
Semi-automatic schema matching:

Rahm and Bernstein, VLDB Journal 01.
Teaching integration to undergraduates:
SIGMOD Record, September, 2003.
Download