Corpus-based Schema Matching

advertisement
Corpus-based Schema Matching
Jayant Madhavan
Philip Bernstein
AnHai Doan
Alon Halevy
Microsoft Research
UIUC
University of Washington
Schema Matching
BooksAndMusic
Title
Author
Publisher
ItemID
ItemType
ListPrice
Categories
Keywords
Discounts
ItemID
DiscountPrice
Inventory
Database A
Books
Title
ISBN
Price
OurPrice
Edition
Authors
ISBN
FirstName
LastName
BookGenres
ISBN
Genre
Inventory Database B
Schema Matching: Discovering correspondences between similar elements
Eventually… SQL expressions that can populate one database from other
November 4th, 2004
Corpus-based Schema Matching
Heterogeneity and Data Sharing
Data Integration
Mediator
Query
Books+Music Central
Book, Music,
Store, …
Mappings
Books, Pubs,
Authors,…
Data
Sources
Products,
Discounts, …
All Books
CD World
Amazon
Mappings provide the glue between independent data sources
Schema matching important to any application
with multiple data sources
November 4th, 2004
Corpus-based Schema Matching
Typical Approaches
Multiple sources of evidence in the schemas
Schema element names
Abbreviations, synonyms,…
Descriptions and documentation
Incomplete, absent,…
Data types
Inconsistent, absent,…
Schema structure
Overlapping schemas,…
Data instances
Different values, scales,…
BooksAndCDs/Categories ~ BookCategories/Category
ItemID: unique identifier for a book or a CD
DateTime  Integer
All books have similar attributes
All addresses have similar formats
Combine multiple techniques to exploit all available evidence
[Do, Rahm; VLDB 2002], [Doan, et al.; WWW 2002]…
November 4th, 2004
Corpus-based Schema Matching
S
T
Schemas
s
Element
Models
Ms
1. Build models
Name:
Instances:
Type: …
Name:
Instances:
Type: …
2. Compare models
Matching
Techniques
3. Combine results
t1
tn
s1
Similarity
Matrix
sm
4. Generate matches
Mapping
s
November 4th, 2004
Corpus-based Schema Matching
t
Mt
t
Insufficient evidence
Product
productID
name
price
salePrice
0X7630AB12
The Concept in
Central Park
$13.99
$11.99
Music
ASIN
title
artists
recordLabel
discountPrice
(no tuples)
MusicCD
ASIN
4Y3026DF23
CD
album
artistName
The Best of the Doors
The Doors
prodID
albumName
artists
9R4374FG56
Saturday Night Fever
The Bee Gess
November 4th, 2004
price
$16.99
discountPrice
$12.99
recordCompany
Columbia
Corpus-based Schema Matching
price
$14.99
salePrice
$9.99
Obtaining more evidence
Product, CD
productID,
prodID
name, albumName
price
salePrice
0X7630AB12,
9R4374FG56
The Concept in Central Park,
Saturday Night Fever
$13.99,
$14.99
$11.99, $9.99
Music, MusicCD
ASIN
title, album
artists,
artistName
4Y3026DF23
The Best of the Doors
The Doors
recordLabel
discountPrice
$12.99
Corpus-based Augment
MusicCD
Corpus
ASIN
4Y3026DF23
CD
prodID
9R4374FG56
November 4th, 2004
album
The Best of the Doors
albumName
Saturday Night Fever
artistName
The Doors
artists
The Bee Gess
Corpus-based Schema Matching
price
$16.99
discountPrice
$12.99
recordCompany
Columbia
price
salePrice
$14.99
$9.99
Corpus-based Schema Matching
Can we use known schemas and mappings to
match as yet unseen schemas?
Augment information about elements in
schemas being matched
Learn schema design patterns and constraints
from known schemas to improve matches
November 4th, 2004
Corpus-based Schema Matching
Multiple representations for concepts
CDs
CD Music
Album
AlbumName
Name TrackName
ID
CDID ProdCode
ISBN
ArtistID
DiscountPrice
Artist
AuthorArtist Name
LastName Author
DiscountedPrice SalePrice
OurPrice Discounted DiscPrice
RecordLabel
Label Company
RecordingCompany
Artists
CD2Artist AuthorArtists
Learn alternate names, data instances, names of related
elements, data types, …
November 4th, 2004
Corpus-based Schema Matching
Schema Design Patterns
Relations between elements
Schema element dependency
CDs  price
fax  telephone
discountPrice  price
city  state
numEmployees  manager
Frequently co-occurring concepts
(Warehouse, warehouseID, manager,
telephone, fax)
(Availability, Books, CDs, Warehouses)
zipcode  Warehouses
Tables and likely columns
Table/column
Likely column/table
Warehouses
warehouseID, telephone, fax, state, zip, numEmployees,
manager, streetAddress, city capacity
title
Books
isbn
Books, Availability
November 4th, 2004
Other column/table
Keywords, Authors
Corpus-based Schema Matching
Corpus of known
schemas and mappings
S
Schemas
Element
Models
s
Build initial models
Ms
Name:
Instances:
Type: …
Search similar
elements
e
s
M’s
Augmented
Models
e
f
Name:
Instances:
Type: …
Build
augmented models
f
Typical Schema Matcher
Learn schema
design patterns
Generate Matches
Mapping
November 4th, 2004
Domain Constraints
Corpus-based Schema Matching
Concepts/Clusters
Contents of the Corpus
In order to augment
Learn model ensemble for each element
names, data instances, types, structure, …
Train using the schemas and mappings
Element and elements it maps to are positive examples
In order to learn domain constraints
Cluster elements in the corpus into concepts
Estimate schema statistics
Likely tables-columns and element co-occurrence
Learn importance of individual constraints
November 4th, 2004
Corpus-based Schema Matching
Experimental Results
Four domains
Automatically extracted web forms
Manually created relational schemas
Techniques
Direct: Glue [WWW’2004]
Corpus-based Augment
Corpus-based Pivot [IIW’2004]
November 4th, 2004
Corpus-based Schema Matching
Improved Matching Performance
direct
augment
pivot
1
0.95
Average FMeasure
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
auto
real estate
invsmall
inventory
16-19 schemas and 6 mappings in the corpus
22-54 schema pairs being tested
November 4th, 2004
Corpus-based Schema Matching
Difficult Match Tasks
direct
augment
pivot
1
0.95
Average FMeasure
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
auto
real estate
invsmall
inventory
More significant improvements for difficult tasks
Improvements are less for easy tasks
November 4th, 2004
Corpus-based Schema Matching
Related Work
Using past matching experience
[Doan, et al., SIGMOD’2001; Do & Rahm, VLDB’2002]
We are trying to match unseen schemas.
Using web forms to construct mediated schema
[He & Chang, SIGMOD’2003]
Clustering of elements is an intermediate step in our corpus.
Using a Domain Ontology
[Xu & Embley, DASFAA’2003]
Our corpus structures are automatically generated.
November 4th, 2004
Corpus-based Schema Matching
Conclusions
Schema Matching is hard with insufficient evidence
Corpus-based Schema Matching
Augment the evidence about elements in unseen schemas
Learn schema design patterns to select matches
Improves matching especially for difficult tasks
Future Work
Large schemas and complex mappings
User feedback to curate the corpus
Corpus as a tool for other data management task [Halevy &
Madhavan, IJCAI’2003]
http://www.cs.washington.edu/homes/jayant
November 4th, 2004
Corpus-based Schema Matching
Download