Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington Schema Matching BooksAndMusic Title Author Publisher ItemID ItemType ListPrice Categories Keywords Discounts ItemID DiscountPrice Inventory Database A Books Title ISBN Price OurPrice Edition Authors ISBN FirstName LastName BookGenres ISBN Genre Inventory Database B Schema Matching: Discovering correspondences between similar elements Eventually… SQL expressions that can populate one database from other November 4th, 2004 Corpus-based Schema Matching Heterogeneity and Data Sharing Data Integration Mediator Query Books+Music Central Book, Music, Store, … Mappings Books, Pubs, Authors,… Data Sources Products, Discounts, … All Books CD World Amazon Mappings provide the glue between independent data sources Schema matching important to any application with multiple data sources November 4th, 2004 Corpus-based Schema Matching Typical Approaches Multiple sources of evidence in the schemas Schema element names Abbreviations, synonyms,… Descriptions and documentation Incomplete, absent,… Data types Inconsistent, absent,… Schema structure Overlapping schemas,… Data instances Different values, scales,… BooksAndCDs/Categories ~ BookCategories/Category ItemID: unique identifier for a book or a CD DateTime Integer All books have similar attributes All addresses have similar formats Combine multiple techniques to exploit all available evidence [Do, Rahm; VLDB 2002], [Doan, et al.; WWW 2002]… November 4th, 2004 Corpus-based Schema Matching S T Schemas s Element Models Ms 1. Build models Name: Instances: Type: … Name: Instances: Type: … 2. Compare models Matching Techniques 3. Combine results t1 tn s1 Similarity Matrix sm 4. Generate matches Mapping s November 4th, 2004 Corpus-based Schema Matching t Mt t Insufficient evidence Product productID name price salePrice 0X7630AB12 The Concept in Central Park $13.99 $11.99 Music ASIN title artists recordLabel discountPrice (no tuples) MusicCD ASIN 4Y3026DF23 CD album artistName The Best of the Doors The Doors prodID albumName artists 9R4374FG56 Saturday Night Fever The Bee Gess November 4th, 2004 price $16.99 discountPrice $12.99 recordCompany Columbia Corpus-based Schema Matching price $14.99 salePrice $9.99 Obtaining more evidence Product, CD productID, prodID name, albumName price salePrice 0X7630AB12, 9R4374FG56 The Concept in Central Park, Saturday Night Fever $13.99, $14.99 $11.99, $9.99 Music, MusicCD ASIN title, album artists, artistName 4Y3026DF23 The Best of the Doors The Doors recordLabel discountPrice $12.99 Corpus-based Augment MusicCD Corpus ASIN 4Y3026DF23 CD prodID 9R4374FG56 November 4th, 2004 album The Best of the Doors albumName Saturday Night Fever artistName The Doors artists The Bee Gess Corpus-based Schema Matching price $16.99 discountPrice $12.99 recordCompany Columbia price salePrice $14.99 $9.99 Corpus-based Schema Matching Can we use known schemas and mappings to match as yet unseen schemas? Augment information about elements in schemas being matched Learn schema design patterns and constraints from known schemas to improve matches November 4th, 2004 Corpus-based Schema Matching Multiple representations for concepts CDs CD Music Album AlbumName Name TrackName ID CDID ProdCode ISBN ArtistID DiscountPrice Artist AuthorArtist Name LastName Author DiscountedPrice SalePrice OurPrice Discounted DiscPrice RecordLabel Label Company RecordingCompany Artists CD2Artist AuthorArtists Learn alternate names, data instances, names of related elements, data types, … November 4th, 2004 Corpus-based Schema Matching Schema Design Patterns Relations between elements Schema element dependency CDs price fax telephone discountPrice price city state numEmployees manager Frequently co-occurring concepts (Warehouse, warehouseID, manager, telephone, fax) (Availability, Books, CDs, Warehouses) zipcode Warehouses Tables and likely columns Table/column Likely column/table Warehouses warehouseID, telephone, fax, state, zip, numEmployees, manager, streetAddress, city capacity title Books isbn Books, Availability November 4th, 2004 Other column/table Keywords, Authors Corpus-based Schema Matching Corpus of known schemas and mappings S Schemas Element Models s Build initial models Ms Name: Instances: Type: … Search similar elements e s M’s Augmented Models e f Name: Instances: Type: … Build augmented models f Typical Schema Matcher Learn schema design patterns Generate Matches Mapping November 4th, 2004 Domain Constraints Corpus-based Schema Matching Concepts/Clusters Contents of the Corpus In order to augment Learn model ensemble for each element names, data instances, types, structure, … Train using the schemas and mappings Element and elements it maps to are positive examples In order to learn domain constraints Cluster elements in the corpus into concepts Estimate schema statistics Likely tables-columns and element co-occurrence Learn importance of individual constraints November 4th, 2004 Corpus-based Schema Matching Experimental Results Four domains Automatically extracted web forms Manually created relational schemas Techniques Direct: Glue [WWW’2004] Corpus-based Augment Corpus-based Pivot [IIW’2004] November 4th, 2004 Corpus-based Schema Matching Improved Matching Performance direct augment pivot 1 0.95 Average FMeasure 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 auto real estate invsmall inventory 16-19 schemas and 6 mappings in the corpus 22-54 schema pairs being tested November 4th, 2004 Corpus-based Schema Matching Difficult Match Tasks direct augment pivot 1 0.95 Average FMeasure 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 auto real estate invsmall inventory More significant improvements for difficult tasks Improvements are less for easy tasks November 4th, 2004 Corpus-based Schema Matching Related Work Using past matching experience [Doan, et al., SIGMOD’2001; Do & Rahm, VLDB’2002] We are trying to match unseen schemas. Using web forms to construct mediated schema [He & Chang, SIGMOD’2003] Clustering of elements is an intermediate step in our corpus. Using a Domain Ontology [Xu & Embley, DASFAA’2003] Our corpus structures are automatically generated. November 4th, 2004 Corpus-based Schema Matching Conclusions Schema Matching is hard with insufficient evidence Corpus-based Schema Matching Augment the evidence about elements in unseen schemas Learn schema design patterns to select matches Improves matching especially for difficult tasks Future Work Large schemas and complex mappings User feedback to curate the corpus Corpus as a tool for other data management task [Halevy & Madhavan, IJCAI’2003] http://www.cs.washington.edu/homes/jayant November 4th, 2004 Corpus-based Schema Matching